APPARATUS FOR DETECTING UNKNOWN MALWARE USING VARIABLE OPCODE SEQUENCE AND METHOD USING THE SAME
Disclosed herein are an apparatus for detecting unknown malware using a variable-length operation code (opcode) and a method using the apparatus. The method includes collecting opcode information from a detection target, generating a multi-pixel image having a variable length by performing feature engineering on the opcode information; and detecting unknown malware by inputting the multi-pixel image to a deep-learning model based on AI.
Latest Electronics and Telecommunications Research Institute Patents:
- METHOD OF ENCODING/DECODING DYNAMIC MESH AND RECORDING MEDIUM STORING METHOD OF ENCODING/DECODING DYNAMIC MESH
- METHOD FOR ENCODING/DECODING VIDEO AND RECORDING MEDIUM STORING THE METHOD FOR ENCODING VIDEO
- METHOD OF TRANSMITTING IPV6 PACKETS BASED ON OPTICAL WIRELESS TECHNOLOGY AND DEVICE FOR PERFORMING THE SAME
- METHOD FOR ENCODING/DECODING VIDEO FOR MACHINE AND RECORDING MEDIUM STORING THE METHOD FOR ENCODING VIDEO
- INFRASTRUCTURE COOPERATIVE AUTONOMOUS DRIVING SYSTEM AND METHOD OF GENERATING TRAJECTORY CONSTRAINTS FOR COLLISION AVOIDANCE IN AUTONOMOUS VEHICLES BY USING THE SYSTEM
This application claims the benefit of Korean Patent Application No. 10-2020-0142203, filed Oct. 29, 2020, and No. 10-2021-0060608, filed May 11, 2021, which are hereby incorporated by reference in their entireties into this application.
BACKGROUND OF THE INVENTION 1. Technical FieldThe present invention relates generally to technology for detecting malware, and more particularly to technology for detecting unknown malware using AI by processing information about an operation code (opcode), which is an instruction code that is used when static or dynamic analysis is performed on malware.
2. Description of the Related ArtIn a conventional method, malware is detected based on a mechanism (signature-based detection) of determining whether a suspected malware file matches pattern information of a specific code section of malware. Particularly, because a conventional antivirus detection technique detects malware based on byte information of a specific code section used by malware or determines whether or not a file is malicious based on information about the structure of the file and on various kinds of log information (information about a DLL, a call of an API function, and the like), which are generated when malware is dynamically executed, it is limitedly able to detect new malware or new variants thereof.
Accordingly, various static and dynamic techniques for analysis of malware have recently been proposed, and attempts to effectively analyze and detect various types of malware including unknown files are ongoing, but it is difficult to release such techniques to the public due to the accuracy and performance limitations thereof.
Documents of Related Art
- (Patent Document 1) Korean Patent No. 10-1880686, registered on Jul. 16, 2018 and titled “Malware code detection system based on AI deep-learning”.
An object of the present invention is to provide unknown-malware detection technology based on AI, the detection accuracy and performance of which can be improved.
Another object of the present invention is to provide malware detection technology capable of classifying and detecting unknown malware based on feature information characterized by the variable length thereof.
A further object of the present invention is to provide malware detection technology that can also be applied to detection of a document-type or script-type malware file.
In order to accomplish the above objects, a method for detecting unknown malware according to the present invention includes collecting operation code (opcode) information from a detection target; generating a multi-pixel image having a variable length by performing feature engineering on the opcode information; and detecting unknown malware by inputting the multi-pixel image to a deep-learning model based on AI.
Here, the multi-pixel image may correspond to a multi-pixel RGB image based on an n-gram corresponding to the opcode information.
Here, generating the multi-pixel image may include storing n-gram sequences for hexadecimal (hex) codes having a variable length based on the opcode information; and mapping a 3-gram of opcodes to an RGB code based on the n-gram sequences, thereby generating the multi-pixel RGB image.
Here, collecting the opcode information may include extracting a text section from an executable file corresponding to the detection target; converting raw data in the text section to opcodes in an assembly language format using a binary analysis tool; and extracting the hex codes based on the opcodes.
Here, detecting the unknown malware may be configured to detect the unknown malware in consideration of the similarity between feature information of each type of malware classified through training and multi-pixel feature information output from the deep-learning model.
Here, the method may further include generating a multi-pixel image corresponding to malware based on opcode information collected based on the malware and generating training data so as to correspond to the multi-pixel image corresponding to the malware; and training the deep-learning model using the training data.
Here, training the deep-learning model may be configured to acquire at least one of information about entropy of each piece of malware, the original creation date thereof, the final update date thereof, and information about a website via which the malware is distributed and to use the acquired information as the training data.
Here, the opcode may be a code for providing at least one function, among a logical operation, program flow control, memory manipulation, and an arithmetic operation.
Here, the opcode may correspond to an opcode in the assembly language format in a one-to-one manner.
Here, the opcode may include a 1-byte or 2-byte instruction and multiple operand values.
Also, an apparatus for detecting unknown malware according to an embodiment of the present invention includes a processor for collecting operation code (opcode) information from a detection target, generating a multi-pixel image having a variable length by performing feature engineering on the opcode information, and detecting unknown malware by inputting the multi-pixel image to a deep-learning model based on AI; and memory for storing the opcode information and the multi-pixel image.
Here, the multi-pixel image may correspond to a multi-pixel RGB image based on an n-gram corresponding to the opcode information.
Here, the processor may be configured to store n-gram sequences for hexadecimal (hex) codes having a variable length based on the opcode information; and to map a 3-gram of opcodes to an RGB code based on the n-gram sequences, thereby generating the multi-pixel RGB image.
Here, the processor may be configured to extract a text section from an executable file corresponding to the detection target; to convert raw data in the text section to opcodes in an assembly language format using a binary analysis tool; and to extract the hex codes based on the opcodes.
Here, the processor may be configured to detect the unknown malware in consideration of the similarity between feature information of each type of malware classified through training and multi-pixel feature information output from the deep-learning model.
Here, the processor may be configured to generate a multi-pixel image for training using opcode information collected based on multiple pieces of benign code and multiple pieces of malware; to generate training data so as to correspond to the multi-pixel image for training; and to train the deep-learning model using the training data.
Here, the processor may be configured to acquire at least one of information about entropy of each piece of malware, the original creation date thereof, the final update date thereof, and information about a website via which the malware is distributed, and to use the acquired information as the training data.
Here, the opcode may be a code for providing at least one function, among a logical operation, program flow control, memory manipulation, and an arithmetic operation.
Here, the opcode may correspond to an opcode in the assembly language format in a one-to-one manner.
Here, the opcode may include a 1-byte or 2-byte instruction and multiple operand values.
The above and other objects, features, and advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
The present invention will be described in detail below with reference to the accompanying drawings. Repeated descriptions and descriptions of known functions and configurations which have been deemed to unnecessarily obscure the gist of the present invention will be omitted below. The embodiments of the present invention are intended to fully describe the present invention to a person having ordinary knowledge in the art to which the present invention pertains. Accordingly, the shapes, sizes, etc. of components in the drawings may be exaggerated in order to make the description clearer.
Hereinafter, a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.
Referring to
For example, as illustrated in
Here, a text section may be extracted from an executable file corresponding to the detection target.
Here, raw data in the text section may be converted into opcodes in an assembly language format using a binary analysis tool.
Here, in order to extract opcodes for static or dynamic analysis of malware, any of various binary analysis tools (disassemblers) may be used. Particularly, using an analysis tool, such as IDA, objdump, OllyDbg, Visual Studio, PE Explorer, or the like, depending on the OS of the analysis system, opcodes may be extracted using a static method. Also, when dynamic analysis is performed, opcodes of an executable file may be extracted and converted using a CPU or using a processor tracer or the like, even in a virtualized environment.
Here, the text section of an executable file (a PE, a DLL, or the like) may be extracted using any of various binary analysis tools. For example, raw data present in SECTION_text in a Portable Executable (PE) file, like what is illustrated in
Here, reverse engineering based on general binary analysis is a manual method, and refers to all of various types of static/dynamic information, such as the type of a file, the size thereof, header information thereof, a certificate, the internal structure of the file, a registry, a network operation, information about a relevant API function, and the like, but the present invention fundamentally uses only opcodes, which are operation codes that occur when static/dynamic analysis is performed.
Also, in the case of packed or obfuscated malware, it is necessary to decode or deobfuscate the same in advance such that opcodes can be collected therefrom, but this part will not be described in detail because it does not form part of the gist of the present invention.
Here, an opcode according to the present invention is a machine-language instruction processed by a CPU, and may provide functions of a logical operation, program flow control, memory manipulation, or an arithmetic operation. Particularly, each opcode has a characteristic in that it exactly corresponds to a corresponding assembly language instruction in a one-to-one manner. The structure of such an opcode may include a 1-byte or 2-byte instruction (opcode) and multiple operand values.
For example, referring to
In the present invention, in order to find semantics having such characteristics, hexadecimal (hex) code values of opcodes (Intel's X86 Opcode and Instruction Reference) are used.
Here, hexadecimal codes may be extracted based on the opcodes.
Also, in the method for detecting unknown malware using a variable-length opcode according to an embodiment of the present invention, feature engineering is performed on the opcode information, whereby a multi-pixel image having a variable length is generated at step S120.
Here, step S120 may correspond to the information-processing and optimization step illustrated in
Here, the multi-pixel image may be a multi-pixel RGB image based on an n-gram corresponding to the opcode information.
Here, based on the opcode information, an n-gram sequence of hexadecimal codes having a variable length may be stored.
For example, when the hex code values of the opcodes are extracted at the opcode-based information collection step illustrated in
Here, referring to
Here, based on the n-gram sequences, a 3-gram of opcodes is mapped to an RGB code, whereby a multi-pixel image may be generated.
For example, a 1-gram sequence of opcodes may be converted to a grayscale image, and a 2-gram of opcodes may be converted into X and Y coordinates to be used as location information in a plane. Here, a 3-gram sequence of opcode information may be converted to an image having pixels of M×M (jpg, png, bmp, or the like) through RGB code mapping, as illustrated in
If 3-gram sequences respectively generated from one hundred thousand pieces of malware include 900 3-grams on average, an image having pixels of 30×30 pixels may be generated. Here, when the number of 3-grams in a 3-gram sequence of malware is less than the average, zero padding is added, and appears black pixels when the 3-gram sequence is converted into an image. Also, when the number of 3-grams in a 3-gram sequence of malware is equal to or greater than the average, control may be performed such that only a number of 3-grams equal to the average, among all of the 3-grams in the 3-gram sequence, are converted into an image.
Here, the format of the resultant image may take any of various formats, such as jpg, png, gif, bmp, and the like.
Also, in the method for detecting unknown malware using a variable-length operation code according to an embodiment of the present invention, the multi-pixel image is input to a deep-learning model based on AI, whereby unknown malware is detected at step S130.
Here, step S130 may correspond to an unknown malware detection step in the step of analyzing and classifying malware based on AI, illustrated in
Here, in the present invention, the deep-learning model based on AI may be trained in advance using training data generated for each type of malware, and a description thereof will be made later with reference to Table 2.
For example, the deep-learning model according to an embodiment of the present invention may correspond to a model based on a Convolutional Neural Network (CNN), as illustrated in
Here, unknown malware may be detected based on the similarly between feature information of each type of malware classified through training and multi-pixel feature information output from the deep-learning model.
For example, when the similarity between feature information of each type of malware classified through training and multi-pixel feature information output from the deep-learning model is equal to or greater than a preset similarity, it may be determined that the detection target is unknown malware. That is, when the detection target includes feature information that is not the same as that of previously classified malware but similar thereto, it may be determined that the detection target is a new unknown malware variant.
Here, in consideration of the similarity between feature information of each type of malware classified through training and the multi-pixel feature information output from the deep-learning model, malware of a previously classified type may also be detected.
For example, it may be assumed that the multi-pixel feature information output from the deep-learning model is X and that A, B and C are present as feature information for respective types of malware classified through training. Here, the multi-pixel feature information X may be compared with each of A, B, and C in order to determine the similarity therebetween.
If the similarity between X and A is equal to or greater than a preset reference, X may be determined to be malware corresponding to the feature information A.
Also, although not illustrated in
Also, although not illustrated in
For example, training data corresponding to malicious or benign code is labeled with a malware type that is classified in advance, and is then used as training data for the deep-learning model based on AI.
Here, because the criteria for classifying malware into each type are different for respective anti-virus vendors, the classification of types is not uniform, and the criteria used for labeling for training may also be variable. For example, Table 2 shows an example of labelling for classifying malware into ten types when training data in the form of an image is generated.
Here, in the case of malware sample data for training that is classified based on various criteria (criteria of other vaccine vendors, VirusTotal, and the like), other than the criteria used for classification into the classes listed in Table 2, a multi-pixel image may be generated based on information (a vaccine name, whether malware is detected, a version, a detection name result, an update date, and the like) detected by each vendor in response to a specific malware file or a hash value, and may then be applied to various AI technologies, such as the CNN-based model illustrated in
Here, at least one of information about entropy (randomness) of each piece of malware, the original creation date thereof, the final update date thereof, and information about the website via which the malware is distributed may be acquired and used as training data.
For example, information about entropy of specific malware, the original creation date thereof, the final update date thereof, and information about the website (IP address or domain) via which the specific malware is distributed over the Internet may be additionally acquired at the step of optimizing the m-pixel feature information based on a 3-gram, illustrated in
Also, in the present invention, the various n-gram sequences of opcodes that are separately stored at the optimization step illustrated in
Through the above-described method for detecting unknown malware, detection accuracy and performance of technology for detecting unknown malware based on AI may be improved.
Also, malware detection technology capable of classifying and detecting unknown malware based on feature information characterized by the variable length thereof may be provided.
Also, malware detection technology that can also be applied to detection of a document-type or script-type malware file may be provided.
Referring to
The communication unit 1210 may serve to transmit and receive information required for detection of unknown malware through a communication network. Here, the network provides a path via which data is delivered between devices, and may be conceptually understood to encompass networks that are currently being used and networks that have yet to be developed.
For example, the network may be an IP network, which provides service for transmission and reception of a large amount of data and uninterrupted data service through an Internet Protocol (IP), an all-IP network, which is an IP network structure that integrates different networks based on IP, or the like, and may be configured as a combination of one or more of a wired network, a Wireless Broadband (WiBro) network, a 3G mobile communication network including WCDMA, a 3.5G mobile communication network including a High-Speed Downlink Packet Access (HSDPA) network and an LTE network, a 4G mobile communication network including LTE advanced, a satellite communication network, and a Wi-Fi network.
Also, the network may be any one of a wired/wireless local area network for providing communication between various kinds of data devices in a limited area, a mobile communication network for providing communication between mobile devices or between a mobile device and the outside thereof, a satellite communication network for providing communication between earth stations using a satellite, and a wired/wireless communication network, or may be a combination of two or more selected therefrom. Meanwhile, the transmission protocol standard for the network is not limited to existing transmission protocol standards, but may include all transmission protocol standards to be developed in the future.
The processor 1220 collects operation code (opcode) information from a detection target.
Also, the processor 1220 generates a multi-pixel image having a variable length by performing feature engineering on the opcode information.
Here, the multi-pixel image may be a multi-pixel RGB image based on an n-gram corresponding to the opcode information.
Here, the n-gram sequences of hexadecimal codes having a variable length are stored based on the opcode information, and the 3-gram of opcodes is mapped to an RGB code based on the n-gram sequence, whereby the multi-pixel RGB image may be generated.
Here, a text section is extracted from an executable file, corresponding to the detection target, raw data in the text section is converted to opcodes in an assembly language format using a binary analysis tool, and the hex codes may be extracted based on the opcode.
Here, the opcode may be a code for providing at least one function, among a logical operation, program flow control, memory manipulation, and an arithmetic operation.
Here, the opcode may correspond to the opcode in an assembly language format in a one-to-one manner.
Here, the opcode may be configured with a 1-byte or 2-byte instruction and multiple operand values.
Also, the processor 1220 inputs the multi-pixel image to a deep-learning model based on AI, thereby detecting unknown malware.
Here, unknown malware may be detected in consideration of the similarity between feature information of each type of malware classified through training and multi-pixel feature information output from the deep-learning model.
Also, the processor 1220 may generate a multi-pixel image for training using opcode information collected based on multiple pieces of benign code and multiple pieces of malware, generate training data so as to correspond to the multi-pixel image for training, and train the deep-learning model using the training data.
Here, at least one of information about entropy of each piece of malware, the original creation date thereof, the final update date thereof, and information about a website via which the malware is distributed may be acquired and used as the training data.
The memory 1230 stores the opcode information and the multi-pixel image.
Also, the memory 1230 stores various kinds of information generated in the above-described apparatus for detecting unknown malware according to an embodiment of the present invention.
According to an embodiment, the memory 1230 may be separate from the apparatus for detecting unknown malware, and may support the function for detecting unknown malware. Here, the memory 1230 may operate as separate mass storage, and may include a control function for performing operations.
Meanwhile, the apparatus for detecting unknown malware includes memory installed therein, whereby information may be stored therein. In an embodiment, the memory is a computer-readable medium. In an embodiment, the memory may be a volatile memory unit, and in another embodiment, the memory may be a nonvolatile memory unit. In an embodiment, the storage device is a computer-readable recording medium. In different embodiments, the storage device may include, for example, a hard-disk device, an optical disk device, or any other kind of mass storage device.
Using the above-described apparatus for detecting unknown malware, detection accuracy and performance of technology for detecting unknown malware based on AI may be improved.
Also, malware detection technology capable of classifying and detecting unknown malware based on feature information characterized by the variable length thereof may be provided.
Also, malware detection technology that can also be applied to detection of a document-type or script-type malware file may be provided.
Referring to
Accordingly, an embodiment of the present invention may be implemented as a nonvolatile computer-readable storage medium in which methods implemented using a computer or instructions executable in a computer are recorded. When the computer-readable instructions are executed by a processor, the computer-readable instructions may perform a method according to at least one aspect of the present invention.
According to the present invention, unknown-malware detection technology based on AI, the detection accuracy and performance of which can be improved, may be provided.
Also, the present invention may provide malware detection technology capable of classifying and detecting unknown malware based on feature information characterized by the variable length thereof.
Also, the present invention may provide malware detection technology that can also be applied to detection of a document-type or script-type malware file.
As described above, the apparatus for detecting unknown malware using a variable-length operation code and method using the apparatus according to the present invention are not limitedly applied to the configurations and operations of the above-described embodiments, but all or some of the embodiments may be selectively combined and configured, so that the embodiments may be modified in various ways.
Claims
1. A method for detecting unknown malware, comprising:
- collecting operation code (opcode) information from a detection target;
- generating a multi-pixel image having a variable length by performing feature engineering on the opcode information; and
- detecting unknown malware by inputting the multi-pixel image to a deep-learning model based on AI.
2. The method of claim 1, wherein the multi-pixel image corresponds to a multi-pixel RGB image based on an n-gram corresponding to the opcode information.
3. The method of claim 2, wherein generating the multi-pixel image comprises:
- storing n-gram sequences for hexadecimal (hex) codes having a variable length based on the opcode information; and
- mapping a 3-gram of opcodes to an RGB code based on the n-gram sequences, thereby generating the multi-pixel RGB image.
4. The method of claim 3, wherein collecting the opcode information comprises:
- extracting a text section from an executable file corresponding to the detection target;
- converting raw data in the text section to opcodes in an assembly language format using a binary analysis tool; and
- extracting the hex codes based on the opcodes.
5. The method of claim 1, wherein detecting the unknown malware is configured to detect the unknown malware in consideration of a similarity between feature information of each type of malware classified through training and multi-pixel feature information output from the deep-learning model.
6. The method of claim 1, further comprising:
- generating a multi-pixel image for training using opcode information collected based on multiple pieces of benign code and multiple pieces of malware and generating training data based on the multi-pixel image for training; and
- training the deep-learning model using the training data.
7. The method of claim 6, wherein training the deep-learning model is configured to acquire at least one of information about entropy of each piece of malware, an original creation date thereof, a final update date thereof, and information about a website via which the malware is distributed and to use the acquired information as the training data.
8. The method of claim 4, wherein the opcode is a code for providing at least one function, among a logical operation, program flow control, memory manipulation, and an arithmetic operation.
9. The method of claim 4, wherein the opcode corresponds to an opcode in the assembly language format in a one-to-one manner.
10. The method of claim 4, wherein the opcode includes a 1-byte or 2-byte instruction and multiple operand values.
11. An apparatus for detecting unknown malware, comprising:
- a processor for collecting operation code (opcode) information from a detection target, generating a multi-pixel image having a variable length by performing feature engineering on the opcode information, and detecting unknown malware by inputting the multi-pixel image to a deep-learning model based on AI; and
- memory for storing the opcode information and the multi-pixel image.
12. The apparatus of claim 11, wherein the multi-pixel image corresponds to a multi-pixel RGB image based on an n-gram corresponding to the opcode information.
13. The apparatus of claim 12, wherein the processor is configured to:
- store n-gram sequences for hexadecimal (hex) codes having a variable length based on the opcode information; and
- map a 3-gram of opcodes to an RGB code based on the n-gram sequences, thereby generating the multi-pixel RGB image.
14. The apparatus of claim 13, wherein the processor is configured to:
- extract a text section from an executable file corresponding to the detection target;
- convert raw data in the text section to opcodes in an assembly language format using a binary analysis tool; and
- extract the hex codes based on the opcodes.
15. The apparatus of claim 11, wherein the processor is configured to detect the unknown malware in consideration of a similarity between feature information of each type of malware classified through training and multi-pixel feature information output from the deep-learning model.
16. The apparatus of claim 11, wherein the processor is configured to:
- generate a multi-pixel image for training using opcode information collected based on multiple pieces of benign code and multiple pieces of malware;
- generate training data so as to correspond to the multi-pixel image for training; and
- train the deep-learning model using the training data.
17. The apparatus of claim 16, wherein the processor is configured to:
- acquire at least one of information about entropy of each piece of malware, an original creation date thereof, a final update date thereof, and information about a website via which the malware is distributed, and
- use the acquired information as the training data.
18. The apparatus of claim 14, wherein the opcode is a code for providing at least one function, among a logical operation, program flow control, memory manipulation, and an arithmetic operation.
19. The apparatus of claim 14, wherein the opcode corresponds to an opcode in the assembly language format in a one-to-one manner.
20. The apparatus of claim 14, wherein the opcode includes a 1-byte or 2-byte instruction and multiple operand values.
Type: Application
Filed: Aug 30, 2021
Publication Date: May 5, 2022
Patent Grant number: 11790085
Applicant: Electronics and Telecommunications Research Institute (Daejeon)
Inventors: Jung-Tae KIM (Daejeon), Ji-Hyeon SONG (Daejeon), Jong-Hyun KIM (Daejeon), Sang-Min LEE (Daejeon), Ik-Kyun KIM (Daejeon), Dae-Sung MOON (Daejeon)
Application Number: 17/461,337