Anti-confusion binary code clone detection method based on software gene

文档序号：135029 发布日期：2021-10-22 浏览：63次中文

阅读说明：本技术 基于软件基因的抗混淆二进制代码克隆检测方法 (Anti-confusion binary code clone detection method based on software gene ) 是由单征刘福东张春燕唐柯黄一钊桂海仁乔猛熊其冰徐恋秋宋智辉于 2021-06-30 设计创作，主要内容包括：本发明公开一种基于软件基因的抗混淆二进制代码克隆检测方法,首先使用O-LLVM编译器对源程序进行编译得到汇编程序,然后再对汇编程序提取CFG,然后应用软件基因的概念,将其划分成多个软件基因块,将CFG中每一个节点分割为独立的软件基因块进行指令规范化后,利用随机游走算法遍历CFG中的节点获取软件基因序列作为训练集,之后再应用机器学习算法对训练集进行训练,主要采用自然语言处理的方法(Word2Vec)对汇编指令进行词嵌入,然后采用Doc2Vec对软件基因序列进行语义嵌入,提取函数的语义信息,最终使得训练好的模型能够在抗混淆的代码克隆检测中获得良好的效果。本发明能够在抵抗混淆选项的同时,有效地检测二进制代码的相似程度。(The invention discloses an anti-confusion binary code clone detection method based on software genes, which comprises the steps of firstly compiling a source program by using an O-LLVM compiler to obtain an assembly program, then extracting CFG from the assembler, then applying the concept of software gene, dividing the software gene into a plurality of software gene blocks, dividing each node in the CFG into independent software gene blocks for instruction normalization, traversing nodes in the CFG by using a random walk algorithm to obtain a software gene sequence as a training set, then training the training set by using a machine learning algorithm, mainly performing Word embedding on assembly instructions by using a natural language processing method (Word2Vec), and then, adopting Doc2Vec to carry out semantic embedding on the software gene sequence, extracting the semantic information of the function, and finally enabling the trained model to obtain a good effect in the anti-confusion code clone detection. The invention can effectively detect the similarity of the binary codes while resisting confusion options.)

1. An anti-confusion binary code clone detection method based on software genes is characterized by comprising the following steps:

step 1: compiling the source program by using an Obfuscator-LLVM compiler to obtain a corresponding assembler program;

step 2: traversing all assembler files, analyzing the content of the assembler files, extracting a program control flow graph of the assembler, obtaining a plurality of basic blocks, and storing the basic blocks into a data structure;

and step 3: dividing basic blocks in the data structure into software gene blocks, removing empty basic blocks, and transferring the subdivided software gene blocks into a gene data structure;

and 4, step 4: the software gene block in the gene data structure is subjected to assembly instruction standardization;

and 5: the method comprises the steps of traversing nodes, namely basic blocks, in a program control flow graph by using a random walk algorithm, obtaining a software gene sequence as a training set, wherein the software gene sequence is composed of a plurality of software gene blocks, Word2Vec is adopted to carry out Word embedding on assembly instructions in the software gene sequence, then Doc2Vec is adopted to carry out semantic embedding on the software gene sequence, semantic information of an assembly function is extracted, a mathematical vector containing the semantic information of the assembly function is obtained, a plurality of mathematical vectors of the same assembly function are spliced, the similarity of the spliced mathematical vectors is calculated by using a cosine similarity method, the similarity comparison among the assembly functions is realized, and the binary code clone detection is completed.

2. The method for detecting clone of anti-aliasing binary code based on software gene as claimed in claim 1, wherein in step 2, traversing all assembler files, parsing the content of the assembler file comprises:

firstly, creating a collection, ordered Dict () data object data, wherein the contents of all assembler files are stored, the key name is the file name of each assembler file, the key value is a new collection, ordered Dict () data object, in the new collection, ordered Dict () data object, the key name is the function name in the current assembler file, the key value is still a collection, ordered Dict () data object, in the data object, the key name is the label in the current function, namely the identification of each basic block, the key value is a list, and the list stores the assembly instructions in the current basic block.

3. The method for detecting the clone of the anti-confusion binary code based on the software gene as claimed in claim 1, wherein the step 3 of dividing the basic block in the data structure into the software gene blocks comprises:

traversing each assembly instruction in the basic block, finding that the current instruction is a jump instruction, ending the current software gene block, and if other instructions follow the basic block, newly creating a software gene block to store the next assembly instruction sequence.

4. The method for detecting the anti-confusion binary code clone based on the software gene as claimed in claim 1, wherein the step 4 comprises:

registers% eax,% ebx,% edx are each replaced with "REG", immediate numbers are each replaced with "IMM", accessed memory addresses are each replaced with "ADDRESS", function names following call instructions are replaced with "FUNC", variable names are replaced with "VAR", and references in the assembler are replaced with "label".

5. The method for detecting the anti-confusion binary code clone based on the software gene as claimed in claim 1, further comprising, after the step 4:

and storing the extracted data into files, wherein each assembly function is stored into two files, one file is used for storing all basic blocks in the assembly function and the connection relation between the basic blocks, and the other file is used for storing an assembly instruction sequence of the software gene block corresponding to each basic block.

6. The method for detecting the clone of the anti-confusion binary code based on the software gene as claimed in claim 1, wherein the step 5 of Word embedding the assembly instruction in the software gene sequence by using Word2Vec comprises:

the assembly instruction is used as a Word, a software gene block consisting of a plurality of assembly instructions is used as a sentence, a software gene sequence consisting of a plurality of software gene blocks is used as a paragraph, a skip-gram system architecture of a Word2Vec model is adopted, and a complete assembly instruction is used as a unit to obtain a Word vector.

Technical Field

The invention belongs to the technical field of network security, and particularly relates to an anti-confusion binary code clone detection method based on software genes.

Background

In recent years, with the development and progress of information technology, various types of software bring convenience to people's lives and bring many security problems, such as code piracy, software infringement and malicious code abuse. In order to solve the problems, reverse engineering is particularly necessary, and the problems of software piracy, malicious code variation and the like are solved by identifying unknown codes after reverse engineering, comparing the unknown codes with a known code library and detecting the repetition rate or similarity of code segments. However, as various obfuscation tools become more sophisticated, obfuscation strategies are complex and diverse, and even if programs with similar logic functions are obfuscated by similar tools, inverted disassembly codes are very different in structure and logic. These obfuscation techniques, while largely protecting the software's copyright, also pose various problems such as difficulty in detecting code piracy, malicious code variants (m.lindorer, a.di Federico, f.Maggi, p.m.Compariti, and s.Zanero, "Lines of a macromolecular code: information into the macromolecular software index," in Proceedings of the 28th annular Computer Security Applications Conference-ACSAC' 12, Orlando, Florida,2012, p.349, doi: 10.1145/2420950.2421001). Although there are also many methods for studying Binary Code similarity (Y.Hu, Y.Zhang, J.Li, H.Wang, B.Li, and D.Gu, "BinMatch: A semiconductor-based Hybrid application on Binary Code Analysis," arXiv:1808.06216[ cs ], Aug.2018, addressed: Mar.28,2021.[ Online ] Available: http:// axiv.org/abs/1808.06216.), none of these methods is well resistant to confusion technology (L.Luo, J.Ming, D.Wu, P.Liu, and S.Zhu, "semiconductor-based interference-based coding-mapping Code mapping, F.22. intermediate: F.2014, F.J.M.J.J.Ming, D.Wu, P.Liu, and S.Zhu," semiconductor-based coding-mapping, F.22. I.F..

Disclosure of Invention

Aiming at the problem that the existing binary code similarity method cannot well resist the confusion technology, the invention provides a software gene-based anti-confusion binary code clone detection method, which can effectively detect the similarity degree of binary codes while resisting confusion options.

In order to achieve the purpose, the invention adopts the following technical scheme:

an anti-confusion binary code clone detection method based on software genes comprises the following steps:

step 1: compiling the source program by using an Obfuscator-LLVM compiler to obtain a corresponding assembler program;

and step 3: dividing basic blocks in the data structure into software gene blocks, removing empty basic blocks, and transferring the subdivided software gene blocks into a gene data structure;

and 4, step 4: the software gene block in the gene data structure is subjected to assembly instruction standardization;

Further, in step 2, traversing all the assembler files, and parsing the content of the assembler file includes:

Further, in step 3, the dividing the basic block in the data structure into software gene blocks includes:

Further, the step 4 comprises:

Further, after the step 4, the method further comprises:

Further, in the step 5, Word embedding the assembly instruction in the software gene sequence by using Word2Vec includes:

Compared with the prior art, the invention has the following beneficial effects:

1. different from the reverse direction from the binary file to the assembler, the method analyzes the binary code from the forward direction, namely compiles the binary code from the source code to the assembler, and has the same effect, but the method can obviously reduce the workload;

2. the program control flow graph is converted into a software gene sequence by adopting a random walk algorithm, the control flow graph can be converted into a sequential assembly code sequence, a graph matching algorithm is ingeniously bypassed, the calculation complexity is effectively reduced, and the efficiency is improved;

3. word embedding is carried out on assembly instructions in a software gene sequence by adopting Word2Vec, and then the assembly program is processed by adopting a method of carrying out semantic embedding on the software gene sequence by adopting Doc2Vec, so that a vector containing the semantic information of the assembly program is obtained, and the similarity degree of binary codes can be effectively detected while confusion options are resisted.

Drawings

FIG. 1 is a flow chart of an anti-aliasing binary code clone detection method based on software genes according to an embodiment of the invention;

FIG. 2 is a diagram illustrating a basic block division of an anti-aliasing binary code clone detection method based on software genes according to an embodiment of the present invention;

FIG. 3 is a block diagram illustrating an example of a software gene segmentation method for detecting anti-aliasing binary code cloning based on a software gene according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an example of a normalization process of assembly instructions in an anti-aliasing binary code clone detection method based on software genes according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a word vector extraction in an anti-aliasing binary code clone detection method based on software genes according to an embodiment of the present invention;

FIG. 6 is a second exemplary diagram of word vector extraction in an anti-aliasing binary code clone detection method based on software genes according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the word vector extraction effect of the method for detecting clone of anti-aliasing binary code based on software genes according to the embodiment of the present invention;

FIG. 8 is a second diagram illustrating the word vector extraction effect of the anti-aliasing binary code clone detection method based on software genes according to the embodiment of the present invention;

fig. 9 is a line graph of similarity between functions corresponding to different word vector dimensions of an anti-aliasing binary code clone detection method based on software genes according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

as shown in fig. 1, a method for detecting anti-confusion binary code clone based on software gene comprises:

step S101: compiling the source program by using an Obfuscator-LLVM compiler (O-LLVM compiler) to obtain a corresponding assembler program;

step S102: traversing all assembler files, analyzing the content of the assembler files, extracting a program Control Flow Graph (CFG) of the assembler, obtaining a plurality of basic blocks, and storing the basic blocks in a data structure;

step S103: dividing basic blocks in the data structure into software gene blocks, removing empty basic blocks, and transferring the subdivided software gene blocks into a gene data structure;

step S104: the software gene block in the gene data structure is subjected to assembly instruction standardization;

step S105: the method comprises the steps of traversing nodes, namely basic blocks, in a program control flow graph by using a random walk algorithm, obtaining a software gene sequence as a training set, wherein the software gene sequence is composed of a plurality of software gene blocks, Word2Vec is adopted to carry out Word embedding on assembly instructions in the software gene sequence, then Doc2Vec is adopted to carry out semantic embedding on the software gene sequence, semantic information of an assembly function is extracted, a mathematical vector containing the semantic information of the assembly function is obtained, a plurality of mathematical vectors of the same assembly function are spliced, the similarity of the spliced mathematical vectors is calculated by using a cosine similarity method, the similarity comparison among the assembly functions is realized, and the binary code clone detection is completed.

Specifically, in step S101:

the Obfuscator-LLVM is an LLVM compiling suite supporting multiple platforms, and can guarantee the safety problem of software through code confusion and anti-tampering functions. Its main confusion functions include three: instruction replacement, false control flow, control flow flattening. The source program is compiled using the three main obfuscation techniques described above, respectively. We select open source and apply a wider OpenSSL source code and some other open source code library, detailed information as shown in table 1. And then compiling by using an Obfuscator-LLVM compiler to obtain a corresponding assembler.

Table 1 data set description

Specifically, in step S102:

and traversing all the assembler files and analyzing the content of the assembler files. The analytical procedure was as follows: firstly, creating a collection, ordered Dict () data object data, wherein the contents of all assembler files are stored, the key name is the file name of each assembler file, the key value is a new collection, ordered Dict () data object, in the new collection, ordered Dict () data object, the key name is the function name in the current assembler file, the key value is still a collection, ordered Dict () data object, in the data object, the key name is the label in the current function, namely the identification of each basic block, the key value is a list, and the list stores the assembly instructions in the current basic block. And after the analysis of all the assembler files is completed, all the assembler codes are saved in the data structure, so that the basic blocks are well divided. The division of the basic block is shown in fig. 2.

Specifically, in step S103:

the software genes are divided into code segments according to the functions of the assembler and are called software genes. The gene is divided into original basic blocks into 'software gene blocks' by using the concept of software genes, the control flow in each software gene block is executed in sequence, and only the last instruction is a jump instruction or a ret instruction. The software gene blocks are connected with each other according to the logical structure of the program.

And storing the main data after the assembler file is analyzed in a data structure, wherein the main data is mainly to refine each basic block in the data again, divide the basic block into software gene blocks, remove the empty basic block and transfer the subdivided software gene blocks into a gene data structure. The specific process of subdivision is roughly: traversing each instruction in the basic block, finding that the current instruction is a jump instruction, ending the current software gene block, and if other instructions follow the basic block, newly creating a software gene block to store the next instruction sequence. The schematic diagram of the segmentation software gene block is shown in FIG. 3. After traversing all nodes in the assembler file, the segmentation of the software gene blocks is completed.

Specifically, in step S104:

after the gene data structure is obtained, we need further processing on the data. For each instruction, the instruction consists of an operation code and an operand, but in the instruction, the operand is complex and diverse, for example, an immediate includes various integers, a register includes various registers such as% eax,% ebx,% edx, and the like, a memory address also includes expressions of various addressing modes, and the like. The normalized rules are mainly as follows: registers such as% eax,% ebx,% edx, etc. are all replaced with "REG", immediate numbers are all replaced with "IMM", accessed memory addresses are all replaced with "ADDRESS", function names following call instructions are replaced with "FUNC", variable names are replaced with "VAR", and references in the assembler are replaced with "label". The specific normalization process is shown in fig. 4.

Specifically, after step S104, the method further includes:

after replacement, the extracted data is stored in files as a data set, each function is stored as two files, one file (. edge) stores all basic blocks (nodes) in the function and the connection relation between the basic blocks, and the other file (. node) stores the assembly instruction sequence of the software gene block corresponding to each basic block.

Specifically, in step S105:

firstly, obtaining an ordered code sequence (software gene sequence) as training set data by adopting a random walk method, and then carrying out Word embedding on assembly instructions in the software gene sequence by adopting Word2 Vec. Here, the processing is performed using the Word2Vec model by referring to the method of natural language processing. Word2Vec is a set of machine learning models that can generate Word vectors, which are shallow two-layer neural networks used to train Word text to learn semantic information of words. The Word2Vec model can map any Word to a specified fixed-length high-dimensional feature vector, and has two architectures: bag-of-words (CBOW) and skip-gram, in the CBOW architecture, the model predicts the current word from a window of surrounding upper and lower words, and does not consider the order between words. In the skip-gram architecture, the model uses a fixed-size window and predicts the current word from the context words in the window. Both architectures can represent the input word as a fixed-length feature vector, but the CBOW model has two obvious weaknesses: it loses the word-to-word order in the sentence and ignores the semantic information contained by the words.

The assembly instructions are used as words, the gene blocks formed by a plurality of assembly instructions are used as sentences, the gene sequences formed by a plurality of gene blocks are used as paragraphs, if the CBOW model is used, the important information can be ignored, which is intolerable to us, and therefore a skip-gram system architecture is adopted to train word vectors. It is finally necessary to have the word vectors obtained with such characteristics: i.e., the euclidean distances between the word vectors to which words of similar meaning are mapped are also similar. Therefore, semantic information of the words can be kept as much as possible in the process of mapping the words to the vectors, so that the finally obtained vectors can contain functional information of the functions as much as possible, and the functional information is used as a basis for comparing similarity of the functions.

In the training model, firstly, Word vectors are obtained, and referring to a Word2Vec model in Natural Language Processing (NLP), the Word vectors can be obtained by taking each Word in an assembly instruction as a unit, as shown in fig. 5, or by taking a complete assembly instruction as a unit, as shown in fig. 6. Through later-stage experimental comparison, the word vector is obtained by taking a complete assembler instruction as a unit.

We use a random walk algorithm on the basis of a program control flow graph to select any one node in the graph, and from this node, along the direction of program control flow, randomly select a node directly connected to the given node as the next node, and repeat this process continuously until a certain condition is reached: such as a fixed sequence length or end of procedure. In order to prevent the length of the random node sequence from being too long, a truncated random walk (truncated random walk) is adopted, that is, the longest length of the random node sequence obtained in the random walk algorithm is 10, and if the length of a certain execution path of the function exceeds 10, truncation is performed, so that the control flow graph is converted into a series of assembly sequences and is used as training data.

And then, adopting Doc2Vec to carry out semantic embedding on the software gene sequence, extracting semantic information of the assembly function to obtain a mathematical vector containing the semantic information of the assembly function, and adopting a method for calculating cosine similarity to calculate the similarity between the mathematical vectors so as to realize similarity comparison between the assembly functions. Doc2Vec is an unsupervised machine learning algorithm model that maps a variable length text (such as a sentence, or a piece of text, or even an article) to a fixed length feature vector. The method can obtain the Vector representing the article from the training of predicting words in the article, and a large amount of research shows that the generated article Vector (Paragraph Vector) can make up for the defects of other article Vector representation technologies such as a bag-of-words model and the like. In this model, although the article vector is initialized at random, the article vector can represent semantic information included in the text to some extent after being trained by the Doc2Vec model. The Doc2Vec model is applied to semantic extraction of the assembly sequence, and because the Word2Vec model is implicitly called in the model to embed words, the Word2Vec model does not need to be trained additionally to obtain Word vectors.

To verify the effect of the present invention, the following experiment was performed:

(a) assembly instruction segmentation method

And respectively training the assembly instruction by using the two assembly instruction segmentation methods to obtain the corresponding characteristic vectors. In the training process, in order to observe training results more clearly and compare superiority and inferiority among training results, t-SNE is introduced. t-SNE is a relatively common high-dimensional data visualization tool. The method can convert high-dimensional data into two-dimensional or three-dimensional data through model training, and then visualize the converted data by using a matplotpilib package, so that the similarity between the acquired word embedding vectors can be visually seen from a graph. As shown in fig. 7 and 8, the vector effect obtained by the two segmentation methods is shown.

From fig. 7, it can be seen that the distance between the corresponding points of each word is relatively uniform, even if the distance difference between the words with similar meanings (such as "jle", "jl") and the words with greatly different meanings (such as "popq") is not very obvious, so that the semantic information of the words cannot be well reflected by the representation method.

From fig. 8, it is obvious that the instructions are not evenly distributed, but a plurality of instructions are gathered and separated from other instructions. For the partial enlargement in the figure, it can be seen that the distances between points mapped by instructions with similar functions in the two-dimensional space are also similar, for example, the instruction "cmovlel REG" and the instruction "cmovbel REG" have similar meanings semantically, and the distance between word vectors obtained after word2vec model training is also closer, so that we can consider that a complete instruction as a word can represent semantic information corresponding to the instruction to some extent.

From the results of the above two research experiments, it can be seen that the results obtained by word embedding using two instruction segmentation methods are very different, the word vectors obtained by segmenting each instruction and performing word embedding are distributed more uniformly, which means that the word vectors do not well contain semantic information of assembly instructions, but the word vectors obtained by word embedding using a complete instruction as a word are distributed more intensively with instructions having similar functions in two-dimensional distribution, which means that the word vectors contain better semantic information of program instructions. This also illustrates to some extent that we have some scientificity and rationality in using NLP to perform instruction sequences.

(b) Dimension of word vector

In the Doc2Vec model training process, in order to study the influence of the dimensionality of a word vector on an experimental result, experimental tests are carried out when the dimensionality of the word vector is 25, 50, 100, 150 and 200 respectively, and similarity indexes of the same function are used as bases for judging the quality of the experimental result. Meanwhile, in order to save time and reduce experiment cost, a small part of training data is randomly selected as experiment data in the process of training the Doc2Vec model. After training is finished, the trained model is used for testing, and the test result is shown in fig. 9, wherein the horizontal axis of fig. 9 represents vector dimensions, and the vertical axis represents similarity between function vectors.

It can be seen from fig. 9 that the similarity between function vectors varies with the vector dimensions, and that for a function there is an optimum dimension value such that the similarity is highest for both the aliased and the non-aliased functions. In this experiment, it can be seen that, when the dimension of the word vector is 150, the similarity between vectors obtained by the similarity function is the highest, so we should select 150 as the dimension of the word vector to train the Doc2Vec model.

Through the above experiment, the optimal training parameters are selected to train the Doc2Vec model, and a vector representing the semantics of each node sequence is obtained, but in one function, a plurality of node sequences are necessarily present, so that when the similarity of two functions is compared, the similarity between a plurality of vectors and another plurality of vectors is actually compared. At this point we can have multiple options: one is to add and average the vectors, and the other is to splice the vectors together directly. The two algorithms have advantages and disadvantages respectively, and the scientificity of the two processing methods needs to be compared through specific experimental data. Repeated experiments show that the higher accuracy can be obtained by directly splicing a plurality of vectors of the function, so that the similarity of the function is directly calculated by using the processing method in the following tests.

Meaning of p @ n test: if a and b are a pair of similar functions, b and 99 (or more) different randomly selected functions are put together, the 100 functions are respectively used for calculating the similarity with a, and the similarity is sorted from big to small, wherein the probability that b is sorted in the first n bits. In our experiments, the similar functions we chose were assembly functions of the same function compiled from different obfuscation options of the O-LLVM compiler. For example, the a function is a compilation function obtained by confusion with any one of three confusion options, and the remaining 100 functions to be compared are compilation functions obtained without confusion with any confusion option.

Testing one: in the test, a data set of libtomcprypt is used, functions in the test set are all data sets obtained by compiling the same obfuscation options (sub, dummy control flow (bcf) and control flow flattening (fla)) and the quantity of the functions is 100. In the testing process, similarity is calculated for each function subjected to confusion and each function not subjected to confusion, the obtained similarities are sequenced, p @1, p @3 and p @10 are respectively calculated, and the obtained testing result is shown in table 2.

Table 2 test results for LibTomCrypt

And (2) testing: in this test, the data set of LibGmp was used instead, and the test method and test were identical, and the test results obtained are shown in table 3.

Table 3 test results of LibGmp

From experimental results, it can be seen that the Doc2Vec model has the best effect on the sub confusion options of the O-LLVM compiler, and even the probability in the p @10 test is very close to 1, which shows that the invention has a good resistance effect on the confusion options.

In conclusion, different from the reverse direction of the binary file to the assembler, the method analyzes the binary code from the forward direction, namely compiles the binary code from the source code to the assembler, and the effect of the method and the device is the same, but the workload can be obviously reduced; the method adopts the random walk algorithm to convert the program control flow graph into the software gene sequence, can convert the control flow graph into the sequential assembly code sequence, ingeniously bypasses the graph matching algorithm, effectively reduces the computational complexity and improves the efficiency; according to the invention, the assembly program is processed by adopting a natural language processing method, and the vector containing the semantic information of the assembly program is obtained, so that the confusion option can be resisted, and the similarity degree of the binary codes can be effectively detected.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种工程源码动态自适应不同硬件资源的方法及系统

Anti-confusion binary code clone detection method based on software gene

相关技术

网友询问留言