Industrial software vulnerability detection method based on self-attention mechanism

文档序号:190876 发布日期:2021-11-02 浏览:24次 中文

阅读说明:本技术 基于自注意力机制的工业软件漏洞检测方法 (Industrial software vulnerability detection method based on self-attention mechanism ) 是由 张立国 薛静芳 金梅 李佳庆 秦芊 王磊 申前 孟子杰 耿星硕 黄文汉 于 2021-07-22 设计创作,主要内容包括:本发明公开了一种基于自注意力机制的工业软件漏洞检测方法,该方法采用自注意力机制对软件漏洞进行检测,检测过程分为代码预处理过程和漏洞检测模型训练测试过程;首先获取数据库漏洞程序,提取库API函数调用和程序片段,分为5类漏洞问题构建成工业软件漏洞的漏洞库,然后对漏洞库的关键点进行切片,将切片后的程序组装成行关联代码并进行数据处理,之后将行关联代码转换成对应的向量获得特征向量,并处理成相同长度大小;将特征向量与位置编码向量相加作为漏洞检测模型的输入,然后训练D-transformer神经网络,将训练好的模型通过测试样本验证模型的检测能力。本发明方法进一步提高了检测分类精度,降低了漏报率。(The invention discloses an industrial software vulnerability detection method based on a self-attention mechanism, which adopts the self-attention mechanism to detect software vulnerabilities, wherein the detection process is divided into a code preprocessing process and a vulnerability detection model training and testing process; firstly, a database vulnerability program is obtained, library API function call and program fragments are extracted, the library API function call and the program fragments are divided into 5 types of vulnerability problems to construct a vulnerability library of industrial software vulnerabilities, then key points of the vulnerability library are sliced, the sliced program is assembled into row association codes and is subjected to data processing, and then the row association codes are converted into corresponding vectors to obtain characteristic vectors which are processed into the same length; and adding the characteristic vector and the position coding vector to be used as the input of a vulnerability detection model, then training a D _ transformer neural network, and verifying the detection capability of the model by the trained model through a test sample. The method further improves the detection classification precision and reduces the rate of missing reports.)

1. The industrial software vulnerability detection method based on the self-attention mechanism is characterized in that the method adopts the self-attention mechanism to detect industrial software vulnerabilities, and the detection process comprises a code preprocessing process and a vulnerability detection model training and testing process:

the code preprocessing process comprises the following steps:

s11, extracting API function calls and program fragments according to a database vulnerability program, dividing the API function into 5 types of vulnerability problems to construct a vulnerability library of industrial software vulnerabilities, taking key points of the vulnerability library as the entry points of program slices, and extracting parameters, statements and expressions related to the key points in codes;

s12, slicing key points of the leak library by using a slicing tool to obtain a sliced program, assembling the sliced program into line association codes, performing data processing on the line association codes, and converting the line association codes into corresponding vectors to obtain characteristic vectors;

s13, uniformly processing the feature vectors into vectors with the same length, supplementing 0 after the vectors when the length of the line-associated code vectors is insufficient, and deleting the excess parts when the length of the line-associated code vectors exceeds the set length;

the training and testing process of the code vulnerability detection model comprises the following steps:

s21, extracting a fixed number of samples from the training samples in a batch extraction mode in the training process, and transmitting the samples into a code vulnerability detection model adopting a self-attention mechanism to obtain a prediction result;

s22, updating parameters of the error between the prediction result and the real result through a back propagation and gradient descent algorithm, and training a model through multiple iterations;

and S23, testing by using the trained model in the testing process, comparing whether the testing result of the model is the same as the actual result, and testing the detection capability of the model.

2. The method for detecting industrial software vulnerabilities based on the self-attention mechanism as claimed in claim 1, wherein the database employs source codes of a vulnerability database of NIST, including NVD of vulnerabilities in software products and SARD of academic security vulnerabilities, and 80% of programs are randomly selected as training programs and 20% are selected as testing programs.

3. The method for industrial software vulnerability detection based on self-attention mechanism of claim 1, wherein the 5 types of vulnerability problems are buffer overflow problem, null pointer reference problem, heap overflow problem, API misuse and information leakage problem, respectively.

4. The method of claim 1, wherein the slices comprise a forward slice and a backward slice, wherein the forward slice corresponds to the statements affected by the relevant parameters, and the backward slice corresponds to the statements that may affect the relevant parameters, and the data dependency graph is used to extract the two slices.

5. The self-attention mechanism-based industrial software vulnerability detection method of claim 1, wherein the line association code is semantically inter-associated lines of code.

6. The method for detecting the industrial software vulnerability based on the self-attention mechanism as claimed in claim 1, wherein the data processing mainly comprises tagging line-related codes, setting tags containing the vulnerability to be 1 and tags not containing the vulnerability to be 0, performing word segmentation and de-duplication processing on the line-related codes and giving different weights to the samples according to the unbalanced problem of the data samples.

7. The method for detecting the industrial software vulnerability based on the self-attention mechanism is characterized in that the conversion of the line association code data into the corresponding vectors adopts a word skipping model of word2vec based on the sequence softmax.

8. The method for detecting industrial software bugs based on the self-attention mechanism as claimed in claim 1, wherein the code bug detection model is a D _ fransformer model using the self-attention mechanism, the network structure thereof uses an encoder-decoder architecture, and each of the internal encoder and decoder comprises a self-attention mechanism layer and a feedforward neural network.

Technical Field

The invention relates to the technical field of network security, in particular to an industrial software vulnerability detection method based on a self-attention mechanism.

Background

With the continuous development of software development technologies in various industries, industrial software is produced at the same time, and the industrial software is used as an important component of an industrial internet application system, is industrial application software meeting specific requirements, can release human work from boring and repeated physical labor, is dedicated to more valuable knowledge creative work, can enjoy advanced software and hardware technologies at low cost, improves the production efficiency of enterprises, and further improves the intelligent manufacturing level of the industries through large-scale reuse. While industrial software provides convenience for industrial technicians, the problem of industrial software bugs also becomes a significant problem that software developers can not neglect, and in an industrial internet environment, once software is connected with a system, if software has bugs, machines and equipment can be attacked, and a production process can be damaged, interfered or even stopped. Therefore, the more timely a potential vulnerability problem in software is detected, the less property and security loss is incurred during production. The conventional static detection method has high false alarm rate, and the dynamic detection method is easy to generate false alarm and is time-consuming. Deep learning has achieved tremendous success in many areas in recent years as deep learning techniques have evolved, which are able to learn higher-level feature representations with more complexity and abstraction, are able to automatically learn more generalizable potential features or representations, and provide flexibility, allowing network architecture to be customized for different application scenarios, making it possible to learn software vulnerability patterns from a large amount of code using deep learning techniques.

The convolutional neural network can better obtain finer-grained characteristics for pictures in the aspect of computer vision, is used for learning structured spatial data, and can only capture the context meaning of words for words or sentences, so that the convolutional neural network can only learn the simple code semantics of the context. Recurrent neural networks have made impressive efforts in the natural language processing direction for processing sequential data, and in particular the bidirectional form of recurrent neural networks can capture the long-term dependencies of sequences, so many studies have used bidirectional long-term memory networks and gated recursive cell structures to learn code context dependencies. The conventional deep learning model also applies an attention mechanism, and only some key information inputs are selected for processing, namely different attention is allocated to different moments in an input sequence to improve the efficiency of the neural network, which is important for understanding the semantics of many types of vulnerabilities.

Disclosure of Invention

The invention aims to further improve the precision and the efficiency of software vulnerability detection, and simultaneously can show certain self-adaptive capacity when the network environment changes, improve the detection capacity of unknown vulnerabilities, and establish a more complete and comprehensive detection model, thereby improving the detection effect of software vulnerabilities. The invention provides an industrial software vulnerability detection method based on a self-attention mechanism, aiming at improving the detection performance.

In order to solve the technical problems and achieve the purpose of the invention, the invention is realized by the following technical scheme: the method is characterized in that the industrial software vulnerability is detected by adopting a self-attention mechanism, and the detection process comprises a code preprocessing process and a vulnerability detection model training and testing process:

the code preprocessing process comprises the following steps:

s11, extracting API function call and program fragments according to a database vulnerability program, dividing the API function into 5 types of vulnerability problems to construct a vulnerability library of industrial software vulnerabilities, and extracting parameters, statements and expressions related to key points in codes by taking the key points of the vulnerability library as the entry points of program slices; s12, slicing key points of the leak library by using a slicing tool to obtain a sliced program, assembling the sliced program into line association codes, performing data processing on the line association codes, and converting the line association codes into corresponding vectors to obtain characteristic vectors; s13, uniformly processing the feature vectors into vectors with the same length, supplementing 0 at the last of the vectors when the length of the line association code vector is less than the set length, and deleting the excess parts when the length of the line association code vector exceeds the set length;

the training and testing process of the code vulnerability detection model comprises the following steps:

s21, extracting a fixed number of samples from the training samples in a batch extraction mode in the training process, and transmitting the samples into a code vulnerability detection model adopting a self-attention mechanism to obtain a prediction result; s22, updating parameters of the error between the prediction result and the real result through a back propagation and gradient descent algorithm, and training a model through multiple iterations; and S23, testing by using the trained model in the testing process, comparing whether the testing result of the model is the same as the actual result, and testing the detection capability of the model.

Preferably, the database adopts the source code of the NIST vulnerability database, which comprises NVD of the vulnerabilities in the software products and SARD of academic security vulnerabilities, and 80% of programs are randomly selected as training programs and 20% of programs are selected as testing programs.

Preferably, the 5 types of bug problems are buffer overflow problem, null pointer reference problem, heap overflow problem, API misuse and information leakage problem, respectively.

Preferably, the slices include a forward slice and a backward slice, wherein the forward slice corresponds to a statement affected by the relevant parameter, and the backward slice corresponds to a statement that can affect the relevant parameter, and the two slices are extracted by using the data dependency graph.

Preferably, the line-associated code is semantically inter-associated lines of code.

Preferably, the data processing mainly includes adding a label to the line-related code, setting the label containing the vulnerability to be 1, setting the label not containing the vulnerability to be 0, performing word segmentation processing and deduplication processing on the line-related code, and giving different weights to the samples according to the unbalanced problem of the data samples.

Preferably, the conversion of the line associated code data into the corresponding vector adopts a word skipping model of word2vec based on the sequence softmax.

Preferably, the code vulnerability detection model is a D _ transformer model adopting a self-attention mechanism, the network structure of the D _ transformer model adopts an encoder-decoder architecture, and each of the internal encoders and decoders comprises a self-attention mechanism layer and a feedforward neural network.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method adopts a self-attention mechanism, for long sentences, the weight distributed to each input item depends on the interaction between the input items, the problem of long-term dependence is avoided, and the method can help the current node to only focus on the current word, so that the semantics of the context can be obtained;

(2) the invention adopts a multi-head attention mechanism, thereby improving the fitting capability of the model;

(3) the method uses the D _ transformer model, can simultaneously accept all vectors as input, has the advantage of parallel computation, and therefore has higher training speed, and can obtain higher precision by using a deeper model which can be converged to lower test errors compared with a shallower model, so that the vulnerability detection method can better meet the actual engineering requirements.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a code pre-processing flow diagram of the present invention;

FIG. 3 is a flow chart of the code vulnerability detection model training and testing of the present invention;

FIG. 4 is a network structure diagram of the D _ transform model according to the present invention;

FIG. 5 is a diagram of the inner schematic version of the encoder and decoder of the present invention;

FIG. 6 is a detailed structural diagram of the D _ transform of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

The invention will be described in more detail with reference to the following detailed description and accompanying drawings:

the hardware experimental environment of the invention comprises CPU Intel Core I7-9750H, 2.60GHZ,16GRAM, GPU NVIDIA GeForce GTX3060 and video memory 6G. The software running environment is a Windows1064 bit operating system, the programming language is Python3 programming, and a deep learning framework based on the pytorch is adopted. A Joern tool is needed for program slices, a Linux environment is needed for Joern, and the program slices are realized by using a virtual machine, namely Vmware works locations 15.5 Pro, a Linux version Ubuntu18.04.4, a memory 8G and a python version 2.7. Tools for sectioning Joern version 0.3.1, Neo4j version Community-2.1.5, Gremlin 8G, Python-Joern 0.3.1, Joern-tools 0.3.1, JDK 1.7.

Fig. 1 is a flowchart of the method provided by the present invention, which includes a code preprocessing flow and a vulnerability detection model training and testing flow, as shown in fig. 2 and fig. 3, respectively. The code preprocessing flow is as shown in fig. 2, and a vulnerability database of NIST is obtained, where the vulnerability database of NIST includes NVD of vulnerabilities of production software and SARD including academic security vulnerabilities. In NVD, each vulnerability has a unique common vulnerability and identifier (CVE ID) and a common vulnerability enumeration identifier (CWE ID) that indicates the type of vulnerability involved. Randomly 80% of the programs were selected as training programs and the remaining 20% were selected as testing programs. 6045C/C + + library/API function calls including standard library function calls, basic Windows API and Linux kernel API function calls are extracted from the program. In total, 56902 library/API function calls were extracted from the program, including 7255 forward function calls and 49647 reverse function calls. And dividing the API function into 5 types of vulnerability problems to construct a vulnerability library of the industrial software vulnerability, wherein the 5 types of vulnerability problems are respectively a buffer overflow problem, a null pointer reference problem, a heap overflow problem, an API misuse problem and an information leakage problem. Some key points of the 5 major vulnerability problems employed by the present invention are shown in table 1.

TABLE 1 partial program vulnerability and contained key points

The method comprises the steps of taking key points of a leak library as entry points of program slicing to extract parameters, statements and expressions related to the key points in codes, slicing the key points of the leak library by using a tool joern, extracting forward slices and backward slices by using a data dependency graph to obtain sliced programs, and assembling the sliced programs into line-related codes. According to the 5 categories of problems, 5 data sets are summarized from the row association code database. The vulnerability data types and the number of row associated codes for the 5 data sets are shown in table 2.

TABLE 2 vulnerability data types and number of line associated codes

Performing data processing on the line-related code, mainly adding a label to the line-related code, setting the label containing the bug as 1, and setting the label not containing the bug as 0, and regarding the line-related code extracted from the NVD program, if at least one line of statement of the line-related code is deleted or modified in a patch form, automatically marking the line-related code as '1', otherwise marking the line-related code as '0', and manually checking the line-related code automatically marked as '1' and modifying the line-related code because the line-related code may be marked by mistake in the automatic marking mode; for the line association code extracted from the SARD program, since each program in the SARD is already labeled as "good", "bad" or "mixed", the line association code extracted from the program with the "good" label is labeled as "0", and the line association code extracted from the program with the "bad" or "mixed" label is labeled as "1" if at least one line contains a statement causing a bug, otherwise, it is labeled as "0". And then carrying out deduplication processing, only keeping one line association code for the codes with the same line association code and the same label, and deleting the rest line association codes if the line association codes are the same but the labels are different. After the deduplication processing is finished, the comments in the program are deleted, word segmentation processing is carried out according to lines, and user-defined variable names and user-defined function names are replaced by variable _ m and function _ n, so that interference on vulnerability semantics is avoided.

The word after the word segmentation processing is carried out on the line association code is converted into a corresponding vector, the word skipping model of word2vec based on the sequence softmax is adopted, and each word can be represented into a vector with a fixed length based on the idea of distributed representation, so that the vectors can better represent the relationship between different words. The method based on the sequence softmax uses a data structure of a binary tree, and can greatly reduce the training gradient calculation overhead. The skipping model cares about the probability of generating the background words by the given central words, and when the gradient is updated by utilizing back propagation, the skipping model adjusts the network by using a plurality of words, so that the effect of the skipping model is superior to that of the continuous bag-of-words model when the rare words are faced. And (3) minimizing the loss function by training the jump character model to obtain a training sample, and obtaining a test sample by using the same method.

Because the vector lengths of the vectorized line-associated codes may be inconsistent, all the line-associated code vectors need to be uniformly processed into vectors of the same length, and when the length is less than the set length, 0 is finally supplemented to the line-associated code vectors; when the length of the line-associated code vector exceeds the set length, the excess portion is deleted. After the line association code is vectorized, the line association code can be input into a neural network for training and detection.

The vulnerability detection model training test flow is shown in fig. 3, in the training process, the number of tokens represented by the vector of the row-associated code, tokens, dropout, batch size, epochs and 300 are set to be 50, dropout and batch size are set to be 0.5, 300 hidden nodes are selected by the optimization algorithm, and the learning rate is 0.001 for training. And extracting a fixed number of samples from the training samples in a batch extraction mode, and transmitting the samples into a code vulnerability detection model adopting a self-attention mechanism.

The code vulnerability detection model adopts a D _ transformer model of a self-attention mechanism, the network structure of the D _ transformer model is shown in FIG. 4, and the D _ transformer model adopts an encoder-decoder architecture and consists of 12 stacked encoder layers and 12 stacked decoder layers. The internal simple structure of each encoder and decoder is shown in fig. 5, and the encoder comprises a self-attention mechanism layer and a feedforward neural network, and the self-attention mechanism can help the current node to focus only on the current word, so that the semantics of the context can be obtained. The decoder also comprises a two-layer network mentioned by the encoder, but an encoder-decoder attention layer is arranged between the two layers to help the current node to acquire the key contents needing attention currently.

The self-attention mechanism actually wants the network to notice the correlation between different parts in the whole input, and the implementation method is as follows: for each given input vector, multiplying by three coefficient matrices Wq、Wk、WvThree values Q, K, V are obtained, representing the input query, key, value, respectively, and Q K V. Calculating the similarity A between every two input vectors by Q and K in a dot product mode, wherein the formula (1) is as follows:

A=KT·Q; (1)

dividing the similarity A by a constant, performing weighted summation softmax operation, and then obtaining the output quantity of the self-attention mechanism layer by using a dot product with V, wherein the formula of an output vector O is shown as the formula (2):

wherein, constant dkGenerally referred to as the first dimension of the matrix, to accommodate the dot product result so as not to be too large.

A multi-head attention mechanism is added in the self-attention mechanism, namely only one group of Q, K, V matrixes is initialized, but a plurality of groups are initialized, the invention adopts 4 groups, and the specific process is as follows: first Q, K, V linear transformation and 4 scaling dot products are made to obtain 4 heads, each head formula is shown in equation (3):

headi=O(QWi q,KWi k,VWi v); (3)

wherein Wi q、Wi k、Wi vQ, K, V ith coefficient matrix, Wi q∈R512×128,Wi k∈R512×128,Wi v∈R512×128

Here we do 4 times, one header at a time, and the parameter W for each Q, K, V linear transformation is not the same. Since the feedforward neural network cannot input 4 matrices, 4 matrices need to be connected together, so that a large matrix is obtained, then a matrix is randomly initialized to multiply the combined matrix, and finally a matrix operation of a final multi-head attention result is obtained, as shown in formula (4):

MultiHead(Q,k,V)=Concat(head1,......head4)Wo; (4)

wherein WoFor randomly initialized matrices, Wo∈R512×512

The multi-head attention mechanism carries out 4 times of calculation instead of only one time of calculation, so that the model can be allowed to pay attention to information in different aspects in different representation subspaces as an integrated function, and the fitting capacity of the model is improved. A short-cut structure in a residual error network is also adopted in the self-attention mechanism, and the purpose is to solve the gradient degradation problem in deep learning.

Before inputting data into a feedforward neural network to perform an activation function, in order to prevent the input data from falling into an activation function saturation region, layer normalization LN is required, LN is to calculate a mean value and a variance at each sample, and formula (5) is shown as follows:

wherein x isiIs an input sample, alpha is a scale factor, beta is a translation factor, and e is a slight positive number, mu, used to avoid the divisor 0lIs the sample data mean, σl 2Is the variance of the sample data.

The detailed structure of D _ transform is shown in fig. 6, in order for the model to capture the order information of the words, an additional position encoding vector is added to the input of the encoder and decoder layers, and the calculation method of the position encoding is shown in equations (6) and (7):

wherein, PE is a two-dimensional matrix, pos is the position of the current word in the sentence, i is index and d of each value in the vectormodelIs the dimension of the word vector.

The encoder is started by processing the input sequence. The output of the top encoder is then converted into a set of attention vectors k and v. Each decoder will use these attention vectors in its "encoder-decoder attribute" layer, which helps the decoder to focus attention on the appropriate location in the input sequence; after the encoding phase is completed, the decoding phase begins. Each step of the decoding stage outputs an element from the output sequence. This process is repeated until the symbol indicating that the decoder has finished outputting. The output of each step is fed to the bottom decoder at the next time step, and the decoder embeds and adds position codes to the decoder inputs to represent the position of each word, just as we do with the encoder inputs. And after the decoding is finished, adding a full connection layer and a softmax layer at the end, and obtaining the word with the maximum probability as a final result.

The D _ transformer breaks through the limitation that the RNN model can not be calculated in parallel, and compared with the CNN, the operation times required for calculating the association between two positions do not increase along with the distance, so that the D _ transformer adopts a fully parallelized structure, the training speed is higher, and a deeper model can be used and can be converged to lower test errors compared with a shallower model, so that the model can obtain higher precision.

The code obtains a prediction result through a vulnerability detection model D _ transformer, the error between the prediction result and a real result is subjected to gradient updating through a back propagation and gradient descent algorithm, and the model is trained through multiple iterations; and in the testing process, the trained model is used for testing, whether the testing result of the model is the same as the actual result or not is compared, and the detection capability of the model is tested.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made in the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. As a result of the observation: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于双BIOS平台的SOC验证装置、方法及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类