Statement level program repairing method and system based on graph

文档序号:1921121 发布日期:2021-12-03 浏览:22次 中文

阅读说明:本技术 一种基于图的语句级程序修复方法及系统 (Statement level program repairing method and system based on graph ) 是由 李斌 唐奔 孙小兵 薄莉莉 于 2021-08-02 设计创作,主要内容包括:本发明公开了一种基于图的语句级程序修复方法及系统,属于软件调试领域。本发明首先提取缺陷代码及补丁、规范代码以构建训练及预训练的数据集;对数据集预处理并预训练编程语言模型;利用编程语言模型进行数据嵌入,构建并训练基于Graph-to-Sequence架构的翻译模型;使用训练完成的翻译模型生成缺陷语句的补丁。本发明使用融合源代码多种特征的代码图表征代码,结合了预训练模型学习代码规范并加快翻译模型训练收敛速度,可以优化缺陷语句的上下文表示,使翻译模型能够更好地学习缺陷语句与正确语句之间的语法语义关联信息,从而更好地表征缺陷修复的语义,生成遵循编程语言规范的高质量修复补丁以自动化地修复缺陷程序,能够极大降低缺陷修复的成本。(The invention discloses a statement level program repairing method and system based on a graph, and belongs to the field of software debugging. Firstly, extracting defect codes, patches and standard codes to construct a training and pre-training data set; preprocessing the data set and pre-training a programming language model; embedding data by using a programming language model, and constructing and training a Graph-to-Sequence architecture-based translation model; and generating a patch of the defect statement by using the trained translation model. The method uses the code chart feature code which integrates various features of the source code, combines the pre-training model to learn the code specification and quickens the training convergence speed of the translation model, can optimize the context expression of the defect statement, enables the translation model to better learn the grammar semantic association information between the defect statement and the correct statement, thereby better representing the semantic meaning of defect repair, generates a high-quality repair patch which follows the programming language specification to automatically repair the defect program, and can greatly reduce the cost of defect repair.)

1. A graph-based statement-level program repair method, the method comprising the steps of:

step 1, crawling a code file with defects and a patch file thereof from an open source community, constructing a training data set of a translation model, and crawling a method that the number of times of submitting modification is less than a set threshold value, and constructing a pre-training data set of a programming language model;

step 2, converting code sentences in the pre-training data set into tokens, training a programming language model by using the pre-training data set, preprocessing the training data set of the training translation model, converting the repaired sentences into tokens, constructing a code map based on the defect sentences and the context of the defect sentences, and generating vector representation by using the trained programming language model so as to accord with the input of the translation model;

step 3, embedding data in a training data set in combination with a pre-trained programming language model, and training a Graph-to-Sequence architecture-based translation model through the embedded training data set; the translation model comprises a graph encoder and a sequence decoder, wherein the graph encoder adds a super node to an input code graph to be regarded as abstract representation of a defect statement, the super node is connected with all nodes related to the defect statement, the graph encoder iteratively generates node embedding through aggregation of node neighbor information, and the generated node embedding of the super node is used as the input of the sequence decoder to generate a sequence corresponding to a defect subgraph; the sequence decoder is a recurrent neural network with an attention mechanism, and candidate Token is generated iteratively to form a Token sequence;

and 4, extracting a defect statement and a context thereof for a newly input program with a defect, constructing a code graph, generating vector representation by using a programming language model, and generating a patch by using a trained translation model.

2. The graph-based statement-level program repairing method according to claim 1, wherein the specific process of step 1 comprises:

step 1-1, crawling code files with defects in open source projects and patch files submitted for repairing the defects from the open source community, and constructing bug-fix pair formation model training data; crawling a method with the number of submitted modification times smaller than a set threshold value from the open source project as model pre-training data;

step 1-2, selecting bug-fix pairs only repairing single-line sentences from the crawled model training data, reserving the method of repairing the sentences as context, and removing data with the length of the context being more than or equal to a set threshold; and for the pre-training data of the crawled model, removing the method with the length being more than or equal to a set threshold value and removing the method which repeatedly appears.

3. The graph-based statement-level program repairing method according to claim 1, wherein the specific process of step 2 comprises:

step 2-1, separating the codes of the method in the pre-training data set according to words, and splitting and recombining the code sequence into a sequence taking Token as a unit by using a BPE word segmentation method;

step 2-2, inputting the Token sequence generated in the step 2-1 and training a BERT model; step 2-3, the sentences after the training data of the translation model is repaired in a centralized manner are split and recombined into sequences with Token as a unit by utilizing a BPE word segmentation method;

step 2-4, for the input defect statement and the context thereof, optimizing the context of the defect statement through program slicing, removing the context irrelevant to the semantics of the defect statement, and then constructing a code graph based on the defect statement and the optimized context thereof;

and 2-5, generating vector representations for the Token sequence generated in the step 2-3 and the nodes of the code graph constructed in the step 2-4 by using the trained BERT model.

4. The graph-based statement-level program repairing method according to claim 3, wherein the specific process of the step 2-4 comprises:

step 2-4-1, building PDG according to the defect statement and the context thereof, searching the context statement related to the defect statement according to the PDG, removing the statement unrelated to the defect statement, and representing the defect statement and the context statement after program slicing in a sequence form;

step 2-4-2, converting the code sequence generated in step 2-4-1 into AST, wherein nodes in the AST are represented by words processed by using a BPE word segmentation method, and different types of edges are used for connecting the nodes in the AST tree according to rules, and the method specifically comprises the following steps:

(1) connecting a control flow relation node in the AST by using a ControlFlow edge according to a control flow rule;

(2) connecting nodes with data flow relations in the AST by using a DataFlow edge according to a data flow rule;

(3) and connecting nodes with the upper-lower order relation in the AST by using NaturalSeq edges according to the natural order of the source code.

5. The graph-based statement-level program repairing method according to claim 1, wherein the constructing of the graph encoder in step 3 specifically includes:

step 3-1, marking the nodes related to the defective sentences and the code graphs corresponding to the contexts thereof, and adding a super node VsConnecting all marked nodes, and randomly initializing to generate VsAn initial representation of (a); all the marked nodes and all the edges connecting any two marked nodes form a defect subgraph, a super node VsThe method can be regarded as the aggregation of defect subgraphs, and is an abstract representation of a defect statement;

step 3-2, for any node V in the code graphiIteratively aggregating K-order neighbor node information thereof through a node aggregation algorithm and generating a node embedding hi

Step 3-3, for the super node V added in step 3-1sGenerating node embedding h by using the node aggregation algorithm in the step 3-2sIts node embedding is regarded as subgraph embedding and is used as input of sequence decoder to generate sequence corresponding to defect subgraph.

6. The graph-based statement-level program repairing method according to claim 5, wherein the constructing of the sequence decoder in step 3 specifically comprises:

step 3-4, obtaining the vector representation (h) of all nodes of the defect subgraph in the step 3-31,h2…,hV);

Step 3-5, passing the implicit vector S of the decoder at the previous momentt-1And the subgraph node vector obtained in the step 3-4 represents hjCalculating a relevance score e of each input position j to the current output position by using a scoring functiontjWherein the hidden vector of the decoder at the initial moment is a super node VsIs represented by the vector ofs

Step 3-6, calculating the relevance score e through the step 3-5tjCalculating the attention weight alphatjAnd a context vector ct

Wherein

Wherein y is the number of node vector representations obtained in step 3-4;

step 3-7, calculating context vector c through step 3-6tHidden vector S of decodert-1And the output y of the decoder at the previous momentt-1Calculating the state vector S of the current time t by using a nonlinear activation functiont

Step 3-8, obtaining the t-time state vector S through the calculation of the step 3-7tContext vector ciAnd the output y of the decoder at the previous momentt-1Calculating the output probability p of the current position by using a multilayer nonlinear function; and 3-9, repeating the steps 3-5 to 3-8, and iteratively generating a Token with the highest probability score at each position until the sequence is terminated to obtain a Token sequence generated by the defect subgraph conversion.

7. The graph-based statement-level program repairing method according to claim 4, wherein the node aggregation algorithm of step 3-2 specifically comprises the following steps:

step 3-2-1, for the node v, dividing the neighbor nodes of the node v into forward neighbors according to the direction of the edgeAnd backward neighbors

Step 3-2-2, representation of the forward neighbors of node vAggregate into a vector

Where max represents the maximization operator, WpoolFor pooled matrices, σ denotes notA linear activation function, b is an offset constant, and k is a current neighbor order;

step 3-2-3, the current characteristic vector of the node v is calculatedAnd the generated forward aggregated vectorConcatenated and input to a fully-connected layer with sigma activation function, through which the forward representation of node v is updated

Step 3-2-4, the backward neighbors of the node v are processed by using the processing methods of step 3-2-2 and step 3-2-3Updating the backward representation of node v

And 3-2-5, repeating the steps 3-2-2 to 3-2-4K times, wherein K is an aggregation order, and connecting the forward representation and the backward representation of the node v after repeating the K times in series to generate the final representation of the node v.

8. The graph-based statement-level program repairing method according to claim 1, wherein the specific process of step 4 comprises:

step 4-1, marking the newly input program with defects, extracting the method of the defect statement as context, constructing a code graph, and completing the embedding of the nodes of the code graph;

step 4-2, inputting the code map which is generated in the step 4-1 and is embedded into the translation model in the step 3, and predicting candidate Token sequences by using the trained translation model based on the Graph-to-Sequence architecture;

step 4-3, reducing the candidate Token sequence in the step 4-2 by using a BPE word segmentation method to generate a candidate patch sequence;

and 4-4, replacing the defect statements in the source code file by using the candidate patch sequence generated in the step 4-3, verifying the correctness of the patch by using the test case, and outputting the candidate patch sequence passing through the test case as a correct patch.

9. A graph-based statement-level program repair system, comprising:

the data set extraction module is used for crawling the code files with defects and the patch files thereof from the open source community, constructing a training data set of the translation model, and crawling a method that the number of times of submitting and modifying is less than a set threshold value, and constructing a pre-training data set of the programming language model;

the pre-training module is used for converting code sentences in the pre-training data set into tokens, training a programming language model by using the pre-training data set, preprocessing the training data set for training the translation model, converting repaired sentences into tokens, constructing a code map based on the defect sentences and the context of the defect sentences, and generating vector representation by using the trained programming language model so as to meet the input of the translation model;

the translation model training module is used for embedding data in a training data set in combination with a pre-trained programming language model, and training a translation model based on a Graph-to-Sequence architecture through the embedded training data set; the translation model comprises a graph encoder and a sequence decoder, wherein the graph encoder adds a super node to an input code graph to be regarded as abstract representation of a defect statement, the super node is connected with all nodes related to the defect statement, the graph encoder iteratively generates node embedding through aggregation of node neighbor information, and the generated node embedding of the super node is used as the input of the sequence decoder to generate a sequence corresponding to a defect subgraph; the sequence decoder is a recurrent neural network with an attention mechanism, and candidate Token is generated iteratively to form a Token sequence;

and the program repairing module is used for extracting a defect statement and the context thereof for a newly input program with defects, constructing a code graph, generating vector representation by using a programming language model, and generating a patch by using a trained translation model.

10. A graph-based statement-level program repair system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when loaded into the processor implements the graph-based statement-level program repair method according to any one of claims 1 to 8.

Technical Field

The invention belongs to the field of software debugging, and particularly relates to a statement level program repairing method and system based on a graph.

Background

The generation of program defects is inevitable during software development, and developers need to spend a great deal of effort to repair the defects. With the continuous increase of the scale of modern software, the number of program defects and the repair difficulty are increased, and the program defects cause huge economic losses to enterprises. To improve software reliability and reduce development costs, researchers have proposed a number of Automated Program Repair (APR) techniques to automate flawed programs.

The traditional defect repairing method depends on expert knowledge, needs field experts to spend a great deal of energy on constructing a repairing template or a repairing strategy, and therefore does not have generalization capability. Due to the characteristic that software defects can repeatedly appear, researchers find that the defect repair history can provide effective guidance for automatic repair, and therefore a deep learning model is introduced to learn the history defect repair characteristics to guide repair. Neural machine translation model (NMT) based methods are a typical application of deep learning models in the field of APR. The NMT-based approach automatically learns abstract repair patterns of a program from historical error repair data to capture associations between error statements and repair statements. These models are very general in that they do not depend on the programming language, but only relate to historical data for training. Despite the great advantages of NMT based methods over the traditional art, such methods still have deficiencies. The code representation adopted by the current NMT-based method still cannot keep rich syntactic and semantic information. Also, such methods ignore implicit semantics in the source code, since it tends to represent the source code as a sequence and apply the sequence to a sequence model to generate a patch. In addition, these models are inefficient to learn when the input sequence is too long.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a statement-level program repairing method and system which have the characteristics of strong generalization capability, excellent defect repairing characteristic extraction capability, high repairing patch quality, good industrial application prospect and the like.

The technical scheme is as follows: the technical scheme for realizing the purpose of the invention is as follows: a statement level program repair method based on a graph comprises the following steps:

step 1, crawling a code file with defects and a patch file thereof from an open source community, constructing a training data set of a translation model, and crawling a method that the number of times of submitting modification is less than a set threshold value, and constructing a pre-training data set of a programming language model;

step 2, converting code sentences in the pre-training data set into tokens, training a programming language model by using the pre-training data set, preprocessing the training data set of the training translation model, converting the repaired sentences into tokens, constructing a code map based on the defect sentences and the context of the defect sentences, and generating vector representation by using the trained programming language model so as to accord with the input of the translation model;

step 3, embedding data in a training data set in combination with a pre-trained programming language model, and training a Graph-to-Sequence architecture-based translation model through the embedded training data set; the translation model comprises a graph encoder and a sequence decoder, wherein the graph encoder adds a super node to an input code graph to be regarded as abstract representation of a defect statement, the super node is connected with all nodes related to the defect statement, the graph encoder iteratively generates node embedding through aggregation of node neighbor information, and the generated node embedding of the super node is used as the input of the sequence decoder to generate a sequence corresponding to a defect subgraph; the sequence decoder is a recurrent neural network with an attention mechanism, and candidate Token is generated iteratively to form a Token sequence;

and 4, extracting a defect statement and a context thereof for a newly input program with a defect, constructing a code graph, generating vector representation by using a programming language model, and generating a patch by using a trained translation model.

Further, the specific process of step 1 comprises:

step 1-1, crawling code files with defects in open source projects and patch files submitted for repairing the defects from the open source community, and constructing bug-fix pair formation model training data; crawling a method with the number of submitted modification times smaller than a set threshold value from the open source project as model pre-training data;

step 1-2, selecting bug-fix pairs only repairing single-line sentences from the crawled model training data, reserving the method of repairing the sentences as context, and removing data with the length of the context being more than or equal to a set threshold; and for the pre-training data of the crawled model, removing the method with the length being more than or equal to a set threshold value and removing the method which repeatedly appears.

Further, the specific process of step 2 comprises:

step 2-1, separating the codes of the method in the pre-training data set according to words, and splitting and recombining the code sequence into a sequence taking Token as a unit by using a BPE word segmentation method;

step 2-2, inputting the Token sequence generated in the step 2-1 and training a BERT model; step 2-3, the sentences after the training data of the translation model is repaired in a centralized manner are split and recombined into sequences with Token as a unit by utilizing a BPE word segmentation method;

step 2-4, for the input defect statement and the context thereof, optimizing the context of the defect statement through program slicing, removing the context irrelevant to the semantics of the defect statement, and then constructing a code graph based on the defect statement and the optimized context thereof;

and 2-5, generating vector representations for the Token sequence generated in the step 2-3 and the nodes of the code graph constructed in the step 2-4 by using the trained BERT model.

Further, the specific process of step 2-4 comprises:

step 2-4-1, building PDG according to the defect statement and the context thereof, searching the context statement related to the defect statement according to the PDG, removing the statement unrelated to the defect statement, and representing the defect statement and the context statement after program slicing in a sequence form;

step 2-4-2, converting the code sequence generated in step 2-4-1 into AST, wherein nodes in the AST are represented by words processed by using a BPE word segmentation method, and different types of edges are used for connecting the nodes in the AST tree according to rules, and the method specifically comprises the following steps:

(1) connecting a control flow relation node in the AST by using a ControlFlow edge according to a control flow rule;

(2) connecting nodes with data flow relations in the AST by using a DataFlow edge according to a data flow rule;

(3) and connecting nodes with the upper-lower order relation in the AST by using NaturalSeq edges according to the natural order of the source code.

Further, the constructing of the graph encoder in step 3 specifically includes:

step 3-1, marking the nodes related to the defective sentences and the code graphs corresponding to the contexts thereof, and adding a super node VsConnecting all marked nodes, and randomly initializing to generate VsAn initial representation of (a); all the marked nodes and all the edges connecting any two marked nodes form a defect subgraph, a super node VsThe method can be regarded as the aggregation of defect subgraphs, and is an abstract representation of a defect statement;

step 3-2, for any node V in the code graphiIteratively aggregating K-order neighbor node information thereof through a node aggregation algorithm and generating a node embedding hi

Step 3-3, for the super node V added in step 3-1sGenerating node embedding h by using the node aggregation algorithm in the step 3-2sIts node embedding is regarded as subgraph embedding and is used as input of sequence decoder to generate sequence corresponding to defect subgraph.

Further, constructing a sequence decoder in step 3 specifically includes:

step 3-4, obtaining the vector representation (h) of all nodes of the defect subgraph in the step 3-31,h2…,hV);

Step 3-5, passing the implicit vector S of the decoder at the previous momentt-1And the subgraph node vector obtained in the step 3-4 represents hjCalculating a relevance score e of each input position j to the current output position by using a scoring functiontjWherein the hidden vector of the decoder at the initial moment is a super node VsIn the direction ofAmount represents hs

Step 3-6, calculating the relevance score e through the step 3-5tjCalculating the attention weight alphatjAnd a context vector ct

Wherein

Wherein V is the number of node vector representations obtained in step 3-4;

step 3-7, calculating context vector c through step 3-6tHidden vector s of decoderT-1And the output y of the decoder at the previous momentt-1Calculating the state vector S of the current time t by using a nonlinear activation functiont

Step 3-8, obtaining the t-time state vector S through the calculation of the step 3-7tContext vector ciAnd the output y of the decoder at the previous momentt-1Calculating the output probability p of the current position by using a multilayer nonlinear function; and 3-9, repeating the steps 3-5 to 3-8, and iteratively generating a Token with the highest probability score at each position until the sequence is terminated to obtain a Token sequence generated by the defect subgraph conversion.

Further, the node aggregation algorithm in step 3-2 specifically includes:

step 3-2-1, for the node v, dividing the neighbor nodes of the node v into forward neighbors according to the direction of the edgeAnd backward neighbors

Step 3-2-2, representation of the forward neighbors of node vIs polymerized into oneAn individual vector

Where max represents the maximization operator, WpoolThe method comprises the following steps of (1) obtaining a pooling matrix, wherein sigma represents a nonlinear activation function, b is an offset constant, and k is a current neighbor order;

step 3-2-3, the current characteristic vector of the node v is calculatedAnd the generated forward aggregated vectorConcatenated and input to a fully-connected layer with sigma activation function, through which the forward representation of node v is updated

Step 3-2-4, the backward neighbors of the node v are processed by using the processing methods of step 3-2-2 and step 3-2-3Updating the backward representation of node v

And 3-2-5, repeating the steps 3-2-2 to 3-2-4K times, wherein K is an aggregation order, and connecting the forward representation and the backward representation of the node v after repeating the K times in series to generate the final representation of the node v.

Further, the specific process of step 4 includes:

step 4-1, marking the newly input program with defects, extracting the method of the defect statement as context, constructing a code graph, and completing the embedding of the nodes of the code graph;

step 4-2, inputting the code map which is generated in the step 4-1 and is embedded into the translation model in the step 3, and predicting candidate Token sequences by using the trained translation model based on the Graph-to-Sequence architecture;

step 4-3, reducing the candidate Token sequence in the step 4-2 by using a BPE word segmentation method to generate a candidate patch sequence;

and 4-4, replacing the defect statements in the source code file by using the candidate patch sequence generated in the step 4-3, verifying the correctness of the patch by using the test case, and outputting the candidate patch sequence passing through the test case as a correct patch.

Based on the same inventive concept, the invention provides a statement level program repair system based on a graph, which comprises:

the data set extraction module is used for crawling the code files with defects and the patch files thereof from the open source community, constructing a training data set of the translation model, and crawling a method that the number of times of submitting and modifying is less than a set threshold value, and constructing a pre-training data set of the programming language model;

the pre-training module is used for converting code sentences in the pre-training data set into tokens, training a programming language model by using the pre-training data set, preprocessing the training data set for training the translation model, converting repaired sentences into tokens, constructing a code map based on the defect sentences and the context of the defect sentences, and generating vector representation by using the trained programming language model so as to meet the input of the translation model;

the translation model training module is used for embedding data in a training data set in combination with a pre-trained programming language model, and training a translation model based on a Graph-to-Sequence architecture through the embedded training data set; the translation model comprises a graph encoder and a sequence decoder, wherein the graph encoder adds a super node to an input code graph to be regarded as abstract representation of a defect statement, the super node is connected with all nodes related to the defect statement, the graph encoder iteratively generates node embedding through aggregation of node neighbor information, and the generated node embedding of the super node is used as the input of the sequence decoder to generate a sequence corresponding to a defect subgraph; the sequence decoder is a recurrent neural network with an attention mechanism, and candidate Token is generated iteratively to form a Token sequence;

and the program repairing module is used for extracting a defect statement and the context thereof for a newly input program with defects, constructing a code graph, generating vector representation by using a programming language model, and generating a patch by using a trained translation model.

Based on the same inventive concept, the invention provides a graph-based statement level program repairing system, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the graph-based statement level program repairing method when being loaded to the processor.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: 1) the source codes are represented by the figures such as abstract syntax trees, data flow graphs and control flow graphs which can model complex syntax semantic features of the codes, so that the extraction of defect information features is facilitated; 2) the embedding of codes is generated by using a pre-training programming language model BERT model, so that the codes can be normalized and the training convergence speed of a translation model can be accelerated; 3) a repair patch is generated aiming at a defective statement, so that the problem of low translation model efficiency caused by overlong input sequence can be solved; 4) optimizing the context range of the defect statement by using program slices, and avoiding the noise problem caused by overlong context; 5) meanwhile, the forward and backward relations of the nodes are aggregated, so that the characteristics of the nodes can be better learned. 6) Compared with the traditional Sequence-to-Sequence model, the model based on the Graph-to-Sequence architecture can learn and repair the template under the condition of keeping the structural integrity of the code Graph by using the Graph encoder to replace the Sequence encoder, and the universality of the model is improved.

Drawings

FIG. 1 is a flow diagram of a graph-based statement-level program repair methodology in one embodiment;

FIG. 2 is a diagram illustrating program slicing effects in one embodiment;

FIG. 3 is a code diagram representation in one embodiment;

FIG. 4 is a diagram illustrating a Graph-to-Sequence architecture-based translation model in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, in conjunction with fig. 1, the present invention provides a graph-based statement-level program repairing method, which includes the following steps:

step 1, data set extraction. Respectively constructing pre-training data sets of a training translation model and a programming language model from the crawling data of the open source community;

and 2, preprocessing a training data set and pre-training a programming language model. Training a programming language model using a pre-training data set, and pre-processing a training data set for training a translation model to conform to inputs of the translation model;

and 3, constructing and training a translation model. Embedding data in a training data set in combination with a pre-trained programming language model, and training a Graph-to-Sequence architecture-based translation model through the embedded training data set;

and 4, generating and verifying the patch statement. And for the preprocessed program, generating a candidate patch by using the trained translation model, verifying the correctness of the patch according to the test case, and outputting the correct patch.

Further, in one embodiment, the specific process of extracting the data set in step 1 includes:

step 1-1, crawling code files with defects in open source projects and patch files submitted for repairing the defects from the open source community, and constructing bug-fix pair formation model training data; crawling a method with less submitting and modifying times from an open source project as model pre-training data;

and step 1-2, performing data screening and filtering on the crawled data. Selecting a bug-fix pair only repairing a single line of sentences from the model training data obtained in the step 1-1, reserving a method (function) where the current repairing sentences are located as context, and removing data with the context length being more than or equal to 1000; for the model pre-training data obtained in the step 1-1, removing the method with the length of more than or equal to 1000 and removing the repeated method;

and 1-3, respectively constructing a training data set and a pre-training data set through the data in the step 1-2, and randomly selecting 80% of the bug-fix pairs as the training set and the rest 20% of the bug-fix pairs as the test set for the training data set.

Further, in one embodiment, the data set preprocessing and programming language model pre-training in step 2 specifically includes:

and 2-1, preprocessing a pre-training data set. Separating the method codes in the pre-training data set according to words, and splitting and recombining the code sequence into a sequence taking Token as a unit by using a BPE (byte pair encoder) word segmentation method;

and 2-2, inputting the Token sequence generated in the step 2-1 into a BERT model and training the BERT model. The BERT (bidirectional Encoder replication from transformations) model is one of the most commonly used pre-training models, and a well-trained BERT model can generate its corresponding vector Representation for an input sequence. Inputting pre-training data formed by the Token sequence generated in the step 2-1 into an open source deep learning framework TensorFlow, packaging a complete BERT model, calling API (application program interface) to run 100epochs, and completing training to obtain the completely trained BERT model;

and 2-3, preprocessing the repaired sentences, and splitting and recombining the repaired sentences into a sequence taking Token as a unit by using a BPE word segmentation method. The training data set stores data in a bug-fix pair mode, wherein a bug is a statement with a defect and a context thereof, and a fix is a repair version of the statement with the defect;

and 2-4, processing the defect statement and the context thereof. For the input defect statement and the context thereof, optimizing the context of the defect statement through program slicing, removing the context irrelevant to the semantics of the defect statement, and then constructing a code graph based on the defect statement and the optimized context thereof;

and 2-5, generating vector representations for the Token sequences generated in the step 2-3 and the nodes of the code graphs constructed in the step 2-4 by using a completely trained BERT model.

By adopting the scheme of the embodiment, the problem of vocabulary explosion can be effectively relieved by splitting the code sequence through the BPE word segmentation method, and meanwhile, the grammar normalization of the patch can be effectively improved and the convergence speed of the translation model can be accelerated by using the pre-training model to generate embedding for the split sequence.

Further, in one embodiment, the defect statement and its context processing in steps 2 to 4 includes:

and 2-4-1, slicing the program. First, using open source tool Joern to construct PDG (Program Dependence Graph) according to the defect statement and the context thereof, wherein the PDG is a graphical representation of the control Dependence and the data Dependence among programs. And then searching context sentences associated with the defect sentences according to the PDG, and removing the sentences irrelevant to the defect sentences. The defect statements and the context statements after the program slicing is completed are expressed in a sequence form;

and 2-4-2, constructing a code graph. The code sequence generated at step 2-4-1 is converted into AST (abstract syntax tree) using the open source tool gummree. An AST may be viewed as a graphical (tree) representation of the program's syntactic structure, with nodes in the AST represented by words processed using BPE tokenization. For the constructed AST tree, connecting nodes in the AST tree by using different types of edges according to rules, specifically comprising:

(1) connecting a control flow relation node in the AST by using a ControlFlow edge according to a control flow rule;

(2) connecting nodes with data flow relations in the AST by using a DataFlow edge according to a data flow rule;

(3) and connecting nodes with the upper-lower order relation in the AST by using NaturalSeq edges according to the natural order of the source code.

By adopting the scheme of the embodiment, sentences irrelevant to the semantics of the defect sentences can be removed by using program slices based on PDG, the context of the defect sentences is effectively simplified, and noise interference in the training process is reduced. And meanwhile, the representation of the mixture of various code syntax semantic graphs is used, and compared with the traditional sequence representation method, more syntax semantic information can be reserved.

Further, in one embodiment, the building and training of the Graph-to-Sequence architecture-based translation model in step 3 specifically includes:

the Graph-to-Sequence architecture-based translation model comprises a Graph encoder and a Sequence decoder:

(1) the construction graph encoder specifically comprises:

and 3-1, adding the super nodes. Marking nodes related to the defective sentences in the code graph generated in the step 2-4, and adding a super node VsConnecting all marked nodes, and randomly initializing to generate VsIs shown. All the marked nodes and all the edges connecting any two marked nodes form a defect subgraph, a super node VsThe method can be regarded as the aggregation of defect subgraphs, and is an abstract representation of a defect statement;

and 3-2, iteratively generating node embedding. For any node V in the code graphiIteratively aggregating K-order neighbor node information thereof through a node aggregation algorithm and generating a node embedding hi

And 3-3, generating subgraph embedding according to the node embedding. For the super node V added in step 3-1sGenerating node embedding h by using the node aggregation algorithm in the step 3-2s. Due to VsDirectly connected to nodes associated with defective sentences, VsThe method comprises the following steps of capturing information of all phase connection points, wherein the information can be regarded as a representation of a defect subgraph, and the node embedding is regarded as subgraph embedding and serves as input of a sequence decoder to generate a corresponding sequence of the defect subgraph;

(2) constructing a sequence decoder with a recurrent neural network of an attention mechanism, and specifically comprising the following steps:

step 3-4, obtaining the vector representation (h) of all nodes of the defect subgraph in the step 3-31,h2…,hV);

Step 3-5, passing the implicit vector S of the decoder at the previous momentt-1And 3-4, calculating each input position j and the current output position by using the obtained subgraph node vector in the step 3-4Wherein the decoder implicit vector at the initial moment is a super node VsIs represented by the vector ofs

etj=a(st-1,hj)

Wherein a is a scoring function which can score the output matching degree of the input nodes around the input position j and the current position;

step 3-6, calculating the relevance score e through the step 3-5tjCalculating the attention weight alphatjAnd a context vector ct

Wherein

Wherein V is the number of node vector representations obtained in step 3-4;

step 3-7, calculating context vector c through step 3-6tHidden vector S of decodert-1And the output y of the decoder at the previous momentt-1Calculating the state vector S of the current moment tt

st=f(st-1,yt-1,ct)

Where f is a non-linear activation function, capable of combining context vectors ctHidden vector S of decodert-1And the output y of the decoder at the previous momentt-1Calculating the current time state vector S of the decoder through the weight matrixt

Step 3-8, obtaining the t-time state vector S through the calculation of the step 3-7tContext vector ciAnd the output y of the decoder at the previous momentt-1Calculating an output probability p for the current position, where ytOutput for Token at time t:

p(yt∣y1,…,yt-1)=g(yt-1,st,ct)

wherein g is a multi-layer non-linear function,can pass through the state vector StContext vector ciCalculating the probability score of the current position according to the output of the previous moment;

3-9, repeating the steps 3-5 to 3-8, and iteratively generating a Token sequence until the sequence is terminated to obtain the Token sequence generated by defect subgraph conversion;

the above functions are existing functions, and the specific expression form can refer to Cho K, VanB, Gulcehre C, et al, article Learning phenyl representation using RNN encoder-decoder for statistical machine translation, which is not described herein in detail.

(3) Training a Graph-to-Sequence architecture-based translation model, which specifically comprises the following steps:

step 3-10, calculating loss value loss of the Token sequence generated in step 3-9 and the fix sequence generated in step 2-3, and updating parameters in the graph encoder and the sequence decoder by using a gradient descent method according to the loss;

and 3-11, repeating the steps 3-1 to 3-10 for each bug-fix pair preprocessed in the training data set, and adjusting parameters in the Graph encoder and the Sequence decoder to obtain a trained translation model based on the Graph-to-Sequence architecture.

By adopting the scheme of the embodiment, the Graph-to-Sequence architecture-based translation model is used, and the conversion from end-to-end learning Graph structure data to Sequence structure data can be realized, so that the model has stronger learning capability and generalization performance.

Further, in one embodiment, the node aggregation algorithm in step 3-2 specifically includes:

step 3-2-1, for the node v, dividing the neighbor nodes of the node v into forward neighbors according to the direction of the edgeAnd backward neighborsForward neighborsI.e. the set of neighbor nodes to which the node v points, the backward neighborsNamely a neighbor node set pointing to the node v;

step 3-2-2, representation of the forward neighbors of node vAggregate into a vector

Where max represents the maximization operator, WpoolThe method comprises the following steps of (1) obtaining a pooling matrix, wherein sigma represents a nonlinear activation function, b is an offset constant, and k is a current neighbor order;

step 3-2-3, the current characteristic vector of the node v is calculatedAnd the generated forward aggregated vectorConnected in series and input to a fully connected layer with a sigma activation function. Updating the Forward representation of node v through the fully-connected layer

Step 3-2-4, the backward neighbors of the node v are processed by using the processing methods of step 3-2-2 and step 3-2-3UpdatingBackward representation of node v

And 3-2-5, repeating the steps 3-2-2 to 3-2-4K times (K is the polymerization order). The forward representation and the backward representation of the node v will be concatenated after repeating K times to generate the final representation of the node v.

By adopting the scheme of the embodiment, the model can learn better node representation by aggregating the characteristic information of the forward neighbor and the backward neighbor of the node, thereby strengthening the learning effect.

Further, in one embodiment, the generation and verification of the patch statement in step 4 includes:

and 4-1, preprocessing a defect program. For a newly input program with defects, firstly, marking the defect statement of the newly input program, and extracting the method of the defect statement as a context. And then repeating the step 2-4 of the defect statement and the context processing thereof to construct a code graph. Finally, repeating the steps 2-5 to complete the embedding of the code graph nodes;

step 4-2, inputting the code map which is generated in the step 4-1 and is embedded into the translation model in the step 3, and predicting candidate Token sequences by using the trained translation model based on the Graph-to-Sequence architecture;

step 4-3, reducing the candidate Token sequence in the step 4-2 by using a BPE word segmentation method to generate a candidate patch sequence;

and 4-4, replacing the defect statements in the source code file by using the candidate patch sequence generated in the step 4-3, verifying the correctness of the patch by using the test case, and outputting the candidate patch sequence passing through the test case as a correct patch.

As a specific example, in one embodiment, the graph-based statement level program repair method of the present invention is further verified and explained, and includes the following contents:

1. extracting a data set, crawling code files with defects in an open source project and patch files submitted for repairing the defects from the open source community, constructing bug-fix pairs to form model training data, screening and filtering the crawled data, keeping a method (function) where a current repairing statement is located as a context, removing the data with the context length being more than or equal to 1000, randomly selecting 80% of the bug-fix pairs as the training set, and taking the rest 20% of the bug-fix pairs as the testing set. (ii) a And (3) crawling a method with less submitting and modifying times (less than 5 times, which can be regarded as a code method with strong normative) from the open source project as model pre-training data, removing a method with the method length being more than or equal to 1000, and removing a method which repeatedly appears to form a pre-training data set. One sample in the training set is shown in table 1 below and one sample in the pre-training dataset is shown in table 2 below. The programming language used for the samples shown in the table is Java, and the programming language used for the data set is not limited in practical scenarios.

TABLE 1 training set of samples

TABLE 2 Pre-training data set for a sample

2. Data set preprocessing and programming language model pre-training. Separating the method codes in the pre-training data set according to words, splitting and recombining the code sequence into a sequence with Token as a unit by using a BPE (byte pair encoder) word segmentation method, and inputting the sequence into a training BERT model. For the sentence after repair, the sentence before repair is split and recombined into a sequence with Token as a unit by using a BPE word segmentation method, for the input defect sentence and the context thereof, firstly, an open source tool Joern is used to construct PDG according to the defect sentence and the context thereof, then, the context sentence related to the defect sentence is searched according to the PDG, the sentence unrelated to the defect sentence is removed, and the slicing effect on one sample in the data set is shown in FIG. 2. And then constructing a code graph based on the defect statement and the optimized context thereof, wherein the code graph is based on the AST generated by the open source tool Gumtree, different types of edges are used for connecting nodes in the AST tree according to rules, a controlFlow edge is used for connecting nodes with a control flow relation in the AST according to the control flow rules, a DataFlow edge is used for connecting nodes with a data flow relation in the AST according to the data flow rules, a NaturalSeq edge is used for connecting nodes with an upper-lower order relation in the AST according to the natural order of the source code, and the code graph constructed according to a certain method in the data set is shown in FIG. 3. And finally, generating vector representations for the Token sequences generated in the step 2-3 and the nodes of the code graph constructed in the step 2-4 by using a completely trained BERT model.

3. And (3) constructing and training a translation model based on the Graph-to-Sequence architecture, wherein as shown in FIG. 4, the construction and training processes of the model are as follows:

the Graph-to-Sequence architecture-based translation model comprises a Graph encoder and a Sequence decoder:

and 3-1, adding the super nodes. Marking nodes related to the defective sentences in the code graph generated in the step 2-4, and adding a super node VsConnecting all marked nodes, and randomly initializing to generate VsIs shown. All the marked nodes and all the edges connecting any two marked nodes form a defect subgraph, a super node VsThe method can be regarded as the aggregation of defect subgraphs, and is an abstract representation of a defect statement;

and 3-2, iteratively generating node embedding. For any node V in the code graphiIteratively aggregating K-order neighbor node information thereof through a node aggregation algorithm and generating a node embedding hiThe specific process of the node aggregation algorithm comprises the following steps:

step 3-2-1, for the node v, dividing the neighbor nodes of the node v into forward neighbors according to the direction of the edgeAnd backward neighborsForward neighborsI.e. the neighbor node to which node v pointsAggregate, backward neighborsNamely a neighbor node set pointing to the node v;

step 3-2-2, representation of the forward neighbors of node vAggregate into a vector

Where max represents the maximization operator, WpoolThe method comprises the following steps of (1) obtaining a pooling matrix, wherein sigma represents a nonlinear activation function, b is an offset constant, and k is a current neighbor order;

step 3-2-3, the current characteristic vector of the node v is calculatedAnd the generated forward aggregated vectorConnected in series and input to a fully connected layer with a sigma activation function. Updating the Forward representation of node v through the fully-connected layer

Step 3-2-4, the backward neighbors of the node v are processed by using the processing methods of step 3-2-2 and step 3-2-3Updating the backward representation of node v

Step 3-2-5, repeating steps 3-2-2 through 3-2-4K times (K ═ 3). The forward representation and the backward representation of the node v will be concatenated after repeating K times to generate the final representation of the node v.

And 3-3, generating subgraph embedding according to the node embedding. For the super node V added in step 3-1sGenerating node embedding h by using the node aggregation algorithm in the step 3-2s. Due to VsDirectly connected to nodes associated with defective sentences, VsThe method comprises the following steps of capturing information of all phase connection points, wherein the information can be regarded as a representation of a defect subgraph, and the node embedding is regarded as subgraph embedding and serves as input of a sequence decoder to generate a corresponding sequence of the defect subgraph;

step 3-4, obtaining the vector representation (h) of all nodes of the defect subgraph in the step 3-31,h2…,hV);

Step 3-5, passing the implicit vector S of the decoder at the previous momentt-1And 3-4, calculating the relevance of each input position j and the current output position by the sub-graph node vector obtained in the step 3-4, wherein the decoder hidden vector at the initial moment is a super node VsIs represented by the vector ofs

etj=a(st-1,hj)

Wherein a is a scoring function which can score the output matching degree of the input nodes around the input position j and the current position;

step 3-6, calculating the relevance score e through the step 3-5tjCalculating the attention weight alphatjAnd a context vector ct

Wherein

Wherein V is the number of node vector representations obtained in step 3-4;

step 3-7, calculating context vector c through step 3-6tHidden vector S of decodert-1And the previous timeOutput y of the decodert-1Calculating the state vector S of the current moment tt

st=f(st-1,yt-1,Ct)

Where f is a non-linear activation function, capable of combining context vectors ctHidden vector S of decodert-1And the output y of the decoder at the previous momentt-1Calculating the current time state vector S of the decoder through the weight matrixt

Step 3-8, obtaining the t-time state vector S through the calculation of the step 3-7tContext vector ciAnd the output y of the decoder at the previous momentt-1Calculating an output probability p for the current position, where ytOutput for Token at time t:

p(yt∣y1,…,yt-1)=g(yt-1,st,ct)

wherein g is a multi-layer nonlinear function capable of passing through a state vector StContext vector ciCalculating the probability score of the current position according to the output of the previous moment;

3-9, repeating the steps 3-5 to 3-8, and iteratively generating a Token sequence until the sequence is terminated to obtain the Token sequence generated by defect subgraph conversion;

step 3-10, calculating loss value loss of the Token sequence generated in step 3-9 and the fix sequence generated in step 2-3, and updating parameters in the graph encoder and the sequence decoder by using a gradient descent method according to the loss;

and 3-11, repeating the steps 3-1 to 3-10 for each bug-fix pair preprocessed in the training data set, and adjusting parameters in the Graph encoder and the Sequence decoder to obtain a trained translation model based on the Graph-to-Sequence architecture.

4. And generating and verifying patch sentences. For a newly input program with defects, firstly, marking the defect statement of the newly input program, and extracting the method of the defect statement as a context. And then repeating the step 2-4 of the defect statement and the context processing thereof to construct a code graph. And finally, repeating the steps 2-5 to complete the embedding of the code graph nodes, then restoring the candidate Token sequence by using a BPE word segmentation method to generate a candidate patch sequence, finally replacing the defect sentences in the source code file with the generated candidate patch sequence, verifying the correctness of the patch by using the test case, and outputting the candidate patch sequence passing through the test case as a correct patch.

The invention uses a code chart feature code which integrates multiple features of a source code to better feature the semantics of defect repair, combines the pre-training model to learn the code specification and accelerates the convergence speed of the translation model training, so that the translation model can better learn the grammar semantic association information between the defect sentences and the correct sentences, and the correct code patches which accord with the grammar specification can be automatically and accurately generated through the defect sentences of the translation model learning chart structure based on the Graph-to-Sequence architecture, thereby greatly reducing the cost of defect repair.

Based on the same inventive concept, in one embodiment, the invention provides a graph-based statement-level program repair system, which includes: the data set extraction module is used for crawling the code files with defects and the patch files thereof from the open source community, constructing a training data set of the translation model, and crawling a method that the number of times of submitting and modifying is less than a set threshold value, and constructing a pre-training data set of the programming language model; the pre-training module is used for converting code sentences in the pre-training data set into tokens, training a programming language model by using the pre-training data set, preprocessing the training data set for training the translation model, converting repaired sentences into tokens, constructing a code map based on the defect sentences and the context of the defect sentences, and generating vector representation by using the trained programming language model so as to meet the input of the translation model; the translation model training module is used for embedding data in a training data set in combination with a pre-trained programming language model, and training a translation model based on a Graph-to-Sequence architecture through the embedded training data set; and the program repairing module is used for extracting a defect statement and the context thereof for a newly input program with defects, constructing a code graph, generating vector representation by using a programming language model, and generating a patch by using a trained translation model. For details of specific implementation of each module, reference is made to the above method embodiments, and details are not described herein again.

Based on the same inventive concept, in one embodiment, the invention provides a graph-based statement-level program repairing system, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the computer program is loaded into the processor, the graph-based statement-level program repairing method in the above embodiments is implemented.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:模型生产方法、系统、装置及电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!