Regular expression description generation method based on BART model

文档序号：115962 发布日期：2021-10-19 浏览：21次中文

阅读说明：本技术 一种基于bart模型的正则表达式描述生成方法 (Regular expression description generation method based on BART model ) 是由于池陈翔杨光刘珂于 2021-04-21 设计创作，主要内容包括：本发明提供了一种基于BART模型的正则表达式描述生成方法,包括以下步骤：(1)搜集高质量的正则表达式,对正则表达式人工标注对应的自然语言描述,针对数据进行预处理；(2)将分词输入嵌入层生成最终的特征向量X；(3)改进BART模型。本发明的有益效果为：该方法对输入的正则表达式生成高质量的自然语言描述,从而帮助计算机科学初学者以及开发人员更加快速的理解正则表达式。(The invention provides a regular expression description generation method based on a BART model, which comprises the following steps: (1) collecting a high-quality regular expression, manually marking a corresponding natural language description on the regular expression, and preprocessing data; (2) inputting the participles into an embedding layer to generate a final feature vector X; (3) the BART model was improved. The invention has the beneficial effects that: the method generates high-quality natural language description for the input regular expression, thereby helping computer science beginners and developers to understand the regular expression more quickly.)

1. A regular expression description generation method based on a BART model is characterized by comprising the following steps:

(1) collecting high-quality regular expressions, manually marking the collected regular expressions to obtain corresponding natural language description, finally generating a data set D, and setting the format of each instance in the data set D as the regular expression and the natural language description;

(2) segmenting the regular expression in the data set D by using a Byte-Level BBPE (Byte-Level Byte Pair Encoding) method based on Byte Level, so that the BART model can learn semantics better according to the segmentation;

(3) the data set D is further subdivided into a training set and a verification set, an initial model is trained based on a BART model based on the training set, the initial model is further fine-tuned based on the verification set, and finally a regular expression description generation model based on the BART model is constructed;

the parameters of the model are set as follows:

dropout of the regular expression description generative model is set to 0.1;

setting an activation function of the regular expression description generation model as a gelu function;

the attention-heads of the regular expression description generation model is set to be 16;

the word embedding dimension of the regular expression description generation model is set to be 1024;

the number of hidden layers of the regular expression description generation model is set to be 12;

the vocabulary size of the regular expression description generation model is set to 50265;

the number of encoder-decoder layers of the regular expression description generation model is set to be 12;

(4) inputting a new regular expression into the trained description generation model of the regular expression, so as to generate corresponding high-quality text description, wherein the description can assist developers in understanding the meaning of the regular expression; the concrete contents are as follows: after the regular expression is segmented, the segmented regular expression is input into an encoder of a model to learn context information vectors, then the segmented regular expression is input into a model decoder to be decoded, a softmax function is used for obtaining the probability of each word, and finally a final text description is generated by utilizing a Beam Search algorithm.

2. The method for generating regular expression description based on BART model as claimed in claim 1, wherein said step (1) collects high quality regular expressions with length of 20 to 50 and artificially labeled natural language description with length of 20 to 50.

3. The method for generating a regular expression description based on a BART model according to claim 1, wherein in the step (2), a BBPE (Byte-level Byte format encoding) segmentation method is used for the regular expression, the regular expression of the data set is segmented, the BBPE splits the regular expression into Byte sequences in segmentation, and a suffix "</w >" is added at the end.

4. The method for generating regular expression description based on BART model as claimed in claim 1, wherein the initial model is trained based on BART model in step (3), and for the regular expression description generation problem, the Self-orientation mechanism used in the encoder and decoder of original BART model is replaced by Norm-orientation mechanism, which applies normalization technique to Q and K in the original orientation mechanism, Q and K_norAnd K_norExpressing normalized Q and K, the Norm-Attention formula can be expressed as: Norm-Attention (Q, K, V) ═ softmax (g × Q)_norK_nor ^T) V, wherein V is a value parameter matrix in the original orientation mechanism, and the Norm-orientation mechanism can ensure that the BART model can still translate regular expressions into high-quality natural language description under the condition of less resources.

5. The method for generating regular expression description based on BART model of claim 1, wherein the step (3) further fine-tunes the BART initial model, the model embeds the participles to obtain a feature vector Word Embedding, learns the relationship vector between the words by Word Embedding, learns the Position relationship vector Position Embedding by Position Encoding (Position Encoding) of the participles, learns the semantic relationship vector Segment Embedding of two adjacent sentences by Segment Encoding (Segment Encoding), and adds the three learned vectors to obtain the feature vector X of the final code Segment, which is expressed as: the feature vector X is Position Embedding + Segment Embedding + Word Embedding.

6. The BART model-based regular expression description generation method of claim 1, wherein in step (4), a Beam Search algorithm is added after the softmax function, the Search algorithm keeps the Top-k high-probability word as the next input in each step of prediction, where k is Beam size representing Beam width, the first time step, selects the k words with the highest conditional probability in all combinations as the first word of the candidate output sequence, and then selects the k candidates with the highest conditional probability in all combinations as the candidate output sequence in the time step based on the output sequence of the last step, and finally selects the optimal one from the k candidates.

7. The method for generating a regular expression description based on a BART model as claimed in claim 1, wherein in the step (4) of tuning the BART model, the processed data set is adjusted according to the following formula (8): 1: scale of 1 is training set, validation set, and test set.

Technical Field

The invention relates to the technical field of computer application, in particular to a regular expression description generation method based on a BART model.

Background

In the field of computer science, regular expressions are very important concepts, and are generally used to retrieve and replace texts that conform to a certain pattern (rule). Regular expressions may describe certain matching rules that are then used to determine the string format or extract the string content. It can be used for various operating systems (e.g., Windows, Linux, Macintosh), and is supported by almost all programming languages (e.g., Python, C, Java, PHP). Regular expressions are widely used in different scenarios (e.g., software development, software maintenance, and software testing), but understanding the semantics of these regular expressions is challenging for students or developers who are not familiar with the syntax of the regular expressions. The semantics of regular expressions are relatively difficult to understand, especially for some computer beginners. If a method can quickly translate the input regular expression into natural language text description, the method is an effective way to solve the problem.

At present, there is little work to solve such problems, and the existing related methods translate the input natural language description into a corresponding regular expression, and translate the regular expression into the natural language description is still a difficult problem, so how to solve the above problems is the subject of the present invention.

Disclosure of Invention

The invention aims to provide a regular expression description generation method based on a BART model, which can translate a regular expression input by a developer into understandable natural language description.

The idea of the invention is as follows: the invention provides a regular expression natural language description generation method based on deep learning, wherein a regular expression is used as text input, a machine translation model is constructed by means of an improved BART model, and the text description quality of translation by the method provided by the invention is better than that of a method using other deep learning models (such as transformers and berts).

The invention is realized by the following measures: a regular expression description generation method based on a BART model comprises the following steps:

(1) collecting high-quality regular expressions, manually labeling the collected regular expressions to obtain corresponding natural language description, generating a final data set D, setting the format of the data set D as the regular expressions and the natural language description, wherein the data set D comprises 10000 pairs of high-quality regular expressions and corresponding natural language description.

(2) The regular expression is used as a text for word segmentation, the regular expression is used for word segmentation, OOV (out-of-vocabulary) can be effectively solved by using the Byte-Level BBPE for word segmentation, and the word segmentation problem can be better solved by using the Byte-Level BBPE for word segmentation because the regular expression contains a large number of special characters, so that the BART model can better learn the semantics of word segmentation by word segmentation;

(3) inputting the participles into an embedding layer of a BART model to convert the participles into a characteristic vector, and specifically comprising the following steps;

(3-1) generating an original feature vector Word Embedding by the model through Word Embedding;

(3-2) learning the Position relation of the participles in the sentence through Position coding, and generating a Position vector Position Embedding;

(3-3) learning a semantic relation vector Segment Embedding between two adjacent sentences through semantic coding, and finally adding the three vectors to generate a final feature vector X, namely X is Position Embedding + Segment Embedding + Word Embedding;

(4) the method comprises the following steps of improving a BART model to obtain a specific regular expression description generation model, and specifically comprises the following improvement steps:

(4-1) the invention uses Norm-Attention mechanism to replace Self-Attention mechanism in original BART model, and the Attention mechanism can make the softmax function not easy to be saturated arbitrarily without sacrificing expression, thereby ensuring the quality of generating natural language description of BART model with less resources;

(4-2) adding a Beam Search Beam Search algorithm in a generation part after the softmax function, wherein the problem of low quality of generated natural language description can be improved by using the Beam Search Beam Search algorithm;

(5) according to the following steps of 8: 1: 1, dividing a data set into a training set, a verification set and a test set, and training the constructed improved BART model by using the divided training set to obtain a regular expression description generation model:

the regular expression describes the parameter settings of the generative model as follows:

dropout of the regular expression description generation model is set to be 0.1;

setting an activation function of the regular expression description generation model as gelu;

the attention-heads of the regular expression description generation model is set to be 16;

the word embedding dimension of the regular expression description generation model is set to be 1024;

the number of hidden layers of the regular expression description generation model is set to be 12;

the regular expression describes the generation model with vocab _ size set 50265;

the regular expression describes a generator-decoder level set to 12 for the generative model.

Compared with the prior art, the invention has the beneficial effects that: the regular expression description generation method based on the BART model constructs the regular expression description generation model by improving the BART model, replaces a Self-orientation mechanism in the original BART model with a newly-proposed Norm-orientation mechanism, and improves the problem of low quality of the translated natural language description by adding a Beam Search Beam Search algorithm. Therefore, the method is considerable in performance, and the natural language description generated by translation can explain the meaning of the regular expression with high quality through various index measurement, so that a beginner of the computer theory can be helped to learn the regular expression better.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a system framework diagram of a regular expression description generation method based on a BART model provided in the present invention.

Fig. 2 is a flow chart of an embedding layer in the method provided by the present invention.

Fig. 3 is a block diagram of an encoder of the method provided by the present invention.

FIG. 4 is a structural diagram of the Norm-Attention mechanism used in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. Of course, the specific embodiments described herein are merely illustrative of the invention and are not intended to be limiting.

Example 1

Referring to fig. 1, a regular expression description generation method based on a BART model specifically includes the following steps:

(1) the method comprises the steps of collecting high-quality regular expressions and adding corresponding natural language descriptions to the regular expressions in a manual labeling mode, wherein the data set comprises 10000 regular expressions and corresponding natural language descriptions, table 1 shows length statistical information of the regular expressions in the data set, and table 2 shows statistical information of the natural language description lengths corresponding to the regular expressions in the data set.

TABLE 1

TABLE 2

(2) The method comprises the steps that a Byte-Level BBPE (Byte-Level BPE) algorithm is used for segmenting words of a regular expression, the original regular expression is "([ 0-9]) +) -. multidot. (dog)," and the result after segmentation is "(" [ "" 0 "-" "9" - ")" "" "") "", and the Byte-Level BBPE algorithm can effectively solve the OOV problem and is good in word segmentation effect;

(3) as shown in fig. 2, the input word segmentation is converted into a corresponding feature vector X through the embedding layer, and the formula of the feature vector is as follows:

the feature vector X is Position Embedding + Segment Embedding + Word Embedding;

(4) the data set is as follows 8: 1: a scale of 1 divides the data set into a training set, where the training set is used to train and fine tune the model, a validation set, where the validation set is used to perform model optimization, and a test set, where the test set is used to evaluate the performance of the built model.

(5) The BART model is improved, Norm-Attention is used to replace the original Self-Attention in the encoder and the decoder, and the structure of the encoder is shown in FIG. 3; in the generation part after the softmax function, a Beam Search algorithm is added, and a plurality of results can be considered by searching once by the Beam Search algorithm, so that better output results than other Search algorithms can be obtained;

(6) training the constructed improved BART-based model based on the constructed data set to obtain a regular expression description generation model:

the regular expression describes the parameter settings of the generative model as follows:

dropout of the regular expression description generation model is set to be 0.1;

setting an activation function of the regular expression description generation model as gelu;

the attention-heads of the regular expression description generation model is set to be 16;

the word embedding dimension of the regular expression description generation model is set to be 1024;

the number of hidden layers of the regular expression description generation model is set to be 12;

the regular expression describes the generation model with vocab _ size set 50265;

(7) inputting the feature vector X generated by training in the step (4) into the regular expression description generation model in the step (6), carrying out natural language description generation on the regular expression, and measuring a translation result by using four indexes of BLEU, METEOR, ROUGE-L and CIDER:

table 3 experimental results based on four criteria

Experiments show that the regular expression description generation method based on the BART model provided by the invention is the best in each index compared with the rest natural language description generation models constructed by using transformers and Bert models.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

9页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于大数据共享的英语翻译装置

Regular expression description generation method based on BART model

相关技术

网友询问留言