Unsupervised machine reading comprehension training method based on self-supervised learning

文档序号：1816626 发布日期：2021-11-09 浏览：4次中文

阅读说明：本技术 一种基于自监督学习的无监督机器阅读理解训练方法 (Unsupervised machine reading comprehension training method based on self-supervised learning ) 是由孙乐边宁韩先培陈波于 2021-07-16 设计创作，主要内容包括：本发明公开了一种基于自监督学习的无监督机器阅读理解训练方法,其步骤包括：1)对于语料库中的自然语言文本段落,首先查找该自然语言文本段落中符合设定条件的词或词组,并将其中的一个词或词组替换为设定符号；2)利用替换后的段落训练机器阅读理解模型；训练过程中,机器阅读理解模型根据段落中所述设定符号的上下文信息,从替换后的段落中定位一个连续的字符串填补所述设定符号所在位置。本发明能够利用语料库自身的信息,直接学习机器阅读理解所需的能力,而不需要任何人工标注的阅读理解数据,也不需要利用启发式规则人工合成训练数据,所训练模型鲁棒性高。(The invention discloses an unsupervised machine reading comprehension training method based on self-supervised learning, which comprises the following steps: 1) for a natural language text paragraph in a corpus, firstly searching a word or a phrase which meets a set condition in the natural language text paragraph, and replacing one word or phrase with a set symbol; 2) training a machine reading understanding model by using the replaced paragraphs; in the training process, the machine reading understanding model positions a continuous character string from the replaced paragraph according to the context information of the set symbol in the paragraph and fills the position of the set symbol. The invention can directly learn the reading and understanding capacity of the machine by using the information of the corpus, does not need any manually marked reading and understanding data, does not need to artificially synthesize training data by using heuristic rules, and has high robustness of the trained model.)

1. An unsupervised machine reading understanding training method based on self-supervised learning comprises the following steps:

1) for a natural language text paragraph in a corpus, firstly searching a word or a phrase which meets a set condition in the natural language text paragraph, and replacing one word or phrase with a set symbol;

2) training a machine reading understanding model by using the replaced paragraphs; in the training process, the machine reading understanding model positions a continuous character string from the replaced paragraph according to the context information of the set symbol in the paragraph and fills the position of the set symbol.

2. The method according to claim 1, wherein in step 1), the replaced paragraphs are screened by using statistical information to obtain paragraphs for training a machine reading understanding model; the screening method comprises the following steps: first, the sum Scire of TF-IDF values of all words in the paragraph is calculated_PassCalculating the sum of TF-IDF values of all words in the replaced phrase, and calculating TF-IDF similarity Score between the context word sequence of the set character and the context word sequence of the answer position_ans(ii) a Then root ofAccording to Score ═ Score_Pass+αScore_Mask+βScore_ansCalculating the Score of the paragraph, and selecting the top N pieces of data with the highest Score; wherein α and β are hyperparameters.

3. The method of claim 2, further comprising an iterative data screening mechanism; the iterative data screening mechanism scores the rest of the self-supervision data by using a machine reading understanding model obtained by training on part of the self-supervision data, removes the self-supervision data with the score lower than a set threshold value, and performs the training of the next iteration round by using the rest of the self-supervision data; the self-supervision data is a paragraph obtained after screening by utilizing statistical information.

4. The method of claim 1, wherein the setting conditions comprise: (1) the character string appears repeatedly in the paragraph; (2) the character string does not include stop words and punctuation marks; (3) the length of the character string is not greater than 5 words.

5. The method of claim 4, wherein the number of repeated occurrences t satisfies 2 ≦ t ≦ 4.

6. A method according to claim 1, 2 or 3, wherein the set symbol is a non-english word.

7. A method for obtaining answers by a machine reading understanding model comprises the following steps:

3) before inputting a question in a reading and understanding task into a trained machine reading and understanding model, converting the question words in the question into the set symbols and converting the question into a statement sentence; and then inputting the statement sentence into a trained machine reading understanding model, and extracting a corresponding answer from a given text paragraph according to the statement sentence by the trained machine reading understanding model.

8. A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 7.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

Technical Field

The invention relates to a machine reading understanding model training method, in particular to an unsupervised machine reading understanding model training method based on self-supervised learning, and belongs to the technical field of natural language processing.

Background

Machine reading understanding technology refers to that given a text paragraph and a relevant question, a model is trained by a machine learning method, so that the model can extract answers of the question from the text paragraph.

Machine reading understanding technology is one of the supporting technologies for constructing intelligent information services. Internet text data contains a great deal of information and knowledge, but the data is stored in an unstructured manner. The traditional information acquisition technology mainly based on information retrieval is difficult to meet the fine and diversified information requirements of users. The automatic question-answering system constructed by taking machine reading understanding as the core provides an effective means for solving the fine information requirement of a user.

Machine reading understanding models often rely on a large amount of manually labeled training data for training and are difficult to directly apply to new fields and low-resource languages. The existing pre-training language model is not designed for reading and understanding tasks, and the capability deviation exists between the prediction capability of the context-related words concerned by the pre-training language model and the answer extraction capability required by reading and understanding. The existing unsupervised machine reading understanding method based on the synthetic data is difficult to generate high-quality reading understanding data, and a machine reading understanding model trained by using the synthetic data is difficult to meet the requirements of a real application scene on performance and robustness.

The self-supervision learning refers to a technology for training a model by designing a self-supervision training task and utilizing information of unlabeled data to enable the model to have specific capacity.

Disclosure of Invention

In order to train the unsupervised machine reading understanding model more effectively, the invention provides an unsupervised machine reading understanding training method based on self-supervision learning.

The content comprises the following steps: self-supervision learning task and training method aiming at machine reading understanding; and (II) a question switching method for the unsupervised reading understanding model. The self-supervision learning task aiming at the machine reading understanding can directly learn the capability required by the machine reading understanding by utilizing the information of the corpus. The question switching method for the unsupervised reading and understanding model is matched with the self-supervised learning, and the problem that syntax is inconsistent between statement sentences in the non-labeled corpus and question sentences in the reading and understanding task is solved.

The technical scheme adopted by the invention is summarized as follows:

an unsupervised machine reading understanding training method based on self-supervised learning comprises the following parts:

1) setting and training of the self-supervision learning task: inputting each natural language text paragraph in the corpus, firstly searching a meaningful word or phrase repeatedly appearing in the paragraph by the self-supervision learning task, and replacing one word or phrase with a set [ MASK ] symbol (usually set as a special character, different from a common English word); then, screening the replaced paragraphs by utilizing statistical information; and finally, training a machine reading understanding model to locate a continuous character string from the replaced paragraph (by predicting the starting and ending positions of the character string) according to the context information of the [ MASK ] symbol in the paragraph, and filling the [ MASK ] position in the paragraph.

2) A question switching method facing unsupervised machine reading and understanding comprises the following steps: in the model testing and application stage, before inputting the question in the reading and understanding task into the machine reading and understanding model which is trained, converting the question into a statement sentence with a [ MASK ] symbol, thereby keeping the consistency of the input data forms in the model training stage and the testing (application) stage; in the conversion at this stage, the question words appearing in the question sentence are searched, and the question words are replaced by [ MASK ]; and inputting the statement sentence obtained after conversion into the model to obtain an answer.

Further, the meaningful word or phrase selected in the technique 1) section refers to a character string satisfying the following conditions: (1) the occurrence frequency t of the character string in the text paragraph satisfies that t is more than or equal to 2 and less than or equal to 4; (2) the character string does not include stop words and punctuation marks; (3) the length of the string is no greater than 5 words. Through the constraint condition, the word or the phrase replaced by the [ MASK ] contains more information, thereby being beneficial to training a machine to read and understand the model.

Further, the data filtering section referred to in the technology 1) section includes defining a scoring function for the self-supervised data:

(1)Score_Pass: the sum of the TF-IDF values of all words in the paragraph is calculated.

(2)Score_Mask: the sum of the TF-IDF values of all words in the replaced phrase is calculated.

(3)Score_ans: calculation [ MASK ]]And a TF-IDF similarity between the context word sequence of answer position context word sequence.

The score of the data is a weighted sum of the three scores mentioned above, i.e.

Score＝Score_Pass+αScore_Mask+βScore_ans

Wherein α and β are hyperparameters. And the data are ranked by comprehensively utilizing the three scoring functions, so that the data which are beneficial to training the reading understanding model can be further selected.

Further, the technique 1) section also includes an iterative data screening mechanism. The iterative data screening mechanism scores the rest of the data by using a model trained on a part of the self-supervision data (a plurality of screened paragraphs), removes the self-supervision data with the score lower than a certain set threshold value, and performs the next iteration training by using the rest of the self-supervision data. The iterative data screening mechanism can further screen data in the iterative model training process so as to ensure the quality of the data finally used for training the reading understanding model.

Further, technique 2) converts the question into a statement sentence by replacing the question pronouns in the question with templates having [ MASK ] characters, using a template-based question conversion algorithm in part. The question switching method can effectively solve the problem that the syntax structure between the question under the question and answer situation and the statement sentence under the model training situation is inconsistent.

Compared with the prior art, the invention has the beneficial effects that:

1) the self-supervision learning task and the training method thereof aiming at the machine reading understanding can directly learn the capability required by the machine reading understanding by utilizing the information of the corpus, and do not need any manually marked reading understanding data or heuristic rules to artificially synthesize the training data.

2) The question switching method for the unsupervised reading and understanding model is provided, the problem of syntax inconsistency between statement sentences in the non-labeled corpus and question sentences in the reading and understanding task can be effectively solved, and the robustness of the reading and understanding model is improved.

Drawings

Fig. 1 is a flowchart of an auto-supervised learning training method for machine reading understanding.

FIG. 2 is a diagram illustrating conversion of a question sentence in a reading and understanding task into a statement sentence with a [ MASK ] symbol during model testing.

Detailed Description

The present invention will be described in further detail below to make the above objects, features and advantages of the present invention more apparent and understandable.

The invention comprises an unsupervised machine reading understanding training method based on self-supervised learning, which comprises two key parts:

1) setting and training of the self-supervision learning task: inputting each natural language text paragraph in the corpus, firstly searching for a meaningful word or phrase repeatedly appearing in the paragraph by the self-supervision learning task, and replacing one word or phrase with a set [ MASK ] symbol; then, screening the replaced paragraphs by utilizing statistical information; and finally, training a machine reading understanding model to locate a continuous character string from the replaced paragraph (by predicting the starting and ending positions of the character string) according to the context information of the [ MASK ] symbol in the paragraph, and filling the [ MASK ] position in the paragraph.

2) A question switching method facing unsupervised machine reading and understanding comprises the following steps: in the model testing and application stage, before the question sentence in the reading and understanding task is input into the machine reading and understanding model which is trained in the above way, the question sentence is converted into a statement sentence with a [ MASK ] symbol, so that the consistency of the input data form in the model training stage and the testing (application) stage is maintained.

Self-supervision learning task:

specifically, the self-supervision task for machine reading understanding designed by the invention can be formally expressed as

Given a band [ MASK ]]Text passage p of characters_maskThe self-supervision task requires the machine-read understanding model to predict the start and end positions [ i ] of a string in the paragraph_start,i_end]This string may be filled [ MASK ]]The content of the character.

For example, in the text paragraph shown in FIG. 1, the second "vitamin D" is replaced with a [ MASK ] character, and the self-supervision task asks the machine-reading understanding system to predict the start-stop position of the first "vitamin D", and to fill this [ MASK ] position with the first "vitamin D".

Unlike existing pre-training language models, the self-supervision task limits the way [ MASK ] characters are filled in to one continuous string in a paragraph. Through the training of the self-supervision task, the machine reading understanding model can automatically learn the capability of mining deep association relations between sentences in a text paragraph and the capability of accurately extracting required character strings from the paragraph, which is the core capability required by machine reading understanding.

In order to perform the self-supervised learning training more effectively, the portion replaced with the [ MASK ] character should be a meaningful phrase with a sufficient amount of information. For this purpose, first, a character string appearing twice in a text passage is found, and a character string that needs to satisfy the following conditions is selected:

(1) the occurrence frequency t of the character string in the text paragraph satisfies 2 ≦ t ≦ 4, because the character string with too high frequency is often low in information content;

(2) the string does not include stop words and punctuation marks because stop words and punctuation marks generally do not appear in the short answer to the question;

(3) the string is no longer than 5 words in length because the context of an excessively long string can provide less information.

Self-supervised learning use intersectionThe cross entropy loss function trains the machine reading understanding model. Given a p_maskThe training goal is to maximize the logarithmic probability of the correct start-stop position. Specifically, the training objective function is

WhereinAndthe probability of the starting and stopping positions output by the machine reading understanding model is represented, j is the number of the training data, and N is the total number of the training data.

And (3) data sorting and screening:

a data sorting and screening mechanism is used to remove low-quality data which is too easy to predict in the self-supervised learning data. The data sorting mechanism uses a TF-IDF based data scoring function, which comprises three parts:

(1)Score_Pass: to assess the quality of a text passage, excluding meaningless passages, the sum of the TF-IDF values of all words in the passage is calculated.

(2)Score_Mask: was replaced by [ MASK ] for evaluation]The information quantity of the word group of the character avoids replacing an oversimplified and meaningless word group, and the sum of TF-IDF values of all words in the replaced word group is calculated.

(3)Score_ans: to select a better answer position in the case where there are multiple answer positions, [ MASK ] is calculated]The context is a window of 10 words in size, and the context has a TF-IDF similarity between the context word sequence of the answer position and the context word sequence of the answer position. The similarity between two sentences (word sequences) can be calculated by using the TF-IDF values of the words.

The score of the data is a weighted sum of the three scores mentioned above, i.e.

Score＝Score_Pass+αScore_Mask+βScore_ans

Wherein α and β are hyperparameters.

And sequencing all the self-supervision data from large to small according to the scores, and taking the first N pieces of data with the highest scores to perform self-supervision learning.

Through a data ordering mechanism, the machine-reading understanding model can learn more efficiently with higher quality paragraphs.

Screening iteration data:

in the self-supervised data, not all [ MASK ] characters in the data are predictable, which is inconsistent with the setting of the reading understanding task. In order to reduce such unpredictable data, an iterative data screening mechanism is used to filter out the low-prediction-probability self-supervision data.

Specifically, the self-supervision data is first divided into 5 groups. Firstly, a group of data is used for training a machine reading understanding model, then the trained model is used for calculating the prediction probability of each next group of data, and the data with the probability smaller than a certain set threshold value tau are removed. And then continuing to train the machine reading understanding model by using the rest data. The above process was repeated until 5 sets of data were used for training.

To calculate the predicted probability of data, the model is first used to calculate the probability that each word position in a text paragraph is the beginning and ending position, and the probability of a substring is the product of the beginning/ending probabilities of its beginning and end. And taking 20 substrings with the highest substring probability, performing Softmax normalization on the probabilities, wherein the prediction probability of the data is the normalized probability of the correct answer string, and if the correct answer is not contained in the 20 substrings with the highest probability, the prediction probability of the data is 0.

Machine reading understanding model:

the machine reading understanding model in the present invention can adopt a variety of models with different structures, including but not limited to machine reading understanding models based on pre-training language models such as bert (bidirectional Encoder expressions from transformations).

Question conversion:

in the stage of model testing and actual use, in order to solve the problem of syntax inconsistency between statement sentences in the non-annotated corpus and question sentences in the reading and understanding task, the question sentences need to be converted into statement sentences with [ MASK ] symbols. Specifically, as shown in fig. 2, before a question is input into the machine-reading understanding model, a question pronoun in the question is first converted into a statement sentence by replacing the question pronoun with a template having [ MASK ] characters using a template-based question conversion algorithm. The machine-readable understanding model then extracts answers to the questions from the given text passage based on the converted statement sentence.

The above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and those skilled in the art can make modifications or equivalent substitutions on the technical solutions of the present invention without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

8页详细技术资料下载

Unsupervised machine reading comprehension training method based on self-supervised learning

相关技术

网友询问留言