Pre-training framework for language understanding and generation with two-stage decoder

文档序号：49493 发布日期：2021-09-28 浏览：22次中文

阅读说明：本技术 带有两阶段解码器的用于语言理解和生成的预训练框架 (Pre-training framework for language understanding and generation with two-stage decoder ) 是由俞凯陈露马达陈志� 于 2021-07-05 设计创作，主要内容包括：本发明实施例提供一种带有两阶段解码器的用于语言理解和生成的预训练框架,包括：编码器,用于接收条件生成任务中的加噪文本,编码得到加噪文本序列,其中,加噪文本包括：片段遮盖和/或句子打乱的文本；解码器,用于接收编码器输出的加噪文本序列,在第一解码阶段,重构加噪文本序列,得到重构文本,生成对应加噪文本序列的意义表示；在第二解码阶段,基于重构文本和意义表示,生成后续文本。本发明实施例还提供一种二阶段解码器。本发明实施例在第一解码阶段进行重构加理解,在第二解码阶段中既保障了文本的理解,又可以生成后续文本,使得预训练质量更好。显式的理解可以应用于下游任务,方便获取上下文信息,适用任务更广。(An embodiment of the present invention provides a pre-training framework for language understanding and generation with a two-stage decoder, including: the encoder is used for receiving the noise-added text in the condition generation task and encoding the noise-added text to obtain a noise-added text sequence, wherein the noise-added text comprises: text with segment covering and/or sentence scrambling; the decoder is used for receiving the noise-added text sequence output by the encoder, reconstructing the noise-added text sequence in a first decoding stage to obtain a reconstructed text and generating a meaning expression corresponding to the noise-added text sequence; in a second decoding stage, subsequent text is generated based on the reconstructed text and the meaning representation. The embodiment of the invention also provides a two-stage decoder. The embodiment of the invention carries out reconstruction and comprehension in the first decoding stage, not only guarantees comprehension of the text in the second decoding stage, but also can generate the subsequent text, so that the pre-training quality is better. Explicit understanding can be applied to downstream tasks, and the method is convenient for obtaining context information and has wider applicable tasks.)

1. A two-stage decoder, comprising:

in a first decoding stage, reconstructing the output of an encoder to obtain a reconstructed text and generating a meaning expression corresponding to the output;

in a second decoding stage, a subsequent text is generated based on the reconstructed text and the meaning representation.

2. The two-stage decoder of claim 1, wherein the meaning representation is explicit for processing downstream summarization or question-and-answer-like tasks.

3. The two-stage decoder of claim 1, wherein reconstructing the output of the encoder, resulting in reconstructed text, generating the meaning representation corresponding to the output comprises:

in a first decoding stage, receiving a noisy text sequence output by an encoder in a condition generation task to obtain a reconstructed text, and generating a meaning expression corresponding to the encoder output in the condition generation task, wherein the noisy text comprises: segment masking and/or sentence scrambling.

4. A pre-training framework for language understanding and generation with a two-stage decoder, comprising:

the encoder is used for receiving a noise-added text in the condition generation task and encoding the noise-added text to obtain a noise-added text sequence, wherein the noise-added text comprises: text with segment covering and/or sentence scrambling;

a decoder for receiving the noisy text sequence output by the encoder,

in a first decoding stage, reconstructing the noise-added text sequence to obtain a reconstructed text and generating a meaning expression corresponding to the noise-added text sequence;

in a second decoding stage, a subsequent text is generated based on the reconstructed text and the meaning representation.

5. The pre-training framework of claim 4, wherein the condition generating task comprises: a text summarization task;

the encoder is used for receiving the text of the text summarization task and encoding to obtain a text sequence;

a decoder for receiving the text sequence output by the encoder,

reconstructing the text sequence in a first decoding stage to obtain a coherent text sequence which can be understood after reconstruction, and generating a meaning expression of the text sequence;

in a second decoding stage, a text excerpt is generated based on the consecutive text sequence and the meaning representation.

6. The pre-training framework of claim 4, wherein the condition generating task comprises: a question and answer task;

the encoder is used for receiving the text of the question answering task and encoding the text to obtain a text sequence, wherein the text sequence comprises: a topic text sequence and a question sequence related to the topic text;

a decoder for receiving the text sequence output by the encoder,

reconstructing the problem sequence in a first decoding stage to obtain an understandable problem sequence after reconstruction and generating a meaning expression of the problem sequence;

and in a second decoding stage, generating a reply answer based on the title text sequence, the comprehensible question sequence after reconstruction and the meaning representation.

7. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the two-stage decoder in the pre-training framework of any of claims 4-6.

8. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, carries out the steps of a two-stage decoder in a pre-training framework as claimed in any one of the claims 4 to 6.

Technical Field

The present invention relates to the field of intelligent speech, and more particularly to a pre-training framework for language understanding and generation with a two-stage decoder.

Background

Self-supervised pre-training raises the state of the art for Natural Language Generation (NLG) tasks, the purpose of which is to generate natural language sentences from given documents (conditions), such as Context-to-Response (Context-to-Response), conversational Response generation, etc. in task-oriented dialogs.

To implement pre-training, there are various types of pre-training models for conditional text generation tasks. Most of these models fall into two categories. First class models, such as MASS (Masked Sequence to Sequence) and BART (Bidirectional and Auto-Regressive Transformers), decode the Masked portions or recover the original text given the corrupted text. A second type of Model, such as PALM (Pre-training Autoencoding & AutoEGResive Language Model), generates subsequent text from context. The former has good context understanding, while the latter is good at predicting future text.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

MASS and BART only rebuild the original text or the covering part in the original text, and do not do relevant pre-training to the generation of the subsequent text, resulting in insufficient subsequent generation capability and poor natural language generation. The PALM only carries out relevant pre-training on the generation of the subsequent text, does not explicitly generate relevant comprehension, and the subsequently generated text can be in a condition of incompliance with the context or error due to insufficient comprehension capability.

Disclosure of Invention

The method aims to at least solve the problems that the generation of the subsequent text is not subjected to relevant pre-training in the prior art, so that the subsequent generation capability is insufficient, and no explicit generation related understanding is generated, so that the subsequent text generation capability is insufficient and is not smooth

In a first aspect, an embodiment of the present invention provides a two-stage decoder, including:

in a first decoding stage, reconstructing the output of an encoder to obtain a reconstructed text and generating a meaning expression corresponding to the output;

in a second decoding stage, a subsequent text is generated based on the reconstructed text and the meaning representation.

In a second aspect, embodiments of the present invention provide a pre-training framework for language understanding and generation with a two-stage decoder, comprising:

a decoder for receiving the noisy text sequence output by the encoder,

in a first decoding stage, reconstructing the noise-added text sequence to obtain a reconstructed text and generating a meaning expression corresponding to the noise-added text sequence;

in a second decoding stage, a subsequent text is generated based on the reconstructed text and the meaning representation.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the two-stage decoder in the pre-training framework for language understanding and generation with the two-stage decoder of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium having a computer program stored thereon, where the program is to be executed by a processor to perform the steps of a two-stage decoder in a pre-training framework for language understanding and generation with the two-stage decoder according to any of the embodiments of the present invention.

The embodiment of the invention has the beneficial effects that: reconstruction and comprehension are carried out in the first decoding stage, comprehension of texts is guaranteed in the second decoding stage, and subsequent texts can be generated, so that the pre-training quality is better. Meanwhile, explicit understanding can be applied to downstream tasks; and a single decoder is taken as a main part, and in the two-stage joint training process, compared with two decoders, the context information is conveniently acquired. Meanwhile, the method is also suitable for the implementation of abstract extraction and question-answering tasks.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a block diagram of a two-stage decoder according to an embodiment of the present invention;

FIG. 2 is a block diagram of a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention;

FIG. 3 is a diagram of a pre-training model architecture for a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention;

FIG. 4 is an architecture diagram of different fine tuning directions of a PLUTO with a two-stage decoder for language understanding and generation of a pre-training framework provided by an embodiment of the present invention;

FIG. 5 is a diagram of the results of a test data set for summarization with a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention;

FIG. 6 is a diagram of the results of a CoQA development dataset for a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention;

FIG. 7 is a graph of the results on a test data set of Cornell Movie Dialog cores (lower is better) of a pre-training framework for language understanding and generation with a two-stage decoder provided by an embodiment of the present invention;

FIG. 8 is a graphical illustration of a linearization of conversational training data of a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the invention;

FIG. 9 is a graph of the result of a context response on a MultiWOZ2.0 for a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention;

FIG. 10 is a graph of the context response results on CamRest676 of a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention;

FIG. 11 is a diagram of GLUE benchmarking results of a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention;

FIG. 12 is a diagram of PLUTO-2 and PART-2 and different pre-training of a PPL versus Connell university movie dialogue corpus with a pre-training framework for language understanding and generation provided by an embodiment of the present invention with a two-stage decoder.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a block diagram of a two-stage decoder according to an embodiment of the present invention, which includes the following structures:

s11: in a first decoding stage, reconstructing the output of an encoder to obtain a reconstructed text and generating a meaning expression corresponding to the output;

s12: in a second decoding stage, a subsequent text is generated based on the reconstructed text and the meaning representation.

In this embodiment, the two-stage decoder of the method belongs to the transform encoder-decoder framework. It is noted that the two-stage decoder of the method is a decoder that performs two decoding stages, rather than a large decoder consisting of two small decoders. Such a design takes into account that in pre-training, the actual input is noisy text, the comprehension representation of this noisy text may be its original text and its reply its successor text, the original text and successor text being contextual, a decoder may capture the context between them (a decoder is a language model, GPT), and if the decoder is split into two parts, the modeling of the language model is lost.

For S11 formally, for the condition generating task, let x ═ x₁，x₂，...，x_m]Representing a conditional text sequence, r ═ r₁，r₂，...，r_k]Denotes the understood sequence of x, y ═ y₁，y₂，...，y_n]Representing a coherent text sequence. Given a condition x, which is input to the encoder, a two-stage decoder will, in the first stage, first generate an understanding r and then predict the coherent text y. The formula is as follows:

wherein, the meaning represents the comprehension sequence r corresponding to the understating in the above formula, and the reconstructed text is the predicted coherent text y corresponding to the Generation in the above formula.

For S12, in the pre-training step, the encoder receives the noisy/corrupted text with f (x), and the output of the encoder is input to the two-stage decoder, where f (x) is a noise function, resulting in:

P(x，y|f(x))＝P(x|f(x))·P(y|f(x)，x)

as an embodiment, reconstructing the output of the encoder, resulting in the reconstructed text, and generating the meaning representation corresponding to the output comprises:

Referring to BART, the method introduces text stuffing and sentence alignment to break conditional text. To construct a large amount of pre-training data, a large number of text snippets are selected from a large number of unlabeled corpora. Defining the number of maximum lengths (tokens) as segments 1 of each segment, the method cuts down several consecutive tokens from the beginning as conditional text and the remaining assumed coherent text followed by text capture before contextual understanding human written text (these segments are text that is masked or sentence scrambled). The length of x is 80% of L, and the length of y is 20%. Since the coherent text is typically short, such as summaries and generative questions and answers. This arrangement may enhance the correlation between pre-training and fine-tuning.

Unlike the BART method, the method will generate subsequent text based on the reconstructed coherent text, rather than generating subsequent text from the corrupted text in the second decoding stage. Different from PALM, the method explicitly generates understanding through a denoising target, and generates a subsequent text based on the understanding, thereby improving the effect of generating the subsequent text.

As one embodiment, the meaning representation is explicit for processing of a downstream abstract or question-and-answer type task.

Since the understanding of PALM is implicit, there is a gap between pre-training and some downstream understanding tasks that need to generate an understanding. For example, for a Context-to-Response task in a task-oriented dialog, the Response depends on the result queried from the database according to the belief state, and the corresponding result cannot be accurately determined due to the understanding of the implicit.

While the MR (Meaning Representation) determined by the method is explicit, the MR may be some part of the condition (extract summary and QA), rewrite summary or comprehension (status), or even the exact text as the condition. In a first decoding stage, an understanding of the conditions is generated and the coherent text is predicted in a second decoding stage. The conditional text sequence corresponding to the condition specifically refers to a given text sequence in some tasks, for example, in a text summarization task, a given text requires a model to output a summary, and the given text is a conditional text. For example, in a question-and-answer task, a given article and question requires that the model output answers, where the article and question are conditional text.

According to the embodiment, reconstruction and understanding are carried out in the first decoding stage, and the understanding of the text is guaranteed in the second decoding stage, and the subsequent text can be generated, so that the pre-training quality is better. Meanwhile, explicit understanding can be applied to downstream tasks; and a single decoder is taken as a main part, and in the two-stage joint training process, compared with two decoders, the context information is conveniently acquired. Meanwhile, the method is also suitable for the implementation of abstract extraction and question-answering tasks.

Fig. 2 is a block diagram of a pre-training framework for language understanding and generation with a two-stage decoder according to an embodiment of the present invention, which includes the following structures:

s21: the encoder is used for receiving a noise-added text in the condition generation task and encoding the noise-added text to obtain a noise-added text sequence, wherein the noise-added text comprises: text with segment covering and/or sentence scrambling;

s22: a decoder for receiving the noisy text sequence output by the encoder,

in a first decoding stage, reconstructing the noise-added text sequence to obtain a reconstructed text and generating a meaning expression corresponding to the noise-added text sequence;

in a second decoding stage, a subsequent text is generated based on the reconstructed text and the meaning representation.

In this embodiment, the decoder is the aforementioned two-stage decoder. The Pre-training framework for Language Understanding and generation with a Two-stage decoder may also be called PLUTO (Pre-training frame for Language Understanding and generation with Two stage decOding), and the specific structure is shown in FIG. 3. The decoder of PLUTO has two phases (understand, reconstruct) and (generate). In the downstream task, the stages corresponding to different tasks are different, and the stage at which the task is generated should be determined according to the specific downstream task.

For example, the summarization task may be viewed as an understanding of the document. Thus, it is generated in the first decoding stage. However, it is not appropriate for the dialog response to be generated so that the answer is given in the first phase, since the response is not a direct understanding of the question. In a practical scenario, the question should be understood first, and then a response generated. Therefore, it is more appropriate to create a response in the second stage. The problem itself can be regarded as an understanding.

As an embodiment, the condition generating task includes: a text summarization task;

the encoder is used for receiving the text of the text summarization task and encoding to obtain a text sequence;

a decoder for receiving the text sequence output by the encoder,

reconstructing the text sequence in a first decoding stage to obtain a coherent text sequence which can be understood after reconstruction, and generating a meaning expression of the text sequence;

in a second decoding stage, a text excerpt is generated based on the consecutive text sequence and the meaning representation.

In the present embodiment, as shown in fig. 4(b), x represents a question in the summarization task (in the text summarization task, the question refers to the entire text). After being encoded by the encoder, the encoded text is input to a decoder for two-stage decoding to obtain a reconstructed understandable coherent text sequence and corresponding meaning expression. In this way, in the second decoding stage, a text excerpt may be generated based on the consecutive text sequence and the meaning representation.

As another embodiment, the condition generating task includes: a question and answer task;

a decoder for receiving the text sequence output by the encoder,

reconstructing the problem sequence in a first decoding stage to obtain an understandable problem sequence after reconstruction and generating a meaning expression of the problem sequence;

and in a second decoding stage, generating a reply answer based on the title text sequence, the comprehensible question sequence after reconstruction and the meaning representation.

In the present embodiment, in the generated question-and-answer task shown in fig. (a), x is a question and p is a sentence, and there are not two questions. The task is to give an article p and then ask a question x about this article, similar to reading comprehension in middle school english. Then, for this task, the first stage of the model of the method reconstructs the question x, which represents the understanding of the question, and the second stage generates the answer.

It can be seen from this embodiment that PLUTO has a two-stage decoder, which first takes understanding as the main and then generates the consecutive subsequent text. The two-stage decoding mechanism combines the denoising and the prediction of a text pre-training target, and enhances the understanding and the generating capacity of the PLUTO.

Experiments on the method show that the PLUTO has 12 layers in both the encoder and the decoder, and the hiding size is 1024. In the pre-training corpus, the method uses BookCorpus (data set) and the latest English Wikipedia (16 GB in total). Kenizer (open source tool) and BART used the same method, with the maximum length L set at 512. To create the aforementioned text fragment, there is a sliding window containing at most the L tags of a sentence. Like PALM, the parameters were initialized with BART, using 384 batches, 100K rounds, and a linear learning rate scheduler with a peak learning rate of 1 e-5.

In the summarization task, there is one document and one summary for the document. The fine tuning of this task is very intuitive. The summary is considered an understanding of the document and is generated in the first decoding stage. Set-up for summary, the method was experimented on 3 data sets: CNN/DailyMail, XSum, and Gigaword. A 20K round of fine tuning was performed on the pre-trained PLUTO on all datasets. The batch sizes of CNN/DailyMail and XSum are set to 80 and the batch size of Gigaword is set to 256. The peak learning rate was 3e-5 for CNN/DailyMail using the linear learning rate scheduler and for the other two using the cosine learning rate scheduler. During generation, the beam size of CNN/DailyMail is set to 4, XSum is set to 6, and Gigaword is set to 5. To evaluate the model, a ROUGE (e.g., automatic digest ROUGE evaluation method) script is used.

All results are shown in fig. 5. At CNN/DailyMail, the performance of PLUTO outperformed all of the baselines listed here. The pre-trained corpora of these pre-trained model baselines are well matched. At XSum, PLUTO achieved higher Rouge-1 and matched Rouge-2 and Rouge-L than the best results for BART. On Gigaword, PLUTO performs better than BART, but not as well as PALM. In general, PLUTO performed better than BART, demonstrating the effectiveness of the two-stage decoding of the method.

Fine-tune generated Questions (QA), using CoQA, a conversational question-answer dataset. The example in CoQA is conversational, and the model should generate an answer from the dialog history (including questions of the current turn) and an article. The method investigates two fine tuning methods (1) similar to UniLM, concatenating the dialog history and channel, and feeding the concatenation to the encoder, which generates the response in the first stage. (2) The two-stage decoding method of the method is used. In the first phase, the questions of the current round are reconstructed, and in the second phase, the answers are predicted.

Settings like UniLM and ERNIE-GEN, fine tune PLUTO to generate answers on-the-fly at CoQA. During trimming, the batch size is set to 64. And (5) optimizing a model corresponding to the framework of the method by utilizing a linear learning rate scheduling program of the peak learning rate 3e-5 for 10K rounds. During the reasoning process, beam size is set to 5. The evaluation script comes from the official website.

All results are shown in fig. 6. The two-stage decoding method of the method greatly improves BART and PLUTO. Even the generation method with two-stage decoding is superior to extracting the UniLM. In addition, PLUTO-2 works best, indicating that the pre-training work of the method works well. Surprisingly, the performance of PLUTO-1 is superior to BART-1. Thus attributing this to a better understanding of the channel and session history by PLUTO. In CoQA, almost all answers are subspans of paragraphs or/and no. Thus, the answer may be considered a meaningful representation of the conversation history in tandem with the chapters. Compared to BART, the reconstruction is affected by the generation of the second coding stage. Pre-training targets may help to better understand.

Experiments of response generation tasks were performed on the cornell movie dialogue corpus after MASS and PALM settings. But the ppl (persistence) results are not compared to them. Since the vocabulary of the BART tokenizer is different from both, it is unfair to compare PPL values to different vocabularies. After them, experiments were performed on the complete data (110K) and the 10K randomly sampled data. During the fine tuning, the batch size is set to 64 and a linear learning rate scheduler with a peak learning rate of 3e-5 is used. Models of the method are optimized for 20K rounds and 2K rounds for the complete data and the 10K randomly sampled training data, respectively. To evaluate the model of the present method, a degree of confusion (PPL) was used.

The results are reported in fig. 7. Compared with the decoding in the first stage, BART and PLUTO both obtain better performance after fine adjustment of the two-stage decoding, which shows that the two-stage decoding method is effective. Furthermore, PLUTO performs best in two-stage decoding, even better at 10K data than the first stage decoding at 110K data, indicating that the pre-training goal of the method is necessary.

In the fine tuning task, the model should generate a response from the dialog history (including the current turn of user utterances). Unlike QA on CoQA, a belief state should first be generated to query the database. The dialogue data is linearized as shown in fig. 8 for a specific pre-training step. The user: hi i am tracking for a train to an area in cambridge by 08:15 (hi, I are looking for a 08:15 train to cambridge): certainly, where will you be equipped with a classification form? (of course, where do you go: i'll be leaving from bishops stortford on monday (Monday away from the leading Istufoford). By means of the decoder it can be determined that:

[train]destination cambridge departuer bishops stortford[/s]0001[resp]it look like the[value_id]is what you are looking for departing[value_departure]at[value_leave]and arriving in[value_destination]at[value_arrive]would you like to book[e]

([ train ] destination Cambridge origin Master Instructions Stanford [/s ]0001[ resp ] looks like [ value _ id ] you are looking for, starts [ value _ default ] to [ value _ leave ] arrives [ value _ destination ] [ value _ arive ]. do you want to order [ e ])

The dialog history is concatenated as input to the encoder. Considering the belief state as an understanding of the history of the conversation, the decoder will generate it in the first stage. Given the belief phase, the query results, e.g., the number of entities that meet the user's requirements, will be encoded as a binary sequence. The decoder may then predict the response condition and query result for the confidence phase.

Proceed to validate PLUTO for Context-to-Response tasks in the task oriented dialog on MultiWOZ2.0 and CamRest 676. After DAMD, all system responses are unsurfaced to reduce the diversity of surface languages. During the fine tuning, the batch size was set to 32 and the model of the method was optimized for 10K rounds using a linear learning rate scheduler, with a peak learning rate of 3e-5 for MultiWOZ2.0. For CamRest676, the batch size is 64, and PLUTO 20 epoch is optimized using a cosine learning rate scheduler with a peak learning rate of 3 e-5. In the generation process, the beam sizes of the two data sets are set to 5. Inform, Success and BLEU are reported later. The first two evaluate task completion if the system returns an appropriate entity (Inform) and answers all questions required by user rights (Success). The BLEU assesses the fluency of the response. The combined score (combined) also uses combined ═ Inform + Success × 0.5+ BLEU as the combined quality measure. The same assessment script as MinTL (BART) and DAMD was used.

Results the results are shown in figure 9 on MultiWOZ2.0 and listing the comments used by the various models. Using the predictive engine belief state at the top, PLUTO performed best in all but the Inform. Unlike SOLOIST, SOLOIST is pre-trained on a corpus of labeled task-oriented dialogs, including schemas and taskmasters, BART and PLUTO task-oriented dialogs that do not use any labels. BART performed better than SOLOIST on Success and BLUE, but performed worse on Inform. The composite score of PLUTO was further improved by 1.86 points compared to BART. Inform and Success contributed largely to the improvement, indicating that PLUTO could produce a more satisfactory response. The bottom is the end-to-end setting and the model should generate belief states. In addition to the conversational speech context, mnitl (bart) inputs the previous belief state to the encoder. The present method uses the linearization mentioned in figure 8 and gives better results for BART. PLUTO again achieved a score of 0.9 in the Inform, and finally achieved the best overall performance.

The results on CamRest676 are shown in figure 10. Both BART and PLUTO achieve higher Inform because the belief state on CamRest676 is so straightforward. However, PLUTO performed much better in Success and BLEU, reaching a new most advanced composite score, indicating that the two-phase pre-training of the method is effective.

Trimming and results furthermore, the method evaluates PLUTO over several discriminant tasks. In particular, the model of the method was tested on GLUE Benchmark. Only the first decoding stage is used because the discrimination task is an understanding task. Similar to BART, the same input is input to the encoder and decoder. The final hidden state of the last decoder token in the first stage is input to a new multi-class linear classifier. All results are shown in fig. 11. PLUTO matches the performance of BART and RoBERTa. PLUTO performs slightly better than BART on average, but not as well as RoBERTa. Wherein the GLUE data set comprises: MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, RTE, WNLI.

To verify the effectiveness of the pre-training target of the method, the relationship between PLUTO performance and the pre-training steps was explored on the Cornell cine dialogue corpus (see fig. 12). When the training data were 110K and 10K, the confusion decreased with increasing number of training rounds, indicating the pre-training work of the method.

The parameters of PLUTO are initialized by BART. There may be a problem if the performance improvement of PLUTO against BART comes from an extra pre-training step. To eliminate this concern, the method here continues the pre-training BART 40K round using the same pre-training data as PLUTO.

Fig. 12 shows that PLUTO performs better than BART in the same pre-training step, indicating that the pre-training target of PLUTO is more efficient. Also, as the number of BART pre-training steps increases, there is no significant performance improvement, as shown by the dashed line in fig. 12. The performance trend of BART without the two-phase decoding mechanism is very stable, indicating that the improvement in PLUTO does not come from further pre-training.

Finally, in general, the two most similar pre-training methods to PLUTO are BART and PALM. Unlike BART, PLUTO is separately understood and generated using a two-stage decoder. Unlike PALM, PLUTO explicitly generates understandings, which are more relevant to Context-to-Response in task-oriented dialogs that require the development of belief states. Furthermore, the pre-training goal of PALM is less relevant to the summarization task, since the generated text tends to coincide with the document and is an understanding, not the subsequent text like a response. In PLUTO, a digest may be generated in the first stage of decoding. The method proposes PLUTO build on a Transformer. Although it is similar to BART, PLUTO has a two-stage decoder that first understands and then generates coherent text. The two-stage decoding mechanism combines the denoising and the prediction of a text pre-training target, and enhances the understanding and the generating capacity of the PLUTO.

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the steps of a two-stage decoder in a pre-training framework for language understanding and generation with a two-stage decoder in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the two-stage decoder in the pre-training framework for language understanding and generation with the two-stage decoder of any of the embodiments of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：将后处理生成的word数据转换成结构化数据的方法及系统

Pre-training framework for language understanding and generation with two-stage decoder

相关技术

网友询问留言