Optimal alignment method based on transformer attention mechanism output

文档序号：1556963 发布日期：2020-01-21 浏览：12次中文

阅读说明：本技术 一种基于transformer注意力机制输出的优化对齐方法 (Optimal alignment method based on transformer attention mechanism output ) 是由闫明明陈绪浩李迅波罗华成赵宇段世豪于 2019-09-27 设计创作，主要内容包括：本发明公开了一种基于transformer注意力机制输出的优化对齐方法；应用在基于注意力机制的transformer框架模型上。包括：注意力机制函数的输入是源语言与目标语言的词向量Q、K,在翻译框架中会输出一个对齐张量输出,使用多个注意力机制函数可以输出多个对齐张量输出,并且由于计算过程中有随机参数的变化,所以每个输出是不同的。对多个对齐张量输出先求出他们各自的二范数值Ti,然后使用余弦相似度公式来计算最优值；得到的最优值作为最终输出的对齐张量参与整个翻译过程。引入正则化的计算方式,可以计算出多个输出的最恰当的输出,能够有效的提升注意力机制函数的对齐效果,提高翻译效果与分值。该算法可以应用于所有含有注意力机制的模型,不需要修改模型框架。(The invention discloses an optimized alignment method based on transformer attention mechanism output; the method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism function is the word vector Q, K in the source and target languages, one alignment tensor output is output in the translation framework, multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to the variation of random parameters in the calculation process. Firstly, respective two-norm values Ti of a plurality of alignment tensor outputs are obtained, and then a cosine similarity formula is used for calculating an optimal value; and the obtained optimal value is used as the finally output alignment tensor to participate in the whole translation process. By introducing a regularization calculation mode, the most appropriate output of a plurality of outputs can be calculated, the alignment effect of the attention mechanism function can be effectively improved, and the translation effect and the score are improved. The algorithm can be applied to all models with attention mechanism, and the model framework does not need to be modified.)

1. An optimized alignment method based on the output of a transducer attention mechanism is applied to the transducer model based on the attention mechanism and is characterized in that; the method comprises the following steps: calculating word vectors of input sentences in a source language and a target language in a translation process, and simultaneously obtaining a plurality of function output outputs (i);

calculating the two-norm of the tensor of the output outputs to obtain T (i), and regularizing the output outputs (i) according to a formula

2. The method for optimized alignment based on the transform attention mechanism output, according to claim 1, wherein; the method is realized in the following specific manner;

the first step is as follows: generating the temporal semantic vector

s_t＝tanh(W[s_t-1,y_t-1])

e_ti＝s_tW_ah_i

The second step is that: passing hidden layer information and prediction

Calculating the outputs of a plurality of attention functions by taking tensors Q and K of word vectors of a source language and a target language as initial quantities of calculation, wherein the outputs have respective errors due to parameter matrixes, regularization constraint operation is carried out on the outputs to obtain final outputs, and finally the distances are substituted into outputs to be calculated: the process is as follows:

step 1: let the hidden layer output vector be ki, and perform a dot product operation QKt to obtain Si.

Step 2: performing softmax normalization to obtain Ai alignment weight, wherein the calculation formula is as follows;

and step 3: multiplying ai and Vi to obtain an attitude (Q, K, V) with the calculation formula of

And 4, step 4: the attention function is repeatedly calculated for a plurality of times. Obtaining output1, output2, and (i);

and 5: obtaining the obtained output1, output2, output (i) and (b) according to a calculation formula

step 6: and finally outputting to participate in subsequent operation.

Technical Field

The invention relates to neural machine translation, in particular to an optimal alignment method based on a transformer attention mechanism output.

Background

Neural network machine translation is a machine translation method proposed in recent years. Compared with the traditional statistical machine translation, the neural network machine translation can train a neural network which can be mapped from one sequence to another sequence, and the output can be a sequence with a variable length, so that the better performance can be obtained in the aspects of translation, conversation and text summarization. Neural network machine translation is actually a coding-decoding system, coding encodes a source language sequence, extracts information in the source language, and converts the information into another language, namely a target language through decoding, so that language translation is completed.

When the model generates the output, an attention range is generated to indicate which parts of the input sequence need to be focused when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. The attention mechanism is similar to some behavior characteristics of a human, and when the human looks at a certain period of speech, the human usually only focuses on words with information amount, but not all words, namely, the attention weight given to each word by the human is different. The attention mechanism model increases the training difficulty of the model, but improves the effect of text generation. In this patent we are just making improvements in the attention mechanism function.

Since 2013, a neural machine translation system is proposed, along with the rapid development of computer computing power, the neural machine translation is also rapidly developed, a seq-seq model, a transform model and the like are successively proposed, and in 2013, a novel end-to-end encoder-decoder structure for machine translation is proposed by Nal Kalch brenner and Phil Blunom. The model may use a Convolutional Neural Network (CNN) to encode a given piece of source text into a continuous vector, and then use a Recurrent Neural Network (RNN) as a decoder to convert the state vector into the target language. Google in 2017 issued a new machine learning model, a Transformer, that performed far better than existing algorithms in machine translation and other language understanding tasks.

The Seq2Seq framework is known as Sequence to Sequence. It is a general encoder-decoder framework, which can be used in the scenes of machine translation, text summarization, conversation modeling, image captions, etc. Seq2Seq is not an official, open source implementation of the gnmt (google neural Machine transformation) system. The purpose of the framework is to accomplish a wider range of tasks, of which neural machine translation is only one.

The traditional technology has the following technical problems:

in the attention mechanism function alignment process, the existing framework firstly calculates the similarity of two sentence word vectors input, and then performs a series of calculations to obtain an alignment function. And each alignment function is output once during calculation, and the output of the time is used as the input of the next time for calculation. Such single thread computations are likely to result in accumulation of errors. We introduce a regularization computation method to find the optimal solution among multiple computation processes. The best translation effect is achieved.

Disclosure of Invention

Therefore, in order to solve the above-mentioned disadvantages, the present invention provides an optimized alignment method based on the transform attention mechanism output.

The invention is realized in such a way, an optimized alignment method based on the output of a transducer attention machine mechanism is constructed, and the optimized alignment method is applied to a transducer model based on the attention machine mechanism and is characterized in that; the method comprises the following steps: calculating word vectors of input sentences in a source language and a target language in a translation process, and simultaneously obtaining a plurality of function output outputs (i);

calculating the two-norm of the tensor of the output outputs to obtain T (i), and regularizing the output outputs (i) according to a formula

The resulting output is taken as the final output.

The invention relates to an optimized alignment method based on a transformer attention mechanism output, which is characterized by comprising the following steps of (1) carrying out alignment on a target object; the method is realized in the following specific manner;

the first step is as follows: generating the temporal semantic vector

s_t＝tanh(W[s_t-1,y_t-1])

e_ti＝s_tW_ah_i

The second step is that: passing hidden layer information and prediction

step 1: let the hidden layer output vector be ki, and perform a dot product operation QKt to obtain Si.

Step 2: performing softmax normalization to obtain Ai alignment weight, wherein the calculation formula is as follows;

and step 3: multiplying ai and Vi to obtain an attitude (Q, K, V) with the calculation formula of

And 4, step 4: the attention function is repeatedly calculated for a plurality of times. Obtaining output1, output2, and (i);

and 5: obtaining the obtained output1, output2, output (i) and (b) according to a calculation formula

Output is the final Output we get;

step 6: and finally outputting to participate in subsequent operation.

The invention has the following advantages: the invention relates to an optimized alignment method based on a transformer attention mechanism output. The method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism function is the word vector Q, K in the source and target languages, one alignment tensor output is output in the translation framework, multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to the variation of random parameters in the calculation process. For a plurality of alignment tensor outputs, respective two-norm values Ti (i is 0, 1 and 2.) are firstly obtained, and then a cosine similarity formula is used for calculating an optimal value

And the obtained optimal value is used as the finally output alignment tensor to participate in the whole translation process. By introducing a regularization calculation mode, the most appropriate output of a plurality of outputs can be calculated, the alignment effect of the attention mechanism function can be effectively improved, and the translation effect and the score are improved. The algorithm can be applied to all models with attention mechanism, and the model framework does not need to be modified.

Detailed Description

The present invention will be described in detail below, and technical solutions in embodiments of the present invention will be clearly and completely described below. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides an optimized alignment method based on the output of a transducer attention machine system through improvement, which is applied to a transducer model based on the attention machine system and is characterized in that; the method comprises the following steps: calculating word vectors of input sentences in a source language and a target language in a translation process, and simultaneously obtaining a plurality of function output outputs (i);

calculating the two-norm of the tensor of the output outputs to obtain T (i), and regularizing the output outputs (i) according to a formulaThe resulting output is taken as the final output.

transform framework introduction:

encoder consisting of 6 identical layers, each layer containing two sub-layers, the first sub-layer being a multi-head attention layer and then a simple fully connected layer. Where each sub-layer is concatenated and normalized with the residual).

The Decoder consists of 6 identical layers, but the layers are different from the encorder, wherein the layers comprise three sub-layers, one self-addressing Layer is arranged, and the encorder-addressing Layer is finally a full connection Layer. Both of the first two sub-layers are based on multi-head authentication layers. One particular point is masking, which prevents future output words from being used during training.

Attention model:

the encode-decoder model, although very classical, is also very limited. A large limitation is that the link between encoding and decoding is a fixed-length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed-length vector. However, there are two disadvantages to this, namely, the semantic vector cannot completely represent the information of the whole sequence, and the information carried by the first input content is diluted by the later input information. The longer the input sequence, the more severe this phenomenon is. This results in insufficient information being initially obtained for the input sequence at the time of decoding, which can compromise accuracy.

In order to solve the above problem, an attention model was proposed after the appearance of Seq2Seq for one year. When the model generates the output, an attention range is generated to indicate which parts of the input sequence are focused on when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. Attention and some behavior characteristics of a person have certain similarities, and when the person looks at a certain word, the person usually only focuses attention on words with information amount, but not all words, namely, the attention weight given to each word by the person is different. The attention model increases the training difficulty of the model, but improves the effect of text generation.

The first step is as follows: generating the temporal semantic vector

s_t＝tanh(W[s_t-1,y_t-1])

e_ti＝s_tW_ah_i

The second step is that: passing hidden layer information and prediction

The improvement here is a modification in the attention function.

step 1: let the hidden layer output vector be ki, and perform a dot product operation QKt to obtain Si.

Step 2: performing softmax normalization to obtain Ai alignment weight, wherein the calculation formula is

And step 3: multiplying ai and Vi to obtain an attitude (Q, K, V) with the calculation formula of

And 4, step 4: repeatedly calculating an attention function for multiple times; get output1, output2, and (i).

And 5: obtaining the obtained output1, output2, output (i) and (b) according to a calculation formula

Output is the final Output we get.

Step 6: and finally outputting to participate in subsequent operation.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

6页详细技术资料下载

Optimal alignment method based on transformer attention mechanism output

相关技术

网友询问留言