Weight distribution method based on multiple attention mechanisms of transducer

文档序号:1628480 发布日期:2020-01-14 浏览:23次 中文

阅读说明:本技术 一种基于transformer多种注意力机制的权重分配方法 (Weight distribution method based on multiple attention mechanisms of transducer ) 是由 闫明明 陈绪浩 罗华成 赵宇 段世豪 于 2019-09-27 设计创作,主要内容包括:本发明公开了一种基于transformer多种注意力机制的权重分配方法;包括:注意力机制的输入是目标语言的目标语言和源语言的词向量,输出是一个对齐张量。使用多个注意力机制函数可以输出多个对齐张量输出,并且由于计算过程中有随机参数的变化,所以每个输出是不同的。在此将所有的注意力机制模型都投入运算中,并将多种注意力机制输出做正则化计算,来逼近最佳输出。这种正则化计算方法确定了所得的值不会偏离最优值太远,也保存了各个注意力模型的最优性,若是一个注意力模型的实验效果极好,则加大该模型的权重函数来加大该模型对最终输出的影响力,从而提高翻译效果。(The invention discloses a weight distribution method based on multiple attention mechanisms of a transformer; the method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. All attention mechanism models are put into operation, and various attention mechanism outputs are subjected to regularization calculation to approach to the optimal output. The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also preserved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on final output, so that the translation effect is improved.)

1. A weight distribution method based on multiple attention mechanisms of a transducer is applied to a transducer model based on the attention mechanisms and is characterized in that; the method comprises the following steps:

step 1: in the transform model, the more excellent model output is selected for the application scenario.

Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta12+....+δi=1;

And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O12O23O3.......+δiOiCalculating an optimal matching value as final output; wherein delta12+....+δi1 and δiIs the weight parameter we set; o isiIs the output of various attention models;

and 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta12+....+δiRule of 1;

and 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.

Technical Field

The invention relates to the field of neural machine translation, in particular to a transform-based multi-attention mechanism weight distribution method.

Background

Neural network machine translation is a machine translation method proposed in recent years. Compared with the traditional statistical machine translation, the neural network machine translation can train a neural network which can be mapped from one sequence to another sequence, and the output can be a sequence with a variable length, so that the better performance can be obtained in the aspects of translation, conversation and text summarization. Neural network machine translation is actually a coding-decoding system, coding encodes a source language sequence, extracts information in the source language, and converts the information into another language, namely a target language through decoding, so that language translation is completed.

When the model generates the output, an attention range is generated to indicate which parts of the input sequence need to be focused when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. The attention mechanism is similar to some behavior characteristics of a human, and when the human looks at a certain period of speech, the human usually only focuses on words with information amount, but not all words, namely, the attention weight given to each word by the human is different. The attention mechanism model increases the training difficulty of the model, but improves the effect of text generation. In this patent we are just making improvements in the attention mechanism function.

Since 2013, a neural machine translation system is proposed, along with the rapid development of computer computing power, the neural machine translation is also rapidly developed, a seq-seq model, a transform model and the like are proposed in sequence, and in 2013, a novel end-to-end encoder-decoder structure for machine translation is proposed by Nal Kalch brenner and Phil Blunom [4 ]. The model may use a Convolutional Neural Network (CNN) to encode a given piece of source text into a continuous vector, and then use a Recurrent Neural Network (RNN) as a decoder to convert the state vector into the target language. Google in 2017 issued a new machine learning model, a Transformer, that performed far better than existing algorithms in machine translation and other language understanding tasks.

The traditional technology has the following technical problems:

in the attention mechanism function alignment process, the existing framework firstly calculates the similarity of two sentence word vectors input, and then performs a series of calculations to obtain an alignment function. And each alignment function is output once during calculation, and the output of the time is used as the input of the next time for calculation. Such single thread computations are likely to result in accumulation of errors. We introduce a variety of attention mechanisms for weight assignment, namely to find the optimal solution in a plurality of calculation processes. The best translation effect is achieved.

Disclosure of Invention

Therefore, in order to solve the above-mentioned disadvantages, the present invention provides a weight assignment method based on multiple attention mechanisms of a transform; the method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. A plurality of attention mechanism models are proposed nowadays, such as a self-attention mechanism, a multi-head attention mechanism, a total attention mechanism, a local attention mechanism and the like, each different attention mechanism has different outputs and characteristics, all attention mechanism models are put into operation, and the outputs of various attention mechanism models are subjected to regularization calculation to approach to the optimal output.

The invention is realized in such a way, a weight distribution method based on multiple attention mechanisms of a transducer is constructed, and the method is applied to a transducer model based on the attention mechanism and is characterized in that; the method comprises the following steps:

step 1: in the transform model, the more excellent model output is selected for the application scenario.

Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta12+....+δi=1;

And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O12O23O3.......+δiOiCalculating an optimal matching value as final output; wherein delta12+....+δi1 and δiIs the weight parameter we set; o isiIs the output of various attention models;

and 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta12+....+δiRule of 1;

and 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.

The invention has the following advantages: the invention discloses a weight distribution method based on multiple attention mechanisms of a transformer. The method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. A plurality of attention mechanism models are proposed nowadays, such as a self-attention mechanism, a multi-head attention mechanism, a total attention mechanism, a local attention mechanism and the like, each different attention mechanism has different outputs and characteristics, all attention mechanism models are put into operation, and the outputs of various attention mechanism models are subjected to regularization calculation to approach to the optimal output. Applying a formula: fin _ out is δ1O12O23O3.......+δiOiWherein delta12+....+δi1 and δiIs the weight parameter we set. O isiIs the output of various attention models, and the regularization calculation method determines that the obtained value does not deviate too far from the optimal value and also preserves the values of various attention modelsOptimality, if the experimental effect of an attention model is excellent, the weight function of the model is increased to increase the influence of the model on final output, so that the translation effect is improved.

Detailed Description

The present invention will be described in detail below, and technical solutions in embodiments of the present invention will be clearly and completely described below. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a weight distribution method based on multiple attention mechanisms of a transformer by improvement. The method is applied to a transformer frame model based on an attention mechanism.

transform framework introduction:

encoder consisting of 6 identical layers, each layer containing two sub-layers, the first sub-layer being a multi-head attention layer and then a simple fully connected layer. Where each sub-layer is concatenated and normalized with the residual).

The Decoder consists of 6 identical layers, but the layers are different from the encorder, wherein the layers comprise three sub-layers, one self-addressing Layer is arranged, and the encorder-addressing Layer is finally a full connection Layer. Both of the first two sub-layers are based on multi-head authentication layers. One particular point is masking, which prevents future output words from being used during training.

Attention model:

the encode-decoder model, although very classical, is also very limited. A large limitation is that the link between encoding and decoding is a fixed-length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed-length vector. However, there are two disadvantages to this, namely, the semantic vector cannot completely represent the information of the whole sequence, and the information carried by the first input content is diluted by the later input information. The longer the input sequence, the more severe this phenomenon is. This results in insufficient information being initially obtained for the input sequence at the time of decoding, which can compromise accuracy.

In order to solve the above problem, an attention model was proposed after the appearance of Seq2Seq for one year. When the model generates the output, an attention range is generated to indicate which parts of the input sequence are focused on when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. Attention and some behavior characteristics of a person have certain similarities, and when the person looks at a certain word, the person usually only focuses attention on words with information amount, but not all words, namely, the attention weight given to each word by the person is different. The attention model increases the training difficulty of the model, but improves the effect of text generation.

First, generating the semantic vector at the moment:

Figure BDA0002218636610000041

st=tanh(W[st-1,yt-1])

and secondly, transferring hidden layer information and predicting:

Figure BDA0002218636610000044

Figure BDA0002218636610000045

many attention mechanism models have been proposed, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., and each different attention mechanism has different outputs and characteristics.

The improvement here is a modification in the attention function.

All attention mechanism models are put into operation, and various attention mechanism outputs are subjected to regularization calculation to approach to the optimal output. Applying a formula: fin _ out is δ1O12O23O3.......+δiOiWherein delta12+....+δi1 and δiIs the weight parameter we set. O isiThe regularization calculation method determines that the obtained values do not deviate too far from the optimal values, and preserves the optimality of each attention model. The method comprises the following specific implementation steps of;

step 1: in the transform model, the more excellent model output is selected for the application scenario.

Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta12+....+δi=1;

And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O12O23O3.......+δiOiAnd calculating the optimal matching value as final output.

And 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta12+....+δi1.

And 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

6页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种多特征融合的句子级译文质量估计方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!