Tree-to-sequence-based Mongolian Chinese machine translation method

文档序号:1567775 发布日期:2020-01-24 浏览:26次 中文

阅读说明:本技术 一种基于树到序列的蒙汉机器翻译方法 (Tree-to-sequence-based Mongolian Chinese machine translation method ) 是由 苏依拉 薛媛 赵旭 卞乐乐 范婷婷 张振 于 2019-09-27 设计创作,主要内容包括:本发明一种基于树到序列的蒙汉机器翻译方法,采用树到序列NMT模型,扩展了具有源侧短语结构的序列到序列模型,在模型中加入了自注意力机制,这种自注意力机制不仅可以使得解码器在每一步主动查询最相关的信息,并且还大大缩短了信息流动的距离,另外它可以使得解码器在生成翻译的单词的同时能和源语句的短语以及单词进行对齐。120万蒙汉双语平行语料数据集的实验结果表明,本发明的模型明显优于序列到序列的注意力NMT模型,并且与最先进的树到串SMT系统相比更胜一筹。(The invention relates to a Mongolian Chinese machine translation method based on tree-to-sequence, which adopts a tree-to-sequence NMT model, expands a sequence with a source side phrase structure to a sequence model, and adds a self-attention mechanism in the model, wherein the self-attention mechanism not only can enable a decoder to actively inquire the most relevant information at each step, but also greatly shortens the information flowing distance, and in addition, the self-attention mechanism can enable the decoder to be aligned with phrases and words of a source sentence while generating translated words. The experimental result of the 120 ten thousand Mongolian bilingual parallel corpus data set shows that the model of the invention is obviously superior to the sequence-to-sequence attention NMT model and is superior to the most advanced tree-to-string SMT system.)

1. A Mongolian Chinese machine translation method based on tree-to-sequence adopts NMT model of encoder-decoder structureAs an integral framework of the translation process, the encoder is composed of a sequence encoder and a Tree-based encoder, the sequence encoder and the Tree-based encoder respectively generate a sentence vector, wherein, in the Tree-based encoder, a source sentence is composed of a plurality of phrase units and is represented as a binary Tree based on a head-driven phrase structure grammar, the Tree-based encoder is a Tree-transformer structure constructed by using a transformer, each node in the binary Tree is represented by a transformer unit, thereby recursively encoding the sentence in a bottom-up manner after the phrase structure of the source sentence to generate a vector representation composed of structure information of the sentence, and an initial decoder s1Having two sub-units, respectively final sequence encoder unit hnAnd a final tree-based encoder unit

Figure FDA0002218025340000011

2. The tree-to-sequence based montmorillohman machine translation method according to claim 1, wherein the tree-based encoder is built in a standard sequence encoder.

3. The tree-to-sequence based Mongolian Chinese machine translation method according to claim 2, wherein said tree-based encoder uses left and right sub-hidden units

Figure FDA0002218025340000015

Figure FDA0002218025340000014

wherein f istreeIs a non-linear function; in initializing a tree-based encoder unit, a sequence transform unit is used and a tree-transform is used to compute a transform unit having a parent node of two child transform units.

4. The tree-to-sequence based Mongolian Chinese machine translation method according to claim 3, wherein the sequence transformer unit representation is adopted when initializing tree-based encoder unit, i.e. h00; using a tree-transform to compute a transform unit for a parent node having two child transform units, the transform unit is formulated as

Figure FDA0002218025340000021

5. The tree-to-sequence based Mongolian Chinese machine translation method of claim 3, wherein the initial decoder is

Figure FDA0002218025340000022

6. Tree-to-sequence based according to claim 1The Mongolian Chinese machine translation method is characterized in that a self-attention mechanism is added in a transform, a weight is learned for each word of an input sentence vector, each word in the self-attention mechanism has 3 different vectors, namely Q, K and V vectors, the length of each vector is 64, and an embedded vector X is multiplied by three different weight matrixes W through 3 different weight matrixesQ,WK,WVThe result is that, where the embedded vector X is transformed from the input word, the three weight matrices are all 512 × 64 in size.

7. The tree-to-sequence based montmorillohma machine translation method according to claim 6, wherein the transform in the decoder further adds an encoder-decoder attention mechanism, wherein Q is from the last output of the decoder, and K and V are from the output of the encoder, and in the machine translation, the decoding process is a sequential operation process, that is, when the K-th feature vector is decoded, only the K-1 and its previous results can be seen.

8. The tree-to-sequence based montmorillohman machine translation method according to claim 7, wherein the NMT model is trained using BlackOut.

9. A tree-to-sequence based montmorillohman machine translation method as claimed in claim 7 wherein in the decoding process, the beam search is used to decode the target sentence of source sentence x and calculate the sum of the log-likelihoods of the target sentence y ═ y (y ═ y)1,y2,...ym) As the beam score:

Figure FDA0002218025340000024

using the sentence length statistics in the beam search, the length of the target sentence is related to the length of the source sentence, redefining the score for each candidate as follows:

Figure FDA0002218025340000031

Lx,y=log P(len(y)|len(x))

wherein L isx,yIs a penalty for conditional probability given a target sentence length of the source sentence length len (x), which allows the model to decode the sentence by considering the length of the target sentence;

finally, decoding of the source sentences is achieved while aligning the input phrases and words with the output by means of the GIZA + + tool.

Technical Field

The invention belongs to the technical field of machine translation, and particularly relates to a Mongolian Chinese machine translation method based on tree-to-sequence.

Background

Machine Translation (MT) has been one of the most complex language processing problems, and recent advances in Neural Machine Translation (NMT) have made it possible to translate using a simple end-to-end architecture.

In the encoder-decoder model, the encoder reads the entire sequence of source words to produce a fixed-length vector, and then the decoder generates the target words from the vector. The encoder-decoder model has extended the attention mechanism that allows the models to jointly learn soft alignments between source and target languages. The NMT model achieves up-to-date results in english-french and english-german translation tasks. However, for more structurally distant language pairs (e.g., chinese-mon), whether NMT is competitive with traditional Statistical Machine Translation (SMT) methods in translation tasks remains to be observed.

Table 1 shows a pair of parallel sentences of chinese and mongolian languages. In many respects, Chinese and Mongolian languages are linguistically distant, with different syntactic structures, with words and phrases defined in different lexical units. In SMT, it is known that incorporating syntactic components of the Source language into the model can improve word alignment and translation accuracy. However, existing NMT models do not allow such alignment to be performed.

TABLE 1 Mongolian Chinese and its translation order problem

Figure BDA0002218025350000011

Disclosure of Invention

To overcome the above-described drawbacks of the prior art, it is an object of the present invention to provide a tree-to-sequence based montmorillo chinese machine translation method that employs an attention NMT model to utilize syntactic information, following the phrase structure of a source sentence, recursively encodes the sentence in a bottom-up manner to produce a vector representation of the sentence and decodes the input phrases and words while aligning them with the output.

In order to achieve the purpose, the invention adopts the technical scheme that:

a Mongolian Chinese machine translation method based on a Tree-to-sequence adopts an NMT model of an encoder-decoder structure as an integral framework of a translation process, wherein the encoder consists of a sequence encoder and a Tree-based encoder, the sequence encoder and the Tree-based encoder respectively generate a sentence vector, wherein in the Tree-based encoder, a source sentence consists of a plurality of phrase units and is represented as a binary Tree based on a head-driven phrase structure grammar, the Tree-based encoder is a Tree-transformer structure constructed by using a transformer, each node in the binary Tree is represented by the transformer unit, so that after the phrase structure of the source sentence, the sentence is recursively encoded in a bottom-to-top manner to generate a vector representation consisting of structure information of the sentence, the sequence encoder obtains a vector representation of a normal sentence, the Tree-based encoder obtains a vector representation of the phrase structure of the sentence, initial decoder s1Having two sub-units, respectively final sequence encoder unit hnAnd a final tree-based encoder unitUsing final sequence encoder unit h when initializing leaf nodesnTree-based encoder units for use in initializing parent nodes

Figure BDA0002218025350000022

The tree-based encoder is built in a standard sequence encoder, and the structural relationship of the tree-based encoder and the standard sequence encoder is shown in FIG. 3.

The tree-based encoder uses left and right sub-concealment units

Figure BDA0002218025350000023

And

Figure BDA0002218025350000024

computing a kth parent hidden unit for a kth phraseAs follows:

Figure BDA0002218025350000026

wherein f istreeIs a non-linear function; in initializing a tree-based encoder unit, a sequence transform unit is used and a tree-transform is used to compute a transform unit having a parent node of two child transform units.

When initializing a tree-based encoder unit, a sequence transformer unit representation, i.e., h, is used00; using a tree-transform to compute a transform unit for a parent node having two child transform units, the transform unit is formulated as

Figure BDA0002218025350000031

tree denotes a nonlinear function.

Initial decoder

Figure BDA0002218025350000032

Wherein g istreeAnd ftreeHaving the same function, the initialization allows the decoder to capture information from sequence data and phrase structures, use the Tree-transformer initialization decoder to translate multiple source languages into a target language, and set the parser Tree for a sentence when the parser cannot output the sentence

Figure BDA0002218025350000033

To encode the sentence using a sequence encoder.

Adding a self-attention mechanism into the transform, learning a weight for each word of an input sentence vector, wherein each word in the self-attention mechanism has 3 different vectors, namely Q, K and V vectors, the length of each word is 64, and the embedded vector X is multiplied by three different weight matrixes W through the 3 different weight matrixesQ,WK,WVObtaining, wherein the embedded vector X is converted from the input wordThe three weight matrices are obtained to be 512 × 64 in size.

The transform in the decoder also adds an encoder-decoder attention mechanism in which Q is from the last output of the decoder and K and V are from the output of the encoder, and the decoding process is a sequential operation process when machine translated, i.e. when decoding the kth eigenvector, only the kth-1 and its previous results can be seen.

The invention trains the NMT model by BlackOut.

In the decoding process, a target sentence of a source sentence x is decoded using beam search, and a sum y of log-likelihoods of the target sentence is calculated (y ═ y1,y2,...ym) As the beam score:

Figure BDA0002218025350000034

using the sentence length statistics in the beam search, the length of the target sentence is related to the length of the source sentence, redefining the score for each candidate as follows:

Figure BDA0002218025350000035

Lx,y=logP(len(y)|len(x))

wherein L isx,yIs a penalty for conditional probability given a target sentence length of the source sentence length len (x), which allows the model to decode the sentence by considering the length of the target sentence;

finally, decoding of the source sentences is achieved while aligning the input phrases and words with the output by means of the GIZA + + tool.

Compared with the prior art, the present invention is based on a tree-to-sequence approach, which can improve word alignment and translation accuracy by recursively encoding sentences in a bottom-up manner to generate vector representations of the sentences, after the phrasal structure of the source sentence, using attention NMT models to exploit syntax information.

Drawings

FIG. 1 is a diagram of a parallel sentence alignment of a pair of Chinese and Mongolian languages.

Fig. 2 is a schematic diagram of an attention-based encoder-decoder model.

FIG. 3 is a schematic diagram of an attention-based Tree-to-sequence NMT model.

FIG. 4 is a schematic diagram of an example sentence translation and attention relationships of the model of the present invention.

FIG. 5 is a schematic diagram of a transform encoder structure.

FIG. 6 is a schematic diagram of a transform decoder structure.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

Fig. 1 shows a pair of parallel sentences of chinese and mongolian languages. In many respects, Chinese and Mongolian languages are linguistically distant, with different syntactic structures, with words and phrases defined in different lexical units. The present invention is directed to incorporating syntactic components of a known source language into a model using a light alignment algorithm to improve word alignment and translation accuracy.

To achieve the above objects, the present invention employs an attention NMT model to utilize syntax information, still employing an encoder-decoder model as an overall framework of the translation flow, following the phrase structure of the source sentence, recursively encodes the sentence in a bottom-up fashion to produce a vector representation of the sentence and decodes the input phrases and words while aligning them with the output.

To describe the tree-to-sequence attention NMT model of the present invention, we first introduce from the encoder-decoder model:

1. tree-to-sequence modeling

1.1 encoder-decoder model

NMT is an end-to-end approach to data-driven machine translation, where the NMT model directly estimates the conditional probability P (y | x) given a large number of source and target sentence pairs (x, y). The NMT model consists of an encoder and a decoder, and is called an encoder-decoder model. In the encoder-decoder model, a sentence is considered to be composed of a series of words. In the encoder process, the encoder will process each sourceThe word x ═ x1,x2,...,xn) Embedding into d-dimensional vector space. Then, the decoder outputs the word sequence y in the target language (y) given the information about the source sentence supplied from the encoder (y)1,y2,...,ym). Here, n and m are the lengths of the source sentence and the target sentence, x, respectivelyn、ymRespectively representing the m-th and n-th words in the source sentence and the target sentence.

the transformer network structure allows efficient embedding of sequential data into vector space, given the ith input x in the encoderiAnd a previous hidden unit hi-1∈Rd×1Calculating the ith hidden unit hi∈Rd×1

hi=fen(xi,hi-1), (1)

Wherein R isd×1Representing a d x 1-dimensional vector space, fenIs a coding function, is a nonlinear function, and initially hides a unit h 00. Recursively applying an encoding function fenUntil the nth hidden unit h is obtainedn. transformer encoder-decoder model assumption hnA vector representing the meaning of the input sequence up to the nth word.

After the entire input sentence is encoded into the vector space, it is decoded in a similar manner. Initial decoder unit s1Using source sentence vector(s)1=hn) And (5) initializing. Given the previous target word and the jth hidden unit of the decoder, the conditional probability for generating the jth target word is calculated as follows:

P(yj|y<j,x)=g(sj), (2)

wherein s isjIs the j-th hidden unit of the decoder, g is a non-linear function, by using another non-linear function fdeTo calculate sjAs follows:

sj=fde(yi-1,sj-1), (3)

the use of a transformer unit allows for better parallelism of the model when translating.

1.2 attention encoder-decoder model

An NMT model with attention mechanism can neatly align each decoder state with an encoder state. The attention mechanism allows the NMT model to explicitly quantify the degree to which each encoder state contributes to word prediction at each time step.

In the attention NMT model of Luong et al, the i-th source concealment unit, i.e. the encoder concealment unit h, is in the j-th step of the decoder processiAnd the attention score between the jth target hidden unit, i.e. the decoder hidden unit

Figure BDA0002218025350000061

The calculation method of (2) is as follows:

Figure BDA0002218025350000062

jth context vector djIs composed of alphaj(i) Weighted sum vector:

Figure BDA0002218025350000063

the model predicts the jth word using the softmax function:

P(yj|y<j,x)=softmax(Ws+bs+Attention(Q,K,V)) (7)

wherein WS∈R|V|×dAnd bs∈R|V|×1Weight matrix and bias vector, respectively, | V | represents the size of the target vocabulary. Since in the encoder, the data first passes through a module called "self Attention" to obtain a weighted feature vector Attention, and then the Attention is sent to the next module of the encoder, namely the feedback neural network module. Here, two layers are fully connected, the first layer is ReLU, and the second layer is a linear activation function, which can be expressed as:

FFN(Attention)=max(0,djW1+b1)W2+b2(8)

the structure of the encoder is shown in fig. 5, and the structure of the decoder is shown in fig. 6, both of which include a self-attention mechanism, and the decoder has an additional encoder-decoder attention mechanism, wherein the self-attention mechanism is used for representing the relationship between the current translation and the translated text; and the encoder-decoder attention mechanism is used to represent the relationship between the current translated and encoded feature vectors.

Objective function of NMT model

The objective function for training the NMT model is the sum of the log-likelihoods of the translation pairs in the training data:

where D represents a set of parallel sentence pairs, | D | represents the size of the training set, and the model parameters are learned by the Stochastic Gradient Descent (SGD) method.

3 attention Tree to sequence model

3.1 Tree based encoder + sequence encoder

Existing NMT models treat a sentence as a series of words, ignoring the structure of the syntax inherent in the language. The present invention proposes a new tree-based encoder to explicitly consider the syntax structure in the NMT model. The present invention focuses on the phrase structure of the sentence and constructs the sentence vector from the phrase vector in a bottom-up manner. Thus, the statement vector in a tree-based encoder is composed of structural information rather than sequential data. Fig. 3 shows the model proposed by the present invention, which is referred to as the tree-to-sequence attention NMT model.

In the header-driven phrase structure grammar, a sentence is composed of a plurality of phrase units and represented as a binary tree, as shown in FIG. 3. Following the structure of sentences, a tree-based encoder is constructed in a standard sequence encoder using left and right sub-hidden units

Figure BDA0002218025350000072

And

Figure BDA0002218025350000073

computing a kth parent hidden unit for a kth phrase

Figure BDA0002218025350000074

As follows:

Figure BDA0002218025350000075

wherein f istreeIs a non-linear function.

The present invention constructs a tree-based encoder using a transformer, wherein each node in the binary tree is represented by a transformer unit, and sequence transformer units are used in initializing leaf units of the tree-based encoder.

Each non-leaf node in the binary tree is also represented by a transform unit, and a tree-transform is used to compute a transform unit having a parent of two child transform units.

The Tree-based encoder proposed by the present invention is a natural extension of the conventional sequence encoder, since the Tree-transformer is a generalization of the chain-transformer. The inventive encoder constructs phrase nodes in a context-dependent manner when computing the transform units of a leaf node, e.g., allowing the model to compute different representations of multiple occurrences of the same word in a sentence, since the sequence transforms were computed in the context of the previous unit. This capability contrasts with the original Tree-transformer, where the leaf consists of word embedding only without any contextual information.

3.2 initial decoder setup

The present invention has two different sentence vectors: one from the sequence encoder and the other from the tree-based encoder. As shown in FIG. 3, another Tree-transformer unit is provided, which has a final sequence encoder unit (h)n) And a tree-based encoder unitAs two subunits, it is set as the initial decoder s1The following were used:

Figure BDA0002218025350000082

sequence encoder unit for initializing leaf nodes, tree encoder unit for initializing father nodes, wherein gtreeWith f having another set of Tree-transform parameterstreeWith the same function.

This initialization allows the decoder to capture information from the sequence data and phrase structure. A Tree-transformer is used to initialize the decoder with which multiple source languages are translated into a target language. When the parser can not output the parse tree of the sentence, the setting is passed

Figure BDA0002218025350000083

To encode the sentence using a sequence encoder. Thus, the tree-based encoder proposed by the present invention is applicable to any sentence.

3.3 self-attention mechanism in model

The self-attention mechanism is the most core content of the transform, and can learn a weight for each word of an input sentence vector, and each word in the self-attention mechanism has 3 different vectors, which are respectively Q (query), K (key) and V (value) vectors, and the length of each vector is 64. Multiplying the embedded vector X by three different weight matrices W through 3 different weight matricesQ,WK,WVThe dimensions of all three of the matrices are 512 x 64. The following examples illustrate specific implementations:

the specific Attention is calculated as follows:

(1) converting input words into embedded vectors

(2) Obtaining three vectors of Q, K and V according to the embedded vector

(3) Calculate one score for each vector: score is q · k, and q and k are Q, K components, respectively.

(4) score dot-by-V for each component

(5)

Figure BDA0002218025350000092

The above steps can be generalized to n.

Encoder-decoder attention mechanism in 3.4 model

The transform module in the decoder has a multiple encoder-decoder attention mechanism over the encoder, in which Q comes from the last output of the decoder and K and V come from the output of the encoder. In machine translation, the decoding process is a sequential operation process, that is, when the k-th feature vector is decoded, only the k-1 and the previous results can be seen, so the method is called a hidden attention mechanism.

3.5 sample-based NMT model approximation

The largest computational bottleneck in training the NMT model is in computing the softmax layer described in equation (7) because its computational cost increases linearly with the size of the vocabulary. The acceleration techniques of the GPU have proven useful for sequence-based NMT models, but are not easily applied when processing tree-structured data. To reduce the training cost of the softmax layer NMT model, BlackOut, a sampling-based approximation method, is used. BlackOut has proven effective in such language models and allows the model to run fairly quickly even with a vocabulary of one million words.

In each word prediction step in the training, BlackOut uses a weighted softmax function to estimate the conditional probability in equation (2) for the target word and K negative samples. Negative examples are the extraction of the power β ∈ [0,1] from the unigram distribution. The single letter distribution is estimated using training data, with β being a hyper-parameter. BlackOut is closely related to Noise Contrast Estimation (NCE) and is more difficult to solve than the original softmax and NCE in RNNLM. After the training is over, BlackOut may be used as the original softmax.

4 experiment

4.1 training data

The proposed model was applied to a 120 ten thousand Mongolian parallel sentence dataset. A phrase structure is obtained and for the source sentence, i.e. the Mongolian, a probabilistic HPSG parser Enju is used. Only Enju is used to obtain the binary phrase structure for each sentence and no HPSG specific information is used. For the target language, i.e. chinese, the chinese segmentation tool jieba is used and the pre-processing steps recommended in word2vec are performed. Translation pairs whose sentence length exceeds 50 and whose source sentence was not successfully parsed are then filtered out. Two experiments were performed on a small training dataset to study the effectiveness of the model of the invention and to compare it with other systems on a large training dataset.

The vocabulary includes a number of words observed in the training data that is greater than or equal to N. The invention sets N to 2 for small training data set and N to 5 for large training data set. Out-of-vocabulary words are mapped to the special label "unk". The present invention adds another special symbol "eos" for both languages and inserts it at the end of all sentences.

4.3 decoding procedure

The present invention uses beam searching to decode the target sentence of source sentence x and calculates the sum of the log-likelihoods of the target sentence, y ═ y1,y2,…ym) As the beam score:

Figure BDA0002218025350000101

decoding in the NMT model is a generative process and depends on the target language model of a given source sentence. As the target sentence becomes longer, the score becomes smaller, so simple beam searching does not work well when decoding long sentences. In preliminary experiments, beam searching using the length normalization of Cho et al was not effective in Mongolian to Chinese translation. The method of Pouget-Abadie et al requires the use of another NMT model to estimate the conditional probability P (yx) and is therefore not suitable for the present invention.

The present invention uses sentence length statistics in the beam search. Assuming that the length of the target sentence correlates with the length of the source sentence, the score for each candidate is redefined as follows:

Figure BDA0002218025350000111

Lx,y=logP(len(y)len(x)), (14)

wherein L isx,yIs the penalty of conditional probability given the target sentence length of the source sentence length len (x). It allows the model to decode the sentence by considering the length of the target sentence. In experiments where the conditional probability P (len (y) len (x)) was pre-computed from the statistical data collected in the first 100 thousands of pairs of training data sets, the present invention allows the decoder to generate up to 100 words.

5. Qualitative analysis

The translation of the test data is illustrated with a model with d 512 and several attention relationships when decoding sentences. In fig. 4, a Mongolian sentence represented as a binary tree is translated into Chinese, and several attention relationships between Mongolian words or phrases and Chinese words are shown with the highest attention scores α. Additional attention relationships are also illustrated for comparison, and the target word can be seen to be in soft alignment with the source word and phrase.

6. Conclusion

In summary, the present invention extends the NMT model of attention, focuses on the phrase structure of the source sentence, and builds a tree-based encoder after parsing the tree. The tree-based encoder of the present invention is a natural extension of the sequence encoder model, where the tree-transformer leaf units in the encoder can work with the original sequence transformer encoder. Furthermore, the attention mechanism allows the tree-based encoder to align not only the input words, but also the input words with the output words.

Experimental results on a 120 ten thousand monthans parallel sentence data set show that the model proposed by the invention obtains the best BLEU score and is superior to the sequential attention NMT model.

The experimental result on the 120 ten thousand Mongolian-Chinese translation task shows that the model provided by the invention realizes the most advanced translation accuracy.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:翻译方法、装置、电子设备及可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!