Tree-to-sequence-based Mongolian Chinese machine translation method

文档序号：1567775 发布日期：2020-01-24 浏览：26次中文

阅读说明：本技术 一种基于树到序列的蒙汉机器翻译方法 (Tree-to-sequence-based Mongolian Chinese machine translation method ) 是由苏依拉薛媛赵旭卞乐乐范婷婷张振于 2019-09-27 设计创作，主要内容包括：本发明一种基于树到序列的蒙汉机器翻译方法,采用树到序列NMT模型,扩展了具有源侧短语结构的序列到序列模型,在模型中加入了自注意力机制,这种自注意力机制不仅可以使得解码器在每一步主动查询最相关的信息,并且还大大缩短了信息流动的距离,另外它可以使得解码器在生成翻译的单词的同时能和源语句的短语以及单词进行对齐。120万蒙汉双语平行语料数据集的实验结果表明,本发明的模型明显优于序列到序列的注意力NMT模型,并且与最先进的树到串SMT系统相比更胜一筹。(The invention relates to a Mongolian Chinese machine translation method based on tree-to-sequence, which adopts a tree-to-sequence NMT model, expands a sequence with a source side phrase structure to a sequence model, and adds a self-attention mechanism in the model, wherein the self-attention mechanism not only can enable a decoder to actively inquire the most relevant information at each step, but also greatly shortens the information flowing distance, and in addition, the self-attention mechanism can enable the decoder to be aligned with phrases and words of a source sentence while generating translated words. The experimental result of the 120 ten thousand Mongolian bilingual parallel corpus data set shows that the model of the invention is obviously superior to the sequence-to-sequence attention NMT model and is superior to the most advanced tree-to-string SMT system.)

1. A Mongolian Chinese machine translation method based on tree-to-sequence adopts NMT model of encoder-decoder structureAs an integral framework of the translation process, the encoder is composed of a sequence encoder and a Tree-based encoder, the sequence encoder and the Tree-based encoder respectively generate a sentence vector, wherein, in the Tree-based encoder, a source sentence is composed of a plurality of phrase units and is represented as a binary Tree based on a head-driven phrase structure grammar, the Tree-based encoder is a Tree-transformer structure constructed by using a transformer, each node in the binary Tree is represented by a transformer unit, thereby recursively encoding the sentence in a bottom-up manner after the phrase structure of the source sentence to generate a vector representation composed of structure information of the sentence, and an initial decoder s₁Having two sub-units, respectively final sequence encoder unit h_nAnd a final tree-based encoder unit

2. The tree-to-sequence based montmorillohman machine translation method according to claim 1, wherein the tree-based encoder is built in a standard sequence encoder.

3. The tree-to-sequence based Mongolian Chinese machine translation method according to claim 2, wherein said tree-based encoder uses left and right sub-hidden units

wherein f is_treeIs a non-linear function; in initializing a tree-based encoder unit, a sequence transform unit is used and a tree-transform is used to compute a transform unit having a parent node of two child transform units.

4. The tree-to-sequence based Mongolian Chinese machine translation method according to claim 3, wherein the sequence transformer unit representation is adopted when initializing tree-based encoder unit, i.e. h₀0; using a tree-transform to compute a transform unit for a parent node having two child transform units, the transform unit is formulated as

5. The tree-to-sequence based Mongolian Chinese machine translation method of claim 3, wherein the initial decoder is

6. Tree-to-sequence based according to claim 1The Mongolian Chinese machine translation method is characterized in that a self-attention mechanism is added in a transform, a weight is learned for each word of an input sentence vector, each word in the self-attention mechanism has 3 different vectors, namely Q, K and V vectors, the length of each vector is 64, and an embedded vector X is multiplied by three different weight matrixes W through 3 different weight matrixes^Q,W^K,W^VThe result is that, where the embedded vector X is transformed from the input word, the three weight matrices are all 512 × 64 in size.

7. The tree-to-sequence based montmorillohma machine translation method according to claim 6, wherein the transform in the decoder further adds an encoder-decoder attention mechanism, wherein Q is from the last output of the decoder, and K and V are from the output of the encoder, and in the machine translation, the decoding process is a sequential operation process, that is, when the K-th feature vector is decoded, only the K-1 and its previous results can be seen.

8. The tree-to-sequence based montmorillohman machine translation method according to claim 7, wherein the NMT model is trained using BlackOut.

9. A tree-to-sequence based montmorillohman machine translation method as claimed in claim 7 wherein in the decoding process, the beam search is used to decode the target sentence of source sentence x and calculate the sum of the log-likelihoods of the target sentence y ═ y (y ═ y)₁,y₂,...y_m) As the beam score:

using the sentence length statistics in the beam search, the length of the target sentence is related to the length of the source sentence, redefining the score for each candidate as follows:

L_x,y＝log P(len(y)|len(x))

wherein L is_x,yIs a penalty for conditional probability given a target sentence length of the source sentence length len (x), which allows the model to decode the sentence by considering the length of the target sentence;

finally, decoding of the source sentences is achieved while aligning the input phrases and words with the output by means of the GIZA + + tool.

Technical Field

The invention belongs to the technical field of machine translation, and particularly relates to a Mongolian Chinese machine translation method based on tree-to-sequence.

Background

Machine Translation (MT) has been one of the most complex language processing problems, and recent advances in Neural Machine Translation (NMT) have made it possible to translate using a simple end-to-end architecture.

In the encoder-decoder model, the encoder reads the entire sequence of source words to produce a fixed-length vector, and then the decoder generates the target words from the vector. The encoder-decoder model has extended the attention mechanism that allows the models to jointly learn soft alignments between source and target languages. The NMT model achieves up-to-date results in english-french and english-german translation tasks. However, for more structurally distant language pairs (e.g., chinese-mon), whether NMT is competitive with traditional Statistical Machine Translation (SMT) methods in translation tasks remains to be observed.

Table 1 shows a pair of parallel sentences of chinese and mongolian languages. In many respects, Chinese and Mongolian languages are linguistically distant, with different syntactic structures, with words and phrases defined in different lexical units. In SMT, it is known that incorporating syntactic components of the Source language into the model can improve word alignment and translation accuracy. However, existing NMT models do not allow such alignment to be performed.

TABLE 1 Mongolian Chinese and its translation order problem

Disclosure of Invention

To overcome the above-described drawbacks of the prior art, it is an object of the present invention to provide a tree-to-sequence based montmorillo chinese machine translation method that employs an attention NMT model to utilize syntactic information, following the phrase structure of a source sentence, recursively encodes the sentence in a bottom-up manner to produce a vector representation of the sentence and decodes the input phrases and words while aligning them with the output.

In order to achieve the purpose, the invention adopts the technical scheme that:

a Mongolian Chinese machine translation method based on a Tree-to-sequence adopts an NMT model of an encoder-decoder structure as an integral framework of a translation process, wherein the encoder consists of a sequence encoder and a Tree-based encoder, the sequence encoder and the Tree-based encoder respectively generate a sentence vector, wherein in the Tree-based encoder, a source sentence consists of a plurality of phrase units and is represented as a binary Tree based on a head-driven phrase structure grammar, the Tree-based encoder is a Tree-transformer structure constructed by using a transformer, each node in the binary Tree is represented by the transformer unit, so that after the phrase structure of the source sentence, the sentence is recursively encoded in a bottom-to-top manner to generate a vector representation consisting of structure information of the sentence, the sequence encoder obtains a vector representation of a normal sentence, the Tree-based encoder obtains a vector representation of the phrase structure of the sentence, initial decoder s₁Having two sub-units, respectively final sequence encoder unit h_nAnd a final tree-based encoder unitUsing final sequence encoder unit h when initializing leaf nodes_nTree-based encoder units for use in initializing parent nodes

The tree-based encoder is built in a standard sequence encoder, and the structural relationship of the tree-based encoder and the standard sequence encoder is shown in FIG. 3.

The tree-based encoder uses left and right sub-concealment units

And

computing a kth parent hidden unit for a kth phraseAs follows:

When initializing a tree-based encoder unit, a sequence transformer unit representation, i.e., h, is used₀0; using a tree-transform to compute a transform unit for a parent node having two child transform units, the transform unit is formulated as

tree denotes a nonlinear function.

Initial decoder

Wherein g is_treeAnd f_treeHaving the same function, the initialization allows the decoder to capture information from sequence data and phrase structures, use the Tree-transformer initialization decoder to translate multiple source languages into a target language, and set the parser Tree for a sentence when the parser cannot output the sentence

To encode the sentence using a sequence encoder.

Adding a self-attention mechanism into the transform, learning a weight for each word of an input sentence vector, wherein each word in the self-attention mechanism has 3 different vectors, namely Q, K and V vectors, the length of each word is 64, and the embedded vector X is multiplied by three different weight matrixes W through the 3 different weight matrixes^Q,W^K,W^VObtaining, wherein the embedded vector X is converted from the input wordThe three weight matrices are obtained to be 512 × 64 in size.

The transform in the decoder also adds an encoder-decoder attention mechanism in which Q is from the last output of the decoder and K and V are from the output of the encoder, and the decoding process is a sequential operation process when machine translated, i.e. when decoding the kth eigenvector, only the kth-1 and its previous results can be seen.

The invention trains the NMT model by BlackOut.

In the decoding process, a target sentence of a source sentence x is decoded using beam search, and a sum y of log-likelihoods of the target sentence is calculated (y ═ y₁,y₂,...y_m) As the beam score:

using the sentence length statistics in the beam search, the length of the target sentence is related to the length of the source sentence, redefining the score for each candidate as follows:

L_x,y＝logP(len(y)|len(x))

finally, decoding of the source sentences is achieved while aligning the input phrases and words with the output by means of the GIZA + + tool.

Compared with the prior art, the present invention is based on a tree-to-sequence approach, which can improve word alignment and translation accuracy by recursively encoding sentences in a bottom-up manner to generate vector representations of the sentences, after the phrasal structure of the source sentence, using attention NMT models to exploit syntax information.

Drawings

FIG. 1 is a diagram of a parallel sentence alignment of a pair of Chinese and Mongolian languages.

Fig. 2 is a schematic diagram of an attention-based encoder-decoder model.

FIG. 3 is a schematic diagram of an attention-based Tree-to-sequence NMT model.

FIG. 4 is a schematic diagram of an example sentence translation and attention relationships of the model of the present invention.

FIG. 5 is a schematic diagram of a transform encoder structure.

FIG. 6 is a schematic diagram of a transform decoder structure.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

Fig. 1 shows a pair of parallel sentences of chinese and mongolian languages. In many respects, Chinese and Mongolian languages are linguistically distant, with different syntactic structures, with words and phrases defined in different lexical units. The present invention is directed to incorporating syntactic components of a known source language into a model using a light alignment algorithm to improve word alignment and translation accuracy.

To achieve the above objects, the present invention employs an attention NMT model to utilize syntax information, still employing an encoder-decoder model as an overall framework of the translation flow, following the phrase structure of the source sentence, recursively encodes the sentence in a bottom-up fashion to produce a vector representation of the sentence and decodes the input phrases and words while aligning them with the output.

To describe the tree-to-sequence attention NMT model of the present invention, we first introduce from the encoder-decoder model:

1. tree-to-sequence modeling

1.1 encoder-decoder model

NMT is an end-to-end approach to data-driven machine translation, where the NMT model directly estimates the conditional probability P (y | x) given a large number of source and target sentence pairs (x, y). The NMT model consists of an encoder and a decoder, and is called an encoder-decoder model. In the encoder-decoder model, a sentence is considered to be composed of a series of words. In the encoder process, the encoder will process each sourceThe word x ═ x₁,x₂,...,x_n) Embedding into d-dimensional vector space. Then, the decoder outputs the word sequence y in the target language (y) given the information about the source sentence supplied from the encoder (y)₁,y₂,...,y_m). Here, n and m are the lengths of the source sentence and the target sentence, x, respectively_n、y_mRespectively representing the m-th and n-th words in the source sentence and the target sentence.

the transformer network structure allows efficient embedding of sequential data into vector space, given the ith input x in the encoder_iAnd a previous hidden unit h_i-1∈R^d×1Calculating the ith hidden unit h_i∈R^d×1：

h_i＝f_en(x_i,h_i-1), (1)

Wherein R is^d×1Representing a d x 1-dimensional vector space, f_enIs a coding function, is a nonlinear function, and initially hides a unit h ₀0. Recursively applying an encoding function f_enUntil the nth hidden unit h is obtained_n. transformer encoder-decoder model assumption h_nA vector representing the meaning of the input sequence up to the nth word.

After the entire input sentence is encoded into the vector space, it is decoded in a similar manner. Initial decoder unit s₁Using source sentence vector(s)₁＝h_n) And (5) initializing. Given the previous target word and the jth hidden unit of the decoder, the conditional probability for generating the jth target word is calculated as follows:

P(y_j|y_＜j,x)＝g(s_j)， (2)

wherein s is_jIs the j-th hidden unit of the decoder, g is a non-linear function, by using another non-linear function f_deTo calculate s_jAs follows:

s_j＝f_de(y_i-1,s_j-1), (3)

the use of a transformer unit allows for better parallelism of the model when translating.

1.2 attention encoder-decoder model

An NMT model with attention mechanism can neatly align each decoder state with an encoder state. The attention mechanism allows the NMT model to explicitly quantify the degree to which each encoder state contributes to word prediction at each time step.

In the attention NMT model of Luong et al, the i-th source concealment unit, i.e. the encoder concealment unit h, is in the j-th step of the decoder process_iAnd the attention score between the jth target hidden unit, i.e. the decoder hidden unit

The calculation method of (2) is as follows:

jth context vector d_jIs composed of alpha_j(i) Weighted sum vector:

the model predicts the jth word using the softmax function:

P(y_j|y_＜j,x)＝softmax(W_s+b_s+Attention(Q,K,V)) (7)

wherein W_S∈R^|V|×dAnd b^s∈R^|V|×1Weight matrix and bias vector, respectively, | V | represents the size of the target vocabulary. Since in the encoder, the data first passes through a module called "self Attention" to obtain a weighted feature vector Attention, and then the Attention is sent to the next module of the encoder, namely the feedback neural network module. Here, two layers are fully connected, the first layer is ReLU, and the second layer is a linear activation function, which can be expressed as:

FFN(Attention)＝max(0,d_jW₁+b₁)W₂+b₂(8)

the structure of the encoder is shown in fig. 5, and the structure of the decoder is shown in fig. 6, both of which include a self-attention mechanism, and the decoder has an additional encoder-decoder attention mechanism, wherein the self-attention mechanism is used for representing the relationship between the current translation and the translated text; and the encoder-decoder attention mechanism is used to represent the relationship between the current translated and encoded feature vectors.

Objective function of NMT model

The objective function for training the NMT model is the sum of the log-likelihoods of the translation pairs in the training data:

where D represents a set of parallel sentence pairs, | D | represents the size of the training set, and the model parameters are learned by the Stochastic Gradient Descent (SGD) method.

3 attention Tree to sequence model

3.1 Tree based encoder + sequence encoder

Existing NMT models treat a sentence as a series of words, ignoring the structure of the syntax inherent in the language. The present invention proposes a new tree-based encoder to explicitly consider the syntax structure in the NMT model. The present invention focuses on the phrase structure of the sentence and constructs the sentence vector from the phrase vector in a bottom-up manner. Thus, the statement vector in a tree-based encoder is composed of structural information rather than sequential data. Fig. 3 shows the model proposed by the present invention, which is referred to as the tree-to-sequence attention NMT model.

In the header-driven phrase structure grammar, a sentence is composed of a plurality of phrase units and represented as a binary tree, as shown in FIG. 3. Following the structure of sentences, a tree-based encoder is constructed in a standard sequence encoder using left and right sub-hidden units

And

computing a kth parent hidden unit for a kth phrase

As follows:

wherein f is_treeIs a non-linear function.

The present invention constructs a tree-based encoder using a transformer, wherein each node in the binary tree is represented by a transformer unit, and sequence transformer units are used in initializing leaf units of the tree-based encoder.

Each non-leaf node in the binary tree is also represented by a transform unit, and a tree-transform is used to compute a transform unit having a parent of two child transform units.

The Tree-based encoder proposed by the present invention is a natural extension of the conventional sequence encoder, since the Tree-transformer is a generalization of the chain-transformer. The inventive encoder constructs phrase nodes in a context-dependent manner when computing the transform units of a leaf node, e.g., allowing the model to compute different representations of multiple occurrences of the same word in a sentence, since the sequence transforms were computed in the context of the previous unit. This capability contrasts with the original Tree-transformer, where the leaf consists of word embedding only without any contextual information.

3.2 initial decoder setup

The present invention has two different sentence vectors: one from the sequence encoder and the other from the tree-based encoder. As shown in FIG. 3, another Tree-transformer unit is provided, which has a final sequence encoder unit (h)_n) And a tree-based encoder unitAs two subunits, it is set as the initial decoder s₁The following were used:

sequence encoder unit for initializing leaf nodes, tree encoder unit for initializing father nodes, wherein g_treeWith f having another set of Tree-transform parameters_treeWith the same function.

This initialization allows the decoder to capture information from the sequence data and phrase structure. A Tree-transformer is used to initialize the decoder with which multiple source languages are translated into a target language. When the parser can not output the parse tree of the sentence, the setting is passed

To encode the sentence using a sequence encoder. Thus, the tree-based encoder proposed by the present invention is applicable to any sentence.

3.3 self-attention mechanism in model

The self-attention mechanism is the most core content of the transform, and can learn a weight for each word of an input sentence vector, and each word in the self-attention mechanism has 3 different vectors, which are respectively Q (query), K (key) and V (value) vectors, and the length of each vector is 64. Multiplying the embedded vector X by three different weight matrices W through 3 different weight matrices^Q,W^K,W^VThe dimensions of all three of the matrices are 512 x 64. The following examples illustrate specific implementations:

the specific Attention is calculated as follows:

(1) converting input words into embedded vectors

(2) Obtaining three vectors of Q, K and V according to the embedded vector

(3) Calculate one score for each vector: score is q · k, and q and k are Q, K components, respectively.

(4) score dot-by-V for each component

(5)

The above steps can be generalized to n.

Encoder-decoder attention mechanism in 3.4 model

The transform module in the decoder has a multiple encoder-decoder attention mechanism over the encoder, in which Q comes from the last output of the decoder and K and V come from the output of the encoder. In machine translation, the decoding process is a sequential operation process, that is, when the k-th feature vector is decoded, only the k-1 and the previous results can be seen, so the method is called a hidden attention mechanism.

3.5 sample-based NMT model approximation

The largest computational bottleneck in training the NMT model is in computing the softmax layer described in equation (7) because its computational cost increases linearly with the size of the vocabulary. The acceleration techniques of the GPU have proven useful for sequence-based NMT models, but are not easily applied when processing tree-structured data. To reduce the training cost of the softmax layer NMT model, BlackOut, a sampling-based approximation method, is used. BlackOut has proven effective in such language models and allows the model to run fairly quickly even with a vocabulary of one million words.

In each word prediction step in the training, BlackOut uses a weighted softmax function to estimate the conditional probability in equation (2) for the target word and K negative samples. Negative examples are the extraction of the power β ∈ [0,1] from the unigram distribution. The single letter distribution is estimated using training data, with β being a hyper-parameter. BlackOut is closely related to Noise Contrast Estimation (NCE) and is more difficult to solve than the original softmax and NCE in RNNLM. After the training is over, BlackOut may be used as the original softmax.

4 experiment

4.1 training data

The proposed model was applied to a 120 ten thousand Mongolian parallel sentence dataset. A phrase structure is obtained and for the source sentence, i.e. the Mongolian, a probabilistic HPSG parser Enju is used. Only Enju is used to obtain the binary phrase structure for each sentence and no HPSG specific information is used. For the target language, i.e. chinese, the chinese segmentation tool jieba is used and the pre-processing steps recommended in word2vec are performed. Translation pairs whose sentence length exceeds 50 and whose source sentence was not successfully parsed are then filtered out. Two experiments were performed on a small training dataset to study the effectiveness of the model of the invention and to compare it with other systems on a large training dataset.

The vocabulary includes a number of words observed in the training data that is greater than or equal to N. The invention sets N to 2 for small training data set and N to 5 for large training data set. Out-of-vocabulary words are mapped to the special label "unk". The present invention adds another special symbol "eos" for both languages and inserts it at the end of all sentences.

4.3 decoding procedure

The present invention uses beam searching to decode the target sentence of source sentence x and calculates the sum of the log-likelihoods of the target sentence, y ═ y₁,y₂,…y_m) As the beam score:

decoding in the NMT model is a generative process and depends on the target language model of a given source sentence. As the target sentence becomes longer, the score becomes smaller, so simple beam searching does not work well when decoding long sentences. In preliminary experiments, beam searching using the length normalization of Cho et al was not effective in Mongolian to Chinese translation. The method of Pouget-Abadie et al requires the use of another NMT model to estimate the conditional probability P (yx) and is therefore not suitable for the present invention.

The present invention uses sentence length statistics in the beam search. Assuming that the length of the target sentence correlates with the length of the source sentence, the score for each candidate is redefined as follows:

L_x,y＝logP(len(y)len(x)), (14)

wherein L is_x,yIs the penalty of conditional probability given the target sentence length of the source sentence length len (x). It allows the model to decode the sentence by considering the length of the target sentence. In experiments where the conditional probability P (len (y) len (x)) was pre-computed from the statistical data collected in the first 100 thousands of pairs of training data sets, the present invention allows the decoder to generate up to 100 words.

5. Qualitative analysis

The translation of the test data is illustrated with a model with d 512 and several attention relationships when decoding sentences. In fig. 4, a Mongolian sentence represented as a binary tree is translated into Chinese, and several attention relationships between Mongolian words or phrases and Chinese words are shown with the highest attention scores α. Additional attention relationships are also illustrated for comparison, and the target word can be seen to be in soft alignment with the source word and phrase.

6. Conclusion

In summary, the present invention extends the NMT model of attention, focuses on the phrase structure of the source sentence, and builds a tree-based encoder after parsing the tree. The tree-based encoder of the present invention is a natural extension of the sequence encoder model, where the tree-transformer leaf units in the encoder can work with the original sequence transformer encoder. Furthermore, the attention mechanism allows the tree-based encoder to align not only the input words, but also the input words with the output words.

Experimental results on a 120 ten thousand monthans parallel sentence data set show that the model proposed by the invention obtains the best BLEU score and is superior to the sequential attention NMT model.

The experimental result on the 120 ten thousand Mongolian-Chinese translation task shows that the model provided by the invention realizes the most advanced translation accuracy.

13页详细技术资料下载

Tree-to-sequence-based Mongolian Chinese machine translation method

相关技术

网友询问留言