Method for improving zero sample translation capability of multi-language neural machine translation model

文档序号:169371 发布日期:2021-10-29 浏览:38次 中文

阅读说明:本技术 一种提升多语言神经机器翻译模型零样本翻译能力的方法 (Method for improving zero sample translation capability of multi-language neural machine translation model ) 是由 张婷 金任任 熊德意 于 2021-07-02 设计创作,主要内容包括:本发明属于机器翻译技术领域,公开了一种提升多语言神经机器翻译模型零样本翻译能力的方法,包括:设计损失函数:计算训练样本中的源语言句子和目标语言句子对应的Encoder输出之间的相似度,并相应相似度添加到原有的Loss中,用所述Loss优化模型的参数,指导Encoder将具有相似语义的句子映射到语义空间中相近位置;优化模型结构:在Decoder的最后一层的输出和用于生成目标语言词表上的概率分布的线性层之间为每种目标语言都添加了一个独立的线性层;在Decoder中额外添加目标语言信息;并调整采样方法。本发明可以显著提升多语言神经机器翻译模型的零样本翻译能力。(The invention belongs to the technical field of machine translation, and discloses a method for improving zero sample translation capability of a multilingual neural machine translation model, which comprises the following steps: designing a loss function: calculating the similarity between Encoder outputs corresponding to source language sentences and target language sentences in training samples, adding the corresponding similarity into the original Loss, and guiding the Encoder to map sentences with similar semantics to the similar positions in a semantic space by using the parameters of the Loss optimization model; optimizing the model structure: adding an independent linear layer for each target language between the output of the last layer of the Decoder and the linear layer for generating probability distribution on the vocabulary of the target language; additionally adding target language information in the Decoder; and adjusts the sampling method. The zero sample translation capability of the multi-language neural machine translation model can be remarkably improved.)

1. A method for improving zero sample translation capability of a multi-language neural machine translation model, comprising the following steps: the zero sample translation capability of the multilingual neural machine translation model is improved by constructing the neural machine translation model, designing a loss function of the translation model, optimizing a model structure and adjusting a sampling method.

2. The method for improving zero-sample translation capability of multi-lingual neural machine translation model according to claim 1, wherein the method for improving zero-sample translation capability of multi-lingual neural machine translation model comprises the following steps:

step one, designing a loss function: calculating the similarity between Encoder outputs corresponding to source language sentences and target language sentences in training samples, adding the corresponding similarity into the original Loss, and guiding the Encoder to map sentences with similar semantics to the similar positions in a semantic space by using the parameters of the Loss optimization model;

step two, optimizing the model structure: adding an independent linear layer for each target language between the output of the last layer of the Decoder and the linear layer for generating probability distribution on the vocabulary of the target language; additionally adding target language information in the Decoder; and adjusts the sampling method.

3. The method for improving translation capability of zero sample of multi-lingual neural machine translation model according to claim 2, wherein in the first step, the calculating the similarity between Encoder outputs corresponding to the source language sentence and the target language sentence in the training sample, and the adding the corresponding similarity to the original Loss comprises:

similarity calculation is carried out by adopting two methods, and corresponding similarity addition is carried out;

(1) the similarity is calculated using the following formula:

translating the value range of the similarity to the range of [ -2,0], taking an inverse number and adding the inverse number into the original Loss as follows:

(2) similarity calculation is carried out by calculating the difference between the two vectors, and the calculation result is directly added to the original Loss as follows:

where λ represents a hyper-parameter for controlling the ratio between the added Loss and the original Loss.

4. The method for improving zero-sample translation capability of multi-lingual neural machine translation model according to claim 1, wherein the method for improving zero-sample translation capability of multi-lingual neural machine translation model further comprises:

an identifier that can uniquely represent the target language is added to the beginning of the target language sentence, and the added identifier is deleted before the target language sentence is input to the Encoder.

5. The method for improving the zero-sample translation capability of the multi-lingual neural machine translation model according to claim 1, wherein in the second step, the adjusting and sampling method comprises:

dlthe size of the training set of the ith language pair is, the ratio of the training set of the language pair to the original total training set is:

the probability that the training samples of the adjusted ith language pair are sampled is:

when T is 1, the probability of the training sample of each language pair being sampled is the proportion of its training set to the original total training set; when T ∞, the probability that the training samples of each language pair are sampled is equal, meaning that the training set of low resource languages is oversampled.

6. A program storage medium for receiving user input, the stored computer program causing an electronic device to perform the method of any of claims 1-5 for enhancing zero-sample translation capability of a multi-lingual neural machine translation model, the method comprising the steps of:

step one, designing a loss function: calculating the similarity between Encoder outputs corresponding to source language sentences and target language sentences in training samples, adding the corresponding similarity into the original Loss, and guiding the Encoder to map sentences with similar semantics to the similar positions in a semantic space by using the parameters of the Loss optimization model;

step two, optimizing the model structure: adding an independent linear layer for each target language between the output of the last layer of the Decoder and the linear layer for generating probability distribution on the vocabulary of the target language; additionally adding target language information in the Decoder; and adjusts the sampling method.

7. A computer program product stored on a computer readable medium, comprising a computer readable program for providing a user input interface for implementing a method of any one of claims 1-5 for enhancing zero-sample translation capabilities of a multi-lingual neural machine translation model when executed on an electronic device.

8. A zero-sample translation multi-lingual neural machine translation model for implementing the method of any one of claims 1 to 5 for enhancing zero-sample translation capability of the multi-lingual neural machine translation model.

Technical Field

The invention belongs to the technical field of machine translation, and particularly relates to a method for improving zero sample translation capability of a multilingual neural machine translation model.

Background

At present: machine translation studies how to transform text in one language into text in another language with the same semantics by a computer, which is not only faster but also less expensive than manual translation. In view of the huge application prospect of machine translation, the academic and industrial fields put great manpower, material resources and financial resources into the field of machine translation, and machine translation is one of the research hotspots in the academic and industrial fields. After many years of research, the research results of machine translation have been practically applied, and many companies such as google, microsoft, Baidu and the like construct own translation engines and provide online translation services for the outside. According to the message issued by google, the translation service platform under the google flag provides up to 10 hundred million translation services every day, and the vocabulary translated every day exceeds 1000 hundred million. Machine translation has become an indispensable tool in life today of various language communication collisions.

The neural machine translation is a general term of a machine translation method based on a neural network, and researches on realizing equivalent semantic conversion of texts between two languages through the neural network. The machine translation method based on the neural network shows a more excellent translation effect due to the generation of a large-scale parallel corpus and the improvement of the computing capacity of computer hardware, and gradually replaces a machine translation method based on statistics to become a mainstream method of the current machine translation. The mainstream neural-machine translation model is an encoder-decoder framework (Sutskever et al, 2014), in which an encoder encodes a sentence in a source language into a high-dimensional vector, and a decoder generates a sentence in a target language from the vector. The translation process of the neural machine translation model can be regarded as solving a probability problem, and X ═ X (X) is set1,x2,x3,…,xn) Representing a source language sentence having n words, and setting Y ═ Y1,y2,y3,…,ym) Representing a target language sentence containing m words, where X, Y contain the same semantics, the neural-machine translation model, when translated, can be viewed as solving for Y that maximizes the probability P (Y/X) given the sequence X. Since the encoder will encode the sentence X in the source language into a high-dimensional vector E after obtaining it, let E ═ E1,e2,e3,…,en) Then, there are:

(e1,e2,e3,…,en)=Encoder(x1,x2,x3,…,xn)

when the target language word y is to be generatediThen, the decoder will decode the source language sentence according to its code E and the target word (y) at the first i-1 times1,y2,y3,…,yi-1) Generating a target-language word yiAs shown in the following formula:

at the beginning of birth of a neural machine translation model, an encoder and a decoder are generally composed of a cyclic neural network, however, the input of the cyclic neural network at the current moment depends on the output of the previous moment, so that the parallel processing capacity of a GPU cannot be efficiently utilized in the calculation process of the model, and the training and reasoning speed of the model cannot achieve a satisfactory effect. In addition, the neural machine translation model based on the recurrent neural network cannot effectively model the long-distance dependence in the sentence, which results in that the translation effect of the model on the long sentence is not satisfactory. For the defects of a neural machine translation model based on a cyclic neural network, the prior art 1 provides a Transformer structure, the Transformer abandons the structures of the traditional cyclic neural network and a convolutional neural network, and a good translation effect can be obtained only by means of an attention mechanism and a feedforward neural network. The advantages of the Transformer gradually replace the traditional cyclic neural network and convolutional neural network structure to become the current mainstream neural machine translation model structure.

Multilingual neural machine translation studies how to use one neural machine translation model to achieve simultaneous translation of multiple language pairs. Multi-lingual neural machine translation can be classified into many-to-one translation, one-to-many translation, and many-to-many translation, depending on the number of languages involved at the source and target ends of the model. To support translation of multiple language pairs, early multilingual neural machine translation models typically added a separate encoder for each source language and a separate decoder for each target language in the model, and if it is desired to share parameters in the model in different languages, it is possible to share the encoders for some source languages or the decoders for some target languages or the shared attention mechanism for some languages, however, these schemes all result in a larger increase in the total number of parameters of the model as the number of languages involved increases. Unlike the above-mentioned solutions, prior art 2 implements translation of multiple language pairs using only one encoder and decoder. In the case of using only one decoder, the decoder may generate an erroneous target language sentence due to the lack of target language information to be generated, if no modification is made to the training data or the training mode. In order to add the target language information to be generated in the model to guide the decoder to perform translation, the prior art 3 adds an identifier capable of uniquely identifying the target language at the beginning of each sentence in the source language, for example, a chinese sentence to be translated into english is "multilingual neural machine translation", which is modified to "< EN > multilingual neural machine translation".

Zero-sample translation (zero-shot translation) means that the machine translation model has not been trained by the parallel corpora of a particular language pair during training but also has the ability to process that language pair. The neural-machine translation model of a single language pair has little capability of zero-sample translation (zero-shot translation) because no training samples of related languages are learned. In the multilingual neural machine translation model, a joint training mode is generally adopted, and although a training set does not have parallel corpora of a specific language pair, the model can learn a relevant training sample of a source language or a target language during training, so that the model has certain zero-sample translation (zero-shot translation) capability. For example, a multilingual neural-machine translation model trained on parallel corpora of German → English, English → Chinese can translate German directly into Chinese.

The zero-sample translation (zero-shot translation) phenomenon of the multilingual neural machine translation model means that the translation between any two languages in the N languages can be realized by a many-to-many neural machine translation model obtained by parallel corpus training of other N languages → the hub language and the hub language → other N languages by means of a hub language, which is equivalent to training the neural machine translation model of N x (N-1) single language pairs, so that the number of the neural machine translation models actually deployed in the production environment can be greatly reduced, and the actual deployment flow of the models can be simplified. Besides, when the parallel corpus of some language pairs is insufficient, the model with the translation capability of the language pairs can still be obtained by the method.

Although the zero-sample translation (zero-shot translation) capability of the multi-language neural machine translation model has many advantages, since the model does not learn the samples of the language pair in the training process, the zero-sample translation (zero-shot translation) effect of the model is usually particularly poor, and when actually translating, the model tends to generate words of the pivot language rather than words of the given target language, which further results in that the model finally generates sentences of the pivot language rather than sentences of the given target language. For example, when using english as the pivot language, if the multilingual neural machine translation model translates french directly into chinese, the sentence generated by the model will be mixed with english words, as shown in table 1.

TABLE 1

Through the above analysis, the problems and defects of the prior art are as follows: when zero sample translation is carried out by the existing machine translation model, wrong target language words can be generated, the translation accuracy is not high, and the translation effect of the model on the supervised and trained language pairs can be influenced.

The difficulty in solving the above problems and defects is: the basic structure in the neural machine translation model is the neural network, and although the natural language processing model based on the neural network has breakthrough progress in many natural language processing tasks in recent years, the decision logic of the model cannot be directly explained by the existing scientific theory, and the neural network is more like a black box model for the invention. Therefore, why will the multilingual neural-machine translation model have zero sample translation capability? Why is the translation effort of the multilingual neural machine translation model always undesirable? How to boost the model's zero sample translation capability without reducing the model's translation capability for other trained language pairs? The problems are not completely solved in both academic and industrial fields, and further research is needed. Moreover, due to lack of relevant theoretical guidance, when the zero sample translation research of the multi-language neural machine translation model is engaged, the method cannot derive an effective method for improving the zero sample translation capability of the model according to a specific theory, only can search factors which may influence the zero sample translation capability of the model through own existing knowledge accumulation and a large number of experiments, then summarize experiment results and obtain relevant rules, and finally design a better model structure, a training method, a training data sampling method and the like according to the rules. However, the training data of the multilingual neural machine translation model usually contains parallel corpora of a plurality of language pairs, and the scale of the training set of the multilingual neural machine translation model is usually very large, so that the model consumes a large amount of computing resources during training, and if strong computing power is not used as support, related experiments are difficult to perform, and the quality of a scheme for improving the zero-sample translation capability of the model cannot be verified through the experiments.

The significance of solving the problems and the defects is as follows: the zero sample translation capability of the multilingual neural machine translation model shows that the invention can realize the mutual translation between any two languages in the N languages by using one pivot language and only needing parallel linguistic data between other N languages and the pivot language. In a practical production environment, the machine translation system needs to support translation of as many language pairs as possible to meet different requirements, and when the invention needs a machine translation system capable of performing mutual translation between any two of N languages, if the model does not have zero sample translation capability or modelThe translation effect of the zero sample is not satisfactory, and the invention needsAlthough a plurality of large-scale parallel corpora are disclosed at present, the high-quality large-scale parallel corpora still belong to scarce resources, and the parallel corpora only cover a small part of languages in the world, for low-resource languages, the invention still has difficulty in obtaining enough corpus training models, and a model with good zero-sample translation capability can effectively alleviate the problem. Under the condition that the zero sample translation capability of the model is good, the invention can meet the translation requirements of N x (N-1) directions only by training 2N translation directions, thereby reducing the number of the actually trained models and simplifying the deployment process of the system. Under the condition that the parallel linguistic data between two languages are insufficient, in order to realize the mutual translation between the two languages, a certain pivot language with more parallel linguistic data with the two languages is usually used, a text of one language is translated into the pivot language, and then the text of the pivot language is translated into the other language. Compared with the mode, the zero sample translation capability of the model can enable the model to complete the whole translation process under the condition that the training samples of the relevant language pairs are not learned, namely, the model is directly translated from one language to another language, so that the translation speed can be ensured, and the error accumulation of the translation in two stages can be avoided.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method for improving the zero sample translation capability of a multilingual neural machine translation model.

The invention is realized in such a way that a method for improving the zero sample translation capability of a multi-language neural machine translation model comprises the following steps:

the zero sample translation capability of the multilingual neural machine translation model is improved by constructing the neural machine translation model, designing a loss function of the translation model, optimizing a model structure and adjusting a sampling method.

Further, the method for improving the zero sample translation capability of the multi-language neural machine translation model comprises the following steps:

step one, designing a loss function: calculating the similarity between Encoder outputs corresponding to source language sentences and target language sentences in training samples, adding the corresponding similarity into the original Loss, and guiding the Encoder to map sentences with similar semantics to the similar positions in a semantic space by using the parameters of the Loss optimization model; multilingual neural machine translation can be decomposed into domain adaptation problems and multitask learning problems, where various source language sentences can be treated as data from different domains and translation into different target languages can be treated as different tasks. If the zero-sample translation (zero-shot translation) phenomenon of the model is considered from the perspective of domain adaptation, the invention can be understood that the model learns common features from different source languages, and even if the model does not see relevant language pairs during training, the Encoder can still output vector representation with semantic information of source language sentences, which is one of the reasons why the model can perform zero-sample translation (zero-shot translation), but actually the zero-sample translation (zero-shot translation) effect of the model is not good. One of the reasons why the zero-sample translation (zero-shot translation) capability of the model is influenced is that the original model training method leads the semantic code learned by the Encoder to carry source language information, namely, the semantic code of the source language sentence output by the Encoder is related to the source language, and the information interacts with the target language information contained in the Decoder to cause the model to have poor translation effect when processing unseen language. Assuming that a multilingual neural machine translation model is trained with parallel corpora of French → Chinese and Chinese → English, for the direction of Chinese → English, the model can generate a better translation result because the model learns the corresponding parallel corpora, and if the Encoder can encode the French sentences and the Chinese sentences with the same semantics into similar vector representations, the model can also better process the direction of French → English. Therefore, there is a need to improve the original model training method to allow the Encoder to encode sentences with the same semantics into similar vector representations, and the semantic code output by the Encoder does not contain source language information.

Step two, optimizing the model structure: adding an independent linear layer for each target language between the output of the last layer of the Decoder and the linear layer for generating probability distribution on the vocabulary of the target language; additionally adding target language information in the Decoder; and adjust the sampling method to alleviate the problem of data imbalance between different languages. The multilingual neural machine translation model always generates wrong target language sentences during zero-sample translation (zero-shot translation), and the invention conjectures the reason for generating the problem mainly from two reasons: 1. in the multilingual neural machine translation model, an Encoder only needs to pay attention to semantic information of a source language sentence, a Decode not only needs to pay attention to the semantic information of the source language sentence, but also needs to pay attention to information such as the semantic information and word sequence information of a target language sentence, and different target languages have obvious difference, so that all target languages share all parameters in the Decode, and the method is not necessarily the optimal method. And 2. the Decoder cannot finally generate a specific target language sentence due to insufficient target language information received by the Decoder. In order to instruct the Decoder to generate a specific target language sentence during the training of the multilingual neural machine translation model, an identifier capable of uniquely identifying the target language is usually added at the beginning of the source language sentence or the beginning of the target language sentence, for example, a chinese sentence which needs to be translated into english is "multilingual neural machine translation" which is modified to "< EN > multilingual neural machine translation", but this method only modifies the original corpus, and in order to further instruct the Decoder to translate, certain modifications need to be made on the structure of the model.

Further, in the step one, the calculating the similarity between the Encoder outputs corresponding to the source language sentences and the target language sentences in the training samples, and adding the corresponding similarity to the original Loss includes:

similarity calculation is carried out by adopting two methods, and corresponding similarity addition is carried out;

(1) the similarity is calculated using the following formula:

translating the value range of the similarity to the range of [ -2,0], taking an inverse number and adding the inverse number into the original Loss as follows:

(2) similarity calculation is carried out by calculating the difference between the two vectors, and the calculation result is directly added to the original Loss as follows:

where λ represents a hyper-parameter for controlling the ratio between the added Loss and the original Loss.

Further, the method for improving the zero sample translation capability of the multi-language neural machine translation model further comprises the following steps:

an identifier that can uniquely represent the target language is added to the beginning of the target language sentence, and the added identifier is deleted before the target language sentence is input to the Encoder.

Further, in step two, the method for adjusting sampling includes:

dlthe size of the training set of the ith language pair is, the ratio of the training set of the language pair to the original total training set is:

the probability that the training samples of the adjusted ith language pair are sampled is:

when T is 1, the probability of the training sample of each language pair being sampled is the proportion of its training set to the original total training set; when T ∞, the probability that the training samples of each language pair are sampled is equal, meaning that the training set of low resource languages is oversampled.

By combining all the technical schemes, the invention has the advantages and positive effects that: the zero-sample translation (zero-shot translation) capability of the multilingual neural machine translation model can be remarkably improved, the problem that the model generates wrong target language words when zero-sample translation (zero-shot translation) is carried out on the model is solved, and the translation effect of the model on the supervised-trained language pairs is not influenced.

Drawings

FIG. 1 is a flowchart of a method for improving zero-sample translation capability of a multilingual neural machine translation model according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating an improved multilingual neural machine translation model training method according to an embodiment of the present invention.

Fig. 3 is a model diagram of each target language with an additional linear transformation layer according to an embodiment of the present invention.

Fig. 4 is a structural diagram after target language information is additionally added to the Decoder according to the embodiment of the present invention.

Fig. 5 is a graph comparing the sampling probability of the adjusted sampling method for language pairs of different training set sizes when T is 1,5, and 100, respectively, according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a method for improving the zero sample translation capability of a multilingual neural machine translation model, and the invention is described in detail below by combining the attached drawings.

The method for improving the zero sample translation capability of the multi-language neural machine translation model provided by the embodiment of the invention comprises the following steps:

the zero sample translation capability of the multilingual neural machine translation model is improved by constructing the neural machine translation model, designing a loss function of the translation model, optimizing a model structure and adjusting a sampling method.

As shown in fig. 1, the method for improving zero sample translation capability of a multilingual neural machine translation model according to an embodiment of the present invention includes the following steps:

s101, designing a loss function: calculating the similarity between Encoder outputs corresponding to source language sentences and target language sentences in training samples, adding the corresponding similarity into the original Loss, and guiding the Encoder to map sentences with similar semantics to the similar positions in a semantic space by using the parameters of the Loss optimization model;

s102, optimizing a model structure: adding an independent linear layer for each target language between the output of the last layer of the Decoder and the linear layer for generating probability distribution on the vocabulary of the target language; additionally adding target language information in the Decoder; and adjusts the sampling method.

The method for calculating the similarity between Encoder outputs corresponding to source language sentences and target language sentences in training samples provided by the embodiment of the invention comprises the following steps of:

similarity calculation is carried out by adopting two methods, and corresponding similarity addition is carried out;

(1) the similarity is calculated using the following formula:

translating the value range of the similarity to the range of [ -2,0], taking an inverse number and adding the inverse number into the original Loss as follows:

(2) similarity calculation is carried out by calculating the difference between the two vectors, and the calculation result is directly added to the original Loss as follows:

where λ represents a hyper-parameter for controlling the ratio between the added Loss and the original Loss.

The method for improving the zero sample translation capability of the multi-language neural machine translation model provided by the embodiment of the invention further comprises the following steps:

an identifier that can uniquely represent the target language is added to the beginning of the target language sentence, and the added identifier is deleted before the target language sentence is input to the Encoder.

The sampling adjusting method provided by the embodiment of the invention comprises the following steps:

dlthe size of the training set of the ith language pair is, the ratio of the training set of the language pair to the original total training set is:

the probability that the training samples of the adjusted ith language pair are sampled is:

when T is 1, the probability of the training sample of each language pair being sampled is the proportion of its training set to the original total training set; when T ∞, the probability that the training samples of each language pair are sampled is equal, meaning that the training set of low resource languages is oversampled.

The technical solution of the present invention is further illustrated by the following specific examples.

Example 1:

multilingual neural machine translation can be decomposed into domain adaptation problems and multitask learning problems, where various source language sentences can be treated as data from different domains and translation into different target languages can be treated as different tasks. If the zero-sample translation (zero-shot translation) phenomenon of the model is considered from the perspective of domain adaptation, the invention can be understood that the model learns common features from different source languages, and even if the model does not see relevant language pairs during training, the Encoder can still output vector representation with semantic information of source language sentences, which is one of the reasons why the model can perform zero-sample translation (zero-shot translation), but actually the zero-sample translation (zero-shot translation) effect of the model is not good. One of the reasons why the zero-sample translation (zero-shot translation) capability of the model is influenced is that the original model training method leads the semantic code learned by the Encoder to carry source language information, namely, the semantic code of the source language sentence output by the Encoder is related to the source language, and the information interacts with the target language information contained in the Decoder to cause the model to have poor translation effect when processing unseen language. Assuming that a multilingual neural machine translation model is trained with parallel corpora of French → Chinese and Chinese → English, for the direction of Chinese → English, the model can generate a better translation result because the model learns the corresponding parallel corpora, and if the Encoder can encode the French sentences and the Chinese sentences with the same semantics into similar vector representations, the model can also better process the direction of French → English. Therefore, there is a need to improve the original model training method to allow the Encoder to encode sentences with the same semantics into similar vector representations, and the semantic code output by the Encoder does not contain source language information.

The multilingual neural machine translation model always generates wrong target language sentences during zero-sample translation (zero-shot translation), and the invention conjectures the reason for generating the problem mainly from two reasons: 1. in the multilingual neural machine translation model, an Encoder only needs to pay attention to semantic information of a source language sentence, a Decode not only needs to pay attention to the semantic information of the source language sentence, but also needs to pay attention to information such as the semantic information and word sequence information of a target language sentence, and different target languages have obvious difference, so that all target languages share all parameters in the Decode, and the method is not necessarily the optimal method. And 2. the Decoder cannot finally generate a specific target language sentence due to insufficient target language information received by the Decoder. In order to instruct the Decoder to generate a specific target language sentence during the training of the multilingual neural machine translation model, an identifier capable of uniquely identifying the target language is usually added at the beginning of the source language sentence or the beginning of the target language sentence, for example, a chinese sentence which needs to be translated into english is "multilingual neural machine translation" which is modified to "< EN > multilingual neural machine translation", but this method only modifies the original corpus, and in order to further instruct the Decoder to translate, certain modifications need to be made on the structure of the model.

Designing a loss function

A common training method for a neural machine translation model is to input a source language sentence into an Encoder, then input the output of the Encoder and a target language sentence into a Decode, then use probability distribution on a target language vocabulary generated by the Decode and cross entropy between words in the target language sentence as Loss, and finally optimize the parameters of the model according to the obtained Loss. Although a better model can be obtained by adopting the training method, the zero-sample translation (zero-shot translation) effect of the model is not ideal.

The invention speculates that when Encode can map sentences with similar semantics in all languages to the similar positions in the semantic space, Encode can better process the difference between different source languages, and the zero-sample translation (zero-shot translation) capability of the model can be improved. The Encode output can be viewed as a vector representation of the semantics of the source language sentences, and thus, for sentences having similar semantics, the present invention contemplates that the differences in the vector representations of the Encode outputs for those sentences be as small as possible. During training, since the source language sentences and the target language sentences in the training sample already have the same semantics, the present invention can input the source language sentences and the target language sentences into the encoders separately and then minimize the difference between their corresponding Encoder outputs. In specific implementation, the method calculates the similarity between Encoder outputs corresponding to source language sentences and target language sentences in training samples, adds the similarity to the original Loss (as shown in FIG. 2), and then uses the Loss to optimize parameters of a model so as to guide Encoder to map sentences with similar semantics to adjacent positions in a semantic space.

In the process of calculating the similarity, the invention adopts two methods, wherein the first method is shown as a formula 3.1, and the second method is shown as a formula 3.2. The value range of the similarity obtained by the first method is in the range of [ -1,1], the closer the obtained result is to 1, the more similar the two vectors are, in order to maximize the similarity, the value range of the similarity is translated to the range of [ -2,0], and then the inverse number is taken and added to the original Loss, as shown in the formula 3.3 and the formula 3.5. The calculation of the second method is the difference between the two vectors, and the closer the obtained result is to 0, the more similar the two vectors are, so the result obtained by the calculation of the second method can be directly added into the original Loss as shown in formulas 3.4 and 3.5. λ in equation 3.5 is a hyper-parameter used to control the ratio between the added Loss and the original Loss of the present invention. In order to ensure that the semantic code output by the Encode does not contain information of a source language, in the concrete implementation, the source language sentence is not modified, only an identifier which can uniquely represent a target language is added at the beginning of a target language sentence, and the added identifier is deleted before the target language sentence is input into the Encode.

Optimized model structure

The current mainstream multilingual neural machine translation model adopts a complete parameter sharing scheme, namely all languages guide the translation process of the Decoder only by identifiers capable of uniquely identifying target languages for all parameters in the shared model, and different languages have different word formation methods, different word orders, different pronunciations and other differences, which can challenge the Decoder to generate sentences meeting the language specification of the target languages. In order to model commonality between different languages while also modeling differences between different languages, Encoders can be shared for all source languages, with separate Decoders for each target language, but the number of parameters of the model produced by this approach will appear to grow faster as the number of source and target languages increases.

According to the method, an independent linear layer is added for each target language between the output of the last layer of the Decoder and the linear layer for generating probability distribution on the target language vocabulary, and each target language sentence passes through the independent linear layer during model training and prediction, so that the difference between different languages can be modeled, and the total quantity of parameters of the model can not be increased rapidly. In order to solve the problem that a multilingual neural machine translation model generates wrong target language sentences during zero-sample translation (zero-shot translation), target language information is additionally added to a Decoder so as to guide the Decoder to generate correct words. The optimized model structure is shown in fig. 3-4.

Method for adjusting sampling

The training set of the multilingual neural machine translation model is generally formed by mixing the training sets of all language pairs, however, the training sets of different language pairs are different in scale, the scales of the training sets of high-resource language pairs and low-resource language pairs are different by several orders of magnitude, and the method of simply mixing the training sets of all language pairs can cause that the frequency of model learning samples of the low-resource language pairs is too low during training, so that the translation effect of the model on the low-resource language is not ideal. To alleviate this problem, the present invention adjusts the sampling method during training, let dlThe training set of the language pair is the size of the training set of the ith language pair, the proportion of the training set of the language pair to the original total training setComprises the following steps:

the probability that the training samples of the adjusted ith language pair are sampled is:

t in equation 3.7 is a controllable hyperparameter, and when T is 1, the probability that the training sample of each language pair is sampled is the proportion of its training set to the original total training set. When T ═ infinity, the probability that the training samples for each language pair are sampled is equal, which is equivalent to the present invention oversampling the training set for low resource languages.

The technical effects of the present invention will be further explained below with reference to specific experiments.

In order to verify the validity of the proposed solution, the invention performs a multi-term many-to-many multi-language machine translation experiment using the public data set TED-58. Since TED-58 does not contain a test set of zero-sample translations, the present invention uses a test set of zero-sample translations (zero-shot translations) from another disclosed data set, opus-100, to test the zero-sample translation (zero-shot translation) effect of the model. The experimental result shows that the scheme adopted by the invention can obviously improve the zero-sample translation (zero-shot translation) capability of the multilingual neural machine translation model, and relieve the problem that the model generates wrong target language words when the zero-sample translation (zero-shot translation) is carried out, and the translation effect of the model on the supervised-trained language pairs cannot be influenced.

Data pre-processing

The original data set TED-58 has been partitioned into a training set, a validation set, and a test set, but there are repeated sentences between the training set and the validation set and between the training set and the test set, which the present invention removes in order to evaluate the true translation effect of the model. In order to control the size of the word list, the invention carries out BPE word segmentation on the mixed linguistic data of all the languages.

Experimental setup

Basic experimental setup: the present invention adds an identifier to the beginning of each target language sentence indicating the language to which the sentence belongs, and sets T to 1.5 in equation 3.7.

Experiment A: the invention uses a Transformer to carry out the experiment, on the basis of basic experiment setting, the invention adds an identifier which represents the language to which the sentence belongs at the beginning of each source language sentence, and the source end and the target end of the model use independent word lists.

Experiment B: the invention uses a Transformer to carry out the experiment, on the basis of basic experiment setting, the invention adds an identifier which represents the language to which each source language sentence belongs at the beginning of each source language sentence, and the source end and the target end of the model share a word list and embedding.

Experiment C: the invention uses a Transformer to carry out the experiment, and on the basis of basic experiment setting, the invention adopts the methods of the formulas 3.3 and 3.5 to calculate the Loss, and the source end and the target end of the model share the vocabulary and the embedding.

Experiment D: the invention carries out the experiment by using the model shown in fig. 3, and on the basis of basic experiment setting, the source end and the target end of the model share the vocabulary and the embedding.

Experiment E: the invention uses the model shown in fig. 3 to carry out the experiment, and on the basis of the basic experiment setting, the invention adopts the methods of the formulas 3.2 and 3.4 to calculate the Loss, and the source end and the target end of the model share the vocabulary and the embedding.

Experiment F: the invention uses the model shown in fig. 3 to carry out the experiment, and on the basis of the basic experiment setting, the invention adopts the methods of the formulas 3.3 and 3.5 to calculate the Loss, and the source end and the target end of the model share the vocabulary and the embedding.

Results of the experiment

The results of the experiments tested in a total of 116 orientations of 58 languages → English and English → 58 languages are shown in Table 3.1. Experimental results show that the optimization scheme adopted by the invention does not influence the translation effect of the model on the language pair subjected to supervised training.

TABLE 2 average BLEU values obtained from a total of 116 orientations of the experiments tested in 58 languages → English and English → 58 languages

The zero-sample translation (zero-shot translation) effect of the model test generated by each set of experiments is shown in table 3, and the proportion of wrong language words generated in the translation output by the model is also tested by the invention, as shown in table 4. The experimental result shows that the zero-sample translation (zero-shot translation) capability of the model can be remarkably improved by adopting the experimental settings in the experiment D, the experiment E and the experiment F, and the model is guided to generate the correct target language word.

TABLE 3 BLEU values (two decimal places reserved) from zero sample translation (zero-shot translation) tests of the models generated in each set of experiments

TABLE 4 proportion of generated incorrect language words resulting from zero-sample translation (zero-shot translation) testing of the experimentally generated models of each set

A B C D E F
ar-de 0.074 0.091 0.076 0.000 0.000 0.000
ar-fr 0.048 0.072 0.020 0.000 0.000 0.000
ar-nl 0.089 0.076 0.066 0.000 0.000 0.000
ar-ru 0.083 0.179 0.086 0.000 0.000 0.000
ar-zh 0.128 0.195 0.129 0.000 0.000 0.000
de-fr 0.052 0.073 0.023 0.000 0.000 0.000
de-nl 0.086 0.090 0.059 0.000 0.000 0.000
de-ru 0.130 0.208 0.122 0.000 0.000 0.000
de-zh 0.153 0.189 0.145 0.000 0.000 0.000
fr-nl 0.080 0.084 0.058 0.000 0.000 0.000
fr-ru 0.102 0.200 0.093 0.000 0.000 0.000
fr-zh 0.132 0.217 0.137 0.000 0.000 0.000
nl-ru 0.129 0.192 0.119 0.000 0.000 0.000
nl-zh 0.139 0.174 0.132 0.000 0.000 0.000
ru-zh 0.112 0.190 0.126 0.000 0.000 0.000

The present invention uses sentencepece to perform BPE participle on the corpus and limits the size of the vocabulary to 32 k.

In all experiments, the number of Encoder and Decoder layers in the model is set to be 6, dmodelSet to 512, dffSet to 2048, the number of heads is set to 8, and dropout is set to 0.25.

The invention selects Adam as an optimizer for all models, the initial learning rate of the model is set to 0.0005, and the weighted attenuation rate is set to 0.0001.

When the original Loss is calculated, the invention adopts a label smoothing method, belongs tolsSet to 0.1.

The invention adopts a warmup strategy to train all models, and the change rule of the learning rate of the models in the whole training process is as follows:

warmup steps is set to 3000.

When the loss function designed by the present invention is used, the present invention sets λ in equation 3.5 to 0.1.

When the sampling scheme of the present invention is employed, the present invention sets T in equation 3.7 to 1.5.

In all experiments, the present invention sent 5000 subwords as one batch to model training for a total of 30 epochs.

The present invention trains all models using a gradient accumulation strategy with update _ freq set to 4.

The present invention trains all models in a teacher-force fashion.

At the time of decoding, the present invention sets a beam width (beam size) to 5 with a beam search (beam search) method.

The training process of all models is completed on 4 display cards.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:Xbench对比工具用于提高质检效率与准确率的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!