Chinese-to-pseudo parallel corpus generating method fused with monolingual language model

文档序号：1099032 发布日期：2020-09-25 浏览：19次中文

阅读说明：本技术 融合单语语言模型的汉越伪平行语料生成方法 (Chinese-to-pseudo parallel corpus generating method fused with monolingual language model ) 是由余正涛贾承勋赖华文永华于志强于 2020-04-30 设计创作，主要内容包括：本发明涉及融合单语语言模型的汉越伪平行语料生成方法,属于自然语言处理技术领域。本发明考虑到单语数据的可利用性,在回译方法的基础上,将利用大量单语数据训练的语言模型与神经机器翻译模型进行融合,在回译过程中通过语言模型融入语言特性,以此生成更规范质量更优的伪平行语料,并将生成的语料添加到原始小规模语料中训练最终翻译模型。本发明通过将语言模型和神经机器翻译模型融合,能够产生质量更优的伪平行语料,进而更好地提升汉越神经机器翻译系统的性能和效果。(The invention relates to a method for generating a pseudo-parallel corpus of Chinese characters fused with a monolingual language model, and belongs to the technical field of natural language processing. The invention takes the availability of the monolingual data into consideration, fuses a language model trained by using a large amount of monolingual data and a neural machine translation model on the basis of a retranslation method, fuses language characteristics through the language model in the retranslation process so as to generate a pseudo parallel corpus with more standard and better quality, and adds the generated corpus to an original small-scale corpus to train a final translation model. According to the invention, by fusing the language model and the neural machine translation model, pseudo parallel corpora with better quality can be generated, and the performance and effect of the Hanyue neural machine translation system can be further improved.)

1. The method for generating the pseudo-more-Chinese parallel corpus fused with the monolingual language model is characterized by comprising the following steps of: the method for generating the pseudo-more-Chinese parallel corpus fusing the monolingual language model comprises the following specific steps:

step1, generating pseudo parallel corpora: generating pseudo-parallel data in two directions by a forward translation method and a reverse translation method;

step2, monolingual language model fusion: in the generation process of the pseudo parallel corpus, a language model of a target language obtained by utilizing monolingual data training is fused into a neural machine translation model, and language characteristics are fused into the generation of the pseudo parallel corpus through the language model;

step3, pseudo-parallel data screening: screening the pseudo parallel sentence pairs by the generated pseudo parallel data through a method based on language model confusion;

step4, model training and translation: training a final Hanyue neural machine translation model by the screened pseudo-parallel corpora and the original data, then translating the test set data through the trained model, and decoding to obtain a final BLEU value of the model.

2. The method for generating the pseudo-parallel corpus of Chinese fused with the monolingual language model according to claim 1, wherein:

in Step1, for the generation of the pseudo parallel corpus, the reverse translation method is to train a transhan neural machine translation model by using a bilingual corpus of the Chinese language, translate the monolingual data of the Vietnamese language into Chinese data, and form the pseudo parallel data of the reverse translated Chinese language; the forward translation method is to train a Chinese-Vietnam neural machine translation model by using Chinese-Vietnam bilingual corpus and translate Chinese monolingual data into Vietnam data so as to form forward translated Chinese-Vietnam pseudo-parallel data.

3. The method for generating the pseudo-parallel corpus of Chinese fused with the monolingual language model according to claim 1, wherein: in Step2, as for the fusion method of the monolingual language model, the first method is based on the independently trained language model fusion, the cyclic neural network language model and the neural machine translation model are respectively trained, and the output probability of the final model is subjected to weighted splicing; the second method is based on the merging training language model fusion, the hidden state of the recurrent neural network language model and the hidden state of the neural machine translation model decoder are merged together for training, and the hidden state of the recurrent neural network language model is used as input at each moment.

4. The method for generating the pseudo-parallel corpus of Chinese fused with the monolingual language model according to claim 1, wherein: in Step3, the generated pseudo parallel data are sorted and labeled, then the Chinese language model and the Vietnamese language model are used to judge the confusion degree of each language part in the pseudo parallel data, sentences conforming to the set threshold value are screened out to retain the sentence labels, the intersection of the Chinese sentences and the Vietnamese sentence sub-labels is taken, and corresponding sentence pairs are traversed and retained according to the sentence labels.

Technical Field

The invention relates to a method for generating a pseudo-parallel corpus of Chinese characters fused with a monolingual language model, and belongs to the technical field of natural language processing.

Background

Neural Machine Translation (NMT) is an end-to-end Machine Translation method proposed by Sutskever et al, and the more training data, the better model performance, but for resource-scarce languages, the available bilingual data is very limited, which is also a main reason for poor Translation effect.

There are many methods for improving the performance of the low-resource language neural machine translation system, and the method for expanding the pseudo-parallel data by using the existing resources is one of the more effective methods at present. The method for realizing data expansion mainly comprises four types, wherein the first type is that a pseudo parallel sentence pair is extracted from a comparable corpus, a source language and a target language are mapped into the same space, and a candidate parallel sentence pair is selected according to a certain rule, so that the pseudo parallel corpus can be effectively extracted, but sentence characteristics are not easy to capture, and the extracted pseudo parallel sentence pair has high noise; the second method is based on word replacement, and utilizes the existing small-scale parallel sentences to carry out regular replacement on the appointed words to obtain a new pseudo parallel sentence pair, but the effect is not good when the words are one-to-many; the third category is a pivot language-based method, which is classified into a system level, a corpus level and a phrase level by Li et al, and provides a way of expanding the scale of generating training data and optimizing the word alignment quality to improve the translation performance of the system, and the method is suitable for zero-resource languages but the quality of the generated corpus is poor; the fourth type is that monolingual data is used for Back Translation (BT), a translation model from a target language to a source language is trained through small-scale training data, and the monolingual data in the target language is translated into the source language data, so that pseudo parallel data are generated.

Chinese-Vietnamese is a typical low-resource language pair, less parallel corpora can be obtained, and the problem can be better alleviated by generating pseudo-parallel data through data expansion. Considering that monolingual data is easy to obtain and sufficient in resources, but most of the existing methods do not fully utilize monolingual resources, research and research are conducted on a method for generating pseudo-parallel corpora by using monolingual data. Because the language characteristics can be well learned by using a language model trained by a large amount of monolingual data, the monolingual language model and the neural machine translation model are fused, so that the language characteristics of a target language can be fused through the language model in the generation process of pseudo-parallel data. Experiments show that compared with a reference system, pseudo-parallel data generated by the method provided by the invention can effectively improve the translation performance of the Hanyue neural machine.

Disclosure of Invention

The invention provides a method for generating a pseudo-parallel corpus of Chinese fused with a monolingual language model, which is used for solving the following problems: at present, the quality of the generated pseudo-parallel data is not high by using a method for generating pseudo-parallel data by using monolingual data for translation, and how to improve the quality of the pseudo-parallel data is mostly not considered in the existing method.

The technical scheme of the invention is as follows: the method for generating the more pseudo-parallel linguistic data of the Chinese character fused with the monolingual language model comprises the following specific steps:

step1, generating pseudo parallel corpora: generating pseudo-parallel data in two directions by a forward translation method and a reverse translation method;

step3, pseudo-parallel data screening: screening the pseudo parallel sentence pairs by the generated pseudo parallel data through a method based on language model confusion;

Further, in Step1, for the generation of the pseudo parallel corpus, the reverse translation method is to train a vietnamese neural machine translation model using the chinese-vietnamese corpus to translate the vietnamese monolingual data into chinese data, so as to form reverse translated chinese-vietnamese pseudo parallel data; the forward translation method is to train a Chinese-Vietnam neural machine translation model by using Chinese-Vietnam bilingual corpus and translate Chinese monolingual data into Vietnam data so as to form forward translated Chinese-Vietnam pseudo-parallel data.

Further, in Step2, as for the fusion method of the monolingual language model, the first method is based on the independently trained language model fusion, training the recurrent neural network language model and the neural machine translation model respectively, and performing weighted splicing on the output probability of the final model; the second method is based on the merging training language model fusion, the hidden state of the recurrent neural network language model and the hidden state of the neural machine translation model decoder are merged together for training, and the hidden state of the recurrent neural network language model is used as input at each moment.

Further, in Step3, the generated pseudo parallel data is sorted and labeled, and then the confusion degree of each language part in the pseudo parallel data is judged by using the chinese language model and the vietnamese language model, sentences conforming to the set threshold value are screened out to retain the sentence labels thereof, the intersection of the chinese sentences and the vietnamese sentence sub-labels is taken, and corresponding sentence pairs are traversed and retained according to the sentence labels.

The invention has the beneficial effects that:

1. in the method, the language characteristics of the target language are fused into the pseudo-parallel data by fusing the language model of the target language in the generation process of the pseudo-parallel data, so that the quality of the pseudo-parallel data is better, and the improvement of the translation performance of the pseudo-parallel data on the Chinese-transcendental neural machine translation model is further improved;

2. after the pseudo parallel data are generated, the pseudo parallel data are screened by using the confusion degree of the language model, so that the noise of the data is reduced, and the calculation times of the training model is reduced.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of a structure of a language model fusion method based on independent training;

FIG. 3 is a flow chart of a language model fusion method based on merged training;

FIG. 4 is a flow diagram of data screening based on language model confusion.

Detailed Description

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于CEN/XFS的读卡类复合模块的并发调用方法

Chinese-to-pseudo parallel corpus generating method fused with monolingual language model

相关技术

网友询问留言