Method and device for constructing old-Chinese bilingual corpus with Thai as pivot

文档序号:1556961 发布日期:2020-01-21 浏览:23次 中文

阅读说明:本技术 一种以泰语为枢轴的老-汉双语语料库构建方法及装置 (Method and device for constructing old-Chinese bilingual corpus with Thai as pivot ) 是由 毛存礼 高旭 余正涛 高盛祥 王振晗 聂男 于 2019-09-11 设计创作,主要内容包括:本发明涉及以泰语为枢轴的老-汉双语语料库构建方法及装置,属自然语言处理领域。本发明先对汉语-泰语平行语料数据进行泰语分词处理;构建老挝语-泰语双语词典,并利用其将泰语句子逐词翻译成老挝语句子序列,得到候选的老挝语-泰语平行句对;构建基于双向LSTM的老挝语-泰语平行句对分类模型,对候选的老挝语-泰语平行句对进行分类,获取老挝语-泰语双语平行句对;以泰语为枢轴语言对老挝语和汉语进行匹配,构建老挝语-汉语双语平行语料库。并根据上述步骤功能模块化制成以泰语为枢轴语言的老-汉双语平行语料库构建装置,本发明解决了老挝语-汉语语料稀缺的问题,对老-汉双语语料库的构建具有一定的理论意义和实际应用价值。(The invention relates to a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, belonging to the field of natural language processing. Firstly, carrying out Thai word segmentation processing on Chinese-Thai parallel corpus data; constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs; constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and acquiring Laos-Thai bilingual parallel sentence pairs; and matching Laos with Chinese by taking the Thai as a pivot language to construct a Laos-Chinese bilingual parallel corpus. The invention solves the problem of scarcity of Laos-Chinese linguistic data and has certain theoretical significance and practical application value for the construction of the old-Chinese bilingual corpus.)

1. A method for constructing an old-Chinese bilingual corpus with Thai as a pivot is characterized by comprising the following steps of: the method comprises the following steps:

step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;

step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;

step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;

step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.

2. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific steps of Step1 are as follows:

step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;

and Step1.2, performing word segmentation on the selected Thai sentences.

3. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific Step of Step2 is as follows:

construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;

step2.2, because Laos-Thai are extremely similar, the Thai sentences in the acquired Chinese-Thai bilingual parallel sentence pairs are translated word by using a Laos-Thai bilingual dictionary, and because the situation of one word is ambiguous, a plurality of Laos sentences with different semantemes can be generated during translation by the dictionary, so that candidate Laos-Thai parallel sentence pairs are obtained, wherein the candidate Laos-Thai parallel sentence pairs are a plurality of groups of sentences of a plurality of Laos corresponding to one Thai sentence, and the Laos sentences are not completely translated with each other.

4. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific Step of Step3 is as follows:

step3.1, manually constructing a Laos-Thai parallel corpus based on sentence alignment;

step3.2, because Laos and Thai have great similarity in terms and pronunciation, the Laos-Thai parallel sentence pair constructed by utilizing the bidirectional LSTM is characterized in a shared semantic space, specifically, the bidirectional LSTM is used for obtaining forward and backward state vectors, and splicing is carried out to obtain sentence vector representation in the shared semantic space, namely:

Figure FDA0002198521760000022

Figure FDA0002198521760000023

Figure FDA0002198521760000024

Figure FDA0002198521760000025

Figure FDA0002198521760000026

wherein the content of the first and second substances,representing the forward representation of the hidden vector of the ith Thai sentence in an N state;

Figure FDA0002198521760000028

Figure FDA00021985217600000210

Figure FDA00021985217600000212

Figure FDA00021985217600000213

Figure FDA00021985217600000216

Figure FDA00021985217600000218

then, capturing matching information between the two vectors by using a vector dot product and a vector difference to obtain a matching vector:

Figure FDA00021985217600000219

Figure FDA00021985217600000220

Figure FDA00021985217600000221

wherein the content of the first and second substances,

Figure FDA00021985217600000222

step3.3, finally, calculating the probability that Laos sentences and Thai sentences are parallel sentences by using a fully connected layer of a convolutional neural network through a sigmoid function to judge whether the two sentences are mutually translated or not;

p(yi=1|hi)=σ(W3hi+c)

wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are translated into each other, W3C is the convolutional neural network model parameter, σ is the activation function;

step3.4, using the following cross entropy loss as a loss function, iterating for multiple times, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai parallel sentence pairs;

wherein the loss function is as follows:

Figure FDA0002198521760000031

wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.

5. An old-Chinese bilingual corpus construction device taking Thai as a pivot is characterized in that: the system comprises a data preprocessing module, a dictionary translation module, a Laos-Thai parallel sentence pair extraction module and a Laos-Chinese parallel corpus construction module;

a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;

a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;

Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;

Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.

Technical Field

The invention relates to a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, belonging to the technical field of natural language processing.

Background

The corpus construction is the premise of natural language processing research work, the old-Chinese bilingual corpus is an important data resource for developing Chinese-old machine translation and cross-language retrieval, Laos is a language with scarce resources in southeast Asia languages, the old-Chinese bilingual parallel resources are scarce, and the method for directly acquiring the old-Chinese bilingual parallel resources from the Internet has great difficulty.

Laos and Thai belong to the strong Dai branch of the strong Dong nationality of the Chinese Tibetan language family, basic vocabularies are almost the same or similar, the syntax structure has great similarity, and the Chinese-Thai parallel linguistic data is relatively easy to obtain, so that Laos and Thai can be used for obtaining an old-Thai parallel sentence pair, and the old-Chinese bilingual parallel linguistic data is constructed on the basis of taking Thai as a pivot.

Disclosure of Invention

The invention provides a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, which are used for constructing a Laos-Chinese bilingual parallel corpus.

The technical scheme of the invention is as follows: a method for constructing an old-Chinese bilingual corpus with Thai as a pivot comprises the following steps:

step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;

step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;

step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;

step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.

Further, the Step1 includes the specific steps of:

step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;

step1.2, performing word segmentation on the selected Thai sentences, wherein the word segmentation tool uses a southeast Asia small language information processing platform developed by Kunming technology university, and the website is http://222.197.219.24: 8099/.

The invention considers that Thai adopts a book connecting form without word segmentation, and cannot be translated based on words and used in a model. Therefore, the word segmentation is carried out through the Thai word segmentation tool to obtain the Thai sentences with segmented words.

The design of the preferred scheme is an important component of the invention, and mainly provides a corpus and data preprocessing process for the invention, and provides a corpus basis for the subsequent dictionary translation and model use.

Further, the specific Step of Step2 is as follows:

construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;

step2.2, because Laos-Thai are extremely similar, the Thai sentences in the acquired Chinese-Thai bilingual parallel sentence pairs are translated word by using a Laos-Thai bilingual dictionary, and because the situation of one word is ambiguous, a plurality of Laos sentences with different semantemes can be generated during translation by the dictionary, so that candidate Laos-Thai parallel sentence pairs are obtained, wherein the candidate Laos-Thai parallel sentence pairs are a plurality of groups of sentences of a plurality of Laos corresponding to one Thai sentence, and the Laos sentences are not completely translated with each other.

The preferred design scheme is that an important process of a Laos-Thai candidate parallel sentence is obtained, similarity of Laos and Thai in the aspects of word construction and the like is analyzed and utilized, a dictionary is constructed to translate word by word to obtain a candidate parallel corpus, and preparation is made for next step of extraction of the Laos-Thai parallel corpus through a model.

Further, the specific Step of Step3 is as follows:

step3.1, manually constructing a Laos-Thai parallel corpus based on sentence alignment;

the present invention trains models based on Laos-Thai parallel corpora, and therefore, high quality parallel corpora are required to make the trained models more efficient. Therefore, the Laos-Thai parallel corpus is constructed in a manual mode, and the data of the training model are ensured to be completely accurate parallel corpus, so that the Laos-Thai parallel sentence classification model is obtained.

Step3.2, because Laos and Thai have great similarity in terms and pronunciation, the Laos-Thai parallel sentence pair constructed by utilizing the bidirectional LSTM is characterized in a shared semantic space, specifically, the bidirectional LSTM is used for obtaining forward and backward state vectors, and splicing is carried out to obtain sentence vector representation in the shared semantic space, namely:

Figure BDA0002198521770000021

Figure BDA0002198521770000022

Figure BDA0002198521770000023

Figure BDA0002198521770000024

Figure BDA0002198521770000031

wherein the content of the first and second substances,

Figure BDA0002198521770000033

representing the forward representation of the hidden vector of the ith Thai sentence in an N state;

Figure BDA0002198521770000034

is a hidden vector forward representation of the ith sentence in Thai in the N-1 state,

Figure BDA0002198521770000035

is the word vector representation of Thai sentence in the ith sentence in N state, and the LSTM represents the LSTM activation function;

Figure BDA0002198521770000036

representing the i-th sentence of Thai in the backward direction of a hidden vector of an N state;

Figure BDA0002198521770000037

is a hidden vector backward representation of the ith sentence in Thai in an N +1 state;

expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;

Figure BDA0002198521770000039

representing the hidden vector forward representation of the ith sentence of Laos in an N state;

Figure BDA00021985217700000310

is a hidden vector forward representation of the ith sentence of Laos in an N-1 state,

Figure BDA00021985217700000311

is the word vector representation of Laos sentences in the N state in the ith sentence;

Figure BDA00021985217700000312

expressing the i-th sentence of Laos in the backward direction of the hidden vector of the N state;

Figure BDA00021985217700000313

the method is characterized in that the i-th sentence of Laos is represented backwards in a hidden vector of an N +1 state;

expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;

then, capturing matching information between the two vectors by using a vector dot product and a vector difference to obtain a matching vector:

Figure BDA00021985217700000315

Figure BDA00021985217700000316

Figure BDA00021985217700000317

wherein the content of the first and second substances,which respectively represent matching vectors containing sentence matching information obtained by calculating sentence vector dot products and vector difference values of Laos and Thai; h isiIs the final vector representation containing the matching information,W1,W2and b is a parameter of the bidirectional LSTM model;

step3.3, finally, calculating the probability that Laos sentences and Thai sentences are parallel sentences by using a fully connected layer of a convolutional neural network through a sigmoid function to judge whether the two sentences are mutually translated or not;

p(yi=1|hi)=σ(W3hi+c)

wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are translated into each other, W3C is the convolutional neural network model parameter, σ is the activation function;

step3.4, using the following cross entropy loss as a loss function, iterating for multiple times, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai parallel sentence pairs;

wherein the loss function is as follows:

Figure BDA0002198521770000041

wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.

A device for constructing an old-Chinese bilingual corpus with Thai as a pivot comprises a data preprocessing module, a dictionary translation module, a Laos-Thai parallel sentence pair extraction module and a Laos-Chinese parallel corpus construction module;

a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;

a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;

Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;

Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.

The invention has the beneficial effects that:

laos is a scarce language in southeast Asia language, and it is very difficult to directly obtain parallel Lao-Chinese bilingual resources from the Internet.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a drawing of Laos-Thai syntactic similarity in the present invention;

FIG. 3 is a diagram of word polysemous for translation in the present invention;

FIG. 4 is a flow chart of parallel sentence classification in the present invention;

FIG. 5 is a view showing the construction of the apparatus of the present invention;

FIG. 6 is a block diagram of the general process flow of the present invention.

Detailed Description

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于transformer的距离参量对齐翻译方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!