Method and device for extracting parallel sentence pairs for transfer learning based on language similarity

文档序号:1816624 发布日期:2021-11-09 浏览:6次 中文

阅读说明:本技术 基于语言相似性的迁移学习平行句对抽取方法及装置 (Method and device for extracting parallel sentence pairs for transfer learning based on language similarity ) 是由 毛存礼 满志博 余正涛 高盛祥 黄于欣 王振晗 于 2021-07-01 设计创作,主要内容包括:本发明涉及基于语言相似性的迁移学习平行句对抽取方法及装置,属自然语言处理领域。本发明首先对泰语、老挝语的语料进行预处理,将泰语中的子词和词语基于音标进行替换,得到泰语、老挝语句子的统一表示,然后,基于泰语-老挝语之间的语言相似性利用数据迁移和模型迁移的方法将汉语-泰语的平行句对抽取模型迁移到汉语-老挝语的模型上,最后,利用预训练好的平行句对抽取模型对输入模型的汉语-老挝语平行句对进行预测。本发明所提方法能够有效地对语言相似性进行建模和利用资源较丰富的汉语-泰语句对抽取模型迁移到资源较稀缺的汉语-老挝语句对抽取模型上,从而达到提升汉语-老挝语句对抽取模型性能目的,具有重要的理论和实际应用价值。(The invention relates to a method and a device for extracting a parallel sentence pair of transfer learning based on language similarity, belonging to the field of natural language processing. Firstly, preprocessing the linguistic data of Thai and Laos, replacing sub-words and expressions in the Thai based on phonetic symbols to obtain a unified expression of the sentences of the Thai and Laos, then migrating the extracted model of the Chinese-Thai parallel sentence pair to the model of the Chinese-Laos based on the language similarity between the Thai and Laos by using a data migration and model migration method, and finally predicting the Chinese-Laos parallel sentence pair of the input model by using the pre-trained parallel sentence pair. The method provided by the invention can effectively model the language similarity and migrate the extracted model of the Chinese-TiLao sentence pair with abundant resources to the extracted model of the Chinese-Lao sentence pair with scarce resources, thereby achieving the purpose of improving the performance of the extracted model of the Chinese-Lao sentence pair and having important theoretical and practical application values.)

1. The method for extracting the parallel sentence pairs for the transfer learning based on the language similarity is characterized by comprising the following steps: the method comprises the following specific steps:

step1, preprocessing data of Thai and Laos: expressing words, sub-words and pronunciation information of Laos based on Thai;

step2, training a Chinese-Thai parallel sentence pair extraction model based on transfer learning: training the model by the parallel sentences of the Chinese-Thai language, and further migrating the model to the parallel sentence pair extraction model of the Chinese-Laos language;

and extracting the input Chinese-Laos parallel sentence pairs through a pre-trained Chinese-Thai parallel sentence pair extraction model, and judging the sentence similarity.

2. The method for extracting a pair of parallel sentences for transfer learning based on language similarity according to claim 1, wherein: the specific steps of Step1 are as follows:

step1.1, firstly, carrying out word segmentation on input Thai and Laos sentences;

and Step1.2, replacing words in the Thai-Laos based on a Thai-Laos bilingual dictionary and a phonetic symbol dictionary.

3. The method for extracting a pair of parallel sentences for transfer learning based on language similarity according to claim 1, wherein: the specific steps of the step Step1.2 are as follows:

in the data preprocessing layer, three parts of a dictionary, a subword dictionary and a phonetic symbol dictionary among Thai-Laos are used for replacing and representing the Laos as Thai, so that when data model input and vectorization representation are carried out, bilingual data of two languages can be effectively mixed and trained to achieve the purpose of data enhancement; the method comprises the following specific steps:

step1.2.1, the tai-old term indicates: inputting a Thai sentence S with n wordsTh w=(w1 th,w2 th,...,wn th) And a corresponding Thai sentence S containing n wordsLao w=(w1 lao,w2 lao,...,wn lao) The corresponding relation of Thai sub-words and Laos vocabulary is replaced, the Thai and Laos sentences after word segmentation are replaced based on the Thai-Laos dictionary, and the words of Laos sentences are replaced by Thai, so that all input Laos sentences of the input model are represented by Thai sentences, and the input Thai sentences are represented by the following formula:

STh s=(s1 th,s2 th,...,sn th)

SLao s=(s1 lao,s2 lao,...,sn lao)

step1.2.2, tai-old phonetic notation: carrying out vectorization representation on phonetic symbols between Thai and Laos, splicing phonetic symbol information as vectors into sentence vector representation, and inputting Thai sentences S of each modelTh w=(w1 th,w2 th,...,wn th) Laos sentence SLao w=(w1 lao,w2 lao,...,wn lao) There is a corresponding phonetic symbol level representation.

4. The method for extracting a pair of parallel sentences for transfer learning based on language similarity according to claim 1, wherein: the specific steps of Step2 are as follows:

step2.1, obtaining a Tai-old word vector based on a pre-training language model: in the input layer part, the Chinese-Thai and Chinese-Laos bilingual data are subjected to mixed training by using the idea of data migration; specifically, the word vector representation of the input is generated based on BERT, as shown in the following formula:

step2.2, obtaining a Tai-old phonetic symbol vector: constructing a dictionary by using Thai-Lao-sub-Word-phonetic symbols according to pronunciation similarity, generating phonetic symbol vectors of the Thai and Lao based on Word2vec on the basis of the constructed phonetic symbol dictionary by using a Skip-gram language model, firstly, replacing the Thai-Lao based on Word grades in the process of sentence replacement of the Thai-Lao, replacing characters and sub-words between the Thai-Lao which cannot be replaced by using the corresponding relation of the sub-words, additionally, replacing and representing the generated characters and sub-Word corresponding table, and expressing the phonetic symbol vectors of the Thai and Lao obtained based on model training as the phonetic symbol vectors of the Thai and LaoAnd

step2.3, Tai-old word vector and phonetic symbol vector splicing: obtaining words and phonetic symbol vectors of Thai and Lao based on the steps, and splicing the word vectors and phonetic symbol vectors of the Thai and Lao; as shown in the following equation:

step2.4, model training layer: utilize Poly encoder to carry out the code to bilingual sentence and calculate bilingual sentence similarity, to source language sentence and the target language sentence of input Poly encoder, all contain two encoders in the Poly encoder structure to encode target language chinese for single vector to represent, to every tai of input model, laos sentence all can be represented by the concatenation form of Step 2.3's m tai, laos's word vector, phonetic symbol vector, specifically as follows:

STh E=(Eth 1,Eth 2,...,Eth m)

SLao E=(Elao 1,Elao 2,...,Elao m)

the vectorized Thai and Lao sentences will be further represented as n vectors y based on the attention mechanism1 Th\Lao,y2 Th\Lao,...,yn Th\LaoWherein n influences the speed in the whole model training process, and in order to obtain the global features of n inputs, the vector of n nodes learned by the whole model training part is c1,...,cnWherein c isiExtracting representation form y by participating in all outputs of the previous layeri Th\Lao;yi Th\LaoIs represented by the following formula:

wherein the content of the first and second substances,the training weights for the source language are represented,h1...hNrepresenting a context information vector representation generated based on an Attention mechanism; n represents the training weight number of the source language;

finally, the target language Chinese y is used in view of the n global context functionsChAs a query vector in the training process:

wherein (w)1,...,wm)=softmax(yCh·y1 Th\Lao,...,yCh·ym Th\Lao) Representing target language weight information;

finally, the dot product Score of the similarity calculation between the output Thai and Laos sentences and the target language Chinese sentences is Score (Th \ Lao, Ch) ═ yi Th\Lao·yCh

5. The device for extracting the parallel sentence pairs for the transfer learning based on the language similarity is characterized in that: the system comprises the following modules:

the system comprises a Thai-Laos preprocessing module, a Laos processing module and a Laos processing module, wherein the Thai-Laos preprocessing module is used for expressing the words, the sub-words and the pronunciation information of Laos based on Thai;

the parallel sentence pair extraction module based on the transfer learning is used for transferring the Chinese-Thai parallel sentence pair extraction model to the Chinese-Laos parallel sentence pair extraction model;

and the parallel sentence pair extraction module is used for extracting the input Chinese-Laos parallel sentence pairs through a pre-trained Chinese-Thai parallel sentence pair extraction model.

Technical Field

The invention relates to a method and a device for extracting a parallel sentence pair of transfer learning based on language similarity, belonging to the technical field of natural language processing.

Background

The problem of insufficient linguistic data of low-resource languages is solved by using the idea of transfer learning, which is a research hotspot of current natural language processing. The method has the advantages that the existing Chinese-Thai parallel sentence language material is migrated to the Chinese-Laos language by utilizing the migration learning, so that a good effect can be obtained, the main reason is that certain language similarity exists in the Thai-Laos language, the Chinese-Thai language and the Chinese-Laos bilingual sentence pair are relatively lacked, the translation model performance of the Chinese-Thai language and the Chinese-Laos language is directly poor, the common strategy is to utilize a certain number of parallel sentence pairs to construct a model for extracting the parallel sentence pairs, high-quality Chinese-Thai language is extracted from the comparable language material or the pseudo parallel sentence pairs in the Internet, and the machine translation performance can be effectively improved. The similarity information of different layers of Thai and Laos is fused and represented, the Chinese-Thai sentence extraction model and the Chinese-Laos sentence extraction model are shared, and the language information of rich resource languages is effectively utilized.

Disclosure of Invention

The invention provides a method and a device for extracting a parallel sentence pair for transfer learning based on language similarity, which are used for solving the problems of scarcity of mark data of Chinese-Laos, small-scale training data and poor effect of the parallel sentence pair and the problem of poor model effect of training by means of mark data.

The technical scheme of the invention is as follows: the method for extracting the parallel sentence pairs for the transfer learning based on the language similarity comprises the following specific steps:

step1, performing word segmentation processing on data of Thai and Laos, and expressing words, sub-words and pronunciation information of Laos based on Thai;

step2, training a Chinese-Thai parallel sentence pair extraction model based on transfer learning: training the model by the parallel sentences of the Chinese-Thai language, and further migrating the model to the parallel sentence pair extraction model of the Chinese-Laos language;

and extracting the input Chinese-Laos parallel sentence pairs through a pre-trained Chinese-Thai parallel sentence pair extraction model, and judging the sentence similarity.

Further, the Step1 includes the specific steps of:

step1.1, firstly, carrying out word segmentation on input Thai and Laos sentences;

and Step1.2, replacing words in the Thai-Laos based on a Thai-Laos bilingual dictionary and a phonetic symbol dictionary.

Further, the specific steps of step step1.2 are as follows:

in the data preprocessing layer, three parts of a dictionary, a subword dictionary and a phonetic symbol dictionary among Thai-Laos are used for replacing and representing the Laos as Thai, so that when data model input and vectorization representation are carried out, bilingual data of two languages can be effectively mixed and trained to achieve the purpose of data enhancement; the method comprises the following specific steps:

step1.2.1, the tai-old term indicates: inputting a Thai sentence S with n wordsTh w=(w1 th,w2 th,...,wn th) And a corresponding Thai sentence S containing n wordsLao w=(w1 lao,w2 lao,...,wn lao) The corresponding relation between the Thai sub-words and Laos vocabulary is replaced, and the Thai and Laos with Chinese meaning 'I love China' are respectivelyThe basic words of Tai and Lao sentences after word segmentation are replaced based on a dictionary of Tai-Lao, and words of Lao sentences are replaced by Tai, so that all input Lao sentences of the input model are represented by Tai sentences, and the input Tai sentences are expressed as follows in formula (1):

because the size of the dictionary of the Thai-Laos is limited, all Laos can not be found out corresponding Thai words to be replaced, and in the replacement process, part of Laos is reserved in the original sentences, but the performance of the follow-up model is not influenced, because BERT needs to cover the mechanism, and the effect of introducing part of noise to improve the model performance can be realized by reserving part of Laos words.

Step1.2.2, tai-old phonetic notation: because the language similarity between Thai and Lao is mainly reflected on bilingual pronunciation, and all characters between the Thai and Lao are represented by corresponding phonetic symbols, in order to further fuse the similarity characteristics of the Thai and Lao, and further take the language similarity between the Thai and Lao as constraint to constrain the representation form of the bilingual and explicitly model the semantics of the two languagesTh w=(w1 th,w2 th,...,wn th) Laos sentence SLao w=(w1 lao,w2 lao,...,wn lao) There is a corresponding phonetic symbol level representation.

For example, the Chinese meaning is "I love China. The expressions of Tai language and Laos are respectively The two sentences are expressed in phonetic symbols according to the constructed phonetic symbol dictionaryThese representations will further constrain the similarity of the two languages. The Thai and Laos sentences are expressed in phonetic notation form as formula (2):

further, the Step2 includes the specific steps of:

step2.1, obtaining a Tai-old word vector based on a pre-training language model: and in the input layer part, mixed training is carried out on the Thai-Chinese and Laos-Chinese bilingual data by using the thought of data migration. The BERT multi-language pre-training model comprises 108 languages acquired based on Wikipedia, wherein the southeast Asia languages comprise Thai, Burma and Vietnam, and do not contain Laos. Therefore, the data between Laos and Chinese is expanded on the data level by utilizing the language similarity between Thai Laos; specifically, the input word vector representation is generated based on BERT, the dimensionality of the word vector of the tai language and the laos generated here is 768, and the word vector representation with context information can be generated based on a pre-training language model, which is specifically shown in formula (3):

step2.2, obtaining a Tai-old phonetic symbol vector: the subwords are the minimum semantic granularity in the language, the relationship of most of the words in the language can be represented by the subwords, and a dictionary is constructed by the Thai, Laos, subwords and phonetic symbols according to pronunciation similarity. For Thai,The phonetic symbol vector of Laos is generated on the basis of Word2vec by using a Skip-gram language model on the basis of a built phonetic symbol dictionary, in the process of sentence replacement of the Thai-Laos, in order to better obtain the corresponding relation of sentences of two languages, firstly, the Thai-Laos is replaced on the basis of Word grades, and because not all words between the Thai-Laos can be replaced correspondingly, therefore, the corresponding relation of sub-words is utilized here, characters and sub-words between the Thai-Laos which cannot be replaced are replaced, and in addition, the generated corresponding Word list of the characters and the sub-words is replaced and represented. The advantage of using this method is that it is easier to obtain word vector of any character representation symbol, and phonetic symbol vector of Thai and Lao obtained based on model training is represented asAnd

step2.3, Tai-old word vector and phonetic symbol vector splicing: obtaining words and phonetic symbol vectors of Thai and Lao based on the steps, and splicing the word vectors and phonetic symbol vectors of the Thai and Lao; as shown in the following equation:

step2.4, model training layer: utilize Poly encoder to encode bilingual sentence and calculate bilingual sentence similarity, compare in bidirectional encoder and cross-language encoder, the structure of Poly encoder can be more quick accurate extract more bilingual sentence information, to source language sentence and the target language sentence of input Poly encoder, all contain two encoders in the Poly encoder structure, and represent target language chinese coding as single vector, to every tai language of input model, laos sentence all can be represented by m tai languages of Step2.3, the word vector of laos, the concatenation form of phonetic symbol vector is represented, specifically as follows:

STh E=(Eth 1,Eth 2,...,Eth m)

SLao E=(Elao 1,Elao 2,...,Elao m) (5)

the vectorized Thai and Lao sentences will be further represented as n vectors y based on the attention mechanism1 Th\Lao,y2 Th\Lao,...,yn Th\LaoWherein n influences the speed in the whole model training process, and in order to obtain the global features of n inputs, the vector of n nodes learned by the whole model training part is c1,...,cnWherein c isiExtracting representation form y by participating in all outputs of the previous layeri Th\Lao;yi Th\LaoIs represented by the following formula:

wherein the content of the first and second substances,representing training weights in the Source language, h1...hNRepresenting a context information vector representation generated based on an Attention mechanism; n represents the training weight number of the source language;

finally, the target language Chinese y is used in view of the n global context functionsChAs a query vector in the training process:

wherein (w)1,...,wm)=softmax(yCh·y1 Th\Lao,...,yCh·ym Th\Lao) Representing target language weight information;

finally, the output Thai,The Score of dot product calculated by similarity between Laos sentence and target language Chinese sentence is Score (Th \ Lao, Ch) ═ yi Th\Lao·yCh

The device for extracting the parallel sentence pairs for the transfer learning based on the language similarity comprises the following modules:

the system comprises a Thai-Laos preprocessing module, a Laos processing module and a Laos processing module, wherein the Thai-Laos preprocessing module is used for expressing the words, the sub-words and the pronunciation information of Laos based on Thai;

the parallel sentence pair extraction module based on the transfer learning is used for transferring the Chinese-Thai parallel sentence pair extraction model to the Chinese-Laos parallel sentence pair extraction model;

and the parallel sentence pair extraction module is used for extracting the input Chinese-Laos parallel sentence pairs through a pre-trained Chinese-Thai parallel sentence pair extraction model.

The invention has the beneficial effects that:

1. and the similarity information of different levels of Thai and Laos is subjected to fusion representation, so that the purpose of sharing the sentence extraction model of Chinese-Thai and the sentence extraction model of Chinese-Laos is achieved in the training process.

2. And constructing vector representations of different layers by using the similarity of Thai and Laos, and enhancing the similarity representation between languages.

3. Based on similarity of words, sub-words and pronunciation of the Tai-Lao language, a pretrained BERT multi-language model is utilized to perform fine adjustment on a Lao language data set, and dependency information among the words in a sentence is acquired based on a deep multi-coding mechanism, so that the performance of a Lao-Chinese bilingual sentence extraction model is improved.

Drawings

FIG. 1 is a method for extracting parallel sentence pairs for transfer learning based on language similarity;

FIG. 2 is an overall flow chart of the present invention;

Detailed Description

Example 1: as shown in fig. 1-2, the method for analyzing the burma language dependency syntax based on the migration learning includes the following specific steps:

step1, performing word segmentation processing on data of Thai and Laos, and expressing words, sub-words and pronunciation information of Laos based on Thai;

as a preferred embodiment of the present invention, the Step1 specifically comprises the following steps:

step1.1, firstly, utilizing a word segmentation tool to perform word segmentation processing on input Thai and Laos sentences;

and Step1.2, replacing words in the Thai-Laos based on a Thai-Laos bilingual dictionary and a phonetic symbol dictionary.

As a preferable scheme of the invention, the step Step1.2 comprises the following specific steps:

in the data preprocessing layer, three parts of a dictionary, a subword dictionary and a phonetic symbol dictionary among Thai-Laos are used for replacing and representing the Laos as Thai, so that when data model input and vectorization representation are carried out, bilingual data of two languages can be effectively mixed and trained to achieve the purpose of data enhancement; the method comprises the following specific steps:

step1.2.1, the tai-old term indicates: inputting a Thai sentence S with n wordsTh w=(w1 th,w2 th,...,wn th) And a corresponding Thai sentence S containing n wordsLao w=(w1 lao,w2 lao,...,wn lao) The corresponding relation between the Thai sub-words and Laos vocabulary is replaced, and the Thai and Laos with Chinese meaning 'I love China' are respectivelyThe divided Thai sentences and Laos sentences are replaced based on a dictionary of Thai-Laos, and words of Laos sentences are replaced by Thai sentences, so that all input Laos sentences of the input model are represented by Thai sentences, and the input Thai sentences are expressed as formula (1) after words and subword level replacement.

STh s=(s1 th,s2 th,...,sn th) (1)

SLao s=(s1 lao,s2 lao,...,sn lao)

Because the size of the dictionary of the Thai-Laos is limited, all Laos can not be found out corresponding Thai words to be replaced, and in the replacement process, part of Laos is reserved in the original sentences, but the performance of the follow-up model is not influenced, because BERT needs to cover the mechanism, and the effect of introducing part of noise to improve the model performance can be realized by reserving part of Laos words.

Step1.2.2, tai-old phonetic notation: because the language similarity between Thai and Lao is mainly reflected on bilingual pronunciation, and all characters between the Thai and Lao are represented by corresponding phonetic symbols, in order to further fuse the similarity characteristics of the Thai and Lao, and further take the language similarity between the Thai and Lao as constraint to constrain the representation form of the bilingual and explicitly model the semantics of the two languagesTh w=(w1 th,w2 th,...,wn th) Laos sentence SLao w=(w1 lao,w2 lao,...,wn lao) There is a corresponding phonetic symbol level representation.

For example, the Chinese meaning is "I love China. The expressions of Tai language and Laos are respectively The two sentences are expressed in phonetic symbols according to the constructed phonetic symbol dictionaryThese representations will further constrain the similarity of the two languages. The Thai and Laos sentences are expressed in phonetic notation form as formula (2):

as a preferred embodiment of the present invention, the Step2 specifically comprises the following steps:

step2.1, the tai-old word vector based on the pre-trained language model: and in the input layer part, mixed training is carried out on the Thai-Chinese and Laos-Chinese bilingual data by using the thought of data migration. The BERT multi-language pre-training model comprises 108 languages acquired based on Wikipedia, wherein the southeast Asia languages comprise Thai, Burma and Vietnam, and do not contain Laos. Therefore, data between Laos and Chinese are expanded on a data level by using the language similarity between Thai Laos, input word vector representation is generated based on BERT, the dimensionality of the generated Thai and Lao word vectors is 768, and word vector representation with context information can be generated based on a pre-training language model, specifically formula (3).

Step2.2, the tai-old phonetic symbol vector: the subwords are the smallest semantic granularity in the language, the relationship of most of the words in the language can be represented by the subwords, and a dictionary is constructed by the Thai, Laos, subwords and phonetic symbols according to pronunciation similarity, and is specifically shown in Table 1. For phonetic symbol vectors of Thai and Lao, the phonetic symbol vectors are generated on the basis of Word2vec by using a Skip-gram language model on the basis of the built phonetic symbol dictionary, in the process of sentence replacement of the Thai-Lao,in order to better obtain the corresponding relation between sentences in two languages, firstly, the Thai-Laos is replaced based on the word level, and not all words between the Thai-Laos can be replaced correspondingly, so that characters and sub-words between the Thai-Laos which cannot be replaced are replaced by using the corresponding relation of the sub-words, and in addition, the generated corresponding word list of the characters and the sub-words is replaced and represented. The advantage of using this method is that it is easier to obtain word vector of any character representation symbol, and phonetic symbol vector of Thai and Lao obtained based on model training is represented asAnd

step2.3, Tai-old word vector and phonetic symbol vector splicing: words and phonetic symbol vectors of Thai and Lao are obtained based on the steps, and in order to perform model training, the word vectors and phonetic symbol vectors of Thai and Lao are spliced, as shown in formula (4).

Step2.4, model training layer: utilize Poly encoder to encode bilingual sentence and calculate bilingual sentence similarity, compare in bidirectional encoder and cross-language encoder, the structure of Poly encoder can be more fast accurate extract more bilingual sentence information, to source language sentence and the target language sentence of input Poly encoder, all contain two encoders in its structure, and represent target language chinese coding for single vector, to every tai language of input model, laos language sentence all can be represented by m tai languages of Step2.3, the word vector of laos language, the concatenation form of phonetic symbol vector, specifically as follows:

in addition, these vectorized Thai, Laos sentences will be further represented as n vectors y based on attention-driven mechanism1 Th\Lao,y2 Th\Lao,...,yn Th\LaoWherein n influences the speed in the whole model training process, and in order to obtain the global features of n inputs, the vector of n nodes learned by the whole model training part is c1,...,cnWherein c isiExtracting representation form y by participating in all outputs of the previous layeri Th\Lao。yi Th\LaoIs expressed as shown in equation (6):

wherein the content of the first and second substances,representing training weights in the Source language, h1...hNRepresenting a context information vector representation generated based on the Attention mechanism.

Finally, the target language Chinese y is used in view of the n global context functionsChAs a query vector in the training process:

wherein (w)1,...,wm)=softmax(yCh·y1 Th\Lao,...,yCh·ym Th\Lao) Representing target language weight information.

Finally, the dot product Score of the similarity calculation between the output Thai and Laos sentences and the target language Chinese sentences is Score (Th \ Lao, Ch) ═ yi Th\Lao·yCh

The device for extracting the parallel sentence pairs for the transfer learning based on the language similarity comprises the following modules:

the system comprises a Thai-Laos preprocessing module, a Laos processing module and a Laos processing module, wherein the Thai-Laos preprocessing module is used for expressing the words, the sub-words and the pronunciation information of Laos based on Thai;

the parallel sentence pair extraction module based on the transfer learning is used for transferring the Chinese-Thai parallel sentence pair extraction model to the Chinese-Laos parallel sentence pair extraction model;

and the parallel sentence pair extraction module is used for extracting the input Chinese-Laos parallel sentence pairs through a pre-trained Chinese-Thai parallel sentence pair extraction model.

In particular, from an open-source corpus OPUS1Obtaining Chinese-Thai parallel corpus from the Asian language tree bank (ALT) of the open-source corpus2Obtaining Chinese-Laos parallel linguistic data and artificially constructing partial linguistic data. The training set, test set, and validation set used during a particular experiment are shown in tables 1 and 2.

TABLE 1 Chinese-Thai Experimental dataset

Training set Verification set Test set
Number of sentence pairs 196000 2000 2000

TABLE 2 Chinese-Laos experiment data set

Training set Verification set Test set
Number of sentence pairs 96000 2000 2000

In order to test the performance of the proposed model, the Precision rate (Precision), Recall rate (Recall) and F1 value (F1-Measure) are selected as evaluation indexes for judging whether the model can correctly classify the Chinese-Laos and Chinese-Thai parallel sentence pairs. The specific formulas are shown in (7), (8) and (9):

wherein TP is a true positive case, FP is a false negative case, FP is a false positive case, and TN is a true negative case.

In the experimental part, in order to verify the effectiveness of the proposed method, the proposed method is compared with the existing baseline model, and the proposed method is respectively based on machine learning: SVM, LR, and method of deep learning BiLSTM, the specific baseline model introduces the following (1) - (3):

as shown in Table 3, the method of the invention achieves better effects on the combination of three data sets, compared with the mode of machine learning SVM and LR, the method of the invention can obtain better semantic representation of word vectors based on the mode of pre-training BERT language model, and obtain better context information representation based on the mode of attention mechanism, the traditional mode based on machine learning depends on the size of data scale, the performance is not good on low-resource Thai and Lao, the method is limited by data scale, and the results of the SVM and LR methods are not obviously improved. The baseline model is respectively based on two different test sets and training sets for experimental analysis, and longitudinal comparison shows that the Hantai experimental effect is superior to the Hanlao experimental effect, because the Hantai experimental data set is larger in scale than the Hanlao experimental data set.

Table 3 experimental results of comparative experiments with other models

Compared with a deep learning-based method, the method has a considerable effect, the effect based on the Poly encoder method is equivalent to that of the method due to the improvement of the method based on the Poly, and in addition, the effect based on the BERT method is relatively low in the experimental effect of the Laos due to the fact that Laos words are lacked in a multilingual BERT word list, so that the method further verifies that the method fully utilizes the language similarity between the Thai language and the Laos language, and the performance of parallel sentences on an extraction model is improved.

When the training corpus is mixed with the Hantai and Hanlao corpus, the F1 value of the proposed method respectively reaches 76.36% and 56.15% on the Hantai and Hanlao test sets, which indicates that when the Hantai and Hanlao bilinguals are mixed and trained, the purpose of data enhancement is achieved, the two different corpora are mixed, the training parameters between similar languages are shared, further explaining the superiority of the proposed method, in addition, when the experimental training set is Hantai and the test set is Hanlao, the F1 value of the proposed method reaches 74.16%, when the experimental training set is Hanlao and the test set is Hanlao, the F1 value of the proposed method reaches 53.88%, and information of bilingual sentences can be well obtained by directly using a Poly coding mode.

In the experiment, in order to verify the influence of different positive and negative sample ratios on the experiment result, different experiment positive and negative sample ratios are set for the experiment, and the specific experiment result is shown in tables 4 and 5.

TABLE 4 influence of different positive and negative sample ratios on experimental results when the test set is Thai

Sample ratio Training corpus Test corpus P R F1
1:1 Hantai and Hanlao Hantai medicine 65.65 77.20 70.96
1:2 Hantai and Hanlao Hantai medicine 60.19 76.20 67.26
1:3 Hantai and Hanlao Hantai medicine 70.66 80.20 75.13
1:4 Hantai and Hanlao Hantai medicine 72.30 80.90 76.36

Table 5 shows the influence of the ratio of different positive and negative samples in Laos on the experimental results

Tables 4 and 5 show that when the ratio of positive and negative samples is kept to be 1:4 in the experimental setting, the experimental effect is optimal, the ratio of the positive and negative samples can affect parameters in the model training process, and the optimal effect can be achieved only by controlling the ratio of data to a certain extent.

Table 6 and table 7 in the experiment, we designed the ablation experiment to investigate the influence of different parts in the experiment process on the experiment results, namely, the subword, word, pronunciation and four groups of comparison experiments of subword + pronunciation + word, and the specific experiment results are shown in table 3.7 and table 3.8. The experimental effect is the best after the three granularities of word + subword + pronunciation are superposed, because the similarities of different forms between the Thai and Laos are further expressed, the similarities of the three different granularities are further fused, the deeper similarity is better fused, the best representation form is obtained, the similarity of the word level is only utilized, and the restriction of pronunciation is avoided, so that the word meaning distance of partial unmatchable words is farther. Only the pronunciation between Tai and Lao is used as similarity to carry out constraint representation, information on the word meaning level of the words cannot be fused, and the similarity of the words, subwords and pronunciation of the Tai-Lao is mutually superposed and constrained to obtain the most accurate similarity representation, so that the model effect is improved most obviously.

Table 6 influence of ablation experiment on experimental results when test set is thailand

Different assembly parts Training corpus Test corpus P R F1
Sub-word Hantai and Hanlao Hantai medicine 56.43 58.90 57.54
Word and phrase Hantai and Hanlao Hantai medicine 50.98 56.11 53.45
Pronunciation Hantai and Hanlao Hantai medicine 67.85 66.55 67.21
Subword + pronunciation + word Hantai and Hanlao Hantai medicine 72.30 80.90 76.36

Table 7 test set influences of melting experiment on experimental results when Laos language is adopted

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:跨模态数据的匹配方法、装置、设备及介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!