Data generation method and device and readable storage medium

文档序号：1861774 发布日期：2021-11-19 浏览：11次中文

阅读说明：本技术 数据生成方法、装置及可读存储介质 (Data generation method and device and readable storage medium ) 是由穆畅李响于 2021-09-07 设计创作，主要内容包括：本公开涉及一种数据生成方法、装置及可读存储介质,该方法包括：对初始第一语言文本进行加噪处理,得到加噪后的第一语言文本；根据预训练语言模型对所述加噪后的第一语言文本进行处理,得到目标第一语言文本；对所述目标第一语言文本进行反翻译处理,得到第二语言文本；基于所述目标第一语言文本和所述第二语言文本,得到用于训练翻译模型的训练数据。本公开的方法可以提升训练翻译模型的训练数据的多样性,解决训练数据短缺的问题。(The present disclosure relates to a data generation method, an apparatus and a readable storage medium, the method comprising: carrying out noise adding processing on the initial first language text to obtain a noise-added first language text; processing the first language text after noise addition according to a pre-training language model to obtain a target first language text; performing reverse translation processing on the target first language text to obtain a second language text; and obtaining training data for training a translation model based on the target first language text and the second language text. The method disclosed by the invention can improve the diversity of the training data for training the translation model and solve the problem of training data shortage.)

1. A method of generating data, comprising:

carrying out noise adding processing on the initial first language text to obtain a noise-added first language text;

processing the first language text after noise addition according to a pre-training language model to obtain a target first language text;

performing reverse translation processing on the target first language text to obtain a second language text;

and obtaining training data for training a translation model based on the target first language text and the second language text.

2. The method of claim 1, wherein the first language text is chapter-level text;

the performing reverse translation processing on the target first language text to obtain a second language text comprises:

splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;

according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;

and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.

3. The method according to claim 1, wherein the denoising the initial first language text to obtain a denoised first language text comprises:

and carrying out deletion, replacement and noise addition on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text lacks words with preset quantity or preset positions relative to the initial first language text.

4. The method according to claim 1, wherein the denoising the initial first language text to obtain a denoised first language text comprises:

and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.

5. The method of claim 1, wherein the pre-trained language model is a bi-directional autoregressive transformer model.

6. The method of claim 2, wherein the reverse translation model is trained by:

obtaining a plurality of training samples; wherein each of the training samples comprises a sample first language sub-text and a sample second language sub-text; wherein the sample first language sub-text and the sample second language sub-text are the sentence level text;

iteratively updating parameters of the initial reverse translation model based on a plurality of training samples to reduce loss function values corresponding to the training samples, and obtaining a trained reverse translation model;

wherein, the loss function value corresponding to each training sample is determined by the following process:

processing the sample first language sub-text through a reverse translation model to obtain a predicted second language sub-text;

determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.

7. The method of claim 6, wherein the reverse translation model is a Transformer model.

8. A data generation apparatus, comprising:

the noise adding module is configured to add noise to the initial first language text to obtain a first language text after noise addition;

the processing module is configured to process the first language text subjected to noise addition according to a pre-training language model to obtain a target first language text;

the reverse translation module is configured to perform reverse translation processing on the target first language text to obtain a second language text;

a training data determination module configured to obtain training data for training a translation model based on the target first language text and the second language text.

9. A data generation apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the computer program in the memory to implement the steps of the method of any one of claims 1-7.

10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a data generation method and apparatus, and a readable storage medium.

Background

Machine translation refers to the process of converting a source language to a target language by a computer. With the development of machine learning and deep learning techniques, machine translation gradually moves from statistical machine translation to the neural machine translation era. For the neural machine translation model, the training data used for training the translation model has a very important influence on the prediction of the model, so how to better construct the training data of the translation model to improve the translation accuracy of the translation model is a problem to be solved urgently.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a data generation method, apparatus, and readable storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a data generation method, including:

carrying out noise adding processing on the initial first language text to obtain a noise-added first language text;

processing the first language text after noise addition according to a pre-training language model to obtain a target first language text;

performing reverse translation processing on the target first language text to obtain a second language text;

and obtaining training data for training a translation model based on the target first language text and the second language text.

In some embodiments, the first language text is a chapter-level text, and the performing a reverse translation process on the target first language text to obtain a second language text includes:

splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;

according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;

and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.

In some embodiments, the denoising the initial first language text to obtain a denoised first language text includes:

In some embodiments, the pre-trained language model is a bi-directional autoregressive transformer model.

In some embodiments, the reverse translation model is trained by:

wherein, the loss function value corresponding to each training sample is determined by the following process:

processing the sample first language sub-text through a reverse translation model to obtain a predicted second language sub-text;

determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.

In some embodiments, the reverse translation model is a Transformer model.

According to a second aspect of the embodiments of the present disclosure, there is provided a data generation apparatus including:

the noise adding module is configured to add noise to the initial first language text to obtain a first language text after noise addition;

the processing module is configured to process the first language text subjected to noise addition according to a pre-training language model to obtain a target first language text;

the reverse translation module is configured to perform reverse translation processing on the target first language text to obtain a second language text;

a training data determination module configured to obtain training data for training a translation model based on the target first language text and the second language text.

In some embodiments, the first language text is chapter-level text, and the reverse translation module is further configured to:

splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;

according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;

and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.

In some embodiments, the noise adding module is further configured to: and carrying out deletion, replacement and noise addition on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text lacks words with preset quantity or preset positions relative to the initial first language text.

In some embodiments, the noise adding module is further configured to: and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.

In some embodiments, the pre-trained language model is a bi-directional autoregressive transformer model.

In some embodiments, the apparatus further comprises a training module configured to:

wherein, the loss function value corresponding to each training sample is determined by the following process:

processing the sample first language text through a reverse translation model to obtain a predicted second language sub-text;

determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.

In some embodiments, the reverse translation model is a Transformer model.

According to a third aspect of the embodiments of the present disclosure, there is provided a data generation apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the computer program in the memory to implement the steps of the method of any one of the first aspects of the disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any one of the first aspects of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the noisy first language text is processed through the pre-training language model, a target first language text different from the initial first language text can be generated, training data are further constructed based on a reverse translation result of the target first language text and the target first language text, training data reflecting different scores can be obtained, diversification of the training data is achieved, and a training effect of the translation model is improved; and the pre-training model does not need to be trained additionally, so that the efficiency of generating training data is greatly improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart illustrating a method of data generation according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating training a reverse translation model according to an exemplary embodiment of the present disclosure;

FIG. 3 is a block diagram illustrating a data generation apparatus according to an exemplary embodiment of the present disclosure;

fig. 4 is a block diagram illustrating a data generation apparatus according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Before introducing a data generation method provided by the present disclosure, an application scenario related to various embodiments in the present disclosure is first described, and the present disclosure may be applied to a process of training a translation model, where the trained translation model may be used for translating a chapter-level text.

In some embodiments, in a scenario of chapter translation, the original chapter-level text may be split to obtain a plurality of sentence-level texts, and training data for training a chapter-level translation model is constructed according to translation results corresponding to the plurality of sentence-level texts and the original chapter-level text. However, this approach has the following features: complex dependency relationships exist among sentences in the chapters, and after the chapters-level texts are simply split, the obtained multiple sentence-level texts do not reflect the dependency relationships, so that the finally generated translation results are inconsistent in context and poor in consistency. Furthermore, the training data constructed in this way reduces the training effect of the model. In addition, in the context of chapter-based translation, the training data required for training the translation model is in short supply, which limits the wide application of chapter-based translation models.

Therefore, the data generation method, the data generation device and the readable storage medium are provided, the noisy first language text is processed through a pre-training language model, a target first language text different from the initial first language text can be generated, namely, the target first language text with translation errors is generated, then training data is constructed based on a reverse translation result of the target first language text and the target first language text, training data reflecting different scores can be obtained, diversification of the training data is realized, and the training effect of the translation model is improved; and the pre-training model does not need to be trained additionally, so that the efficiency of generating training data is greatly improved.

Fig. 1 is a flow chart illustrating a method of data generation according to an exemplary embodiment, which may include the following steps, as shown in fig. 1.

In step S11, noise processing is performed on the initial first language text to obtain a noise-added first language text.

In some embodiments, the initial first language text is first language text that has not been processed. The first language text may be a text obtained by translating a text to be translated, i.e., a source language text. The first language text may be text in any language. Such as english, chinese, german, etc. In some embodiments, the initial first language text may be chapter-level text. A chapter can be a language whole unit consisting of a series of consecutive words, phrases, clauses, sentences, or paragraphs, and chapter-level text can be text formed from the language whole unit. Such as articles, books, periodicals, etc. It will be appreciated that chapter-level text may include a large number (e.g., four thousand, ten thousand, etc.) of characters.

In some embodiments, the denoised first language text may be a text obtained by performing a denoising process on the initial first language text. In some embodiments, the noise processing may include at least one of: the method includes replacing at least one word in the initial first language text, deleting at least one word segment in the initial first language text, changing an order of at least two sentences in the initial first language text, and rotating the initial first language text.

In some embodiments, the denoising the initial first language text to obtain a denoised first language text may include: and deleting, replacing and denoising the initial first language text to obtain a denoised first language text, wherein the denoised first language text lacks words with preset quantity or preset positions relative to the initial first language text.

It will be appreciated that the pruning replacement noising process may be used to derive a noised first language text lacking a predetermined number or position of words relative to the initial first language text. The preset number and the preset position can be specifically set according to actual conditions. In some embodiments, pruning replacement noising may include at least one of: deleting at least one word in the initial first language text, deleting at least one word segment in the initial first language text.

In some embodiments, when the noising process is a replacement process, at least one word in the initial first language text may be replaced with a preset marker. For example, the preset marker is MASK. Wherein the replacement may be a random replacement. Illustratively, to initiateA language text includes sentence A, and the word sequence of sentence A is { a₁,a₂,a₃,a₄,a₅For example, if the word a in sentence A is replaced₃Then the word sequence of sentence A in the denoised first language text is changed to { a }₁,a₂,MASK,a₄,a₅}. The method for replacing other words is similar to the replacing process, and is not repeated here.

In some embodiments, replacing at least one word in the initial first language text may include: at least one word other than the connection word and the entity word in the initial first language text is replaced. A conjunction may refer to a word used to join sentences. E.g., words related to causality, control, summary, inference, etc. An entity may be any object that can be described, such as a service, a person name, a place name, and so on. An entity word may be a word corresponding to an entity. By replacing the connecting words and the words except the entity words, the integral structure of the initial first language text after noise addition can be ensured not to change as much as possible, so that the subsequently generated training data is more matched with the field of real text data, and the training effect of the translation model is better.

In some embodiments, at least one word in the initial first language text may be randomly deleted. For example, the text in the initial first language includes sentence A, and the word sequence of sentence A is { a }₁,a₂,a₃,a₄,a₅Take the example, if the word a is deleted₄And a₅Then the word sequence of sentence A in the denoised first language text is changed to { a }₁,a₂,a₃}。

In some embodiments, a term segment may be a segment consisting of a contiguous plurality of terms. In some embodiments, the word segment may be a contiguous plurality of words in a sentence that the initial first language text includes. It is worth mentioning that the plurality of word fragments may be a plurality of word fragments comprised by the same sentence or different sentences in the initial first language text. For example, the word sequence still above sentence A is { a₁,a₂,a₃,a₄,a₅For example, the plurality of word segments may include: a is₁-a₂、a₁-a₂-a₃,、a₂-a₃-a₄-a₅And the like.

In some embodiments, at least one word segment in the first language text may be randomly deleted. For example, still taking the foregoing example as an example, the word segment a may be deleted₁-a₂Or a₁-a₂-a₃And then the word sequence of the sentence A in the first language text after noise addition is changed into { a₃,a₄,a₅Either of (a)₄,a₅}. In some embodiments, the length of each of the at least one word segment follows a poisson distribution with λ ═ 3.

In some embodiments, the denoising the initial first language text to obtain a denoised first language text may include: and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.

It will be appreciated that out-of-order denoising can be used to derive denoised first language text in a different order or word order relative to the original first language text. In some embodiments, the out-of-order noise-adding process may include at least one of: the method includes replacing at least one word in the initial first language text, changing an order of at least two sentences in the initial first language text, and rotating the initial first language text.

In some embodiments, at least one word in the initial first language text may be replaced with a preset marker. For example, the preset marker is MASK. Wherein the replacement may be a random replacement. Illustratively, the word sequence of sentence A, still comprised in the initial first language text, is { a }₁,a₂,a₃,a₄,a₅For example, if the word a in sentence A is replaced₃Then the word sequence of sentence A in the denoised first language text is changed to { a }₁,a₂,MASK,a₄,a₅}. The method for replacing other words is similar to the replacing process, and is not repeated here.

In some embodiments, the order of at least two sentences in the initial first language text may be randomly changed. In some embodiments, the initial first language text may be divided into a plurality of sentences with periods as separators, and the order of the plurality of sentences may be randomly changed. For example, taking the sentence sequence of the initial first language text as { a, B, C, D, E }, the order of sentences a to E can be changed arbitrarily to obtain the sentence sequence of the first language text after the noise processing { D, E, a, B, C }.

In some embodiments, rotating the initial first language text may refer to randomly selecting a sentence, centering on the sentence, and rotating the initial first language text with the selected sentence as a new beginning. For example, still taking the sentence sequence of the initial first language text as { a, B, C, D, E }, if the initial first language text is rotated around the sentence D, the sentence sequence of the first language text after the noise processing is { D, E, a, B, C }.

In some embodiments, multiple kinds of noise processing may be performed on the initial first language text to obtain a noise-added first language text. In some embodiments, the various noising processes may include pruned alternative noising processes and out-of-order noising processes. Multiple noise addingThe execution sequence of the processing can be specifically set according to the actual situation, and the present disclosure does not set any limit to this. For example, the sentence sequence of text still in the initial first language { A, B, C, D, E }, the word sequence of sentence A is { a₁,a₂,a₃,a₄,a₅For example, if the denoising process includes: changing the order of sentences A and B → centering on sentence D, rotating the initial first language text → deleting the word segment a in sentence A₁-a₂→ delete word a in sentence A₄→ substitute for word a in sentence a₃The sentence sequence of the first language text after the noise processing is { D, E, B, A, C }, and the word sequence of the sentence A is { MASK, a }₅}. It should be noted that, in order to simplify the examples of the present disclosure, the sentences included in the initial first language text and the words included in the sentences are not exhaustive, and the present disclosure does not set any limit to the number of sentences and the number of words.

In step S12, the noisy first language text is processed according to the pre-training language model to obtain a target first language text.

In some embodiments, the target first language text may be a text obtained by reconstructing the noisy first language text by a pre-trained language model. Since the pre-trained language model processes the noisy data, in some embodiments, the difference between the target first language text and the initial first language text, i.e., the reconstructed text is different from the actual text, has the content of a recovery error.

In some embodiments, the pre-trained language model may be a pre-trained neural network model for text generation. In some embodiments, the pre-trained language model may be a bidirectional Auto-Regressive Transformers (BART) model, and the BART model is a pre-trained language model using a transform model overall structure. When the BART model is pre-trained, firstly, the input text is damaged by using various noises, and then the input text is reconstructed through the sequence-to-sequence model to obtain an output sample.

In some embodiments, the BART model may include an Encoder and a decoder, where the Encoder uses the Encoder components of the BERT (Bidirectional Encoder from Transformers) model, which may encode the input text from two directions to obtain more context information. The decoder uses a decoder component from GPT (Generative Pre-Training) for reconstructing the input text.

In some embodiments, the BART model may process noisy first language text that includes a preset number of characters, and thus, the initial first language text may be language text that includes a preset number of characters. For example, two thousand characters. In some embodiments, the initial first language text may be a small discourse language text (e.g., a 2 kilo-character language text) resulting from splitting a large discourse language text (e.g., a ten thousand character language text). In some embodiments, a large amount of training data can be constructed by performing the data generation method of the present disclosure on a plurality of initial first language texts including a preset number of characters, thereby improving the accuracy of the translation model training.

Because the BART model uses the context information of the text, namely the bidirectional semantic information, in the pre-training encoding process, the target first language text reconstructed by the BART model is more consistent with the semantics of the real language text (namely the initial first language text), and the fluency and the continuity are stronger. Furthermore, the fluency and the consistency of the second language text which is translated reversely from the target first language text are stronger, and the subsequent generation of training data is facilitated.

In step S13, the target first language text is subjected to a reverse translation process to obtain a second language text.

In step S14, based on the target first language text and the second language text, training data for training a translation model is obtained.

In some embodiments, the reverse translation process may refer to a process of reverse translating the language text resulting from the translation into source language text. In some embodiments, the second language text may be source language text. For example, the target first language text is English text, then the second language text may be Chinese text. For another example, if the target first language text is german text, then the second language text may be english text. The first language text and the second language text correspond to different languages. It should be noted that the first language text and the second language text may be texts corresponding to any language translation, and the disclosure is not limited thereto.

In some embodiments, the reverse translation process may be performed according to a model. As mentioned above, the text in the first language is chapter-level text, and in some embodiments, the chapter-level text may be split to obtain sentence-level text, so as to perform the inverse translation process based on the sentence-level text.

In some embodiments, performing a reverse translation process on the target first language text to obtain a second language text may include: splitting the target first language text to obtain a plurality of target first language sub-texts; the first language sub-text is a sentence-level text; according to the reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts; and carrying out fusion processing on the plurality of second language sub texts to obtain the second language texts.

In some embodiments, the plurality of target first language sub-texts may be a plurality of sentence-level texts obtained by sentence-splitting the target first language texts at chapter level. In some embodiments, the reverse translation model may be a pre-trained machine learning model, and the trained reverse translation model may output the second language sub-text based on the input first language sub-text. The second language sub-text may be the source language text of the first language sub-text, and as will be appreciated, the second language sub-text may be sentence level text. For the training process of the reverse translation model, reference may be made to fig. 2 and the related description thereof, which are not described herein again.

In some embodiments, a plurality of second language sub-texts may be subjected to fusion processing to obtain a second language text. In some embodiments, the fusion process may include stitching. In some embodiments, the plurality of second language sub-texts may be stitched according to a sentence structure of the target first language text. For example, still taking the sentence sequence of the target first language text as { a, B, C, D, E }, if the plurality of second language sub-texts are a ', B ', C ', D ', and E ', the concatenation order of the plurality of second language sub-texts a ', B ', C ', D ', and E ' is a ' → B ' → C ' → D ' → E '.

In some embodiments, training data may be derived based on the target first language text and the second language text. The target first language text is a language text obtained through translation, the second language text is a source language text, a parallel data set can be constructed through the target first language text and the second language text, and the parallel data set can be used for training a translation model. In some embodiments, the translation model may be a model for chapter-level text translation.

Splitting a target first language text at a chapter level into a plurality of target first language sub-texts at a sentence level; according to the reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts; and performing fusion processing on the plurality of second language sub-texts to obtain a second language text. The anti-translation model is used for sentence-level text translation, so that the training mode is simple and easy to apply, and the efficiency of obtaining the training data of the translation model can be improved. Moreover, because the target first language text better conforms to the semantics of the real language text and has strong fluency and coherence, although the target first language text is a sentence-level text, the target first language text split into the multiple sentence-level target first language sub-texts still considers the semantic information of the context, so that the fluency and coherence of the discourse-level second language text obtained based on the multiple sentence-level target first language sub-texts are stronger.

In addition, the present disclosure employs a BART model pre-trained on a large corpus without retraining additional models. Meanwhile, the denoised first language text obtained through the BART model not only keeps the overall structural style of the initial first language text, but also increases the diversity of the generated language text through the denoising, further increases the diversity of the finally generated chapter-level second language text, effectively solves the problem of training data shortage of the translation model in a chapter translation scene, and is beneficial to the training of a subsequent translation model.

FIG. 2 is a flow diagram illustrating training a reverse translation model in accordance with an exemplary embodiment. As shown in fig. 2, the process includes:

in step S21, a plurality of training samples are acquired; wherein each of the training samples comprises a sample first language sub-text and a sample second language sub-text; wherein the sample first language sub-text and the sample second language sub-text are the sentence level text.

In some embodiments, the training samples may be data input into the initial reverse translation model for training the reverse translation model. In some embodiments, the sample first language sub-text may be a sentence-level translated target language text and the sample second language sub-text may be a sentence-level source language text. For more details about the sample first language sub-text and the sample second language sub-text, reference may be made to step S14 and its related description, which are not repeated herein.

In some embodiments, multiple training samples may be obtained through a database or by invoking an associated interface.

In step S22, parameters of the initial reverse translation model are iteratively updated based on a plurality of training samples to reduce the loss function values corresponding to the training samples, so as to obtain a trained reverse translation model.

In some embodiments, the reverse translation model may be a Transformer model. During the training of the reverse translation model, parameters of the initial reverse translation model may be iteratively updated based on a plurality of training samples. Specifically, the parameters of the initial reverse translation model can be continuously adjusted to reduce the loss function values corresponding to the training samples, so that the loss function values satisfy the preset conditions. For example, the loss function value converges, or the loss function value is less than a preset value. And when the loss function meets the preset condition, completing model training to obtain a trained reverse translation model. The trained reverse translation model can perform reverse translation processing on the target first language sub-text to obtain a second language sub-text.

In some embodiments, the loss function value for each training sample is determined by: processing the sample first language sub-text through a reverse translation model to obtain a predicted second language sub-text; determining a loss function value based at least on a difference between the predicted second language sub-text and the sample second language sub-text.

Fig. 3 is a block diagram illustrating a data generation apparatus 300 according to an example embodiment. Referring to fig. 3, the apparatus includes a noise adding module 310, a processing module 320, a reverse translation module 330, and a training data determining module 340.

The noise adding module 310 is configured to add noise to the initial first language text to obtain a noise-added first language text.

The processing module 320 is configured to process the noisy first language text according to a pre-training language model to obtain a target first language text.

The reverse translation module 330 is configured to perform a reverse translation process on the target first language text to obtain a second language text.

The training data determination module 340 is configured to derive training data for training a translation model based on the target first language text and the second language text.

In some embodiments, the first language text is chapter-level text, and the reverse translation module 330 is further configured to:

splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;

according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;

and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.

In some embodiments, the noise adding module 310 is further configured to: and carrying out deletion, replacement and noise addition on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text lacks words with preset quantity or preset positions relative to the initial first language text.

In some embodiments, the noise adding module 310 is further configured to: and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.

In some embodiments, the pre-trained language model is a bi-directional autoregressive transformer model.

In some embodiments, the apparatus further comprises a training module configured to:

wherein, the loss function value corresponding to each training sample is determined by the following process:

processing the sample first language text through a reverse translation model to obtain a predicted second language sub-text;

determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.

In some embodiments, the reverse translation model is a Transformer model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 4 is a block diagram illustrating a data generation apparatus 400 according to an example embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, the apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an interface for input/output (I/O) 412, a sensor component 414, and a communication component 416.

The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the data generation method described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

The memory 404 is configured to store various types of data to support operations at the apparatus 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 406 provide power to the various components of device 400. Power components 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for apparatus 400.

The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor assembly 414 may detect an open/closed state of the apparatus 400, the relative positioning of the components, such as a display and keypad of the apparatus 400, the sensor assembly 414 may also detect a change in the position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in the temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described data generation methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 404 comprising instructions, executable by the processor 420 of the apparatus 400 to perform the data generation method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned data generation method when executed by the programmable apparatus.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：使用统计流数据进行不同语言之间的机器翻译

Data generation method and device and readable storage medium

相关技术

网友询问留言