Corpus generation method and device and computer equipment

文档序号:907627 发布日期:2021-02-26 浏览:5次 中文

阅读说明:本技术 语料生成方法、装置及计算机设备 (Corpus generation method and device and computer equipment ) 是由 刘恒友 李辰 包祖贻 黄睿 徐光伟 李林琳 于 2019-08-19 设计创作,主要内容包括:本申请实施例提供一种语料生成方法、装置及一种计算机设备。本申请实施例中,通过获取目标领域中的正样本,确定与所述正样本中的至少一个正确词对应的替换词。利用对应的替换词替换所述至少一个正确词,获得与所述正样本对应的负样本并至少基于所述正样本及所述负样本,生成所述目标领域的第一纠错平行语料。本申请实施例进一步可以快速获得大数量、高质量的目标领域的第一纠错平行语料。(The embodiment of the application provides a corpus generating method and device and computer equipment. In the embodiment of the application, a positive sample in a target field is obtained, and a replacement word corresponding to at least one correct word in the positive sample is determined. And replacing the at least one correct word by using the corresponding replacement word to obtain a negative sample corresponding to the positive sample, and generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample. The embodiment of the application can further quickly obtain a large number of high-quality first error correction parallel corpora in the target field.)

1. A corpus generating method, comprising:

acquiring a positive sample in a target field;

determining a replacement word corresponding to at least one correct word in the positive sample;

replacing the at least one correct word with the corresponding replacement word to obtain a negative sample corresponding to the positive sample;

and generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample.

2. The method according to claim 1, wherein the generating the first error correction parallel corpus of the target domain based on at least the positive examples and the negative examples comprises:

training a prediction model of the target domain based at least on the positive samples and the negative samples;

predicting and obtaining at least one negative text corresponding to the text to be predicted of the target field by utilizing the prediction model;

and generating a first error correction parallel corpus of the target field based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted.

3. The method of claim 1, wherein the obtaining positive samples in a target domain comprises:

acquiring a search text with the search frequency meeting a search threshold value in the target field;

taking the search text as the positive sample.

4. The method of claim 1, wherein the determining a replacement word corresponding to at least one correct word in the positive sample comprises:

selecting at least one correct word to be replaced in the positive sample according to a preset replacement rule;

determining a replacement word corresponding to the at least one correct word.

5. The method according to claim 4, wherein the selecting at least one correct word to be replaced in the positive sample according to a preset replacement rule comprises:

and randomly selecting a first preset value of correct words to be replaced in the positive sample.

6. The method of claim 4, wherein the determining the replacement word corresponding to the at least one correct word comprises:

acquiring a word alignment data set; the word alignment data set establishes a corresponding relation between a correct word and a replacement word set acquired from multiple fields; wherein, the set of replacement words comprises at least one replacement word;

determining a replacement word corresponding to the at least one correct word based on the word alignment dataset.

7. The method of claim 6, wherein obtaining the set of word alignment data comprises:

acquiring a word alignment corpus containing replacement words and correct words in the multiple fields; wherein the word alignment corpus at least comprises a second error correction parallel corpus data set in multiple fields;

performing word alignment processing on the word alignment corpus to obtain a data set of error-correcting words;

and obtaining a word alignment data set for establishing the alignment relation between the correct word and the replacement word set at least based on the error-correcting word data set, the shape-similar word data set and the common error word data set.

8. The method of claim 6, wherein determining, based on the set of word alignment data, a replacement word corresponding to the at least one correct word comprises:

determining a set of replacement words in the word alignment data set corresponding to the at least one correct word;

determining a replacement word corresponding to the at least one correct word from the set of replacement words.

9. The method of claim 8, wherein the determining, from the set of replacement words, a replacement word corresponding to the at least one correct word comprises:

calculating the replacement probability corresponding to the replacement words in the replacement word set;

determining a replacement word corresponding to the at least one correct word from the set of replacement words based on the replacement probability.

10. The method of claim 9, wherein the calculating the replacement probability corresponding to the replacement word in the set of replacement words comprises:

counting the probability distribution of the correct words of the replacement words in the replacement word set;

and determining a replacement probability corresponding to the replacement word in the replacement word set based on the probability distribution.

11. The method of claim 9, wherein determining, based on the replacement probability, a replacement word from the set of replacement words that corresponds to the at least one correct word comprises:

taking the replacement probability as the replacement weight of the replacement word in the corresponding replacement word set;

and selecting a replacement word corresponding to the at least one correct word in the replacement word set through weighted random sampling based on the replacement weight.

12. The method of claim 6, wherein determining, based on the set of word alignment data, a replacement word corresponding to the at least one correct word comprises:

judging whether the word alignment data set has a replacement word corresponding to the at least one correct word;

if so, executing the step of replacing the at least one correct word by using the corresponding replacement word to obtain a negative sample corresponding to the positive sample;

if not, re-determining at least one correct word to be replaced in the positive sample according to a preset replacement rule until determining that a replacement word corresponding to the at least one correct word exists in the word alignment data set.

13. The method of claim 2, wherein training the predictive model for the target domain based on at least the positive and negative examples comprises:

generating a first reverse error correction parallel corpus of the target field based on the corresponding relation of the positive sample and the negative sample;

acquiring a second reverse error correction parallel corpus in multiple fields;

and training a prediction model of the target field based on the first reverse error correction parallel corpus and the second reverse error correction parallel corpus.

14. The method according to claim 13, wherein said obtaining the second inverse error correction parallel corpus in the multi-domain comprises:

acquiring error correction parallel corpora in the multiple fields;

and performing reverse mapping processing on the corresponding relation between the positive text and the negative text in the error correction parallel corpus in the multiple fields to obtain a second reverse error correction parallel corpus in the multiple fields.

15. The method according to claim 13, wherein training the prediction model of the target domain based on the first and second inverse corpus comprises:

pre-training a prediction model of the target field based on the second backward error correction parallel corpus to obtain an initial prediction model;

and carrying out optimization training on the initial prediction model based on the first reverse error correction parallel corpus to obtain the prediction model.

16. The method according to claim 15, wherein the performing optimization training on the initial prediction model to obtain the prediction model based on the first inverse error correction parallel corpus comprises:

acquiring a third reverse error correction parallel corpus in the target field; the third reverse error correction parallel corpus is generated based on a second error correction parallel corpus obtained by manual labeling in the target field;

and performing optimization training on the initial prediction model based on an optimization training sample generated by the first reverse error correction parallel corpus and the third reverse error correction parallel corpus to obtain the prediction model.

17. The method according to claim 2, wherein the obtaining the text book to be predicted of the target domain comprises:

acquiring a search text with the search frequency meeting a search threshold value in the target field;

and taking the search text as the positive text to be predicted.

18. The method according to claim 2, wherein the obtaining of the at least one negative text corresponding to the text to be predicted by using the prediction model of the target field comprises:

predicting the text book to be predicted by using the prediction model to obtain at least one negative text sorted according to the matching priority; wherein the matching priority represents the degree of matching of the at least one negative text with the positive text to be predicted respectively.

19. The method according to claim 2, wherein after the predicting, by using the prediction model in the target domain, at least one negative text corresponding to the text to be predicted, the method further comprises:

and determining the matching priority of the at least one negative text and the positive text to be predicted according to a preset condition.

20. The method according to any one of claims 18 and 19, wherein the composing the first error correction parallel corpus of the target domain based on the positive text to be predicted and the corresponding at least one negative text comprises:

according to the matching priority, selecting the preset first N negative texts in the at least one negative text as predicted negative texts; wherein N is greater than or equal to 1;

and generating a first error correction parallel corpus of the target field based on the corresponding relation between the predicted negative text and the to-be-predicted positive text.

21. The method according to claim 1, wherein after generating the first error correction parallel corpus of the target domain based on at least the positive examples and the negative examples, further comprising:

and training an error correction model of the target field at least based on the first error correction parallel corpus.

22. The method according to claim 21, wherein said training an error correction model of said target domain based on at least said first error correction corpus comprises:

acquiring a second error correction parallel corpus of the target field; the second error correction parallel corpus is an error correction parallel corpus generated by manual labeling in the target field;

and training an error correction training sample generated based on the first error correction parallel corpus and the second error correction parallel corpus to obtain an error correction model of the target field.

23. The method according to claim 21, wherein after training the error correction model of the target domain based on the first error correction corpus, further comprising:

acquiring a text to be corrected in a target field input by a user;

determining a correct text corresponding to the text to be corrected by using the error correction model of the target field;

and returning the correct text for the user terminal to search the target object based on the correct text.

24. A corpus generating method, comprising:

acquiring a positive text to be predicted in a target field;

predicting and obtaining at least one negative text corresponding to the text to be predicted by using the prediction model of the target field; wherein the prediction model is obtained based on at least positive samples in the target field and negative samples corresponding to the positive samples; the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word;

generating a first error correction parallel corpus of the target field based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

25. A text error correction method, comprising:

acquiring at least one candidate text output by a text input interface aiming at user input operation;

judging whether a text to be corrected in the target field exists in the at least one candidate text;

if the text to be corrected exists, determining a correct text corresponding to the text to be corrected by using the error correction model of the target field; the error correction model is obtained based on first error correction parallel corpus training in the target field; the first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to a text book to be predicted of the target field through prediction and based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the prediction model is obtained by training a negative text sample based on a positive sample of the target field and at least one correct word in the positive sample replaced by the replacement word;

and replacing the at least one candidate text with the correct text so that the user terminal can search the target object based on any replaced candidate text determined by the user.

26. A corpus generating device, comprising:

the first positive sample acquisition module is used for acquiring a positive sample in a target field;

a first alternative word determination module, configured to determine an alternative word corresponding to at least one correct word in the positive sample;

the first negative sample acquisition module is used for replacing the at least one correct word by using the corresponding replacement word to acquire a negative sample corresponding to the positive sample;

and the first training sample generation module is used for generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample.

27. A corpus generating device, comprising:

the text book to be predicted acquisition module is used for acquiring a positive text to be predicted in the target field;

the negative text prediction module is used for predicting and obtaining at least one negative text corresponding to the text to be predicted by using the prediction model in the target field; wherein the prediction model is obtained based on at least positive samples in the target field and negative samples corresponding to the positive samples; the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word;

the second training sample generation module is used for generating a first error correction parallel corpus of the target field based on the text to be predicted and the at least one negative text corresponding to the text to be predicted; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

28. A computer device comprising a processing component and a storage component; the storage component stores one or more computer instructions; the one or more computer instructions to be invoked for execution by the processing component;

the processing component is to:

acquiring a positive sample in a target field;

determining a replacement word corresponding to at least one correct word in the positive sample;

replacing the at least one correct word with the corresponding replacement word to obtain a negative sample corresponding to the positive sample;

and generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample.

29. A computer device comprising a processing component and a storage component; the storage component stores one or more computer instructions; the one or more computer instructions to be invoked for execution by the processing component;

the processing component is to:

acquiring a positive text to be predicted in a target field;

predicting and obtaining at least one negative text corresponding to the text to be predicted by using the prediction model of the target field; wherein the prediction model is obtained based on at least positive samples in the target field and negative samples corresponding to the positive samples; the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word;

generating a first error correction parallel corpus of the target field based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

Technical Field

The embodiment of the application relates to the technical field of networks, in particular to a corpus generating method and device and computer equipment.

Background

With the rapid development of internet technology, users increasingly consume through online shopping malls. When a user searches for a desired commodity, text information of the commodity to be searched generally needs to be input in a search box of a user side, and the user side searches for the commodity based on the text information and displays the commodity matched with the text information to the user.

However, when the user inputs the text information of the goods to be searched, due to the limitation of knowledge or the fact that a wrong text is possibly input due to hand mistake in the input process, correct text information meeting the requirements of the user can be obtained by searching and correcting the wrong text information input by the user in order to improve the user experience, searching is carried out based on the correct text information, and a correct search result of the user is returned. For example, the user wants to search for a product of shepherd's purse, but the correct text information obtained by inputting the wrong text information of' full dish 'and performing search and error correction is' shepherd 'so as to return the product related to shepherd' searched by the user based on 'shepherd' to the user.

At present, the corresponding relation between the wrong text and the correct text in the search text information is mainly established in a manual labeling mode for search and error correction of commodities in the e-commerce field, and an error correction parallel corpus data set is generated, wherein the format of the error correction parallel corpus can be < the wrong text, the correct text >. And then, combining a statistical analysis method or a supervised machine learning method, inputting the error correction parallel corpus data set serving as a training sample into an error correction model for model training, and obtaining the error correction model in the E-commerce field. And then, correcting the error text information input by the user by using the error correction model to obtain correct text information.

The error correction effect of the actual error correction model is greatly influenced by the quality and the quantity of the training samples, the higher the quality of the training samples is, the better the error correction effect of the error correction model obtained by training the larger the quantity is, and otherwise, the worse the error correction effect is. However, because the number of error correction parallel corpora in the e-commerce field is small at present, how to quickly and effectively obtain a large number of high-quality error correction parallel corpora to improve the error correction effect of the error correction model becomes a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a corpus generating method, a corpus generating device and computer equipment, which can automatically generate a large number of high-quality error correction parallel corpora in a target field without manual marking, thereby laying a foundation for improving the error correction effect of an error correction model.

In a first aspect, an embodiment of the present application provides a corpus generating method, including:

acquiring a positive sample in a target field;

determining a replacement word corresponding to at least one correct word in the positive sample;

replacing the at least one correct word with the corresponding replacement word to obtain a negative sample corresponding to the positive sample;

and generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample.

In a second aspect, an embodiment of the present application provides a corpus generating method, including:

acquiring a positive text to be predicted in a target field;

predicting and obtaining at least one negative text corresponding to the text to be predicted by using the prediction model of the target field; wherein the prediction model is obtained based on at least positive samples in the target field and negative samples corresponding to the positive samples; the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word;

generating a first error correction parallel corpus of the target field based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

In a third aspect, an embodiment of the present application provides a text error correction method, including:

acquiring at least one candidate text output by a text input interface aiming at user input operation;

judging whether a text to be corrected in the target field exists in the at least one candidate text;

if the text to be corrected exists, determining a correct text corresponding to the text to be corrected by using the error correction model of the target field; the error correction model is obtained based on first error correction parallel corpus training in the target field; the first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to a text book to be predicted of the target field through prediction and based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the prediction model is obtained by training a negative text sample based on a positive sample of the target field and at least one correct word in the positive sample replaced by the replacement word;

and replacing the at least one candidate text with the correct text so that the user terminal can search the target object based on any replaced candidate text determined by the user.

In a fourth aspect, an embodiment of the present application provides a corpus generating device, including:

the first positive sample acquisition module is used for acquiring a positive sample in a target field;

a first alternative word determination module, configured to determine an alternative word corresponding to at least one correct word in the positive sample;

the first negative sample acquisition module is used for replacing the at least one correct word by using the corresponding replacement word to acquire a negative sample corresponding to the positive sample;

and the first training sample generation module is used for generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample.

In a fifth aspect, an embodiment of the present application provides a corpus generating device, including:

the text book to be predicted acquisition module is used for acquiring a positive text to be predicted in the target field;

the negative text prediction module is used for predicting and obtaining at least one negative text corresponding to the text to be predicted by using the prediction model in the target field; wherein the prediction model is obtained based on at least positive samples in the target field and negative samples corresponding to the positive samples; the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word;

the second training sample generation module is used for generating a first error correction parallel corpus of the target field based on the text to be predicted and the at least one negative text corresponding to the text to be predicted; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

In a sixth aspect, embodiments of the present application provide a computer device, including a processing component and a storage component; the storage component stores one or more computer instructions; the one or more computer instructions to be invoked for execution by the processing component;

the processing component is to:

acquiring a positive sample in a target field;

determining a replacement word corresponding to at least one correct word in the positive sample;

replacing the at least one correct word with the corresponding replacement word to obtain a negative sample corresponding to the positive sample;

generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

In a seventh aspect, an embodiment of the present application provides a computer device, including a processing component and a storage component; the storage component stores one or more computer instructions; the one or more computer instructions to be invoked for execution by the processing component;

the processing component is to:

acquiring a positive text to be predicted in a target field;

predicting and obtaining at least one negative text corresponding to the text to be predicted by using the prediction model of the target field; wherein the prediction model is obtained based on at least positive samples in the target field and negative samples corresponding to the positive samples; the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word;

generating a first error correction parallel corpus of the target field based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

Compared with the prior art, the application can obtain the following technical effects:

the embodiment of the application provides a corpus generating method, a corpus generating device and computer equipment, wherein the method obtains a negative sample corresponding to a positive sample by determining a replacement word corresponding to at least one correct word of the positive sample in a target field and replacing the at least one correct word by using the corresponding replacement word, so that the automatic establishment of the corresponding relation between the positive sample and the negative sample is realized, and a large number of high-quality first error correction parallel corpora in the target field can be quickly and effectively obtained. And training an error correction model of the target field by using the obtained first error correction parallel corpus, so that the error correction effect of the error correction model can be further improved, the error correction model is used for correcting the text to be corrected in the target field input by the user, the correct text which corresponds to the text to be corrected and best meets the search requirement of the user is obtained, and the user experience is greatly improved.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart illustrating an embodiment of a corpus generation method according to the present application;

FIG. 2 is a flow chart illustrating a corpus generation method according to another embodiment of the present disclosure;

FIG. 3 shows a schematic flow chart diagram of yet another embodiment of a data processing method according to the present application;

FIG. 4 is a schematic flow chart diagram illustrating a corpus generation method according to yet another embodiment of the present application;

FIG. 5 shows a schematic flow chart diagram of another embodiment of a data processing method according to the present application;

FIG. 6 is a schematic flow chart diagram illustrating a further embodiment of a text correction method according to the present application;

FIG. 7 is a schematic diagram illustrating a text correction method based on an input text interface according to the present application;

FIG. 8 is a schematic diagram illustrating an embodiment of a corpus generating device according to the present application;

FIG. 9 is a schematic structural diagram illustrating another embodiment of a corpus generating device according to the present application;

FIG. 10 is a schematic diagram illustrating an architecture of yet another embodiment of a data processing apparatus according to the present application;

FIG. 11 is a schematic diagram illustrating a corpus generating apparatus according to another embodiment of the present application;

FIG. 12 is a schematic block diagram illustrating a data processing apparatus according to yet another embodiment of the present application;

FIG. 13 is a schematic block diagram illustrating another embodiment of a data processing apparatus according to the present application;

FIG. 14 is a schematic diagram illustrating a structure of another embodiment of a text correction device according to the present application;

FIG. 15 illustrates a schematic block diagram of one embodiment of a computer device, in accordance with the present application;

FIG. 16 illustrates a schematic block diagram of one embodiment of a computer device, in accordance with the present application;

FIG. 17 is a schematic diagram illustrating an embodiment of a computer device according to the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In some of the flows described in the specification and claims of this application and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, the number of operations, e.g., 101, 102, etc., merely being used to distinguish between various operations, and the number itself does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

In the embodiment of the application, a negative sample corresponding to the positive sample is obtained by determining a replacement word corresponding to at least one correct word of the positive sample in the target field and replacing the at least one correct word with the corresponding replacement word, so that the automatic establishment of the corresponding relationship between the positive sample and the negative sample is realized, and the first error correction parallel corpus in the target field with large quantity and high quality can be quickly and effectively obtained. And training an error correction model of the target field by using the obtained first error correction parallel corpus, so that the error correction effect of the error correction model can be further improved, the error correction model is used for correcting the text to be corrected in the target field input by the user, the correct text which corresponds to the text to be corrected and best meets the search requirement of the user is obtained, and the user experience is greatly improved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of an embodiment of a corpus generating method according to an embodiment of the present application. The method may comprise the steps of:

101: a positive sample in the target domain is acquired.

Alternatively, the target field may be any application field applied to a search and query technology, such as a new retail vendor field and a live video field which are rapidly developed at present, and is not particularly limited herein.

The generation process of the error correction parallel corpus in the target field is actually a process of establishing a corresponding relation between a correct text searched in the target field and an error text corresponding to the correct text, so as to generate the error correction parallel corpus in the form of < the error text, the correct text >, and a large amount of error correction parallel corpora in the target field are obtained as training samples of an error correction model, so that the error correction model in the target field can be obtained through training.

In order to improve the instruction of correcting the parallel corpus, the searching for the correct text may be obtained according to a history search text of the target field, and as an optional implementation, the obtaining of the positive sample in the target field may include:

acquiring a search text with the search frequency meeting a search threshold value in the target field;

taking the search text as the positive sample.

When a user searches for a certain commodity through a search box, search text information such as a commodity name, a shop name and the like of the commodity is input, while a small number of users input wrong search texts, and the reasons for inputting the wrong search texts are different from one user to another, so that a plurality of wrong search texts may exist. However, it can be understood that when the user finds that the search is wrong, the user may re-input the correct search text, and the search frequency for the user to input the correct text may be higher than the search frequency for the user to input the wrong text according to the statistical probability.

Therefore, based on the historical search records of the target field, the high-frequency search texts in the search records are obtained, namely, the search texts with the search frequency larger than the specified search threshold are taken as the positive samples. The search threshold may be determined according to the probability distribution of the search frequency, for example, if the number of search texts with the search frequency greater than 500 times is much higher than the number of search texts with the search frequency lower than 500 times, the search threshold may be set to 500 times. Therefore, the high-frequency search text is screened out based on the set search threshold value and is used as a positive sample. Of course, in order to ensure that the determined positive sample contains no wrong search text as much as possible, a higher search threshold may be set, so as to reduce the probability that the positive sample contains the wrong search text.

Alternatively, in order to further improve the quality of the positive sample, the positive sample may be a product information text based on a product name in a target field or a store name, and the sample is not specifically limited herein and may be selected according to actual needs.

102: determining a replacement word corresponding to at least one correct word in the positive sample.

103: and replacing the at least one correct word with the corresponding replacement word to obtain a negative example corresponding to the positive example.

Since the positive sample is text information, the positive sample may obtain at least one word through a word segmentation process. It is understood that, in general, the search for the error text in the error correction parallel corpus is not completely different from the correct text, but there is a difference between a certain word or words.

Therefore, in order to improve the sample quality and avoid that the difference between the negative sample and the positive sample obtained after replacement is too large and does not accord with the actual search condition, only a small part of correct words in the positive sample are replaced. And replacing at least one correct word in the positive sample to obtain a negative sample corresponding to the positive sample in the target field. Optionally, in some embodiments, the determining the replacement word corresponding to the at least one correct word in the positive sample may include:

selecting at least one correct word to be replaced in the positive sample according to a preset replacement rule;

determining a replacement word corresponding to the at least one correct word.

The preset replacement rule may be set according to sample requirements, for example, a word with a middle position may be preferentially selected as a correct word, or a first word or a last word may be selected as a correct word according to a position of each word in the positive sample. Of course, it is also possible to select, for example, a noun, a verb, and the like, preferentially, and then select an adjective, an auxiliary word, and the like as the correct word according to the part of speech of each word in the positive sample. Of course, the correct word of the positive sample may also be randomly selected, and as an optional implementation manner, the selecting, according to the preset replacement rule, at least one correct word to be replaced in the positive sample may include:

and randomly selecting a first preset value of correct words to be replaced in the positive sample.

In practical application, at least one correct word is selected, so that the first preset value is greater than or equal to 1. Certainly, a plurality of samples can be selected to improve the diversity of the samples, but the maximum limit value of the replacement words is set according to the replacement effect, for example, at most three correct words of the positive samples are selected for replacement, so as to ensure the sample quality of the negative samples.

Meanwhile, due to the diversity of the search text, a positive sample containing only one word may exist, and a positive sample containing a plurality of words also exists, so that a threshold range can be set, wherein the threshold range can be set to [1,3], that is, when at least one correct word is randomly selected according to the number of the words contained in the positive sample, at least one correct word is selected, and at most 3 correct words are selected for replacement, so that the negative sample corresponding to the positive sample can be obtained more flexibly and diversely.

In practical applications, the common error text which is generated by the error input by the user is usually the error of the individual word of the search text, and the reason may be caused by inputting the similar character, the homophone, the synonym and the like corresponding to the correct word. Therefore, the shape word, the homophone, the synonym, and the like can be used as the replacement words of the correct words, and at least one correct word in the positive sample can be replaced.

As an optional implementation, the determining the replacement word corresponding to the at least one correct word may include:

acquiring a word alignment data set; the word alignment data set establishes a corresponding relation between a correct word and a replacement word set acquired from multiple fields; the set of replacement words comprises at least one replacement word.

Determining a replacement word corresponding to the at least one correct word based on the word alignment dataset.

The mapping relation from the correct words to the wrong word set is established in the word alignment data set, wherein the wrong word set comprises at least one wrong word. Therefore, the replacement words corresponding to the at least one correct word can be determined based on the mapping relationship established in the word alignment data set.

Optionally, the acquiring the word alignment data set may include:

acquiring a word alignment corpus containing replacement words and correct words in the multiple fields; wherein the word alignment corpus at least comprises a second error correction parallel corpus data set in multiple fields;

performing word alignment processing on the word alignment corpus to obtain a data set of error-correcting words;

and obtaining a word alignment data set for establishing the alignment relation between the correct word and the replacement word set at least based on the error-correcting word data set, the shape-similar word data set and the common error word data set.

In order to further improve the richness of the sample and enable the mapping relation established by the word alignment data set to better conform to the error correction habit of the user, the word alignment data set is generated based on a second error correction parallel corpus, a shape and word approximation data set, a common error word list and the like of multiple fields, wherein the multiple fields can include a target field.

Because the second error correction parallel corpus of the multiple fields is the mapping relation between the correct text and the wrong text, the second error correction parallel corpus needs to be subjected to word alignment processing, and the corresponding relation between the correct word and the wrong word is obtained based on the corresponding relation between the correct text and the wrong text; and the shape word data set and the common error word table can be obtained based on the existing shape word table and the common error word table.

And establishing a corresponding relation between the correct word and the replacement word set based on the obtained corresponding relation between the correct word and the wrong word, the shape similar word, the homophone, the synonym and the like to obtain a word alignment data set.

Further, the determining, based on the word alignment dataset, a replacement word corresponding to the at least one correct word may include:

determining a set of replacement words in the word alignment data set corresponding to the at least one correct word;

determining a replacement word corresponding to the at least one correct word from the set of replacement words.

Just because of the mapping relationship between the correct word and the set of replacement words established in the word alignment data set, when determining a replacement word corresponding to at least one correct word, first, a set of replacement words corresponding to the correct word is determined based on the mapping relationship. And selecting at least one replacement word from the replacement word set according to the preset replacement times.

For example, in order to improve the diversity of the negative samples and obtain a larger number of first error correction parallel corpora, optionally, N times of replacement may be performed on one positive sample, and one negative sample is obtained every time of replacement, so that N first error correction parallel corpora may be obtained based on one positive sample and the corresponding N negative samples, and the number of training samples may be greatly enriched.

However, each replacement word does not actually meet the replacement requirement, for example, when the semantics and the smoothness degree of the negative sample obtained after replacement are greatly changed, the quality of the negative sample is poor, and the training effect of the error correction model is affected.

Therefore, in order to improve the sample quality of the negative examples, the probability that the error word is corrected to the corresponding replacement word in the second error-corrected parallel corpus may be determined. As an optional implementation manner, the determining, from the set of replacement words, a replacement word corresponding to the at least one correct word may include:

calculating the replacement probability corresponding to the replacement words in the replacement word set;

determining a replacement word corresponding to the at least one correct word from the set of replacement words based on the replacement probability.

The higher the replacement probability of the replacement word indicates that the more frequently the replacement word appears in the search text, the higher the probability that the replacement word is corrected into the correct word, and conversely, the lower the frequency that the replacement word appears in the search text, the lower the probability that the replacement word is corrected into the correct word. Therefore, the higher the sample quality of the negative sample obtained by preferentially selecting the replacement word with the high replacement probability.

In this embodiment of the present application, a data statistics manner may be adopted to determine a replacement probability corresponding to each replacement word, and specifically, as an implementation manner that can be implemented, the calculating a replacement probability corresponding to a replacement word in the replacement word set may include:

counting the probability distribution of the correct words of the replacement words in the replacement word set;

and determining a replacement probability corresponding to the replacement word in the replacement word set based on the probability distribution.

In the actual multi-domain second parallel error correction corpus, a plurality of correct texts and error texts are included, and by counting the number of times that each error word appears in the error text and the number of times that each error word is corrected into a correct word, the probability distribution that each replacement word in the replacement word set in the word alignment data set is corrected into a corresponding correct word can be calculated, for example, the number of times that word a is corrected into word D is 50 words, the number of times that word B is corrected into word D is 10 words, the number of times that word C is corrected into word D is 40 words, and the probability distribution that replacement word A, B, C in the replacement word set is corresponding to word D is [0.5, 0.1, 0.4], and of course, the probability distribution function that each replacement word is corrected into a corresponding correct word can also be calculated in actual statistics, so that the replacement probability of each replacement word is determined based on the probability distribution function. The description is only schematic, and the probability distribution of each alternative word corrected to the corresponding correct word can be specifically counted according to the actual situation.

After determining the replacement probabilities corresponding to the replacement words in the replacement word set based on the probability distribution, the replacement words in the replacement word set may be ranked based on the replacement probabilities, and the replacement words with higher replacement probabilities are preferentially selected to replace the positive sample. When a plurality of replacement words need to be selected and replaced to obtain a plurality of negative samples, a plurality of negative samples with different sample qualities can be obtained in sequence according to the replacement probability from large to small.

When the number of correct words to be replaced by the positive sample is more than one, and each correct word can determine a plurality of replacement words, in order to further improve the diversity of the negative samples, the negative samples corresponding to the plurality of correct words can be arranged and combined to obtain a plurality of replacement word combinations. For example, a correct word D and a correct word E in the positive sample are determined, and a replacement word a and a replacement word B corresponding to the correct word D and a replacement word W and a replacement word H corresponding to the correct word E are respectively determined, and a replacement word combination (a, W), (a, H), (B, W) and (B, H) are obtained by permutation and combination to respectively replace the correct word (D, E), so that four negative samples corresponding to the positive sample can be obtained, and four first error correction parallel corpora can be generated.

It can be understood that when the amount of the positive sample data is large, the probability that the correct word in the selected positive sample is repeated is very high, for example, the positive sample P contains the correct word D, and the positive sample Q also contains the correct word D, but if the selection is performed each time according to the replacement probability of the replacement word, all the correct words D in each sample are determined to be the replacement word a, so that the diversity of the samples is lost, because even the replacement word with a small replacement probability still has the probability of appearing in the wrong text.

In order to further increase the diversity of the negative examples and improve the quality of the examples, as an optional implementation, the determining, based on the replacement probability, a replacement word corresponding to the at least one correct word from the replacement word set may include:

taking the replacement probability as the replacement weight of the replacement word in the corresponding replacement word set;

and selecting a replacement word corresponding to the at least one correct word in the replacement word set through weighted random sampling based on the replacement weight.

The replacement probability is used as the replacement weight of each replacement word, the replacement words corresponding to the correct words are selected in a weighted random sampling mode, the problem that the same correct words are replaced by the same replacement words can be avoided, meanwhile, the phenomenon of error text diversity in searching is better met, and therefore the quality of the first error correction parallel corpus is greatly improved.

104: and generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample.

The first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

Based on the corresponding relationship between the positive sample and the negative sample, the positive sample can be used as a correct text, the negative sample can be used as an incorrect text to generate a first error correction parallel corpus of the target field, and the data format of the first error correction parallel corpus can be expressed as < negative sample, positive sample >.

Since the amount of words in the word alignment data set is limited by the number and quality of the actually obtained word alignment corpus, there may be a set of replacement words in the word alignment data set that does not include a correct word. As an optional implementation, the determining the replacement word corresponding to the at least one correct word may include:

judging whether the word alignment data set has a replacement word corresponding to the at least one correct word;

if yes, go to step 103;

if not, re-determining at least one correct word to be replaced in the positive sample according to the preset replacement rule until determining that a replacement word corresponding to the at least one correct word exists in the word alignment data set.

Therefore, when judging that the replacement word corresponding to the at least one correct word does not exist in the word alignment data set, dividing the word alignment data set into at least two cases, wherein one case is that any correct word in the at least one correct word does not exist in the corresponding replacement word; and the other is that all correct words have no corresponding alternative words.

If the fixed number of correct words of the positive sample is not determined based on the replacement rule, for the first case, only correct words having replacement words among the at least one correct word may be replaced; if the setting is a fixed number of correct words, for example, two correct words must be replaced for each positive sample, then the other words selected from the positive samples can be re-used as the correct words to be replaced according to the preset replacement rules for the two cases. And if all correct words in the positive sample have no corresponding alternative words, removing the positive sample.

In the embodiment of the application, the corresponding relation between the positive sample and the negative sample is established in a word replacement mode, and the replacement word corresponding to at least one correct word in the positive sample is determined based on a statistical mode, so that the negative sample obtained through word replacement can accord with the daily search habit of a user. The method for generating the error correction parallel corpora can obtain a large number of high-quality first error correction parallel corpora in the target field more quickly and effectively compared with the traditional method for manually labeling the training samples on the basis of ensuring the quality of the training samples, and meanwhile, the labor cost can be greatly reduced.

Therefore, the error correction model in the target field is trained by using the obtained high-quality and large-quantity first error correction parallel corpus, so that the error correction effect of the error correction model can be further improved, the error correction model is used for correcting the text to be corrected in the target field input by the user, the correct text which corresponds to the text to be corrected and best meets the search requirement of the user is obtained, and the user experience is greatly improved.

Fig. 2 is a schematic flow chart of an embodiment of a corpus generating method according to an embodiment of the present application. The method may comprise the steps of:

201: a positive sample in the target domain is acquired.

202: determining a replacement word corresponding to at least one correct word in the positive sample.

203: and replacing the at least one correct word with the corresponding replacement word to obtain a negative example corresponding to the positive example.

204: training a predictive model for the target domain based at least on the positive samples and the negative samples.

In order to further improve the sample quality of the error correction model training samples, the prediction model of the target field can be obtained through training by taking the positive samples and the negative samples generated based on the statistical mode as the training corpora of the prediction model.

Alternatively, the prediction model may be any Machine Translation model, such as a Sequence to Sequence (Seq) model or an NMT (Neural network Machine Translation) model implemented by relying on a Sequence 2Seq model framework, and the like, and is not limited in detail herein. Seq2Seq is a general encoder-decoder framework, and is widely used in the field of machine translation because Seq2Seq does not limit the sequence length of input and output.

In practical applications, the encoder and the decoder may be implemented by using an RNN (Recurrent Neural Network) model or an LSTM (Long Short Term Memory Network) model, and the like, which is not specifically limited herein. The encoder encodes the positive sample and the negative sample into a semantic vector by learning the positive sample and the negative sample, and the semantic vector can represent semantic association relation between the positive sample and the negative sample. The encoder transmits the semantic vector to the decoder, the decoder decodes the positive sample through the learned semantic vector to obtain a corresponding prediction sample, and the prediction sample is gradually close to a negative sample corresponding to the positive sample through repeatedly adjusting model parameters in model training, so that a prediction model of the target field is obtained.

In the embodiment of the application, in order to enable the prediction model to obtain a better prediction effect, both an encoder and a decoder of the prediction model may preferably adopt a 2-layer LSTM (2-layer long-short memory network), and an "Attention mechanism" in deep learning is added, that is, an Attention model is used in a decoding part, and the Attention model is used as an interface between the encoder and the decoder, so as to help solve the problem that in a machine translation process, effective information is hidden or diluted when a text sequence is too long, so that sufficient effective information of an input sequence cannot be obtained in a decoding process of the decoder, and the prediction effect is poor.

The Attention model may obtain an alignment vector derived from the Attention score of the encoder when encoding the input sequence and introduce the alignment vector into the semantic vector. The alignment vector can be used for measuring the importance degree of each word in the input sequence when aligned, and based on the alignment vector, the decoder can determine the concerned region in the input sequence which should be concerned when decoding, so as to ensure that the effective information in the input sequence can be obtained by the decoder, and thus the output sequence is generated according to the concerned region. It can be understood that although the model complexity of the prediction model is improved to a certain extent by adding the attention mechanism, the mechanism can effectively improve the prediction effect of the prediction model, thereby laying a foundation for further obtaining the high-quality first error correction parallel corpus.

As an implementable embodiment, the training of the predictive model for the target domain based at least on the positive examples and the negative examples may comprise:

generating a first reverse error correction parallel corpus of the target field based on the corresponding relation of the positive sample and the negative sample;

acquiring a second reverse error correction parallel corpus in multiple fields;

and training a prediction model of the target field based on the first reverse error correction parallel corpus and the second reverse error correction parallel corpus.

As can be seen from the foregoing, the data format of the error correction parallel corpus of the training error correction model can be expressed as < error text, correct text >, and the mapping relationship is a forward mapping from the error text to the correct text. Therefore, when the error correction model is trained, the error correction model is trained based on the error text in the error correction parallel corpus as the source language and the correct text as the target language.

The prediction model is actually used for predicting the error text corresponding to the correct text, so that a reverse mapping relation from the error text to the correct text needs to be established, and the data format of the generated reverse error correction parallel corpus can be expressed as < positive sample > and < negative sample >.

Optionally, the reverse error correction parallel corpus in other fields can be expanded into the training corpus of the prediction model, so that the word quantity of the training corpus can be increased, more word alignment relations can be learned, the prediction model can learn to obtain more effective semantic information, and the prediction effect of the prediction model can be further improved.

Optionally, the obtaining the second inverse error correction parallel corpus in multiple domains may include:

acquiring error correction parallel corpora in the multiple fields;

and performing reverse mapping processing on the corresponding relation between the positive text and the negative text in the error correction parallel corpus in the multiple fields to obtain a second reverse error correction parallel corpus in the multiple fields.

In practical application, the error correction parallel corpora in the multiple fields include error correction parallel corpora historically generated in a target field, for example, error correction parallel corpora generated by manual tagging, and error correction parallel corpora in other fields can also be introduced to enrich the vocabulary of the training corpora, for example, the live video field, the news media field, the science and technology field, and the like, which are related to the search engine, and are not specifically limited herein.

Two ways can be adopted for the prediction model training, the first way is to combine the first reverse error correction parallel corpus and the second reverse error correction parallel corpus to obtain the training corpus of the prediction model, and train the prediction model, but the training corpus data volume is large, so the training complexity and the time cost are high.

Further, in order to improve the training efficiency of the prediction model, a second method may be adopted, that is, the prediction model obtained based on the second reverse error correction parallel corpus pre-training is used as an initial model, and the initial model is subjected to fine tune (optimization training) by using the first reverse error correction parallel corpus obtained based on the positive sample and the negative sample, so as to finally obtain the prediction model in the target field, as described in the following embodiment.

As an optional implementation manner, the training the prediction model of the target domain based on the first and second inverse error correction parallel corpora may include:

pre-training a prediction model of the target field based on the second backward error correction parallel corpus to obtain an initial prediction model;

and carrying out optimization training on the initial prediction model based on the first reverse error correction parallel corpus to obtain the prediction model.

As an optional implementation manner, the performing optimization training on the initial prediction model based on the first inverse error correction parallel corpus to obtain the prediction model may include:

and acquiring a third reverse error correction parallel corpus in the target field.

And generating the third reverse error correction parallel corpus based on a second error correction parallel corpus obtained by manual labeling in the target field.

And performing optimization training on the initial prediction model based on an optimization training sample generated by the first reverse error correction parallel corpus and the third reverse error correction parallel corpus to obtain the prediction model.

In the optimization training process, the obtained first reverse error correction parallel corpus is expanded to a third reverse error correction parallel corpus obtained from the history of the target field to generate an optimization training sample of the prediction model, and then the initial model obtained by pre-training is subjected to optimization training to obtain the prediction model of the target field.

The initial prediction model can greatly reduce the data volume of training corpora in the pre-training process, so that the pre-training time and the training complexity can be reduced, the initial model parameters of the prediction model can be obtained, the model parameters of the initial prediction model are optimized by using the optimized training samples on the basis of the initial model, the training speed of the prediction model can be greatly improved, but the prediction effect of the prediction model is slightly worse than that of the prediction model obtained by the first mode training, and therefore different model training modes can be selected according to actual requirements in practical application to train the prediction model in the target field.

205: and predicting and obtaining at least one negative text corresponding to the text to be predicted of the target field by utilizing the prediction model.

206: and generating a first error correction parallel corpus of the target field based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted.

The first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

And predicting the positive text to be predicted in the target field based on the prediction model obtained by training to obtain at least one negative text corresponding to the text to be predicted. The prediction model can learn to obtain deep semantic information between the positive sample and the negative sample, so that the negative sample predicted based on the prediction model can contain more effective semantic information, the negative sample predicted based on the prediction model can draw more types of error forms, the constructed error forms are more in line with the habit of daily input errors of a user, and the sample quality of the first error correction sample is further improved.

As an optional implementation manner, the obtaining, by using the prediction model, at least one negative text corresponding to a text to be predicted of the target field by prediction may include:

acquiring a positive text to be predicted of the target field;

and predicting and obtaining at least one negative text corresponding to the text to be predicted by utilizing the prediction model.

As an optional implementation manner, the obtaining of the text to be predicted of the target domain may include:

acquiring a search text with the search frequency meeting a search threshold value in the target field;

and taking the search text as the positive text to be predicted.

Optionally, as an implementable implementation, the obtaining, by using the prediction model of the target domain, at least one negative text corresponding to the text to be predicted by prediction may include:

predicting the text book to be predicted by using the prediction model to obtain at least one negative text sorted according to the matching priority; wherein the matching priority represents the degree of matching of the at least one negative text with the positive text to be predicted respectively.

From the foregoing, when the prediction model is combined with the "Attention mechanism", the degree of importance of each word alignment in the positive text to be predicted can be measured based on the alignment vector through the Attention model, so as to determine the matching degree of the corresponding negative text and the positive text to be predicted based on the degree of importance. If the wrong word in the predicted negative text is aligned with the correct word in the positive text to be predicted, when the attention score of the correct word alignment is higher, the matching degree of the negative text and the positive text to be predicted is higher, and otherwise, the matching degree is lower. It is understood that the matching degree of the negative text obtained by actual prediction not only is based on the attention score of word alignment, but also is closely related to the semantics of the context of the negative text, such as the smoothness degree of a sentence, and the proximity degree of the positive text, and can be used as a matching parameter influencing the matching degree of the negative text.

Therefore, when the prediction model predicts that at least one negative text is obtained, the matching priority of each negative text can be determined based on the matching degree of each negative text, the higher the matching degree is, the higher the corresponding matching priority is, and conversely, the lower the matching priority is.

As another implementation manner, after the obtaining, by using the prediction model in the target domain, at least one negative text corresponding to the text to be predicted by prediction, the method may further include:

and determining the matching priority of the at least one negative text and the positive text to be predicted according to a preset condition.

For a prediction model that does not incorporate an "attention mechanism," the degree of matching of at least one negative sample output by the prediction model with a positive sample to be predicted may be evaluated based on preset matching parameters. The matching parameters can adopt the same parameters as the above, such as semantic smoothness, proximity to the positive text to be predicted, importance of the aligned words in the negative text and the positive text to be predicted, and the like, and can be set according to actual requirements. The method for evaluating the matching degree may be implemented in any manner, for example, different weight values are set according to the importance degree of the matching parameter, and the matching value of each negative sample is calculated by a weighted summation manner, which is not specifically limited herein.

In practical applications, the quality of generating the first error correction parallel corpus based on the negative text with the low matching degree is poor, and therefore in order to obtain the high-quality first error correction parallel corpus, optionally, the forming the first error correction parallel corpus of the target field based on the positive text to be predicted and the corresponding at least one negative text may include:

according to the matching priority, selecting the preset first N negative texts in the at least one negative text as predicted negative texts; wherein N is greater than or equal to 1;

and generating a first error correction parallel corpus of the target field based on the corresponding relation between the predicted negative text and the to-be-predicted positive text.

In the embodiment of the application, on the basis of obtaining the positive sample and the negative sample by performing word replacement based on statistical analysis, the semantic relation between the positive sample and the negative sample is further learned by combining a neural network translation model, and the neural network translation model can construct a negative text with richer error types and higher quality based on the mapping relation between the positive sample and at least one negative sample.

In addition, more training word quantities and semantic information can be introduced by introducing the second error correction parallel corpus irrelevant to the target field to train the prediction model, so that negative texts constructed by the prediction model can be further enriched to have more various error forms, and the habit of daily input errors of a user is fitted, so that the sample quality of the first error correction parallel corpus is further improved, and a foundation is laid for obtaining the first error correction parallel corpus with higher quality.

Fig. 3 is a schematic flowchart of a data processing method according to another embodiment of the present application. The method may comprise the steps of:

301: a positive sample in the target domain is acquired.

302: determining a replacement word corresponding to at least one correct word in the positive sample.

303: and replacing the at least one correct word with the corresponding replacement word to obtain a negative example corresponding to the positive example.

304: training a predictive model for the target domain based at least on the positive samples and the negative samples.

The prediction model is used for predicting and obtaining at least one negative text corresponding to the text to be predicted in the target field, and generating a first error correction parallel corpus in the target field based on the text to be predicted and the at least one negative text corresponding to the text to be predicted.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, on the basis of obtaining the positive sample and the negative sample by performing word replacement based on statistical analysis, the semantic relation between the positive sample and the negative sample is further learned by combining with a neural network translation model, and the mapping relation between the positive sample and at least one negative sample is based on, so that the neural network model can construct a negative text with richer error types and higher quality, and a foundation is laid for obtaining a first error correction parallel corpus with higher quality.

Fig. 4 is a flowchart illustrating an embodiment of a corpus generating method according to an embodiment of the present application. The method may comprise the steps of:

401: and acquiring the positive text to be predicted of the target field.

402: and predicting and obtaining at least one negative text corresponding to the text to be predicted by using the prediction model of the target field.

The prediction model is obtained based on a positive sample in the target field and a negative sample corresponding to the positive sample through training; and the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word.

403: and generating a first error correction parallel corpus of the target field based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted.

The first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, on the basis of the prediction model of the target field obtained based on the training, at least one negative text corresponding to the text to be predicted is obtained by prediction through the prediction model. In order to further improve the quality of the first error correction parallel corpus, on the basis of determining the matching degree of the at least one negative text and the positive text to be predicted, the negative text with a larger matching degree is preferentially selected as the predicted negative text. Therefore, the high-quality and large-quantity first error correction parallel corpora can be obtained more quickly and efficiently based on the prediction model, and a foundation is laid for training to obtain an error correction model with a better error correction effect.

Fig. 5 is a schematic flowchart of a data processing method according to another embodiment of the present application. The method may comprise the steps of:

501: and acquiring a first error correction parallel corpus of the target field.

The first error correction parallel corpus is generated based on a positive sample in the target field and a negative sample corresponding to the positive sample; and the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word.

Or the first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to the text book to be predicted of the target field through prediction and based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the prediction model is obtained based on training of a positive sample of the target domain and a negative text sample obtained by replacing at least one correct word in the positive sample with the replacement word.

502: and training an error correction model of the target field at least based on the first error correction parallel corpus.

The error correction model is used for correcting the error of the text to be corrected in the target field input by the user and returning the correct text corresponding to the text to be corrected.

In practical applications, the error correction model may be any Machine Translation model, for example, in addition to the NMT model, SMT (Statistical Machine Translation) may be used. The SMT model is a statistical translation model constructed by performing statistical analysis on a large number of error correction parallel corpora, and the training process of the SMT model can be based on the corresponding relation between a positive text and a negative text in the first error correction parallel corpora and the model parameters of the SMT model are optimized and adjusted by adopting the existing SMT model training method. No matter the error correction model adopts an NMT model or an SMT model, the correct text output by the error correction model based on the negative text is the same as the corresponding positive text through model training, and therefore the error correction model of the target field is obtained.

As an optional implementation, the training of the error correction model of the target domain based on at least the first error correction parallel corpus may include:

acquiring a second error correction parallel corpus of the target field; the second error correction parallel corpus is an error correction parallel corpus generated by manual labeling in the target field;

and training an error correction training sample generated based on the first error correction parallel corpus and the second error correction parallel corpus to obtain an error correction model of the target field.

In practical application, the error correction model training mode can be divided into two types, wherein the first type is an initial error correction model of a target field obtained by training based on a second error correction parallel corpus generated based on history as a pre-training corpus; and then, performing optimization training on the initial error correction model by taking the first error correction parallel corpus generated based on the application as an optimization training corpus, thereby obtaining an error correction model in the target field.

Of course, in order to further improve the error correction effect of the model, the first error correction parallel corpus and the second error correction parallel corpus may be combined to generate an error correction training corpus. Different model training modes can be selected according to actual requirements, and are not particularly limited herein.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, under the scene that a large amount of parallel corpus training data is needed, a large amount of high-quality first error correction parallel corpus in the target field is obtained based on the embodiment, and the error correction model in the target field is trained, so that the error correction effect and quality of the error correction model in the field can be further improved, and a guarantee is provided for providing a higher-quality error searching and correcting service for a user.

Fig. 6 is a schematic flowchart of another embodiment of a text error correction method according to an embodiment of the present application. The apparatus may include:

601: and acquiring the text to be corrected in the target field input by the user.

602: and determining a correct text corresponding to the text to be corrected by using the error correction model of the target field.

Wherein the error correction model is obtained at least based on a first error correction parallel corpus training in the target field; the first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to a text book to be predicted of the target field through prediction and based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the prediction model is obtained based on training of a positive sample of the target domain and a negative text sample obtained by replacing at least one correct word in the positive sample with the replacement word.

Optionally, the text to be corrected is a search text of a target field input by the user in any search box. The search box can be an application client in a target field or a search box in any webpage, and when a user inputs search text, whether the search text is matched with correct text exists or not is automatically detected by using an error correction model. If yes, returning a correct text; if not, then do the search for more based on the correct text entered by the user.

603: and returning the correct text for the user terminal to search the target object based on the correct text.

And the search box searches based on the returned correct text and prompts the corrected correct text of the user on a search page so that the user can confirm whether the corrected correct text meets the search requirement of the user, and further the automatic search error correction process of the target field is completed.

In practical application, the embodiment of the application can also perform text error correction on candidate words output by a text input interface provided by any text input service based on user input operation, so that a user can directly input correct texts in a text search box to search and obtain target objects more quickly and efficiently. The text application service may be provided by a local input method application carried by a system in a terminal device such as a smart phone, a computer, a tablet computer, or the like, or an input method application of any invoked third-party application platform, which is not specifically limited herein.

Optionally, in some embodiments, an embodiment of the present application further provides a text correction method, which may include:

acquiring at least one candidate text output by a text input interface aiming at user input operation;

when a user inputs characters through an input device such as a text input keyboard or an input drawing board, the text input application determines at least one candidate text matched with the input operation of the user according to the characters generated by the input device. In practical applications, the input keyboard may be a virtual keyboard, a user may perform an input operation by displaying the virtual keyboard on the electronic display screen, the user may also directly write text to be input by the user in the input drawing board by means of an auxiliary input device, such as a stylus or with a finger, and the text input application may determine at least one candidate text by recognizing the text input by the user.

In practice, when the user triggers an input operation, the text input application may be invoked and the text input interface is displayed, and at the same time, characters input by the user through the input device are displayed in the text input interface, for example, the pinyin input rule displays pinyin characters input by the user, the english input rule displays english characters input by the user, and the like, which are not specifically limited herein. Meanwhile, the text input interface simultaneously displays at least one candidate text determined according to the characters input by the user, and the at least one candidate text is displayed according to the arrangement sequence determined by the matching degree of the characters input by the user, for example, common words, high-frequency words and the like are preferentially displayed.

Judging whether a text to be corrected in the target field exists in the at least one candidate text;

because the at least one candidate word output by the text input interface is determined only according to the user input operation, text or words which do not belong to the target field may exist in the at least one candidate word. Therefore, it is necessary to determine in advance whether the at least one candidate text belongs to the text to be corrected in the target field category, so as to improve the efficiency of text correction.

If the text to be corrected exists, determining a correct text corresponding to the text to be corrected by using the error correction model of the target field; the error correction model is obtained based on first error correction parallel corpus training in the target field; the first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to a text book to be predicted of the target field through prediction and based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the prediction model is obtained by training a negative text sample based on a positive sample of the target field and at least one correct word in the positive sample replaced by the replacement word;

and replacing the at least one candidate text with the correct text so that the user terminal can search the target object based on any replaced candidate text determined by the user.

Optionally, in the embodiment of the present application, the correct text may be directly used to replace a corresponding candidate text in the text input interface as a new candidate text, or the correct text may be displayed on the text input interface simultaneously with the original candidate text to prompt the user for error correction, when the user selects the candidate text, the user may be prompted to input the correct text in the search text box, which may be specifically set according to actual requirements, and no specific limitation is made herein.

As shown in fig. 7(a), when the user inputs "q i' ca i", the text input interface displays: 1. when a user performs text input operation in a search text box of a user side, text error correction operation on at least one candidate text displayed on a text input interface is performed, the 'full dishes' in the at least one candidate text are determined to be a text to be corrected in a target field according to the target field corresponding to the user side, text error correction is performed on the 'full dishes' in the candidate text by using an error correction model, and the corresponding correct text is determined to be 'shepherd' in the candidate text. As shown in fig. 7(b), the schematic diagram of the text error correction performed on at least one candidate text displayed on the text input interface is shown in the following steps: 1. at least one replaced candidate text of shepherd's purse 2 (j i), equipment 3, seven-color, and the like. When the user selects "1", the correct text "shepherd's purse" is directly entered in the text search box.

Therefore, the text error correction can be directly realized on the input text interface, so that the text error correction can be realized by a user in the text input process, the text error correction time can be saved, the target object searching efficiency is improved, and the user experience is further improved.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, just because a large number of high-quality first error correction parallel corpora obtained based on the basis of the combination of statistical analysis and the neural network model are used as error correction training samples, the error correction model obtained by training has a better error correction effect, so that a better error searching and correcting service can be provided for a user based on the error correction model, the positive samples obtained after error correction meet the searching requirements of the user to the greatest extent, the searching efficiency of the user is improved, and the user obtains better user experience.

Fig. 8 is a schematic structural diagram of an embodiment of a corpus generating device according to an embodiment of the present application. The apparatus may include:

a first positive sample obtaining module 801, configured to obtain a positive sample in a target field;

a first alternative word determining module 802, configured to determine an alternative word corresponding to at least one correct word in the positive sample;

a first negative sample obtaining module 803, configured to replace the at least one correct word with a corresponding replacement word, and obtain a negative sample corresponding to the positive sample;

a first training sample generating module 804, configured to generate a first error correction parallel corpus of the target domain based on at least the positive sample and the negative sample.

The first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

In order to improve the instruction of correcting the parallel corpus, the correct text search may be obtained according to the history search text in the target field, and as an optional implementation manner, the first positive sample obtaining module 801 may be specifically configured to:

acquiring a search text with the search frequency meeting a search threshold value in the target field;

taking the search text as the positive sample.

Optionally, in some embodiments, the first alternative word determining module 802 may specifically include:

a correct word selecting unit, configured to select at least one correct word to be replaced in the positive sample according to a preset replacement rule;

a replacement word determining unit for determining a replacement word corresponding to the at least one correct word.

Of course, the correct word of the positive sample may also be randomly selected, and as an optional implementation manner, the correct word selecting unit may specifically be configured to:

and randomly selecting a first preset value of correct words to be replaced in the positive sample.

As an optional implementation manner, the alternative word determining unit may specifically be configured to:

acquiring a word alignment data set; the word alignment data set establishes a corresponding relation between a correct word and a replacement word set acquired from multiple fields; the set of replacement words comprises at least one replacement word.

Determining a replacement word corresponding to the at least one correct word based on the word alignment dataset.

Optionally, the obtaining of the word alignment data set may specifically be configured to:

acquiring a word alignment corpus containing replacement words and correct words in the multiple fields; wherein the word alignment corpus at least comprises a second error correction parallel corpus data set in multiple fields;

performing word alignment processing on the word alignment corpus to obtain a data set of error-correcting words;

and obtaining a word alignment data set for establishing the alignment relation between the correct word and the replacement word set at least based on the error-correcting word data set, the shape-similar word data set and the common error word data set.

Further, the determining, based on the word alignment data set, a replacement word corresponding to the at least one correct word may specifically be used to:

determining a set of replacement words in the word alignment data set corresponding to the at least one correct word;

determining a replacement word corresponding to the at least one correct word from the set of replacement words.

Therefore, in order to improve the sample quality of the negative examples, the probability that the error word is corrected to the corresponding replacement word in the second error-corrected parallel corpus may be determined. As an optional implementation manner, the determining, from the replacement word set, a replacement word corresponding to the at least one correct word may specifically be to:

calculating the replacement probability corresponding to each replacement word in the replacement word set;

determining a replacement word corresponding to the at least one correct word from the set of replacement words based on the replacement probability.

In this embodiment of the present application, a data statistics manner may be adopted to determine a replacement probability corresponding to each replacement word, and specifically, as an implementation manner that can be implemented, the calculating of the replacement probability corresponding to each replacement word in the replacement word set may be specifically used to:

counting the probability distribution of the correct words of the replacement words in the replacement word set;

and determining a replacement probability corresponding to the replacement word in the replacement word set based on the probability distribution.

In order to further increase the diversity of the negative examples and improve the quality of the examples, as an optional implementation manner, the determining, based on the replacement probability, a replacement word corresponding to the at least one correct word from the replacement word set may specifically be configured to:

taking the replacement probability as the replacement weight of the replacement word in the corresponding replacement word set;

and selecting a replacement word corresponding to the at least one correct word in the replacement word set through weighted random sampling based on the replacement weight.

As an optional implementation manner, the determining of the replacement word corresponding to the at least one correct word may specifically be used to:

judging whether the word alignment data set has a replacement word corresponding to the at least one correct word;

if yes, triggering a first negative sample acquisition module 803;

if not, re-determining at least one correct word to be replaced in the positive sample according to the preset replacement rule until determining that a replacement word corresponding to the at least one correct word exists in the word alignment data set.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, the corresponding relation between the positive sample and the negative sample is established by adopting a word replacement local mode, and the replacement word corresponding to at least one correct word in the positive sample is determined based on a statistical mode, so that the negative sample obtained by word replacement can accord with the daily search habit of a user. The method for generating the error correction parallel corpora can obtain a large number of high-quality first error correction parallel corpora in the target field more quickly and effectively compared with the traditional method for manually labeling the training samples on the basis of ensuring the quality of the training samples, and meanwhile, the labor cost can be greatly reduced.

Therefore, the error correction model in the target field is trained by using the obtained high-quality and large-quantity first error correction parallel corpus, so that the error correction effect of the error correction model can be further improved, the error correction model is used for correcting the text to be corrected in the target field input by the user, the correct text which corresponds to the text to be corrected and best meets the search requirement of the user is obtained, and the user experience is greatly improved.

Fig. 9 is a schematic structural diagram of another embodiment of a corpus generating device according to an embodiment of the present application. The apparatus may include:

a first positive sample acquiring module 901, configured to acquire a positive sample in a target domain.

A first alternative word determining module 902, configured to determine an alternative word corresponding to at least one correct word in the positive sample.

A first negative example obtaining module 903, configured to replace the at least one correct word with a corresponding replacement word, and obtain a negative example corresponding to the positive example.

A first training sample generating module 904, configured to generate a first error correction parallel corpus of the target domain based on at least the positive sample and the negative sample.

The first training sample generation module 904 may include:

a prediction model training unit 911, configured to train a prediction model of the target domain based on at least the positive samples and the negative samples.

As an implementation manner, the prediction model training unit 911 may be specifically configured to:

generating a first reverse error correction parallel corpus of the target field based on the corresponding relation of the positive sample and the negative sample;

acquiring a second reverse error correction parallel corpus in multiple fields;

and training a prediction model of the target field based on the first reverse error correction parallel corpus and the second reverse error correction parallel corpus.

Optionally, the obtaining of the second inverse error correction parallel corpus in multiple domains may specifically be used to:

acquiring error correction parallel corpora in the multiple fields;

and performing reverse mapping processing on the corresponding relation between the positive text and the negative text in the error correction parallel corpus in the multiple fields to obtain a second reverse error correction parallel corpus in the multiple fields.

As an optional implementation manner, the training of the prediction model in the target domain based on the first and second inverse error correction parallel corpora may specifically be used to:

pre-training a prediction model of the target field based on the second backward error correction parallel corpus to obtain an initial prediction model;

and carrying out optimization training on the initial prediction model based on the first reverse error correction parallel corpus to obtain the prediction model.

As an optional implementation manner, the optimal training of the initial prediction model based on the first inverse error correction parallel corpus to obtain the prediction model may specifically be used to:

and acquiring a third reverse error correction parallel corpus in the target field.

And generating the third reverse error correction parallel corpus based on a second error correction parallel corpus obtained by manual labeling in the target field.

And a negative text prediction unit 912, configured to obtain at least one negative text corresponding to the text to be predicted in the target field by using the prediction model.

The first training sample generating unit 913 is configured to generate a first error correction parallel corpus of the target field based on the text to be predicted and the at least one negative text corresponding to the text to be predicted.

The first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

As an alternative implementation, the negative text prediction unit 912 may include:

the text book acquiring subunit is used for acquiring the positive text to be predicted in the target field;

and the negative text prediction subunit is used for obtaining at least one negative text corresponding to the text to be predicted by using the prediction model.

As an optional implementation manner, the text book acquiring subunit to be predicted may specifically be configured to:

acquiring a search text with the search frequency meeting a search threshold value in the target field;

and taking the search text as the positive text to be predicted.

Optionally, as an implementable implementation, the negative text prediction subunit may be specifically configured to:

predicting the text book to be predicted by using the prediction model to obtain at least one negative text sorted according to the matching priority; wherein the matching priority represents the degree of matching of the at least one negative text with the positive text to be predicted respectively.

As another implementation, after the negative text prediction subunit, the method may further include:

and the priority determining subunit is used for determining the matching priority of the at least one negative text and the positive text to be predicted according to a preset condition.

In practical applications, the quality of generating the first error correction parallel corpus based on the negative text with the low matching degree is poor, and therefore, in order to obtain the high-quality first error correction parallel corpus, optionally, the first training sample generating unit 913 may be specifically configured to:

according to the matching priority, selecting the preset first N negative texts in the at least one negative text as predicted negative texts; wherein N is greater than or equal to 1;

and generating a first error correction parallel corpus of the target field based on the corresponding relation between the predicted negative text and the to-be-predicted positive text.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, on the basis of obtaining the positive sample and the negative sample by performing word replacement based on statistical analysis, the semantic relation between the positive sample and the negative sample is further learned by combining a neural network translation model, and the neural network translation model can construct a negative text with richer error types and higher quality based on the mapping relation between the positive sample and at least one negative sample.

In addition, more training word quantities and semantic information can be introduced by introducing the second error correction parallel corpus irrelevant to the target field to train the prediction model, so that negative texts constructed by the prediction model can be further enriched to have more various error forms, and the habit of daily input errors of a user is fitted, so that the sample quality of the first error correction parallel corpus is further improved, and a foundation is laid for obtaining the first error correction parallel corpus with higher quality.

Fig. 10 is a schematic structural diagram of another embodiment of a data processing apparatus according to an embodiment of the present application. The apparatus may include:

a second positive sample obtaining module 1001, configured to obtain a positive sample in a target field;

a second alternative word determining module 1002, configured to determine an alternative word that correctly corresponds to at least one of the positive samples;

a second negative sample obtaining module 1003, configured to replace the at least one correct word with a corresponding replacement word, and obtain a negative sample corresponding to the positive sample;

a predictive model training module 1004 for training a predictive model for the target domain based at least on the positive examples and the negative examples.

The prediction model is used for predicting and obtaining at least one negative text corresponding to the text to be predicted in the target field, and generating a first error correction parallel corpus in the target field based on the text to be predicted and the at least one negative text corresponding to the text to be predicted.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, on the basis of obtaining the positive sample and the negative sample by performing word replacement based on statistical analysis, the semantic relation between the positive sample and the negative sample is further learned by combining with a neural network translation model, and the mapping relation between the positive sample and at least one negative sample is based on, so that the neural network model can construct a negative text with richer error types and higher quality, and a foundation is laid for obtaining a first error correction parallel corpus with higher quality.

Fig. 11 is a schematic structural diagram of another embodiment of a corpus generating device according to an embodiment of the present application. The apparatus may include:

a text to be predicted acquisition module 1101, configured to acquire a positive text to be predicted in a target field;

and the negative text prediction module 1102 is configured to predict and obtain at least one negative text corresponding to the text to be predicted by using the prediction model in the target field.

Wherein the prediction model is obtained based on at least positive samples in the target field and negative samples corresponding to the positive samples; and the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word.

A second training sample generating module 1103, configured to generate a first error correction parallel corpus of the target field based on the text to be predicted and the at least one negative text corresponding to the text to be predicted.

The first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, on the basis of the prediction model of the target field obtained based on the training, at least one negative text corresponding to the text to be predicted is obtained by prediction through the prediction model. In order to further improve the quality of the first error correction parallel corpus, on the basis of determining the matching degree of the at least one negative text and the positive text to be predicted, the negative text with a larger matching degree is preferentially selected as the predicted negative text. Therefore, the high-quality and large-quantity first error correction parallel corpora can be obtained more quickly and efficiently based on the prediction model, and a foundation is laid for training to obtain an error correction model with a better error correction effect.

Fig. 12 is a schematic structural diagram of another embodiment of a data processing apparatus according to an embodiment of the present application. The apparatus may include:

a first obtaining module 1201, configured to obtain a first error correction parallel corpus of the target field.

The first error correction parallel corpus is generated based on a positive sample in the target field and a negative sample corresponding to the positive sample; and the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word.

A first error correction model training module 1202, configured to train an error correction model of the target domain based on the first error correction parallel corpus.

The error correction model is used for correcting the error of the text to be corrected in the target field input by the user and returning the correct text corresponding to the text to be corrected.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, under the scene that a large amount of parallel corpus training data is needed, a large amount of high-quality first error correction parallel corpus in the target field is obtained based on the embodiment, and the error correction model in the target field is trained, so that the error correction effect and quality of the error correction model in the field can be further improved, and a guarantee is provided for providing a higher-quality error searching and correcting service for a user.

Fig. 13 is a schematic structural diagram of another embodiment of a data processing apparatus according to an embodiment of the present application. The apparatus may include:

a second obtaining module 1301, configured to obtain the first error correction parallel corpus of the target field.

The first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to a text to be predicted of the target field through prediction and based on the text to be predicted and the at least one negative text corresponding to the text to be predicted; the prediction model is obtained based on training of a positive sample of the target domain and a negative text sample obtained by replacing at least one correct word in the positive sample with the replacement word.

A second error correction model training module 1302, configured to train an error correction model of the target field based on the first error correction parallel corpus.

The error correction model is used for correcting the error of the text to be corrected in the target field input by the user and returning the correct text corresponding to the text to be corrected.

As an optional implementation manner, the second error correction model training module 1302 may be specifically configured to:

acquiring a second error correction parallel corpus of the target field; the second error correction parallel corpus is an error correction parallel corpus generated by manual labeling in the target field;

and training an error correction training sample generated based on the first error correction parallel corpus and the second error correction parallel corpus to obtain an error correction model of the target field.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, under the scene that a large amount of parallel corpus training data is needed, a large amount of high-quality first error correction parallel corpus in the target field is obtained based on the embodiment, and the error correction model in the target field is trained, so that the error correction effect and quality of the error correction model in the field can be further improved, and a guarantee is provided for providing a higher-quality error searching and correcting service for a user.

Fig. 14 is a schematic structural diagram of another embodiment of a text error correction apparatus according to an embodiment of the present application. The apparatus may include:

a text to be corrected acquiring module 1401, configured to acquire a text to be corrected in a target field input by a user.

And the error correction module 1402 is configured to determine a correct text corresponding to the text to be corrected by using the error correction model in the target field.

Wherein the error correction model is obtained at least based on a first error correction parallel corpus training in the target field; the first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to a text book to be predicted of the target field through prediction and based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the prediction model is obtained based on training of a positive sample of the target domain and a negative text sample obtained by replacing at least one correct word in the positive sample with the replacement word.

A returning module 1403, configured to return the correct text, so that the user end searches for the target object based on the correct text.

The foregoing detailed description has described specific embodiments of the present application in detail, which are not repeated herein.

In the embodiment of the application, just because a large number of high-quality first error correction parallel corpora obtained based on the basis of the combination of statistical analysis and the neural network model are used as error correction training samples, the error correction model obtained by training has a better error correction effect, so that a better error searching and correcting service can be provided for a user based on the error correction model, the positive samples obtained after error correction meet the searching requirements of the user to the greatest extent, the searching efficiency of the user is improved, and the user obtains better user experience.

Fig. 15 is a schematic structural diagram of an embodiment of a computer provided in an embodiment of the present application, where the server may include a processing component 1501 and a storage component 1502. The storage component 1502 is configured to store one or more computer instructions, wherein the one or more computer instructions are invoked for execution by the processing component.

The processing component 1501 may be configured to:

acquiring a positive sample in a target field;

determining a replacement word corresponding to at least one correct word in the positive sample;

replacing the at least one correct word with the corresponding replacement word to obtain a negative sample corresponding to the positive sample;

generating a first error correction parallel corpus of the target field at least based on the positive sample and the negative sample; the first error correction parallel corpus is used for training an error correction model of the target field, so that the error correction model is used for correcting the error of the text to be corrected in the target field input by the user, and the correct text corresponding to the text to be corrected is obtained.

The processing component 1501 may include one or more processors executing computer instructions to perform all or part of the steps of the above-described method. Of course, the processing elements may also be implemented as one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components configured to perform the above-described methods.

The storage component 1502 is configured to store various types of data to support operations in the server. The memory components may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Of course, a computer device may also necessarily include other components, such as input/output interfaces, communication components, and so forth.

The input/output interface provides an interface between the processing components and peripheral interface modules, which may be output devices, input devices, etc.

The communication component is configured to facilitate wired or wireless communication between the server and other devices, such as with a terminal.

The embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a computer, the corpus generating method in the embodiments shown in fig. 1 to fig. 4 may be implemented.

Fig. 16 is a schematic structural diagram of an embodiment of a computer device provided in an embodiment of the present application, where the terminal device may include a processing component 1601 and a storage component 1602. The storage component 1602 is configured to store one or more computer instructions, where the one or more computer instructions are invoked for execution by the processing component.

The processing component 1601 may be configured to:

acquiring a first error correction parallel corpus of the target field; the first error correction parallel corpus is generated based on a positive sample in the target field and a negative sample corresponding to the positive sample; the negative sample is obtained by replacing at least one correct word in the positive sample with a replacement word corresponding to each correct word;

training an error correction model of the target field based on the first error correction parallel corpus; the error correction model is used for correcting the error of the text to be corrected in the target field input by the user and returning the correct text corresponding to the text to be corrected.

Processing component 1601 may include one or more processors executing computer instructions to perform all or some of the steps of the methods described above. Of course, the processing elements may also be implemented as one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components configured to perform the above-described methods.

The storage component 1602 is configured to store various types of data to support operations in the server. The memory components may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Of course, a computer device may also necessarily include other components, such as input/output interfaces, communication components, and so forth.

An embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a computer, the data processing method of the embodiment shown in fig. 5 may be implemented.

Fig. 17 is a schematic structural diagram of an embodiment of a computer device provided in the embodiment of the present application, where the terminal device may include a processing component 1701 and a storage component 1702. The storage component 1702 is configured to store one or more computer instructions, wherein the one or more computer instructions are invoked for execution by the processing component.

The processing component 1701 may be configured to:

acquiring a text to be corrected in a target field input by a user;

determining a correct text corresponding to the text to be corrected by using the error correction model of the target field; wherein the error correction model is obtained at least based on a first error correction parallel corpus training in the target field; the first error correction parallel corpus is generated by utilizing a prediction model of the target field, obtaining at least one negative text corresponding to a text book to be predicted of the target field through prediction and based on the text book to be predicted and the at least one negative text corresponding to the text book to be predicted; the prediction model is obtained by training a negative text sample based on a positive sample of the target field and at least one correct word in the positive sample replaced by the replacement word;

and returning the correct text for the user terminal to search the target object based on the correct text.

The processing component 1701 may include one or more processors executing computer instructions to perform all or some of the steps of the methods described above. Of course, the processing elements may also be implemented as one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components configured to perform the above-described methods.

The storage component 1702 is configured to store various types of data to support operations in the server. The memory components may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Of course, a computer device may also necessarily include other components, such as input/output interfaces, communication components, and so forth.

An embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a computer, the text error correction method according to the embodiment shown in fig. 6 can be implemented.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

36页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:英文邮件文本数据处理方法、装置、设备及可存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!