Text error correction method, related device and readable storage medium

文档序号:1242923 发布日期:2020-08-18 浏览:4次 中文

阅读说明:本技术 文本纠错方法、相关设备及可读存储介质 (Text error correction method, related device and readable storage medium ) 是由 杨子清 崔一鸣 伍大勇 王士进 胡国平 刘挺 于 2020-04-30 设计创作,主要内容包括:本申请公开了一种文本纠错方法、相关设备及可读存储介质,在获取待纠错文本之后,分别待纠错文本进行字级别乱序纠错和词级别乱序纠错,得到字级别乱序纠错后的文本以及词级别乱序纠错后的文本,基于字级别乱序纠错后的文本以及词级别乱序纠错后的文本,即可得到纠错后的最终文本,基于上述方案,能够实现对文本中出现的乱序进行纠错,且相对于人工纠错的方式,可以节省人力和时间,进而可以提升纠错效率。(The application discloses a text error correction method, related equipment and a readable storage medium, after a text to be corrected is obtained, word level disorder error correction and word level disorder error correction are respectively carried out on the text to be corrected, the text after the word level disorder error correction and the text after the word level disorder error correction are obtained, and based on the text after the word level disorder error correction and the text after the word level disorder error correction, a final text after error correction can be obtained.)

1. A text error correction method, comprising:

acquiring a text to be corrected;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

and obtaining an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

2. The method according to claim 1, wherein said performing word-level out-of-order error correction on the text to be corrected to obtain word-level out-of-order corrected text comprises:

performing adjacent word exchange processing on the text to be corrected to obtain a word level candidate text set; each word level candidate text in the word level candidate text set is a text obtained after adjacent word exchange processing is carried out on the text to be corrected;

determining a target text set from the word level candidate text set; each target text in the target text set is a word level candidate text with grammar compliance;

and determining the text after word level disorder and error correction based on the target text set.

3. The method of claim 2, wherein determining a target text set from the word-level candidate text sets comprises:

calculating a grammar compliance probability value of each word level candidate text and a grammar compliance probability value of the text to be corrected aiming at each word level candidate text;

and determining whether the word level candidate texts are target texts or not based on the grammar compliance probability value of the word level candidate texts and the grammar compliance probability value of the texts to be corrected, wherein all the determined target texts form a target text set.

4. The method of claim 2, wherein determining the word-level scrambled text based on the target set of text comprises:

for each target text in a target text set, determining a part of the target text, which is different from the text to be corrected;

calculating a first probability value of a word at a corresponding position in the target text, which is a part in the target text, which is distinguished from the text to be corrected, and a second probability value of a word at a corresponding position in the text to be corrected, which is a part in the target text, which is distinguished from the text to be corrected;

and determining whether the target text is the text with word level disorder and error correction according to the first probability value and the second probability value.

5. The method of claim 1, wherein the performing word-level disorder error correction on the text to be corrected to obtain a word-level disorder corrected text comprises:

performing word exchange processing on the text to be corrected to obtain a word-level candidate text set; each word-level candidate text in the word-level candidate text set is a text obtained after the word exchange processing is carried out on the text to be corrected;

and determining the text after word level disorder and error correction based on the word level candidate text set.

6. The method according to claim 5, wherein the performing word exchange processing on the text to be corrected comprises:

determining disorder fragments from the text to be corrected, wherein the disorder fragments at least comprise two words;

and exchanging words located at the disorder starting position and words located at the disorder ending position in the disorder fragments to obtain the word-level candidate text.

7. The method according to claim 6, wherein the determining out-of-order segments from the text to be corrected comprises:

inputting the text to be corrected into a disorder fragment determination model, outputting the disorder fragments in the text to be corrected by the disorder fragment determination model, and training the disorder fragment determination model by taking the training text as a training sample and the disorder fragments marked by the training text as a sample label.

8. The method of claim 5, wherein determining the word-level disordered corrected text based on the set of word-level candidate texts comprises:

calculating a grammar compliance probability value of the word level candidate texts and a grammar compliance probability value of the text to be corrected aiming at each word level candidate text in the word level candidate text set;

and determining whether the word-level candidate text is the text with disorder word-level error correction or not according to the grammar compliance probability value of the word-level candidate text and the grammar compliance probability value of the text to be corrected.

9. The method of claim 1, wherein obtaining the final text after error correction based on the word-level scrambled text and the word-level scrambled text comprises:

determining an error correction segment of the text to be corrected and an error corrected segment corresponding to the error correction segment of the text to be corrected based on the text subjected to word level disorder error correction and the text subjected to word level disorder error correction;

and replacing the segment to be corrected corresponding to the segment after error correction in the text to be corrected with the segment after error correction to obtain a final text after error correction.

10. The method of claim 9, wherein determining the corrected segments of the text to be corrected based on the word-level scrambled text and the word-level scrambled text comprises:

determining the parts of the text to be corrected, which are different from the text, of the text with the word level disorder and the text with the word level error;

determining the part of the text to be corrected, which is different from the text, to be the error correction segment of the text to be corrected;

or judging whether the part of the text to be corrected, which is distinguished from the text, is matched with a configured text filtering rule, if not, determining that the part of the text to be corrected, which is distinguished from the text, is the error correction segment of the text to be corrected, and if so, determining that the part of the text to be corrected, which is distinguished from the text, is not the error correction segment of the text to be corrected.

11. The method of claim 10, wherein the determining whether the portion of the text to be corrected that is distinct from the text matches the configured text filtering rule comprises:

judging whether the words set by the configured text filtering rules without disorder include parts of the text to be corrected, which are different from the text;

if yes, determining that the part of the text to be corrected, which is different from the text, is matched with the configured text filtering rule;

and if not, determining that the part of the text to be corrected, which is different from the text, does not match the configured text filtering rule.

12. A text correction apparatus, comprising:

the acquiring unit is used for acquiring a text to be corrected;

the word level disorder error correction unit is used for carrying out word level disorder error correction on the text to be corrected to obtain the text after the word level disorder error correction;

the word level disorder error correction unit is used for performing word level disorder error correction on the text to be corrected to obtain the text after the word level disorder error correction;

and the text error correction unit is used for obtaining an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

13. A text correction apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the text correction method according to any one of claims 1 to 11.

14. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text correction method according to any one of claims 1 to 11.

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a text error correction method, a related device, and a readable storage medium.

Background

As information technology develops, more and more information is presented in the form of text, but in some scenarios, disorder may occur in the text (e.g., "cause development enters a new stage," where "is" and "is new" in order). For example, users unfamiliar with Chinese grammar (e.g., pupils and foreigners learning Chinese) may edit out-of-order Chinese text; when a user edits a text on a computer, the cursor is moved due to pasting, copying and the like, so that the position of the cursor is easy to cause, and the edited text may be disordered and the like.

The disordered text affects the text reader to correctly understand the text content, so that it is necessary to correct the disorder in the text.

At present, a manual error correction mode is mostly adopted to detect and correct the disorder of the text, however, the manual error correction mode consumes a large amount of manpower and time, and the efficiency is low.

Disclosure of Invention

In view of the foregoing problems, the present application provides a text error correction method, a related device and a readable storage medium. The specific scheme is as follows:

a text error correction method comprising:

acquiring a text to be corrected;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

and obtaining an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

Optionally, the performing word-level disorder error correction on the text to be corrected to obtain a text after word-level disorder error correction includes:

performing adjacent word exchange processing on the text to be corrected to obtain a word level candidate text set; each word level candidate text in the word level candidate text set is a text obtained after adjacent word exchange processing is carried out on the text to be corrected;

determining a target text set from the word level candidate text set; each target text in the target text set is a word level candidate text with grammar compliance;

and determining the text after word level disorder and error correction based on the target text set.

Optionally, the determining a target text set from the word-level candidate text sets includes:

calculating a grammar compliance probability value of each word level candidate text and a grammar compliance probability value of the text to be corrected aiming at each word level candidate text;

and determining whether the word level candidate texts are target texts or not based on the grammar compliance probability value of the word level candidate texts and the grammar compliance probability value of the texts to be corrected, wherein all the determined target texts form a target text set.

Optionally, the determining the word-level scrambled corrected text based on the target text set includes:

for each target text in a target text set, determining a part of the target text, which is different from the text to be corrected;

calculating a first probability value of a word at a corresponding position in the target text, which is a part in the target text, which is distinguished from the text to be corrected, and a second probability value of a word at a corresponding position in the text to be corrected, which is a part in the target text, which is distinguished from the text to be corrected;

and determining whether the target text is the text with word level disorder and error correction according to the first probability value and the second probability value.

Optionally, the performing word-level disorder error correction on the text to be corrected to obtain a text after word-level disorder error correction includes:

performing word exchange processing on the text to be corrected to obtain a word-level candidate text set; each word-level candidate text in the word-level candidate text set is a text obtained after the word exchange processing is carried out on the text to be corrected;

and determining the text after word level disorder and error correction based on the word level candidate text set.

Optionally, the performing word exchange processing on the text to be corrected includes:

determining disorder fragments from the text to be corrected, wherein the disorder fragments at least comprise two words;

and exchanging words located at the disorder starting position and words located at the disorder ending position in the disorder fragments to obtain the word-level candidate text.

Optionally, the determining the out-of-order segments from the text to be corrected includes:

inputting the text to be corrected into a disorder fragment determination model, outputting the disorder fragments in the text to be corrected by the disorder fragment determination model, and training the disorder fragment determination model by taking the training text as a training sample and the disorder fragments marked by the training text as a sample label.

Optionally, the determining the word-level disordered corrected text based on the word-level candidate text set includes:

calculating a grammar compliance probability value of the word level candidate texts and a grammar compliance probability value of the text to be corrected aiming at each word level candidate text in the word level candidate text set;

and determining whether the word-level candidate text is the text with disorder word-level error correction or not according to the grammar compliance probability value of the word-level candidate text and the grammar compliance probability value of the text to be corrected.

Optionally, the obtaining an error-corrected final text based on the word-level disorder-corrected text and the word-level disorder-corrected text includes:

determining an error correction segment of the text to be corrected and an error corrected segment corresponding to the error correction segment of the text to be corrected based on the text subjected to word level disorder error correction and the text subjected to word level disorder error correction;

and replacing the segment to be corrected corresponding to the segment after error correction in the text to be corrected with the segment after error correction to obtain a final text after error correction.

Optionally, the determining an error correction segment of the text to be corrected based on the text subjected to word-level disorder error correction and the text subjected to word-level disorder error correction includes:

determining the parts of the text to be corrected, which are different from the text, of the text with the word level disorder and the text with the word level error;

determining the part of the text to be corrected, which is different from the text, to be the error correction segment of the text to be corrected;

or judging whether the part of the text to be corrected, which is distinguished from the text, is matched with a configured text filtering rule, if not, determining that the part of the text to be corrected, which is distinguished from the text, is the error correction segment of the text to be corrected, and if so, determining that the part of the text to be corrected, which is distinguished from the text, is not the error correction segment of the text to be corrected.

Optionally, the determining whether the portion of the text to be corrected, which is different from the text, matches with the configured text filtering rule includes:

judging whether the words set by the configured text filtering rules without disorder include parts of the text to be corrected, which are different from the text;

if yes, determining that the part of the text to be corrected, which is different from the text, is matched with the configured text filtering rule;

and if not, determining that the part of the text to be corrected, which is different from the text, does not match the configured text filtering rule.

A text correction apparatus comprising:

the acquiring unit is used for acquiring a text to be corrected;

the word level disorder error correction unit is used for carrying out word level disorder error correction on the text to be corrected to obtain the text after the word level disorder error correction;

the word level disorder error correction unit is used for performing word level disorder error correction on the text to be corrected to obtain the text after the word level disorder error correction;

and the text error correction unit is used for obtaining an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

Optionally, the word-level out-of-order error correction unit includes:

the adjacent word exchange processing unit is used for carrying out adjacent word exchange processing on the text to be corrected to obtain a word level candidate text set; each word level candidate text in the word level candidate text set is a text obtained after adjacent word exchange processing is carried out on the text to be corrected;

the first filtering unit is used for determining a target text set from the word level candidate text set; each target text in the target text set is a word level candidate text with grammar compliance;

and the second filtering unit is used for determining the text after word level disorder error correction based on the target text set.

Optionally, the first filter unit comprises:

the first grammar compliance probability value calculation unit is used for calculating grammar compliance probability values of the word level candidate texts and grammar compliance probability values of the texts to be corrected aiming at each word level candidate text;

and the target text determining unit is used for determining whether the word level candidate texts are target texts or not based on the grammar compliance probability value of the word level candidate texts and the grammar compliance probability value of the text to be corrected, and all the determined target texts form a target text set.

Optionally, the second filter unit comprises:

the difference determining unit is used for determining a part, which is different from the text to be corrected, in the target text aiming at each target text in the target text set;

a probability calculation unit, configured to calculate a first probability value of a word at a corresponding position in the target text, which is a part of the target text that is different from the text to be corrected, and a second probability value of a word at a corresponding position in the text to be corrected, which is a part of the target text that is different from the text to be corrected;

and the text determining unit is used for determining whether the target text is the text subjected to word level disorder error correction according to the first probability value and the second probability value.

Optionally, the word-level disorder error correction unit includes:

the word exchange processing unit is used for carrying out word exchange processing on the text to be corrected to obtain a word-level candidate text set; each word-level candidate text in the word-level candidate text set is a text obtained after the word exchange processing is carried out on the text to be corrected;

and the text determining unit is used for determining the text subjected to word level disorder and error correction based on the word level candidate text set.

Optionally, the word exchange processing unit includes:

the disorder fragment determining unit is used for determining a disorder fragment from the text to be corrected, wherein the disorder fragment at least comprises two words;

and the exchanging unit is used for exchanging the words at the disorder starting position and the words at the disorder ending position in the disorder fragments to obtain the word-level candidate texts.

Optionally, the out-of-order fragment determining unit is specifically configured to:

inputting the text to be corrected into a disorder fragment determination model, outputting the disorder fragments in the text to be corrected by the disorder fragment determination model, and training the disorder fragment determination model by taking the training text as a training sample and the disorder fragments marked by the training text as a sample label.

Optionally, the unit for determining text after word-level disorder and error correction includes:

a second grammar compliance probability value calculating unit, configured to calculate, for each word-level candidate text in the word-level candidate text set, a grammar compliance probability value of the word-level candidate text and a grammar compliance probability value of the text to be corrected;

and the word level candidate text processing unit is used for determining whether the word level candidate text is the text with word level disorder and error correction according to the grammar compliance probability value of the word level candidate text and the grammar compliance probability value of the text to be error corrected.

Optionally, the text correction unit includes:

a determining unit, configured to determine, based on the word-level disorder corrected text and the word-level disorder corrected text, an error correction segment of the text to be corrected and an error corrected segment corresponding to the error correction segment of the text to be corrected;

and the replacing unit is used for replacing the to-be-corrected segment corresponding to the error-corrected segment in the to-be-corrected text by using the error-corrected segment to obtain the final error-corrected text.

Optionally, the determining unit includes:

a distinguishing part determining unit, configured to determine, for each text in the text subjected to word-level disorder and error correction and the text subjected to word-level disorder and error correction, a part where the text to be corrected differs from the text;

a first error correction segment determining unit, configured to determine that a part of the text to be corrected, which is different from the text, is an error correction segment of the text to be corrected;

or, the second error correction segment determining unit is configured to determine whether a part of the text to be corrected, which is distinguished from the text, matches a configured text filtering rule, determine, if the part of the text to be corrected, which is distinguished from the text, is an error correction segment of the text to be corrected, and determine, if the part of the text to be corrected, which is distinguished from the text, is not an error correction segment of the text to be corrected.

Optionally, the second error correction segment determining unit includes:

a filtering rule determining unit, configured to determine whether a word set by the configured text filtering rule and without disorder includes a part of the text to be corrected, which is different from the text; if yes, determining that the part of the text to be corrected, which is different from the text, is matched with the configured text filtering rule; and if not, determining that the part of the text to be corrected, which is different from the text, does not match the configured text filtering rule.

A text error correction apparatus includes a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the text error correction method.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the text correction method as described above.

By means of the technical scheme, the text error correction method, the related equipment and the readable storage medium are disclosed, after the text to be corrected is obtained, word level disorder error correction and word level disorder error correction are respectively carried out on the text to be corrected, the text after the word level disorder error correction and the text after the word level disorder error correction are obtained, and the final text after error correction can be obtained on the basis of the text after the word level disorder error correction and the text after the word level disorder error correction.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic flow chart illustrating a text error correction method disclosed in an embodiment of the present application;

FIG. 2 is a schematic diagram of an out-of-order fragment determination model as disclosed herein;

FIG. 3 is a schematic structural diagram of a text error correction apparatus disclosed in an embodiment of the present application;

fig. 4 is a block diagram of a hardware structure of a text error correction apparatus disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Next, the text error correction method provided in the present application will be described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart of a text error correction method disclosed in an embodiment of the present application, where the method may include:

step S101: and acquiring the text to be corrected.

In the present application, the text to be corrected may be a text expressed in various written languages, for example, a chinese text. The text to be corrected may include a sentence, a paragraph, or a chapter, such as a news manuscript.

As an implementation manner, the method for obtaining the text to be corrected may include obtaining an original text to be corrected, and using the original text to be corrected as the text to be corrected.

However, the format or content of the original text to be corrected may not meet the requirement of error correction, and therefore, in this application, as another possible implementation manner, the original text to be corrected may be preprocessed, and the preprocessed text serves as the text to be corrected. The preprocessing mode may be text cleaning, and the specific mode of text cleaning includes, but is not limited to, using a language technology platform tool to perform sentence segmentation on an original text, deleting blank lines, deleting illegal characters, deleting messy codes caused by coding errors, deleting blank characters at the beginning and end of a sentence, and the like.

It should be noted that if the preprocessed text is empty, it indicates that no disorder error exists in the text, and a null result is directly returned. And if the preprocessed text is not empty, continuing to execute the subsequent steps.

Step S102: and performing word-level disorder error correction on the text to be corrected to obtain the text after the word-level disorder error correction.

In this application, word-level disorder refers to the reversal of the order of two adjacent words, for example, in the text "I am on the go", the reversal of the order of "also" and "on" two words. It should be noted that, a specific implementation manner of performing word-level disorder error correction on the text to be corrected to obtain the text after word-level disorder error correction will be described in detail through the following embodiments.

Step S103: and performing word-level disorder error correction on the text to be corrected to obtain the text after the word-level disorder error correction.

In this application, word-level disorder refers to the reversal of the order of a multi-word, which may include the reversal of the order of two adjacent multi-words, e.g., in the text "please keep fit," the reversal of the order of two multi-words, "body" and "keep fit. The reversal of the order of the multi-word may also include the reversal of the order of two non-adjacent multi-words, for example, in the text "display of the public material base, currently …," display "and" base "the reversal of the order of two multi-words. It should be noted that, a specific implementation manner of performing word-level disorder error correction on the text to be corrected to obtain the text after the word-level disorder error correction will be described in detail through the following embodiments.

In the present application, the execution order of step S102 and step S103 is not limited at all, and both may be executed sequentially or simultaneously.

Step S104: and obtaining an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

In the present application, the text after word-level disorder and error correction and the text after word-level disorder and error correction may be fused to obtain the final text after error correction, and a specific fusion manner will be described in detail through the following embodiments.

It should be noted that, the text to be corrected may not have any disorder, may only have word level disorder, and may also have both word level disorder and word level disorder, and in the present application, regardless of what kind of disorder exists in the text to be corrected, step S102, step S103, and step S104 are executed. However, the text after word-level disorder error correction and/or the text after word-level disorder error correction may be empty, and the final text after error correction may be the same as the text to be corrected or different from the text to be corrected.

The embodiment discloses a text error correction method, wherein after a text to be corrected is obtained, word level disorder error correction and word level disorder error correction are respectively carried out on the text to be corrected, so that a text after the word level disorder error correction and a text after the word level disorder error correction are obtained, and based on the text after the word level disorder error correction and the text after the word level disorder error correction, a final text after the error correction can be obtained.

In another embodiment of the present application, an implementation manner of performing word-level disorder correction on the text to be corrected in the above step S102 to obtain a text after word-level disorder correction is described, where the implementation manner may include the following steps:

step S201: performing adjacent word exchange processing on the text to be corrected to obtain a word level candidate text set; and each word level candidate text in the word level candidate text set is a text obtained after adjacent word exchange processing is carried out on the text to be corrected.

In this application, the implementation manner of performing adjacent word exchange processing on the text to be corrected may include: determining two adjacent target words from the text to be corrected; and exchanging the sequence of the target words in the text to be corrected to obtain the word-level candidate text.

For convenience of understanding, it is assumed that the text to be corrected is "great khingan", and the word-level candidate text sets are { "great khingan", "great khingan ampere" }.

Step S202: and determining a target text set from the word level candidate text sets.

In the present application, each target text in the target text set is a word-level candidate text with a grammar compliance.

It should be noted that the word level candidate text of the grammar compliance specifically refers to the grammar compliance of the word after the adjacent word exchange processing is performed in the word level candidate text.

In this application, the implementation of determining the target text set from the word-level candidate text sets may include the following steps:

step S2021: and aiming at each word level candidate text, calculating the grammar compliance probability value of the word level candidate text and the grammar compliance probability value of the text to be corrected.

In the present application, the text to be corrected, and, for each word level candidate text, may be input with a grammar compliance probability value calculation model that outputs a grammar compliance probability value of the text to be corrected, and a grammar compliance probability value of each word level candidate text.

It should be noted that, in the present application, the grammar compliance probability calculation model may specifically be an n-gram language model established based on grammatically correct text, such as news, novels, encyclopedia, daily conversations, and the like. The n-gram language model can adopt a statistical method to count and estimate the probability that the current position is the text to be corrected and the corresponding word in each word level candidate text from the corpus, and further calculate the probability of the text to be corrected and each word level candidate text, and the value of the probability can be used as the grammar compliance probability value of the text to be corrected. The n-gram language model can adopt a statistical method to count and estimate the probability that the current position is the corresponding word in each word level candidate text from the corpus, and further calculate the probability of each word level candidate text, and the value of the probability can be used as the grammar compliance probability value of each word level candidate text.

Step S2022: and determining whether the word level candidate texts are target texts or not based on the grammar compliance probability value of the word level candidate texts and the grammar compliance probability value of the texts to be corrected, wherein all the determined target texts form a target text set.

In the application, a difference value between the grammar compliance probability values of the word-level candidate texts and the grammar compliance probability values of the texts to be corrected can be calculated, whether the difference value is larger than a first preset threshold value or not is judged, if so, the word-level candidate texts are determined to be target texts, otherwise, the word-level candidate texts are determined to be non-target texts.

Step S203: and determining the text after word level disorder and error correction based on the target text set.

In this application, the implementation process of determining the word-level scrambled text based on the target text set may include the following steps:

step S2031: and for each target text in the target text set, determining a part of the target text, which is different from the text to be corrected.

For convenience of understanding, it is assumed that the text to be corrected is "he takes us up a mountain", the target text set is { "he takes us up a mountain", "he takes us up a mountain" }, and the portion of the target text that is distinguished from the text to be corrected is "just taking" for the target text of "he takes us up a mountain".

Step S2032: and calculating a first probability value of a word at a position corresponding to the target text, which is a part of the target text different from the text to be corrected, and a second probability value of a word at a position corresponding to the text to be corrected, which is a part of the target text having words different from the text to be corrected.

In the application, the parts of the target text, which are different from the text to be corrected, may be marked with [ MASK ], a neural network language model based on a transformer is used to predict the probability that the word at the [ MASK ] position is the word at the corresponding position of the target text and the probability that the word at the corresponding position of the text to be corrected, the probability that the word at the [ MASK ] position is the word at the corresponding position of the target text is predicted, which is a first probability value, and the probability that the word at the [ MASK ] position is predicted is the word at the corresponding position of the text to be corrected, which is a second probability value.

For convenience of understanding, assuming that the text to be corrected is "he takes us and goes up the mountain", the text marked with [ MASK ] at the part of the text to be corrected which is distinguished from the text to be corrected is "his [ MASK ] goes up the mountain", the text is input to a transform-based neural network language model, the transform-based neural network language model predicts that the probability of [ MASK ] going up the mountain is a first probability value, and predicts that the probability of [ MASK ] going up the mountain is a second probability value.

It should be noted that the transform-based neural network language model may be any one of a BERT (Bidirectional Encoder Representations from Transformers) model, a RoBERTa-large chinese pre-training model, a RoBERTa-wwm-ext, and a RoBERTa-wwm-large-ext.

Step S2033: and determining whether the target text is the text with word level disorder and error correction according to the first probability value and the second probability value.

In this step, a difference value obtained by subtracting the second probability value from the first probability value may be calculated, and if the difference value is greater than a second preset threshold, it is determined that the target text is a text with word-level disorder and error correction.

In another embodiment of the present application, an implementation manner of performing word-level disorder error correction on the text to be corrected in step S103 to obtain a text after the word-level disorder error correction is introduced, where the implementation manner may include the following steps:

step S301: performing word exchange processing on the text to be corrected to obtain a word-level candidate text set; each word-level candidate text in the word-level candidate text set is a text obtained after the word exchange processing is carried out on the text to be corrected;

in this application, the implementation manner of performing word exchange processing on the text to be corrected may include: determining disorder fragments from the text to be corrected, wherein the disorder fragments at least comprise two words; and exchanging words located at the disorder starting position and words located at the disorder ending position in the disorder fragments to obtain the word-level candidate text.

For the convenience of understanding, assuming that the text to be corrected is ' Chinese is one of the most popular languages ', the disorder fragment is ' one of the languages of people ' and ' people ', the word ' people ' at the disorder starting position in the ' people ' and the word ' people ' at the disorder ending position are exchanged to obtain the word-level candidate text ' Chinese is one of the most popular languages ', and the word-level candidate text ' one of the words ' at the disorder starting position in the ' one of the languages ' and the word ' at the disorder ending position in the ' one of the languages ' is exchanged to obtain the word-level candidate text ' Chinese is one of the most popular languages '.

It should be noted that, in this application, the implementation manner of determining the out-of-order segment from the text to be corrected may include: and inputting the text to be corrected into a disorder fragment determination model, and outputting the disorder fragments in the text to be corrected by the disorder fragment determination model.

In this application, the disorder segment determination model may also be the neural network language model based on the transform provided in step S2032, except that the disorder segment determination model is obtained by training using a training text as a training sample and using a disorder segment labeled by the training text as a sample label.

For convenience of understanding, fig. 2 is a schematic diagram of a disorder segment determination model disclosed in the present application, based on the model, a text to be corrected is segmented according to characters and input to the disorder segment determination model, a hidden layer representation at each position is obtained through calculation of an embedding layer, a word vector and a transformer, and a full connection layer predicts a category corresponding to each position. As one implementable, categories may include normal, out-of-order start, out-of-order middle, out-of-order end. The disorder fragment is a fragment composed of words with the categories of disorder starting and disorder ending, or a fragment composed of words with the categories of disorder starting, disorder middle and disorder ending.

Step S302: and determining the text after word level disorder and error correction based on the word level candidate text set.

In this application, when the word-level candidate text set is not an empty set, the determining, based on the word-level candidate text set, the implementation manner of the word-level disorder corrected text may include the following steps:

step S3021: for each word level candidate text in the word level candidate text set, calculating a grammar compliance probability value of the word level candidate text and a grammar compliance probability value of the text to be corrected.

The text to be corrected and each word-level candidate text may be input into a grammar compliance probability value calculation model that outputs a grammar compliance probability value of the text to be corrected and a grammar compliance probability value of each word-level candidate text.

It should be noted that, in the present application, the grammar compliance probability calculation model may specifically be an n-gram language model established based on grammatically correct text, such as news, novels, encyclopedia, daily conversations, and the like. The n-gram language model can adopt a statistical method to count and estimate the probability that the current position is the text to be corrected and the corresponding word in each word level candidate text from the corpus, and further calculate the probability of the text to be corrected and each word level candidate text, and the value of the probability can be used as the grammar compliance probability value of the text to be corrected. The n-gram language model can adopt a statistical method to count and estimate the probability that the current position is the corresponding word in each word level candidate text from the corpus, and further calculate the probability of each word level candidate text, and the value of the probability can be used as the grammar compliance probability value of each word level candidate text.

Step S3022: and determining whether the word-level candidate text is the text with disorder word-level error correction or not according to the grammar compliance probability value of the word-level candidate text and the grammar compliance probability value of the text to be corrected.

In this application, the determining, according to the grammar compliance probability value of the word-level candidate text and the grammar compliance probability value of the text to be corrected, whether the word-level candidate text is a text with word-level disorder and error correction may be implemented in the following manners: and calculating a difference value of the grammar conformity probability value of the word-level candidate text minus the grammar conformity probability value of the text to be corrected, and if the difference value is greater than a third preset threshold value, determining that the word-level candidate text is the text with word-level disorder and correction.

In another embodiment of the present application, an implementation manner of obtaining an error-corrected final text based on the word-level disorder corrected text and the word-level disorder corrected text in step S104 is described, where the implementation manner may include the following steps:

step S401: and determining an error correction segment of the text to be corrected and an error corrected segment corresponding to the error correction segment of the text to be corrected based on the text subjected to word level disorder error correction and the text subjected to word level disorder error correction.

In this application, determining an implementation manner of an error correction segment of the text to be corrected based on the text subjected to word-level disorder error correction and the text subjected to word-level disorder error correction may include:

s4011: determining the parts of the text to be corrected, which are different from the text, of the text with the word level disorder and the text with the word level error;

for convenience of understanding, it is assumed that the text to be corrected is "he congratulates one way along the wind", the text after word level disorder and error correction is "he congratulates one way along the wind", a part of the text to be corrected, which is distinguished from the text after word level disorder and error correction, is "congratulatory", and a part of the text to be corrected, which is distinguished from the text after word level disorder and error correction, is "one way along the wind".

S4012: and determining that the part of the text to be corrected, which is different from the text, is an error correction segment of the text to be corrected.

For convenience of understanding, it is assumed that a text to be corrected is "he congratulates one way along the wind", a text after word level disorder and error correction is "he congratulates one way along the wind", a part of the text to be corrected, which is distinguished from the text after word level disorder and error correction is "congratulating", a part of the text to be corrected, which is distinguished from the text after word level disorder and error correction is "one way along the wind", and an error correction segment of the text to be corrected is "congratulating" and "one way along the wind".

However, in some cases, the disorder in which some entity words, new words, domain words, and white list words are considered to exist is not a true disorder, for example, "zhongying building" is more famous and more common in the corpus than "yin zhongying building", but "yin zhongying building" does exist in a proper context. For another example, in a judicial arts related document, the term "case involves" may appear frequently, and from the perspective of the general field, the term "involves" may be out of order, but actually is a field-specific word. And so on. The words are considered to have disorder, which may cause errors in text correction.

In order to solve the above problem, in the present application, a text filtering rule may be further configured in advance, a word (such as a solid word, a new word, a domain word, a word in a white list, etc.) which is not disordered is set, after determining a part of the text to be corrected which is distinguished from the text, whether the part of the text to be corrected which is distinguished from the text matches the configured text filtering rule is determined for each text in the text after the word level disorder correction and the word level disorder correction, if not, the part of the text to be corrected which is distinguished from the text is determined to be an error correction segment of the text to be corrected, and if matching, the part of the text to be corrected which is distinguished from the text is determined to be not an error correction segment of the text to be corrected.

Wherein the process of determining whether the part of the text to be corrected, which is distinguished from the text, matches with the configured text filtering rule may include: judging whether the words set by the configured text filtering rules without disorder include parts of the text to be corrected, which are different from the text; if yes, determining that the part of the text to be corrected, which is different from the text, is matched with the configured text filtering rule; and if not, determining that the part of the text to be corrected, which is different from the text, does not match the configured text filtering rule.

For convenience of understanding, it is assumed that the pre-configured text filtering rule sets that no disorder word exists, the text to be corrected is "case wad", and the text after the word level disorder error correction is "wad", the part of the text to be corrected, which is distinguished from the text after the word level disorder error correction, includes "case wad", but the pre-configured text filtering rule sets that no disorder word exists in the words, and determines that "case wad" is not an error correction segment of the text to be corrected.

In the application, after the error correction segment of the text to be corrected is determined, an error-corrected segment corresponding to the error correction segment of the text to be corrected needs to be determined, where the error-corrected segment corresponding to the error correction segment of the text to be corrected is the text after word-level disorder error correction, or the text segment corresponding to the position of the error correction segment in the text after word-level disorder error correction.

It should be noted that, for an error correction segment, if only a text subjected to word-level disorder error correction exists, or an error-corrected segment corresponding to the text subjected to word-level disorder error correction exists in the text subjected to word-level disorder error correction, the error-corrected segment is the error-corrected segment corresponding to the error correction segment, and if both the text subjected to word-level disorder error correction and the text subjected to word-level disorder error correction exist an error-corrected segment corresponding to the error-corrected segment, a grammar compliance probability value of the text subjected to word-level disorder error correction and a grammar compliance probability value of the text subjected to word-level disorder error correction are obtained, and the error-corrected segment in the text with a larger grammar compliance probability value is determined as the error-corrected segment corresponding to the error correction segment.

For convenience of understanding, it is assumed that a text to be corrected is "he congratulates one way along the wind", a text after word level disorder and error correction is "he congratulates one way along the wind", a part of the text to be corrected, which is distinguished from the text after word level disorder and error correction is "congratulating", a part of the text to be corrected, which is distinguished from the text after word level disorder and error correction is "one way along the wind", and an error correction segment of the text to be corrected is "congratulating" and "one way along the wind". And the error-corrected segment corresponding to the error-corrected segment of the text to be error-corrected is 'parallel congratulatory', and the error-corrected segment corresponding to the error-corrected segment of the text to be error-corrected 'one way along the wind' is 'one way along the wind'.

Step S402: and replacing the segment to be corrected corresponding to the segment after error correction in the text to be corrected with the segment after error correction to obtain a final text after error correction.

For convenience of understanding, based on the above steps, "and" is used to replace "congratulatory" in "he congratulates and takes one way along the wind" of the text to be corrected, and "one way along the wind" is used to replace "one way along the wind" in "he congratulates and takes one way along the wind" of the text to be corrected, and the final text after correction is "he congratulates and takes one way along the wind".

The following describes the text error correction device disclosed in the embodiment of the present application, and the text error correction device described below and the text error correction method described above may be referred to correspondingly.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a text error correction apparatus disclosed in the embodiment of the present application. As shown in fig. 3, the text correction apparatus may include:

an obtaining unit 11, configured to obtain a text to be corrected;

a word-level disorder error correction unit 12, configured to perform word-level disorder error correction on the text to be corrected, so as to obtain a text after the word-level disorder error correction;

a word-level disorder error correction unit 13, configured to perform word-level disorder error correction on the text to be corrected, so as to obtain a text after the word-level disorder error correction;

and the text error correction unit 14 is configured to obtain an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

Optionally, the word-level out-of-order error correction unit includes:

the adjacent word exchange processing unit is used for carrying out adjacent word exchange processing on the text to be corrected to obtain a word level candidate text set; each word level candidate text in the word level candidate text set is a text obtained after adjacent word exchange processing is carried out on the text to be corrected;

the first filtering unit is used for determining a target text set from the word level candidate text set; each target text in the target text set is a word level candidate text with grammar compliance;

and the second filtering unit is used for determining the text after word level disorder error correction based on the target text set.

Optionally, the first filter unit comprises:

the first grammar compliance probability value calculation unit is used for calculating grammar compliance probability values of the word level candidate texts and grammar compliance probability values of the texts to be corrected aiming at each word level candidate text;

and the target text determining unit is used for determining whether the word level candidate texts are target texts or not based on the grammar compliance probability value of the word level candidate texts and the grammar compliance probability value of the text to be corrected, and all the determined target texts form a target text set.

Optionally, the second filter unit comprises:

the difference determining unit is used for determining a part, which is different from the text to be corrected, in the target text aiming at each target text in the target text set;

a probability calculation unit, configured to calculate a first probability value of a word at a corresponding position in the target text, which is a part of the target text that is different from the text to be corrected, and a second probability value of a word at a corresponding position in the text to be corrected, which is a part of the target text that is different from the text to be corrected;

and the text determining unit is used for determining whether the target text is the text subjected to word level disorder error correction according to the first probability value and the second probability value.

Optionally, the word-level disorder error correction unit includes:

the word exchange processing unit is used for carrying out word exchange processing on the text to be corrected to obtain a word-level candidate text set; each word-level candidate text in the word-level candidate text set is a text obtained after the word exchange processing is carried out on the text to be corrected;

and the text determining unit is used for determining the text subjected to word level disorder and error correction based on the word level candidate text set.

Optionally, the word exchange processing unit includes:

the disorder fragment determining unit is used for determining a disorder fragment from the text to be corrected, wherein the disorder fragment at least comprises two words;

and the exchanging unit is used for exchanging the words at the disorder starting position and the words at the disorder ending position in the disorder fragments to obtain the word-level candidate texts.

Optionally, the out-of-order fragment determining unit is specifically configured to:

inputting the text to be corrected into a disorder fragment determination model, outputting the disorder fragments in the text to be corrected by the disorder fragment determination model, and training the disorder fragment determination model by taking the training text as a training sample and the disorder fragments marked by the training text as a sample label.

Optionally, the unit for determining text after word-level disorder and error correction includes:

a second grammar compliance probability value calculating unit, configured to calculate, for each word-level candidate text in the word-level candidate text set, a grammar compliance probability value of the word-level candidate text and a grammar compliance probability value of the text to be corrected;

and the word level candidate text processing unit is used for determining whether the word level candidate text is the text with word level disorder and error correction according to the grammar compliance probability value of the word level candidate text and the grammar compliance probability value of the text to be error corrected.

Optionally, the text correction unit includes:

a determining unit, configured to determine, based on the word-level disorder corrected text and the word-level disorder corrected text, an error correction segment of the text to be corrected and an error corrected segment corresponding to the error correction segment of the text to be corrected;

and the replacing unit is used for replacing the to-be-corrected segment corresponding to the error-corrected segment in the to-be-corrected text by using the error-corrected segment to obtain the final error-corrected text.

Optionally, the determining unit includes:

a distinguishing part determining unit, configured to determine, for each text in the text subjected to word-level disorder and error correction and the text subjected to word-level disorder and error correction, a part where the text to be corrected differs from the text;

a first error correction segment determining unit, configured to determine that a part of the text to be corrected, which is different from the text, is an error correction segment of the text to be corrected;

or, the second error correction segment determining unit is configured to determine whether a part of the text to be corrected, which is distinguished from the text, matches a configured text filtering rule, determine, if the part of the text to be corrected, which is distinguished from the text, is an error correction segment of the text to be corrected, and determine, if the part of the text to be corrected, which is distinguished from the text, is not an error correction segment of the text to be corrected.

Optionally, the second error correction segment determining unit includes:

a filtering rule determining unit, configured to determine whether a word set by the configured text filtering rule and without disorder includes a part of the text to be corrected, which is different from the text; if yes, determining that the part of the text to be corrected, which is different from the text, is matched with the configured text filtering rule; and if not, determining that the part of the text to be corrected, which is different from the text, does not match the configured text filtering rule.

Referring to fig. 4, fig. 4 is a block diagram of a hardware structure of a text error correction apparatus disclosed in an embodiment of the present application, and referring to fig. 4, the hardware structure of the text error correction apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring a text to be corrected;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

and obtaining an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring a text to be corrected;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

performing word-level disorder error correction on the text to be corrected to obtain a text subjected to word-level disorder error correction;

and obtaining an error-corrected final text based on the word-level disorder error-corrected text and the word-level disorder error-corrected text.

Alternatively, the detailed function and the extended function of the program may be as described above.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于人工智能的话题挖掘方法、装置及电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!