Character error correction method, device, equipment and storage medium

文档序号：1113569 发布日期：2020-09-29 浏览：9次中文

阅读说明：本技术 文字纠错方法、装置、设备及存储介质 (Character error correction method, device, equipment and storage medium ) 是由杨震李彦亓超马宇驰于 2020-06-11 设计创作，主要内容包括：本申请实施例提供了一种文字纠错方法、装置、设备及存储介质,涉及计算机领域,该方法通过利用预先训练的语言纠错模型检测得到待纠正语句的各个原字符在该待纠正语句中对应位置上的出现概率,并根据所述出现概率确定所述待纠正语句中的待纠正文字；从预配置字表的各个候选文字中选择用于替代所述待纠正文字的目标文字,纠正该待纠正语句。本技术方案实现了高效而准确地识别出待纠正语句中的错别字并进行纠正。(The embodiment of the application provides a character error correction method, a device, equipment and a storage medium, which relate to the field of computers, and the method comprises the steps of detecting and obtaining the occurrence probability of each original character of a sentence to be corrected at the corresponding position in the sentence to be corrected by utilizing a pre-trained language error correction model, and determining the character to be corrected in the sentence to be corrected according to the occurrence probability; and selecting a target character for replacing the character to be corrected from each candidate character in the pre-configured character table, and correcting the sentence to be corrected. The technical scheme realizes that wrongly written characters in the sentence to be corrected are efficiently and accurately identified and corrected.)

1. A character error correction method is characterized by comprising the following steps:

detecting and obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected by utilizing a pre-trained language error correction model;

determining characters to be corrected in the sentences to be corrected according to the occurrence probability;

and selecting target characters for replacing the characters to be corrected from all candidate characters in a pre-configured character table, and correcting the sentences to be corrected by using the target characters.

2. The text error correction method according to claim 1, wherein the step of detecting and obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected by using the pre-trained language error correction model comprises:

acquiring a sentence to be corrected, inputting the sentence to be corrected into a pre-trained language correction model, and analyzing semantic association relations among original characters of the sentence to be corrected through the language correction model;

and obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected based on the semantic association relation.

3. The text error correction method according to claim 2, wherein the step of obtaining the sentence to be corrected comprises:

and carrying out optical symbol recognition on the video frame image to obtain video caption sentences, and screening out sentences to be corrected from the video caption sentences.

4. The word error correction method according to claim 1, wherein the step of determining the word to be corrected in the sentence to be corrected according to the occurrence probability comprises:

acquiring the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected;

and comparing the occurrence probability with a preset threshold, if the occurrence probability is smaller than the preset threshold, determining the position as a position to be corrected, and determining the original character on the corrected position as the character to be corrected.

5. The text error correction method of claim 1, wherein the step of selecting a target text for replacing the text to be corrected from among the candidate texts in the pre-configured word table comprises:

acquiring each candidate character of a pre-configured character table, and analyzing a probability vector corresponding to the candidate character at the position to be corrected by using the pre-trained language error correction model;

and determining a target character for replacing the character to be corrected from the candidate characters according to the probability vector.

6. The text error correction method of claim 5, wherein the step of determining the target text for replacing the text to be corrected from the candidate texts according to the probability vector comprises:

extracting the shape word of the word to be corrected and the corresponding probability value of the shape word in the probability vector from the candidate words;

and comparing the probability values corresponding to the shape and proximity characters, and selecting the shape and proximity character with the maximum probability value as a target character according to the comparison result so as to replace the character to be corrected with the target character.

7. The text error correction method of claim 1, wherein the step of selecting the target text for replacing the text to be corrected from the candidate texts in the pre-configured word table further comprises:

coding by utilizing the shape approximate code of the Chinese character to construct candidate characters, and gathering the candidate characters to generate a pre-configured character table; wherein the shape-approximate code comprises at least one of a Chinese character structure, a stroke and a four-corner code.

8. A character error correction apparatus, comprising:

the detection module is used for detecting and obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected by utilizing a pre-trained language error correction model;

a determining module, configured to determine, according to the occurrence probability, a word to be corrected in the sentence to be corrected;

and the correcting module is used for selecting target characters for replacing the characters to be corrected from all candidate characters in the pre-configured character table and correcting the sentences to be corrected by using the target characters.

9. A word correction device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the word correction method according to any one of claims 1-7 are implemented when the processor executes the program.

10. A storage medium containing computer-executable instructions for performing the steps of the text error correction method of any one of claims 1-7 when executed by a computer processor.

Technical Field

The present application relates to the field of computer software, and in particular, to a method, an apparatus, a device, and a storage medium for text error correction.

Background

With the rapid development of digital media technology, people need to input characters of other media, such as paper or multimedia (such as video), into a computer so as to analyze and process the characters through the computer.

In the related art, the characters on paper or in video can be quickly recognized by using an Optical Character Recognition (OCR) technology, so that computer texts which can be edited on a computer are obtained. However, when OCR is performed on characters, the characters recognized by OCR are often recognized as near-shape characters due to errors of the characters recognized by OCR caused by problems of character background, character font and the like, so that the accuracy of the output result of the characters recognized by OCR is low.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks, and in particular, to solve the problem of low accuracy of the output result of character recognition.

In a first aspect, an embodiment of the present application provides a text error correction method, including the following steps:

determining characters to be corrected in the sentences to be corrected according to the occurrence probability;

In an embodiment, the step of detecting, by using a pre-trained language correction model, occurrence probabilities of original characters of a sentence to be corrected at corresponding positions in the sentence to be corrected includes:

In an embodiment, the step of obtaining the sentence to be corrected includes:

and carrying out optical symbol recognition on the video frame image to obtain video caption sentences, and screening out sentences to be corrected from the video caption sentences.

In an embodiment, the step of determining the words to be corrected in the sentence to be corrected according to the occurrence probability includes:

acquiring the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected;

In one embodiment, the step of selecting a target word to replace the word to be corrected from the candidate words in the pre-configured word table comprises:

and determining a target character for replacing the character to be corrected from the candidate characters according to the probability vector.

In an embodiment, the step of determining a target word for replacing the word to be corrected from the candidate words according to the probability vector comprises:

extracting the shape word of the word to be corrected and the corresponding probability value of the shape word in the probability vector from the candidate words;

In an embodiment, before the step of selecting a target word to replace the word to be corrected from the candidate words in the pre-configured word table, the method further includes:

In a second aspect, an embodiment of the present application further provides a text error correction device, including:

a determining module, configured to determine, according to the occurrence probability, a word to be corrected in the sentence to be corrected;

In a third aspect, an embodiment of the present application further provides a text error correction apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the text error correction method mentioned in any embodiment of the first aspect when executing the program.

In a fourth aspect, the present application further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to perform the steps of the text error correction method as mentioned in any one of the embodiments of the first aspect.

The method, the apparatus, the device, and the storage medium for text error correction provided in the foregoing embodiments obtain, by using a pre-trained language error correction model, an occurrence probability of each original character of a sentence to be corrected at a corresponding position in the sentence to be corrected, and determine, according to the occurrence probability, a text to be corrected in the sentence to be corrected; and selecting a target character for replacing the character to be corrected from all candidate characters in the pre-configured character table, and correcting the sentence to be corrected, so that the wrongly written characters in the sentence to be corrected can be efficiently and accurately identified, and the wrongly written characters in the sentence to be corrected can be corrected.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a text error correction method according to an embodiment;

FIG. 2 is a schematic diagram illustrating a prediction principle of the occurrence probability of each original character of the sentence to be corrected according to an embodiment;

FIG. 3 is a diagram illustrating another prediction principle of the occurrence probability of each original character of the sentence to be corrected according to an embodiment;

FIG. 4 is a schematic diagram illustrating a prediction principle of the probability of occurrence of each original character of the sentence to be corrected according to an embodiment;

FIG. 5 is a schematic structural diagram of a text error correction apparatus according to an embodiment;

fig. 6 is a schematic structural diagram of a text error correction apparatus according to an embodiment.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Fig. 1 is a flowchart illustrating a text error correction method implemented in a computer device, such as a server, a personal computer, a notebook computer, a tablet computer, a scanner, a smart phone, and the like according to an embodiment.

Specifically, as shown in fig. 1, the text error correction method may include the following steps:

s110, detecting and obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected by utilizing a pre-trained language error correction model.

The language correction model may be obtained by training a large number of word samples in advance. Optionally, the word training sample set is input into a preset neural network language model for training, so as to obtain a speech error correction model. The input word training sample can be a word or a sentence composed of a plurality of words.

For a language model, it is simply the probability distribution of a string of words. In particular, the role of the language model is to determine a probability distribution P for a text of length m, indicating the likelihood of the text being present. A pre-trained word representation should be able to contain rich syntactic and semantic information and be able to model ambiguous words. A language model is utilized to obtain a context-dependent pre-trained representation.

Optionally, in this embodiment, the language error correction model may be a BERT language model, where the pre-trained BERT language model performs spell check on the input sentence, and outputs a probability vector of an original character at a next character position of the current character position through a softmax function, so as to obtain an occurrence probability of the original character at a corresponding position in the input sentence.

Since each character of each word or sentence has a syntactic and semantic association relationship, in this embodiment, the probability of occurrence of each character in the sentence to be corrected at the corresponding position in the sentence to be corrected can be detected by using the language error correction model. The corresponding position of each original character in the sentence to be corrected can be determined according to the position of one or more characters as a standard, for example, in the sentence "playing basketball", the position corresponding to "playing" can be understood as a position separated by one character from the position of the word "ball", or can be other positions determined by combining the semantics of the sentence to be corrected.

For example, the language correction model obtained by training is based on training of a large number of word training samples, grammar, syntax and voice association relations among words are learned, and the occurrence probability of each possible word of a character at the next position of the current position character is calculated based on the semantic association relation between the current position character and the character at the next position, so as to obtain the probability distribution of each possible word.

For another example, if the current position character is "name", the language error correction model calculates the occurrence probability of the next character "sentence" or other input characters, such as "packet", and obtains that the occurrence probability of "sentence" is higher than that of "packet" based on the semantic association relationship between "name" and "sentence", or "name" and "packet", etc.

Similarly, the sentence to be corrected, which is a thousand-ancient name package, is input into the language error correction model, and the voice error correction model can calculate the occurrence probability of each original character of the sentence to be corrected at the corresponding position on the sentence to be corrected according to the syntax, the grammar, the voice incidence relation and the like among the voices to be corrected. For example, the sentence to be corrected includes 8 original characters, such as "this", "is", "one", "thousand", "ancient", "first name", and "package", further, such that "this" is the first character position at the corresponding position of the sentence to be corrected, "is" the second character position at the corresponding position of the sentence to be corrected (or the next character position of "this", or the previous character position of "one"), and so on, the corresponding position of "package" in the sentence to be corrected is the eighth character position (or the next character position of "first name").

And S120, determining characters to be corrected in the sentences to be corrected according to the occurrence probability.

In this embodiment, the occurrence probability of the original character is related to the accuracy, and the higher the probability that a certain original character of the sentence to be corrected output by the language correction model appears at the corresponding position on the sentence to be corrected, the higher the accuracy of the original character is, the lower the possibility of an input or recognition error is.

Optionally, obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected; and comparing the occurrence probability with a preset threshold, if the occurrence probability is smaller than the preset threshold, determining the position as a position to be corrected, and determining the original characters on the correction position as characters to be corrected, wherein one or more positions to be corrected can exist, and one or more corresponding characters to be corrected can also exist.

It should be noted that, because the default of the language correction model is biased toward that each original character of the sentence to be corrected is a correct character, the probability of occurrence of each original character at the corresponding position of the sentence to be corrected is higher than that of other characters, even if the original character is a wrong character, that is, the probability of occurrence of the original character at the corresponding position of the sentence to be corrected is the highest probability of occurrence at the position. In this case, if the highest occurrence probability of each character at the current position is smaller than a preset threshold, the current position is a position to be corrected; and extracting the original character at the position to be corrected and determining the original character as the character to be corrected.

Assuming that the preset threshold is 0.9, based on the above example, the probability vectors of the 8 original characters of the sentence to be corrected, which are "this", "yes", "one", "thousand", "old", "name" and "package", appearing at the corresponding positions in the sentence to be corrected, are 0.99, 0.98, 0.99 and 0.7, respectively. Because the probability of occurrence of the original character packet in the sentence to be corrected is 0.6 and is lower than the preset threshold value of 0.9, the position corresponding to the 8 th character of the sentence to be corrected, namely the position of the next character immediately following the first character, is determined as the position to be corrected, and the original character on the position to be corrected, namely the packet, is determined as the character to be corrected.

S130, selecting target characters for replacing the characters to be corrected from all candidate characters of the pre-configured character table, and correcting the sentences to be corrected by using the target characters.

In this embodiment, the pre-configured word table includes a plurality of candidate words, where the candidate words include a word that is the same as an original character of the sentence to be corrected and a similar word to the original character of the sentence to be corrected.

Optionally, according to the occurrence probability of each candidate word in the pre-configured word table at the position to be corrected, a target word is selected according to the occurrence probability of each candidate word, where the target word may be a word with the highest occurrence probability in candidate words except the word to be corrected. Optionally, other selection manners, such as user-defined selection of other characters from the pre-configured character table that are the same as or similar to the pronunciation or font of the character to be corrected, may be used as the target character. And replacing the character to be corrected with the target character so as to correct the sentence to be corrected.

In the text error correction method provided by this embodiment, the occurrence probability of each original character of a sentence to be corrected at a corresponding position in the sentence to be corrected is obtained by using a pre-trained language error correction model, and the text to be corrected in the sentence to be corrected is determined according to the occurrence probability; and selecting a target character for replacing the character to be corrected from all candidate characters in the pre-configured character table, and correcting the sentence to be corrected, so that the wrongly written characters in the sentence to be corrected can be efficiently and accurately identified, and the wrongly written characters in the sentence to be corrected can be corrected.

In order to make the technical solutions of the present application clearer and easier to understand, specific implementation processes and modes of a plurality of steps in the technical solutions are described in detail below.

In an embodiment, the step S110 of detecting, by using a pre-trained language correction model, occurrence probabilities of original characters of a sentence to be corrected at corresponding positions in the sentence to be corrected may include the following steps:

s1101, obtaining a sentence to be corrected, inputting the sentence to be corrected into a pre-trained language error correction model, and analyzing semantic association relations among original characters of the sentence to be corrected through the language error correction model.

Optionally, the statement to be corrected is obtained from the execution device, may also be obtained from a local external device, and may also be obtained from a cloud. The sentence to be corrected can be a manually input sentence, and can also be a sentence obtained by a device scanning recognition mode and the like.

In one embodiment, the video frame image is subjected to optical symbol recognition to obtain video caption sentences, and the sentences to be corrected are screened out from the video caption sentences.

In the embodiment, a multi-frame video is intercepted, and the optical symbol recognition is carried out on the video frame image. Optionally, before performing optical character recognition, a video frame image meeting requirements may be screened from the captured video frame image, for example, a video frame image without subtitles, incomplete subtitles, or unsatisfactory subtitles in definition is deleted to obtain an original video frame image meeting requirements, and further, after performing binarization processing and other processing on the original video frame image, optical character recognition may be performed to obtain a video subtitle statement corresponding to the original video frame image.

Further, the sentences to be corrected which need to be identified and error-detected are screened out from the video caption sentences, optionally, the sentences to be corrected can be screened out in a labeling mode, the sentences to be corrected can also be screened out in a random extraction mode, and the sentences to be corrected can also be made from all the video caption sentences for detection.

In this embodiment, a sentence to be corrected is processed, the sentence to be corrected is converted into a corresponding sentence sequence, a character feature vector is constructed for each character in the sentence sequence, and the character feature vector is input into a pre-trained language error correction model, so that the semantic association relationship between the character feature vectors corresponding to each original character in the sentence to be corrected is analyzed by combining the semantics between the context words of the sentence to be corrected through the language error correction model. Generally, the more frequently a character appears in a word or a sentence, the greater the semantic relationship of the character in the word or sentence.

S1102, obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected based on the semantic association relation.

The semantic association relationship is determined according to semantic analysis of a word or a sentence or the frequency of occurrence of the certain character in the certain word or the certain sentence, and in general, the stronger the semantic association relationship between the characters in the word or the sentence is in the sentence obtained by commonly used words or word combinations.

In this embodiment, if the semantic association relationship between the current position character and the previous position character and/or the next position character is stronger, the probability that the character appears at the current position is higher; if the semantic association relationship between the current position character and the previous position character and/or the next position character is weaker, the probability that the character appears at the current position is smaller.

For example, the sentence to be corrected is "blue ball", and the language correction model obtains the probabilities of the corresponding positions of the three original characters, namely "blue" and "ball", in the sentence to be corrected based on the semantic association relationship of the sentence to be corrected. By combining the semantic association relation of the word "blue ball", the probability that the word "blue" is followed before the word "blue ball" or is followed after the word "ball" is low, the probability that the word "blue" and the word "ball" in the three words of "blue" and "ball" are in the current position of the sentence to be corrected is high, and the fact that the word "blue" is possibly a wrongly-written word can be predicted.

The corresponding position of each original character in the sentence to be corrected can be a position determined according to the position of one or more characters as a standard, for example, the position corresponding to the "hit" can be understood as a position separated from the "ball" character by one character, or can be other positions determined by combining the semantics of the sentence to be corrected.

In order to more clearly illustrate the technical solution of the present application, the following takes an example that the sentence to be corrected is "weather is clear".

In an embodiment, the following semantics can be inferred according to the above semantics, so as to predict the occurrence probability of the current character, as shown in fig. 2, fig. 2 is a schematic diagram of a prediction principle of the occurrence probability of each original character of the to-be-corrected sentence provided by an embodiment, the occurrence probability of the current character "qi" can be predicted according to the above input "day" character, and similarly, the occurrence probability of the current character "true" can be predicted according to the above input "weather" character.

In an embodiment, the above semantics can be inferred according to the following semantics, so as to predict the occurrence probability of the current character, as shown in fig. 3, fig. 3 is another prediction principle schematic diagram of the occurrence probability of each original character of the sentence to be corrected provided by an embodiment, and the occurrence probability of the current character "fine" can be predicted according to the "bright" character input below, and similarly, the occurrence probability of the current character "true" can be predicted according to the "clear" character input below, and the like.

In an embodiment, the occurrence probability of the current character may be inferred according to context semantics, as shown in fig. 4, fig. 4 is a further schematic diagram of a prediction principle of the occurrence probability of each original character of the to-be-corrected sentence provided by an embodiment, and the occurrence probability of the current character "qi" may be predicted according to the "day" character input above and the "true clear" character input below, and the like.

The prediction of the occurrence probability of each original character in the sentence to be corrected at the corresponding position in the sentence to be corrected provided by the embodiment can efficiently and accurately identify the wrongly written characters in the sentence to be corrected.

In an embodiment, the selecting a target word to replace the word to be corrected from the candidate words in the pre-configured word table in step S130 may include the following steps:

s1301, obtaining each candidate character of a pre-configured character table, and analyzing a probability vector corresponding to the candidate character at the position to be corrected by using the pre-trained language error correction model.

In this embodiment, the pre-configured word list may be located in the language correction model, or may be an externally input word list.

When the sentence to be corrected is input into the pre-trained voice error correction model to detect each original character of the sentence to be corrected, the language error correction model simultaneously obtains each candidate character of the pre-configured character table, so that the occurrence probability of each original character in the sentence to be corrected at the corresponding position of each original character is detected through the language error correction model, and the occurrence probability of each candidate character at the corresponding position of each original character is detected at the same time.

Further, the probability of occurrence of each candidate word at the corresponding position on the original character of the sentence to be corrected is normalized by a softmax function to obtain a probability vector of the candidate word at the corresponding position. And mapping each probability value in the probability vector with the corresponding candidate characters one by one, wherein the sum of the probability values corresponding to the candidate characters is equal to one.

S1302, determining a target word for replacing the word to be corrected from the candidate words according to the probability vector.

Because the language error correction model is biased toward that each original character of the sentence to be corrected is a correct character by default, the probability of occurrence of each original character at the corresponding position of the original character is higher than that of other characters, even if the original character is a wrong character, that is, the probability of occurrence of the original character at the corresponding position of the sentence to be corrected is the maximum probability of occurrence at the position. When the maximum occurrence probability of each character at the corresponding position is smaller than a preset threshold value, the original character is a wrongly-written character, the original character is determined as the character to be corrected, and the corresponding position is determined as the position to be corrected.

In this embodiment, since the probability value of each candidate character in the probability vector is relative to the probability value, if a probability value in the probability vector is higher, it indicates that the accuracy of the candidate character corresponding to the probability value is higher. And selecting the candidate character with the maximum probability value from the other candidate characters except the character to be corrected (namely the original character) in the candidate characters as a target character, and replacing the character to be corrected with the target character to correct the sentence to be corrected.

Further, in an embodiment, the step S1302 of determining, from the candidate words according to the probability vector, a target word for replacing the word to be corrected, may include the following steps:

s3021, extracting the shape-similar characters of the character to be corrected and the corresponding probability values in the probability vectors from the candidate characters.

The shape-similar characters refer to characters with similar structure to the characters to be corrected, and may be one or more. The candidate characters of the pre-configured character table comprise the shape similar characters of the characters to be corrected, when the candidate characters of the pre-configured character table are input into the language error correction model, the occurrence probability of each candidate character on the current position where each original character of the sentence to be corrected is located is obtained, normalization processing is carried out on the occurrence probability through a softmax function, and the probability vector of each candidate character of the characters to be corrected on the position to be corrected is obtained. After normalization processing, the sum of the probability vectors corresponding to each candidate character is equal to 1.

In the probability vector, each candidate character is mapped with its occurrence probability value at the corresponding position of the original character of the sentence to be corrected one by one, and further, in this embodiment, the shape near character of the character to be corrected is extracted from each candidate character of the pre-configured character table detected at the position to be corrected, and the probability value corresponding to each shape near character is extracted from the probability vector.

S3022, comparing the probability values corresponding to the shape-similar characters, and selecting the shape-similar character with the maximum probability value as a target character according to the comparison result so as to replace the character to be corrected with the target character.

In this embodiment, the greater the probability value of a certain shape-similar character is, the greater the probability that the shape-similar character is taken as a target character is, and the target character is a correct character and is used to replace the character to be corrected to correct the sentence to be corrected.

For example, the word to be corrected is "package", the shape-similar words of the "package" word extracted from the candidate words are "sentence", "even" and "spoon", and the corresponding probability values in the probability vector are respectively: 0.2, 0.1 and 0.05, because the corresponding probability value in the probability vector of the sentence is the maximum, the sentence is taken as the target character so as to replace the character to be corrected by the target character.

In this embodiment, normalization processing is performed on the occurrence probability of each candidate character (or similar character), so that the relative occurrence probability of each candidate character (or similar character) at the current position where each original character of the sentence to be corrected is located can be obtained more accurately, and the result of identifying the target character is more accurate.

In an embodiment, before selecting a target word to replace the word to be corrected from the candidate words in the pre-configured word table in step S130, the method may further include the following steps:

s100, encoding by utilizing the shape approximate code of the Chinese character to construct candidate characters, and collecting the candidate characters to generate a pre-configuration character table.

Wherein the shape-approximate code comprises at least one of a Chinese character structure, a stroke and a four-corner code.

The Chinese character structure refers to various specific points and lines constituting the Chinese character font, and is also the minimum structural unit of the Chinese character. Strokes generally refer to the points and lines that make up a Chinese character and are uninterrupted in various shapes, such as horizontal (a), vertical (a), left falling (a), right falling (a), broken (a) character, etc., which are the smallest contiguous stroke units that make up a Chinese character glyph. Four-corner code a computer four-corner Chinese character input method includes coding and fetching Chinese characters and corresponding relation with keyboard, using some specific radical strokes of Chinese characters as codes, and these codes are respectively correspondent to ten numerals of 0, 1, 2 and 3 … … 9, then utilizing these codes to resolve four corners of Chinese characters, and using correspondent numerals as codes to express and distinguish Chinese characters.

In this embodiment, the shape-approximating codes of the chinese characters, such as the structure, strokes, and four-corner codes, are used for coding combination, and since the shape-approximating codes of the chinese characters are the minimum units constituting the chinese characters, different candidate characters are constructed by different coding and combination modes of the shape-approximating codes, and the candidate characters are usually common characters, and can cover each original character and the corresponding shape-approximating character that may appear in the sentence to be corrected. And collecting the candidate words to generate a pre-configured word table and storing the pre-configured word table.

The following describes in detail a related embodiment of the text error correction apparatus.

Fig. 5 is a schematic structural diagram of a text error correction apparatus according to an embodiment, and as shown in fig. 5, the text error correction apparatus 10 is adapted to automatically correct an input text, and includes: a detection module 110, a determination module 120, and a correction module 130.

The detection module 110 is configured to detect, by using a pre-trained language error correction model, occurrence probabilities of original characters of a sentence to be corrected at corresponding positions in the sentence to be corrected;

a determining module 120, configured to determine, according to the occurrence probability, a word to be corrected in the sentence to be corrected;

a correcting module 130, configured to select a target word for replacing the word to be corrected from each candidate word in the pre-configured word table, and correct the sentence to be corrected by using the target word.

In the word error correction apparatus provided in the foregoing embodiment, the detection module 110 detects, by using a pre-trained language error correction model, occurrence probabilities of original characters of a sentence to be corrected at corresponding positions in the sentence to be corrected, and the determination module 120 determines, according to the occurrence probabilities, a word to be corrected in the sentence to be corrected; the correcting module 130 selects a target word for replacing the word to be corrected from the candidate words in the pre-configured word table, corrects the word to be corrected, thereby efficiently and accurately recognizing the wrongly written word in the word to be corrected, and correcting the wrongly written word in the word to be corrected.

In one embodiment, the detection module 110 includes: a semantic analysis unit and a probability obtaining unit;

the semantic analysis unit is used for acquiring a sentence to be corrected, inputting the sentence to be corrected into a pre-trained language error correction model, and analyzing semantic association relations among original characters of the sentence to be corrected through the language error correction model; and the probability obtaining unit is used for obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected based on the semantic association relation.

In one embodiment, the semantic analysis unit includes: and the sentence screening subunit is used for carrying out optical symbol recognition on the video frame image to obtain video caption sentences and screening out sentences to be corrected from the video caption sentences.

In one embodiment, the determining module 120 includes: the device comprises an occurrence probability acquisition unit and a character to be corrected determination unit;

the system comprises an appearance probability acquiring unit, a correction unit and a correction unit, wherein the appearance probability acquiring unit is used for acquiring the appearance probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected; and the character to be corrected determining unit is used for comparing the occurrence probability with a preset threshold, determining the position as the position to be corrected if the occurrence probability is smaller than the preset threshold, and determining the original character on the corrected position as the character to be corrected.

In one embodiment, the correction module 130 includes: a probability vector calculation unit and a target character determination unit;

the probability vector calculation unit is used for acquiring each candidate character of a pre-configured character table and analyzing the probability vector corresponding to the candidate character at the position to be corrected by using the pre-trained language error correction model; and the target character determining unit is used for determining a target character for replacing the character to be corrected from the candidate characters according to the probability vector.

In one embodiment, the target text determination unit includes: a shape and near character extracting subunit and a target character selecting subunit;

the shape and near word extracting subunit is configured to extract, from the candidate words, shape and near words of the word to be corrected and corresponding probability values of the shape and near words in the probability vector; and the target character selection subunit is used for comparing the probability values corresponding to the near-shape characters, and selecting the near-shape character with the maximum probability value as the target character according to the comparison result so as to replace the character to be corrected by the target character.

In one embodiment, the text error correction apparatus 10 further includes: the character table generating module is used for encoding by utilizing the shape approximate codes of the Chinese characters to construct candidate characters and collecting the candidate characters to generate a pre-configuration character table; wherein the shape-approximate code comprises at least one of a Chinese character structure, a stroke and a four-corner code.

The character error correction device provided by the above can be used for executing the character error correction method provided by any of the above embodiments, and has corresponding functions and beneficial effects.

Fig. 6 is a schematic structural diagram of a text error correction apparatus according to an embodiment, and as shown in fig. 6, the text error correction apparatus includes a processor 60, a memory 61, an input device 62, and an output device 63. The memory 61 stores thereon a computer program operable on the processor 60, which when executed by the processor 60 implements the text error correction method as in any of the embodiments described above.

The number of the processors 60 in the text error correction device may be one or more, and fig. 6 exemplifies one processor 60. The number of the memories 61 in the text error correction device may be one or more, and one memory 61 is taken as an example in fig. 6. The processor 60 and the memory 61 of the text error correction device may be connected by a bus or other means, as exemplified by the bus connection in fig. 6. In the embodiment, the character error correction device can be a computer, a scanner, a server and the like.

The memory 61 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules (e.g., the detection module 110, the determination module 120, and the correction module 130) corresponding to the text error correction method according to any embodiment of the present disclosure. The memory 61 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 61 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 61 may further include memory located remotely from the processor 60, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 62 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the computer apparatus, as well as a camera for acquiring images and a sound pickup apparatus for acquiring audio data. The output device 63 may include an audio device such as a speaker or a text input device such as a printer. It should be noted that the specific composition of the input device 62 and the output device 63 can be set according to actual situations.

The processor 60 executes various functional applications of the device and data processing by executing software programs, instructions and modules stored in the memory 61, i.e. implements the above-mentioned text error correction method.

When the computer device provided by the above-mentioned embodiment executes the character error correction method provided by any of the above-mentioned embodiments, the computer device has corresponding functions and beneficial effects.

The present embodiments also provide a storage medium containing computer-executable instructions that, when executed by a computer processor, perform a method of text error correction, comprising:

determining characters to be corrected in the sentences to be corrected according to the occurrence probability;

Of course, the storage medium provided by this embodiment contains computer-executable instructions, and the computer-executable instructions are not limited to the above-mentioned operation of the text error correction method, and may also perform related operations in the text error correction method provided by any embodiment, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solution essentially or portions contributing to the prior art may be embodied in the form of a software product, where the computer software product may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute the text error correction method according to any embodiment of the present invention.

The application discloses an A1 character error correction method, which comprises the following steps:

determining characters to be corrected in the sentences to be corrected according to the occurrence probability;

A2. According to the text error correction method described in a1, the step of detecting, by using a pre-trained language error correction model, the occurrence probability of each original character of a sentence to be corrected at a corresponding position in the sentence to be corrected includes:

A3. According to the text error correction method described in a2, the step of obtaining the sentence to be corrected includes:

and carrying out optical symbol recognition on the video frame image to obtain video caption sentences, and screening out sentences to be corrected from the video caption sentences.

A4. According to the text error correction method described in a1, the step of determining the text to be corrected in the sentence to be corrected according to the occurrence probability includes:

acquiring the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected;

A5. The text error correction method according to a1, wherein the step of selecting a target text for replacing the text to be corrected from the candidate texts in the pre-configured word table comprises:

and determining a target character for replacing the character to be corrected from the candidate characters according to the probability vector.

A6. The text error correction method according to a5, wherein the step of determining a target text for replacing the text to be corrected from the candidate texts according to the probability vector comprises:

extracting the shape word of the word to be corrected and the corresponding probability value of the shape word in the probability vector from the candidate words;

A7. The text error correction method according to a1, wherein the step of selecting the target text for replacing the text to be corrected from the candidate texts in the pre-configured word table further comprises:

B8. A text error correction apparatus comprising:

a determining module, configured to determine, according to the occurrence probability, a word to be corrected in the sentence to be corrected;

B9. The text error correction apparatus according to B8, wherein the detection module comprises: a semantic analysis unit and a probability obtaining unit;

and the probability obtaining unit is used for obtaining the occurrence probability of each original character of the sentence to be corrected at the corresponding position in the sentence to be corrected based on the semantic association relation.

B10. The character error correction apparatus according to B9, wherein the semantic analysis unit includes: and the sentence screening subunit is used for carrying out optical symbol recognition on the video frame image to obtain video caption sentences and screening out sentences to be corrected from the video caption sentences.

B11. The text error correction apparatus according to B8, wherein the determining module comprises: the device comprises an occurrence probability acquisition unit and a character to be corrected determination unit;

an appearance probability obtaining unit, configured to obtain appearance probabilities of corresponding positions of original characters of the sentence to be corrected in the sentence to be corrected;

and the character to be corrected determining unit is used for comparing the occurrence probability with a preset threshold, determining the position as the position to be corrected if the occurrence probability is smaller than the preset threshold, and determining the original character on the corrected position as the character to be corrected.

B12. The text error correction apparatus according to B8, wherein the correction module comprises: a probability vector calculation unit and a target character determination unit;

and the target character determining unit is used for determining a target character for replacing the character to be corrected from the candidate characters according to the probability vector.

B13. The character error correction apparatus according to B12, wherein the target character determination unit includes: a shape and near character extracting subunit and a target character selecting subunit;

and the target character selection subunit is used for comparing the probability values corresponding to the near-shape characters, and selecting the near-shape character with the maximum probability value as the target character according to the comparison result so as to replace the character to be corrected by the target character.

B14. The character error correction apparatus according to B8, wherein the character error correction apparatus 10 further comprises: the character table generating module is used for encoding by utilizing the shape approximate codes of the Chinese characters to construct candidate characters and collecting the candidate characters to generate a pre-configuration character table; wherein the shape-approximate code comprises at least one of a Chinese character structure, a stroke and a four-corner code.

C15. A word correction apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of the word correction method as claimed in any one of claims a1 to a7.

D16. A storage medium containing computer executable instructions for performing the steps of the text error correction method of any one of claims a1 to a7 when executed by a computer processor.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：适用于刚柔接触网的实时定位点识别方法

Character error correction method, device, equipment and storage medium

相关技术

网友询问留言