Text error correction method, text error correction device, computer storage medium and computer program product

文档序号:1861741 发布日期:2021-11-19 浏览:30次 中文

阅读说明:本技术 文本纠错方法、装置、计算机存储介质及计算机程序产品 (Text error correction method, text error correction device, computer storage medium and computer program product ) 是由 包祖贻 李辰 章波 于 2021-09-01 设计创作,主要内容包括:本申请实施例提供了一种文本纠错方法、装置、计算机存储介质及计算机程序产品。该方法包括:对待纠错文本进行切分,获得待纠错文本中的多个文字;获得多个文字分别对应的语义向量、读音向量和字形向量,并根据语义向量、读音向量和字形向量生成多个文字的特征向量,由于考虑了文字对应的语义、读音和字形,使得生成的特征向量携带的信息更为丰富,表达的文字特征更准确。根据多个文字的特征向量,获得待纠错文本中的多个文字在词表中分别映射到多个候选文字的概率;根据多个文字的音形向量和词表的音形向量,计算表征待纠错文本与正确文本之间的相似度。然后根据多个候选文字的概率和相似度,确定待纠错文本的输出结果,提高了文本纠错的准确性。(The embodiment of the application provides a text error correction method, a text error correction device, a computer storage medium and a computer program product. The method comprises the following steps: segmenting a text to be corrected to obtain a plurality of characters in the text to be corrected; the semantic vector, the pronunciation vector and the font vector corresponding to the characters are obtained, and the feature vectors of the characters are generated according to the semantic vector, the pronunciation vector and the font vector. According to the feature vectors of the characters, obtaining the probability that the characters in the text to be corrected are respectively mapped to a plurality of candidate characters in a word list; and calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list. And then determining an output result of the text to be corrected according to the probability and the similarity of the candidate characters, thereby improving the accuracy of text correction.)

1. A text error correction method comprising:

segmenting a text to be corrected to obtain a plurality of characters in the text to be corrected;

obtaining semantic vectors, pronunciation vectors and font vectors corresponding to the characters respectively, and generating feature vectors of the characters according to the semantic vectors, the pronunciation vectors and the font vectors;

Obtaining the probability that the characters in the text to be corrected are respectively mapped to a plurality of candidate characters in a word list according to the feature vectors of the characters;

calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list, wherein the sound-shape vectors are generated according to the pronunciation vectors and the font vectors;

and determining an output result of the text to be corrected according to the probability and the similarity of the candidate characters.

2. The method of claim 1, wherein obtaining probabilities that the words in the text to be corrected are mapped to candidate words in a vocabulary respectively according to the feature vectors of the words comprises:

for each character in the characters, coding the feature vector of the character to obtain the hidden state of the feature vector of the character;

and classifying the hidden state of the feature vector of the character based on the candidate characters in the word list to obtain the probability that the character is respectively mapped to a plurality of candidate characters in the word list.

3. The method of claim 2, wherein classifying the hidden state of the feature vector of the word based on candidate words in the vocabulary, and obtaining probabilities that the word is mapped to a plurality of candidate words in the vocabulary respectively comprises:

Classifying the hidden state of the feature vector of the character based on the candidate characters in the word list according to a classification model to obtain initial probabilities of the character respectively mapped to a plurality of candidate characters in the word list;

and normalizing the initial probabilities of the characters respectively mapped to the candidate characters in the word list through a first normalization function to obtain the probabilities of the characters respectively mapped to the candidate characters in the word list.

4. The method of claim 1, wherein calculating the similarity characterizing the text to be corrected and the correct text based on the phonogram vectors of the plurality of words and the phonogram vectors of the vocabulary comprises:

multiplying the sound-shape vectors of the characters and the sound-shape vectors of the word list to obtain a sound-shape vector product;

and normalizing the product of the sound-shape vectors according to a second normalization function to obtain the similarity.

5. The method of claim 1, wherein determining the output result of the text to be corrected according to the probabilities of the candidate words and the similarities comprises:

multiplying the probability of the candidate characters by the similarity to obtain the similarity probability of the text to be corrected;

And taking the text corresponding to the probability maximum value in the similarity probabilities of the texts to be corrected as an output result of the texts to be corrected.

6. The method of claim 1, wherein the method further comprises:

generating a pronunciation vector of the word list according to the pronunciations of a plurality of candidate characters in the word list;

generating a font vector of the word list according to the fonts of a plurality of candidate characters in the word list;

and generating the sound-shape vector of the word list according to the sum of the pronunciation vector of the word list and the font vector of the word list.

7. The method according to claim 1, wherein the text to be corrected includes at least one of phonogram characters, numerals, and special characters in addition to ideographic characters; the method for segmenting the text to be corrected to obtain a plurality of characters in the text to be corrected comprises the following steps:

carrying out single character segmentation on the ideographic characters in the text to be corrected to obtain a plurality of ideographic characters;

segmenting at least one of the phonogram characters, the numbers and other characters in the text to be corrected according to a preset segmentation rule to obtain a plurality of segmented characters;

the plurality of words includes the plurality of ideograms and the plurality of split characters.

8. The method of any of claims 1-7, further comprising:

acquiring a parallel corpus containing an error text and a correct text corresponding to the error text;

constructing a training sample according to the parallel corpus;

training a machine learning model according to the training samples so that the trained machine learning model executes the text error correction method; wherein the machine learning model comprises: the character recognition system comprises a coding model part for vectorizing the characters, a coding model part for coding the characteristic vectors of the characters to obtain hidden states, and a classification model part for classifying the hidden states of the characteristic vectors of the characters.

9. A text correction apparatus comprising:

the segmentation module is used for segmenting a text to be corrected to obtain a plurality of characters in the text to be corrected;

the feature vector generation module is used for obtaining semantic vectors, pronunciation vectors and font vectors corresponding to the characters respectively and generating feature vectors of the characters according to the semantic vectors, the pronunciation vectors and the font vectors;

a probability determining module, configured to obtain, according to the feature vectors of the multiple words, probabilities that the multiple words in the text to be corrected are respectively mapped to multiple candidate words in a word list;

The similarity calculation module is used for calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list, wherein the sound-shape vectors are generated according to the pronunciation vectors and the font vectors;

and the output result module is used for determining the output result of the text to be corrected according to the probability and the similarity of the candidate characters.

10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a text correction method as claimed in any one of claims 1 to 8.

11. A computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the text correction method of any one of claims 1-8.

Technical Field

The embodiments of the present application relate to the field of computer technologies, and in particular, to a text error correction method, apparatus, computer storage medium, and computer program product.

Background

With the continuous development of computer software technology, people usually input texts by means of various input tools, and compared with ideographic languages such as English, ideographic languages such as Chinese have rich and various expressions and meanings, and input errors of spelling, shape and pronunciation are easy to occur. On one hand, misspelling can cause the text to be misunderstood more easily, and the communication efficiency is influenced; on the other hand, in many strict document scenes, such as judicial works, contracts, news manuscripts and the like, the requirement on spelling errors is very strict.

In order to ensure the correctness of input, the input text needs to be corrected. In the prior art, taking a chinese character as an example, the purpose of error correction is achieved by obtaining a confusion set of chinese characters in a text to be corrected (the confusion set is a set of a plurality of candidate characters with similar pronunciation and font of the chinese characters), then selecting the candidate characters from the confusion set, and further determining an output result of the text to be corrected according to the candidate characters.

However, the number of the Chinese characters is very large, and the number of the Chinese characters with similar pronunciation and font is also very large, so that the search space of the confusion set of the Chinese characters is very large, and the error correction of the text to be corrected is performed by selecting the candidate characters from the confusion set, so that the efficiency is low, and the accuracy of the error correction is reduced. The same problem also exists for ideograms similar to chinese characters, such as japanese and korean.

Disclosure of Invention

In view of the above, embodiments of the present application provide a text error correction scheme to at least partially solve the above problem.

According to a first aspect of embodiments of the present application, there is provided a text error correction method, including: segmenting a text to be corrected to obtain a plurality of characters in the text to be corrected; obtaining semantic vectors, pronunciation vectors and font vectors corresponding to the characters respectively, and generating feature vectors of the characters according to the semantic vectors, the pronunciation vectors and the font vectors; obtaining the probability that the characters in the text to be corrected are respectively mapped to a plurality of candidate characters in a word list according to the feature vectors of the characters; calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list, wherein the sound-shape vectors are generated according to the pronunciation vectors and the font vectors; and determining an output result of the text to be corrected according to the probability and the similarity of the candidate characters.

According to a second aspect of embodiments of the present application, there is provided a text correction apparatus including: the segmentation module is used for segmenting a text to be corrected to obtain a plurality of characters in the text to be corrected; the feature vector generation module is used for obtaining semantic vectors, pronunciation vectors and font vectors corresponding to the characters respectively and generating feature vectors of the characters according to the semantic vectors, the pronunciation vectors and the font vectors; a probability determining module, configured to obtain, according to the feature vectors of the multiple words, probabilities that the multiple words in the text to be corrected are respectively mapped to multiple candidate words in a word list; the similarity calculation module is used for calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list, wherein the sound-shape vectors are generated according to the pronunciation vectors and the font vectors; and the output result module is used for determining the output result of the text to be corrected according to the probability and the similarity of the candidate characters.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the text error correction method according to the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the text correction method according to the first aspect.

According to a fifth aspect of embodiments of the present application, there is provided a computer program product including computer instructions for instructing a computing device to execute operations corresponding to the text error correction method according to the first aspect.

According to the text error correction scheme provided by the embodiment of the application, a text to be corrected is segmented, and a plurality of characters in the text to be corrected are obtained; the method comprises the steps of obtaining semantic vectors, pronunciation vectors and font vectors corresponding to a plurality of characters respectively, and generating feature vectors of the plurality of characters according to the semantic vectors, the pronunciation vectors and the font vectors. According to the feature vectors of the characters, obtaining the probability that the characters in the text to be corrected are respectively mapped to a plurality of candidate characters in a word list; calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list, and calculating the conversion probability from the text to be corrected to the correct text by considering the sound-shape vectors of the text to obtain the similarity, wherein the sound-shape vectors are generated according to the pronunciation vectors and the font vectors. And then determining an output result of the text to be corrected according to the probability and the similarity of the candidate characters. Because the semantics, pronunciation and font of the characters are considered when determining the probability of a plurality of candidate characters, the pronunciation and font of the characters are considered when determining the similarity. By combining the probabilities and the similarities of a plurality of candidate characters, the text error correction scheme provided by the embodiment of the application can not only process spelling errors with similar pronunciation and similar font in the traditional scheme, but also effectively process spelling errors without similar pronunciation and similar font, and improves the accuracy of text error correction.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a flowchart illustrating steps of a text error correction method according to an embodiment of the present application;

fig. 2 is an application scenario diagram of a text error correction method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating steps of another text correction method according to an embodiment of the present application;

fig. 4 is a block diagram of a text error correction apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

It should be noted that the first and second embodiments are used for distinguishing names, do not represent sequential relationships, and are not understood to indicate or imply relative importance or implicitly indicate the number of indicated technical features, for example, the first normalization function and the second normalization function.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

The first embodiment,

As shown in fig. 1, a flowchart of a text error correction method provided in an embodiment of the present application is shown, where the text error correction method includes the following steps:

step S101, segmenting the text to be corrected to obtain a plurality of characters in the text to be corrected.

In this example, the text to be corrected may be segmented in a character segmentation manner, and the text to be corrected is segmented into a plurality of words, in this embodiment, the text to be corrected at least includes an ideographic text such as chinese, and in addition, may also include other texts and characters, such as: english, French, Korean, Japanese, numeric, Greek, punctuation, etc.

The text to be corrected is explained by taking the example that the text to be corrected comprises Chinese, English, numbers and punctuations, the text to be corrected is set as 'I surprise 11 points, eat KFC', firstly, the text to be corrected is segmented, and a plurality of characters in the text to be corrected are obtained, namely: "i | surprise | day |11| point | and | eat | KFC". In the example, the Chinese is subjected to single Chinese character segmentation, the punctuations are subjected to single segmentation, the numbers, the English, the special symbols and the like are subjected to reserved segmentation, and the integrity of the numbers, the English and the special symbols is reserved. For example, "11 dots" is split into "11 | dots"; cutting the '11: 00' into '11: 00'; "2021 year 01 month 01 day" is divided into "2021 | year |01| month |01| day"; the '2021-01-01' is divided into '| 2021-01-01 |'. However, it should be clear to those skilled in the art that in many cases, the text to be corrected may only contain ideographic text, and then a single word segmentation is performed on the text.

Step S102, obtaining semantic vectors, pronunciation vectors and font vectors corresponding to the characters respectively, and generating feature vectors of the characters according to the semantic vectors, the pronunciation vectors and the font vectors.

Aiming at each character, generating a semantic vector of the character according to the semantic of the character, wherein the semantic vector can also be understood as a character vector; generating a pronunciation vector of the character according to the pronunciation of the character; generating a font vector of the character according to the font of the character; and generating the characteristic vector of the character according to the semantic vector of the character, the pronunciation vector of the character and the font vector of the character, thereby obtaining the characteristic vectors of a plurality of characters. Alternatively, the sum of the semantic vector, the pronunciation vector, and the glyph vector is used as a feature vector, which can be understood as a comprehensive vector comprehensively considering the semantics, the pronunciation, and the glyph.

The semantics, pronunciation and font are explained by taking the words of "I" and "today" as examples, and "I" semantically includes "I, you, he, she, it"; the pronunciation of the Chinese character includes characters which are homophonic or near-to-phonetic with the 'wo', such as 'I', Wo ', horizontal and Wo'; characters with the same components as or similar characters to the characters of the Chinese character 'me', such as 'me, hunger, coupling, seeking', are included on the character pattern. "today" semantically includes "today, Ming, after, yesterday"; the pronunciation includes the characters with the same tone or similar tone as "jin", such as "jin, Jing, surprise, Xin, Yuan"; the font includes the same components as "today" or characters similar to "today", such as "today, flow, thread, soul, ash, include, mark, collar, greedy".

It should be noted that, when the characters are vectorized in the embodiment of the present application, the semantics, the pronunciation, and the font of the characters may be encoded by the encoding model, so as to obtain the semantics vector, the pronunciation vector, and the font vector of the characters. The encoding model may be any appropriately set machine learning model that can be used for vectorizing characters, and the embodiment of the present application does not limit the specific structure of the adopted machine learning model, as long as the encoding model can encode characters to obtain vectors.

Step S103, obtaining the probability that a plurality of characters in the text to be corrected are respectively mapped to a plurality of candidate characters in the word list according to the feature vectors of the plurality of characters.

The word list comprises a large number of candidate words, each word obtained after segmentation is mapped to a plurality of candidate words in the word list aiming at the feature vector of each word obtained after segmentation of the text to be corrected, the corresponding probability between each word and the plurality of candidate words can be obtained, and the probability that each word in the text to be corrected is mapped to the plurality of candidate words in the word list can be understood as the probability distribution of each word in the word list. For example, there are 10000 candidate words in the vocabulary, each word corresponding to 10000 probabilities, and the probabilities in this example are used to indicate the confidence that the word matches the candidate words respectively.

And step S104, calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list, wherein the sound-shape vectors are generated according to the pronunciation vectors and the font vectors.

The feature vector in the above step S102 is a vector in which the semantic vector, the pronunciation vector, and the glyph vector are comprehensively considered, and the glyph vector in this example is a vector in which the pronunciation vector and the glyph vector are comprehensively considered. And calculating the similarity between the voice-shape vector of the character and the voice-shape vector of each candidate character aiming at each character of the text to be corrected, wherein the similarity is used for representing the similarity between the text to be corrected and the correct text.

In the present example, the pronunciation and the font of the character are considered when calculating the similarity, and the semantic meaning of the character is not considered, because in the mathematical formula, after the signal (message) passes through a system (channel), it is intended to generate a correct (message), but due to the influence of the wrong word (reflected semantically), an error (message) is generated. Therefore, in the present example, only the influence of the pronunciation vector and the font vector on the text is considered, and the semantic vector, namely noise, is not considered. By calculating the similarity between the phonogram vectors, the probability P (e | c) representing the typing of an error text from a correct text is obtained, where e is the error text and c is the correct text, and then P (e | c) × P (c) is obtained, thereby obtaining the probability of typing an error text from a correct text.

And S105, determining an output result of the text to be corrected according to the probability of the candidate characters and the similarity.

And aiming at each character, obtaining an output result of each character according to the probability that each character in the text to be corrected is respectively mapped to a plurality of candidate characters in the word list and the similarity between each character and the plurality of candidate characters, and determining the output result of the text to be corrected according to the output result of each character. Because the semantics, pronunciation and font of the characters are considered when determining the probability of a plurality of candidate characters, the pronunciation and font of the characters are considered when determining the similarity. Compared with the scheme of selecting a plurality of candidate characters through a single confusion set, the text error correction scheme provided by the embodiment of the application can process spelling errors with similar pronunciation and similar font in the traditional scheme through the combination of the probability and the similarity of the plurality of candidate characters, can also effectively process spelling errors without similar pronunciation and similar font, and improves the accuracy of text error correction.

In this example, the output result of the text to be corrected may be the output text directly, or may be output words corresponding to a plurality of words, or may be a correction result corresponding to an error word. Setting a text to be corrected as 'I surprise 11 points and eat KFC', wherein in an achievable mode, an output result is 'I11 points today and eat KFC'; in another realizable mode, the output result can be 'I-I, surprise-D.C., day-day, 11-11, point-point, eat-eat, KFC-KFC'; in another implementation, the output result is an error correction result "surprise-current", and the above-mentioned various manners can be applied to the scheme of the embodiment of the present application.

As shown in fig. 2, fig. 2 is a schematic view of an application scenario of a text error correction method according to an embodiment of the present application. The text to be corrected is set to be ' I surprise 11 points and eat KFC ', the text to be corrected is subjected to character segmentation, and a plurality of characters in the text to be corrected are obtained, namely ' I ' surprise | day |11| points | and | eat | KFC '. And coding the segmented characters to obtain semantic vectors, pronunciation vectors and font vectors corresponding to the characters respectively, generating characteristic vectors of the characters according to the semantic vectors, the pronunciation vectors and the font vectors, and then obtaining the probability that each character in the text to be corrected is mapped to a plurality of candidate characters in a word list respectively according to the characteristic vectors of the characters. And calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list. And then determining an output result of the text to be corrected according to the probability and the similarity of the candidate characters. In fig. 2, the output result is "i'm 11 o' clock today, eat KFC" for illustration.

According to the text error correction scheme provided by the embodiment of the application, a text to be corrected is segmented, and a plurality of characters in the text to be corrected are obtained; the method comprises the steps of obtaining semantic vectors, pronunciation vectors and font vectors corresponding to a plurality of characters respectively, and generating feature vectors of the plurality of characters according to the semantic vectors, the pronunciation vectors and the font vectors. According to the feature vectors of the characters, obtaining the probability that the characters in the text to be corrected are respectively mapped to a plurality of candidate characters in a word list; calculating the similarity between the text to be corrected and the correct text according to the sound-shape vectors of the characters and the sound-shape vectors of the word list, and calculating the conversion probability from the text to be corrected to the correct text by considering the sound-shape vectors of the text to obtain the similarity, wherein the sound-shape vectors are generated according to the pronunciation vectors and the font vectors. And then determining an output result of the text to be corrected according to the probability and the similarity of the candidate characters. Because the semantics, pronunciation and font of the characters are considered when determining the probability of a plurality of candidate characters, the pronunciation and font of the characters are considered when determining the similarity. By combining the probabilities and the similarities of a plurality of candidate characters, the text error correction scheme provided by the embodiment of the application can not only process spelling errors with similar pronunciation and similar font in the traditional scheme, but also effectively process spelling errors without similar pronunciation and similar font, and improves the accuracy of text error correction.

The text error correction method of the embodiments of the present application may be executed by any suitable electronic device with data processing capability, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

Example II,

Optionally, in an example of the embodiment of the present application, the second embodiment of the present application may include the following steps S301 to S307, as shown in fig. 3, which are a flowchart illustrating steps of another text error correction method provided in the embodiment of the present application.

Step S301, segmenting the text to be corrected to obtain a plurality of characters in the text to be corrected.

Optionally, in an embodiment of the present application, the text to be corrected includes at least one of an phonogram character, a numeral, and a special character in addition to the ideographic character, and when the text is segmented, in step S301, the following steps may be further implemented: carrying out single character segmentation on the ideographic characters in the text to be corrected to obtain a plurality of ideographic characters; segmenting at least one of phonogram characters, numbers and other characters in a text to be corrected according to a preset segmentation rule to obtain a plurality of segmented characters; the plurality of words includes a plurality of ideograms and a plurality of split characters.

Ideographic characters include, but are not limited to, characters such as chinese, korean, and japanese, and when the ideographic characters are segmented, the ideographic characters are segmented into individual characters, and in the case where the ideographic characters are chinese, the chinese is segmented into individual chinese characters.

Phonograms include, but are not limited to, english, french, and greek characters, and other characters include, but are not limited to, punctuation marks, spaces, and ratio marks for use in assisted word recording languages. When the phonetic characters, the numbers and other characters are cut, the characters are cut according to a preset cutting rule, the integrity of the characters is kept, and a plurality of cut characters are obtained. In practical applications, the preset segmentation rule may be set by a person skilled in the art as appropriate according to actual requirements, or determined by analyzing a large number of segmentation rules used in the segmentation process of a large number of characters, which is not limited in this embodiment of the present application.

Step S302, generating a semantic vector according to the semantics of the characters, generating a pronunciation vector according to the pronunciations of the characters, and generating a font vector according to the fonts of the characters; and generating a plurality of character feature vectors according to the semantic vector, the pronunciation vector and the font vector.

Illustratively, the semantics, the pronunciation and the font of the character are respectively coded through a coding model to obtain a semantic vector, a pronunciation vector and a font vector of the character, and the sum of the semantic vector, the pronunciation vector and the font vector is used as a feature vector.

However, in practical applications, the semantic vector, the pronunciation vector, and the font vector may be generated in advance by other methods, and may be directly obtained and used when used.

Step S303, for each of the plurality of characters, encoding the feature vector of the character to obtain the hidden state of the feature vector of the character.

For each character, in the embodiment of the application, the feature vector of the character is encoded through the encoding model, so that the hidden state of the feature vector of the character is obtained. In the embodiment of the present application, the coding model for coding the feature vector of the text to obtain the hidden state may include, but is not limited to, a Neural Network (NN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Long Short-Term Memory Network (Long Short-Term Memory, LSTM), a bidirectional LSTM, a transform, and the like.

Taking LSTM as an example for illustration, LSTM is a long-short term memory neural network suitable for processing sequence data. The feature vector is encoded by the LSTM, which can be understood as that the context information of each word in the text to be corrected is fused into the feature vector to obtain the hidden state of the feature vector of the word, and the hidden state is also the feature data output by the current hidden layer.

Step S304, classifying the hidden state of the feature vector of the character based on candidate characters in a word list to obtain the probability that the character is respectively mapped to a plurality of candidate characters in the word list.

In the embodiment of the application, the hidden state of the feature vector of the character is classified based on the candidate characters in the word list through the classification model. For example, the classification model may be a network for classifying the hidden state, for example, a forward classification network based on a Softmax function, and the embodiment of the present application does not limit the specific structure of the classification model as long as the classification model can be used to classify the hidden state and map the feature vector of the word to each candidate word in the vocabulary. For example, the vocabulary includes 10000 words, the long-term and short-term neural networks are used for coding the feature vectors of the characters to obtain the coded hidden state H, and the classification model classifies the hidden state H to obtain the probability that the characters are respectively mapped to a plurality of candidate characters in the vocabulary.

Optionally, in an embodiment of the present application, step S304 may also be implemented by: classifying the hidden state of the feature vector of the character based on candidate characters in a word list according to a classification model to obtain initial probabilities of the character respectively mapped to a plurality of candidate characters in the word list; and normalizing the initial probabilities of the characters respectively mapped to the candidate characters in the word list through a first normalization function to obtain the probabilities of the characters respectively mapped to the candidate characters in the word list.

In this example, the characters are classified by combining the classification model and the normalization function, so as to obtain the probability that the characters are respectively mapped to a plurality of candidate characters in the vocabulary. For example, the vocabulary includes 10000 words, and the hidden state H is classified by the classification model and the normalization function, so as to obtain 10000 probabilities that the word is mapped in the vocabulary respectively, and the sum of the probabilities is 1. The hidden state is classified through a classification model and a normalization function, so that the error of an output result is reduced when the output result is determined by utilizing the probability and the similarity of the characters mapped to a plurality of candidates in the follow-up process. Optionally, the normalization function in this example may be a softmax function.

It should be noted that, in step S304, the embodiment of the present application may also only use the classification model to classify the hidden state, and not perform normalization processing on the probabilities mapped to the multiple candidate words, because the product of the phonogram vectors is normalized in the subsequent step S306 to obtain the similarity, and then the step of multiplying the probabilities of the multiple candidate words and the similarity is performed, because the similarity is the product of the normalized phonogram vectors, the accuracy of the similarity probability of the text to be corrected is not affected even if the normalization processing is not performed on the probabilities mapped to the multiple candidate words.

Step S305, generating pronunciation vectors of the word list according to the pronunciations of a plurality of candidate characters in the word list; generating a font vector of the word list according to the fonts of a plurality of candidate characters in the word list; and generating the pronunciation-shape vector of the word list according to the sum of the pronunciation vector of the word list and the font vector of the word list.

The phonogram vectors of the vocabulary in this example include the phonogram vectors of a plurality of candidate words in the vocabulary. Illustratively, pronunciation and font of a plurality of candidate characters in the vocabulary are respectively coded through a coding model to obtain a pronunciation vector and a font vector, and the sum of the pronunciation vector and the font vector is used as a font vector, so as to obtain the font vectors of the plurality of candidate characters in the vocabulary.

The steps S301 to S304 are not in sequence with the execution process of the step S305, and may also be executed in parallel, which is not limited in this embodiment of the present application.

In addition, this step is an optional step, and in practical applications, the voice shape vector generated in advance in other manners may be directly obtained and used.

Step S306, multiplying the sound-shape vectors of the characters and the sound-shape vectors of the word list to obtain a sound-shape vector product; and normalizing the product of the sound-shape vectors according to the second normalization function to obtain the similarity.

In this example, the phonogram vectors of the plurality of characters in the text to be corrected are multiplied by the phonogram vector of the word list to obtain a phonogram vector product, the phonogram vector product reflects the similarity of each character in the text to be corrected, which is mapped to each candidate character in the word list, but the difference between the plurality of similarities is large, if the plurality of similarities and the probabilities of the plurality of candidate characters are directly multiplied, the obtained similarity probability has an error, and thus the accuracy of determining the output result according to the similarity probability is reduced. Therefore, in this example, the voice-shape vector product is normalized through the normalization function, so as to obtain the similarity between the text to be characterized and the correct text. And carrying out normalization processing on the sound-shape vectors through a normalization function to obtain the similarity, so that the error of the output result is reduced when the output result is determined by utilizing the probability and the similarity mapped to a plurality of candidate characters in the follow-up process. Optionally, the normalization function in this example may be a softmax function.

Step S307, multiplying the probability and the similarity of the candidate characters to obtain the similarity probability of the text to be corrected; and taking the text corresponding to the maximum probability value in the similarity probabilities of the texts to be corrected as an output result of the texts to be corrected.

In this example, the probabilities of the multiple candidate words represent probabilities that each word is mapped to the multiple candidate words in the word list, the similarity represents the similarity of each word mapped to each candidate word in the word list, and the probabilities of the multiple candidate words and the similarities are multiplied to obtain the similarity probability of each word in the text to be corrected. Further, the text corresponding to the maximum probability value in the similar probabilities can be used as the output result of the text to be corrected.

Setting a text to be corrected as 'I surprise 11 points and eat KFC', for example, setting a candidate character corresponding to the maximum value of the similarity probability of the character 'I' as 'I', a candidate character corresponding to the maximum value of the similarity probability of the character 'I' as 'today', a candidate character corresponding to the maximum value of the similarity probability of the character 'day' as 'day', a candidate character corresponding to the maximum value of the similarity probability of the character 'points' as 'points', a candidate character corresponding to the maximum value of the similarity probability of the character 'eat' as 'eat', and keeping the number '11' and the English abbreviation 'KFC' unchanged. The output result of the text to be corrected in this example can have the following three ways. In the first example, the candidate word corresponding to the maximum value of the similarity probability of each word in the similarity probabilities is used as the output result of the text to be corrected, and the output result is 'I-I, surprise-D, day-day, 11-11, point-point, eating-eating, KFC-KFC'. And in the second example, an output text is constructed according to the candidate words corresponding to the maximum value of the similarity probability of each word in the similarity probability, the output text is used as the output result of the text to be corrected, and the output result is 'I eat KFC at 11 points today'. Taking the corrected text as an output result of the text to be corrected, specifically, taking the candidate word corresponding to the maximum value of the similarity probability of each error word in the similarity probabilities as the output result of the text to be corrected, where the output result is 'surprise-current'.

In the prior art, when a text is corrected, a confusion set of similar pronunciation and similar font is usually constructed in advance, and many manual interventions are required for the confusion set of similar pronunciation and similar font to construct the relationship between similar pronunciation and similar font. When the text to be corrected is corrected, a plurality of candidate characters are generated by utilizing a pre-constructed confusion set, then the candidate characters are sequenced, and the candidate character with the highest probability of appearing at the current position is selected, so that the output result of the text to be corrected is obtained.

However, the confusion set reduces the search space by limiting the search space to the words with similar pronunciations and similar fonts, but also limits the accuracy of text correction, on one hand, the confusion set cannot accurately model the similarity degree of the words on the sound and the shapes; on the other hand, the text error correction scheme cannot process spelling errors which are not similar in pronunciation and not similar in font only by using the confusion set of the similar pronunciation and the similar font, so that the accuracy of text error correction is reduced.

In the embodiment of the application, the characters are segmented to obtain a plurality of characters, each character is vectorized according to the semantics, the pronunciation and the font, and the semantics, the pronunciation and the font corresponding to the characters are comprehensively considered when the feature vector is generated, so that the information carried by the generated feature vector is richer, and the expressed character features are more accurate. And coding the characteristic vector according to the context information to obtain the hidden state of the characters, and classifying the hidden state by utilizing a classification network and a normalization function to obtain the probability of a plurality of candidate characters mapped in the word list by each character. And calculating the similarity between each character and a plurality of candidate characters in the word list, wherein the similarity represents the similarity between the text to be corrected and the correct text, and the similarity is calculated according to the probability and the similarity of the candidate characters, so that an output result is obtained according to the similarity. Because the semantics, pronunciation and font of the characters are considered when determining the probability of a plurality of candidate characters, the pronunciation and font of the characters are considered when determining the similarity. By combining the probabilities and the similarities of a plurality of candidate characters, the text error correction scheme provided by the embodiment of the application can not only process spelling errors with similar pronunciation and similar font in the traditional scheme, but also effectively process spelling errors without similar pronunciation and similar font, and improves the accuracy of text error correction.

It should be noted that the main objective in the embodiment of the present application is to perform error correction processing on chinese characters, obtain a hidden state through character segmentation, character vectorization, context information encoding, obtain probabilities of each character mapped to multiple candidate characters in a vocabulary through classification by a classification network, calculate similarity between each character and multiple candidate characters in the vocabulary, calculate similarity probability according to the probabilities and similarities of the multiple candidate characters, and thus obtain an output result according to the similarity probability. The characters such as numbers, English, punctuation and the like are also processed through the steps, but because the output results of the characters such as numbers, English, punctuation and the like are not considered when the model used in the steps is trained, the model is updated mainly through a back propagation algorithm aiming at the output results of Chinese. It can be understood that, for the phonogram character, the number and the special character in the text to be corrected, the above steps are also executed when the error correction is carried out, the above steps do not have any processing, and the output result is the same character as the phonogram character, the number and the special character without change.

In the above steps S301 to S307, three machine learning models are used, which are respectively a coding model part for vectorizing the text to obtain a semantic vector, a pronunciation vector, and a font vector, a coding model part for coding the feature vector of the text to obtain a hidden state, and a classification model part for classifying the hidden state of the feature vector of the text.

Optionally, in an embodiment of the present application, a parallel corpus including an error text and a correct text corresponding to the error text is obtained; constructing a training sample according to the parallel corpora; training the machine learning model according to the training samples so that the trained machine learning model executes a text error correction method; wherein the machine learning model comprises: the character encoding system comprises an encoding model part for vectorizing characters, an encoding model part for encoding the characteristic vectors of the characters to obtain hidden states, and a classification model part for classifying the hidden states of the characteristic vectors of the characters.

The parallel corpus includes an error text and a correct text corresponding to the error text, for example, the error text is "countless wind and rain in life", and the correct text is "countless wind and rain in human voice". And constructing a training sample according to the parallel linguistic data, so that the machine learning model is trained according to the training sample.

In order to distinguish the coding model for vectorizing the characters and the coding model for coding the feature vectors of the characters to obtain the hidden state when the machine learning model is trained, in this example, the coding model for coding the feature vectors of the characters to obtain the hidden state is represented by a neural network model, and in this example, the neural network model may be understood as the coding model.

The specific training steps include: and segmenting the error text sample to obtain a plurality of first characters in the error text, and segmenting the correct text sample corresponding to the error text to obtain a plurality of second characters in the correct text. And coding the semantics, the pronunciation and the font of the first characters according to the initial coding model to generate the semantics vector, the pronunciation vector and the font vector of the first characters, and adding the semantics vector, the pronunciation vector and the font vector to obtain a feature vector. And aiming at each first character in the plurality of first characters, coding the feature vector of the first character according to the initial neural network model to obtain the hidden state of the feature vector of the first character. And classifying the hidden state of the feature vector of the first character based on the candidate characters in the word list according to the initial classification model to obtain the probability that the first character is mapped to a plurality of candidate characters in the word list. And multiplying the sound shape vectors of the plurality of first characters and the sound shape vectors of the word list, and obtaining the similarity between the representation error text sample and the representation correct text sample through a normalization function. And determining an output result of the error text sample according to the probability and the similarity of the plurality of first words in the error sample mapped to the plurality of candidate words in the word list respectively. Calculating a loss value according to a loss function and by combining an output result of the error text sample and a plurality of second characters in the correct text; and training the initial coding model, the initial classification model and the initial neural network model according to the loss value to obtain the coding model, the classification model and the neural network model.

When the three machine learning models are trained by using the training samples, model parameters of the machine learning models are continuously adjusted, iterative updating is carried out until a training termination condition is reached so as to obtain the trained coding model, classification model and neural network model, wherein the training termination condition is, for example, the training frequency reaches a preset frequency, or the prediction result of the machine learning model is in a preset deviation range, and the like. Illustratively, the loss function used in the training step may be a cross-entropy loss function.

In the embodiment of the application, through learning of training data, a coding model for vectorizing characters, a coding model for coding the characteristic vectors of the characters to obtain a hidden state, and a classification model for classifying the hidden state of the characteristic vectors of the characters are obtained. Through the three machine learning models, the probability that each character is respectively mapped to a plurality of candidate characters in the word list is more accurate, and the problem that the similarity degree between the character and the character sound and shape cannot be accurately modeled in a confusion set in the traditional scheme is solved by combining the sound and shape similarity between the wrong text and the correct text. The text error correction scheme provided by the embodiment of the application can not only process spelling errors with similar pronunciation and similar font in the traditional scheme, but also effectively process spelling errors without similar pronunciation and similar font, thereby improving the accuracy of text error correction.

The text error correction method of the embodiments of the present application may be executed by any suitable electronic device with data processing capability, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

Example III,

In this embodiment, a text error correction method according to an embodiment of the present application will be described by taking a case of performing text error correction using a machine learning model as an example. The machine learning model includes: the character recognition system comprises a coding model part for vectorizing the characters, a coding model part for coding the characteristic vectors of the characters to obtain hidden states, and a classification model part for classifying the hidden states of the characteristic vectors of the characters.

First, a training process of the machine learning model will be briefly described below. The training process comprises:

and (I) constructing a training sample.

Specifically, a parallel corpus including an error text and a correct text corresponding to the error text may be obtained; and constructing a training sample according to the parallel corpus.

For example, parallel corpora are labeled, a sentence pair of a spelling error sentence and a spelling correct sentence is inputted, for example, an error sentence "wind and rain infinite in human voice" - - - > correct sentence "wind and rain infinite in life", the error sentence and the correct sentence are segmented, and an error sentence src ═ wind | rain | infinite | in [ human | voice | and | number thereof ] and a correct sentence tgt ═ wind | rain | infinite | in [ human | voice | and | number thereof ] are obtained, and a plurality of characters respectively corresponding to them are obtained. Based on this, training samples are constructed.

And (II) training a machine learning model according to the training samples so that the trained machine learning model executes the text error correction method.

Specifically, the method may include:

2.1, by the coding model part of the machine learning model for vectorizing the characters, performing: (1) performing semantic vectorization on each character in src according to the semantic of each character in the error text such as src, and also understanding as word vectorization to obtain a semantic vector matrix E _ v formed by semantic vectors of a plurality of characters in src; (2) performing reading vectorization on each character in src according to the reading of the character to obtain a reading vector matrix E _ p formed by the reading vectors of a plurality of characters in src; (3) performing font vectorization on each character in the src according to the font of the character to obtain a font vector matrix E _ s formed by the font vectors of a plurality of characters in the src; (4) and adding the E _ v, the E _ p and the E _ s to obtain a feature vector E of the characters in the src, wherein E is E _ v + E _ p + E _ s. The coding model part in this step may be any suitable coding model that can be vectorized.

2.2, inputting the eigenvector matrix E of the characters in src into a coding model part for coding the eigenvectors of the characters to obtain a hidden state, in this example, a multilayer bidirectional LSTM network, coding based on sentence context information to obtain a hidden state H, and obtaining the probability P _ vocab that each character in src is mapped to all candidate characters in a vocabulary V respectively through a classification model part for classifying the hidden state of the eigenvectors of the characters, such as a forward classification network.

It should be noted that the LSTM in this example may be a coding model of another network structure, and the embodiment of the present application is not limited as long as the hidden state H can be obtained by coding the feature vector matrix E of the character.

And 3.1, performing the same operations as (2) and (3) in 2.1 on a plurality of candidate characters in the vocabulary V to obtain a reading vector matrix EV _ p and a font vector matrix EV _ s of the vocabulary V, and adding EV _ p and EV _ s to obtain a font vector matrix E _ vocab of the vocabulary, wherein E _ vocab is EV _ p + EV _ s. Meanwhile, E _ p and E _ s obtained by src in 2.1 through (2) and (3) are added to obtain a voice-shape vector matrix E _ src of the src sentence, where E _ src is E _ p + E _ s. And multiplying the sound-form vector matrix E _ vocab by the sound-form vector matrix E _ src, and obtaining a similarity matrix P _ similarity of each character in the src to each candidate character in the vocabulary V through a softmax function of the classification model part.

And 3.2, multiplying the probability P _ vocab and the similarity matrix P _ similarity to obtain the similarity probability P'.

3.3, calculating to obtain a loss value through a cross entropy loss function according to the similarity probabilities P' and tgt, training the machine learning model through a back propagation algorithm according to the loss value, adjusting model parameters of the machine learning model, and performing iterative updating until a training termination condition is reached.

After the training process is finished, the trained machine learning model has the text error correction function. The process of text error correction based on the trained machine learning model comprises the following steps:

and (I) segmenting the text to be corrected to obtain a plurality of characters of the text to be corrected.

The text to be corrected is explained by taking the example that the text comprises Chinese, punctuation, English and numbers, the Chinese and punctuation are cut into single characters, English numbers are kept cut, and the integrity of the English numbers is kept. For example, the text to be corrected "i surprised at 11 points and eat KFC" is divided into "i | surprised | day |11| points |, | eat | KFC".

And (II) vectorizing each character in the plurality of characters according to the characters and the sound and the shape thereof through a coding model part for vectorizing the character, namely vectorizing the semanteme, the pronunciation and the character shape of the character, and determining the characteristic vector of the character according to the obtained semanteme vector, the pronunciation vector and the character shape vector. Then, the feature vectors of the characters are subjected to a coding model part (in this example, the aforementioned LSTM network) for coding the feature vectors of the characters to obtain a hidden state and a classification model part for classifying the hidden state of the feature vectors of the characters, so as to obtain a probability P _ vocab that each segmented segment (i.e., each character) in the text to be corrected is mapped to all candidate characters in a given word list V.

And (III) calculating similarity according to the sound-shape vector of each character of the text to be corrected and the sound-shape vectors of all candidate characters in the vocabulary V, normalizing to obtain the similarity P _ similar representing the text to be corrected and the correct text, and determining and outputting the similarity probability P' according to the product of the probability P _ vocab and the similarity P _ similar. And generating an error correction result according to the similar probability P' to obtain a spell check result. For example, the probabilities in the similarity probability P' may be sorted, and the candidate word with the highest probability is selected as the output result. The similarity P _ similarity can also be understood as the similarity of the transition probability between the error text and the correct text.

In this example, the context semantic information can be effectively utilized to generate candidate characters according to the LSTM, and the similarity model can determine more accurate sound-shape similarity P _ similarity by training the coding model for vectorizing the characters. The problem that the similarity between the character and the pronunciation and the shape of the character cannot be accurately modeled in the confusion set in the traditional scheme is solved, and meanwhile, spelling errors which are not similar in pronunciation and font can be processed by combining the similarity P _ similarity and the probability P _ vocab.

Text correction schemes in western european languages such as english are mature and mainly depend on word granularity check and correction. The Chinese is very different from Western European languages such as English, and firstly, the number of Chinese characters is very large, and more than 3000 common Chinese characters exist, so that the search space of a confusion set in a Chinese error correction scheme is far larger than that of English; moreover, the length of Chinese vocabulary is generally short, and if misspelling occurs, the semantic meaning of the word is often greatly influenced. The traditional chinese correction scheme usually pre-constructs a confusion set of similar pronunciation and similar font, where the confusion set represents a set of candidate characters with similar characters in the text to be corrected. Many manual interventions are required for the confusion set of similar pronunciations and similar glyphs to construct the relationship between the similar pronunciations and the similar glyphs. According to the confusion set of the similar pronunciation and the similar font, the search space is limited to the characters of the similar pronunciation and the similar font so as to reduce the search space, the confusion set is a set which strongly limits a candidate range, is a set of manually set candidate characters and limits the error correction performance of the traditional Chinese error correction scheme. Specifically, on one hand, the confusion set itself cannot accurately model the degree of similarity in sound and shape between words; on the other hand, only by using the confusion set of the similar pronunciation and the similar font, the text error correction scheme can not process spelling errors which are not similar in pronunciation and are not similar in font, and the accuracy of text error correction is reduced.

The embodiment of the application provides a text error correction scheme, which includes that a coding model for vectorizing characters, a coding model for coding feature vectors of the characters to obtain a hidden state and a classification model for classifying the hidden state of the feature vectors of the characters are obtained through learning of training data. The machine learning model can accurately model the similarity degree of the characters on the sound and the shape, adaptively learn the similarity relation of the characters on the sound and the shape, and is different from the similarity constraint of candidate characters in the traditional 'hard' confusion set. The method and the device can not only process spelling errors with similar pronunciation and similar font in the traditional scheme, but also effectively process spelling errors without similar pronunciation and similar font, thereby improving the accuracy of text error correction.

The text error correction method of the embodiments of the present application may be executed by any suitable electronic device with data processing capability, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

Example four,

Based on any text error correction method described in the first to third embodiments, an embodiment of the present application provides a text error correction device, as shown in fig. 4, fig. 4 is a text error correction device provided in an embodiment of the present application, and the text error correction device 40 includes: the segmentation module 401 is configured to segment a text to be corrected to obtain a plurality of characters in the text to be corrected; a feature vector generation module 402, configured to obtain a semantic vector, a pronunciation vector, and a font vector corresponding to each of the multiple characters, and generate feature vectors of the multiple characters according to the semantic vector, the pronunciation vector, and the font vector; a probability determining module 403, configured to obtain, according to feature vectors of multiple words, probabilities that multiple words in a text to be corrected are respectively mapped to multiple candidate words in a word list; a similarity calculation module 404, configured to calculate a similarity between a text to be corrected and a correct text according to a phonogram vector of a plurality of characters and a phonogram vector of a vocabulary, where the phonogram vector is generated according to a pronunciation vector and a font vector; and an output result module 405, configured to determine an output result of the text to be corrected according to the probabilities and similarities of the multiple candidate characters.

Optionally, in an embodiment of the present application, the probability determining module 403 is further configured to, for each of a plurality of texts, encode the feature vector of the text to obtain a hidden state of the feature vector of the text; and classifying the hidden state of the feature vector of the character based on candidate characters in a word list to obtain the probability that the character is respectively mapped to a plurality of candidate characters in the word list.

Optionally, in an embodiment of the present application, the probability determining module 403 is further configured to perform classification based on candidate characters in a vocabulary according to a classification model on the hidden state of the feature vector of the character, so as to obtain initial probabilities that the character is respectively mapped to multiple candidate characters in the vocabulary; and normalizing the initial probabilities of the characters respectively mapped to the candidate characters in the word list through a first normalization function to obtain the probabilities of the characters respectively mapped to the candidate characters in the word list.

Optionally, in an embodiment of the present application, the similarity calculation module 404 is further configured to multiply the phonogram vectors of the multiple characters and the phonogram vectors of the vocabulary, so as to obtain a phonogram vector product; and normalizing the product of the sound-shape vectors according to the second normalization function to obtain the similarity.

Optionally, in an embodiment of the present application, the output result module 405 is further configured to multiply the probabilities and the similarities of the multiple candidate words to obtain a similarity probability of the text to be corrected; and taking the text corresponding to the maximum probability value in the similarity probabilities of the texts to be corrected as an output result of the texts to be corrected.

Optionally, in an embodiment of the present application, the text error correction apparatus 40 further includes a phonogram vector calculation module, where the phonogram vector calculation module is configured to generate a pronunciation vector of the vocabulary according to pronunciations of a plurality of candidate characters in the vocabulary; generating a font vector of the word list according to the fonts of a plurality of candidate characters in the word list; and generating the pronunciation-shape vector of the word list according to the sum of the pronunciation vector of the word list and the font vector of the word list.

Optionally, in an embodiment of the present application, the text to be corrected includes at least one of phonogram characters, numerals, and special characters in addition to the ideographic characters; the segmentation module 401 is further configured to perform single character segmentation on the ideographic characters in the text to be corrected, so as to obtain a plurality of ideograms; segmenting at least one of phonogram characters, numbers and other characters in a text to be corrected according to a preset segmentation rule to obtain a plurality of segmented characters; the plurality of words includes a plurality of ideograms and a plurality of split characters.

Optionally, in an embodiment of the present application, the text error correction apparatus 40 further includes a training module, where the training module is configured to obtain a parallel corpus including an error text and a correct text corresponding to the error text; constructing a training sample according to the parallel corpora; training the machine learning model according to the training samples so that the trained machine learning model executes a text error correction method; wherein the machine learning model comprises: the character encoding system comprises an encoding model part for vectorizing characters, an encoding model part for encoding the characteristic vectors of the characters to obtain hidden states, and a classification model part for classifying the hidden states of the characteristic vectors of the characters.

The text error correction device of the embodiment of the application is used for realizing the corresponding text error correction method in the plurality of method embodiments, has the beneficial effects of the corresponding method embodiments, and is not described herein again. In addition, the functional implementation of each module in the text error correction device in the embodiment of the present application can refer to the description of the corresponding part in the foregoing method embodiment, and is not described herein again.

Example V,

Based on any text error correction method described in the first to third embodiments, the present application provides an electronic device, and it should be noted that, the text error correction method of the present application may be executed by any suitable electronic device with text error correction capability, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc. As shown in fig. 5, fig. 5 is a structural diagram of an electronic device according to an embodiment of the present application. The specific embodiments of the present application do not limit the specific implementation of the electronic device. The electronic device 50 may include: a processor (processor)502, a Communications Interface 504, a memory 506, and a communication bus 508.

Wherein: the processor 502, communication interface 504, and memory 506 communicate with one another via a communication bus 508.

A communication interface 504 for communicating with other electronic devices or servers.

The processor 502 is configured to execute the computer program 510, and may specifically perform the relevant steps in the above-described text error correction method embodiment.

In particular, the computer program 510 may comprise computer program code comprising computer operating instructions.

The processor 502 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

A memory 506 for storing a computer program 510. The memory 506 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

For specific implementation of each step in the program 510, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing text error correction method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

Example six,

Based on the text error correction methods described in the first to third embodiments, an embodiment of the present application provides a computer storage medium storing a computer program, and the computer program is executed by a processor to implement the text error correction methods described in the first to third embodiments.

Based on the text error correction methods described in the first to third embodiments, an embodiment of the present application provides a computer program product, which includes computer instructions for instructing a computing device to execute operations corresponding to the text error correction methods described in the first to third embodiments.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the text error correction methods described herein. Further, when a general-purpose computer accesses code for implementing the text error correction methods shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the text error correction methods shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

21页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种电力营销数据交互方法、系统及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!