Chinese text error correction method based on pinyin identity or similarity

文档序号：1043308 发布日期：2020-10-09 浏览：12次中文

阅读说明：本技术 一种基于拼音相同或相似的中文文本纠错方法 (Chinese text error correction method based on pinyin identity or similarity ) 是由何卓威于 2020-06-03 设计创作，主要内容包括：本发明提出一种基于拼音相同或相似的中文文本纠错方法,包括以下步骤：S1,在传统ngrams语言模型基础上做调整,建立粒度为单个中文字符的中文字结构语言模型；S2,对待纠错语句进行候选处理,生成候选序列；S3,基于混淆集和MAD算法对候选序列进行检错,获得待纠错语句候选序列；S4,基于中文字结构语言模型的最大后验概率,使用双选Viterbi算法解码输出纠错结果。本发明相对于传统方法词粒度准确率较高,纠错速度较传统方法快。(The invention provides a Chinese text error correction method based on pinyin identity or similarity, which comprises the following steps: s1, adjusting on the basis of the traditional ngrams language model, and establishing a Chinese character structure language model with the granularity of a single Chinese character; s2, performing candidate processing on the statement to be corrected to generate a candidate sequence; s3, error detection is carried out on the candidate sequence based on the confusion set and the MAD algorithm, and a candidate sequence of the statement to be corrected is obtained; and S4, decoding and outputting an error correction result by using a double-choice Viterbi algorithm based on the maximum posterior probability of the Chinese character structure language model. Compared with the traditional method, the method has the advantages of higher word granularity accuracy and higher error correction speed.)

1. A Chinese text error correction method based on pinyin identity or similarity is characterized by comprising the following steps:

s1, adjusting on the basis of the traditional ngrams language model, and establishing a Chinese character structure language model with the granularity of a single Chinese character;

s2, performing candidate processing on the statement to be corrected to generate a candidate sequence;

s3, error detection is carried out on the candidate sequence based on the confusion set and the MAD algorithm, and a candidate sequence of the statement to be corrected is obtained;

and S4, decoding and outputting an error correction result by using a double-choice Viterbi algorithm based on the maximum posterior probability of the Chinese character structure language model.

2. The method for correcting the error of the Chinese text based on the Pinyin identity or similarity as claimed in claim 1, wherein the step S1 specifically comprises:

s101, preprocessing the corpus and generating a word segmentation file;

s102, converting the word segmentation file into a pinyin word group, and then splitting the pinyin word group into word structures, wherein all the word structures form a word structure text;

s103, generating a Chinese character structure language model with granularity of single Chinese character by using the text finally converted into the character structure.

3. The method for correcting the error of the Chinese text based on the Pinyin identity or similarity as claimed in claim 2, wherein the step S1 specifically comprises:

and S104, upgrading the Chinese character structure language model into a class language model, forming similar words into word classes, and replacing pronouns by using word class labels when calculating N-gram statistics.

4. The method for correcting the error of the Chinese text based on the Pinyin identity or similarity as claimed in claim 2, wherein the step S101 specifically comprises:

s111, unifying half-angle formats of text corpora, removing punctuations and performing line-by-line processing;

s112, converting the Chinese number into an Arabic number;

and S113, performing word segmentation by using the Chinese jieba word segmentation library to obtain word segmentation files.

5. The method for correcting the error of the Chinese text based on the Pinyin identity or similarity as claimed in claim 1, wherein the step S2 specifically comprises: according to the homophonic or phoneticizing rule, a sentence generates candidate sequences word by word, each word has one layer, the average candidate number of each layer is between 100 and 150, common polyphone pinyin is optimized, and a small amount of infrequent pronunciations are removed.

6. The method for correcting the error of the Chinese text based on the Pinyin identity or similarity as claimed in claim 1, wherein the step S3 specifically comprises:

confusion set error detection:

s301, judging whether a self-defined error set exists in a statement to be corrected, and if so, directly outputting a correction result;

s302, after word segmentation of the statement to be corrected is judged, the word group with abnormal word frequency is used as an error set and enters a candidate sequence of the statement to be corrected;

MAD algorithm error detection:

s311, dynamically dividing the statements to be corrected into two groups according to the lengths 2 and 3 respectively;

s312, calculating the probability of the two groups of word segmentation language models;

s313, carrying out weighted average according to the grouping length to obtain two groups of weighted probability values;

and S314, averaging the two groups of weighted probability values respectively, and then detecting an outlier according to an MAD algorithm, wherein the outlier is an error position.

7. The method for correcting the error of the Chinese text based on the Pinyin identity or similarity as claimed in claim 6, wherein the step S314 specifically comprises:

(1) calculating median (x) of all observation points;

(2) calculating the absolute deviation value abs (X-mean (X)) of each observation point and the median;

(3) calculating the median of the absolute deviation values in (2), i.e., MAD ═ mean (abs (X-mean (X));

(4) dividing the value obtained in (2) by the value obtained in (3) to obtain a set of distance-from-center values abs (X-mean (X))/MAD for all viewpoints based on MAD.

(5) By setting a threshold, exceeding the threshold is considered to be an outlier, i.e., an error location.

8. The method for correcting the Chinese text errors based on the pinyin identity or similarity as claimed in claim 1, wherein the double selection viterbi algorithm, using the beamsearch algorithm in combination with the viterbi algorithm, specifically comprises:

s401, setting two parameters of BeamSize1 and BeamSize2 for constraint, wherein BeamSize1< BeamSize 2;

s402, preferentially using a beacon search algorithm to obtain 1 maximum probability paths of BeamSize according to the paths in the current candidate layer;

s403, excluding nodes included by 1 BeamSize paths, and carrying out bit filling on 2-1 BeamSize nodes and 2 maximum probability paths of each layer of BeamSize from the rest nodes according to a viterbi algorithm and the nodes;

and S404, outputting an error correction result by using the maximum probability path.

Technical Field

The invention relates to the technical field of text error correction, in particular to a Chinese text error correction method based on pinyin identity or similarity.

Background

Text correction is applicable to many fields, such as manual typing assistance: the wrongly written word condition can be automatically checked and prompted after the user inputs. Therefore, error expressions caused by negligence are reduced, and the input efficiency and quality of a user are effectively improved; search error correction field: aiming at search interfaces such as e-commerce and search engines, users often input errors during searching, and by analyzing the form and characteristics of search terms, the search terms can be automatically corrected and the users can be prompted, so that search results which are more in line with the requirements of the users can be given, and the influence of wrongly written or mispronounced words on the real requirements of the users can be effectively shielded; speech recognition or robot dialogue areas: the text error correction is embedded into the dialogue system, so that wrongly written characters in the process of converting the speech recognition into the text can be automatically corrected, correct sentences after error correction are transmitted to the dialogue understanding system, the speech recognition accuracy is obviously improved, and the overall experience of products is better. In the prior art, error checking and correction processing are required for the replacement error of the similar words. Usually, the error checking and correcting is performed based on the confusion set, and the range to be modified is corrected word by word, and the establishment of the confusion set needs a lot of time and manual maintenance, and is high in cost and inconvenient to use. The existing error correction language model based on the statistical method is usually based on word granularity, that is, the word is taken as an analysis unit, and the relation between the words is inspected to correct the error. However, the accuracy of the word granularity is low in the traditional method for correcting the error by the word granularity and the error by the word granularity, and a new model needs to be built by changing a thought.

Disclosure of Invention

Aiming at the problems that the establishment of a confusion set needs to spend a large amount of time and labor for maintenance, the cost is high, the use is inconvenient, and the word granularity accuracy is low in the traditional method, the invention provides a Chinese text error correction method based on pinyin identity or similarity, which is used for establishing a Chinese character structure language model with the granularity of a single Chinese character, detecting an error of a candidate sequence by using the confusion set and an MAD algorithm, and decoding by using a double-selection Viterbi algorithm to output an error correction result.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a Chinese text error correction method based on pinyin identity or similarity comprises the following steps:

s1, adjusting on the basis of the traditional ngrams language model, and establishing a Chinese character structure language model with the granularity of a single Chinese character;

s2, performing candidate processing on the statement to be corrected to generate a candidate sequence;

s3, error detection is carried out on the candidate sequence based on the confusion set and the MAD algorithm, and a candidate sequence of the statement to be corrected is obtained;

and S4, decoding and outputting an error correction result by using a double-choice Viterbi algorithm based on the maximum posterior probability of the Chinese character structure language model.

The method is characterized in that a Chinese character structure language model with granularity of a single Chinese character is established, error correction is carried out by considering the relation between characters based on character granularity, namely, the single character is taken as an analysis unit, and the accuracy rate of the character granularity is higher compared with the traditional method. The candidate sequence is subjected to error detection based on the confusion set and the MAD algorithm, the preliminary preparation work requires less manpower, only the text corpus in the vertical field is needed, a large amount of time and manpower are not needed for maintenance, and the method is low in cost and convenient to use. The error correction result is decoded and output by using the double-selection Viterbi algorithm, and the error correction speed of the maximum probability path screened most possibly is higher and more accurate than that of the traditional method.

Preferably, the step S1 specifically includes:

s101, preprocessing the corpus and generating a word segmentation file;

s102, converting the word segmentation file into a pinyin word group, and then splitting the pinyin word group into word structures, wherein all the word structures form a word structure text;

s103, generating a Chinese character structure language model with granularity of single Chinese character by using the text finally converted into the character structure.

Preferably, the step S1 specifically includes:

The Chinese character structure language model is built based on the ngrams model, the ngrams model is sparse, and all words are regarded as completely different things. For a word, the model needs enough training data to accurately estimate the probability, and by considering the similarity of the words and forming the common error-prone phrases into word classes, the remaining part of the sentence can be better checked to see whether errors exist, and the error correction accuracy is improved.

Preferably, the step S101 specifically includes:

s111, unifying half-angle formats of text corpora, removing punctuations and performing line-by-line processing;

s112, converting the Chinese number into an Arabic number;

and S113, performing word segmentation by using the Chinese jieba word segmentation library to obtain word segmentation files.

The invention converts the pure number which is meaningless to the text error correction in the linguistic data into the wildcard character < d >, i.e. the capital Chinese number and the Arabic number which are not phrases are converted into a part of speech, thereby improving the generalization ability and the error correction efficiency of the language model.

Preferably, the step S2 specifically includes: according to homophonic or sound-like rules, candidate sequences are generated word by word in a sentence, each word has one layer, the average number of candidates in each layer is between 100 and 150, common polyphone pinyin is optimized, and few and infrequent pronunciations are removed, so that the generalization capability and the error correction efficiency of a language model are improved.

Preferably, the step S3 specifically includes:

confusion set error detection:

s301, judging whether a self-defined error set exists in a statement to be corrected, and if so, directly outputting a correction result;

MAD algorithm error detection:

s311, dynamically dividing the statements to be corrected into two groups according to the lengths 2 and 3 respectively;

s312, calculating the probability of the two groups of word segmentation language models;

s313, carrying out weighted average according to the grouping length to obtain two groups of weighted probability values;

and S314, averaging the two groups of weighted probability values respectively, and then detecting an outlier according to an MAD algorithm, wherein the outlier is an error position.

Preferably, the step S314 specifically includes:

(1) calculating median (x) of all observation points;

(2) calculating the absolute deviation value abs (X-mean (X)) of each observation point and the median;

(3) calculating the median of the absolute deviation values in (2), i.e., MAD ═ mean (abs (X-mean (X));

(4) dividing the value obtained in (2) by the value obtained in (3) to obtain a set of distance-from-center values abs (X-mean (X))/MAD for all viewpoints based on MAD.

(5) By setting a threshold, exceeding the threshold is considered to be an outlier, i.e., an error location.

Preferably, the double-selection viterbi algorithm, which combines the beacon search algorithm with the viterbi algorithm, specifically includes:

s401, setting two parameters of BeamSize1 and BeamSize2 for constraint, wherein BeamSize1< BeamSize 2;

s402, preferentially using a beacon search algorithm to obtain 1 maximum probability paths of BeamSize according to the paths in the current candidate layer;

and S404, outputting an error correction result by using the maximum probability path.

Compared with the beacon search algorithm, the viterbi algorithm is closer to the true maximum probability path, but for the ngram language model, the 5-element ngram language model is used, and when the viterbi algorithm is used in combination, for the statement to be corrected with the total length of more than 5, a single node only takes the maximum probability path, and the path is not necessarily the maximum probability path globally.

In order to optimize the problem, the invention uses the beamsearch algorithm in combination with the viterbi algorithm, named double-selection viterbi algorithm, similar to the beamsearch algorithm, and needs to set constraints of two parameters (beamseze 1< beamseze 2) of BeamSize1 and BeamSize2, namely the current candidate layer, preferentially uses the beamsearch algorithm to obtain the BeamSize1 maximum probability paths according to the paths, then excludes the nodes already included by the BeamSize1 paths, carries out the node replacement of the BeamSize2-BeamSize1 nodes according to the nodes from the rest nodes according to the beamsearch algorithm, and carries out each layer of BeamSize2 maximum probability paths, the double-selection viterbi algorithm is closer to the true probability paths, and the maximum probability path screening error correction speed is faster and more accurate than that of the traditional method

The invention has the following beneficial effects: the method is characterized in that a Chinese character structure language model with granularity of a single Chinese character is established, error correction is carried out by considering the relation between characters based on character granularity, namely, the single character is taken as an analysis unit, and the accuracy rate of the character granularity is higher compared with the traditional method. The candidate sequence is subjected to error detection based on the confusion set and the MAD algorithm, the preliminary preparation work requires less manpower, only the text corpus in the vertical field is needed, a large amount of time and manpower are not needed for maintenance, and the method is low in cost and convenient to use. The error correction result is decoded and output by using the double-selection Viterbi algorithm, and the error correction speed of the maximum probability path screened most possibly is higher and more accurate than that of the traditional method.

Drawings

FIG. 1 is a flow chart of the method of the present embodiment;

FIG. 2 is a flowchart of the viterbi algorithm of the present embodiment;

fig. 3 is a flowchart of the Beamsearch algorithm of the present embodiment.

Detailed Description

12页详细技术资料下载

Chinese text error correction method based on pinyin identity or similarity

相关技术

网友询问留言