Wrongly-written character detection method and device, computer storage medium and electronic equipment

文档序号：1544837 发布日期：2020-01-17 浏览：9次中文

阅读说明：本技术 一种错别字检测方法、装置及计算机存储介质、电子设备 (Wrongly-written character detection method and device, computer storage medium and electronic equipment ) 是由龚伟松郭得庆于 2019-09-09 设计创作，主要内容包括：一种错别字检测方法、装置及计算机存储介质、电子设备,包括：确定待检测的文本数据；将所述文本数据转为拼音数据；生成所述拼音数据的基于ngram模型的特征模板；将所述拼音数据的特征模板输入至预先构建的错别字检测模型；所述错别字检测模型根据条件随机场CRF模型以及基于ngram模型的特征模板训练得到；根据所述错别字检测模型的输出结果确定所述待检测的文本数据是否存在错别字。采用本申请中的方案,可以简单高效的检测出错别字。(A method, a device, a computer storage medium and an electronic device for detecting wrongly written characters comprise: determining text data to be detected; converting the text data into pinyin data; generating a feature template of the pinyin data based on an ngram model; inputting the characteristic template of the pinyin data into a pre-established wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model; and determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detection model. By adopting the scheme in the application, the wrongly written characters can be simply and efficiently detected.)

1. A method for detecting wrongly written characters, comprising:

determining text data to be detected;

converting the text data into pinyin data;

generating a feature template of the pinyin data based on an ngram model;

inputting a feature template of the pinyin data based on an ngram model into a pre-constructed wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;

and determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detection model.

2. The method of claim 1, wherein the wrongly written words detection model is constructed as follows:

collecting training corpora;

marking pinyin on the training corpus;

generating a feature template of the pinyin based on an ngram model;

and training a CRF model by taking the characteristic template as a characteristic function to obtain the wrongly written character detection model.

3. The method of claim 1 or 2, wherein the generating the feature template of the pinyin data includes:

generating a first characteristic for each pinyin according to the front pinyin and the rear pinyin of each pinyin;

generating a second characteristic for each pinyin according to the number of times each pinyin appears in the pinyin data;

extracting the pinyin data according to a preset window 2 or 3 to generate binary character groups, and generating two third features by taking each character in the binary character groups as an ngram feature;

generating a characteristic template of the pinyin data according to the first characteristic, the second characteristic and the two third characteristics; the characteristic templates of the pinyin data include characteristic templates of each pinyin.

4. The method of claim 3, wherein generating the first feature for each pinyin based on previous and subsequent pinyins of each pinyin comprises:

determining a previous pinyin and a next pinyin of a current pinyin in the pinyin data;

generating a first characteristic of the current pinyin;

wherein the first characteristic is (current pinyin, previous pinyin of the current pinyin, and next pinyin of the current pinyin).

5. The method of claim 3, wherein the extracting the pinyin data according to a preset window 2 or 3 to generate binary word groups, and generating two third features for the ngram feature for each word in the binary word groups respectively comprises:

extracting the pinyin data according to a window with a preset window value of 2 or 3 to generate a binary word group;

generating a first third feature by taking a first character in the binary character group as an ngram feature; the first third characteristic is (the pinyin of the ith character and the (i + 1) th character, the probability that the pinyin which is the previous pinyin of the (i + 1) th character is the pinyin of the ith character);

taking a second character in the binary character group as an ngram characteristic to generate a second third characteristic; the second third characteristic is (the pinyin of the jth character and the jth +1 character, the jth +1 character: the probability that the pinyin next to the pinyin of the jth character is the pinyin of the jth character).

6. The method of claim 2, wherein after the generating the binary word group and before generating two third features, further comprising:

counting the frequency of each binary word group;

and removing the binary word with the frequency lower than the preset frequency threshold value.

7. The method of claim 1, further comprising:

and correcting the wrongly written characters when the text data to be detected has the wrongly written characters.

8. A wrongly written character detecting apparatus, comprising:

the data determining module is used for determining text data to be detected;

the pinyin conversion module is used for converting the text data into pinyin data;

the template generating module is used for generating a feature template of the pinyin data based on an ngram model;

the model detection module is used for inputting the characteristic template of the pinyin data into a pre-constructed wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;

and the result determining module is used for determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detecting model.

9. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of any of claims 1 to 7.

Technical Field

The present application relates to data processing technologies, and in particular, to a method and an apparatus for detecting wrongly written characters, a computer storage medium, and an electronic device.

Background

With the popularization of smart phones and other mobile devices, communication among people is mainly based on pinyin typing. Due to various accidental factors in the typing process, such as too fast typing, uncommon characters not found, or hand errors, some wrongly written characters may occur in the communication process. Wrongly written words can be recognized and corrected by the human brain for humans, however, wrongly written words can cause great problems for machines. In a computer, words are stored as 0 and 1, different words have different values, the values are independent and have no correlation (such as same pronunciation, similar font, etc.) like characters. This has led to the need for miswritten word correction when computers are communicating in human computers while performing natural language processing.

The current technique for identifying wrongly written characters is mainly to identify wrongly written characters according to a method of a large amount of texts based on frequency and a dictionary; this approach is complicated, computational speed is not high, and misregistered word recognition needs to be updated from time to time.

Disclosure of Invention

The embodiment of the application provides a method and a device for detecting wrongly written characters, a computer storage medium and electronic equipment, so as to solve the technical problems.

According to a first aspect of embodiments of the present application, there is provided a method for detecting a wrongly written word, including:

determining text data to be detected;

converting the text data into pinyin data;

generating a feature template of the pinyin data based on an ngram model;

inputting the characteristic template of the pinyin data into a pre-established wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;

and determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detection model.

According to a second aspect of embodiments of the present application, there is provided a wrongly written word detecting apparatus, including:

the data determining module is used for determining text data to be detected;

the pinyin conversion module is used for converting the text data into pinyin data;

the template generating module is used for generating a feature template of the pinyin data based on an ngram model;

According to a third aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above.

According to a fourth aspect of embodiments herein, there is provided an electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method as described above.

According to the wrongly-written character detection method and device, the computer storage medium and the electronic equipment, after text data to be detected are converted into pinyin, the feature template for generating the pinyin data is input into the wrongly-written character detection model which is constructed in advance, and then whether wrongly-written characters exist in the text data is detected and determined.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flow chart illustrating an implementation of a method for detecting a wrongly written word according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a device for detecting a wrongly written word according to a second embodiment of the present application;

fig. 3 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application.

Detailed Description

In the process of implementing the present application, the inventors found that:

based on a Long Short-Term Memory neural network model (LSTM), the method can be considered to correct wrongly written words; however, although this method can solve the problem of inconvenient updating based on frequency and dictionary method, LSTM is advantageous for long text prediction, and the wrongly written words in the sentence belong to a local problem in the text, and the LSTM has a general processing effect on the local problem.

In view of the above problems, embodiments of the present application provide a method and an apparatus for detecting wrongly written characters, a computer storage medium, and an electronic device, where a training sample is trained to construct a feature template of a CRF model, the CRF model is then trained, parameters of the CRF model are adjusted, and then wrongly written characters are recognized and corrected, so that wrongly written characters can be corrected quickly and accurately, and the method, the apparatus, the computer storage medium, and the electronic device are simple and fast.

CRF model, Conditional Random Field model, the mathematical language description of CRF is: if X and Y are random variables, and P (Y | X) is a conditional probability distribution of Y given X, if the random variable Y constitutes a Markov random field, then the conditional probability distribution P (Y | X) is called a conditional random field.

The ngram model is a language model, realizes automatic conversion to Chinese characters by using collocation information between adjacent words in the context, and assumes that the occurrence of the Nth word is only related to the former N-1 words but not related to any other words, and the probability of the whole sentence is the product of the occurrence probabilities of all words.

The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于深度学习的外呼状态识别方式

Wrongly-written character detection method and device, computer storage medium and electronic equipment

相关技术

网友询问留言