Text error correction method and device based on artificial intelligence, computer equipment and storage medium

文档序号：1043310 发布日期：2020-10-09 浏览：10次中文

阅读说明：本技术 一种基于人工智能的文本纠错方法、装置、计算机设备及存储介质 (Text error correction method and device based on artificial intelligence, computer equipment and storage medium ) 是由郑立颖徐亮于 2020-06-28 设计创作，主要内容包括：本案涉及大数据处理,提供一种基于人工智能的文本纠错方法,包括：获取历史公文数据；对公文文本进行新词发现处理；将新词加入到词典库中；确定出待纠错原文本中的候选错误词；确定出同音词集合；分别将候选错误词对应替换为同音词；选取超过预设的文本通顺度的已纠错文本作为最终的已纠错文本。考虑到公文书场景下特有的术语表达,对公文文本进行新词发现处理,再将新词加入到词典库中,这样能够挖掘公文书场景下的词作为词典库的补充,目标词典库就含有公文场景下特有的术语的新词；再确定待纠错原文本的候选错误词,避免通用的纠错模型将未识别出特定术语而给将正确的内容改错的问题。本发明还涉及区块链技术,目标词典库存储于区块链中。(The scheme relates to big data processing, and provides a text error correction method based on artificial intelligence, which comprises the following steps: acquiring historical official document data; carrying out new word discovery processing on the official document text; adding the new words into a dictionary base; determining candidate error words in the original text to be corrected; determining a homophone word set; respectively replacing the candidate error words with homophones correspondingly; and selecting the corrected text exceeding the preset text smoothness as the final corrected text. In consideration of the expression of specific terms in the official document scene, carrying out new word discovery processing on the official document text, and adding the new words into the dictionary library, so that words in the official document scene can be mined to be used as supplement of the dictionary library, and the target dictionary library contains the new words of the specific terms in the official document scene; and then determining candidate error words of the original text to be corrected, so as to avoid the problem that the universal correction model corrects the correct content without identifying specific terms. The invention also relates to a block chain technology, and the target dictionary database is stored in the block chain.)

1. A text error correction method based on artificial intelligence is characterized by comprising the following steps:

acquiring historical official document data, wherein the historical official document data comprises official document texts;

carrying out new word discovery processing on the official document text to obtain new words;

adding the new words into an original dictionary library to obtain a target dictionary library added with the new words;

acquiring an original text to be corrected;

determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library;

determining a homophone word set of each candidate error word according to each candidate error word;

respectively replacing candidate error words of the original text to be corrected with corresponding homophones in the homophone set to obtain a corrected text set;

and selecting the corrected text which exceeds the preset text compliance from the corrected text set as the final corrected text.

2. The artificial intelligence based text error correction method of claim 1, wherein the performing new word discovery processing on the official document text to obtain new words comprises:

sequentially breaking the characters of the official document text into multi-element groups, and taking the obtained multi-element groups as a candidate phrase set;

performing word segmentation on the official document text by adopting a word segmentation toolkit to obtain a word segmentation set corresponding to the official document text;

deleting the participle set corresponding to the official document from the candidate phrase set to obtain a target candidate phrase set;

calculating the phrases of the target candidate phrase set according to the probability of each word in each phrase to obtain the corresponding score of each phrase in the target candidate phrase set;

sorting the scores corresponding to each phrase in the target candidate phrase set to obtain a sorting result;

and screening the phrases in the target candidate phrase set according to the sorting result and a preset threshold value to obtain a new word.

3. The artificial intelligence based text error correction method according to claim 2, wherein the obtaining a score corresponding to each word group in the target candidate word group set by calculating the word group of the target candidate word group set according to a probability of occurrence of each word in each word group comprises:

sequentially disassembling phrases of the target candidate phrase set into a first character and a second character;

acquiring the occurrence probability of the first character, the occurrence probability of the second character and the occurrence probability of the phrase;

acquiring the information entropy at the left side of the phrase and the information entropy at the right side of the phrase;

and aiming at each phrase in the target candidate phrase set, obtaining a score corresponding to each phrase according to the probability of the first character, the probability of the second character, the probability of the phrase, the information entropy on the left side of the phrase and the information entropy on the right side of the phrase corresponding to each phrase.

4. The artificial intelligence based text error correction method of claim 3, wherein when the phrases in the target candidate phrase set are binary groups, the sequentially breaking the phrases of the target candidate phrase set into a first character and a second character comprises:

sequentially splitting the binary group into a first character and a second character in sequence, wherein the first character and the second character are single characters;

the obtaining, for each phrase in the target candidate phrase set, a score corresponding to each phrase according to the probability of occurrence of the first character, the probability of occurrence of the second character, the probability of occurrence of the phrase, the information entropy on the left side of the phrase, and the information entropy on the right side of the phrase corresponding to each phrase respectively includes:

obtaining a score corresponding to each phrase by adopting the following score calculation formula;

wherein the content of the first and second substances,

5. The artificial intelligence based text error correction method of claim 3, wherein when the phrases in the target candidate phrase set are triples, the sequentially breaking the phrases of the target candidate phrase set into a first character and a second character comprises:

sequentially splitting the triad into a first character and a second character in sequence, wherein the first character is a double character, and the second character is a single character;

obtaining a score corresponding to each phrase by adopting the following score calculation formula;

wherein the content of the first and second substances,p (x, y) is the probability of the first character x and the second character y appearing together, p (x) is the probability of the first character x appearing, p (y) is the probability of the second character y appearing, LE is the information entropy on the left side of the phrase, and RE is the information entropy on the right side of the phrase.

6. The artificial intelligence based text error correction method of claim 1, wherein the determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary database comprises:

performing word segmentation processing on the original text to be corrected according to a word segmentation tool to obtain word segmentation of the original text to be corrected;

analyzing the phrases in the segmentation set corresponding to the original text to be corrected by adopting a statistical language analysis tool kit to obtain an analysis result of whether the phrases in the segmentation set corresponding to the original text to be corrected exist in the existing dictionary base;

if the word group in the segmentation set corresponding to the original text to be corrected does not exist in the existing dictionary base, judging whether the segmentation of the original text to be corrected exists in the target dictionary base;

and if the participles of the original text to be corrected do not exist in the target dictionary library, determining the participles of the original text to be corrected as candidate error words.

7. The artificial intelligence based text correction method of claim 1, wherein the selecting corrected text from the corrected text collection that exceeds a preset text compliance as final corrected text comprises:

calculating the sentence popularity score of the corrected text after the homophone word is replaced in the corrected text set by adopting a Bayesian formula in a statistical language model to obtain the sentence popularity score of the corrected text, wherein the Bayesian formula is as follows,

p(w1w2...wn)＝p(w1)*p(w2|w1)*p(w3|w1w2)....p(wn|w1w2w3...wn-1)；

wherein p (w1w2.. wn) is a sentence smoothness score of the corrected text after homophones are replaced; w1 is the first word in the corrected text; p (w1) is the probability of the first word in the corrected text; wn is the nth word in the corrected text; p (wn) is the probability of the nth word in the corrected text; p (wn | w1w2w3.. wn-1) is used for giving the former word w1w2w3.. wn-1, and the conditional probability of the occurrence of the latter word wn is solved;

and selecting the corrected text exceeding the preset text smoothness as the final corrected text.

8. An artificial intelligence based text error correction apparatus, comprising:

the first acquisition module is used for acquiring historical official document data, wherein the historical official document data comprises official document texts;

the new word discovery module is used for carrying out new word discovery processing on the official document text to obtain new words;

the new word adding module is used for adding the new words into the original dictionary base to obtain a target dictionary base added with the new words;

the second acquisition module is used for acquiring the original text to be corrected;

the first determining module is used for determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library;

the second determining module is used for determining a homophone set of each candidate error word according to each candidate error word;

the replacing module is used for respectively replacing the candidate error words of the original text to be corrected with the corresponding homophones in the homophone set to obtain a corrected text set;

and the selecting module is used for selecting the corrected text which exceeds the preset text compliance from the corrected text set as the final corrected text.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the artificial intelligence based text correction method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the artificial intelligence based text correction method according to any one of claims 1 to 7.

Technical Field

The present invention relates to big data processing, and more particularly, to a text error correction method and apparatus based on artificial intelligence, a computer device, and a storage medium.

Background

Text error correction is one of the challenges in natural language processing. The text errors mainly include word errors, replacement errors, legal errors, word errors, multi-word errors, missing word errors and the like. There are widely similar word replacement errors in text data, for example, "short board" is wrongly written as "short class", "aid decision" is wrongly written as "aid decision", etc. The existence of wrong words generally directly leads to word segmentation errors, and the word segmentation errors cause semantic confusion of the text, thereby bringing difficulty to text processing. The application scenes of text error correction are many, including input method error correction, ASR (speech to text) error correction, and official document writing error correction.

The existing error correction method comprises the following steps: and an end-to-end deep learning method is used for simultaneously completing error identification and error correction steps and outputting a corrected sentence result, but the error correction method has higher requirements on a training data set and can train an available text error correction model by collecting more error labeled corpora at the early stage. In a special scene, for example, there are many expression terms of a specific scene in a document scene, and a phenomenon that a word segmentation tool is easy to perform wrong segmentation or cannot be identified easily occurs, so that a problem that a general error correction model cannot identify a specific term and correct content is corrected easily occurs.

Disclosure of Invention

The invention provides a text error correction method and device based on artificial intelligence, computer equipment and a storage medium, which aim to solve the problem of text error correction.

A text error correction method based on artificial intelligence comprises the following steps:

acquiring historical official document data, wherein the historical official document data comprises official document texts;

carrying out new word discovery processing on the official document text to obtain new words;

adding the new words into an original dictionary library to obtain a target dictionary library added with the new words;

acquiring an original text to be corrected;

determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library;

determining a homophone word set of each candidate error word according to each candidate error word;

respectively replacing candidate error words of the original text to be corrected with corresponding homophones in the homophone set to obtain a corrected text set;

and selecting the corrected text which exceeds the preset text compliance from the corrected text set as the final corrected text.

An artificial intelligence based text correction apparatus comprising:

the first acquisition module is used for acquiring historical official document data, wherein the historical official document data comprises official document texts;

the new word discovery module is used for carrying out new word discovery processing on the official document text to obtain new words;

the new word adding module is used for adding the new words into the original dictionary base to obtain a target dictionary base added with the new words;

the second acquisition module is used for acquiring the original text to be corrected;

the first determining module is used for determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library;

the second determining module is used for determining a homophone set of each candidate error word according to each candidate error word;

and the selecting module is used for selecting the corrected text which exceeds the preset text compliance from the corrected text set as the final corrected text.

A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the artificial intelligence based text correction method described above when executing said computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the artificial intelligence based text error correction method described above.

In one scheme implemented by the text error correction method based on artificial intelligence, the device, the computer equipment and the storage medium, the historical official document data is acquired by considering the special term expression in the official document scene, the new words of the official document are found and processed, and the new words are added into the dictionary library, so that the words in the official document scene can be mined to be used as the supplement of the dictionary library, and the obtained target dictionary library contains the new words of the special terms in the official document scene; determining candidate error words of the original text to be corrected according to the original text to be corrected and the target dictionary library, so that the problem that a general error correction model cannot identify a specific term and correct content is corrected is solved; and then replacing the candidate error words with homophones, selecting the corrected text exceeding the preset text compliance from the corrected text set as the final corrected text, and screening out the final corrected result by calculating the compliance score so as to improve the accuracy of text correction based on artificial intelligence.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of an artificial intelligence based text correction method in one embodiment of the present invention;

FIG. 2 is another flow chart of a method for artificial intelligence based text correction in an embodiment of the present invention;

FIG. 3 is another flow chart of a method for artificial intelligence based text correction in an embodiment of the present invention;

FIG. 4 is another flow chart of a method for artificial intelligence based text correction in an embodiment of the present invention;

FIG. 5 is a schematic block diagram of an artificial intelligence based text correction apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In an embodiment, as shown in fig. 1, a text error correction method based on artificial intelligence is provided, which is described by taking the method as an example applied to a server, and includes the following steps:

s10: acquiring historical official document data, wherein the historical official document data comprises official document texts.

Understandably, historical official document data in an official document writing scene is obtained, and the official document text refers to sentences, paragraphs, words and the like in the official document writing scene. It is emphasized that the artificial intelligence based text error correction method of the present invention refers to a method for automatically recognizing and correcting problems occurring during the use of natural language. S20: and carrying out new word discovery processing on the official document text to obtain new words.

Understandably, sequentially breaking the characters of the official text into multi-element groups according to the sequence, and taking the obtained multi-element groups as a candidate phrase set; performing word segmentation on the official document text by adopting a word segmentation toolkit to obtain a word segmentation set corresponding to the official document text; deleting the participle set corresponding to the official document from the candidate phrase set to obtain a target candidate phrase set; calculating the phrases of the target candidate phrase set according to the probability of each word in each phrase to obtain the corresponding score of each phrase in the target candidate phrase set; sorting the scores corresponding to each phrase in the target candidate phrase set to obtain a sorting result; and screening the phrases in the target candidate phrase set according to the sorting result and a preset threshold value to obtain a new word.

S30: and adding the new words into the original dictionary library to obtain a target dictionary library added with the new words.

Illustratively, the original dictionary library is a jieba dictionary library, in which a developer can specify a dictionary defined by the developer to include words not included in the jieba dictionary library, understandably, the new words are added into the jieba dictionary library to obtain a target dictionary library after the new words are added, and the target dictionary library is used for determining candidate error words of a subsequent original text to be corrected. It is emphasized that, to further ensure the privacy and security of the target dictionary repository, the target dictionary repository may also be stored in a node of a blockchain.

S40: and acquiring the original text to be corrected.

Understandably, original text to be corrected, which may need to be corrected, is obtained.

S50: and determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library.

In an embodiment, as shown in fig. 4, in step S50, that is, determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library, the method specifically includes the following steps:

s51: and performing word segmentation processing on the original text to be corrected according to a word segmentation tool to obtain the word segmentation of the original text to be corrected.

Exemplarily, a jieba toolkit is adopted to perform word segmentation processing on the original text to be corrected to obtain word segmentation of the original text to be corrected. Understandably, the jieba toolkit combines both rule-based and statistics-based approaches. Firstly, word map scanning is carried out based on a prefix dictionary, the prefix dictionary means that words in the dictionary are arranged according to the order of prefix inclusion, for example, "buy" appears in the dictionary, then words beginning with "buy" all appear in the part, for example, "buy water", and further "buy fruit", so that a hierarchical inclusion structure is formed. If the words are regarded as nodes and the segmentation characters between the words are regarded as edges, a segmentation scheme corresponds to a segmentation path from a first character to a last character to form a directed acyclic graph of all possible segmentation results.

S52: and analyzing the phrases in the segmentation set corresponding to the original text to be corrected by adopting a statistical language analysis tool kit to obtain an analysis result of whether the phrases in the segmentation set corresponding to the original text to be corrected exist in the existing dictionary base.

Illustratively, the statistical language analysis toolkit can be a kenlm tool, the statistical language model trained by the kenlm tool is a statistical language model trained based on the national literature, the training speed of the kenlm tool is higher, and the training of single-machine big data is supported; extracting common words from the daily news corpus of people based on a statistical language model to serve as an existing dictionary base; and judging whether the word group in the segmentation set corresponding to the original text to be corrected exists in the existing dictionary base or not, and obtaining an analysis result whether the word group in the segmentation set corresponding to the original text to be corrected exists in the existing dictionary base or not.

Understandably, if the analysis result indicates that the word group in the segmentation set corresponding to the original text for error correction does not exist in the existing dictionary base, it is determined whether the segmentation of the original text for error correction exists in the target dictionary base, and if the analysis result indicates that the word group in the segmentation set corresponding to the original text for error correction exists in the existing dictionary base, the error correction of the word group in the segmentation set is not needed.

S53: and if the analysis result indicates that the word group in the segmentation set corresponding to the original text to be corrected does not exist in the existing dictionary base, judging whether the segmentation of the original text to be corrected exists in the target dictionary base.

S54: and if the participles of the original text to be corrected do not exist in the target dictionary library, determining the participles of the original text to be corrected as candidate error words.

Understandably, judging whether the participles of the original text to be corrected exist in the target dictionary library, and if the participles of the original text to be corrected do not exist in the target dictionary library, determining the participles of the original text to be corrected as candidate error words; and if the word segmentation of the original text to be corrected exists in the target dictionary base, correcting the word group in the word segmentation set without the need of correcting the word group.

In the embodiment corresponding to fig. 4, in the process of text error correction based on artificial intelligence, under the condition that the number of wrongly written words in the document scene is considered to be small, a statistical language analysis kit is adopted to analyze the phrases in the segmentation set corresponding to the original text to be error corrected, so as to implement unsupervised wrongly written word recognition on the phrases in the segmentation set corresponding to the original text to be error corrected.

S60: and determining the homophone word set of each candidate error word according to each candidate error word.

Illustratively, a library for converting Chinese characters into pinyin is provided in Python, which is named as PyPinyin and can be used for occasions such as phonetic notation, sorting, retrieval and the like of Chinese characters.

S70: and respectively replacing the candidate error words of the original text to be corrected with the corresponding homophones in the homophone set to obtain a corrected text set.

Understandably, considering that the context of the wrong phonemic words is the same as that of the correct words corresponding to the wrong phonemic words, the candidate wrong words are replaced by a plurality of homophones corresponding to the wrong words, and an error-corrected text set is obtained.

S80: and selecting the corrected text which exceeds the preset text compliance from the corrected text set as the final corrected text.

In an embodiment, in step S80, that is, the step of selecting the corrected text that exceeds the preset text compliance from the corrected text set as the final corrected text specifically includes the following steps:

s81: calculating the sentence popularity score of the corrected text after the homophone word is replaced in the corrected text set by adopting a Bayesian formula in a statistical language model to obtain the sentence popularity score of the corrected text, wherein the Bayesian formula is as follows,

p(w1w2...wn)＝p(w1)*p(w2|w1)*p(w3|w1w2)....p(wn|w1w2w3...wn-1)；

illustratively, a statistical language model trained by a kenlm tool is used for calculating the sentence popularity score after the candidate error word is replaced by the homophone, so as to obtain the sentence popularity score of the corrected text after the candidate error word is replaced by the different homophone, and the corrected text exceeding the preset text popularity is selected as the final corrected text. Understandably, the preset text passing degree refers to a preset numerical value, for example, the preset text passing degree is 0.5, 0.6, 0.7, etc.

Understandably, the Bayesian formula in the following statistical language model is adopted for calculation:

p(w1w2...wn)＝p(w1)*p(w2|w1)*p(w3|w1w2)....p(wn|w1w2w3...wn-1)

wherein p (w1w2.. wn) is a sentence smoothness score of the corrected text after homophones are replaced; w1 is the first word; probability that p (w1) is the first word; wn is the nth word; p (wn) is the probability of the nth word; p (wn | w1w2w3.. wn-1) is used for giving the former word w1w2w3.. wn-1, and the conditional probability of the occurrence of the latter word wn is solved;

meanwhile, the probability distribution calculated by the language model is different from the probability distribution of "ideal", so the probability distribution of the model ideal needs to be evaluated, and a common way of evaluating the language model is the degree of confusion (perplexity), which is also called complexity, confusion, and the like.

S82: and selecting the corrected text exceeding the preset text smoothness as the final corrected text.

Understandably, the preset text smoothness refers to a reasonable degree of smoothness of the text language, and the preset text smoothness may be 0.6, 0.7, 0.8, and the like, and the specific invention is not limited. For example, the candidate wrong word is "local fan", the different homophones are "local fan", "local meter", "document fan", and the corrected text after being replaced by the different homophones is "in the era of local fan today", "in the era of local meter today", "in the era of document fan today"; wherein, the sentence smoothness score of the corrected text which is 'the era of being in the current state of the client' is 0.9, the sentence smoothness score of the corrected text which is 'the era of being in the current state of the client' is 0.5, the sentence smoothness score of the corrected text which is 'the era of being in the state of being in the document state of the client' is 0.6, the preset text smoothness score is 0.8, and finally the corrected text which exceeds the preset text smoothness score '0.8' is selected as the final corrected text.

In this embodiment, a bayesian formula in the statistical language model is used to calculate the sentence smoothness score of the corrected text, which can improve the calculation accuracy of the smoothness score and further improve the selection accuracy of the corrected text.

In the embodiment corresponding to fig. 1, in consideration of the expression of the specific terms in the official document scene, acquiring historical official document data, performing new word discovery processing on the official document text, and adding the new words into the dictionary library, so that the words in the official document scene can be mined as the supplement of the dictionary library, and the obtained target dictionary library contains the new words of the specific terms in the official document scene; determining candidate error words of the original text to be corrected according to the original text to be corrected and the target dictionary library, so that the problem that a general error correction model cannot identify a specific term and correct content is corrected is solved; replacing the candidate wrong words with homophones; and selecting the corrected text exceeding the preset text compliance from the corrected text set as a final corrected text, and screening out a final corrected result by calculating a compliance score so as to improve the accuracy of text correction based on artificial intelligence.

In an embodiment, as shown in fig. 2, in step S20, that is, performing new word discovery processing on the official document text to obtain a new word, the method specifically includes the following steps:

s21: and sequentially breaking the characters of the official document into multi-element groups according to the sequence, and taking the obtained multi-element groups as a candidate phrase set.

Illustratively, the characters of the official document are sequentially decomposed into binary groups and triple groups in sequence, and the obtained binary groups and triple groups are used as candidate phrase sets. For example, the characters "new", "word", "found", and "found" of the new word finding in the official document text are sequentially decomposed into two-tuple and triple in order, the obtained two-tuple has "new word", "word is sent", "found", the triple has "new word is sent", "word is found", and the obtained two-tuple "new word", "word is sent", "found", and the triple "new word is sent", "word is found" are used as candidate phrase sets.

In the embodiment of the scheme, the characters of the official document are sequentially split into the binary group and the triple according to the sequence, and the obtained binary group and the triple are used as a candidate phrase set, so that each group of phrases which can become new words in the official document is split into independent words.

S22: and performing word segmentation on the official document text by adopting a word segmentation toolkit to obtain a word segmentation set corresponding to the official document text.

Illustratively, the word segmentation toolkit may be a jieba toolkit. For example, the jieba toolkit is used for analyzing the document text "new word discovery" to obtain the word segmentation sets "new word" and "discovery" corresponding to the document text.

S23: and deleting the participle set corresponding to the official document from the candidate phrase set to obtain a target candidate phrase set.

For example, the "new word" and "discovery" of the participle set corresponding to the document text are deleted from the binary group "new word", "new word issue", "discovery" and "triple" new word issue "and" word discovery "of the candidate phrase set, so as to obtain the target candidate phrase set binary group" word issue "and" triple "new word issue", "word discovery".

S24: and aiming at the phrases of the target candidate phrase set, calculating according to the probability of each word in each phrase to obtain the corresponding score of each phrase in the target candidate phrase set.

In an embodiment, as shown in fig. 3, in step S24, that is, the phrase in the target candidate phrase set is calculated according to the probability of occurrence of each word in each phrase to obtain a score corresponding to each phrase in the target candidate phrase set, which specifically includes the following steps:

s241: and sequentially disassembling the phrases of the target candidate phrase set into a first character and a second character.

Understandably, the two-tuple in the target candidate phrase set comprises 'new word' and 'word finding', if the phrase in the target candidate phrase set is the two-tuple 'word finding', the word 'word finding' in the target candidate phrase set is sequentially split into a first character 'word' and a second character 'word'; if the phrase in the target candidate phrase set is the triple 'new word issue', the word 'new word issue' in the target candidate phrase set is sequentially split into a first character 'new word' and a second character 'issue', or the word 'new word issue' is split into a first character 'word issue' and a second character 'new'.

S242: and acquiring the probability of the first character, the probability of the second character and the probability of the phrase of the target candidate phrase set.

S243: and acquiring the information entropy at the left side of the phrase of the target candidate phrase set and the information entropy at the right side of the phrase of the target candidate phrase set.

S244: and aiming at each phrase in the target candidate phrase set, obtaining a score corresponding to each phrase according to the probability of the first character, the probability of the second character, the probability of the phrase, the information entropy on the left side of the phrase and the information entropy on the right side of the phrase corresponding to each phrase.

In an embodiment, when the phrases in the target candidate phrase set are binary groups, the sequentially splitting the phrases of the target candidate phrase set into the first characters and the second characters includes the following steps:

sequentially splitting the binary group into a first character and a second character in sequence, wherein the first character and the second character are single characters;

in step S244, that is, for each phrase in the target candidate phrase set, obtaining a score corresponding to each phrase according to the probability of occurrence of the first character, the probability of occurrence of the second character, the probability of occurrence of the phrase, the information entropy on the left side of the phrase, and the information entropy on the right side of the phrase corresponding to each phrase, specifically including the following steps:

obtaining a score corresponding to each phrase in the binary group according to the probability of the appearance of the first character of the phrase in the binary group, the probability of the appearance of the second character in the binary group, the probability of the appearance of the phrase in the binary group, the information entropy on the left side of the phrase in the binary group and the information entropy on the right side of the phrase in the binary group by adopting a score calculation formula;

Understandably, the information entropy of each word is calculated and used as the weight of the word, and the formula of the information entropy is as follows, H (w) ═ ∑ pl_og (p), where W is the word and p is the number of different words appearing to the left and right of the word, such as twice A, W, C and once B, W, D in an article, respectively, the left entropy of W is:2/3 shows that phrase A appears 2 times out of 3 times, and B appears only once, so it is 1/3; similarly, the information entropy on the right side of W is the same; if the left and right information entropies of a certain word are large, the word is likely to be a keyword.

In this embodiment, a score calculation formula is adopted, a score corresponding to each phrase in the binary group is obtained according to the probability of occurrence of the first character of the phrase in the binary group, the probability of occurrence of the second character in the binary group, the probability of occurrence of the phrase in the binary group, the information entropy on the left side of the phrase in the binary group, and the information entropy on the right side of the phrase in the binary group, score calculation is performed according to the calculation parameters, so that the accuracy of the score corresponding to each phrase is further improved, and the probability of each phrase being a new word is determined according to the score corresponding to each phrase.

In an embodiment, when the phrases in the target candidate phrase set are triples, the sequentially parsing the phrases in the target candidate phrase set into a first character and a second character includes:

sequentially splitting the binary group into a first character and a second character in sequence, wherein the first character is a double character, and the second character is a single character;

the method further comprises the following steps:

obtaining a plurality of scores corresponding to the phrases of the triples according to the occurrence probability of the first character of the phrases in the triples, the occurrence probability of the second character of the triples, the occurrence probability of the phrases of the triples, the information entropy on the left side of the phrases of the triples and the information entropy on the right side of the triples by adopting a score calculation formula;

Understandably, a double character refers to a string of characters, a double character comprising two single characters, a single character comprising only one character. If the phrase in the target candidate phrase set is the triple 'new word issue', the word 'new word issue' in the target candidate phrase set is sequentially split into a first character 'new word' and a second character 'issue', or the word 'new word issue' is split into a first character 'word issue' and a second character 'new'; if the word "new word sending" in the target candidate phrase set is sequentially split into a first character "new word" and a second character "sending", the obtained score is 0.3; if the words and the new word in the target candidate phrase set are sequentially split into the first character and the second character, the obtained score is 0.5.

In this embodiment, a score calculation formula is adopted, a plurality of scores corresponding to the phrases of the triples are obtained according to the occurrence probability of the first character of the phrases in the triples, the occurrence probability of the second character of the triples, the occurrence probability of the phrases of the triples, the information entropy on the left side of the phrases of the triples, and the information entropy on the right side of the triples, score calculation is performed according to the calculation parameters, so that the accuracy of the scores corresponding to the phrases of the triples is further improved, and the probability of each phrase being a new word is determined according to the score corresponding to each phrase.

In the embodiment corresponding to fig. 3, for each phrase in the target candidate phrase set, a score corresponding to each phrase is obtained according to the probability of occurrence of the first character, the probability of occurrence of the second character, the probability of occurrence of the phrase, the information entropy on the left side of the phrase, and the information entropy on the right side of the phrase corresponding to each phrase, score calculation is performed according to the calculation parameters to improve the accuracy of the score corresponding to each phrase, and the probability of each phrase being a new word is determined according to the score corresponding to each phrase.

S25: and sequencing the score corresponding to each phrase in the target candidate phrase set to obtain a sequencing result.

Understandably, sequencing each phrase in the target candidate phrase set according to the score to obtain a sequencing result.

S26: and screening the phrases in the target candidate phrase set according to the sorting result and a preset threshold value to obtain a new word.

Illustratively, the preset threshold is a preset value, for example, the preset threshold may be 0.6, 0.7, 0.8, and the like, the score smaller than the preset threshold is removed, and a phrase in the target candidate phrase set corresponding to the score exceeding the preset threshold is selected as a new word.

In the embodiment corresponding to fig. 2, new words are found in a document scene, characters of the document text are sequentially decomposed into multi-element groups in sequence, the obtained multi-element groups are used as candidate phrase sets, and then the segmentation sets obtained by segmenting words by using a segmentation toolkit are removed to obtain target candidate phrase sets; and then calculating according to the probability of each word in the word group aiming at the word group of the target candidate word group set to obtain a score corresponding to each word group in the target candidate word group set, and finally screening the word groups in the target candidate word group set according to the sequencing result and a preset threshold value to obtain new words.

It should be understood that, the sequence numbers of the steps in the above embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the present invention.

In one embodiment, an artificial intelligence based text error correction apparatus is provided, and the artificial intelligence based text error correction apparatus corresponds to the artificial intelligence based text error correction method in the above embodiment one to one. As shown in fig. 5, the artificial intelligence based text correction apparatus includes a first obtaining module 10, a new word discovering module 20, a new word adding module 30, a second obtaining module 40, a first determining module 50, a second determining module 60, a replacing module 70 and a selecting module 80. The functional modules are explained in detail as follows:

the first acquisition module 10 is used for acquiring historical official document data, wherein the historical official document data comprises official document texts;

a new word discovery module 20, which performs new word discovery processing on the official document text to obtain new words;

a new word adding module 30, which adds the new word into the original dictionary base to obtain a target dictionary base after adding the new word; it is emphasized that, to further ensure the privacy and security of the target dictionary repository, the target dictionary repository may also be stored in a node of a blockchain.

The second obtaining module 40 obtains an original text to be corrected;

the first determining module 50 is used for determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library;

a second determining module 60, configured to determine a set of homophones for each of the candidate incorrect words according to each of the candidate incorrect words;

a replacing module 70, which replaces the candidate error words of the original text to be corrected with the corresponding homophones in the homophone set to obtain a corrected text set;

and a selecting module 80 for selecting the corrected text exceeding the preset text compliance from the corrected text set as the final corrected text.

For the specific definition of the artificial intelligence based text error correction device, reference may be made to the above definition of the artificial intelligence based text error correction method, which is not described herein again. The modules in the artificial intelligence based text error correction apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, as shown in fig. 6, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring historical official document data, wherein the historical official document data comprises official document texts;

carrying out new word discovery processing on the official document text to obtain new words;

adding the new words into an original dictionary library to obtain a target dictionary library added with the new words; it is emphasized that, to further ensure the privacy and security of the target dictionary repository, the target dictionary repository may also be stored in a node of a blockchain.

Acquiring an original text to be corrected;

determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library;

determining a homophone word set of each candidate error word according to each candidate error word;

respectively replacing candidate error words of the original text to be corrected with corresponding homophones in the homophone set to obtain a corrected text set;

and selecting the corrected text which exceeds the preset text compliance from the corrected text set as the final corrected text.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring historical official document data, wherein the historical official document data comprises official document texts;

carrying out new word discovery processing on the official document text to obtain new words;

adding the new words into an original dictionary library to obtain a target dictionary library added with the new words;

acquiring an original text to be corrected;

determining candidate error words in the original text to be corrected according to the original text to be corrected and the target dictionary library;

determining a homophone word set of each candidate error word according to each candidate error word;

respectively replacing candidate error words of the original text to be corrected with corresponding homophones in the homophone set to obtain a corrected text set;

and selecting the corrected text which exceeds the preset text compliance from the corrected text set as the final corrected text.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

19页详细技术资料下载

Text error correction method and device based on artificial intelligence, computer equipment and storage medium

相关技术

网友询问留言