Semantic collocation word checking method

文档序号：1889948 发布日期：2021-11-26 浏览：13次中文

阅读说明：本技术 语义搭配词检查方法 (Semantic collocation word checking method ) 是由谈辉张硕谢振平夏振涛李艳朱立烨于 2021-08-18 设计创作，主要内容包括：本发明提供一种语义搭配词检查方法,包括：对文章数据集进行搭配词提取,建立搭配词典；将待判断文本转化为文本向量；将文本向量输入深度学习模型,所述深度学习模型将文本向量输入编码器进行编码,将编码后的向量经过全连接层处理,得到文本的判断结果；若判断输入的待判断文本中搭配词搭配正确,则输出判断结果为正确。本发明的语义搭配词检查方法,其检错准确率、检错召回率、检错F值和检错正确率均可以达到90％以上。(The invention provides a semantic collocations word checking method, which comprises the following steps: extracting collocation words from the article data set, and establishing a collocation dictionary; converting the text to be judged into a text vector; inputting the text vector into a deep learning model, inputting the text vector into an encoder for encoding by the deep learning model, and processing the encoded vector through a full connection layer to obtain a judgment result of the text; if the matching word in the input text to be judged is correctly matched, the judgment result is output to be correct. The semantic collocation word checking method can achieve more than 90% of error detection accuracy, error detection recall rate, error detection F value and error detection accuracy.)

1. A semantic collocations word checking method is characterized in that it comprises,

extracting collocation words from the article data set, and establishing a collocation dictionary;

converting the text to be judged into a text vector;

inputting the text vector into a deep learning model, inputting the text vector into an encoder for encoding by the deep learning model, and processing the encoded vector through a full connection layer to obtain a judgment result of the text;

if the matching word in the input text to be judged is correctly matched, the judgment result is output to be correct.

2. The method of claim 1, wherein the matching words are screened and judged to be qualified, and the following formula (1) is specifically adopted to screen the matching words:

wherein w is a basic word, w_iFreq as a collocation of basic words_iBasic word w and its matching word w_iThe frequency of (a) of (b) is,is the average frequency of the base word,is a collocating word (w, w)_i) The number of occurrences over the distance j,is a collocating word (w, w)_i) Average of the number of occurrences over all distances, saidThe uniformity of the distribution of the collocation words can be measured;

the collocations (w, w) are judged by the following formula (2)_i) Three conditions that are reasonable are:

wherein the content of the first and second substances,k₀、k₁and U₀Is a self-defined threshold;

if matching words (w, w)_i) And (3) judging that the matching words are established when the formula (2) is met, adding the established matching words into the matching dictionary, and establishing a matching dictionary knowledge base.

3. The method of claim 1, wherein if the text to be determined is a long sentence, the long sentence is divided into a plurality of short sentences, and the short sentences are stored in the list, and the plurality of short sentences in the list are sequentially converted into short text vectors.

4. The method according to claim 3, wherein the plurality of short text vectors are sequentially input into a deep learning model, the deep learning model sequentially inputs the short text vectors into an encoder for encoding, and the encoded short text vectors are processed by a full-connected layer to obtain a judgment result of the short sentence.

5. The method of claim 4, wherein if the input at least one short text is determined to contain a collocating word error, the determination result is output to be an error.

6. The method of claim 4, wherein if all the inputted short texts are judged to be correct, the short sentences are sequentially spliced into the original text to be judged, and the judgment result is output to be correct.

7. The method of claim 1, wherein the deep learning model is a trained deep learning model CMM-ERNIE.

8. The method of claim 7, wherein the performance of the model is evaluated, and the output of the model is evaluated using the following equations (3) - (6):

wherein, (P) is the matching precision rate; (R) is collocation recall; matching with an F value; (A) the matching accuracy is high; (TP) is marked as true positive, actually positive is predicted to be positive; (FP) is a false positive case, actually negative but predicted positive; (FN) is false negative, actually positive but predicted negative; (TN) is true negative and true negative predictions are negative.

Technical Field

The invention relates to the field of computers, in particular to a semantic collocations word checking method.

Background

With the rapid development of the internet and the field of artificial intelligence, people increasingly have life and work connected with the internet, and desire machines to understand natural language texts in a deeper layer, and desire to obtain satisfactory services in the aspects of man-machine conversation, machine translation, language teaching and the like.

Under the daily communication scene, people have low requirements on the accuracy of word collocation, and because of the influence of factors such as social development, network popularity, social environment and the like, people also use nonstandard collocation combination in daily communication. However, improper collocation still causes the problem of poor communication.

In certain scenarios, such as work reports and official document writing, there are high requirements on the accuracy of collocation in the current context. At present, more researches on text word error correction and grammar structure proofreading are carried out, fewer researches on word collocation are carried out, and the researches on word collocation are relatively complex, so that a proofreading method for word collocation relationship needs to be researched.

Therefore, there is a need to provide a new solution.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention discloses a semantic collocation word checking method, and the error detection accuracy, the error detection recall rate, the error detection F value and the error detection accuracy can reach more than 90%. The specific technical scheme is as follows:

the invention provides a semantic collocations word checking method, which comprises the following steps:

extracting collocation words from a data set consisting of a plurality of articles, and establishing a collocation dictionary;

converting the text to be judged into a text vector;

if the matching word in the input text to be judged is correctly matched, the judgment result is output to be correct. Then the process is completed.

Further, screening and judging whether the collocation words meet the conditions, specifically screening the collocation words by adopting the following formula (1):

the collocations (w, w) are judged by the following formula (2)_i) Three conditions that are reasonable are:

wherein the content of the first and second substances,k₀、k₁and U₀Is a self-defined threshold;

if matching words (w, w)_i) When the formula (2) is satisfied, the matching words are judged to be established, the established matching words are added into the matching dictionary, and a matching dictionary knowledge base is established

Further, if the text to be judged is a long sentence, the long sentence is divided into a plurality of short sentences which are stored in the list, and the plurality of short sentences in the list are sequentially converted into short text vectors.

And further, sequentially inputting the short text vectors into a deep learning model, sequentially inputting the short text vectors into an encoder for encoding by the deep learning model, and processing the encoded short text vectors through a full connection layer to obtain a judgment result of the short sentence.

Further, if the input at least one short text is judged to contain errors of collocations, the output judgment result is wrong.

Further, if the matching words in all the input short texts are judged to be correct, the short sentences are sequentially spliced into the original text to be judged, and the judgment result is output to be correct.

Further, the deep learning model is a trained deep learning model CMM-ERNIE

Further, the performance of the model is evaluated, and the output result of the model is evaluated by adopting the following formulas (3) to (6):

The invention has the following beneficial effects:

the semantic collocation word checking method can achieve more than 90% of error detection accuracy, error detection recall rate, error detection F value and error detection accuracy.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a semantic collocations word checking method according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The invention limits the Chinese context to the official document field, and obtains the linguistic data such as news reports, official document texts and the like from the website. Matching and extracting by using an Smadja algorithm provided by Frank Smadja in a paper 'Retrieving matching locations from text: Xtrace', setting the distance d of a basic word statistical candidate matching word to be 5, and storing a matching combination in an Excel table after statistical screening. And taking the correct language material as a positive sample, and carrying out random collocation word replacement on collocation words in the correct language material to obtain a negative sample. The corpus belongs to the corpus of the official characters, the grammar structure standard and the matching word standard, and meets the experiment requirement.

The invention provides a semantic collocations word checking method, which comprises the following steps: constructing a dictionary → combining the CMM strategy with the dictionary for shielding, constructing a data set used for model training → training a model → obtaining a trained model → inputting a text to be judged into the model → judging whether the input text contains matching errors by the model → outputting. The invention provides a semantic collocation word checking method, which specifically comprises the following steps:

1. collocation extraction

In the same context, certain association exists among matched words, the association exists in information centers such as mutual positions, word semantics and the like, the information uses the Chinese political affairs corpus as a data set, and the Smadja algorithm is used as a main method for word collocation extraction, so that a word collocation knowledge base is constructed. The data set in the present invention is a data set composed of a plurality of articles.

Smadja proposes three conditions to screen reasonable collocations.

Remembering a basic word w and its collocation word w_iHas a frequency of freq_iThe average frequency of the base word is Is a collocating word (w, w)_i) The number of occurrences over the distance j,is a collocating word (w, w)_i) Average number of occurrences over all distancesMean, equation (1):

obtained by the formula (1)The degree of uniformity of the collocation distribution can be measured and used as a variable of formula (2) to calculate the screening conditions.

Determine the matching words (w, w)_i) Three conditions that are reasonable are:

wherein, in the formula (2),k₀、k₁and U₀Is a custom threshold, defined empirically herein as k₀＝1、k₁1 and U₀10. When the collocation satisfies the three conditions in the formula (2), the collocation is determined to be established, and the collocation dictionary is added into the collocation dictionary, so as to establish a collocation dictionary knowledge base.

2. Collocation judgment

In the field of natural language processing, there are two common Mask masking strategies, namely Padding-Mask for processing non-fixed-length input sequences and Seqence-Mask for preventing leakage of tag information. The invention provides a MASK (MASK matching-mid-MASK) strategy for judging Collocation words, which is used for masking information in the middle of Collocation combinations with distances larger than 1.

The CMM masking strategy screens out the matching combinations with the distance larger than 1 from the matching combinations extracted by statistics according to dictionary matching information, and masks the words between the basic words and the matching words. In the chinese language, the words between the basic words and the collocations are often modifiers of the auxiliary words, prepositions, adverbs and collocations, so that the word information between the basic words and the collocations is unimportant when the collocations are combined and judged, and after the masking, the self-attention mechanism can allocate more attention to the context information of the collocations and better utilize the prior knowledge in the sentences.

Model training: the method uses a CMM masking strategy to combine with a constructed dictionary processing corpus to construct a data set for model training, adopts an ERNIE-1.0 deep learning model proposed by Baidu as a model architecture base, and saves the trained model and names the trained model as CMM-ERNIE.

As shown in fig. 1, the specific process of collocation determination is as follows: for example, a ' transformation economy development mode, an optimized economy structure ' is taken as a text to be judged, a too long text is divided into short sentences to be stored in a list to obtain [ ' transformation economy mode ', ' optimized economy structure ], the short texts in the list are sequentially converted into vectors capable of reflecting text characteristics, matching judgment is carried out through a trained deep learning model CMM-ERNIE, the model receives text vectors, the text vectors are input into an encoder to carry out encoding based on a multi-head attention mechanism, the encoded vectors are processed through a full connection layer to obtain a judgment result of the short sentences, and the judgment result is correct/wrong, namely whether the short sentences contain matching errors or not. And inputting in sequence and judging whether the short sentences in the list have matching errors, and if at least one short sentence in the short sentence list has matching errors, considering that the input long sentence has matching errors. And splicing the short sentences into the original long sentences from front to back in sequence. And outputting the judgment result of the long sentence as correct/wrong, wherein the judgment result is correct in the example.

3. Analysis of experiments

The result output after the CMM-ERNIE judgment of the trained deep learning model is whether the matching in the long sentence is wrong or not, namely the result is correct or wrong. The 'error detection accuracy' in the model evaluation index is mainly used for judging the quality of the result.

The method adopts four items of matching accuracy (P), matching recall rate (R), matching F value and matching accuracy (A) as evaluation indexes of the performance of the CMM-ERNIE model. The true-positive case is (TP), actually positive prediction is positive; false positive case (FP), actually negative but predicted positive; false Negative (FN), actually positive but predicted negative; true negative examples (TN), actually negative predictions are negative, and the calculation formula is as follows:

according to the evaluation indexes, experimental results are obtained through experiments, and the experimental results are shown in table 1.

TABLE 1 comparison of the results

Table 1 Comparison of experimental results

As can be seen from table 1, the error detection accuracy, the error detection recall rate, the error detection F-number and the error detection accuracy of the semantic collocation word inspection method of the present invention can all reach more than 90%.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example" or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by one skilled in the art.

While embodiments of the present invention have been shown and described above, it is to be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that changes, modifications and variations may be made therein by those of ordinary skill in the art within the scope of the present invention.

9页详细技术资料下载

Semantic collocation word checking method

相关技术

网友询问留言