Electric power professional word bank construction method based on hybrid model and clustering algorithm

文档序号：1799318 发布日期：2021-11-05 浏览：29次中文

阅读说明：本技术 一种基于混合模型和聚类算法的电力专业词库构建方法 (Electric power professional word bank construction method based on hybrid model and clustering algorithm ) 是由陈文刚宰洪涛刘建国张轲许泳涛何洪英罗滇生尹希浩奚瑞瑶符芳育方于 2021-07-30 设计创作，主要内容包括：本发明涉及人工智能领域,具体涉及一种基于混合模型和聚类算法的电力专业词库构建方法。将电力文本和平行语料进行预处理,再通过分词模型进行分词,其中互信息和左右熵算法和TextRank算法对结巴分词结果进行词语组合,TF-IDF算法和Word2Vec词聚类算法对结巴分词结果提取文本关键词,信息熵分词算法直接对文本分词,上述结果汇总、对比得到特征语料词；从特征语料词中挑选电力专业词汇作为种子词；同时用导出来的电力文本词库作为候选词对电力文本分词,然后使用word2vec算法把词变为词向量；聚类得到相似词,然后规则过滤获得电力专业词库。本发明使用一个聚类模型能够过滤掉大部分非电力领域专业词语,专业词语较为完整。(The invention relates to the field of artificial intelligence, in particular to a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm. Preprocessing an electric text and parallel language materials, and then segmenting words by a Word segmentation model, wherein mutual information, a left-right entropy algorithm and a TextRank algorithm are used for carrying out Word combination on the result of the segmentation of the crust words, a TF-IDF algorithm and a Word2Vec Word clustering algorithm are used for extracting text keywords from the result of the segmentation of the crust words, the information entropy segmentation algorithm is used for directly segmenting the text words, and the results are summarized and compared to obtain characteristic language material words; selecting electric power professional vocabularies from the characteristic corpus words as seed words; meanwhile, the derived electric text word bank is used as a candidate word to divide the electric text into words, and then a word2vec algorithm is used for changing the words into word vectors; clustering to obtain similar words, and then filtering according to rules to obtain a power professional word bank. According to the invention, most of professional words in non-electric power fields can be filtered by using one clustering model, and the professional words are complete.)

1. A method for constructing a power professional word bank based on a hybrid model and a clustering algorithm is characterized by comprising the following steps:

preprocessing an electric power text and a non-electric power professional parallel corpus, and removing blank spaces, punctuations and non-entity meaning words to obtain qualified input text data;

step two, segmenting words of the electric power text and the parallel linguistic data through a word segmentation model to obtain an electric power text word bank and a parallel linguistic data word bank, and comparing the electric power text word bank with the parallel linguistic data word bank to obtain characteristic linguistic data words;

thirdly, selecting electric power professional vocabularies from the characteristic corpus words as seed words; meanwhile, segmenting the electric text by using the electric text word bank derived in the second step, and then changing words into word vectors by using a word2vec algorithm;

and step four, inputting the word vectors and the seed words into a clustering model, clustering to obtain similar words, then regularly filtering out non-electric power professional words, and finally obtaining an electric power professional word bank.

2. The construction method according to claim 1, wherein in the first step, the electric text includes an electric power science paper, a project report, an electric power regulation or an electric power operation manual, and the parallel corpus adopts a crawled Wikipedia corpus.

3. The construction method according to claim 1, wherein in the second step, in the Word segmentation model, a Word set 1 is obtained through a TF-IDF statistical model, a Word2Vec Word clustering model, a TextRank model and a left-right information entropy and mutual information entropy model based on Jieba Word segmentation, a Word set 2 is established through frequency, solidity and freedom, and finally, the two Word sets are combined to obtain a final Word bank.

4. The method of claim 3, wherein the word set 1 is established as follows: and (3) performing Word combination through a TextRank model, a left-right information entropy and mutual information entropy model, and merging the words to obtain a Word set 1.

5. The method of claim 3, wherein the word set 2 is established as follows:

(1) counting: counting the frequency (P) of each word from the corpus_a,P_b) And counting the co-occurrence frequency (P) of two adjacent words_ab)；

(2) Cutting: respectively setting a threshold value min _ prob of the occurrence frequency and a threshold value min _ pmi of mutual information, and then setting P in the corpus_ab<min _ prob orCutting the adjacent characters;

(3) cutting: after the segmentation in the step (2), counting the frequency P of each quasi word obtained in the step (2)_w′Retaining only P_w′>The min _ prob moiety;

(4) redundancy removal: arranging the candidate words obtained in the step (3) from multiple to few according to the number of words, then deleting each candidate word in a word bank in sequence, dividing the candidate word into words by using the remaining words and word frequency, and calculating the mutual trust between the original word and the sub-wordInformation according toIf the mutual information is more than 1, recovering the word, otherwise, keeping deleting and updating the frequency of the segmented sub-words;

(5) counting: the words obtained after the redundancy removal in the step 4 are counted based on the word set to obtain the left information entropy of each wordEntropy of right informationAnd n and m are respectively the number of the left adjacent characters and the right adjacent characters of each word, the free application degree of the text segment is defined as the smaller value of the left adjacent character information entropy and the right adjacent character information entropy, a threshold min _ pdof degree of freedom is set, and if the degree of freedom is greater than the threshold, the segment is considered to be independent into words.

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm.

Background

In the chinese language, the ideographic ability of a word is poor, meaning is scattered, and the ideographic ability of a word is strong, so that an object can be described more accurately, and therefore, in the natural language processing, a word (including a word formation) is the most basic processing unit in general. For languages of Latin languages such as English, the words can be simply and accurately extracted because the blank spaces between the words are used as word margin representation. In the Chinese language, except for punctuation marks, characters are closely connected without obvious boundaries, so that words are difficult to extract. The Chinese word segmentation method is roughly divided into three types: dictionary-based segmentation, statistical model-based segmentation and rule-based segmentation. Dictionary-based segmentation is a relatively common and efficient word segmentation mode, and the premise is that a word bank is provided.

At present, a relatively complete electric power professional word bank is not established in the electric power professional field. With the increase of the demand for semantic understanding of the power text, the demand for constructing word banks in the field of power major is more and more urgent. The field of power technology accumulates a large amount of text data including power science articles, project reports, power regulations, power operation manuals, and the like. Based on the data, the natural language processing technology is utilized to develop the vocabulary discovery research of the electric power professional field, and further, the dictionary of the electric power professional field is constructed, so that the method has important significance for the subsequent development of text understanding, mining and information management of the electric power field. However, since the text mining technology belongs to a new technology appearing in the field of artificial intelligence in recent years, the word segmentation discovery and word bank construction technology also belongs to a new leading-edge field in the domestic electric power professional field, most researches are still in a research test stage, and the application effect is not yet shown.

The Chinese language is different from most western languages, no obvious space mark exists between words of written Chinese, sentences appear in the form of character strings, and the first step of processing the Chinese language is to automatically divide words, namely, the character strings are converted into word strings. The Chinese language with the same syntax is complex and changeable, and has the characteristics of intersection ambiguity, combination ambiguity, ambiguity which cannot be solved in sentences, unknown words and the like in Chinese, so that the Chinese word segmentation is difficult. If the language processing task is to be completed well, word segmentation operation is firstly needed when Chinese data mining is performed. The existing common word segmentation methods are all based on an artificial word stock, and some common words can be manually collected into the word stock, but the existing common word segmentation methods cannot deal with the endless new words, especially the network new words. Which is often a key place for the task of language segmentation. Therefore, one core task of Chinese word segmentation is to perfect a new word discovery algorithm. And (4) new word discovery, namely, automatically discovering language segments which can become words directly from a large-scale corpus without adding any prior materials.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm. The method can overcome the defects of word segmentation algorithm in the word stock construction technology in the field of electric power specialty, and has the function of mining new words for electric power text data.

The scheme of the invention comprises the following steps:

preprocessing an electric text and a parallel corpus, and removing blanks, punctuations and non-entity meaning words to obtain qualified input text data;

step two, segmenting words of the electric power text and the parallel linguistic data which are not in the electric power major through a word segmentation model to obtain an electric power text word bank and a parallel linguistic data word bank, and comparing the electric power text word bank with the parallel linguistic data word bank to obtain characteristic linguistic data words;

thirdly, selecting electric power professional vocabularies from the characteristic corpus words as seed words; meanwhile, the electric text word bank derived in the second step is used as a candidate word to divide the electric text word, and then a word2vec algorithm is used for changing the word into a word vector;

and step four, inputting the word vectors and the seed words into a clustering model, clustering to obtain the words in the electric power professional field, then regularly filtering out non-electric power professional words, and finally obtaining the electric power professional word bank.

In the first step, the electric power text includes an electric power science and technology paper, a project report, an electric power regulation, an electric power operation manual and the like, and the parallel corpus can adopt a crawled corpus in wikipedia.

In the Word segmentation model, a Word set 1 is obtained through a TF-IDF statistical model, a Word2Vec Word clustering model, a TextRank model and a left-right information entropy and mutual information entropy model based on the Jieba segmentation, a Word set 2 is established through frequency, solidity and freedom, and finally the two Word sets are combined to obtain a final Word bank.

The word set 1 is established as follows: and (3) performing Word combination through a TextRank model, a left-right information entropy and mutual information entropy model, and merging the words to obtain a Word set 1.

The word set 2 is established as follows:

(1) counting: counting the frequency (P) of each word from the corpus_a,P_b) And counting the co-occurrence frequency (P) of two adjacent words_ab)；

(2) Cutting: setting threshold values min _ p of occurrence frequency respectivelyrob and threshold value min _ pmi for mutual information, and then combining P in corpus_ab<min _ prob orCutting the adjacent characters;

(3) cutting: after the segmentation in the step (2), counting the frequency P of each quasi word obtained in the step (2)_w′Retaining only P_w′>The min _ prob moiety;

(4) redundancy removal: arranging the candidate words obtained in the step (3) from multiple to few according to the number of words, then deleting each candidate word in a word bank in sequence, dividing the candidate word into words by using the rest words and word frequency, calculating the mutual information of the original word and the sub-words, and according to the mutual informationIf the mutual information is more than 1, recovering the word, otherwise, keeping deleting and updating the frequency of the segmented sub-words;

(5) counting: the words obtained after the redundancy removal in the step 4 are counted based on the word set to obtain the left information entropy of each wordEntropy of right informationAnd n and m are respectively the number of the left adjacent characters and the right adjacent characters of each word, the free application degree of the text segment is defined as the smaller value of the left adjacent character information entropy and the right adjacent character information entropy, a threshold value min _ pdof of the degree of freedom is set, and if the degree of freedom is greater than the threshold value, the segment is considered to be independent into words.

The method is used for segmenting the document text in the electric power field based on the hybrid model, the segmentation of the segmented words accords with Chinese semantics, and the segmentation task can be effectively completed. Compared with the single model, the mixed model is used, so that the word stock is more completely established, and words are more abundant. The words extracted based on the mixed model comprise part of non-electric field words, the clustering model is adopted for clustering electric field professional words, the clustering result shows that most of the non-electric field professional words can be filtered, the clustering electric field professional words are complete, and the effect is good.

Drawings

Fig. 1, a schematic diagram of a text preprocessing process.

Fig. 2 is a schematic diagram of a process of extracting a feature corpus word.

FIG. 3, word segmentation model.

Fig. 4 is a schematic diagram of a process of constructing a word stock in the electric power professional field.

Fig. 5, schematic view of word combination.

Detailed Description

The invention discloses a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm, which comprises the following steps of:

preprocessing an electric text and a parallel corpus, namely deleting spaces, punctuations, special characters and some characters or words without entity significance in initial text data to obtain qualified input text data;

segmenting words of the electric power text and the parallel linguistic data which are not in the electric power major through a word segmentation model to obtain an electric power text word bank and a parallel linguistic data word bank, and comparing the electric power text word bank with the parallel linguistic data word bank to obtain characteristic linguistic data words in the electric power field;

step three, selecting the electric power professional vocabulary from the characteristic corpus words as seed words, wherein the characteristic corpus words still contain non-electric power professional vocabularies; meanwhile, the electric text word bank derived in the second step is used as a candidate word to divide the electric text word, and then a word2vec algorithm is used for changing the word into a word vector;

In the text data preprocessing shown in fig. 1, a large number of spaces and punctuation marks, special characters such as%, and some words without physical meaning such as and, the same, etc., and exist in the initial power domain text data and the parallel text data. To obtain a qualified input text, the text needs to be processed accordingly. The electric power professional field text comprises an electric power scientific article, a project report, an electric power regulation, an electric power operation manual and the like, and the parallel linguistic data can be language data such as Wikipedia and people's daily report and is to be distinguished from electric power text data. In addition, the electric text data and the parallel language materials are large enough, and the built word stock can be large enough.

In the extracted feature corpus words shown in fig. 2, the electric power professional field text and the parallel corpus are subjected to word segmentation by the word segmentation model to obtain two word banks, and the two word banks are compared to obtain the feature corpus words.

The word set 1 is established as follows:

(1) the word segmentation of the crust: the Jieba word segmentation is a good text word segmentation tool and can accurately segment the text words, but the obtained words are small in granularity, so that most of the words in the electric power professional field are segmented, and the obtained words in the electric power professional field are not rich enough. Therefore, these small-sized words are combined to enrich the whole word stock. As shown in fig. 5:

(2) combining: the granularity of the terms of the results of the Jieba word segmentation is small, and most of the terms in the electric power professional field are detached, so the final result is obtained through the combination of the terms.

TF-IDF model extraction of keywords

The statistical model is a TF-IDF model, which is the product of two statistics. There are a number of ways to determine the specific value of the statistic.

Word frequency (TF): word frequency in the word w document d, i.e. the ratio of the number of times the word w appears in the document count (w, d) and the total word number size (d) in the document d: tf ═ count (w, d)/size (d)

Inverse Document Frequency (IDF): the inverse document frequency idf of the word w in the whole document set, i.e. the logarithm of the ratio of the total number n of documents to the number df (w, D) of the documents in which the word w appears: idf log (n/df (w, D))

Therefore, the method comprises the following steps: w is a_i＝tf_i×idf

However, the improved TF-IDF model is used as an evaluation standard, the improved model increases the penalty of DF,

and b, extracting keywords in a Word2Vec word clustering mode.

1) Word2Vec Word vector representation:

the method utilizes a shallow neural network model to automatically learn the occurrence condition of words in a corpus, and embeds the words into a high-dimensional space, usually in 100-500 dimensions, where the words are expressed in the form of word vectors. The extraction of the feature word vectors is based on the word vector model which is trained.

2) K-means clustering algorithm:

the clustering algorithm aims at finding the relationship among data objects in data and grouping the data, so that the similarity in groups is as large as possible and the similarity among the groups is as small as possible.

The algorithm idea is as follows: firstly, randomly selecting K points as initial centroids, wherein K is the expected number of clusters specified by a user, assigning each point to the nearest centroid to form K clusters by calculating the distance from each point to each centroid, then recalculating the centroid of each cluster according to the points assigned to the clusters, and repeating the operation of assigning and updating the centroids until the clusters are not changed or the maximum iteration number is reached.

3) The implementation process of the Word2Vec Word cluster-based keyword extraction method comprises the following steps:

the main idea is that for words represented by Word vectors, words in an article are clustered through a K-Means algorithm, a clustering center is selected as a main keyword of the text, the distance between other words and the clustering center, namely the similarity, is calculated, the front K words closest to the clustering center are selected as the keywords, and the similarity between the words can be calculated by using vectors generated by Word2 Vec. The method comprises the following specific steps:

i. carrying out Word2Vec model training on the corpus to obtain a Word vector file;

i, preprocessing a text to obtain N candidate keywords;

i i i, traversing the candidate keywords, and extracting word vector representation of the candidate keywords from the word vector file;

performing K-Means clustering on the candidate keywords to obtain clustering centers of all categories;

v, calculating the distance (Euclidean distance or Manhattan distance) between the words in the group and the cluster center under each category, and sorting in a descending order according to the cluster size;

and vi, ranking the calculation results of the candidate keywords, and taking the K words as the text keywords.

Extracting key words from a TextRank model

The terms are regarded as nodes in the TextRank model, word relations are built, and the importance of each word is calculated according to the collinear relation among the words. The algorithm used by TextRank for keyword extraction is as follows:

the TextRank model:

wherein d is a damping factor, generally 0.85, In (V)_i) Indicates a point V_iNode of, Out (V)_i) Is represented by V_iThe node to which it is directed. w is a_ijIs represented by node V_i→V_jWeight of the edge of (3), WS (V)_i) Represents the weight of node i, WS (V)_j) Representing the weight of node j.

1) Segmenting a given text T into complete sentences, i.e.

2) For each sentence, performing word segmentation and part-of-speech tagging, filtering out stop words, and only retaining words with specified part-of-speech, such as nouns, verbs and adjectives, i.e. t_ijAre the candidate keywords after retention.

3) And (E) constructing a candidate keyword graph G, wherein V is a node set and consists of the candidate keywords generated in the step (2), then constructing an edge between any two points by adopting a co-occurrence relation (co-occurrence), wherein the edges exist between the two nodes only when the corresponding words co-occur in a window with the length of K, and K represents the size of the window, namely, at most K words co-occur.

4) And according to the formula, iteratively propagating the weight of each node until convergence.

5) And carrying out reverse ordering on the node weights, thereby obtaining the most important T words as candidate keywords.

6) And 5) obtaining the most important T words, marking in the original text, and combining into a multi-word keyword if adjacent phrases are formed.

d. And extracting the multi-word key words by utilizing the left and right information entropy and the mutual information entropy.

1) Calculating mutual information: first order co-occurrences are sought and word frequency is returned. Then, second-order co-occurrence is found, and mutual information and word frequency are returned. The larger the mutual information (PMI) is, the more the a and b two words are related to each other.

2) Calculating left and right entropies; firstly, left frequency is searched, and left entropy H is counted_{Left side of}(x) And returns the left entropy. Then, right frequency is searched, and right entropy H is counted_{Right side}(x) And returns the right entropy.

3) And (3) calculating the result: score as PMI + MIN (H)_{Left side of}(x)，H_{Right side}(x) A larger score indicates a larger probability of combining words.

And finally, combining the words obtained in the steps a, b, c and d to obtain a word set 1.

In the process of establishing the word set 2, the mutual information entropy is obtained after the set degree takes logarithm, and the left and right information entropy is obtained after the degree of freedom. The word set 2 is established as follows:

(1) counting: counting the frequency (P) of each word from the corpus_a,P_b) And counting the co-occurrence frequency (P) of two adjacent words_ab)；

(3) cutting: cutting through the step (2)Then, counting the frequency P of each quasi word obtained in the step (2)_w′Retaining only P_w′>The min _ prob moiety;

(4) redundancy removal: arranging the candidate words obtained in the step (3) from multiple to few according to the number of words, then deleting each candidate word in a word bank in sequence, dividing the candidate word into words by using the rest words and word frequency, calculating the mutual information of the original word and the sub-words, and according to the mutual informationIf the mutual information is more than 1, recovering the word, otherwise, keeping deleting and updating the frequency of the segmented sub-words;

(5) counting: the words obtained after the redundancy removal in the step 4 are counted based on the word set to obtain the left information entropy of each wordEntropy of right informationAnd n and m are respectively the number of the left adjacent characters and the right adjacent characters of each word, the free application degree of the text segment is defined as the smaller value of the left adjacent character information entropy and the right adjacent character information entropy, a threshold value min _ pdof of the degree of freedom is set, and if the degree of freedom is greater than the threshold value, the segment is considered to be independent into words. And obtaining a word set 2 through the steps.

And selecting seed words from the characteristic corpus words obtained in the step two. The method comprises the steps of segmenting words of a text in the field of electric power major to obtain a Word set serving as a candidate Word, segmenting the words of the electric power text, and then training the words by using a Word vector model to obtain a Word vector, wherein the Word vector model is a Word2Vec model. And then clustering the words according to the word vectors obtained by the word vector model, wherein the clustering is performed according to the picked seed words and then a batch of similar words are found. The algorithm uses similar transitivity (somewhat similar to a clustering algorithm based on connectivity), i.e., a is similar to B and B is also similar to C, A, B, C are grouped together (even if A, C is not similar from the perspective). Of course, such a transfer would likely traverse the entire vocabulary, and would therefore be progressively enhancedWith similar limitations. For example, a is a seed word, B, C is not a seed word, A, B similarity is defined as 0.6 and B, C similarity is greater than 0.7 to be considered similar. The similarity threshold calculation formula is as follows: sim_i＝k+d×(1-e^-d×i) I is the number of transmission times, k is the initial similarity threshold, and d is generally 0.2-0.5.

Since the foregoing is purely unsupervised, even if semantic clustering is performed, some non-electric professional words can be extracted, and even some "non-words" are retained, filtering by using rules is required. And finally, filtering the clustered words through rules to obtain a final result.

The method of the invention is shown as follows according to the results of the process:

(1) partial word results based on frequency, degree of consolidation, degree of freedom segmentation: the system comprises a power supply, a power supply, and a power supply, a power supply, a power supply, a power.

(2) Extracting partial results of the keywords based on a Jieba word segmentation statistical model: coordination, safety measures, oil engine, appropriateness, trip, person on duty, power line, equivalent circuit, flammability, explosiveness, caution, spirality, transformer factory, national grid, detection technology, model, oil stain, gear, magnetic flux, insulation paper, plastic cloth, firm, switch cabinet, dc power supply, electric power company, fire-fighting measures, etc.

(3) Extracting a keyword part result based on a Word2Vec clustering model of Jieba Word segmentation: specification, leakage, data, teaching, primary parameters, ammeter, grounding device, high energy, control measures, filter, computer, carbon monoxide, inspection, operation, equipment, circuit breaker, gas, bus, process, personnel, body, operation, trip, direct current, responsible person, interior, iron core, etc

(4) Extracting a key phrase part result based on a TextRank model of the Jieba participle: power-off protection personnel, in-hole, national grid company, blocking mechanism box indication, meter chromatographic analysis data, load operation, transformer overhaul, outlet short circuit impact, grounding disconnecting link, discharge burning trace, action transformer, manufacturing quality problem, linear algorithm, electric energy loss, sleeve external insulation, contact iron core, dendritic discharge trace and the like.

(5) Extracting a key phrase part result based on a mutual information entropy and left and right information entropy model of the Jieba participle: the transformer comprises a tap changer, a main transformer, a low-voltage winding, a winding deformation, a low-voltage circuit breaker, a hanging cover inspection, an oil replacement pump, a transposed conductor, a circuit disconnection, a gas production rate, a dip transformer, a bus voltage, a neutral point sleeve, power equipment preventability, a vacuum oil filter, a dry reactor, a cylindrical winding and the like.

(6) And (3) based on the final partial result of the clustering model: main magnetic flux, sensor, oil conservator, charger, cooler, manufacturer, double bus, transformer, respirator, main transformer, # main transformer, circuit breaker, low voltage circuit breaker, circuit breaker trip, low voltage impedance, high voltage winding, winding deformation, three phase winding, arc discharge, oil replacement pump, on-load tap changer, bus voltage, transformer fault diagnosis, bus protection device, secondary main switch, insulation pad, fuse cutout, capacitor core, winding capacitance, series reactor, switch core, relay baffle, power cord power tool, AC charger, magnetic current, interference pulse, parallel conductor, bus voltage, characteristic gas, electrical equipment, power switch, DC power, air switch, tap changer, vacuum oil filter, oil paper capacitor sleeve, double bus, transformer, respirator, and electrical equipment, The method comprises the following steps of insulating oil chromatographic analysis, frequency response analysis, light gas protection action, heavy gas protection action, infrared thermal imaging detection, a position alarm lamp, pressure pulse analysis, a non-excitation tap switch, flammable and explosive articles, infrared thermal imaging and the like.

The statistical model and the Word2Vec Word clustering model extract keywords based on the results of the crust segmentation, and the results can be found to contain non-electric words and have fine Word granularity. The mutual information and left and right information entropy model and the TextRank model extract key phrases based on the results of the segmentation of the Chinese character, and the problems of fine granularity of the segmentation of the Chinese character and wrong splitting of the field words are solved. The word segmentation result can be seen to be accurate and good in effect based on the word segmentation model of the information entropy. The clustering algorithm is characterized in that the word segmentation results of the model are summarized, then words in the power field are clustered, the clustering result can be found to screen out non-power words, and the clustering effect is obvious. The models are mutually cooperated, and finally the built word stock is complete.

The invention provides a word segmentation method based on information entropy. The invention adopts the method of the minimum information entropy principle to perform word segmentation on the electric power text, thereby realizing the function of more accurate word segmentation. The method comprises the steps of firstly processing the electric text by using the frequency and the solidity, screening out quasi words, then screening out quasi words by using the freedom degree, preliminarily establishing a word bank and improving the word segmentation accuracy.

The invention provides a method for word recombination. Because the Word segmentation result based on the crust is small in Word granularity after Word segmentation, words in some electric power professional fields are separated and segmented, and non-professional words exist, the Word segmentation result is subjected to keyword extraction and Word combination by using a statistical model, a Word2Vec clustering model, a left-right information entropy and mutual information entropy model and a TextRank model, and a Word bank is enriched and perfected.

The invention provides a word clustering method. The clustering rule is that a group of similar words are found according to a plurality of picked seed words, and the method can cluster words which accord with the electric power professional field from the word bank, so that the electric power professional field word bank is enriched and perfected.

Compared with the existing word stock construction method, the invention has the advantages that:

1. aiming at the defects that the granularity of the crust participle is small and the field words are wrongly split, the mutual information and left and right information entropy word combination algorithm and the TextRank algorithm carry out word combination on the gas participle result, more electric field words are found, and the problems are solved. Mutual information and left-right entropy combination judgment are designed, whether words are combined or not is stricter, and the accuracy of word combination is improved.

The TF-IDF algorithm and the Word2Vec Word clustering algorithm are used for weighting keywords for the segmentation words of the bus, and can extract important words in the text, namely words in part of the electric power field. An improved TF-IDF algorithm is designed, the punishment degree of keyword extraction is increased, and keyword extraction is more accurate.

3. The information entropy word segmentation algorithm designs three thresholds, namely word frequency, mutual information, left and right information, strict word formation judgment, improves word segmentation accuracy, summarizes the word segmentation results of the model, supplements the word segmentation results, and establishes a more complete candidate word bank

4. The word clustering algorithm operation is a clustering field word, and can cluster electric field words for the word segmentation result, filter non-electric field words, reduce manual workload, establish a word bank more simply and conveniently, and establish a word bank more completely and better.

11页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种实时解析模板引擎的方法

Electric power professional word bank construction method based on hybrid model and clustering algorithm

相关技术

网友询问留言