Text keyword extraction method fusing text structure information and semantic information

文档序号：1831799 发布日期：2021-11-12 浏览：10次中文

阅读说明：本技术 一种融合文本结构信息和语义信息的文本关键词抽取方法 (Text keyword extraction method fusing text structure information and semantic information ) 是由陈雪王小飞王鹏于 2021-07-19 设计创作，主要内容包括：本发明公开了一种融合文本结构信息和语义信息的文本关键词抽取方法,具体步骤如下：1)对单篇文本的段落进行重新组合,构成新文本；2)对新文本进行预处理,包括分词、词性标注和去除停用词,保留名词和动词作为候选关键词；3)计算每个候选关键词的结构权重；4)计算每个候选关键词的语义权重；5)根据步骤3)所得的结构权重和步骤4)所得的语义权重,计算每个后选关键词的权重,并选择权重最高的K个候选关键词作为文本的关键词。本方法充分利用文本自身结构特点和语义特点,无需依赖领域文本集,且无需进行循环收敛计算,因此简便且效果更好。(The invention discloses a text keyword extraction method fusing text structure information and semantic information, which comprises the following specific steps: 1) recombining paragraphs of a single text to form a new text; 2) preprocessing the new text, including word segmentation, part-of-speech tagging and stop word removal, and keeping nouns and verbs as candidate keywords; 3) calculating the structural weight of each candidate keyword; 4) calculating the semantic weight of each candidate keyword; 5) calculating the weight of each post-selected keyword according to the structural weight obtained in the step 3) and the semantic weight obtained in the step 4), and selecting K candidate keywords with the highest weight as the keywords of the text. The method makes full use of the structural characteristics and semantic characteristics of the text, does not need to depend on a field text set, and does not need to perform circular convergence calculation, so that the method is simple and has better effect.)

1. A text keyword extraction method fusing text structure information and semantic information is characterized by comprising the following specific steps:

1) recombining paragraphs of a single text to form a new text;

2) preprocessing the new text, including word segmentation, part-of-speech tagging and stop word removal, and keeping nouns and verbs as candidate keywords;

3) calculating the structural weight of each candidate keyword;

4) calculating the semantic weight of each candidate keyword;

5) calculating the weight of each candidate keyword according to the structural weight obtained in the step 3) and the semantic weight obtained in the step 4), and selecting K candidate keywords with the highest weight as the keywords of the text.

2. The method for extracting text keywords fusing text structure information and semantic information according to claim 1, wherein the method for recombining paragraphs of text in step 1) is as follows: the title of the original text is used as a first section of the new text; the head section and the tail section of the original text are respectively used as a second section and a third section of the new text; other sections of the original text are arranged behind the original text according to the original sequence, and the new text is n sections in total.

3. The method for extracting text keywords fusing text structure information and semantic information according to claim 1, wherein the structure weight of each candidate keyword is calculated in the step 3); for a candidate keyword v_iIts structural weight str (v)_iK) the calculation formula is as follows:

wherein i is less than or equal to m, and m is the number of candidate keywords in the text; k denotes the kth paragraph of text (k 1 … n), freq (v)_iK) represents a candidate keyword v_iWord frequency in the k-th segment.

4. The method for extracting text keywords fusing text structure information and semantic information according to claim 1, wherein the semantic weight of each candidate keyword is calculated in the step 4); for a candidate keyword v_iIts semantic weight sem (v)_iK) denotes the candidate keyword v in the k-th paragraph_iWith other candidate keywords v_jThe co-occurrence frequency is that i is less than or equal to m and j is less than or equal tom。

5. The method for extracting text keywords fusing text structure information and semantic information according to claim 1, wherein the step 5) calculates the weight of each candidate keyword; for a candidate keyword v_iThe weight calculation formula is as follows:

Technical Field

The invention relates to a text keyword extraction method fusing text structure information and semantic information, in particular to a method for extracting keywords by taking a text title as a text first segment, adjusting a text structure according to the importance of a natural segment and adopting a method of sectionally overlapping the structure weight and the semantic weight of candidate keywords.

Background

Text feature extraction is one of the most basic and important problems in the field of natural language processing, and the main methods comprise statistical-based text feature extraction and neural network-based text feature extraction. The statistical-based methods include TF-IDF, TEXTRANK, RAKE, and the like.

The TF-IDF calculates the word weight by using the product of the word frequency TF (term frequency) and the inverse Document frequency IDF (inverse Document frequency). The method is simple and effective, but depends on a text set, and can not calculate only a single text, and the quality of the text set is a key factor for determining the accuracy of extracting the keywords.

TEXTRANK is a graph ranking algorithm that is improved by the Web Page importance ranking algorithm PageRank. The text is divided into a connection graph of words, the similarity of the words is used as the weight of edges, and finally the weight ranking of the words is extracted through the TEXTRANK value of the iterated words. The method needs loop iteration to converge and has higher complexity.

The RAKE algorithm is characterized in that a unique deactivation word list is designed to extract English phrases instead of words, and the ratio of word frequency to word co-occurrence number is calculated as a weight, so that the accuracy applied to English texts is higher than that applied to Chinese texts.

The main idea of text feature extraction based on the neural network is to represent text words by using word vectors trained by the neural network, then clustering the word vectors by using a clustering algorithm, and selecting top-N clustering centers as text keywords. The training of the word vector model requires a large amount of linguistic data to achieve a good effect, so that the complexity is high.

Disclosure of Invention

The invention aims to provide a text keyword extraction method for fusing text structure information and semantic information aiming at the defects of the existing TFIDF, TEXTRANK and RAKE, and particularly relates to a method for extracting keywords by taking a text title as a text first segment, adjusting a text structure according to the importance of a natural segment and adopting a structure weight and a semantic weight of a segmented and superposed candidate keyword. The method can be used for calculating only a single text, does not relate to field text set auxiliary calculation, does not need a loop iteration process, and does not need a large-scale training set.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text keyword extraction method fusing text structure information and semantic information comprises the following specific steps:

1) recombining paragraphs of a single text to form a new text;

2) preprocessing the new text, including word segmentation, part-of-speech tagging and stop word removal, and keeping nouns and verbs as candidate keywords;

3) calculating the structural weight of each candidate keyword;

4) calculating the semantic weight of each candidate keyword;

The method for recombining paragraphs of text in step 1) is as follows: the title of the original text is used as a first section of the new text; the head section and the tail section of the original text are respectively used as a second section and a third section of the new text; other sections of the original text are arranged behind the original text according to the original sequence, and the new text is n sections in total.

Calculating the structure weight of each candidate keyword in the step 3); for a candidate keyword v_iIts structural weight str (v)_iK) the calculation formula is as follows:

Calculating each candidate keyword in the step 4)Semantic weight of (2); for a candidate keyword v_iIts semantic weight sem (v)_iK) denotes the candidate keyword v in the k-th paragraph_iWith other candidate keywords v_jThe co-occurrence frequency is that i is less than or equal to m and j is less than or equal to m.

The step 5) of calculating the weight of each candidate keyword; for a candidate keyword v_iThe weight calculation formula is as follows:

compared with the prior art, the text keyword weight calculation method has the following outstanding advantages:

the method does not need a field text set, and only needs to extract key words from a single text; and a loop iteration convergence process of word weight and a training process of a large-scale training set are not needed. Therefore, the method is simple to operate and good in effect.

Drawings

Fig. 1 is a flowchart of a text keyword extraction method fusing text structure information and semantic information according to the present invention.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings.

An embodiment of the present invention downloads a total of 1000 articles for 10 domains, 100 for each domain, from a web-aware (https:// www.cnki.net /) search. The 10 fields are: machine learning, computer vision, system architecture, astronomy, physics, music, electricity, economy, public health, and geography. Each downloaded paper has keywords carried by the article as evaluation indexes.

As shown in fig. 1, a method for extracting text keywords by fusing text structure information and semantic information includes the following specific steps:

1) recombining paragraphs of a single text to form a new text; taking the title of the original text as a first section of the new text; the head section and the tail section of the original text are respectively used as a second section and a third section of the new text; other sections of the original text are arranged behind the original text according to the original sequence, and the new text is n sections in total.

2) Preprocessing the new text, including word segmentation, part-of-speech tagging and stop word removal, and keeping nouns and verbs as candidate keywords;

3) calculating the structural weight of each candidate keyword; for a candidate keyword v_iIts structural weight str (v)_iK) the calculation formula is as follows:

4) Calculating the semantic weight of each candidate keyword; for a candidate keyword v_iIts semantic weight sem (v)_iK) denotes the candidate keyword v in the k-th paragraph_iWith other candidate keywords v_jThe co-occurrence frequency is that i is less than or equal to m and j is less than or equal to m.

5) Calculating the weight of each candidate keyword according to the structural weight obtained in the step 3) and the semantic weight obtained in the step 4), and aiming at one candidate keyword v_iThe weight calculation formula is as follows:

and selecting the K candidate keywords with the highest weight as the keywords of the text.

The keywords of each paper are taken as a standard set. Since the number of keywords carried by each paper is not fixed, the accuracy index is defined as: TF-IDF, RAKE, TEXTRANK and the TOP-K keywords extracted by the present invention all belong to a percentage of the keyword criteria set. The accuracy of 10 fields was calculated separately. Table 1 shows the accuracy of the TOP-5 keywords extracted by the four methods. Table 2 shows the accuracy of the TOP-10 keywords extracted by the four methods.

TABLE 1 accuracy of TOP-5 keywords for four methods

TABLE 2 accuracy of TOP-10 keywords for four methods

As can be seen from tables 1 and 2: the TF-IDF may be affected by other text in the domain corpus. The experiment searches for a field with a large range in the knowledge network, and the similarity of texts in a field text set is not high enough, so that the IDF calculation is not accurate enough. The RAKE method extracts keywords that have poor performance in chinese text. TEXTRANK has good effect, but the loop iteration calculation is complex. The method has the highest accuracy, and the accuracy is gradually improved as the number of the extracted keywords is increased.

The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

6页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：敏感数据识别模型训练方法、敏感数据识别方法及系统

Text keyword extraction method fusing text structure information and semantic information

相关技术

网友询问留言