Method and device for automatically extracting text keywords based on co-occurrence language network

文档序号：1170288 发布日期：2020-09-18 浏览：8次中文

阅读说明：本技术 基于共现语言网络的文本关键词自动抽取方法和装置 (Method and device for automatically extracting text keywords based on co-occurrence language network ) 是由刘斌王维赵火军聂常赟于 2020-06-10 设计创作，主要内容包括：本发明公开了基于共现语言网络的文本关键词自动抽取方法和装置,解决了有监督机器学习中需要大量人工标注数据的缺点,克服了语言分析方法泛化能力弱的不足,避免了统计方法易忽略频度低但很重要关键词的问题,本发明包括对网页进行预处理、构建语言网络图模型、候选关键词特征联合提取和候选关键词特征综合排序并输出关键词。本发明通过对网络文本预处理、共现语言网络模型构建、关键词特征联合提取、候选关键词排序优选,使得所抽取的关键词具有良好的可读性、连贯性和相关性,能够广泛应用于自然语言处理、信息检索、文本挖掘、情感分析和多模式人机交互等领域中。(The invention discloses a method and a device for automatically extracting text keywords based on a co-occurrence language network, which solve the defect that a large amount of manually labeled data is needed in supervised machine learning, overcome the defect that a language analysis method is weak in generalization capability, and avoid the problem that a statistical method is easy to ignore keywords with low frequency but very important. According to the method, the extracted keywords have good readability, coherence and relevance by preprocessing the network text, constructing a co-occurrence language network model, jointly extracting the characteristics of the keywords and preferably sequencing the candidate keywords, and the method can be widely applied to the fields of natural language processing, information retrieval, text mining, emotion analysis, multi-mode man-machine interaction and the like.)

1. The method for automatically extracting the text keywords based on the co-occurrence language network is characterized by comprising the following steps of:

s1: preprocessing a webpage:

the webpage preprocessing comprises the steps of decomposing network text clauses, then carrying out word segmentation and part-of-speech tagging to obtain segmented sentences, segmented words and standard formatted texts containing no stop words, generating a candidate keyword set, analyzing two adjacent candidate keywords in the same clause after the sentence segmentation according to the word grouping standard of compound words to determine whether to combine to form new candidate keywords or phrases, and adding the new candidate keywords or phrases into the candidate keyword set;

s2: constructing a language network graph model:

selecting candidate keywords from the candidate keyword set in the S1, inputting the candidate keywords into the graph model and representing the candidate keywords as a node, establishing an edge with weight between corresponding nodes in the graph model for two candidate keywords which appear in the same sentence and have an interval not larger than 1, and establishing a language network graph model based on the three items of weight values of the edge, the node and the edge;

s3: and (3) jointly extracting the features of the candidate keywords:

performing feature calculation on nodes in the language network graph model in the S2 based on a combined extraction technology of discourse features and language network graph model features;

based on part-of-speech weight setting and word frequency statistical calculation carried out by the part-of-speech characteristics and the part-of-speech characteristics in the web text, calculating entropy of an inner space and an outer space by using a method for distinguishing the outer space from the inner space by using word span as the part-of-speech characteristics and using average space, wherein the inner space is a word distance of a keyword in a key paragraph, and the outer space is a distance of the keyword among a plurality of paragraphs;

calculating the decentrality of the nodes in the language network graph model, wherein the decentrality is the reciprocal of the distance from each node to a central node, and the node with the maximum centrality is the central node;

calculating the reciprocal of the sum of the distances between the node and other nodes based on the node approaching centrality calculation in the language network graph model;

calculating the strength of a plurality of nodes based on the strength calculation of adjacent nodes in the language network graph model, and simultaneously calculating the strength of the adjacent nodes of the plurality of nodes, wherein the adjacent nodes are adjacent nodes of the nodes;

s4: comprehensively sorting the characteristics of the candidate keywords and outputting the keywords: and performing normalization processing based on the six characteristics of the part of speech, the word frequency, the word span, the node decentering property and the adjacent node strength in the step S3, establishing a linear model comprising the six characteristics, loading the six characteristics subjected to normalization processing into the linear model, performing descending order arrangement according to the characteristic values of the candidate keywords, and screening out the keywords according to a descending order arrangement structure.

2. The method for automatically extracting keywords from texts based on co-occurrence language network according to claim 1, wherein the detailed steps of web page preprocessing in S1 are as follows:

web page standardization: correcting syntax errors in a webpage HTML file, and removing invalid tags and tag attributes;

and (3) clause decomposition: preliminarily decomposing the web text according to punctuation marks by taking the web text as a language processing unit to obtain a clause set;

word segmentation and part-of-speech tagging: using a word segmentation system for clauses in the clause set, performing word segmentation and part-of-speech tagging, and then removing stop words, wherein the word segmentation system comprises but is not limited to ICTCCLAS and LTP;

generating a candidate keyword set: and combining two adjacent candidate keywords in the same clause according to the word combination specification of the compound words to form a new compound candidate keyword or phrase.

3. The method according to claim 1, wherein the language network graph model G is built in a co-occurrence network graph form, wherein the language network graph model G is (V, E, W), a node V in the graph model represents a candidate keyword, E represents an edge between two nodes V, i and j represent two candidate keywords, the candidate keyword i and the candidate keyword j appear in the same sentence with an interval not greater than 1, and the edge E connects the candidate keyword i and the candidate keyword j, W_abIs a weight value of an edge, W_ijThe calculation formula of (a) is as follows:

wherein f (i, j) represents the frequency of the common occurrence of the candidate keyword a and the candidate keyword b, and f (i) and f (j) represent the frequency of the occurrence of the candidate words i and j in the text respectively.

4. The method according to claim 3, wherein the input of the language network graph model G is the web text content processed by S1, and wherein the candidate keywords in each clause are traversed in sequence, two candidate keywords V with an interval not greater than 1 are constructed to form an edge, and then put into the edge set E, and if the edge already exists in the set E, the weight of the edge is increased by one.

5. The method for automatically extracting keywords from texts based on co-occurrence language network according to claim 4, wherein the detailed steps of S3 are as follows:

setting part of speech POS_iThe weights of (a) are as follows:

for the word t_iIn document D_jIn the number of occurrences is n_ijWord frequency TF of said word_iThe calculation formula is as follows:

the detailed operation of the word span is as follows:

when the position where the word with the occurrence frequency m appears is t₁,t₂,...t_mDistance between words d_i＝t_i+1-t_i；

And (3) judging: if the word spacing d_iLess than the mean spacing mu, are drawn into the inter-space, if the word spacing d_iNot less than the mean spacing mu, the word spacing d_iAnd (3) dividing the distance into outer distances, and respectively calculating an inner distance entropy and an outer distance entropy:

H(d_i)＝-∑P_dlog₂P_d

wherein P is_dIs the probability that a word appears at position d,

obtaining the entropy difference between the inner space and the outer space and ED²(d)＝(H(d_I))²-(H(d_E))²Wherein H (d)_I)、H(d_E) Respectively representing inner space entropy and outer space entropy;

node decentrationThe calculation formula of (a) is as follows:

firstly, regarding the node with the maximum degree centrality as a central node c, and then calculating the distance d from each node i to the central node c_ic；

Node recentness of approachThe calculation formula of (a) is as follows:

intensity of adjacent node SNeigh_iThe calculation formula of (a) is as follows:

wherein n is the number of adjacent nodes.

6. The method for automatically extracting keywords from texts based on co-occurrence language network according to claim 5, wherein the detailed steps of S4 are as follows:

the linear model is as follows:

and evaluating the importance degree of the candidate keywords by adopting the linear model, and screening the keywords after descending the order.

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

Technical Field

The invention relates to the field of keyword extraction, in particular to a text keyword automatic extraction method and a text keyword automatic extraction medium based on a co-occurrence language network.

Background

With the rapid development of network technology and the coming of big data era, massive network text data are generated in a network space, and keyword extraction is fundamental work in text big data analysis and has important practical significance. Keyword extraction refers to extracting words or phrases related to the text subject from a single or multiple texts, and is also called keyword labeling. In the early days, since full-text retrieval was not supported, in order to be able to retrieve a paper using keywords, the author was required to manually set keywords in the paper. The traditional manual keyword labeling mode cannot effectively cope with the current text big data, so the automatic keyword extraction technology is developed at the same time. According to the research of the existing literature, the existing keyword extraction technology can be mainly divided into three categories: a language analysis method, a statistical learning method, and a machine learning method.

The keyword extraction technology based on the language analysis method is applied to the fields of library, informatics, natural language processing and the like at first, and the extraction of the keywords is always performed manually in the early stage. The method mainly extracts keywords which can represent the meaning of a document by carrying out syntactic analysis (including lexical analysis and syntactic analysis) and semantic analysis on a text according to an authoritative dictionary.

A keyword extraction technology based on a statistical learning method obtains statistical characteristics (word frequency, TF-IDF, N-Gram, word co-occurrence, centrality and the like) from a large amount of linguistic data, and then a classifier is constructed to extract keywords. The method was originally proposed by Luhn HP of IBM corporation in 1957, and Luhn used word frequency as a statistical characteristic to automatically index literature data, and was considered as the beginning of keyword extraction research by domestic and foreign scholars. Later, Edmundson HP first implemented a programming system capable of extracting keywords from documents, marking research in this area as new. The KEA system uses Naive Bayes (NB) for training the classifier; performing word segmentation by using a space, punctuation marks and line feed; the TF-IDF and first occurrence position of the candidate word are selected as features in training the classifier. GenEx uses a Decision Tree (DT) algorithm for classifier training, and selects word frequency and part of speech as features. The two keyword extraction systems also become comparison reference systems of subsequent improved methods. In 2003, Song et al realized a KPSpotter system, added information gain features on the basis of KEA, and improved the accuracy of keyword extraction using WordNet. Trieschnigg et al construct a Support Vector Machine (SVM) classification model to extract keywords using word frequency, part of speech, location where candidate words first appear, and TF-IDF as features.

The keyword extraction technology based on the machine learning method can be classified into supervised learning and unsupervised learning according to whether samples need to be labeled manually or not. The supervised keyword extraction method comprises the steps of firstly, manually labeling a large number of keywords in a text to form a training set, then training a classifier by using the training set to obtain a classification model, and finally extracting the keywords from a new text by using the model. The supervised keyword extraction method takes all words in a text as candidate keywords, and then judges whether the candidate words are the keywords or not through a classification algorithm. Turney considers all words in the text to be potential keywords, but only words that match manually specified keywords or phrases are correct keywords. Turney uses heuristic rules in combination with genetic algorithms to extract keywords, and develops the GenEx system on the basis of the heuristic rules. In an unsupervised keyword extraction method, generally, importance quantization indexes of keywords are set first, then candidate words are ranked, and finally the first k keywords are selected as text keywords. Unsupervised keyword extraction methods generally employ a statistical-based method, a topic-based method, a network graph-based method, and the like.

By looking up the existing relevant documents and researching the relevant technologies in the industry, the following results are found after analysis and comparison:

(1) the linguistic analysis method comprises the following steps: the method adopts grammar (including lexical and syntactic) and semantic analysis, is simple and easy, can improve the extraction quality of the keywords, but needs to maintain an additional word list or dictionary, requires grammar specification, has weak generalization capability and has low accuracy of grammar and semantic analysis.

(2) The statistical method is simple and easy to implement, is not limited by a specific application field, has strong generalization capability, does not require text quality, but has low precision ratio, easily ignores words or phrases which have important meanings but low occurrence frequency in the document, focuses on the surface layer statistical characteristics of the document, ignores intra-sentence and inter-sentence structure and semantic structure information of the document, and causes the deletion of keyword structure information and semantics.

(3) In the machine learning method, although the precision rate of supervised learning is high and the text quality is not required, a large number of training samples need to be labeled manually in relation to a specific field, and the training of multiple classifiers is time-consuming. The supervised keyword extraction method needs a large number of training samples, the labeling of the keywords needs a large amount of labor and time, the process is complex, and the quality of the keyword extraction is affected by the number and the quality of the training samples. Therefore, compared to supervised keyword extraction, unsupervised or semi-supervised keyword extraction technology has become a research focus in recent years.

Disclosure of Invention

Aiming at the defects of keyword extraction of the existing Network text, the invention establishes a co-occurrence language Network graph model according to the characteristics of Small World networks (Small World networks) of words in natural languages discovered by Cancho R F I and the like and the adjacent relation among the words, the connection relation among nodes in the Network corresponds to the relation among the words, extracts chapter characteristics and Network graph characteristics, then determines important nodes in the graph according to a node importance index calculation algorithm in a Network topological graph, extracts the words corresponding to the nodes, and finally generates keywords of the document by adopting a sorting-based method.

The method effectively overcomes the defect that a large amount of manual labeling data is needed in supervised machine learning, overcomes the defect that a language analysis method is weak in generalization capability, and avoids the problem that a statistical method is easy to ignore keywords which are low in frequency but very important.

The invention aims to construct a co-occurrence language network graph facing a network text, comprehensively considers the central characteristics of candidate word nodes in the co-occurrence network graph, and uses characteristic indexes such as word frequency, position and neighbor node weight to calculate the importance of the candidate word nodes and perform weight sequencing on the candidate words so as to screen out keywords, and provides a text keyword automatic extraction method and medium based on the co-occurrence language network, so that the effect of unsupervised automatic extraction of text keywords can be realized.

The invention is realized by the following technical scheme:

the method for automatically extracting the text keywords based on the co-occurrence language network comprises the following steps:

s1: preprocessing a webpage:

in order to improve the accuracy of judging the correlation between the web page content and the theme in the focused crawler, the web page text preprocessing needs to perform operations such as web page cleaning, word segmentation, feature extraction and the like on the page on the premise of not losing useful information, and the standard formatted text containing the segmented sentences, the segmented words and the stop words is obtained at the stage. The method mainly comprises the following steps: the webpage preprocessing comprises the steps of decomposing network text clauses, then carrying out word segmentation and part-of-speech tagging to obtain standard formatted texts which are segmented, segmented and do not contain stop words, merging two adjacent candidate keywords of the same clause after the clause segmentation to form a new candidate keyword or phrase, wherein the new candidate keyword or phrase forms a candidate keyword set, and the candidate keyword is a candidate of the keyword;

web page standardization: correcting syntax errors in a webpage HTML file, and removing invalid tags and tag attributes;

and (3) clause decomposition: preliminarily decomposing the web text according to punctuation marks by taking the web text as a language processing unit to obtain a clause set;

word segmentation and part-of-speech tagging: using a word segmentation system for clauses in the clause set, carrying out word segmentation and part-of-speech tagging, and then removing stop words, wherein the word segmentation system comprises ICTCCLAS and LTP;

generating a candidate keyword set: and merging two adjacent candidate keywords in the same clause according to the word forming specification of the compound words to form a new compound candidate keyword or phrase, such as the Chinese word forming specification listed in advanced Chinese of the Wangjian Jun.

S2: constructing a language network graph model:

selecting candidate keywords from the candidate keyword set in the S1, each node representing one candidate keyword, selecting any two nodes connected to form an edge, establishing a language network graph model based on the edge, the weight values of the edge and the three nodes, wherein the two nodes connected to each edge are in the same sentence, the interval between the candidate keywords represented by the two nodes is not more than 1, and the edge is between the two nodes;

the language network graph model G is established in a co-occurrence network graph mode, wherein the language network graph model G is (V, E, W), a node V in the graph model represents a candidate word, E represents an edge between two nodes V, i and j represent two candidate keywords, the candidate keywords i and the candidate keywords j appear in the same sentence, the interval between the candidate keywords i and the candidate keywords j is not more than 1, the edge E is connected with the candidate keywords i and the candidate keywords j, and W is connected with the candidate keywords i and the candidate keywords j_abIs a weight value of an edge, W_ijThe calculation formula of (a) is as follows:

In order to distinguish the relative importance of the nodes, the nodes with the interval not greater than 1 are connected to form edges. The input is the text content processed by the multilevel word segmentation strategy, and the output is a co-occurrence network diagram G.

The specific method comprises the following steps: and traversing the candidate keywords in each clause in sequence, constructing two candidate keywords V with the interval not larger than 1 to form an edge, then placing the edge into an edge set E, and if the edge exists in the set E, adding one to the weight of the edge.

S3: and (3) jointly extracting the features of the candidate keywords:

in order to obtain high-quality keywords with readability, continuity and relevance, a combined extraction technology of chapter characteristics and language network diagram characteristics is adopted.

Performing feature calculation on nodes in the language network graph model in the S2 based on a combined extraction technology of discourse features and language network graph model features;

performing part-of-speech weight setting and word frequency statistical calculation on the basis of chapter features taking part of speech and word frequency as nodes in the network text, and calculating entropy of an inner space and an outer space by using a method for distinguishing the outer space from the inner space by using an average space for the chapter features taking word span as nodes, wherein the inner space is a word distance of a keyword in a key paragraph, and the outer space is a distance of the keyword among a plurality of paragraphs;

calculating the reciprocal of the sum of the distances between the node and other nodes based on the node approaching centrality calculation in the language network graph model;

the detailed calculation steps are as follows:

according to the related research results, the occupation ratio of nouns and dynamic nouns in the keywords is the largest, and adjectives are followedAnd adverbs, so that the part of speech is taken as a characteristic, corresponding weights are respectively set for nouns, verbs, adjectives and other parts of speech according to the prior research experience, and part of speech POS is set_iThe weights of (a) are as follows:

based on the theory that keywords capable of reflecting document themes can inevitably appear in articles for many times, the word frequency is taken as an important feature, and for a word t_iIn document D_jIn the number of occurrences is n_ijWord frequency TF of said word_iThe calculation formula is as follows:

the research finds that the distribution characteristics of words expressing the writing intention of an author are as follows: keywords have small inter-distance in key paragraphs and large inter-distance in multiple paragraphs; while irrelevant terms are randomly distributed throughout the document. Based on this fact, the industry uses the difference in entropy between the inside and outside distances of words to extract keywords. Here, a method of distinguishing the outer space from the inner space by the average space is used, and the detailed operation of the word span is as follows:

when the position where the word with the occurrence frequency m appears is t₁,t₂,...t_mDistance between words d_i＝t_i+1-t_i；

H(d_i)＝-∑P_dlog₂P_d

wherein P is_dIs the probability that a word appears at position d,d in the formula represents the length of the documentDegree;

node decentrationThe calculation formula of (a) is as follows:

firstly, regarding the node with the maximum degree centrality as a central node c, and then calculating the distance d from each node i to the central node c_ic；

The closeness to the centrality reflects the closeness of the relationship between a node and other nodes, and if a node is closer to other nodes, the more central it is, the closer the node is to the centralityThe calculation formula of (a) is as follows:

considering that a node's neighbor nodes have a higher importance, then the node also generally has a higher importance. Therefore, the strength of the adjacent node is considered when the strength of the node is calculated, and the strength SNeigh of the adjacent node is considered_iThe calculation formula of (a) is as follows:

The linear model is as follows:

and finally, sorting the candidate keywords in a descending order according to the importance indexes by using the sorting model, and screening out the required keywords according to a sorting result.

Further, a computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method. The specific use of the method relies on a large number of calculations and it is therefore preferred that the above calculation is performed by a computer program, so any computer program and its storage medium containing the steps protected in the method also fall within the scope of the present application.

The invention has the following advantages and beneficial effects:

aiming at the basic work of extracting the keywords in the text processing, the invention urgently needs the requirement of improving the traditional mode, and the extracted keywords have good readability, coherence and relevance by preprocessing the network text, constructing a co-occurrence language network model, jointly extracting the characteristics of the keywords and preferably sequencing the candidate keywords, so that the invention can be widely applied to the fields of natural language processing, information retrieval, text mining, emotion analysis, multi-mode man-machine interaction and the like.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic view of an overall process for automatically extracting keywords according to the present invention.

Fig. 2 is a schematic diagram of the network text preprocessing process.

Detailed Description

Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangements of components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any inventive changes, are within the scope of the present invention.

The method for automatically extracting the text keywords based on the co-occurrence language network, as shown in fig. 1, comprises the following steps:

s1: preprocessing a webpage:

as shown in fig. 2, web page normalization: correcting syntax errors in a webpage HTML file, and removing invalid tags and tag attributes;

and (3) clause decomposition: preliminarily decomposing the web text according to punctuation marks by taking the web text as a language processing unit to obtain a clause set;

S2: constructing a language network graph model:

S3: and (3) jointly extracting the features of the candidate keywords:

Performing feature calculation on nodes in the language network graph model in the S2 based on a combined extraction technology of discourse features and language network graph model features;

calculating the reciprocal of the sum of the distances between the node and other nodes based on the node approaching centrality calculation in the language network graph model;

the detailed calculation steps are as follows:

according to the related research results, the occupation ratio of nouns and verb nouns in the keywords is the largest, and then adjectives and adverbs are used, so that the part of speech is taken as a feature, corresponding weights are respectively set for nouns, verbs, adjectives and other parts of speech according to the existing research experience, and part of speech POS is set_iThe weights of (a) are as follows:

when the position where the word with the occurrence frequency m appears is t₁,t₂,...t_mDistance between words d_i＝t_i+1-t_i；

H(d_i)＝-∑P_dlog₂P_d

wherein P is_dIs the probability that a word appears at position d,d in the formula represents the length of the document;

node decentrationThe calculation formula of (a) is as follows:

firstly, regarding the node with the maximum degree centrality as a central node c, and then calculating the distance d from each node i to the central node c_ic；

The calculation formula of (a) is as follows:

The linear model is as follows:

Preferably, a computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of the method. The specific use of the method relies on a large number of calculations and it is therefore preferred that the above calculation is performed by a computer program, so any computer program and its storage medium containing the steps protected in the method also fall within the scope of the present application.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

14页详细技术资料下载

Method and device for automatically extracting text keywords based on co-occurrence language network

相关技术

网友询问留言