Keyword extraction method and device

文档序号：1922141 发布日期：2021-12-03 浏览：19次中文

阅读说明：本技术 一种关键词提取方法及装置 (Keyword extraction method and device ) 是由张雅琴于 2021-09-08 设计创作，主要内容包括：本申请提供了一种关键词提取方法及装置,对待处理语句进行分词处理后,对分词结果进行碎词合并,然后,基于关键词字典获得每个词语的TF-IDF值。对待处理语句进行分句,并对每个短句进行分词处理和碎词合并,得到每个短句包含的词语,进一步对每个短句包含的词语进行依存句法分析,得到该短句的核心词组；根据待处理语句所包含的每个词语及其对应的TF-IDF值,以及该待处理语句包含的核心词组,确定出该待处理语句的关键词。该方案针对整个句子提取词语后,再将整个句子划分为短句,然后针对每个短句提取核心词组,以确保不会遗漏重要信息。而且,该方案进行分词后,又进行了碎词合并,不仅减少了词语数量,同时还使提取的关键词信息更加完整。(The application provides a keyword extraction method and device, after word segmentation processing is carried out on a sentence to be processed, word segmentation results are subjected to word fragmentation combination, and then TF-IDF values of all words are obtained based on a keyword dictionary. The method comprises the steps of performing sentence segmentation on a sentence to be processed, performing word segmentation processing and word fragmentation combination on each short sentence to obtain words contained in each short sentence, and further performing dependency syntax analysis on the words contained in each short sentence to obtain a core phrase of the short sentence; and determining the key words of the sentence to be processed according to each word and the corresponding TF-IDF value thereof contained in the sentence to be processed and the core word group contained in the sentence to be processed. According to the scheme, after words are extracted from the whole sentence, the whole sentence is divided into short sentences, and then core phrases are extracted from each short sentence, so that important information cannot be omitted. In addition, after word segmentation is carried out, word fragmentation and word combination are carried out, so that the number of words is reduced, and the extracted keyword information is more complete.)

1. A keyword extraction method is characterized by comprising the following steps:

performing word segmentation on the sentence to be processed to obtain a word segmentation result, and performing word fragmentation and combination on the word segmentation result to obtain a word segmentation and combination result;

obtaining the word frequency-reverse file frequency of each word in the word segmentation and combination result based on a keyword dictionary obtained by pre-training, wherein the keyword dictionary comprises the word frequency-reverse file frequency corresponding to each keyword;

the sentence to be processed is divided, word division processing and word breaking combination are carried out on each short sentence, words contained in the short sentences are obtained, and dependency syntax analysis is carried out on the words contained in each short sentence, so that core phrases contained in the short sentences are obtained;

and obtaining the keywords of the sentence to be processed based on the words contained in the sentence to be processed, the word frequency-reverse file frequency corresponding to the words and the core phrase.

2. The method according to claim 1, wherein the obtaining keywords of the to-be-processed sentence based on the words contained in the to-be-processed sentence, the word frequency corresponding to the words-inverse file frequency, and the core phrase comprises:

acquiring a weight coefficient corresponding to a word contained in the sentence to be processed, wherein the weight coefficient comprises a weight corresponding to the position of the word and a weight corresponding to a core phrase;

obtaining a target weight corresponding to each word based on the weight coefficient corresponding to each word and the word frequency-reverse text frequency;

and determining a preset number of words as the keywords of the sentence to be processed according to the sequence from high to low of the target weight corresponding to each word in the sentence to be processed.

3. The method according to claim 2, wherein the weighting factor includes a first weighting corresponding to a word frequency-inverse file frequency, a second weighting corresponding to the core phrase, a third weighting corresponding to a position of the short sentence in the sentence to be processed, and a fourth weighting corresponding to a part of speech of each of the core phrases;

the obtaining of the target weight corresponding to each word based on the weight coefficient corresponding to each word and the word frequency-reverse text frequency includes:

calculating the product of the first weight corresponding to the word and the word frequency-reverse text frequency of the word;

and calculating the sum of the product and the second weight, the third weight and the fourth weight to obtain the target weight corresponding to the word.

4. The method of claim 3, wherein a sum of maximum values of the first weight, the second weight, the third weight, and the fourth weight is equal to 1;

the second weight corresponding to the core phrase is a second weight preset value, and the second weight corresponding to the words of the non-core phrase is 0;

the numerical value of the third weight corresponding to the short sentence at the beginning or the end of the sentence in the sentence to be processed is higher than the third weight corresponding to the short sentences at other positions in the sentence to be processed;

the fourth weights corresponding to words of different parts of speech are different.

5. The method according to claim 1, wherein the segmenting the sentence to be processed, performing word segmentation processing and word fragmentation merging on each short sentence to obtain the words contained in the short sentence comprises:

dividing the sentences to be processed into short sentences according to punctuations contained in the sentences to be processed;

and performing word segmentation processing on the short sentence to obtain a word segmentation result, and combining words with the co-occurrence frequency greater than a preset threshold value, wherein the words are contained in the word segmentation result, so as to obtain words contained in the short sentence.

6. The method according to claim 1, wherein performing dependency parsing on the words included in each short sentence to obtain the core phrases included in the short sentence comprises:

analyzing semantic dependency relations among words contained in the short sentences by using a dependency syntax analysis method;

and determining the core phrase in the short sentence according to the semantic dependency relationship.

7. The method according to claim 6, wherein the determining a core phrase included in the short sentence according to the semantic dependency relationship comprises:

extracting initial core words of the short sentence according to the semantic dependency relationship;

and expanding the initial core words according to the semantic dependency relationship corresponding to the initial core words to obtain the core phrase.

8. The method of claim 1, wherein the process of obtaining a keyword dictionary comprises:

performing word segmentation processing and word fragmentation combination on any sentence in a training sentence set to obtain a word contained in the sentence;

and for each word, calculating the word frequency-reverse file frequency of the word according to the word frequency of the word and the sentence data volume containing the word, and obtaining the word frequency-reverse file frequency of each word contained in the training sentence set.

9. A keyword extraction apparatus, comprising:

the word segmentation and word fragmentation combination module is used for carrying out word segmentation on the sentence to be processed to obtain a word segmentation result, and carrying out word fragmentation combination on the word segmentation result to obtain a word segmentation and combination result;

the word frequency-reverse file frequency obtaining module is used for obtaining the word frequency-reverse file frequency of each word in the word segmentation and combination result based on a keyword dictionary obtained by pre-training, and the keyword dictionary comprises the word frequency-reverse file frequency corresponding to each keyword;

a core phrase obtaining module, configured to perform sentence segmentation on the sentence to be processed, perform word segmentation processing and word fragmentation combination on each short sentence to obtain words included in the short sentence, and perform dependency syntax analysis on the words included in each short sentence to obtain a core phrase included in the short sentence;

and the keyword determining module is used for obtaining the keywords of the sentence to be processed based on the words contained in the sentence to be processed, the word frequency-reverse file frequency corresponding to the words and the core phrases.

10. A computer-readable storage medium, characterized in that the storage medium stores therein a program that realizes the keyword extraction method according to any one of claims 1 to 8 when executed by a computing device.

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a keyword extraction method and device.

Background

The keyword extraction is to automatically extract words which can express the meaning of the text from the text. The current keyword extraction technologies, such as a word frequency-reverse file frequency technology, textrank, a topic model extraction keyword, and the like, are basically based on long text corpora such as documents and articles, and the corpora are characterized by a large number of words, a large amount of information, a definite topic, and a very definite context.

In an application scenario of auto-questioning and answering, a user inputs a sentence, and the auto-questioning and answering system needs to extract a keyword of the sentence, and the sentence in the auto-questioning and answering system generally has the following characteristics: firstly, the content is short and the number of words is less; secondly, the purpose is clear; ③ one sentence contains multiple meanings; fourthly, the method is very spoken, flexible in expression and different in style. It can be seen that the linguistic data in the automatic question-answering system and the linguistic data in the long text are completely different in characteristics, so that the keyword extraction technology suitable for the long text is not suitable for the short text linguistic data in the automatic question-answering system.

Disclosure of Invention

In view of the above, the present invention provides a keyword extraction method and device to solve the above technical problems, and the disclosed specific technical solution includes:

in a first aspect, the present application provides a keyword extraction method, including:

In a possible implementation manner of the first aspect, the obtaining, based on the words included in the to-be-processed sentence, the word frequency-reverse file frequency corresponding to the words, and the core phrase, the keywords of the to-be-processed sentence includes:

obtaining a target weight corresponding to each word based on the weight coefficient corresponding to each word and the word frequency-reverse text frequency;

In another possible implementation manner of the first aspect, the weight coefficient includes a first weight corresponding to a word frequency-inverse file frequency, a second weight corresponding to the core phrase, a third weight corresponding to a position of the short sentence in the sentence to be processed, and a fourth weight corresponding to a part of speech of each core phrase;

the obtaining of the target weight corresponding to each word based on the weight coefficient corresponding to each word and the word frequency-reverse text frequency includes:

calculating the product of the first weight corresponding to the word and the word frequency-reverse text frequency of the word;

and calculating the sum of the product and the second weight, the third weight and the fourth weight to obtain the target weight corresponding to the word.

In yet another possible implementation manner of the first aspect, a sum of maximum values of the first weight, the second weight, the third weight, and the fourth weight is equal to 1;

the second weight corresponding to the core phrase is a second weight preset value, and the second weight corresponding to the words of the non-core phrase is 0;

the fourth weights corresponding to words of different parts of speech are different.

In another possible implementation manner of the first aspect, the segmenting the to-be-processed sentence, and performing word segmentation processing and word segmentation combination on each short sentence to obtain words included in the short sentence includes:

dividing the sentences to be processed into short sentences according to punctuations contained in the sentences to be processed;

In another possible implementation manner of the first aspect, the performing dependency parsing on the terms included in each short sentence to obtain a core phrase included in the short sentence includes:

analyzing semantic dependency relations among words contained in the short sentences by using a dependency syntax analysis method;

and determining the core phrase in the short sentence according to the semantic dependency relationship.

In another possible implementation manner of the first aspect, the determining, according to the semantic dependency relationship, a core phrase included in the short sentence includes:

extracting initial core words of the short sentence according to the semantic dependency relationship;

and expanding the initial core words according to the semantic dependency relationship corresponding to the initial core words to obtain the core phrase.

In yet another possible implementation manner of the first aspect, the process of obtaining the keyword dictionary includes:

performing word segmentation processing and word fragmentation combination on any sentence in a training sentence set to obtain a word contained in the sentence;

In a second aspect, the present application further provides a keyword extraction apparatus, including:

In a third aspect, the present application further provides an electronic device, including: a memory and a processor;

the memory stores a program, and the processor calls the program in the memory to implement the keyword extraction method according to any one of the possible implementation manners of the first aspect.

In a fourth aspect, the present application further provides a computer-readable storage medium, where a program is stored in the storage medium, and the program, when executed by a computing device, implements the keyword extraction method according to the first aspect or any possible implementation manner.

According to the keyword extraction method, after word segmentation processing is carried out on a sentence to be processed, word segmentation results are combined, and then a TF-IDF value of each word is obtained based on a keyword dictionary. Further, sentence segmentation is carried out on the sentences to be processed, word segmentation processing and word fragmentation combination are carried out on each short sentence, words contained in each short sentence are obtained, dependency syntax analysis is further carried out on the words contained in each short sentence, and core phrases of the short sentence are obtained; and determining the key words of the sentence to be processed according to each word and the corresponding TF-IDF value thereof contained in the sentence to be processed and the core word group contained in the sentence to be processed. According to the scheme, after words are extracted from the whole sentence, the whole sentence is divided into short sentences, and then core phrases are extracted from each short sentence, so that important information cannot be omitted. In addition, the scheme also performs word segmentation and word combination after the words are segmented, so that the number of words is reduced, and the extracted keyword information is more complete. In conclusion, the scheme is suitable for extracting the keywords of the linguistic data of the automatic question-answering system, namely the extracted keywords are more accurate aiming at the linguistic data of the automatic question-answering system.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a keyword extraction process provided in an embodiment of the present application;

fig. 2 is a flowchart of a keyword extraction method provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a process of obtaining a keyword dictionary using corpus training according to an embodiment of the present application;

FIG. 4 is a flowchart of a process for obtaining keywords of a to-be-processed sentence according to an embodiment of the present application;

fig. 5 is a block diagram of a keyword extraction apparatus according to an embodiment of the present application.

Detailed Description

Before describing the embodiments of the method provided in the present application in detail, the technical terms referred to in the present application will be explained.

Word frequency-reverse file frequency: the English full spelling is term Frequency-inverse Document Frequency, and English is abbreviated as TF-IDF, wherein TF is the Frequency of a certain word appearing in an article, namely word Frequency; IDF is the ratio of the total number of documents in the corpus to the number of documents containing the word, and TF-IDF calculates the importance of the word from a statistical perspective.

Chinese word segmentation technology: the Chinese word segmentation algorithm is to segment a Chinese character sequence into individual words, and when the Chinese characters are identified by semantics, a plurality of Chinese characters need to be combined into words to express the real meaning.

And (3) merging broken words: as the name implies, two words with higher co-occurrence probability are combined into one word.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 and fig. 2, fig. 1 shows a detailed flowchart of a keyword extraction process provided in an embodiment of the present application, and fig. 2 shows a flowchart of a keyword extraction method provided in an embodiment of the present application.

The keyword extraction method provided by the application is applied to electronic equipment, and the electronic equipment can be terminal equipment such as a mobile phone, a computer, a tablet computer and the like, and can also be a server.

As shown in fig. 1, for a to-be-processed sentence, word segmentation processing and word segmentation merging are performed first, and a keyword dictionary is queried to obtain a TF-IDF value corresponding to each word included in the sentence (see S110 and S120 shown in fig. 2 for details); then, dividing the whole sentence into a plurality of short sentences according to punctuations for the sentence to be processed, and executing the following steps for each short sentence: the segmentation processing and the segmentation merging (please refer to S130 shown in fig. 2 in detail), then the dependency parsing is performed, and the core phrase is extracted according to the dependency parsing result (please refer to S140 shown in fig. 2 in detail). And finally, for each word in the sentence to be processed, obtaining the target weight corresponding to each word according to the TF-IDF value and the weight corresponding to each word, and determining the first n words from high to low according to the target weights to be the keywords corresponding to the sentence to be processed.

The following will describe the process of the keyword extraction method provided by the present application with reference to fig. 2:

s110, performing word segmentation on the sentence to be processed to obtain a word segmentation result, and performing word fragmentation and combination on the word segmentation result to obtain a word segmentation and combination result.

The sentence to be processed refers to any sentence from which a keyword needs to be extracted, for example, in an online customer service automatic question and answer application scenario, the sentence to be processed is a sentence input by a user.

The Chinese word segmentation tool can be used for segmenting words of the sentence to be processed to obtain corresponding word segmentation results, for example, for an original sentence, the problem that you tell me where to return goods is solved, and the word segmentation results obtained after word segmentation processing are that the problem that you tell me where to return goods is solved.

The main functions of the broken word combination are: 1) merging the spoken words; for example, "trouble, you, tell, i, where, return," do broken word merge and then "trouble you, tell i, where, return," broken word merge technology can greatly reduce the number of participles of a sentence.

2) And merging and reducing the proper nouns. For example, the term "water droplet insurance" is divided into "water droplet insurance", and the semantic meaning is changed after the word division, so that the two words need to be combined into "water droplet insurance".

The broken word combination is to count the times of the adjacent left and right words appearing together, if the co-occurrence times exceeds a certain threshold, the two words are combined into one word, the technology can greatly reduce the number of the words in a short time, and the extracted keywords can more completely keep the semantics.

And S120, obtaining a TF-IDF value corresponding to each word of the word segmentation and combination result based on the keyword dictionary obtained by pre-training.

The keyword dictionary comprises a word frequency-reverse file frequency corresponding to each keyword.

As shown in fig. 3, the process of obtaining the keyword dictionary by corpus training includes the following steps:

and S121, performing word segmentation processing and word fragmentation combination on each sentence in the input sentence set to obtain the keywords contained in the sentence.

During training, all input sentences are combined into a data set, and word segmentation and word fragmentation combination are firstly carried out on each sentence.

S122, calculating the corresponding TF-IDF value of each keyword.

After a complete keyword set is obtained, for each keyword in the set, firstly, the word frequency (TF) of each word is counted, then, the text number of the word appearing in a corpus is counted, the TF-IDF value of the word is obtained through calculation, and finally, each word and the TF-IDF value are used as a dictionary to be output.

TF-IDF is used to assess how important a word is for a document and/or a document in a corpus. The importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. In other words, the more times a word appears in a text, the less times it appears in all documents, the more representative of the article.

TF, or word frequency, refers to the number of times a given word appears in a document, which number is typically normalized, e.g., TF is equal to the quotient of the number of times a word appears in a text divided by the total number of words in the text.

IDF is the inverse file frequency, and the greater the IDF if fewer documents contain the term. The IDF of a given word can be obtained by dividing the total text number in the corpus by the text number containing the word, and taking the logarithm of the obtained quotient, that is, the calculation formula of the IDF is as follows:

log (total amount of text in corpus/number of text containing the word)

And finally, calculating a TF-IDF value according to the TF value and the IDF value obtained by the calculation, wherein the TF-IDF value is calculated according to a formula of TF-IDF-TF-IDF.

And S123, generating a keyword dictionary according to each keyword and the corresponding TF-IDF value.

And after calculating the TF-IDF value of each word in the corpus, outputting each word and the TF-IDF value of the word as a keyword dictionary. For example, the keyword dictionary { 'w 1': tfidf _1, 'w 2': tfidf _2, 'w 3': tfidf _3, …, 'wn': tfidf _ n }, where, 'w 1' represents a word, tfidf _1 represents a TF-IDF value corresponding to the word "w 1", and so on, 'wn' represents an nth word, and tfidf _ n represents a TF-IDF value corresponding to the word 'wn'.

The key word dictionary can be directly used for inquiring TF-IDF values corresponding to the terms, and for a given term, the TF-IDF values corresponding to the term can be directly inquired from the key word dictionary. For example, whether a keyword dictionary contains the same keyword as the term to be queried is directly queried (for example, whether the keyword dictionary contains the same term as the term to be queried is determined through a Chinese word matching algorithm), and if the keyword dictionary contains the same term, the TF-IDF value corresponding to the term is read, and the TF-IDF value corresponding to the term is determined to be the TF-IDF value corresponding to the term to be queried.

S130, sentence segmentation is carried out on the sentences to be processed, word segmentation processing and word fragmentation merging are carried out on each short sentence, and words contained in each short sentence are obtained.

In an automated question and answer system, a sentence entered by a user may contain a plurality of messages that are separated by punctuation. In order to improve the accuracy of keyword extraction in the application scenario, sentences to be processed are divided according to punctuation coincidence to obtain different short sentences. And then extracting keywords for each short sentence, so that important information cannot be lost.

For each short sentence, word segmentation processing and word fragmentation merging are firstly carried out to obtain words contained in the short sentence.

And S140, performing dependency syntax analysis on the words contained in each short sentence to obtain the core phrases contained in the short sentence.

The dependency syntax analysis is to automatically derive the syntax structure of a sentence according to a given grammar system, and analyze the syntax units contained in the sentence and the relationship between the syntax units.

In one possible implementation, based on dependency syntax analysis, semantic dependency relationships between words included in a short sentence are analyzed, and further, a core phrase in the short sentence is determined according to the semantic dependency relationships.

In another possible implementation manner, the semantic dependency relationship between words contained in a short sentence is analyzed based on a dependency syntax analysis method, the core word of the short sentence is extracted according to the semantic dependency relationship to serve as the initial core word of the short sentence, and then the core word is expanded according to the dominating-predicate structure, the pioneering structure, the shape-median structure and the like of the initial core word to obtain the core phrase of the short sentence.

For example, a short sentence is "is not my situation and meets the application requirements now? "after the sentence is first participled and the conjunctions of the conjunctions. "accord with" and "application requirement" are bingo structures, and "application requirement" has a word with a specific meaning, so that the core phrase is expanded to "accord with application requirement".

While "my condition" and "coincidence" are the main meaning structure, but "my condition" is not a word with a specific meaning, and extending such a word to a core phrase does not play a role as a keyword, and therefore, does not extend the word "my condition" to the core phrase.

S150, obtaining the key words of the sentence to be processed based on the words contained in the sentence to be processed, the TF-IDF values corresponding to the words and the core phrases.

In an embodiment of the present application, as shown in fig. 4, the process of obtaining the keyword of the to-be-processed sentence may include:

and S151, acquiring words contained in the statement to be processed and the weight coefficients corresponding to the words.

In one possible implementation, a corresponding weight coefficient is set for each word, for example, a position weight, a weight corresponding to the core phrase, and the like.

S152, based on the weight coefficient and the TF-IDF value corresponding to each word, the target weight corresponding to the word is obtained.

For a given word, the factors influencing whether the word is a key word include various factors, such as the TF-IDF value of the word, the position of the word in the sentence, the part of speech of the word, whether the word is a core phrase, and the like, and the embodiments of the present application will respectively describe the influence of the four dimensions on the key word from the above-mentioned four dimensions. Thus, in one embodiment of the present application, the weighting factor for each word includes the following four weights:

1) a weight tfidf weight, i.e. a first weight, is set for each TF-IDF value.

The first weight is used for representing the influence degree of the TF-IDF value dimension on the keywords, the tfidf _ weight values corresponding to different TF-IDF values are the same, the tfidf _ weight values can be determined according to actual conditions, the larger the value is, the larger the influence degree of the TF-IDF value dimension on the keywords is, and the smaller the value is, the smaller the influence degree of the TF-IDF value on the keywords is.

2) A weight w _ word _ group, i.e. a second weight, is set for the core phrase.

For a word contained in a sentence, if the word is a core phrase, the probability that the word is a keyword is greater than that of a word that is not a core phrase. Therefore, a weight is set for the words belonging to the core phrase. The weight characterizes the degree of influence of the dimension of the core phrase on the keyword. The second weight may be a fixed value, e.g., for a given word, the second weight takes a corresponding set value if the word is a core phrase and the second weight is 0 if the word is not a core phrase.

3) The location weight location _ weight corresponding to the word is the third weight.

In the automatic question-answering system, the purpose of the words input by the user is clear, the purpose is generally embodied in the first sentence or the last sentence, and the information of the positions of the words is particularly important, so that the position weight of the words is introduced, and the position weight represents the influence degree of the words in different positions on whether the words are keywords or not.

The position weight value can be determined according to actual conditions, the position weight value corresponding to the position capable of embodying the sentence purpose is larger, and the position weight value corresponding to other positions in the sentence is smaller.

For example, a sentence includes d phrases, a position weight is set for the position of each phrase, the position weights corresponding to the d phrases are {1: location _ weight _1,2: location _ weight _2, …, d: location _ weight _ d }, where location _ weight _1 represents the position weight corresponding to the 1 st phrase in the sentence, and so on, and location _ weight _ d represents the position weight corresponding to the d th phrase.

For example, the values of location _ weight _1 and location _ weight _ d are large, they may be equal or different, and the position weights corresponding to phrases in other positions are small.

4) A part-of-speech weight w4, i.e. a fourth weight, is set for the part-of-speech.

For example, verbs are usually the central component of a sentence, and other words of different parts of speech are usually dominated by verbs, so that the weight of a verb is the largest, the weight of a noun is the next to the weight of a noun, and the weight of an adjective is the smallest. Of course, other weighting coefficients corresponding to parts of speech may also be set, and will not be described in detail here.

For example, { 'verb': verb _ w, ' non ', ' non _ w, ' objective ': adj _ w, verb _ w represents the weight corresponding to the verb, noun _ w represents the weight corresponding to the noun, and adj _ w represents the weight corresponding to the adjective.

After determining the weight coefficients, for a given word c, calculating the Final corresponding target weight Final _ weight of the word according to the following formula:

Final_weight＝tfidf_weight*tfidf+w_word_group+location_weight+w4

for example, for a word c, the word is located in the first short sentence of the whole sentence, and the part of speech is a verb, the target weight corresponding to c is:

Final_weight_c＝tfidf_weight*tfidf_c+w_word_group+location_weight_1+verb_w

the specific numerical values corresponding to the four weights can be determined according to actual conditions, and the specific numerical values are not limited in the application.

S153, determining the words with the preset number as the keywords of the sentence to be processed according to the sequence from high to low of the target weight corresponding to each word in the sentence to be processed.

And after the target weight of each word in the sentence to be processed is calculated according to the formula, sequencing the words from high to low according to the target weight, and selecting the first n words as the keywords of the sentence to be processed.

In the keyword extraction method provided in this embodiment, after performing word segmentation processing on a to-be-processed sentence, merging the word segmentation results by using a word segmentation merging method, and then obtaining a TF-IDF value of each word included in the word segmentation merging results based on a keyword dictionary. The method comprises the steps of performing sentence segmentation on a sentence to be processed, performing word segmentation processing and word fragmentation combination on each short sentence to obtain words contained in each short sentence, and further performing dependency syntax analysis on the words contained in each short sentence to obtain a core phrase of the short sentence; and determining the key words of the sentence to be processed according to each word and the corresponding TF-IDF value thereof contained in the sentence to be processed and the core word group contained in the sentence to be processed. According to the scheme, after words are extracted from the whole sentence, the whole sentence is divided into short sentences, and then core phrases are extracted from each short sentence, so that important information cannot be omitted. In addition, after the sentence is subjected to word segmentation processing, the word segmentation and combination are carried out, so that the number of words is reduced, and the extracted keyword information is more complete. In conclusion, the scheme is suitable for extracting the keywords of the linguistic data of the automatic question-answering system, namely the extracted keywords are more accurate aiming at the linguistic data of the automatic question-answering system.

Furthermore, when the keyword is extracted, the position weight of the word in the whole sentence is introduced, so that the word at the position containing the important information can be extracted, and the accuracy of the extracted keyword is finally improved. In addition, weights of other dimensions are set, for example, the weight corresponding to TF-IDF, the weight corresponding to a core phrase and the weight corresponding to part of speech, so that whether a word is a keyword or not is measured from a plurality of different dimensions, the accuracy of the extracted keyword is finally improved, and the measurement dimensions are determined according to the characteristics of the corpus in the automatic question-answering system, so that the scheme is more suitable for the automatic question-answering system.

Corresponding to the embodiment of the keyword extraction method, the application also provides an embodiment of a keyword extraction device.

Referring to fig. 5, a block diagram of a keyword extraction apparatus provided in an embodiment of the present application is shown, where the apparatus is applied to an electronic device, and as shown in fig. 5, the apparatus may include:

and the word segmentation and word fragmentation combination module 110 is configured to perform word segmentation on the to-be-processed sentence to obtain a word segmentation result, and perform word fragmentation combination on the word segmentation result to obtain a word segmentation and word fragmentation combination result.

In one possible implementation, the segmentation and fragmentation merging module 110 includes:

and the phrase dividing submodule is used for dividing the statement to be processed into phrases according to punctuations contained in the statement to be processed.

And the word segmentation and word fragmentation merging submodule is used for carrying out word segmentation on the short sentence to obtain a word segmentation result, and merging words with the co-occurrence frequency greater than a preset threshold value contained in the word segmentation result to obtain words contained in the short sentence.

And the word frequency-reverse file frequency obtaining module 120 is configured to obtain a TF-IDF value of each word in the word segmentation and combination result based on the keyword dictionary obtained through pre-training.

The keyword dictionary includes a corresponding TF-IDF value for each keyword.

In an embodiment of the present application, the process of training to obtain the keyword dictionary includes:

performing word segmentation processing and word fragmentation combination on any sentence in a training sentence set to obtain a word contained in the sentence;

The core phrase obtaining module 130 is configured to perform sentence segmentation on the sentence to be processed, perform word segmentation processing and word fragmentation combination on each short sentence to obtain words included in the short sentence, and perform dependency syntax analysis on the words included in each short sentence to obtain a core phrase included in the short sentence.

In one possible implementation, the dependency syntactic tokenization process includes: analyzing semantic dependency relations among words contained in the short sentences by using a dependency syntax analysis method; and determining the core phrase in the short sentence according to the semantic dependency relationship.

The process of determining the core phrases contained in the short sentence comprises the following steps: extracting initial core words of the short sentence according to the semantic dependency relationship; and expanding the initial core words according to the semantic dependency relationship corresponding to the initial core words to obtain the core phrase.

And a keyword determining module 140, configured to obtain a keyword of the to-be-processed sentence based on the word included in the to-be-processed sentence, the word frequency-reverse file frequency corresponding to the word, and the core phrase.

In one embodiment of the present application, the keyword determination module 140 may include:

and the weight obtaining submodule is used for obtaining a weight coefficient corresponding to a word contained in the statement to be processed.

And the weight coefficient comprises the weight corresponding to the position of the word and the weight corresponding to the core phrase.

And the target weight calculation submodule is used for obtaining the target weight corresponding to the word based on the weight coefficient corresponding to each word and the word frequency-reverse text frequency.

In a possible implementation manner, the weight coefficient includes a first weight corresponding to TF-IDF, a second weight corresponding to the core phrase, a third weight corresponding to a position of the short sentence in the to-be-processed sentence, and a fourth weight corresponding to a part of speech of each core phrase.

Wherein, the first weights corresponding to the words with different TF-IDF values are the same.

And the second weight corresponding to the core phrase is a second weight preset value, and the second weight corresponding to the words of the non-core phrase is 0.

And the numerical value of the third weight corresponding to the short sentence at the beginning or the end of the sentence in the sentence to be processed is higher than the third weight corresponding to the short sentences at other positions in the sentence to be processed.

The fourth weights corresponding to words of different parts of speech are different.

The target weight calculation sub-module may include: a first weight calculation submodule and a second weight calculation submodule.

And the first weight calculation submodule is used for calculating the product of the first weight corresponding to the word and the word frequency-reverse text frequency of the word.

And the second weight calculation submodule is used for calculating the sum of the product and the second weight, the third weight and the fourth weight to obtain the target weight corresponding to the word.

And the keyword selection submodule is used for determining a preset number of words as the keywords of the sentences to be processed according to the sequence from high to low of the target weight corresponding to each word in the sentences to be processed.

The keyword extraction device provided by this embodiment, after extracting words for a whole sentence, divides the whole sentence into short sentences, and then extracts a core phrase for each short sentence, so as to ensure that important information is not missed. In addition, after the words are segmented, the words are also broken and combined, namely, the words with high co-occurrence frequency are combined, so that the number of the words is reduced, and the extracted keyword information is more complete. In conclusion, the scheme is suitable for extracting the keywords of the linguistic data of the automatic question-answering system, namely the extracted keywords are more accurate aiming at the linguistic data of the automatic question-answering system.

An electronic device includes a processor and a memory having stored therein a program executable on the processor. The processor implements the above-described embodiment of the keyword extraction method when running the program stored in the memory.

The application also provides a storage medium executable by the computing device, wherein the storage medium stores a program, and the program realizes the keyword extraction method when being executed by the electronic device.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

It should be noted that technical features described in the embodiments in the present specification may be replaced or combined with each other, each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of the embodiments of the present application may be sequentially adjusted, combined, and deleted according to actual needs.

The device and the modules and sub-modules in the terminal in the embodiments of the present application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

17页详细技术资料下载

Keyword extraction method and device

相关技术

网友询问留言