Keyword extraction method and device, terminal equipment and storage medium

文档序号:1414310 发布日期:2020-03-10 浏览:4次 中文

阅读说明:本技术 关键词提取方法、装置、终端设备及存储介质 (Keyword extraction method and device, terminal equipment and storage medium ) 是由 赵洋 魏世胜 王宇 王亚奇 于 2019-10-30 设计创作,主要内容包括:本申请实施例适用于文本处理技术领域,提供了一种关键词提取方法、装置、终端设备及存储介质,所述方法包括:采集待处理的多个文本,各个文本包括正文文本和标题文本;分别计算每个正文文本中各个正文词语的词频指数,基于多个正文文本中的各个目标正文词语,生成正文关键词集合;根据所述正文关键词集合,计算每个标题文本的标题权重值,基于目标标题文本中的各个标题词语,生成标题关键词集合;从所述正文关键词集合和所述标题关键词集合中提取目标关键词。本实施例基于不同词语在正文文本和标题文本中的重要性进行关键词提取,使得提取出的关键词能够匹配文本内容自身的特点,提高了关键词的提取效率和准确率。(The embodiment of the application is applicable to the technical field of text processing, and provides a keyword extraction method, a keyword extraction device, terminal equipment and a storage medium, wherein the method comprises the following steps: collecting a plurality of texts to be processed, wherein each text comprises a body text and a title text; respectively calculating word frequency indexes of all text words in each text, and generating a text keyword set based on all target text words in a plurality of text texts; calculating a title weight value of each title text according to the text keyword set, and generating a title keyword set based on each title word in the target title text; and extracting target keywords from the text keyword set and the title keyword set. According to the method and the device, the keywords are extracted based on the importance of different words in the text and the title text, so that the extracted keywords can be matched with the characteristics of the text content, and the extraction efficiency and accuracy of the keywords are improved.)

1. A keyword extraction method is characterized by comprising the following steps:

collecting a plurality of texts to be processed, wherein each text comprises a body text and a title text;

respectively calculating word frequency indexes of all text words in each text, and generating a text keyword set based on all target text words in a plurality of text texts; the target text words are text words in a preset sequencing interval of the corresponding text after being sequenced according to the word frequency index;

calculating a title weight value of each title text according to the text keyword set, and generating a title keyword set based on each title word in the target title text; the target title text is the title text with the title weight value larger than a preset title weight threshold value;

and extracting target keywords from the text keyword set and the title keyword set.

2. The method of claim 1, further comprising, after said collecting a plurality of texts to be processed:

respectively converting each text into a plain text format, and deleting special characters in the text in the plain text format to obtain a target text;

and respectively segmenting each target body text and each title text, and deleting stop words and single words in each target body text and each title text according to a preset stop word list.

3. The method of claim 1, wherein the word frequency index is a word frequency-inverse text index, and wherein separately calculating the word frequency index for each text word in each text comprises:

counting the occurrence times of each text word in each text, counting the total number of words of all text words in all text texts, and calculating a first ratio between the occurrence times and the total number of the words;

counting a first number of all the text texts, counting a second number of target text texts containing the text words to be currently calculated, and calculating a second ratio between the first number and the second number;

and multiplying the first ratio by the second ratio to obtain the word frequency-inverse text index of the text word to be calculated currently.

4. The method of claim 3, wherein generating a set of body keywords based on respective target body words in a plurality of body texts comprises:

sequencing the text words according to a sequence from big to small according to word frequency-inverse text indexes of the text words in each text, and extracting a plurality of target text words in a preset sequencing interval;

calculating the text word frequency weight of a target text word corresponding to each text, wherein the text word frequency weight is obtained by counting the occurrence times of the target text word in all the text texts;

and extracting a plurality of target text words of which the text word frequency weight is greater than a preset word frequency weight threshold value to obtain a text keyword set.

5. The method of claim 1, wherein said calculating a heading weight value for each heading text from said set of body keywords comprises:

respectively identifying whether the title words in each title text are contained in the text keyword set;

if the title words in the title text are contained in the text keyword set, increasing the importance value corresponding to each title text according to a first value, wherein the initial values of the importance values of each title text are the same;

and respectively calculating the ratio of the importance value corresponding to each title text to the number of contained title words as the title weight value of the title text.

6. The method of claim 5, wherein generating a set of title keywords based on respective title words in the target title text comprises:

counting a third number of the occurrence times of target title words in all target title texts, and taking the third number as an initial title word frequency weight of the target title words, wherein the target title words are any title words in the target title texts;

if the target title word is the name of the object to be identified, multiplying a second numerical value by the third number to obtain the title word frequency weight of the target title word, wherein the second numerical value is greater than the first numerical value;

and sequencing all target title words according to the title word frequency weight to obtain a title keyword set.

7. The method of claim 1, wherein extracting target keywords from the body keyword set and the title keyword set comprises:

determining the text word frequency weight of each text keyword in the text keyword set, and determining the title word frequency weight of each title keyword in the title keyword set;

performing mixed sequencing on each text keyword and each title keyword according to the text word frequency weight and the title word frequency weight;

and extracting a plurality of target keywords in the target sorting interval after mixed sorting.

8. A keyword extraction apparatus, comprising:

the text acquisition module is used for acquiring a plurality of texts to be processed, and each text comprises a text body and a title text;

the text keyword set generating module is used for respectively calculating word frequency indexes of all text words in each text and generating a text keyword set based on all target text words in the plurality of text texts; the target text words are text words in a preset sequencing interval of the corresponding text after being sequenced according to the word frequency index;

the title keyword set generating module is used for calculating a title weight value of each title text according to the text keyword set and generating a title keyword set based on each title word in the target title text; the target title text is the title text with the title weight value larger than a preset title weight threshold value;

and the keyword extraction module is used for extracting target keywords from the text keyword set and the title keyword set.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the keyword extraction method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the keyword extraction method according to any one of claims 1 to 7.

Technical Field

The present application belongs to the technical field of text processing, and in particular, to a keyword extraction method, apparatus, terminal device, and storage medium.

Background

With the development of information technology, a large amount of data is constantly generated on the internet, and news content is one of them.

In the face of a large amount of news content, it is difficult for users to quickly find important and valuable content from the news content. Thus, keyword recommendations for news content are generated at the discretion. At present, keyword recommendation mainly includes that relevant mechanisms select a plurality of news through a manual screening mode, keywords of news contents are marked, and then the keywords are recommended to users for the users to look up news relevant to the keywords.

However, the way of manually screening news and labeling keywords is easily affected by subjective factors of editors, and the characteristics of news contents are ignored, so that the labeled keywords may not be completely matched with the current important news contents, and the recommendation accuracy is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a keyword extraction method, an apparatus, a terminal device, and a storage medium, so as to solve the problem in the prior art that news content itself is easily ignored in news keyword recommendation, so that recommended keywords may not completely match current important news content, and the recommendation accuracy is low.

A first aspect of an embodiment of the present application provides a keyword extraction method, including:

collecting a plurality of texts to be processed, wherein each text comprises a body text and a title text;

respectively calculating word frequency indexes of all text words in each text, and generating a text keyword set based on all target text words in a plurality of text texts; the target text words are text words in a preset sequencing interval of the corresponding text after being sequenced according to the word frequency index;

calculating a title weight value of each title text according to the text keyword set, and generating a title keyword set based on each title word in the target title text; the target title text is the title text with the title weight value larger than a preset title weight threshold value;

and extracting target keywords from the text keyword set and the title keyword set.

A second aspect of the embodiments of the present application provides a keyword extraction apparatus, including:

the text acquisition module is used for acquiring a plurality of texts to be processed, and each text comprises a text body and a title text;

the text keyword set generating module is used for respectively calculating word frequency indexes of all text words in each text and generating a text keyword set based on all target text words in the plurality of text texts; the target text words are text words in a preset sequencing interval of the corresponding text after being sequenced according to the word frequency index;

the title keyword set generating module is used for calculating a title weight value of each title text according to the text keyword set and generating a title keyword set based on each title word in the target title text; the target title text is the title text with the title weight value larger than a preset title weight threshold value;

and the keyword extraction module is used for extracting target keywords from the text keyword set and the title keyword set.

A third aspect of an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the keyword extraction method according to the first aspect when executing the computer program.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the keyword extraction method according to the first aspect.

Compared with the prior art, the embodiment of the application has the following advantages:

according to the method and the device for processing the text, the text keyword set and the title keyword set can be generated by collecting the plurality of texts to be processed and then respectively processing the text texts and the title texts of the texts. Since each keyword in the text keyword set and the title keyword set is a word with higher importance in the text or the title, the target keyword extracted from the sets can represent each collected text to the greatest extent. According to the method and the device, the keywords are extracted based on the importance of different words in the text and the title text, so that the extracted keywords can be matched with the characteristics of the text content, the extraction efficiency and the accuracy of the keywords are improved, and the problem that the marking accuracy is low due to the fact that the keywords are marked manually and are easily influenced by subjective factors of editors is solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a flowchart illustrating steps of a keyword extraction method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of another keyword extraction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a keyword extraction apparatus according to an embodiment of the present application;

fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The technical solution of the present application will be described below by way of specific examples.

Referring to fig. 1, a schematic diagram illustrating a flow of steps of a keyword extraction method according to an embodiment of the present application is shown, which may specifically include the following steps:

s101, collecting a plurality of texts to be processed, wherein each text comprises a body text and a title text;

it should be noted that the method can be applied to a terminal device. That is, the terminal device collects a plurality of texts and identifies and processes the texts, so that the keywords in the texts can be extracted. The terminal device may be a notebook computer, a desktop computer, or other types of devices, which is not limited in this embodiment.

In this embodiment, a plurality of texts may be crawled by means of a web crawler. The text may be a news text, or other types of bulletin texts, planning information, and the like, which is not limited in this embodiment.

Take news text as an example. News on a news website of a specific type can be captured through a web crawler, so that a plurality of news texts in a certain field can be obtained. For example, a plurality of sports news texts may be extracted from a sports news website, a plurality of financial news texts may be extracted from a financial news website, and then hot word content in the sports news and the financial news may be extracted by using the keyword extraction method of the embodiment.

In general, a text title can briefly summarize text content, and in the embodiment, when extracting text keywords, a body text and a title text can be processed separately.

S102, respectively calculating word frequency indexes of each text word in each text;

for the text, the word frequency index of each text word in each text may be calculated first, and the word frequency index may be obtained by counting the number of times that each text word appears in the text, may also be obtained by calculating the ratio between the number of times that each text word appears and the number of text texts, and may also be obtained by calculating other specific algorithms for calculating word frequency, and the specific calculation manner is not limited in this embodiment.

The calculated word frequency index may be used to indicate the importance of the text word in each text or in the entire text. Generally, the greater the word frequency index of a certain text word, the higher the importance of the word, and the greater the possibility that the text word can be used as a keyword in the text of the text.

S103, generating a text keyword set based on each target text word in the plurality of text texts; the target text words are text words in a preset sequencing interval of the corresponding text after being sequenced according to the word frequency index;

in this embodiment, after the word frequency index of each text word in each text is calculated, some target text words may be screened out, and then a text keyword set may be obtained by summarizing the target text words. The target text words may be text words in a preset sorting interval of the corresponding text after being sorted according to the word frequency index.

In a specific implementation, for each text, after the word frequency index of each text word in the text is obtained through calculation, the text words can be sequenced from large to small or from small to large according to the size of the word frequency index, and then a part of words with a large word frequency index is extracted as target text words. For example, after arranging the text words according to the word frequency indexes from large to small, the text words with the word frequency index ranking at the top 50 in each text can be extracted as the target text words.

Then, a text keyword set can be obtained by summarizing the target text words extracted from each text.

In specific implementation, target text words extracted from each text can be directly summarized to be used as a text keyword set, and each text word can be further screened during summarization, so that the text words with larger word frequency indexes can be summarized to be the text keyword set.

For example, for a target text word in a certain text, it can be determined whether the target text word belongs to target text words of other most text texts. If so, the target text word can be considered to be important in a plurality of text texts, and the target text word can be added to the text keyword set. If a target text word appears in only a few text texts, the target text word can be considered to be of relatively low importance for a large number of texts, and for such target text word, it may not be added to the text keyword set.

Of course, the above processing manner is only an example provided in this embodiment, and when the text keyword set is generated according to the target text word, other manners may also be used for processing, which is not limited in this embodiment.

S104, calculating a title weight value of each title text according to the text keyword set;

since the title of each text can basically exactly summarize the main content of the text, the title of one text has a higher importance in the entire text. Therefore, the present embodiment can process the title text of each text separately.

In this embodiment, the importance of the caption text may be represented by a caption weight value.

In a specific implementation, a weight value of a title of each title text may be calculated according to the generated text keyword set.

For example, it may be counted how many title words in the title text belong to the keywords in the text keyword set, and then the proportion of the title words in the whole title is calculated, and the proportion is used as the title weight value of the corresponding title text.

Of course, the weight value of each caption text may also be calculated in other manners, which is not limited in this embodiment.

S105, generating a title keyword set based on each title word in the target title text; the target title text is the title text with the title weight value larger than a preset title weight threshold value;

in a specific implementation, a part of the title text with a higher weight value of the title may be used as a target title text, and then a title keyword set is generated based on each word in the target title text.

For example, those title texts whose title weight values are greater than 0.6 may be taken as target title texts, and then the respective words in the respective target title texts are all added to the title keyword set.

Or after the target title text is identified, further processing may be performed on each target title text, a part of stop words and nonsense words in the target title text are deleted, and then the remaining words are added to the title keyword set, which is not limited in this embodiment.

And S106, extracting target keywords from the text keyword set and the title keyword set.

After the text keyword set and the title keyword set are generated respectively, the target keywords can be extracted from the sets to be used as the hot words of all texts.

In a specific implementation, part of the text keywords can be extracted from the text keyword set, part of the title keywords can be extracted from the title keyword set, and then the extracted text keywords and the extracted title keywords are compared, and the same keywords in the text keywords and the extracted title keywords are reserved to serve as final target keywords. Or summarizing the text keyword set and the title keyword set again, and uniformly extracting part of target keywords from the summarized sets.

For the extracted target keywords, the target keywords can be displayed to the user in a diagram mode such as word cloud, and the user can conveniently know which hot words exist in the collected texts.

By taking a plurality of collected texts as the recent financial news captured from the financial news website through a web crawler as an example, hot words in the recent financial news can be displayed to the user through the processing mode, so that the user can quickly know the recent financial hotspots.

In the embodiment of the application, a text keyword set and a title keyword set can be generated by collecting a plurality of texts to be processed and then respectively processing the text texts and the title texts of the texts. Since each keyword in the text keyword set and the title keyword set is a word with higher importance in the text or the title, the target keyword extracted from the sets can represent each collected text to the greatest extent. According to the method and the device, the keywords are extracted based on the importance of different words in the text and the title text, so that the extracted keywords can be matched with the characteristics of the text content, the extraction efficiency and the accuracy of the keywords are improved, and the problem that the marking accuracy is low due to the fact that the keywords are marked manually and are easily influenced by subjective factors of editors is solved.

Referring to fig. 2, a schematic flow chart illustrating steps of another keyword extraction method according to an embodiment of the present application is shown, which may specifically include the following steps:

s201, collecting a plurality of texts to be processed, wherein each text comprises a text body and a title text;

it should be noted that the method can be applied to a terminal device. That is, the terminal device collects a plurality of texts and identifies and processes the texts, so that the keywords in the texts can be extracted.

In particular implementations, a particular type of text may be crawled by a web crawler. For example, a plurality of financial news texts may be crawled from a financial news website.

In this embodiment, after the plurality of texts are collected, the plurality of texts may be preprocessed. For example, each body text may be converted into a plain text format, the special characters in the body text in the plain text format may be deleted to obtain a target body text, then each target body text and each heading text may be segmented, and stop words and single words in each target body text and each heading text may be deleted according to a preset stop word table.

In a specific implementation, for an acquired text, useless characters such as spaces, special characters and the like in the text can be filtered, then a word segmentation tool is used for segmenting the text of the text, useless stop words and single characters are removed, and the remaining words except the stop words and the single characters are reserved and used as the text of subsequent processing.

The preprocessing may also be performed for the title text in the manner described above. The method comprises the steps of firstly filtering out spaces, characters and the like in a title, then segmenting the title text, and reserving the remaining words except stop words and single characters as the subsequently processed title text.

It should be noted that, because the title has a higher importance for one text, the embodiment may perform subsequent processing on the preprocessed body text and the preprocessed title text respectively.

S202, respectively calculating the word frequency-inverse text index of each text word in each text;

for the text, the word Frequency index of each text word in each text can be firstly calculated, and the word Frequency index can be a word Frequency-Inverse text index (TF-IDF)

Generally, TF is the ratio of the number of times a word appears in a certain text to the total number of words in all texts, and its role is to indicate the number of times the word appears in a text, and more times indicate that it is more important; the IDF is the ratio of the number of all texts to the number of texts containing a certain word, and the TF-IDF value of the certain word is obtained by multiplying the two values. The larger the TF-IDF value of a word in a text, the higher the importance of the word in the text in general.

In this embodiment, the occurrence number of each text word in each text may be firstly counted, the total number of words of all text words in all text texts may be counted, and a first ratio between the occurrence number and the total number of words, that is, a TF value, may be calculated. Then, the first number of all the text texts is counted, the second number of the target text texts containing the text words to be calculated at present is counted, and a second ratio between the first number and the second number, namely an IDF value, is calculated. And multiplying the first ratio by the second ratio to obtain the word frequency-inverse text index of the text word to be calculated at present, namely the TF-IDF value.

S203, sequencing the text words according to the word frequency-inverse text index of each text word in each text from large to small, and extracting a plurality of target text words in a preset sequencing interval;

in this embodiment, after the TF-IDF value of each text word in each text is calculated, the text words may be sorted according to the order of the TF-IDF value from large to small, and then a certain number of words sorted in the front are extracted as target text words.

For example, for each body text, the word with the first 50 bits of the TF-IDF value can be extracted from the text as the target body word.

Of course, the above numerical value is only an example, and words of the TF-IDF value in the first 30 bits or the first 100 bits may also be extracted according to actual needs, which is not limited in this embodiment.

S204, calculating the text word frequency weight of the target text word corresponding to each text, wherein the text word frequency weight is obtained by counting the occurrence frequency of the target text word in all the text texts;

for the extracted target text words, the target text words in each text can be accumulated by the same words to obtain the text word frequency weight of each target text word.

For example, if a target text word belongs to 30 target text words of the text texts (that is, the TF-IDF value of the target text word belongs to the first 50 bits in the 30 text texts), the occurrence times of the target text word in the 30 text texts may be accumulated to obtain the text word frequency weight of the target text word.

S205, extracting a plurality of target text words of which the text word frequency weight is greater than a preset word frequency weight threshold value to obtain a text keyword set;

in the specific implementation, after the text word frequency weights of all target text words are calculated, the text word frequency weights can be sequenced from large to small, and words with the text word frequency weight larger than a certain threshold value are extracted as keywords in the text keyword set.

Or after sequencing according to the text word frequency weight, extracting a certain number of words with larger text word frequency weight as the keywords in the text keyword set.

S206, calculating a title weight value of each title text according to the text keyword set;

in this embodiment, for each processing of the caption text, some caption texts with a larger or smaller number of words may be filtered first. For example, too long heading texts with more than 25 words may be deleted, or too short heading texts with less than 5 words may be deleted.

Then, the remaining caption texts are processed, and the caption weight value of each caption text is calculated. The title weight value can be calculated according to the generated text keyword set.

In this embodiment, whether the title words in each title text are included in the text keyword set may be respectively identified, and if the title words in the title text are included in the text keyword set, the importance value corresponding to each title text may be incremented according to the first value, and then the ratio between the importance value corresponding to each title text and the number of the included title words is respectively calculated as the title weight value of the title text.

It should be noted that the initial value of the importance value of each title text should be the same.

In a specific implementation, on the basis of word segmentation, it may be determined whether each title word belongs to a keyword in the text keyword set, and if so, 1 may be added to the importance value of the title text, so as to obtain the importance value of each title in an accumulated manner.

Taking the initial value of the importance value of each title as 0 and the increasing first value as 1 as an example, if 5 title words in a title belong to the keywords in the text keyword set, the importance value of the title is 5.

By calculating the title importance value, a large amount of texts with lower importance values can be excluded to a certain extent, and texts which are more valuable for finding hot words are reserved.

The importance value of each title may be calculated by dividing the importance value by the number of words in each title text to obtain the corresponding weight value of each title.

For example, if there are 8 title words in a title, and 5 title words belong to the keywords in the text keyword set, the importance value of the title is 5, and the corresponding weight value of the title is 5/8 ═ 0.625.

S207, counting a third number of the occurrence times of target title words in all target title texts, and taking the third number as an initial title word frequency weight of the target title words, wherein the target title words are any title words in the target title texts;

in this embodiment, the target title text may be a title text with the above title weight value being greater than a preset title weight threshold. For example, those caption texts whose caption weight values are greater than 0.6 may be extracted as the target caption texts.

For the target title text, the number of times that any title word in each target title text appears in the entire target title text, that is, the third number, may be counted.

In the case where no repeated word appears in the same title, the initial weight value of the target title word may be simply regarded as being equal to the number of target title texts containing the target title word.

S208, if the target title word is the name of the object to be identified, multiplying a second numerical value by the third number to obtain the title word frequency weight of the target title word, wherein the second numerical value is greater than the first numerical value;

the object name to be recognized may be the name of a company, organization, or person of interest to the user. In general, if such a name is included in the title text, it indicates that the name has a higher importance for the entire text.

Therefore, for such target title words, the title word frequency weight can be calculated from the initial title word frequency weight according to a certain multiple. That is, the third number of occurrences may be multiplied by a second value. As an example, the second value may be 3.

And for the target title words which are not the object names to be identified, the initial title word frequency weight can be used as the final title word frequency weight.

S209, sequencing each target title word according to the title word frequency weight to obtain a title keyword set;

after the title word frequency weights of all the target title words are calculated, all the target title words can be sequenced according to the sequence of the title word frequency weights from large to small, and then partial title words sequenced in the front are extracted to form a title keyword set; or all target title words are used as keywords in the title keyword set.

S210, extracting target keywords from the text keyword set and the title keyword set.

After the text keyword set and the title keyword set are generated respectively, the target keywords can be extracted from the sets and used as the hot words of all texts.

As an example of this embodiment, the text word frequency weight of each text keyword in the text keyword set and the title word frequency weight of each title keyword in the title keyword set may be determined first, then each text keyword and each title keyword are mixed and ranked according to the text word frequency weight and the title word frequency weight, and then a plurality of target keywords in the target ranking interval after mixed and ranked are extracted. The target sorting interval may be an interval in which the word frequency weight is in the front after mixed sorting according to the text word frequency weight and the title word frequency weight.

For the extracted target keywords, the target keywords can be displayed to the user in a diagram mode such as word cloud, and the user can conveniently know which hot words exist in the collected texts.

The embodiment firstly processes the text of each text, effectively utilizes the text content of the text to dig out the general word content and the weight of the word content, and then continuously processes the title text on the basis, thereby being beneficial to comparing the title content. Because the title of the text is usually important, the method and the device can obtain a more accurate hot word result by combining the text content of the text and iteratively calculating the content of each word in the title text, and are convenient for a user to quickly know the text content.

It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Referring to fig. 3, a schematic diagram of a keyword extraction apparatus according to an embodiment of the present application is shown, and the keyword extraction apparatus may specifically include the following modules:

the text acquisition module 301 is configured to acquire a plurality of texts to be processed, where each text includes a body text and a title text;

a text keyword set generating module 302, configured to calculate word frequency indexes of each text word in each text, and generate a text keyword set based on each target text word in the multiple text texts; the target text words are text words in a preset sequencing interval of the corresponding text after being sequenced according to the word frequency index;

a title keyword set generating module 303, configured to calculate a title weight value of each title text according to the text keyword set, and generate a title keyword set based on each title word in the target title text; the target title text is the title text with the title weight value larger than a preset title weight threshold value;

a keyword extraction module 304, configured to extract a target keyword from the body keyword set and the title keyword set.

In this embodiment, the apparatus may further include the following modules:

the preprocessing module is used for respectively converting each text into a plain text format, deleting special characters in the text in the plain text format and obtaining a target text; and respectively segmenting each target body text and each title text, and deleting stop words and single words in each target body text and each title text according to a preset stop word list.

In this embodiment of the present application, the word frequency index is a word frequency-inverse text index, and the text keyword set generating module 302 may specifically include the following sub-modules:

the first ratio operator module is used for counting the occurrence times of each text word in each text, counting the word total number of all the text words in all the text texts, and calculating a first ratio between the occurrence times and the word total number;

the second ratio calculation submodule is used for counting the first number of all the text texts, counting the second number of target text texts containing the text words to be currently calculated, and calculating a second ratio between the first number and the second number;

and the word frequency-inverse text index calculation submodule is used for multiplying the first ratio and the second ratio to obtain the word frequency-inverse text index of the current text word to be calculated.

In this embodiment of the present application, the text keyword set generating module 302 may further include the following sub-modules:

the target text word extraction sub-module is used for sequencing the text words according to the word frequency-inverse text index of each text word in each text from large to small, and extracting a plurality of target text words in a preset sequencing interval;

the text word frequency weight calculation submodule is used for calculating the text word frequency weight of a target text word corresponding to each text, and the text word frequency weight is obtained by counting the occurrence frequency of the target text word in all the text texts;

and the text keyword set generation submodule is used for extracting a plurality of target text words of which the text word frequency weight is greater than a preset word frequency weight threshold value to obtain a text keyword set.

In this embodiment of the present application, the title keyword set generating module 303 may specifically include the following sub-modules:

the title word identification submodule is used for respectively identifying whether the title words in each title text are contained in the text keyword set;

the importance value counting sub-module is used for increasing the importance value corresponding to each title text according to a first value if the title words in the title text are contained in the text keyword set, and the initial value of the importance value of each title text is the same;

and the title weight value calculating submodule is used for calculating the ratio of the importance value corresponding to each title text to the number of contained title words respectively as the title weight value of the title text.

In this embodiment of the present application, the title keyword set generating module 303 may further include the following sub-modules:

the initial title word frequency weight calculation submodule is used for counting the third number of the occurrence times of target title words in all target title texts, and taking the third number as the initial title word frequency weight of the target title words, wherein the target title words are any title words in the target title texts;

a title word frequency weight calculation submodule, configured to, if the target title word is the name of the object to be identified, multiply a second numerical value by the third number to obtain a title word frequency weight of the target title word, where the second numerical value is greater than the first numerical value;

and the title keyword set generation submodule is used for sequencing all target title words according to the title word frequency weight to obtain a title keyword set.

In this embodiment of the application, the keyword extraction module 304 may specifically include the following sub-modules:

a text word frequency weight determining submodule for determining a text word frequency weight of each text keyword in the text keyword set, and,

the title word frequency weight determining submodule is used for determining the title word frequency weight of each title keyword in the title keyword set;

a mixed ordering submodule, configured to perform mixed ordering on each text keyword and each title keyword according to the text word frequency weight and the title word frequency weight;

and the target keyword extraction submodule is used for extracting a plurality of target keywords which are in a target sorting interval after mixed sorting.

For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to the description of the method embodiment section for relevant points.

Referring to fig. 4, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in fig. 4, the terminal device 400 of the present embodiment includes: a processor 410, a memory 420, and a computer program 421 stored in the memory 420 and executable on the processor 410. The processor 410, when executing the computer program 421, implements the steps in the various embodiments of the keyword extraction method, such as the steps S101 to S106 shown in fig. 1. Alternatively, the processor 410, when executing the computer program 421, implements the functions of the modules/units in the above device embodiments, such as the functions of the modules 301 to 304 shown in fig. 3.

Illustratively, the computer program 421 may be partitioned into one or more modules/units, which are stored in the memory 420 and executed by the processor 410 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which may be used to describe the execution process of the computer program 421 in the terminal device 400. For example, the computer program 421 may be divided into a news text collection module, a body keyword set generation module, a title keyword set generation module, and a keyword extraction module, and each module has the following specific functions:

the text acquisition module is used for acquiring a plurality of texts to be processed, and each text comprises a text body and a title text;

the text keyword set generating module is used for respectively calculating word frequency indexes of all text words in each text and generating a text keyword set based on all target text words in the plurality of text texts; the target text words are text words in a preset sequencing interval of the corresponding text after being sequenced according to the word frequency index;

the title keyword set generating module is used for calculating a title weight value of each title text according to the text keyword set and generating a title keyword set based on each title word in the target title text; the target title text is the title text with the title weight value larger than a preset title weight threshold value;

and the keyword extraction module is used for extracting target keywords from the text keyword set and the title keyword set.

The terminal device 400 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device 400 may include, but is not limited to, a processor 410, a memory 420. Those skilled in the art will appreciate that fig. 4 is only one example of a terminal device 400 and does not constitute a limitation of the terminal device 400 and may include more or less components than those shown, or combine certain components, or different components, for example, the terminal device 400 may also include input and output devices, network access devices, buses, etc.

The Processor 410 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 420 may be an internal storage unit of the terminal device 400, such as a hard disk or a memory of the terminal device 400. The memory 420 may also be an external storage device of the terminal device 400, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on, provided on the terminal device 400. Further, the memory 420 may also include both an internal storage unit and an external storage device of the terminal device 400. The memory 420 is used for storing the computer program 421 and other programs and data required by the terminal device 400. The memory 420 may also be used to temporarily store data that has been output or is to be output.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种话题分析方法、装置和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!