Markdown feature perception unsupervised keyword extraction method

文档序号:1170284 发布日期:2020-09-18 浏览:8次 中文

阅读说明:本技术 一种Markdown特征感知的无监督关键词提取方法 (Markdown feature perception unsupervised keyword extraction method ) 是由 杨凌锋 赵慧 华丽萍 于 2020-04-21 设计创作,主要内容包括:本发明提供一种Markdown特征感知的无监督关键词提取方法,其包括:利用Markdown丰富的语义描述的特点提取出一种新的文本特征MD-Feature(Markdown feature),帮助提升Markdown格式的博客文章关键词提取准确率;并使用MD-Feature在TF-IDF算法和TextRank算法的基础上提出改进算法MD-TFIDF(Markdown feature aware TF-IDF)和MD-TextRank(Markdown feature aware TextRank);结合MD-TFIDF算法和MD-TextRank算法,提出一种Markdown特征感知的关键词提取算法--MDKE,帮助解决针对Markdown格式的博客文章关键词提取问题。采用本发明方法,能提高关键词提取准确率,满足针对Markdown格式的博客文章个性化需求。(The invention provides a Markdown feature perception unsupervised keyword extraction method, which comprises the following steps: extracting a new text feature MD-feature (Markdown feature) by utilizing the characteristic of abundant semantic description of Markdown to help improve the extraction accuracy of the keywords of the blog article in the Markdown format; and an improved algorithm MD-TFIDF (Markdown Feature aware TF-IDF) and MD-TextRank (Markdown Feature aware TextRank) are provided on the basis of the TF-IDF algorithm and the TextRank algorithm by using MD-Feature; combining the MD-TFIDF algorithm and the MD-TextRank algorithm, a Markdown feature perception keyword extraction algorithm-MDKE is provided, and the problem of extraction of the keywords of the blog articles in the Markdown format is solved. By adopting the method, the keyword extraction accuracy can be improved, and the individual requirements of the blog articles in Markdown format can be met.)

1. A Markdown feature perception unsupervised keyword extraction method comprises the following steps:

(1) extracting text characteristics MD-Feature from semantic expression characteristics of a text in a Markdown format, and specifically comprising the following steps:

(1.1) dividing semantic blocks: dividing each Markdown format text into a set of semantic blocks;

(1.2) performing word segmentation on the text: performing word segmentation processing on each semantic block by using a jieba word segmentation, a professional noun dictionary and a stop word dictionary;

(1.3) calculating position information: calculating the position information of each word in the semantic block;

(1.4) calculating MD-Feature;

(2) based on TF-IDF algorithm and TextRank algorithm, MD-Feature is used for proposing improved algorithms MD-TFIDF and MD-TextRank;

(3) and (3) providing an MDKE algorithm on the basis of (2), fully utilizing the statistical information and the position information of words and the potential semantic information of the user integrated through the MD-Feature characteristic, and extracting the keywords in the text in the Markdown format, wherein the method specifically comprises the following steps:

(3.1) calculating keyword candidate set L1: calculating keyword candidate set L of document by using algorithm MD-TFIDF1

(3.2) calculating keyword candidate set L2: calculating a keyword candidate set L of a document by using MD-TextRank algorithm2

(3.3) acquiring an intersection L of the candidate sets, evaluating the size of the intersection L and adjusting: taking the candidate set L of keywords in (3.1)1And (3.2) keyword candidate set L2The intersection L of (A); judging whether the size of the L candidate set is smaller than the preset candidate set size Msize, if so, selecting other candidate words with the front parts from the keyword candidate set L1 and the keyword candidate set L2 and adding the candidate words into an intersection L, so that the set size of the intersection L is larger than or equal to the MsizeStopping;

(3.4) scoring the candidate words, and outputting in sequence: and sequencing the words in the candidate set L by using a topic model, and finally selecting the first K words in the sequencing result as the keywords of the text to be output.

2. The Markdown feature-aware unsupervised keyword extraction method according to claim 1, wherein the division of the semantic blocks in step (1.1) divides the Markdown text into different semantic blocks according to the semantic description of the Markdown tag, each semantic block is extracted from the original Markdown-formatted document by regular expression, and the type and content of the semantic block are segmented by "\\ 001".

3. The Markdown Feature aware unsupervised keyword extraction method according to claim 1, wherein the calculation of MD-Feature in step (1.4): by calculating the word wijkThe reciprocal of the proportion of the number of words of the jth semantic block in the jth semantic block to the total number of words is used for measuring the influence of the number of words of the semantic block on the weight of the words, and then the word w is calculatedijkThe position order in the jth semantic block is used to measure the influence of the position of the quantifier in the semantic block on the weight, and different semantic descriptions of different semantic blocks are considered, so that each semantic block is given different semantic weight Wmd,WmdThe value of (a) is obtained experimentally; and finally multiplying the three to obtain a calculation formula of MD-Feature:

Figure FDA0002459489720000021

wherein, KqThe number of words representing the qth semantic block.

4. The Markdown feature aware unsupervised keyword extraction method according to claim 3, wherein the improved algorithm MD-TFIDF in step (2) has the following specific algorithm formula:

Figure FDA0002459489720000022

wherein, tfilThe expression wijkIn document diWord frequency, idf injkThe expression wijkThe occurrence of a word w in a document collection DijkNumber n of documentsjThe inverse document frequency, MDF (w), which is the inverse of the total number of documents Nijk) The expression wijk(ii) an MD-Feature score value of;

the method comprises the following specific steps:

(4.1) semantic segmentation into words: using word segmentation tool to extract the bobble text d of key wordiEach semantic chunk of (a) is segmented into words, denoted asWherein WijThe word segmentation result of the jth semantic block of the Bowen i is represented,representing the kth word in the word segmentation result, wherein K is a semantic block WijThe number of words in the word segmentation result;

(4.2) counting the frequency of the words: using token count with words

Figure FDA0002459489720000025

(4.3) calculating the score of the word: counting idf values of words appearing in several texts in the data set, and calculating score of each word by using formula (2);

(4.4) recording the fraction: with the word WijAs a bond, values are recorded in tokenScore;

(4.5) sequencing: the words in tokenScore are ordered, and the top TopK words of score are used as the keywords of the text.

5. The Markdown feature aware unsupervised keyword extraction method according to claim 4, wherein the MD-TextRank improvement algorithm of step (2): g (V, E) is a directed graph formed by a point set V and an edge set E, wherein points in the V set are words in a text, and edges in the E set represent connection of words appearing in the same co-occurrence window; if for a certain point viAll the In-degree node sets are marked as In (v)i) Record all Out-degree node sets as Out (v)i) The calculation formula of the node weight is as follows:

Figure FDA0002459489720000031

wherein wsjiDenotes vjAnd point viThe adjustment coefficient d is typically set to 0.85; mdf (v)i) Representing a node viAn MD-Feature score value of the corresponding word;

the method comprises the following specific steps:

(5.1) semantic segmentation into words: dividing each semantic block of the blog text d needing extracting the key words into words by a word segmentation tool and recording the words asWhereinIndicating a body of play diK is the K word in the word segmentation result of the jth semantic block WijThe number of words in the word segmentation result;

(5.2) calculating the MD-Feature of each word;

(5.3) recording the fraction: by wordFor the key, the MD-Feature obtained by calculation is a value, the value is stored in Weight in a key value pair mode, a word graph G (V, E) is constructed according to the co-occurrence relation, and Weight is initialized for each node by using Weight so thatPerforming iterative calculation by using a formula (3), and updating the Weight in convergence to Weight in a form of a key value pair;

(5.4) sorting: the words in Weight are sorted according to Weight, and the words with top weights are selected as texts diThe keyword(s).

Technical Field

The invention relates to the technical field of keyword extraction, in particular to a Markdown feature perception unsupervised keyword extraction method, and especially relates to keyword extraction for a Markdown-formatted blog article.

Background

With the progress of information technology and the development of internet technology, people gradually enter an information diversification age from an information-deficient age, even an information overload (information overload) age. Taking the internet IT technology communication community, namely the mining network, which is more active in China as an example, the mining network is an integrated technology and knowledge propagation and sharing service platform, technical enthusiasts and practitioners can publish blogs or posts in a technical community to record and share own experience or opinion about a certain technology, browse or collect own interested technical articles, pay attention to the dynamics of a certain technical bull and participate in the discussion of related technical topics. Technical enthusiasts and practitioners generate a large amount of technical articles every day, and behaviors such as browsing, collecting, commenting, and praising. As data grows, both consumers of technical articles and producers of technical articles face significant challenges: as consumers of technical articles, it is very difficult to find out the technical articles which are interested in the consumers from a large amount of technical articles; however, it is very difficult for the producers of the technical articles to spread the technical articles produced by the producers more widely and to receive more attention from the producers. The recommendation system is an effective solution for effectively screening information for users in an information overload environment, namely, information meeting user requirements is provided for the users in a self-adaptive manner through user interests and hobbies.

The keyword extraction is an important technology in text mining, is the basic and necessary work of a recommendation system, and can effectively help to solve the problem of information overload.

Most of common unsupervised keyword extraction methods are keyword extraction based on a topic model, keyword extraction based on TF-IDF word frequency statistics and keyword extraction based on a TextRank algorithm. The keyword extraction algorithm based on the topic model calculates the importance of the word through the similarity of the topic distribution of the document and the word, and because the method usually needs to train from the corpus to obtain the topic distribution information, the quality of the keyword extracted by the method is greatly influenced by the topic distribution of the trained corpus. A keyword extraction algorithm based on TF-IDF word frequency statistics is a common keyword extraction algorithm, the importance of words to articles is judged mainly by calculating word frequency and inverse document frequency of the words, and the method is too dependent on the statistical characteristics of the words and ignores the characteristics of semantics, context and the like. The keyword extraction algorithm based on the TextRank algorithm is a keyword extraction algorithm based on a graph, a word graph is constructed by using the relation (such as co-occurrence window relation) between local words, and then words in a text are sequenced, however, the weight of the words in the method is often not in actual meaning and is only related to the co-occurrence relation, and the understanding of the context is lacked.

Most of the Bowen in the IT technical community is stored in Markdown format. Markdown is a lightweight markup language that allows people to write documents using a few predefined symbols. The text in the Markdown format is different from the traditional text, and some special symbols appearing in the text have special semantics. For example, "#" should be processed as stop words in plain text, but is used to represent a first-order title in Markdown-formatted text. How to improve the effectiveness of unsupervised keyword extraction by using the semantic features of Markdown becomes an innovation point worthy of research.

Disclosure of Invention

The invention provides a Markdown feature perception unsupervised keyword extraction method, which comprises the following steps: extracting a new text feature MD-feature (Markdown feature) by utilizing the characteristic of abundant semantic description of Markdown to help improve the extraction accuracy of the keywords of the blog article in the Markdown format; and an improved algorithm MD-TFIDF (Markdown Feature aware TF-IDF) and MD-TextRank (Markdown Feature aware TextRank) are provided on the basis of the TF-IDF algorithm and the TextRank algorithm by using MD-Feature; combining the MD-TFIDF algorithm and the MD-TextRank algorithm, a Markdown feature perception keyword extraction algorithm-MDKE is provided, and the problem of extraction of the keywords of the blog articles in the Markdown format is solved. By adopting the method, the keyword extraction accuracy can be improved, and the individual requirements of the blog articles in Markdown format can be met.

The invention provides a Markdown feature perception unsupervised keyword extraction method, which comprises the following steps of:

(1) for semantic expression characteristics of a blog article in a Markdown format, a new text feature MD-feature (Markdown feature) is extracted, and the method specifically comprises the following steps:

(1.1) dividing semantic blocks: dividing each Markdown text into a set of semantic blocks;

(1.2) performing word segmentation on the text: performing word segmentation processing on each semantic block by using a jieba word segmentation, a professional noun dictionary and a stop word dictionary;

(1.3) calculating position information: calculating the position information of each word in the semantic block;

(1.4) calculating MD-Feature.

(2) Based on TF-IDF algorithm and TextRank algorithm, MD-Feature is used to propose improved algorithms MD-TFIDF (Markdown Feature aware TF-IDF) and MD-TextRank (Markdown Feature aware TextRank).

(3) And (3) providing an MDKE algorithm on the basis of (2), fully utilizing the statistical information and the position information of words and the potential semantic information of the user integrated through the MD-Feature characteristic, and extracting the keywords in the blog article in the Markdown format, wherein the method comprises the following specific steps:

(3.1) calculating keyword candidate set L1: calculating a keyword candidate set L of the document by using the algorithm MD-TFIDF mentioned in (2)1

(3.2) calculating keyword candidate set L2: calculating a keyword candidate set L of the document by using (2) MD-TextRank algorithm2

(3.3) acquiring an intersection L of the candidate sets, evaluating the size of the intersection L and adjusting: taking the candidate set L of keywords in (3.1)1And (3.2) keyword candidate set L2The intersection of (a); judging whether the size of the L candidate set is smaller than the preset candidate set size Msize, if so, selecting other candidate words in front from the keyword candidate set L1 and the keyword candidate set L2 and adding the candidate words into an intersection L until the size of the intersection L is larger than or equal to the Msize;

(3.4) scoring the candidate words, and outputting in sequence: because the scoring standards of the candidate words by the MD-TFIDF algorithm and the MD-TextRank algorithm are inconsistent, the words in the candidate set L are ranked by using a topic model (LDA), and finally, the first K words in the ranking result are selected as the keywords of the article to be output.

Further, the specific content of the division of the semantic block in the step (1.1) is:

the invention helps to solve the extraction problem of text keywords in a Markdown format, so that the Markdown text is divided into different semantic blocks according to the semantic description of the Markdown mark, each semantic block is extracted from the original Markdown document through a regular expression, and the type and the content of the semantic block are divided by '\\ 001'. Fig. 2 below illustrates a sample of semantic block extraction from the original text. Table 1 lists Markdown marks and their corresponding description information and corresponding blocks of interest in the present invention.

TABLE 1 Markdown tag description Table

Figure BDA0002459489730000041

Further, the calculation process of the MD-Feature in the step (1.4) is:

the positions of the keywords appearing in the document in the Markdown format are observed and researched, and the keywords are related to the number of the words in the semantic block, the probability that the keywords appear in the smaller number of the words is higher, and the keywords often appear in the positions in front of the semantic block. Therefore, in the present invention, the word w is calculatedijkThe reciprocal of the proportion of the number of words of the jth semantic block in the jth semantic block to the total number of words is used for measuring the influence of the number of words of the semantic block on the weight of the words, and then the word w is calculatedijkThe position order in the jth semantic block is used to measure the influence of the position of the quantifier in the semantic block on the weight, and different semantic descriptions of different semantic blocks are considered, so that each semantic block is given different semantic weight Wmd,WmdThe values of (a) are obtained by experiments. And finally multiplying the three to obtain a calculation formula of MD-Feature:

wherein the content of the first and second substances,Kqthe number of words representing the qth semantic block.

Further, the specific formula of the improved algorithm MD-TFIDF in step (2) is as follows:

Figure BDA0002459489730000043

wherein, tfilThe expression wijkIn document diWord frequency, idf injkThe expression wijkThe occurrence of a word w in a document collection DijkNumber n of documentsjThe inverse document frequency, MDF (w), which is the inverse of the total number of documents Nijk) The expression wijkMD-Feature score of (1).

The method comprises the following specific steps:

(4.1) semantic segmentation into words: using word segmentation tool to extract the bobble text d of key wordiEach semantic chunk of (a) is segmented into words, denoted as

Figure BDA0002459489730000044

Wherein WijThe word segmentation result of the jth semantic block of the Bowen i is represented,

Figure BDA0002459489730000045

representing the kth word in the word segmentation result, wherein K is a semantic block WijThe number of words in the word segmentation result;

(4.2) counting the frequency of the words: using token count with words

Figure BDA0002459489730000051

As a key, with words

Figure BDA0002459489730000052

Counting the frequency of each word in the word segmentation result in a key value pair mode by taking the frequency of the occurrence in the text as a value;

(4.3) calculating the score of the word: counting idf values of words appearing in several articles in the data set, and calculating score of each word by using formula (2);

(4.4) recording the fraction: with the word WijAs a bond, values are recorded in tokenScore;

(4.5) sequencing: the words in tokenScore are ranked, and the top TopK words in score are used as the keywords of the article.

Further, the specific content of the MD-TextRank improvement algorithm in the step (2) is as follows:

in order to improve the traditional TextRank algorithm, one disadvantage is that the weight of each word is consistent by default when a graph is constructed, and the attribute of the word is ignored. The attributes of the words here include: semantic attributes, location attributes, part-of-speech correlations, and the like. The semantic attributes of words in the document with the Markdown format are particularly outstanding, and different semantic blocks have different semantic descriptions and can convey semantic information of an author to a certain extent. Therefore, the TextRank algorithm is improved, a new MD-TextRank algorithm is proposed, the MD-Feature is blended when each node is endowed with weight when the graph is constructed, and iterative computation is carried out through different initial values. The result of random walk is more targeted and closer to the extraction result of the document keywords.

Suppose G (V, E) is a directed graph composed of a set of points V and a set of edges E, where the points in the set of V are words in the text and the edges in the set of E represent connections of words appearing in the same co-occurrence window. If for a certain point viAll the In-degree node sets are marked as In (v)i) Record all Out-degree node sets as Out (v)i) The calculation formula of the node weight is as follows:

Figure BDA0002459489730000053

wherein wsjiDenotes vjAnd point viThe adjustment coefficient d is typically set to 0.85; mdf (v)i) Representing a node viThe MD-Feature score value of the corresponding word.

The method comprises the following specific steps:

(5.1) semantic segmentation into words: dividing each semantic block of the blog text d needing extracting the key words into words by a word segmentation tool and recording the words as

Figure BDA0002459489730000054

WhereinIndicating a body of play diK is the K word in the word segmentation result of the jth semantic block WijThe number of words in the word segmentation result;

(5.2) calculating the MD-Feature of each word;

(5.3) recording the fraction: by word

Figure BDA0002459489730000061

For a key, the MD-Feature obtained by calculation is a value, the value is stored in Weight in a key-value pair mode, a word graph G (V, E) is constructed according to a co-occurrence relation, Weight is initialized for each node by using Weight, iterative calculation is carried out by using a formula (3), and the Weight during convergence is updated to Weight in a key-value pair mode;

and (5.4) sorting. And sequencing the words in Weight according to the Weight, and selecting the words with top weights as the keywords of the article di.

The technical scheme adopted by the invention has the following technical characteristics:

1. the invention provides a Markdown semantic Feature (MD-Feature) according to the grammar and semantic expression of Markdown, which can effectively express the semantic information of computer blog data in a Markdown format.

2. The invention extracts MD-Feature from a semantic block, improves a TFIDF algorithm and a TextRank algorithm by using the MD-Feature, provides a Markdown Feature perception TFIDF algorithm (Markdown Feature aware TF-IDF, MD-TFIDF) and a Markdown Feature perception TextRank algorithm (Markdown Feature aware TextRank, MD-TextRank), and meets the individual requirements of a blog article aiming at the Markdown format.

3. The invention provides an Unsupervised keyword Extraction method of Markdown Feature perception (MDKE) by combining an MD-TFIDF algorithm and an MD-TextRank algorithm, and helps to improve the accuracy of extracting Keywords of a blog article in a Markdown format.

Drawings

Fig. 1 is a schematic flow chart of an embodiment of the Markdown feature-aware unsupervised keyword extraction method provided by the invention.

FIG. 2 is a sample graph of semantic block extraction from original text.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings and the detailed description, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, which is a schematic flow chart of an embodiment of the Markdown feature-aware unsupervised keyword extraction method provided by the present invention, the method includes steps 101 to 108, which specifically include the following steps:

step 101, data acquisition.

In the embodiment, the data set is divided into two parts, one part is mining experiment data which is crawled by a crawler, and the other part is auxiliary data which comprises a stop word list and a professional name word list, is downloaded from a Google machine learning vocabulary and a Qinghua open IT word stock respectively, and is obtained through statistics and manual screening.

In this embodiment, both the computer blog data and the technical community user behavior data are from the active domestic online technical exchange community mining network. A web crawler is designed and realized, historical interaction information of 38416 users is collected, 85193 articles are related, and 571 tags are covered. The process of acquiring the data of the mining experiment is shown in detail below.

(1) Selecting a seed user: the selection of the seed users is related to the depth of traversal of the crawler and the distribution of the number of articles in different professional directions in the data set, so that six directions including a front end, a rear end, artificial intelligence, a database, a computer network and an algorithm are selected, and 5 active users in each direction are used as seed users for capturing data.

(2) Capturing user data: the data captured comprises visible registration information of the user, historical interaction data of the user and article data related to historical interaction of the user.

(3) Acquiring a new grabbing user: and traversing the seed user, and adding the concerned user of the seed user into the URL queue as the object to be captured.

(4) Data persistence: the captured data are analyzed, and the analyzed data are stored in a Mongodb database and are used for follow-up research.

(5) And (4) finishing conditions: and continuously reading the user addresses in the URL queue until the URL queue is empty. And step 102, preprocessing data.

In this embodiment, for the stop word list and the professional noun word list, since the computer blog data belongs to the data of the proprietary field, the professional vocabulary appearing in the text is often the key to the follow-up research. Therefore, the embodiment extracts stop words by using a method based on word frequency statistics, and constructs a stop word list aiming at the computer blog data.

For the mining experiment data, due to the characteristics of the Markdown format text, each Markdown grammar mark has specific semantic description information, if the special symbols are directly removed from the text, the semantic information contained in the Markdown grammar mark is lost, and the part of the special marks is reserved, and extra noise is added. In order to make full use of Markdown's semantic information in subsequent research, the text proposes the concept of semantic block, and the text described by the same semantic is called a type of semantic block. The Markdown text can be divided into different semantic blocks according to the semantic description of the Markdown mark, and table 1 lists the Markdown mark concerned by the research problem and the corresponding description information and the corresponding semantic block. The specific processing steps of the computer Bowen data in the Markdown format are as follows:

(1) extracting title, code block and other article data articles of the article d by using a regular expression;

(2) removing the extracted code block, and processing the Bo Wen into a temporary format a ═ title, article >;

(3) make itPartitioning article into semantic blocks SEBlock ═ b by regular expressions1,b2,...,bi,...,bTIn which b isiRepresenting the ith semantic block. Segmenting the type and content of the semantic block by \001, and showing a sample of extracting the semantic block from the original text in FIG. 2;

(4) processing the key word marked by user for Bowen d into l ═<kw1,kw2,...,kwj,...,kwn>;

(5) And processing the l, the a and the SEBlock into a final format d' < title, article, SEBlock >.

TABLE 1 Markdown tag description Table

Figure BDA0002459489730000081

Step 103: and (5) extracting Markdown characteristics.

In this embodiment, the process of extracting Markdown features includes three steps:

(1) dividing semantic blocks: dividing each Markdown text into a set of semantic blocks;

(2) performing word segmentation on the text: performing word segmentation processing on each semantic block by using a jieba word segmentation, a professional noun dictionary and a stop word dictionary;

(3) calculating position information: calculating the position information of each word in the semantic block;

(4) and (5) calculating MD-Feature. The calculation formula of MD-Feature is as follows:

Figure BDA0002459489730000091

wherein KqThe number of words representing the qth semantic block.

Step 104: the MD-TFIDF algorithm is implemented.

The specific algorithm formula of the improved algorithm MD-TFIDF is as follows:

wherein tf isilThe expression wijkIn document diWord frequency, idf injkThe expression wijkThe occurrence of a word w in a document collection DijkNumber n of documentsjThe inverse document frequency, MDF (w), which is the inverse of the total number of documents Nijk) The expression wijkMD-Feature score of (1).

The method comprises the following specific steps:

(1) the semantic chunk is segmented into words: using word segmentation tool to extract the bobble text d of key wordiEach semantic chunk of (a) is segmented into words, denoted asWherein WijThe word segmentation result of the jth semantic block of the Bowen i is represented,

Figure BDA0002459489730000094

representing the kth word in the word segmentation result, wherein K is a semantic block WijThe number of words in the word segmentation result;

(2) counting the frequency of words: using token count with wordsAs a key, with words

Figure BDA0002459489730000096

Counting the frequency of each word in the word segmentation result in a key value pair mode by taking the frequency of the occurrence in the text as a value;

(3) a score for the word is calculated. Counting idf values of words appearing in several articles in the data set, and calculating score of each word by using formula (2);

(4) recording the fraction: with the word WijAs a bond, values are recorded in tokenScore;

(5) sorting: sequencing the words in the tokenScore, taking TopK words at the top of the score as keywords of the article, and obtaining a candidate set L1

Step 105: the MD-TFIDF algorithm is implemented.

The specific contents of the MD-TextRank improved algorithm are as follows:

in order to improve the traditional TextRank algorithm, one disadvantage is that the weight of each word is consistent by default when a graph is constructed, and the attribute of the word is ignored. The attributes of the words here include: semantic attributes, location attributes, part-of-speech correlations, and the like. The semantic attributes of words in the document with the Markdown format are particularly outstanding, and different semantic blocks have different semantic descriptions and can convey semantic information of an author to a certain extent. Therefore, the TextRank algorithm is improved, a new MD-TextRank algorithm is proposed, the MD-Feature is blended when each node is endowed with weight when the graph is constructed, and iterative computation is carried out through different initial values. The result of random walk is more targeted and closer to the extraction result of the document keywords.

Suppose G (V, E) is a directed graph composed of a set of points V and a set of edges E, where the points in the set of V are words in the text and the edges in the set of E represent connections of words appearing in the same co-occurrence window. If for a certain point viAll the In-degree node sets are marked as In (v)i) Record all Out-degree node sets as Out (v)i) The calculation formula of the node weight is as follows:

wherein wsjiDenotes vjAnd point viThe adjustment coefficient d is typically set to 0.85. mdf (v)i) Representing a node viThe MD-Feature score value of the corresponding word.

The method comprises the following specific steps:

(1) the semantic chunk is segmented into words: dividing each semantic block of the blog text d needing extracting the key words into words by a word segmentation tool and recording the words as

Figure BDA0002459489730000102

WhereinIndicating a body of play diThe word segmentation result of the jth semantic block of (1)The K-th word, K is a semantic block WijThe number of words in the word segmentation result;

(2) calculating the MD-Feature of each word;

(3) recording the fraction: by wordFor a key, the MD-Feature obtained by calculation is a value, the value is stored in Weight in a key-value pair mode, a word graph G (V, E) is constructed according to a co-occurrence relation, Weight is initialized for each node by using Weight, iterative calculation is carried out by using a formula (3), and the Weight during convergence is updated to Weight in a key-value pair mode;

(4) sorting: the words in Weight are sorted according to Weight, and TopN words with higher weights are selected as an article diTo obtain a candidate set L2

Step 106: and obtaining a candidate set L.

Get the candidate set L of keywords1And keyword candidate set L2And judging whether the size of the L candidate set is smaller than the preset candidate set size Msize, if so, selecting the keyword candidate set L1And keyword candidate set L2And selecting other candidate words with the top scores and adding the candidate words into the L until the size of the set L is larger than or equal to Msize.

Step 107: the LDA topic model orders the L candidate sets.

Because the scoring criteria of the candidate words are inconsistent by the MD-TFIDF and MD-TextRank algorithms, the words in the candidate set L are sorted by the LDA topic model. And taking the words in the set L as texts needing theme analysis, inputting the texts into an LDA model, and setting parameters of the LDA model into a theme and K words related to the theme.

Step 108: and acquiring topK keywords.

Since the words in the result output by the LDA model are sorted according to the degree of correlation with the subject, the K words in the output result are finally output as the keywords of the article.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:数据库表的外键映射方法、装置、电子设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!