Method for detecting microblog emergency

文档序号:1783064 发布日期:2019-12-06 浏览:23次 中文

阅读说明:本技术 一种微博突发事件的检测方法 (Method for detecting microblog emergency ) 是由 张仰森 段宇翔 段瑞雪 黄改娟 于 2019-01-23 设计创作,主要内容包括:本发明公开了一种微博突发事件的检测方法,包括:步骤1,对去除噪声和分词处理后的微博数据进行时间划分,获得与时间窗口对应的微博数据集;步骤2,计算各时间窗口包含的所有微博中各词w的突发词判断权重W<Sub>k</Sub>(w),将W<Sub>k</Sub>(w)大于突发词阈值的词作为突发词,从而获得各时间窗口的突发词集;步骤3,基于突发词集中两两突发词的耦合度,利用聚类算法对突发词集中突发词进行聚类,并基于聚类结果检测时间窗口的突发事件。和现有方法相比,本发明基于突发特征词的微博突发事件的检测方法在准确率和F值上有了很大的提升,即检测结果更准确。(The invention discloses a method for detecting a microblog emergency, which comprises the following steps: step 1, performing time division on microblog data subjected to noise removal and word segmentation processing to obtain a microblog data set corresponding to a time window; step 2, calculating burst word judgment weights Wk (w) of all words w in the microblogs in all time windows, and taking words of which Wk (w) is greater than a burst word threshold value as burst words, so as to obtain burst word sets of all time windows; and 3, clustering the burst words in the burst word set by using a clustering algorithm based on the coupling degree of every two burst words in the burst word set, and detecting the burst event in a time window based on a clustering result. Compared with the existing method, the method for detecting the microblog emergency based on the emergency characteristic words greatly improves the accuracy and the F value, namely, the detection result is more accurate.)

1. a method for detecting microblog emergency events is characterized by comprising the following steps:

Step 1, performing time division on microblog data subjected to noise removal and word segmentation processing to obtain a microblog data set Dk corresponding to a time window { d1, d2, … …, dm-1, dm }, wherein Dk represents a microblog data set contained in a kth time window Tk, di represents an ith microblog, and i is 1,2, … … m;

Step 2, calculating burst word judgment weights Wk (w) of all words w in the microblogs in all time windows, and taking words of which Wk (w) is greater than a burst word threshold value as burst words, so as to obtain burst word sets of all time windows; the burst threshold value is an empirical value;

Wherein ck (w) is the word frequency feature of the word w, tk (w) is the topic label feature of the word w, bk (w) is the word frequency growth rate feature of the word w, α, β, γ are the weights of ck (w), tk (w), bk (w), α + β + γ ═ 1;

Step 3, clustering the burst words in the burst word set by using a clustering algorithm based on the coupling degree of every two burst words in the burst word set, and detecting the burst event of a time window based on a clustering result;

The coupling degree S (wp, wq) of every two burst words is C (wp, wq) + MI (wp, wq), and S (wp, wq), C (wp, wq) and MI (wp, wq) are the coupling degree, the co-occurrence degree and the mutual information of the burst words wp and wq respectively.

2. The method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

Denoising noise of microblog data comprises deleting short texts, deleting microblog data with texts of repeated letters, repeated Chinese characters and repeated symbols, deleting non-text information in the microblog data and/or deleting stop words in the microblog data.

3. The method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

the word frequency characteristics Ck (w) of the words w represent the word frequency characteristics of the words w under the time window Tk, fk (w) represent the word frequency of the words w under the time window Tk, fkmax represents the maximum word frequency of all the words in the time window Tk, and delta is the initial value of the word frequency characteristics.

4. the method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

The topic label feature Tk (w) of the word w represents the topic label feature of the word w under the time window Tk, Ntag (w) represents the number of topic labels of the word w appearing in the time window Tk, and Ntag represents the total number of topic labels in the time window Tk; nt _ blog (w) represents the frequency of occurrence of words w in microblogs containing the topic labels in the time window Tk, and Nt _ blog represents the number of microblogs containing the topic labels in the time window Tk; if (w) is a determination factor, if at least one topic tag in the time window Tk contains a word w, if (w) takes 1, and if no topic tag in the time window Tk contains a word w, if (w) takes 2.

5. The method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

the word frequency growth rate characteristic fk (w) of the word w is the word frequency of the word w in the time window Tk, and Ak (w) is the historical average word frequency of the word w in the time window Tk and the previous time windows.

6. The method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

And the co-occurrence degree is shown in the specification, wherein R (wi | wj) is the relative co-occurrence degree of the word wi relative to the word wj in the current time window, and R (wj | wi) is the relative co-occurrence degree of the word wj relative to the word wi in the current time window.

7. The method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

and the mutual information comprises P (wi) and P (wj) which respectively represent the probability of the words wi and wj appearing in the current time window, and P (wi, wj) represents the probability of the words wi and wj appearing together in the current time window.

8. The method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

Step 3, clustering by adopting a hierarchical clustering algorithm, and further comprising the following steps:

301: regarding each burst word in the burst word set as a cluster, obtaining an initial cluster set, and storing the initial cluster set in a cluster set Cluster;

302: calculating the similarity between every two clusters in the cluster set Cluster and storing the similarity in the set Temp, wherein the similarity between the two clusters is the reciprocal average value of the similarity between every two elements in the two clusters;

The similarity between every two clusters is the reciprocal average of the similarity between every two elements in the two clusters:

Wherein D (ya, yb) is the similarity between clusters ya and yb; ya and yb represent the number of the salient words in clusters ya and yb, respectively; smax and Smin represent the maximum and minimum values of the degree of coupling of the bursts in clusters ya and yb, respectively;

303: judging the number | Temp |, if | Temp |, skipping to step 304; if the value of | Temp | >1, two clusters corresponding to the minimum similarity value in the set Temp are taken, the two clusters are deleted from the cluster, the two clusters are merged and then added into the cluster, and the step 302 is executed again on the new cluster;

304: and outputting a binary tree structure consisting of the burst words and the burst word set.

9. the method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

In step 3, a binary tree pruning and partitioning method based on internal similarity is adopted to prune and partition the binary tree structure output in the substep 304 to obtain an emergency partition set, so as to detect the emergency;

the binary tree pruning and segmenting method based on the internal similarity specifically comprises the following steps:

305: judging the sizes of the internal similarity D and theta of the left and right subtrees of the root node of the current binary tree, wherein the internal similarity D of the left and right subtrees is the similarity between clusters corresponding to the left subtree and the right subtree; the initial state of the current binary tree is the binary tree structure output by sub-step 304; theta is a preset cluster internal similarity threshold value;

If D is larger than or equal to theta, adding the current binary tree into the emergency division set, and jumping to step 308; otherwise, executing step 306 and step 307 respectively;

306: judging whether the current left subtree is empty, if not, taking the left subtree as the current binary tree, and executing a substep 305; if the step is empty, terminating the step;

307: judging whether the current right subtree is empty, if not, taking the right subtree as the current binary tree, and executing a substep 305; if the step is empty, terminating the step;

308: and outputting the emergency division set.

10. The method for detecting the microblog emergency according to claim 1, wherein the method comprises the following steps:

the weight values alpha, beta and gamma are obtained by the following method:

(1) Constructing a judgment matrix Uij to represent the important probability of the ith evidence compared with the jth evidence and the corresponding characteristics of the evidence; i is 1,2, 3; j is 1,2, 3; i is not equal to j;

(2) Giving the important probability of the ith evidence compared with the jth evidence by a plurality of experts, and calculating the important synthetic probability value of the ith evidence compared with the jth evidence according to a D-S evidence theory, wherein the synthetic probability value is Uij, so as to obtain a judgment matrix P;

(3) Converting the data above the diagonal in the judgment matrix P, namely when i < j, making

(4) carrying out consistency check on the converted judgment matrix by adopting an analytic hierarchy process, and if the converted judgment matrix passes the consistency check, respectively carrying out normalization on each column of the converted judgment matrix to obtain a new judgment matrix P';

(5) Summing each row in the P' to obtain a vector xT (x1, x2, x3) of the influence of each feature on the burst word; the vector xT is normalized to obtain the weight α T corresponding to each feature (α, β, γ).

Technical Field

the invention relates to the field of public sentiment situation perception of data mining, in particular to a method for detecting microblog emergency.

Background

with the emergence and rapid development of internet media, the information is spread more rapidly and widely by using a distributed social media which is similar to a microblog and takes an interaction relationship between users as a core, and meanwhile, the interactivity is greatly enhanced. Particularly, the spread of the emergency information shows exponential level increase of 'one tenth and one hundred tenth' in a short time. In order to timely find out and reasonably control the sudden social events, the sudden events need to be timely and accurately detected from massive microblog dynamic data.

in recent years, relevant scholars at home and abroad have invested a lot of research in the field of emergency detection of social networking media, and the core problem and difficulty at present is how to quickly and accurately detect an emergency from exponentially increased data. The existing emergency detection method mainly comprises three types of detection methods, namely, taking a text as a center, taking an emergency characteristic word as a center and taking a local region label characteristic as a center.

the text-centered emergency detection method is a text clustering method based on the distance between text semantics. The method comprises the steps of firstly slicing time, then dividing the time into corresponding time windows according to the release time of texts, clustering microblog texts in each time slice, extracting burst characteristics from each obtained cluster, and identifying the classes meeting corresponding burst rules. However, because the microblog text contains a large amount of spam information such as spoken words, network expressions, advertisements, links, and the like, a lot of noise information is introduced when clustering and extracting burst features. In addition, selection of a plurality of parameter thresholds is involved in microblog text clustering, and the selection of the parameter thresholds is mostly an empirical value, which affects the accuracy of detection.

The method for detecting the emergency with the emergency characteristic words as the centers extracts the characteristic words with the emergency from the microblog texts and clusters the obtained emergency characteristic words so as to realize the detection of the emergency. The core of the method lies in the characteristic selection of the burst words rather than the clustering of the characteristic words, and the problem of parameter threshold setting in a text-centered detection method is avoided. However, a large number of documents irrelevant to the event exist in the microblog text, so that noise removal and accurate extraction of the break-word are important factors for improving the detection rate.

The emergency detection method with local region label characteristics as the center mainly aims at microblog data containing region information, and comprises geographical labels attached to user information, places contained in microblog contents and the like. The method can detect the hot spot emergency which is not highlighted in the whole network microblog text but occurs in a certain local region. The core problem mainly focuses on two aspects, one is how to extract regional burst feature words from microblog texts, and the other is how to calculate the popularity of the microblog in a small area.

disclosure of Invention

the invention aims to provide a method for detecting microblog emergency based on emergency characteristic words.

the invention discloses a method for detecting a microblog emergency, which comprises the following steps:

Step 1, performing time division on microblog data subjected to noise removal and word segmentation processing to obtain a microblog data set Dk corresponding to a time window { d1, d2, … …, dm-1, dm }, wherein Dk represents a microblog data set contained in a kth time window Tk, di represents an ith microblog, and i is 1,2, … … m;

step 2, calculating burst word judgment weights Wk (w) of all words w in the microblogs in all time windows, and taking words of which Wk (w) is greater than a burst word threshold value as burst words, so as to obtain burst word sets of all time windows; the burst threshold value is an empirical value;

Wherein ck (w) is the word frequency feature of the word w, tk (w) is the topic label feature of the word w, bk (w) is the word frequency growth rate feature of the word w, α, β, γ are the weights of ck (w), tk (w), bk (w), α + β + γ ═ 1;

step 3, clustering the burst words in the burst word set by using a clustering algorithm based on the coupling degree of every two burst words in the burst word set, and detecting the burst event of a time window based on a clustering result;

the coupling degree S (wp, wq) of every two burst words is C (wp, wq) + MI (wp, wq), and S (wp, wq), C (wp, wq) and MI (wp, wq) are the coupling degree, the co-occurrence degree and the mutual information of the burst words wp and wq respectively.

Further, denoising noise of the microblog data comprises firstly deleting short texts, secondly deleting microblog data with texts of repeated letters, repeated Chinese characters and repeated symbols, thirdly deleting non-text information in the microblog data and/or fourthly deleting stop words in the microblog data.

Further, the word frequency feature ck (w) of the word w represents the word frequency feature of the word w under the time window Tk, fk (w) represents the word frequency of the word w under the time window Tk, fkmax represents the maximum word frequency of all the words in the time window Tk, and δ is the initial value of the word frequency feature.

further, the topic label feature Tk (w) of the word w represents the topic label feature of the word w under the time window Tk, Ntag (w) represents the number of topic labels of the word w appearing in the time window Tk, and Ntag represents the total number of topic labels in the time window Tk; nt _ blog (w) represents the frequency of occurrence of words w in microblogs containing the topic labels in the time window Tk, and Nt _ blog represents the number of microblogs containing the topic labels in the time window Tk; if (w) is a determination factor, if at least one topic tag in the time window Tk contains a word w, if (w) takes 1, and if no topic tag in the time window Tk contains a word w, if (w) takes 2.

Further, the word frequency growth rate characteristic fk (w) of the word w is the word frequency of the word w in the time window Tk, and ak (w) is the historical average word frequency of the word w in the time window Tk and the previous time windows.

Further, the degree of co-occurrence is R (wi | wj) is the relative degree of co-occurrence of the word wi with respect to the word wj in the current time window, and R (wj | wi) is the relative degree of co-occurrence of the word wj with respect to the word wi in the current time window.

further, the mutual information, where P (wi) and P (wj) respectively represent the probability of the occurrence of words wi and wj in the current time window, and P (wi, wj) represents the probability of the occurrence of words wi and wj together in the current time window.

Further, step 3 adopts a hierarchical clustering algorithm to perform clustering, and further comprises:

301: regarding each burst word in the burst word set as a cluster, obtaining an initial cluster set, and storing the initial cluster set in a cluster set Cluster;

302: calculating the similarity between every two clusters in the cluster set Cluster and storing the similarity in the set Temp, wherein the similarity between the two clusters is the reciprocal average value of the similarity between every two elements in the two clusters;

the similarity between every two clusters is the reciprocal average of the similarity between every two elements in the two clusters:

Wherein D (ya, yb) is the similarity between clusters ya and yb; | ya | and | yb | represent the number of the salient words in the clusters ya and yb, respectively; smax and Smin represent the maximum and minimum values of the degree of coupling of the bursts in clusters ya and yb, respectively;

303: judging the number | Temp |, if | Temp |, skipping to step 304; if the value of | Temp | >1, two clusters corresponding to the minimum similarity value in the set Temp are taken, the two clusters are deleted from the cluster, the two clusters are merged and then added into the cluster, and the step 302 is executed again on the new cluster;

304: and outputting a binary tree structure consisting of the burst words and the burst word set.

further, in step 3, a binary tree pruning and partitioning method based on internal similarity is adopted to prune and partition the binary tree structure output in the sub-step 304 to obtain an emergency partition set, so as to detect the emergency;

The binary tree pruning and segmenting method based on the internal similarity specifically comprises the following steps:

305: judging the sizes of the internal similarity D and theta of the left and right subtrees of the root node of the current binary tree, wherein the internal similarity D of the left and right subtrees is the similarity between clusters corresponding to the left subtree and the right subtree; the initial state of the current binary tree is the binary tree structure output by sub-step 304; theta is a preset cluster internal similarity threshold value;

If D is larger than or equal to theta, adding the current binary tree into the emergency division set, and jumping to step 308; otherwise, executing step 306 and step 307 respectively;

306: judging whether the current left subtree is empty, if not, taking the left subtree as the current binary tree, and executing a substep 305; if the step is empty, terminating the step;

307: judging whether the current right subtree is empty, if not, taking the right subtree as the current binary tree, and executing a substep 305; if the step is empty, terminating the step;

308: and outputting the emergency division set.

further, the weights α, β, γ are obtained by the following method:

(1) Constructing a judgment matrix Uij to represent the important probability of the ith evidence compared with the jth evidence and the corresponding characteristics of the evidence; i is 1,2, 3; j is 1,2, 3; i is not equal to j;

(2) Giving the important probability of the ith evidence compared with the jth evidence by a plurality of experts, and calculating the important synthetic probability value of the ith evidence compared with the jth evidence according to a D-S evidence theory, wherein the synthetic probability value is Uij, so as to obtain a judgment matrix P;

(3) Converting the data above the diagonal in the judgment matrix P, namely when i < j, making

(4) Carrying out consistency check on the converted judgment matrix by adopting an analytic hierarchy process, and if the converted judgment matrix passes the consistency check, respectively carrying out normalization on each column of the converted judgment matrix to obtain a new judgment matrix P';

(5) Summing each row in the P' to obtain a vector xT (x1, x2, x3) of the influence of each feature on the burst word; the vector xT is normalized to obtain the weight α T corresponding to each feature (α, β, γ).

The invention has the following characteristics and beneficial effects:

The microblog data set is sliced according to the time information, and the word frequency characteristic, the topic label characteristic and the word frequency increase rate characteristic of each word are calculated for the data in each time window. Then, a word set with burst characteristics is selected according to the weight. And then, calculating the coupling degree between the burst words based on the word co-occurrence degree and the combination compactness, and constructing a similarity matrix as the input of the cohesive hierarchical clustering algorithm. And dividing the clustering result by using a binary tree pruning algorithm based on internal similarity to obtain the emergency corresponding to the time window.

The experimental result shows that compared with the existing method, the method for detecting the microblog emergency based on the emergency characteristic words greatly improves the accuracy and the F value, namely the detection result is more accurate.

Drawings

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a flow diagram of a method for feature fusion weight computation in an embodiment;

FIG. 3 is an example of microblog data in an embodiment;

FIG. 4 is a statistical chart of the frequency of the word "a celebrity" in the example;

FIG. 5 is a graph comparing the results of the experiments in the examples.

Detailed Description

For the purpose of promoting an understanding of the principles of the invention, reference will now be made in detail to the present principles, implementations and advantages of the invention.

The core problems of microblog emergency detection comprise data noise removal, parameter threshold setting and emergent characteristic extraction. According to the invention, the emergency of the microblog is detected based on the emergency feature words, and the flow block diagram is shown in figure 1. Before emergency detection is carried out, microblog data need to be preprocessed, all microblog data are divided according to time, and time window sequences T1, T2, … …, Tn-1 and Tn are obtained, wherein Tk represents the kth time window, and k sequentially takes 1,2, … … n-1 and n. The time window in this embodiment is selected to be one day. The time window Tk comprises a microblog data set Dk, so the microblog data set sequences formed by the time window sequences are D1, D2, … …, Dn-1 and Dn. The microblog data set Dk contains preprocessed microblog data Dk { d1, d2, … …, dm-1, dm }, wherein di represents a piece of preprocessed microblog data, and since preprocessing already includes processing work of Chinese word segmentation, di { w1, w2, … …, wp-1, wp }, wherein wj is the jth word in the microblog data.

the principles and implementation of the model to which the present invention relates will be described in detail below.

First, sudden characteristic word extraction model

1. Analysis and representation of burst word features

The occurrence of the emergency often generates some corresponding characteristics, such as word frequency, the growth rate of the word frequency, the formed topic and the like. For example, the influence of some events in a certain time window Tk is small, but the influence is greatly increased due to wide attention in the time window Tk +1, and meanwhile, words and topics related to the events are also increased sharply. Therefore, the burst characteristic words are obtained from various characteristics such as word frequency, topic labels and word frequency increase rate.

(1) Word frequency characteristics

The word frequency feature can most visually reflect the importance degree of a word in the data set of the whole time window, so that the word frequency is taken as one of the features of the burst word. The word frequency characteristics are usually calculated by adopting a classic TD-IDF method, and the method can find out high-frequency words with high discrimination in a document set and endow the high-frequency words with weight. However, when detecting an emergency for microblog data, because the document set is large in number and the document length is short, directly adopting the TF-IDF method can assign lower weights to words which appear in a large number of microblogs for many times without distinction, so that part of emergency words cannot be detected. Therefore, the invention improves the calculation method of the word frequency weight in the TF-IDF method, and concretely relates to the formula (1). In all microblog data sets Dk contained in the time window Tk, the calculation method of the word frequency weight ck (w) of a certain word is as formula (1), and the word frequency weight is the word frequency characteristic:

wherein ck (w) represents the word frequency weight of the word w in the time window Tk, fk (w) represents the word frequency of the word w in the time window Tk, fkmax represents the maximum frequency of the word in the time window Tk, δ is the initial value of the word frequency weight, and is a value in the range of 0-1, and is generally set to 0.5.

When the word frequency weight calculation method in the formula (1) is used for extracting the emergency words, the interference of the traditional TF-IDF method on the microblog data which are different in length and are all short texts is avoided, and the method is more suitable for emergency detection based on the microblog.

(2) Topic tag features

The topic tag is one of core functions of the Sing microblog, and enables a user to select a topic of a text issued by the user, namely a phrase highly summarizing the text content of the user. The emergency words related to the emergency are likely to appear in the corresponding microblog topic labels. The invention also takes the topic label as one of the characteristics when extracting the burst words. The calculation of the topic label weight Tk (w) of a word w in all microblog data sets Dk in the time window Tk refers to the formula (2) and the formula (3), and the topic label weight is the topic label characteristic:

the number of topic tags of the word w appearing in the time window Tk is represented by Ntag (w), and the total number of topic tags in the microblog data set Dk is represented by Ntag; similarly, Nt _ blog (w) represents the number of times that the word w appears in the microblogs containing the topic tags in the time window Tk, and Nt _ blog represents the number of the microblogs containing the topic tags in the time window Tk; if (w) is a determination factor for determining whether the topic tag includes the word w. If at least one topic tag in the time window Tk contains the word w, if (w) takes 1, and if no topic tag in the time window Tk contains the word w, if (w) takes 2.

according to the topic label weight calculation method, the positions of the words are considered, and higher weights are given to the words in the topics or the words in the microblogs with the topics.

(3) Word frequency growth rate feature

the word frequency characteristic considers high-frequency words in a time window, but does not consider the change trend of the word frequency. If a certain emergency happens just now, the sudden words of the emergency are only sharply increased in the Tk time window, and the sudden words cannot be extracted through the word frequency weight, so that it is necessary to introduce the word frequency increase rate characteristic to identify the sudden words. Combining historical data, the invention firstly calculates the historical average word frequency Ak (w) of a word w in a time window Tk and a previous time window, and is shown in formula (4):

Wherein fk (w) represents the word frequency of the word w in the time window Tk; ak-1(w) represents the historical average word frequency of the word w in the time window Tk-1 and the previous time window, and Ak (w) represents the word frequency of the word w in the first time window when k takes 1.

and (4) calculating the average word frequency Ak (w) in a plurality of continuous time windows by using the formula (4) to reflect the dynamic change generated by the word frequency of a certain word. The word frequency increase weight can be calculated according to the historical average word frequency and the word frequency of the current time window, and represents that a certain word is in an outbreak, stable or abrupt reduction state at present, Bk (w) represents the word frequency increase rate weight of a certain word w in the time window Tk, namely the word frequency increase rate characteristic, and the calculation method is shown in formula (5):

Wherein, fk (w) represents the word frequency of a word w in the time window Tk, and the word frequency increase rate weight bk (w) reflects the activity degree of the word compared with the historical situation. If Bk (w) is greater than 0, the word is in the growth stage, and the larger the value, the more likely the word belongs to the burst characteristic word. Whereas if less than 0 indicates that the word belongs to the decay phase, it is substantially impossible to belong to the burst characteristic word.

Words with high frequency can be selected in a time window by utilizing the word frequency characteristics, representative words in the time window can be selected by utilizing the topic label characteristics, and words related to emergencies can be rapidly found by utilizing the word frequency increase rate characteristics in the time lapse process. Therefore, the word frequency characteristics Ck (w), the topic label characteristics Tk (w) and the word frequency growth rate characteristics Bk (w) of the word w are weighted to obtain the burst word judgment weight Wk (w), and the formula is calculated as shown in formula (6).

Wherein, wk (w) represents the burst word weight of a word w in the time window Tk, and the word whose weighting result is greater than the preset burst word threshold is taken as the burst word in the time window Tk.

since the term frequency growth weight bk (w) has a value range of (— infinity, + ∞) it needs to be normalized before weighting. Bk max and Bk min represent the maximum and minimum values of the word frequency growth weight within the time window Tk, respectively. α, β, γ denote the weights of the word frequency feature, the topic tag feature, and the word frequency increase feature, respectively, and α + β + γ is 1. The values of alpha, beta and gamma influence the selection effect of the burst words.

2. feature fusion method based on combination of D-S evidence theory and analytic hierarchy process

In order to extract the burst words in the microblog data better, the word frequency characteristics, the topic label characteristics and the word frequency increase rate characteristics are fused to obtain burst word judgment weights to extract the burst words, and the burst words are shown in a formula (6); and taking the burst word judgment weight as the input of the next emergency detection. Microblog emergencies are sudden and uncertain, and are in an unknown state. The D-S evidence theory is an uncertainty reasoning method, can process uncertainty caused by uncertainty, meanwhile, an analytic hierarchy process can convert qualitative problems into quantitative calculation, and consistency check can be carried out on final quantitative calculation results. Therefore, in the present embodiment, the initial weight matrix determined by the expert is inferred by using the D-S evidence theory, the judgment matrix of each feature is constructed by using the analytic hierarchy process, the consistency of the judgment matrix obtained above is checked by using the feature matrix consistency checking method in the analytic hierarchy process, the validity of the uncertainty inference process of the entire feature matrix is verified, so as to obtain a relatively accurate feature vector, and the feature vector is used as the weight of each feature. The feature fusion process is shown in fig. 2.

(1) building an assessment framework

The problem to be solved here is to determine whether a word extracted from microblog data is a burst word, and the state of the word is divided into two types, namely a burst word and a non-burst word. Therefore, an anomaly evaluation frame Θ is defined as { Y, N }, a Y state indicates that the word is a burst word, and an N state indicates that the word is not a burst word, and then a word frequency feature, a tag topic feature, and a word frequency increase rate feature are mainly considered in a burst word determination model, so that an evidence triple E (C, T, B) is constructed, and a value of a triple of a word w in a time window Tk is defined as

(2) Construction of judgment matrix by using D-S evidence theory

In the D-S evidence theory, the reasoning of the trust distribution function based on the uncertainty of the evidence theory is the most critical step. The extraction of the burst words is difficult to define by a standard data set, so that a judgment matrix is constructed by combining a plurality of expert opinions, the influence of an individual on the whole evaluation result is reduced to the greatest extent, and the result is more objective.

defining a judgment matrix P with the size of 3 multiplied by 3 to represent the relationship among three evidences used for judging the burst word, wherein a value Uij in the matrix represents a comparison result of importance degrees of the ith evidence and the jth evidence relative to the burst word judgment, and the larger the Uij is, the more important the evidence i is than the evidence j is. Firstly, M experts are required to give the probability M1(a), M2(a), …, mm (a) that the ith proof is more important than the jth proof, a representing the included hypothesis; then, according to the D-S evidence theory synthesis rule, calculating the probability value M (A) after the M experts synthesize, namely Uij. The calculation method is shown in formula (7) and formula (8).

wherein, the expression represents An exclusive-or operator, K represents a normalization factor, and A1, A2 and … An are n hypotheses.

(3) consistency check of judgment matrix by using analytic hierarchy process

a judgment matrix P synthesized based on the D-S evidence theory is shown in formula (9):

in order to use the decision matrix P as an input to the analytic hierarchy process, the data above the diagonal in P is converted, i.e. the upper triangular part is converted to the reciprocal of the lower triangular part, i.e.: if i < j, generally, if the element in the matrix P is determined to satisfy aik + akj ═ aij, the matrix P is called a consistent matrix; otherwise, it is a non-uniform matrix. And if the non-uniform matrix is judged, calculating the non-uniform degree index CI by using a maximum characteristic root method. And after calculating to obtain CI, searching a consistency index table to obtain a random consistency index RI, and finally calculating a relative consistency index to indicate that the inconsistency degree of the judgment matrix P is in an allowable range when the relative consistency index CR is less than 0.1, wherein corresponding eigenvectors { w1, w2, … …, wn } can be used as weight vectors. If the consistency check is not passed, the pairwise comparison relationship of the values in the matrix needs to be readjusted until the consistency check is passed.

(4) Calculation of weight of each index

And respectively normalizing each column of the judgment matrix P obtained by the consistency test to obtain a new judgment matrix P ', and recording the normalized element Uij as U' ij. Then, summing is performed on each row in the judgment matrix P', and a characterization vector xT of the magnitude of influence of each feature weight on the burst choice is obtained (x1, x2, x3), where xi represents the sum of the ith row elements. Finally, the vector xT is normalized to obtain the weight α T (α 1, α 2, α 3) corresponding to each feature, where α i is the weight for the i-th feature to determine the burst word, that is, α ═ α 1, β ═ α 2, and γ ═ α 3. And (6) calculating the burst word judgment weight of a certain word w, and comparing the burst word judgment weight with a burst word judgment threshold value to judge whether the word is a burst word. The burst word judgment threshold is an empirical value, and can be specifically obtained through repeated experiments, and the optimal value of the burst word judgment threshold in the specific mode is 0.7. 3. Incident detection model based on burst words

And defining a time window Tk, extracting to obtain a sudden word set which is Wk, and detecting the sudden event based on the sudden words in Wk. Because the uncertainty of the emergency and the accurate number are difficult to determine, a clustering algorithm in machine learning is adopted to construct an event detection model based on the emergency characteristic words. The specific implementation mode adopts a hierarchical clustering algorithm to construct an emergency detection model.

(1) Method for calculating coupling degree of burst words

When an emergency happens, a large number of microblogs may be generated at the same time, and some words frequently appear in the microblogs, so that the microblogs are very likely to describe the same emergency. In order to prevent semantic difference between extracted emergency words from causing that two emergency words describing the same emergency are divided into two types to influence the accuracy rate of emergency detection, the invention introduces the concept of the coupling degree of the emergency words. The burst word coupling degree refers to the fusion of the co-occurrence degree and the combination compactness between two burst words. The co-occurrence degree represents the condition that two proruption words appear in one microblog at the same time, and the combination closeness degree reflects the semantic correlation degree between the two proruption words. Considering the situation that although words are simultaneous, the semantics are irrelevant, the coupling degree is used as input and provided for a hierarchical clustering algorithm to obtain a tree structure with the burst words as nodes, and finally the tree structure is split and pruned, so that the detection of the burst event is realized.

Definitions of the relative degree of co-occurrence R (wi | wj) of a word wi with respect to a word wj and the relative degree of co-occurrence R (wj | wi) of a word wj with respect to a word wi are shown as formulas (10) and (11).

Wherein tf (wi, wj) and tf (wj, wi) represent the number of microblogs simultaneously containing words wi and wj, and tf (wi, wj) is f (wj, wi); tf (wi) and tf (wj) represent the number of microblogs containing words wi and wj, respectively.

in most cases, R (wi | wj) is not equal to R (wj | wi), and for this reason, a degree of co-occurrence C (wi, wj) is defined as shown in equation (12).

calculating the combination closeness of two burst words appearing in one microblog at the same time by adopting a mutual information calculation method, as shown in formula (13):

Wherein MI (wi, wj) represents mutual information of words wi and wj; p (wi) and P (wj) represent the probabilities of the occurrence of words wi and wj, respectively, and P (wi, wj) represents the probability of the occurrence of words wi and wj together.

tf (wi) and tf (wj) respectively represent the number of microblogs containing words wi and wj, tf (w) represents the total number of microblogs, and tf (wi, wj) represents the number of microblogs containing words wi and wj simultaneously.

a larger mutual information indicates a higher degree of closeness of two words. And (4) constructing a word coupling degree calculation model by fusing the co-occurrence degree and the combination compactness of the words, as shown in a formula (14).

S(w,w)=C(w,w)+MI(w,w) (14)

Wherein, S (wi, wj) is the coupling degree between words wi and wj.

and (3) constructing a coupling degree matrix SW' of the burst word set Wk as shown in the formula (15).

The coupling matrix SW' has symmetry, and the diagonal element Smax represents the value at which the coupling between burst words is maximum. And carrying out normalization operation on each element in the coupling degree matrix SW' to obtain a normalized coupling degree matrix SW, and taking the matrix SW as the input of hierarchical clustering.

(2) Event detection method based on condensed hierarchical clustering

This embodiment employs bottom-up agglomerative hierarchical clustering and uses an average distance algorithm to calculate the cluster-to-cluster distances. Assuming that the cluster ya ═ { wa1, wa2, … …, wam } and the cluster yb ═ { wb1, wb2, … …, wbm } are composed of different numbers of burst words, the average distance between clusters is affected by the similarity between burst words, and the distance D (ya, yb) between clusters is constructed by taking the reciprocal average of the similarities between two elements in two clusters as the average distance between two clusters, as shown in equation (16).

Wherein | ya | and | yb | respectively represent the number of the burst words in the cluster ya and the cluster yb, wi and wj respectively represent the burst words in the two clusters, and Smax and Smin respectively represent the maximum value and the minimum value of the coupling degree between the burst words in the cluster ya and the cluster yb. And updating the inter-cluster distance through the average distance formula to realize clustering of the burst words.

In this embodiment, the implementation process of the bottom-up agglomerative hierarchical clustering method based on average distance is as follows:

inputting: a burst word set Wk under a time window Tk and a coupling degree matrix SW' of the burst word set Wk;

And (3) outputting: and taking the burst words as a binary tree structure of the nodes.

Step 1: regarding each burst in the burst set Wn { { w1, w2, … …, wk } as a cluster, the current cluster set is Clustern { { y1, y2, … …, yk } { { w1}, { w2}, … …, { wk } };

Step 2: calculating the similarity between any two clusters in the cluster set Cluster, and buffering in the set Temp, wherein if Temp is { D (y1, y2), D (y1, y3), … …, D (y1, yk), D (y2, y3), … …, D (yk-1, yk) };

And step 3: judging the number of elements | Temp |, if | Temp |, 1, skipping to step 5; if | Temp | >1, selecting the minimum similarity value in the set Temp and acquiring two corresponding clusters, assuming as clusters ym and yn, deleting the two clusters from the set Cluster, merging the two clusters into a new cluster, and adding the new cluster into the set Cluster, wherein the cluster set Cluster is { y1, y2, … …, { ym, yn }, … …, yk-1 };

And 4, step 4: skipping to the step 3, and updating the Temp set corresponding to the Cluster after merging;

And 5: and outputting the corresponding binary tree structure.

Through a bottom-up agglomerative hierarchical clustering method, a binary tree structure consisting of the burst words and the burst word set can be obtained. The set of all leaf nodes in the binary tree is the initial set of spurts Wn, while the non-leaf nodes in the binary tree are a subset of the set of spurts Wn. To identify the emergency, the binary tree with the burst word as a leaf node and the cluster formed by the burst word as a non-leaf node needs to be segmented, and the burst word in the segmented sub-tree is used as a keyword of the emergency. In step 3 of the clustering method, in the process of merging the two clusters with the minimum similarity value, the similarity between ym and yn is used as a standard for dividing the binary tree. If the similarity between the two merged clusters is not high enough, the cluster needs to be segmented.

The implementation process of the binary tree pruning and segmenting method based on the internal similarity in the specific implementation mode is as follows:

inputting: a binary tree structure output by the agglomerative hierarchical clustering and a cluster internal similarity threshold theta; theta is valued within the range of 0.5-1.5, and an optimal value can be screened out through repeated tests; and (3) outputting: a division set E of the emergency;

Step 6: judging the internal similarity D and theta of the left and right subtrees of the root node, wherein the internal similarity D of the left and right subtrees is the similarity between clusters corresponding to the left and right subtrees; if D is larger than or equal to theta, the root node meets the requirement of the similarity in the cluster, adding the whole binary tree into the emergency division set E (root), and jumping to the step 9; otherwise, executing step 7 and step 8 respectively;

And 7: judging whether a left sub-tree of a root node is empty, if not, taking a threshold value theta of the internal similarity of the left sub-tree and a cluster as input, and performing binary tree pruning on the left sub-tree based on the internal similarity; if the result is empty, terminating the step;

And 8: judging whether a right subtree of the root node is empty, if not, taking the right subtree and a cluster internal similarity threshold value theta as input, and performing binary tree pruning on the right subtree based on internal similarity; if the result is empty, terminating the step;

and step 9: and outputting the emergency division set E.

In the burst event division set E, all the burst words are in the burst event division set E and are placed in different clusters, and the size of each cluster is not necessarily the same. And then manually judging the emergency according to the division set E.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于消除动作修改搜索结果

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!