Two-layer dissimilarity community discovery algorithm research based on central node

文档序号：1087058 发布日期：2020-10-20 浏览：3次中文

阅读说明：本技术 基于中心节点的二层相异性社区发现算法研究 (Two-layer dissimilarity community discovery algorithm research based on central node ) 是由张月霞陈紫扬于 2019-04-03 设计创作，主要内容包括：本发明提供了一种基于中心节点的二层相异性社区发现算法(TDCN-CD)。其中,所述方法包括：首先,通过节点的度和距离选择中心节点,避免了同一社区中距离较近的节点同时被选为中心节点。其次,提出了基于二层的节点相异性指标,可以深度挖掘节点相异性性质,达到精确社区划分的效果。最后,采用Karate和Dolphins两个数据集,分别进行仿真和结果分析,表明相比于Girvan-Newman和Fast-Newman经典社区划分算法,TDCN-CD算法可以有效地检测出社区结构,达到更加准确划分社区的效果。(The invention provides a two-layer dissimilarity community discovery algorithm (TDCN-CD) based on a central node. Wherein the method comprises the following steps: first, the central node is selected according to the degree and the distance of the node, so that the node which is closer to the central node in the same community is prevented from being simultaneously selected as the central node. Secondly, a node dissimilarity index based on two layers is provided, the node dissimilarity property can be deeply mined, and the effect of accurate community division is achieved. Finally, two data sets of Karate and Dolphins are adopted to carry out simulation and result analysis respectively, and the TDCN-CD algorithm can effectively detect the community structure and achieve the effect of more accurately dividing communities compared with the classic community division algorithms of Girvan-Newman and Fast-Newman.)

1. Establishing a two-layer dissimilarity community discovery algorithm research based on a central node, which is characterized by comprising the following steps:

1) a set of central nodes is established.

2) And carrying out community division according to the two-layer dissimilarity index.

3) And evaluating the division result by using a modularity function.

2. The method for establishing two-layer dissimilarity community discovery algorithm research based on central nodes according to claim 1, wherein in the step 1, the method for establishing the central node set comprises the following steps:

the TDCN-CD algorithm firstly traverses the whole network G and calculates the degrees of all nodes in the network. Sorting the nodes in descending order according to the degrees of the nodes, and forming a set V after sorting_H＝{V₁，…V_i，V_i+1，…V_NIn which k is_i＞k_i+1. Selecting a set V_HThe first 10% of nodes form a node set V_max＝{V₁，…V_EWhere E ═ 10% N, V_maxRepresenting a large-scale node set.

In order to avoid that the selected central nodes are in the same community, secondary screening is needed to be carried out on the nodes with large degrees. The distance between nodes in the same community is very close, that is, the dissimilarity between large-scale nodes in the same community is far less than that between large-scale nodes in different communities. Therefore, the set V is calculated according to the measurement method of the node dissimilarity index_maxThe dissimilarity between every two internal nodes is set with a threshold D, the nodes with large dissimilarity are reserved to form a central node set V_core＝{V₁，…V_FAnd F is the number of the network center nodes.

3. The research of establishing the two-layer dissimilarity community discovery algorithm based on the central node according to claim 1, wherein in the step 2, the community division method according to the two-layer dissimilarity index comprises the following steps:

after the central node of the network is determined, communities to which other nodes belong need to be determined. And calculating the community to which each node belongs according to the node dissimilarity index of the single layer.

4. The two-layer dissimilarity community discovery algorithm research based on the central node as claimed in claim 1, wherein in the step 3, the method for evaluating the partitioning result by using the modularity function is as follows:

the modularity function Q value can reflect the strength of community structure, and the larger the Q value is, the stronger the community structure is; the smaller the Q value, the less structural the community. Traversing all nodes in the network, dividing the nodes into communities with different central nodes according to the two-layer dissimilarity index, and using the modularity function Q as an evaluation index of community division quality.

Technical Field

The invention relates to the field of community discovery, in particular to a two-layer dissimilarity community discovery algorithm based on a central node.

Background

Complex networks can be used to represent a variety of complex systems in nature and the real world, with features of complexity, no scale, and small world. The complex network has a specific organization structure and most of the complex networks have the community structure characteristic of local aggregation and overall dispersion. The community discovery of the complex network has important significance for the research of the aspects of the structure analysis, the behavior prediction and the like of the complex network.

At present, many ideas and algorithms have been proposed for the community discovery research of complex networks, and the ideas and algorithms can be divided into a graph segmentation method, a spectrum method, a modularity-based dynamic method and a hierarchical clustering method. The hierarchical clustering method can effectively show the network hierarchical structure, and is convenient for the next research on the network topological structure. Therefore, the hierarchical clustering method has attracted much attention in the research of community discovery of complex networks. Hierarchical clustering methods are classified into two categories, split and aggregate. The split method represented by the GN algorithm has the problems that isolated nodes cannot perform community attribution in a sparse network, time complexity is high, and accurate results cannot be obtained in the past. The time complexity of the condensation method is generally lower than that of the split method, and the method has a greater application prospect in the research of community discovery. The condensed method can be classified into: a global similarity method represented by a fast GN algorithm proposed by Newman; a local similarity method represented by a mark propagation algorithm and a clustering method represented by a K-means algorithm based on a central node. The clustering method based on the central node has the advantages of better accuracy, high community discovery quality, wide application range and the like.

Aiming at the problems that the division result of the traditional K-means algorithm is greatly influenced by the starting Central Node and the running time is long, the invention provides a Two-layer Dissimilarity Community discovery algorithm (TDCN-CD) Based on the Central Node. The basic idea of the algorithm is that each node in the network is regarded as an independent community, the node degree is used as an index to select a node with a large degree number, and then the distance between the nodes is used as the index to perform secondary screening, so that the accuracy of selecting the central node is guaranteed. Meanwhile, in order to solve the problem that the dissimilarity among the nodes is the same in community division, a two-layer node dissimilarity index is defined in the TDCN-CD algorithm, and the node dissimilarity property is deeply mined, so that the algorithm is higher in accuracy and wider in application range. The TDCN-CD algorithm provided by the invention can effectively detect the community structure, realizes the effect of accurately dividing communities, and has certain realization value and research significance.

Disclosure of Invention

The invention provides a two-layer dissimilarity community discovery algorithm (TDCN-CD) based on a central node. According to the algorithm, the central node is selected according to the degree and the distance of the node, so that the node which is closer to the central node in the same community is prevented from being selected as the central node. Meanwhile, a node dissimilarity index based on two layers is provided, the node dissimilarity property can be deeply mined, and the effect of accurate community division is achieved. The Karate and Dolphins data sets are adopted to respectively carry out simulation and result analysis, and the TDCN-CD algorithm can effectively detect the community structure and achieve the effect of more accurately dividing communities compared with the Girvan-Newman and Fast-Newman classical community division algorithms.

The two-layer dissimilarity community discovery algorithm (TDCN-CD) based on the central node comprises the following steps:

1) a set of central nodes is established.

2) And carrying out community division according to the two-layer dissimilarity index.

3) And evaluating the division result by using a modularity function.

The method for establishing the central node set in the step 1 comprises the following steps:

In order to avoid that the selected central nodes are in the same community, secondary screening is needed to be carried out on the nodes with large degrees. The distance between the nodes in the same community is very close, namely the dissimilarity between the large-scale nodes in the same community is far smaller than that between the large-scale nodes in different communities. Therefore, the set V is calculated according to the measurement method of the node dissimilarity index_maxThe dissimilarity between every two internal nodes is set with a threshold D, the nodes with large dissimilarity are reserved to form a central node set V_core＝{V₁，…V_FAnd F is the number of the network center nodes.

In the step 2, the method for dividing communities according to the two-layer dissimilarity index comprises the following steps:

When only the single-layer node dissimilarity index is considered, the node i and the central node may have two single-layer neighbor nodes b and c connected together, and the calculated single-layer node dissimilarity isThe node j and the central node also have two single-layer neighbor nodes c and e which are connected together, and the calculated single-layer node dissimilarity is

The node i and the node j have the same dissimilarity, and community attribution cannot be judged. A two-layer node dissimilarity evaluation index is defined, and when the single-layer node dissimilarity is the same, the two-layer node dissimilarity evaluation index needs to be calculated, and the nodes are divided into communities where the central nodes with the minimum dissimilarity are located.

In the step 3, the method for evaluating the division result by using the modularity function comprises the following steps:

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, in which:

FIG. 1 is a model structure diagram of the present invention.

Fig. 2 is a community relationship diagram of the network G.

Fig. 3 is a schematic diagram of common neighboring nodes with a central node.

Fig. 4 is a Karate network topology.

Fig. 5 is a schematic view of a Karate network.

Fig. 6 is a schematic diagram of the clustering coefficients of the Karate network.

FIG. 7 is a diagram of the community division result of the Karate network based on the TDCN-CD algorithm.

Fig. 8 is a Dolphins network topology.

Fig. 9 is a schematic diagram of the Dolphins network.

FIG. 10 is a schematic diagram of the clustering coefficients of the Dolphins network.

FIG. 11 is a TDCN-CD algorithm-based community partition result diagram of the Dolphins network.

Detailed Description

The following describes in further detail embodiments of the present invention with reference to the accompanying drawings.

FIG. 1 is a model structure diagram of the present invention. When community division is performed according to the single-layer node dissimilarity index, the dissimilarity between nodes may be the same, and the community to which the node belongs cannot be judged according to the single-layer node dissimilarity property. In order to solve the problem, the TDCN-CD algorithm defines a two-layer node dissimilarity index, and the accuracy of final community division is further ensured.

Fig. 1 shows a model structure diagram of the present invention, wherein the specific steps are as follows:

1. a set of central nodes is established.

The TDCN-CD algorithm first traverses the entire network G, and calculates the degrees of all nodes in the network from formula (1). Sorting the nodes in descending order according to the degrees of the nodes, and forming a set V after sorting_H＝{V₁，…V_i，V_i+1，…V_NIn which k is_i＞k_i+1. Select set V_HThe first 10% of nodes form a node set V_max＝{V₁，…V_EWhere E ═ 10% N, V_maxRepresenting a large-scale node set.

The relation between nodes in formula (1) is represented by adjacency matrix A, the nodesDegree k_iRefers to the number of all neighboring nodes of node i.

In order to avoid that the selected central nodes are in the same community, secondary screening is needed to be carried out on the nodes with large degrees. As shown in fig. 2, the nodes in the same community are very close, that is, the dissimilarity of the large-scale nodes in the same community is far smaller than that of the large-scale nodes in different communities. Therefore, the set V is calculated by using the measurement method of the node dissimilarity index in the formula (2)_maxThe dissimilarity between every two internal nodes is set with a threshold D, the nodes with large dissimilarity are reserved to form a central node set V_core＝{V₁，…V_FAnd F is the number of the network center nodes.

γ₁(V_i，V_j)＝S_i，j(2)

S in formula (2)_i，jRepresents the distance between nodes as γ₁(V_i，V_j) And (4) representing a single-layer node dissimilarity evaluation index.

2. And carrying out community division according to the two-layer dissimilarity index.

As shown in FIG. 3, when only the single-layer node dissimilarity index is considered, there may be two commonly connected single-layer neighbor nodes b and c for node i and the center node, and the single-layer node dissimilarity calculated according to equation (2) isThe node j and the central node also have two single-layer neighbor nodes c and e which are connected together, and the single-layer node dissimilarity calculated according to the formula (2) is also

The node i and the node j have the same dissimilarity, and community affiliation cannot be judged.

As shown in FIG. 1, the possible connection situations between the node i and the node j are mainly divided into two types, the first type is that the two-layer adjacent node q of the node i is directly connected with the node j, namely the node i and the node j are connected through two nodes p and q, and then the two-layer node dissimilarity index of the node i and the node j can be expressed as gamma₂₁(V_i，V_j) As shown in equation (3).

In the second case, the two-layer adjacent node q of the node i is connected with the one-layer adjacent node s of the node j, that is, the node i and the node j are connected through three nodes of p, q and s, and then the node dissimilarity of the node i and the node j can be expressed as gamma₂₂(V_i，V_j) As shown in equation (4):

the two-layer node dissimilarity evaluation index finally expresses the formula (5), and the two-layer node dissimilarity evaluation index divides nodes into communities where the center nodes with minimum dissimilarity are located.

Wherein the content of the first and second substances,is the average clustering coefficient.

3. And evaluating the division result by using a modularity function.

Newman et al propose a network modularity evaluation function, specifically defined as equation (7).

Wherein, g_ββIndicating that both nodes constituting an edge are within the community β, α_βIndicating that one of the nodes that make up the edge is within the community β. The Q value can reflect the strength of community structure, and the larger the Q value is, the larger the community structureThe more structured the region; the smaller the Q value, the less structural the community.

Traversing all nodes in the network, dividing the nodes into communities with different central nodes according to the two-layer dissimilarity index, and using the modularity function Q as an evaluation index of community division quality.

And finally, verifying the correctness of the model through experimental simulation.

1) Community partition analysis based on Karate network

The Karate network is a social network constructed for the membership of the empty track club of the university of america. The network comprises 34 nodes and 78 edges, the nodes in the network represent club members, the edges represent connections between the members, and the network topology is shown in fig. 4. The TDCN-CD algorithm needs two indexes of degree and average cluster coefficient when carrying out community division on a complex network. The degrees of nodes in the Karate network are shown in fig. 5, and the clustering coefficients are shown in fig. 6.

Using TDCN-CD algorithm to perform community division on the Karate network, the division result is shown in fig. 7, compared with the real network, the node with number 10 is wrongly divided, and the division results of the other nodes are the same as the real network.

2) Community partitioning analysis based on Dolphins network

The Dolphins network is a social network constructed for membership in dolphin communities. The network comprises 62 nodes and 159 edges, the nodes in the network represent community members, the edges represent the connection between the members, and the network topology is shown in fig. 8. The degrees of nodes in the Dolphins network are shown in fig. 9, and the clustering coefficients are shown in fig. 10.

Using TDCN-CD algorithm to perform community division on Dolphins network, the division result is shown in fig. 11, compared with the real network, the node with number 31 is wrongly divided, and the division results of the other nodes are the same as the real network.

The quality of community division is evaluated by using the index of the modularity Q value, and the dividing quality of the TDCN-CD algorithm, the classical GN algorithm and the FN algorithm is compared in a real network data set. The Q value obtained by the algorithm after the community division is carried out on the Karate network is 0.7293, and the Q value obtained by the algorithm after the community division is carried out on the Dolphins network is 0.8356, which are higher than those of the other two classical algorithms, so that the algorithm provided by the invention is proved to have practical significance.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and it is obvious that those skilled in the art can make various changes and modifications of the present invention without departing from the spirit and scope of the present invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

11页详细技术资料下载

Two-layer dissimilarity community discovery algorithm research based on central node

相关技术

网友询问留言