Graph data dividing method and device and computer equipment

文档序号：1963806 发布日期：2021-12-14 浏览：20次中文

阅读说明：本技术 图数据划分方法、装置和计算机设备 (Graph data dividing method and device and computer equipment ) 是由覃伟于纪平朱晓伟陈文光于 2021-11-15 设计创作，主要内容包括：本说明书实施例公开了一种图数据划分方法、装置和计算机设备。所述方法包括：将图数据中的顶点划分到多个数据集中；将图数据中的边划分到边的目标顶点所在的数据集中；其中,所述数据集用于供分布式集群中的节点进行图计算,所述多个数据集的计算量相近。本说明书实施例可以使分布式集群中节点之间的负载均衡,并且可以节省通信开销。(The embodiment of the specification discloses a graph data partitioning method and device and computer equipment. The method comprises the following steps: partitioning vertices in graph data into a plurality of datasets; dividing the edges in the graph data into data sets in which target vertexes of the edges are located; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar. The embodiment of the specification can balance the load among the nodes in the distributed cluster and save the communication overhead.)

1. A graph data partitioning method, comprising:

partitioning vertices in graph data into a plurality of datasets;

dividing the edges in the graph data into data sets in which target vertexes of the edges are located; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar.

2. The method of claim 1, the partitioning vertices in graph data into multiple data sets, comprising:

determining a calculated quantity reference value according to the graph data and the number of the data sets;

partitioning vertices in graph data into a plurality of datasets; in the dividing process, the calculation amount of each data set is determined so that the calculation amount of the data set is close to the calculation amount reference value.

3. The method of claim 2, the determining a computation amount reference value, comprising:

determining a calculated quantity reference value according to the number of vertexes in the graph data, the number of edges of the vertexes and the number of the data sets;

the determining the calculation amount of each data set comprises the following steps:

and determining the calculation amount of the data set according to the number of the vertexes in the data set and the number of edges of the vertexes.

4. The method of claim 1, the partitioning vertices in graph data into multiple data sets, comprising:

acquiring a locality feature of the graph data, wherein the locality feature is used for representing the proximity degree between the vertexes;

vertices in the graph data are partitioned into multiple datasets according to locality characteristics.

5. The method of claim 4, the partitioning vertices in graph data into multiple data sets, comprising:

according to the locality characteristics, allocating identifiers for the vertexes in the graph data, wherein the numbering sequence of the identifiers is used for expressing the proximity degree;

and dividing the vertexes in the graph data into a plurality of data sets according to the identified numbering sequence.

6. The method of claim 1, the dividing edges in graph data into data sets in which target vertices of the edges are located, comprising:

constructing a table, wherein each row and each column of the table respectively correspond to one data set;

for each edge in the graph data, determining a target row in the table according to a source vertex of the edge, determining a target column in the table according to a target vertex of the edge, and dividing the edge into cells defined by the target row and the target column;

and dividing the edges in each column of cells into the data sets corresponding to the columns.

7. The method of claim 1, further comprising:

the data set is divided into a plurality of subsets, and the subsets are used for graph computation of one node in the distributed cluster.

8. The method of claim 7, the dividing the data set into a plurality of subsets, comprising:

the data set is divided into a plurality of subsets according to the number of threads or the number of processes of the nodes.

9. A graph data partitioning method, comprising:

acquiring a locality feature of the graph data, wherein the locality feature is used for representing the proximity degree between the vertexes;

according to the locality characteristics, dividing vertexes in the graph data into a plurality of data sets;

dividing edges in graph data into the plurality of data sets; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar.

10. The method of claim 9, the partitioning vertices in graph data into multiple data sets, comprising:

determining a calculated quantity reference value according to the graph data and the number of the data sets;

11. The method of claim 10, said determining a computation amount reference value, comprising:

determining a calculated quantity reference value according to the number of vertexes in the graph data, the number of edges of the vertexes and the number of the data sets;

the determining the calculation amount of each data set comprises the following steps:

and determining the calculation amount of the data set according to the number of the vertexes in the data set and the number of edges of the vertexes.

12. The method of claim 9, the partitioning vertices in graph data into multiple data sets, comprising:

according to the locality characteristics, allocating identifiers for the vertexes in the graph data, wherein the numbering sequence of the identifiers is used for expressing the proximity degree;

and dividing the vertexes in the graph data into a plurality of data sets according to the identified numbering sequence.

13. The method of claim 9, further comprising:

the data set is divided into a plurality of subsets, and the subsets are used for graph computation of one node in the distributed cluster.

14. The method of claim 13, the dividing a data set into a plurality of subsets, comprising:

the data set is divided into a plurality of subsets according to the number of threads or the number of processes of the nodes.

15. A graph data partitioning apparatus comprising:

a first dividing unit configured to divide vertices in the graph data into a plurality of data sets;

the second dividing unit is used for dividing the edges in the graph data into data sets in which target vertexes of the edges are located; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar.

16. A graph data partitioning apparatus comprising:

an acquisition unit configured to acquire a local feature of the graph data, the local feature being used to represent a degree of proximity between the vertices;

the first dividing unit is used for dividing vertexes in the graph data into a plurality of data sets according to the locality characteristics;

a second dividing unit configured to divide edges in the graph data into the plurality of data sets; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar.

17. A computer device, comprising:

at least one processor;

a memory having stored thereon program instructions configured to be adapted to be executed by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-14.

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a graph data partitioning method, a graph data partitioning device and a client.

Background

The graph data has a strong expression ability as a data structure. In practical application, the business data with the association relationship can be converted into graph data, and the graph data is calculated by using the distributed cluster.

To do so, the graph data needs to be partitioned in order to distribute the graph data over the nodes of the distributed cluster.

Disclosure of Invention

The embodiment of the specification provides a graph data partitioning method, a graph data partitioning device and a client side, so as to partition graph data.

In a first aspect of embodiments of the present specification, a graph data partitioning method is provided, including:

partitioning vertices in graph data into a plurality of datasets;

In a second aspect of embodiments of the present specification, there is provided a graph data partitioning method, including:

acquiring a locality characteristic of graph data, wherein the locality characteristic is used for representing the proximity degree between vertexes;

according to the locality characteristics, dividing vertexes in the graph data into a plurality of data sets;

In a third aspect of the embodiments of the present specification, there is provided a graph data dividing apparatus including:

a first dividing unit configured to divide vertices in the graph data into a plurality of data sets;

In a fourth aspect of embodiments of the present specification, there is provided a graph data dividing apparatus including:

an acquisition unit configured to acquire a local feature of the graph data, the local feature being used to represent a degree of proximity between the vertices;

the first dividing unit is used for dividing vertexes in the graph data into a plurality of data sets according to the locality characteristics;

In a fifth aspect of embodiments of the present specification, there is provided a computer device comprising:

at least one processor;

a memory storing program instructions configured to be suitable for execution by the at least one processor, the program instructions comprising instructions for performing the method of the first or second aspect.

According to the technical scheme provided by the embodiment of the specification, the calculated amount of the data sets is similar, so that the load among the nodes in the distributed cluster can be balanced. In addition, the edges in the graph data are divided into the data sets where the target vertexes of the edges are located, so that the communication times among the nodes can be reduced, and the communication overhead is saved. In addition, the vertices in the graph data are divided into a plurality of data sets according to the locality characteristics, so that the communication frequency between the nodes can be reduced, and the communication overhead can be saved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of graph data in an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a graph data partitioning method according to an embodiment of the present disclosure;

FIG. 3 is a diagram illustrating a graph data according to an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating a graph data partitioning method according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a graph data partitioning apparatus in an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a graph data partitioning apparatus in an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a computer device in an embodiment of the present specification.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.

The graph data is a data structure. The graph data may include vertices and edges. The graph data may include directed graph data and undirected graph data. The edges of the directed graph data are directed edges, and the direction of the directed edges is from a source vertex to a target vertex. The edges in the undirected graph data are undirected edges. For example, the graph data may be represented as G = (V, E). The V represents a set of vertices, which may include vertices in graph data G. The E represents a set of edges, which may include edges in graph data G. An edge in the edge set may be represented as e = (u, v), where u is a source vertex of the edge e and v is a target vertex of the edge e.

In practical application, business data with an association relationship can be converted into graph data. Specifically, business entities can be converted to vertices in graph data, and relationships between business entities can be converted to edges in graph data. For example, web pages may be converted into vertices in graph data, and link relationships between web pages may be converted into edges in graph data. As another example, accounts may be converted to vertices in the graph data and funds-transfer relationships between accounts may be converted to edges in the graph data.

Graph calculation refers to calculation performed on graph data. The graph computation can be widely applied to the fields of social networks, recommendation systems, network security, text retrieval, biomedicine and the like. For example, web pages may be converted into vertices in graph data, and link relationships between web pages may be converted into edges in graph data. For the graph data, graph calculation can be performed by using algorithms such as PageRank and the like, and the importance degree of the webpage is obtained. The web pages with high importance can be ranked in front of the internet search.

A distributed cluster may include a plurality of nodes, which may include computer devices. To facilitate graph computation on graph data using distributed clusters, graph data may be partitioned. The graph data division refers to dividing the graph data into a plurality of sub graph data. The plurality of subgraph data can be distributed to a plurality of nodes of the distributed cluster for computation. Specifically, the graph data partitioning may take into account the following two factors. (1) The scale difference between the sub-graph data is not large. Therefore, when the plurality of sub-graph data are distributed to a plurality of nodes of the distributed cluster for calculation, the calculation amount among the nodes is similar, and the load among the nodes is balanced. (2) The number of edges connected between the sub-graph data is as small as possible. Therefore, when the plurality of sub-graph data are distributed to the plurality of nodes of the distributed cluster for calculation, the number of communication times among the nodes is small, the communication overhead is saved, and the calculation efficiency is improved.

In the related art, a graph data partitioning method may include: hashing vertices in graph data into a plurality of data sets; edges in the graph data are partitioned into datasets in which source vertices of the edges reside. Each data set can be understood as a sub-graph data of the graph data. However, the vertices are divided in a hash manner, and the relationship among the vertices in the graph data is not considered, so that the number of edges connected among data sets is large, and the number of communication times among nodes is increased. In addition, dividing an edge in graph data into a data set in which the source vertex of the edge is located also increases the number of communications between nodes. For example, the plurality of edges may correspond to a source vertex and a plurality of target vertices. By partitioning the hash of the vertices in the graph data into multiple data sets, it is possible to have the source vertex at one node (hereinafter referred to as a first node) and the multiple target vertices at another node (hereinafter referred to as a second node). In graph computation, information of a target vertex is computed along an edge using information of a source vertex. Since the edge and the source vertex are located at the first node, in order for the second node to know which target vertices should be computed along which edges, the first node needs to send the source vertex information to the second node for each of the plurality of edges. Thus, the first node needs to transmit the information of the source vertex to the second node many times, resulting in a large number of communications between the first node and the second node.

The information of the vertex may include information of a service entity corresponding to the vertex. For example, the business entity corresponding to the vertex may be a web page, and the information of the vertex may include a probability that the web page is visited. As another example, the business entity corresponding to the vertex may be an account, and the information of the vertex may include a fund balance of the account.

For example, the graph data shown in fig. 1 may include vertex a1, vertex B1, vertex C1, vertex D1, vertex E1, vertex F1, edge E = (a 1, B1), edge E = (a 1, C1), edge E = (a 1, D1), edge E = (a 1, E1), and edge E = (a 1, F1). The distributed cluster may include 3 nodes, node P0, node P1, and node P2. The identifiers of the vertex A1, the vertex B1, the vertex C1, the vertex D1, the vertex E1 and the vertex F1 are 0, 1, 2, 3, 4 and 5 respectively. Where the remainder of dividing the identifier 0 of the vertex a1 and the identifier 3 of the vertex D1 by 3 is 0, the vertex a1 and the vertex D1 may be divided into the data set S0 corresponding to the node P0. The remainder of division of identifier 1 of vertex B1 and identifier 4 of vertex E1 by 3 is 1, and vertex B1 and vertex E1 may be divided into data set S1 corresponding to node P1. The remainder of division of identifier 2 of vertex C1 and identifier 5 of vertex F1 by 3 is 2, and vertex C1 and vertex F1 may be divided into data set S2 corresponding to node P2. The source vertices of the edge E = (a 1, B1), the edge E = (a 1, C1), the edge E = (a 1, D1), the edge E = (a 1, E1), the edge E = (a 1, F1) are the vertices a1, and the edge E = (a 1, B1), the edge E = (a 1, C1), the edge E = (a 1, D1), the edge E = (a 1, E1), and the edge E = (a 1, F1) may be divided into the data set S0.

In performing graph computation, for edge e = (a 1, B1), node P0 needs to send information for vertex a1 to node P1; for edge e = (a 1, C1), node P0 needs to send information for vertex a1 to node P2; for edge E = (a 1, E1), node P0 needs to send information for vertex a1 to node P1; for edge e = (a 1, F1), node P0 needs to send information for vertex a1 to node P2. Thus, in performing graph computation, node P0 needs to send information for vertex A1 to node P1 2 times and to send information for vertex A1 to node P2 2 times. Resulting in a greater number of communications between node P0 and nodes P1 and P2.

Embodiments of the present description relate to an implementation environment that may include a graph data processing system.

In some embodiments, the graph data processing system may include distributed clusters. The distributed cluster may be used to partition graph data and may also be used to compute graph data.

And the target nodes in the distributed cluster are used for dividing the graph data to obtain a plurality of data sets. Each data set may comprise vertices and/or edges and may thus be understood as a subgraph of data. The target node may assign the plurality of data sets to a plurality of nodes in a distributed cluster. Each node in the distributed cluster may obtain one or more data sets. Each node in the distributed cluster may perform graph computations directly from the data set. Alternatively, each node in the distributed cluster may further divide the data set into a plurality of subsets; graph computations may be performed in a parallel manner from the plurality of subsets. Of course, the target node may also divide each data set into a plurality of subsets; the plurality of subsets may be assigned to a node in the distributed cluster such that the node performs graph computations in a parallel manner based on the plurality of subsets.

Or the target nodes in the distributed cluster are used for dividing the vertexes in the graph data to obtain a plurality of vertex sets. The set of vertices may include one or more vertices. The labels in each set of vertices constitute a set of labels. The target node may assign the plurality of identification sets to a plurality of nodes in a distributed cluster. Each node in the distributed cluster may obtain one or more identification sets; corresponding vertices can be read from the graph data according to the identification set; an edge with the vertex as a source vertex or an edge with the vertex as a target vertex can be read from the graph data; resulting in a data set comprising vertices and/or edges. Each node in the distributed cluster may perform graph computations directly from the data set. Alternatively, each node in the distributed cluster may also divide the data set into a plurality of subsets; graph computations may be performed in a parallel manner from the plurality of subsets.

The target node may be selected from a distributed cluster. For example, a node may be randomly selected from the distributed cluster as the target node. As another example, the node with the strongest computing power may be selected from the distributed cluster as the target node.

The identification of the vertices may be predetermined. Or local characteristics of the graph data can be obtained, wherein the local characteristics are used for representing the closeness degree between the vertexes; vertices in the graph data may be assigned identifications based on locality characteristics. The numbering order of the labels indicates the proximity between the vertices. The process of assigning identifications to vertices in graph data will be described in detail below.

In some embodiments, the graph data processing system may include a partitioning server and a distributed cluster.

The dividing server is used for dividing the graph data to obtain a plurality of data sets. Wherein each data set may comprise vertices and/or edges and may thus be understood as a subgraph of data. The partitioning server may distribute the plurality of data sets to a plurality of nodes in a distributed cluster. Each node in the distributed cluster may obtain one or more data sets. Each node in the distributed cluster may perform graph computations directly from the data set. Alternatively, each node in the distributed cluster may further divide the data set into a plurality of subsets; graph computations may be performed in a parallel manner from the plurality of subsets. Of course, the partitioning server may also partition each data set into a plurality of subsets; the plurality of subsets may be assigned to a node in the distributed cluster such that the node performs graph computations in a parallel manner based on the plurality of subsets.

The embodiment of the specification provides a graph data dividing method.

The graph data dividing method can be applied to computer equipment. The computer device may include a partitioned server, a distributed cluster. Referring to fig. 2, the graph data partitioning method may include the following steps.

Step S21: vertices in graph data are partitioned into multiple datasets.

In some embodiments, the number of data sets may be determined based on the number of nodes in the distributed cluster. Specifically, the number of data sets may be equal to, greater than, or less than the number of nodes in the distributed cluster. For example, the number of nodes in the distributed cluster may be obtained as the number of data sets. Each data set may correspond to a node in the distributed cluster for graph computation by a node in the distributed cluster. As another example, the number of nodes in the distributed cluster may also be obtained; the number of nodes in a distributed cluster may be multiplied by 2 as the number of data sets. And enabling each two data sets to correspond to one node in the distributed cluster for graph computation by one node in the distributed cluster.

In some embodiments, each data set may include one or more vertices, via step S21. In practical applications, vertices in the graph data may be divided into multiple data sets in a random manner. Alternatively, the vertices in the graph data may be divided into multiple data sets by hashing. For example, the remainder of dividing the identification of the vertices by P, which is the number of datasets, may be calculated and the vertices may be partitioned into the datasets corresponding to the remainder. Or local characteristics of the graph data can be acquired; vertices in the graph data may be partitioned into multiple data sets according to locality characteristics.

The graph data may be analyzed using a graph search algorithm to obtain locality characteristics of the graph data. The graph Search algorithm may include a Breadth First Search algorithm (BFS), a Depth First Search algorithm (DFS), and the like. The locality feature is used to represent the proximity between vertices. The locality characteristics may include whether there are neighbor vertices between vertices. Specifically, if there is an edge connection between two vertices, that is, the two vertices are neighboring vertices, the proximity of the two vertices is closer. If there is no edge connection between two vertices, i.e. the two vertices are not neighbor vertices, the two vertices are closer. Alternatively, the locality feature may also include a shortest path between vertices. Specifically, if the shortest path between two vertices is short, the proximity of the two vertices is close. If the shortest path between two vertices is long, the two vertices are closer. And the vertexes in the graph data are divided according to the locality characteristics, so that the number of edges connected among data sets is reduced, the communication times among nodes are reduced, and the communication overhead is saved.

The vertices in the graph data may be assigned identifications based on locality characteristics; vertices in the graph data may be partitioned into multiple data sets according to the identified numbering order. The identifier is used for identifying the vertex, and may specifically include a number, a character, or a character string composed of a number and a character. The numbering sequence of the identifiers may be continuous or discontinuous. The numbering order of the labels can indicate the proximity between the vertices. Vertices with closer proximity, which identify closer in the numbering order; vertices that are more proximate, whose identification is also more distal in the numbering order. Vertices in the graph data may be contiguously partitioned into the plurality of data sets in the identified numbering order. Therefore, the closeness of the vertexes in the data sets is closer, and the closeness of the vertexes between different data sets is farther, so that the number of edges connected between the data sets is reduced.

For example, the graph data shown in fig. 3 may include vertex a2, vertex B2, vertex C2, vertex D2, vertex E2, vertex F2, vertex G2, edge E = (a 2, B2), edge E = (B2, E2), edge E = (a 2, C2), edge E = (a 2, D2), edge E = (C2, F2), and edge E = (C2, G2). In the graph data shown in fig. 3, vertex a2 and vertex B2 are neighbor vertices, vertex a2 and vertex C2 are neighbor vertices, vertex a2 and vertex D2 are neighbor vertices, vertex B2 and vertex E2 are neighbor vertices, vertex C2 and vertex F2 are neighbor vertices, and vertex C2 and vertex G2 are neighbor vertices. Therefore, the identifiers 0, 1, 2, 3, 4, 5, 6 may be assigned to the vertex a2, the vertex B2, the vertex C2, the vertex D2, the vertex E2, the vertex F2, and the vertex G2, respectively, according to whether the vertices are neighbor vertices. The symbols of the vertex a2, the vertex B2, the vertex C2, the vertex D2, the vertex E2, the vertex F2, and the vertex G2 are consecutive in the order of the numbers. Further, the symbol 0 of the vertex a2 and the symbol 1 of the vertex B2 are closer in the numbering order, the symbol 0 of the vertex a2 and the symbol 2 of the vertex C2 are closer in the numbering order, and the symbol 0 of the vertex a2 and the symbol 3 of the vertex D2 are closer in the numbering order. Identification 1 of vertex B2 and identification 4 of vertex E2 are closer in the numbering order. Identification 2 of vertex C2 and identification 5 of vertex F2 are closer in number order, and identification 2 of vertex C2 and identification 6 of vertex G2 are closer in number order.

The plurality of data sets may include data set S0, data set S1, data set S2. Vertex a2 and vertex B2 may be divided into data set S0, vertex C2 and vertex D2 may be divided into data set S1, and vertex E2, vertex F2, and vertex G2 may be divided into data set S2, in the identified numbered order. Thereby reducing the number of edges connecting between the data set S0, the data set S1, and the data set S1.

Of course, vertices in the graph data may also be partitioned into multiple datasets based directly on locality characteristics. For example, vertices in graph data may be partitioned into multiple data sets directly according to whether there are neighbor vertices between the vertices. As another example, vertices in graph data may be partitioned into multiple datasets directly based on shortest paths between the vertices.

In some embodiments, the plurality of data sets may be close in computational load in order to balance the load among the nodes in the distributed cluster. The computation amount of the data set can be understood as the workload of the nodes in performing graph computation on the data set. The close computational load of the plurality of data sets may include: the calculated amount is equal, and the difference value of the calculated amount is within a preset range.

A calculated quantity reference value may be determined. In the process of dividing the vertex, the calculation amount of each data set can be determined so that the calculation amount of the data set is close to the calculation amount reference value. Approximating the calculated amount of the data set to the calculated amount reference value may include: the calculated amount of the data set is equal to the calculated amount reference value, and the difference value between the calculated amount of the data set and the calculated amount reference value is within a preset range.

The calculation amount reference value may be determined based on the number of vertices in the graph data and the number of the data sets. For example, it can be based on a formulaA calculated quantity reference value is determined. Wherein V represents the number of vertices in graph data and P represents the number of data sets. Accordingly, the number of vertices in the dataset may be counted as the computational load of the dataset. Alternatively, the number of edges of a vertex may include the number of in-edges and/or the number of out-edges of the vertex. The in-edge of a vertex may include an edge with the vertex as a target vertex and the out-edge of a vertex may include an edge with the vertex as a source vertex. The number of edges for different vertices may differ significantly. In order to accurately evaluate the calculation amount of the data set and improve the load balancing effect, the calculation amount reference value can be determined according to the number of vertexes, the number of edges of the vertexes and the number of the data set in the graph data. For example, it can be based on a formulaA calculated quantity reference value is determined. Wherein V represents the number of vertices in graph data, theRepresenting the sum of the number of incoming edges for each vertex in the graph data, saidRepresents the sum of the number of outgoing edges for each vertex in the graph data, and P represents the number of data sets. Accordingly, the computational load of the data set may be determined based on the number of vertices in the data set and the number of edges of the vertices. For example, a formula may be utilizedThe computational load of the data set is determined. Wherein N represents the number of vertices in the dataset, theRepresenting the sum of the number of incoming edges of each vertex in the dataset, saidRepresenting the sum of the number of outgoing edges for each vertex in the dataset.

In practical applications, the plurality of data sets may be divided in a serial manner. Specifically, the vertices in the graph data may be divided into one data set, and the calculation amount of the data set may be calculated so as to approximate the calculation amount of the data set to the calculation amount reference value. Then, the vertices in the graph data may be divided into another data set, and the calculation amount of the data set may be calculated so as to approximate the calculation amount of the data set to the calculation amount reference value. So that iteration is performed continuously.

Of course, in the process of dividing the vertices, the calculation amount of each data set may be determined, and the calculation amount of each data set may be compared to make the calculation amount of each data set similar. In practical applications, the plurality of data sets may be divided in a parallel manner. Specifically, the vertices in the graph data may be divided into a plurality of data sets, the calculation amount of each data set may be determined, and the calculation amounts of the data sets may be compared to make the calculation amounts of the data sets similar. Then, the vertices in the graph data are divided into a plurality of data sets, the calculation amount of each data set is determined, and the calculation amounts of the data sets are compared to make the calculation amounts of the data sets similar. And continuously iterating in the way, and realizing the division of the vertexes.

Step S23: and dividing the edges in the graph data into data sets in which target vertexes of the edges are positioned.

In some embodiments, each data set may include vertices and/or edges, and thus each data set may be understood as a subgraph of data, via step S23. By dividing edges in the graph data into data sets where the target vertices of the edges are located. When graph calculation is carried out, aiming at each vertex, the information of the vertex is sent to other nodes in the distributed cluster by the node at most once without sending the information of the vertex for many times, so that the communication times among the nodes are reduced, and the communication overhead is saved. For example, the plurality of edges may correspond to a source vertex and a plurality of target vertices. Dividing the vertices in the graph data into multiple data sets makes it possible to have the source vertices located at one node (hereinafter referred to as a first node) and the target vertices located at another node (hereinafter referred to as a second node). In graph computation, information of a target vertex is computed along an edge using information of a source vertex. Since the edge and the target vertex are both located at the second node, the first node may send the information of the source vertex to the second node only once, and the second node may calculate the information of the target vertex along the plurality of edges. The number of communications between the first node and the second node is reduced.

Taking the graph data shown in fig. 1 as an example, the target vertices of the edge E = (a 1, B1), the edge E = (a 1, C1), the edge E = (a 1, D1), the edge E = (a 1, E1), and the edge E = (a 1, F1) are the vertex B1, the vertex C1, the vertex D1, the vertex E1, and the vertex F1, respectively. Therefore, the edge E = (a 1, B1) and the edge E = (a 1, E1) may be divided into the data set S1, the edge E = (a 1, C1) and the edge E = (a 1, F1) may be divided into the data set S2, and the edge E = (a 1, D1) may be divided into the data set S0.

In performing the graph calculation, the node P0 may transmit the information of the vertex a1 only once to the node P1 and may transmit the information of the vertex a1 only once to the node P2, so that the number of communications between the node P0 and the nodes P1 and P2 may be reduced.

In some embodiments, for each edge in the graph data, a dataset may be obtained in which a target vertex of the edge is located; the edge may be partitioned into the dataset in which the target vertex is located. Or, in order to improve the dividing efficiency of the edges, a table may be further constructed, where each row and each column of the table respectively correspond to one data set; for each edge in the graph data, determining a target row in the table according to a source vertex of the edge, determining a target column in the table according to a target vertex of the edge, and dividing the edge into cells defined by the target row and the target column; the edges in each column of cells may be divided into the data sets corresponding to that column.

The number of rows and columns of the table may be equal to the number of data sets. The target row may be a row corresponding to the data set where the source vertex is located. The target column may be a column corresponding to the data set where the target vertex is located. The cells defined by the target rows and the target columns may be cells in rows of the target rows and columns of the target columns.

Taking the graph data shown in fig. 3 as an example, a table as shown in table 1 below can be constructed.

TABLE 1

With table 1, the edge e = (a 2, B2) may be divided into the data set S0; the edge e = (a 2, C2), the edge e = (a 2, D2) may be divided into the data set S1; the edge E = (B2, E2), the edge E = (C2, F2), the edge E = (C2, G2) may be divided into the data set S2.

In some embodiments, a data set may be partitioned into a plurality of subsets for graph computation by one node in a distributed cluster. The node may specifically perform the computation in a parallel manner according to the plurality of subsets. Therefore, random reading and writing aiming at the data set can be limited to a smaller subset, and resource overhead caused by large-range random reading and writing is reduced. Moreover, through the steps S21-S23, the division of the graph data among the nodes is realized; by dividing the data set into a plurality of subsets, the division of the data set within the nodes is achieved. Therefore, the calculation efficiency can be improved through two-layer division between nodes and in the nodes.

The number of the subsets can be determined according to the number of threads of the node corresponding to the data set. The number of subsets may be equal to, greater than, or less than the number of threads. For example, equation 2T may be utilized₁Calculating the number of subsets, said T₁Representing the number of threads of the node. In this way, the node may perform graph computations in a multi-threaded manner based on the plurality of subsets. Alternatively, the number of the subsets may also be determined according to the number of processes of the node corresponding to the data set. The number of subsets may be equal to, greater than, or less than the number of processes. For example, equation 2T may be utilized₂Calculating the number of subsets, said T₂Representing the number of processes of the node. In this way, the node may perform graph computation in a multiprocess manner according to the plurality of subsets.

The respective data sets may be divided into subsets in a similar manner as the division of the data into the data sets. The subsets are computationally intensive so that load balancing between threads or processes may be achieved.

In practical application, a node may perform graph computation by using a work stealing algorithm (work stealing) to achieve load balancing between threads or between processes. Of course, the nodes may also perform graph calculation by using other algorithms, which are not described herein again.

The graph data dividing method in the embodiments of the present specification may divide vertices in graph data into a plurality of data sets; an edge in graph data may be partitioned into a dataset in which a target vertex of the edge is located. The multiple data sets are computationally intensive, thereby allowing load balancing among the nodes in the distributed cluster. In addition, the number of times of communication between nodes can be reduced and communication overhead can be saved by dividing the edges in the graph data into the data sets where the target vertexes of the edges are located.

The embodiment of the specification provides a graph data dividing method.

The graph data dividing method can be applied to computer equipment. The computer device may include a partitioned server, a distributed cluster. Referring to fig. 4, the graph data partitioning method may include the following steps.

Step S41: a locality feature of the graph data is obtained, the locality feature being indicative of a proximity between vertices.

Step S43: vertices in the graph data are partitioned into multiple datasets according to locality characteristics.

Step S45: dividing edges in graph data into the plurality of data sets; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar.

The related descriptions of step S41-step S43 may be referred to the corresponding embodiment of fig. 2, and are not repeated herein.

In step S45, the edges in the graph data may be divided into data sets in which the source vertices of the edges are located. Alternatively, the edges in the graph data may also be divided into data sets in which the source vertices of the edges are located. The examples in this specification are not particularly limited thereto.

According to the technical scheme provided by the embodiment of the specification, the locality characteristics of the graph data can be acquired, and the locality characteristics are used for representing the proximity degree between the vertexes; vertices in graph data may be partitioned into multiple datasets according to locality characteristics; edges in graph data may be partitioned into the plurality of data sets; the multiple data sets are computationally intensive, thereby allowing load balancing among the nodes in the distributed cluster. In addition, the vertices in the graph data are divided into a plurality of data sets according to the locality characteristics, so that the communication frequency between the nodes can be reduced, and the communication overhead can be saved.

The embodiment of the specification also provides a graph data dividing device. The graph data dividing device can be applied to dividing servers, distributed clusters or nodes in the distributed clusters. Referring to fig. 5, the graph data dividing apparatus includes the following units.

A first dividing unit 51 for dividing vertices in the graph data into a plurality of data sets;

a second dividing unit 53, configured to divide an edge in the graph data into data sets in which target vertices of the edge are located; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar.

The embodiment of the specification also provides a graph data dividing device. The graph data dividing device can be applied to dividing servers, distributed clusters or nodes in the distributed clusters. Referring to fig. 6, the graph data dividing apparatus includes the following elements.

An acquisition unit 61 configured to acquire a local feature of the graph data, the local feature being used to indicate a degree of proximity between vertices;

a first dividing unit 63 configured to divide vertices in the graph data into a plurality of data sets according to locality characteristics;

a second dividing unit 65 configured to divide edges in the graph data into the plurality of data sets; the data sets are used for nodes in the distributed cluster to perform graph calculation, and the calculation amount of the data sets is similar.

One embodiment of a computer apparatus of the present specification is described below. Fig. 7 is a hardware configuration diagram of the computer device in this embodiment. As shown in fig. 7, the computer device may include one or more processors (only one of which is shown), memory, and a transmission module. Of course, those skilled in the art will appreciate that the hardware configuration shown in fig. 7 is only an illustration, and does not limit the hardware configuration of the computer device. In practice the computer device may also comprise more or fewer component elements than those shown in fig. 7; or have a different configuration than that shown in fig. 7.

The memory may comprise high speed random access memory; alternatively, non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory may also be included. Of course, the memory may also comprise a remotely located network memory. The memory may be used to store program instructions or modules of application software, such as the program instructions or modules of the embodiments corresponding to fig. 2 or fig. 4 in this specification.

The processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The processor may read and execute the program instructions or modules in the memory.

The transmission module may be used for data transmission via a network, for example via a network such as the internet, an intranet, a local area network, a mobile communication network, etc.

This specification also provides one embodiment of a computer storage medium. The computer storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), a Memory Card (Memory Card), and the like. The computer storage medium stores computer program instructions. The computer program instructions when executed implement: the program instructions or modules of the embodiments corresponding to fig. 2 or fig. 4 in this specification.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and the same or similar parts in each embodiment may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, apparatus embodiments, computer device embodiments, and computer storage medium embodiments are substantially similar to method embodiments and therefore are described with relative ease, as appropriate with reference to the partial description of the method embodiments. In addition, it is understood that one skilled in the art, after reading this specification document, may conceive of any combination of some or all of the embodiments listed in this specification without the need for inventive faculty, which combinations are also within the scope of the disclosure and protection of this specification.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. Those skilled in the art will appreciate that the hardware circuitry for implementing the logical method flows can be readily implemented by merely a few logical programming of the method flows using the hardware description languages described above and programming into integrated circuits.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be essentially or partially implemented in the form of software products, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification that do not depart from the spirit of the specification, and it is intended that the appended claims include such variations and modifications that do not depart from the spirit of the specification.

16页详细技术资料下载

Graph data dividing method and device and computer equipment

相关技术

网友询问留言