Method and device for partitioning nodes of computation graph and computer-readable storage medium

文档序号:153087 发布日期:2021-10-26 浏览:32次 中文

阅读说明:本技术 一种计算图的节点划分方法、装置、计算机可读存储介质 (Method and device for partitioning nodes of computation graph and computer-readable storage medium ) 是由 不公告发明人 于 2020-04-26 设计创作,主要内容包括:本公开涉及一种计算图的节点划分方法、装置、计算机可读存储介质。可以将支持机器学习处理器且能够融合在一起的节点或者算子划分在对应的子图内,经过融合以后,可以总揽全局进行底层的优化,从而避免了每个节点或者算子一个一个进行计算,造成输入输出开销较大,以及底层也没法联系上下文进行优化的问题。(The disclosure relates to a method and a device for partitioning nodes of a computational graph and a computer-readable storage medium. The nodes or operators which support the machine learning processor and can be fused together are divided into corresponding sub-graphs, and after fusion, bottom layer optimization can be performed overall, so that the problems that input and output expenses are large due to the fact that each node or operator is calculated one by one, and the bottom layer cannot be optimized in connection with context are solved.)

1. A method of partitioning nodes of a computational graph, comprising:

dividing nodes which support a machine learning processor and can be fused together into corresponding subgraphs;

nodes that do not support a machine learning processor are processed separately.

2. The node partitioning method of claim 1, wherein the partitioning nodes that support machine learning processors and that are separately fusible together within corresponding subgraphs comprises:

nodes that support machine learning processors and can be fused together are partitioned within a convex subgraph corresponding to the maximum weak connectivity.

3. The node partitioning method of claim 2, said partitioning nodes that support machine learning processors and that can be fused together within a corresponding maximum weakly connected convex subgraph comprising:

taking a node without a precursor node as an initial node, traversing a successor node of the initial node along a directed edge, and dividing the initial node and the successor node into the same subgraph under the condition that no other path returns to the initial node in the reverse traversal process from the successor node to the initial node.

4. The node partitioning method of claim 3, said partitioning nodes that support machine learning processors and that can be fused together within a convex subgraph corresponding to a maximum weak connectivity further comprising:

fusing the subgraphs into a new initial node, traversing the successor node of the new initial node along a directed edge from the new initial node, dividing the new initial node and the successor node of the new initial node into new subgraphs under the condition that no other path returns to the new initial node in the reverse traversal process from the successor node of the new initial node to the new initial node, and repeating the steps until all the nodes which support the machine learning processor and can be fused together are divided into the same maximum weakly-communicated convex subgraph.

5. The node partitioning method of claim 4, said partitioning nodes that support machine learning processors and that are capable of fusing together within a corresponding maximum weakly connected convex subgraph comprising:

and taking a node without a precursor node as an initial node, traversing a successor node of the initial node along a directed edge, and finishing the access from the initial node to the successor node under the condition that other paths return to the initial node in the reverse traversal process from the successor node to the initial node.

6. The node partitioning method of claim 5, said partitioning nodes that support machine learning processors and that are capable of being fused together within a corresponding maximum weakly connected convex subgraph comprising:

taking the successor node as a new initial node, and traversing the successor node of the new initial node along a directed edge started by the new initial node;

under the condition that no other path returns to the new starting node in the backward traversal process from the successor node of the new starting node to the new starting node, the new starting node and the successor node of the new starting node are divided into the same subgraph;

in case there are other paths to return to the new starting node in the backward traversal from the successor of the new starting node to the new starting node, then the access from the new starting node to the successor of the new starting node ends.

7. The node partitioning method according to claim 6, wherein the machine learning processors comprise one or more machine learning processors, and wherein corresponding nodes of the same kind of machine learning processors can be merged in the same sub-graph.

8. The node partitioning method according to any one of claims 1 to 7, wherein said individually processing nodes that do not support a machine learning processor comprises:

nodes that do not support a machine learning processor are processed by a central processor, speech processor, programmable logic processor, and/or image processor.

9. The node partitioning method of any one of claims 1 to 7, wherein the subgraph and the most weakly connected convex subgraph are directed graphs.

10. The node partitioning method according to any one of claims 1 to 7, wherein the partition method of the run node is for a Pythrch framework.

11. The node partitioning method according to any one of claims 1 to 7, wherein partitioning results of the node partitioning method are stored in a parallel lookup set, a common set, or a linked list.

12. An apparatus for partitioning nodes of a computational graph, comprising:

a processor configured to execute program instructions; and

a memory configured to store the program instructions, which when loaded and executed by the processor, cause the apparatus to perform the method of any of claims 1-11.

13. A computer readable storage medium having stored therein program instructions adapted to be loaded by a processor and to perform the method according to any of claims 1-11.

Technical Field

The present disclosure relates to the field of computers. And more particularly, to a method, apparatus, and computer-readable storage medium for partitioning nodes of a computational graph.

Background

The Pythrch is a popular deep learning framework, and operators which cannot be supported by a machine learning processor exist in a deep learning network, so that a complete network model cannot be guaranteed to run on the machine learning processor. In the computational graph in the prior art, each node or operator needs to be calculated one by one, which causes high input and output expenses, and the bottom layer cannot be optimized in connection with the context.

Under such circumstances, there is a strong need in the art to solve at least some of the technical problems described above.

Disclosure of Invention

The inventors of the present disclosure have recognized that in computational graphs, network computation speed can be limited if fusion computation optimization is not performed. And if the fusion mode of the machine learning processor is called for the Pythrch network, the network cannot be fused into one fusion node or operator because nodes or operators which are not supported by the machine learning processor exist.

The contradiction in the prior art determines that an algorithm needs to be adopted to select a node or an operator set which can be fused in a network, and the node or the operator set fuses a plurality of nodes or operators respectively.

According to a first aspect of the present disclosure, there is provided a node partitioning method of a computational graph, which may include:

dividing nodes which support a machine learning processor and can be fused together into corresponding subgraphs; nodes that do not support a machine learning processor are processed separately.

According to a second aspect of the present disclosure, there is provided a node division apparatus corresponding to an operator, which may include: a processor configured to execute program instructions; and a memory configured to store program instructions that, when loaded and executed by the processor, cause the apparatus to perform the method described above.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored therein program instructions adapted to be loaded by a processor and to perform the above-mentioned method.

By means of the technical scheme, the nodes or operators which support the machine learning processor and can be fused together can be divided into corresponding sub-graphs. After fusion, the optimization of the bottom layer can be generalized to the whole situation. The calculation of each node (operator) is avoided, the Input and Output (IO) overhead is reduced, and the bottom layer can be optimized in connection with the context.

Drawings

The above-described features of the present disclosure may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings, wherein like reference numerals refer to like elements and in which:

fig. 1 is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a prior art node computation method;

FIG. 3A is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 3B is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 3C is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 3D is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 4A is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 4B is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 4C is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 4D is a schematic diagram illustrating a node partitioning method of a computational graph according to one embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a combined processing device according to one embodiment of the present disclosure; and

fig. 6 schematically illustrates a structural diagram of a board card according to an embodiment of the present disclosure.

Detailed Description

With respect to the terms "operator" and "node" mentioned in the embodiments of the present disclosure, it should be noted that the term "operator" is from the computer computing level (or from the software level or the algorithm level), the term "node" is a more visual expression (from the graphic level or the more intuitive level), and the terms "operator" and "node" are actually the same thing from the point of view. That is, in the embodiments of the present disclosure, the terms "operator" and "node" may be considered to have the same meaning, and are described from different sides.

The term "maximum weakly connected convex subgraph" appearing in embodiments of the present disclosure aims to partition artwork with as few weakly connected convex subgraphs as possible. The definition of "weak connectivity" is that if all directed edges of a directed graph are replaced by undirected edges, the resulting undirected graph is a connected graph, and the directed graph has weak connectivity. A "convex subgraph" is defined as a convex subgraph if, for any two nodes in the subgraph, any path between them passes through only the nodes in the subgraph and not through other nodes in the original graph.

By running the pytorech network layer by layer, regarding each "operator" as a "node" of a computational graph, regarding the flowing direction of data as directed edges, traversing the whole static acyclic directed computational graph (in pytorech, a representation of C + + can be obtained at the bottom layer) composed of the directed edges, and obtaining the running device information of each node, i.e., whether the running device of each node is a running device supporting a machine learning processor, and what kind of machine learning processor if the running device of each node is a machine learning processor.

Node

In one embodiment of the present disclosure, fig. 1 is a schematic diagram illustrating a node partitioning method 100 of a computational graph according to one embodiment of the present disclosure, the node partitioning method 100 of the computational graph may include the following steps, for example, performed by a processor:

102, dividing nodes which support a machine learning processor and can be fused together into corresponding subgraphs; the inventors divided the subgraph for fusion according to the above information (the operating device information of each node). The divided subgraphs can support a machine learning processor and can fuse the computing on the premise that: (1) for any two nodes AB in the fused subgraph, no computation path from A to B passes through nodes outside the subgraph, (2) the fused subgraph has weak connectivity, that is, if the subgraph is converted into a loop-free graph, the graph is a connected graph.

And step 104, performing independent processing on the nodes which do not support the machine learning processor.

In one embodiment of the present disclosure, the machine learning processor may comprise one or more machine learning processors. For example, the machine learning processor may include a first machine learning processor and a second machine learning processor, and so on.

In one embodiment of the present disclosure, wherein step 102, partitioning nodes that support machine learning processors and can be fused together within a corresponding subgraph may comprise:

and 106, dividing the nodes which support the machine learning processor and can be fused together into convex sub-graphs corresponding to the maximum weak connection.

In one embodiment of the present disclosure, the step 106 of partitioning nodes that support a machine learning processor and can be fused together within a convex subgraph corresponding to the maximum weak connectivity may include:

and step 108, taking the node without the precursor node as an initial node, traversing the successor node of the initial node, and dividing the initial node and the successor node into the same subgraph under the condition that no other path returns to the initial node in the reverse traversal process from the successor node to the initial node. Because the process of fusion is performed step by step, there is a possibility that a plurality of subsequent nodes will be encountered during the process of fusion from the starting node to the subsequent nodes.

Optionally, the successor nodes of the starting node include one or more successor nodes.

Optionally, when the successor node of the start node includes a plurality of successor nodes, when traversing the successor node of the start node, the successor nodes may be traversed according to the logic sequence of the operator corresponding to the successor node, and optionally, the successor nodes may also be traversed according to the reverse order of the logic sequence of the operator corresponding to the successor node, which is not limited in this disclosure.

Alternatively, the reverse traversal from the successor node to the starting node refers to a reverse depth traversal.

Optionally, in a case that no other path returns to the starting node in the reverse traversal process from the successor node to the starting node, the starting node and the successor node are divided in the same subgraph, which means that each successor node of the starting node is sequentially visited, each visited successor node is reversely traversed, and if no other path returning to the starting node from the successor node exists, the successor node and the starting node are divided into the same subgraph, that is, each successor node is separately judged, and the path from the successor node to the starting node in the reverse direction is unique.

In one embodiment of the present disclosure, the step 106 of dividing the nodes that support the machine learning processor and can be merged together into the convex subgraph corresponding to the maximum weak connectivity further includes:

and step 110, fusing the subgraphs into a new initial node, traversing the successor node of the new initial node along a directed edge from the new initial node, dividing the new initial node and the successor node of the new initial node into the new subgraph under the condition that no other path returns to the new initial node in the reverse traversal process from the successor node of the new initial node to the new initial node, and repeating the steps until all the nodes which support the machine learning processor and can be fused together are divided into the same convex subgraph which is at the maximum and is in weak connection.

In one embodiment of the present disclosure, the step 106 of partitioning nodes that support a machine learning processor and can be fused together within a convex subgraph corresponding to the maximum weak connectivity may include:

and 112, taking the node without the precursor node as an initial node, traversing the subsequent node corresponding to the initial node along the directed edge, and finishing the access from the initial node to the subsequent node under the condition that other paths return to the initial node in the reverse traversal process from the subsequent node to the initial node. It should be noted that, in the case that there are other paths returning to the starting node in the reverse traversal process from the subsequent node to the starting node, the end of access from the starting node to the subsequent node refers to sequentially accessing each subsequent node of the starting node, performing reverse traversal on each accessed subsequent node, and if there are other paths returning to the starting node from the subsequent node, ending the access from the starting node to the subsequent node. That is, each successor node is judged separately and has other paths to return to the starting node, means that other paths are passed through in the reverse traversal process from the successor node to the starting node, that is, the path from the successor node to the starting node is not unique. As for the case where the path from the subsequent node to the initial node is not unique, it is also specifically described in embodiment 1 and embodiment 2 below, and details thereof are not described here.

In one embodiment of the present disclosure, the step 106 of partitioning nodes that support a machine learning processor and can be fused together within a convex subgraph corresponding to the maximum weak connectivity may include:

step 114, taking the successor node as a new initial node, and traversing the successor node of the new initial node along the directed edge of the new initial node; under the condition that no other path returns to the new initial node in the reverse traversal process from the successor node of the new initial node to the new initial node, the new initial node and the successor node of the new initial node are divided into the same subgraph; in case there are other paths to return to the new starting node in the backward traversal from the successor of the new starting node to the new starting node, then the access from the new starting node to the successor of the new starting node ends.

Fig. 2 is a schematic diagram illustrating a node calculation method in the prior art. For convenience of description, the nodes supporting the machine learning processor in fig. 2, 3A to 3D, and 4A to 4D are indicated by a solid circle "●", and the nodes not supporting the machine learning processor are indicated by an open circle "o". In fig. 2, node E is a node without predecessor computation, node a is a successor node to node E, nodes D and B are successor nodes to node a, and node C is a successor node to node D and B. In the schematic diagram of the prior art calculation method, the node E is calculated first, then the node a is calculated along the directed edge EA, then the node a is divided into two directed edges AD and AB starting from the node a, the node D and the node B are calculated respectively, and finally the node C is calculated. In the prior art, the input/output (IO) overhead is relatively high, and the computation efficiency is relatively low, for example, the output of the node E is the input of the node a, the output of the node a is the input of the node D and the node B, the output of the node D and the node B is the input of the node C, and the node D is a node that does not support the machine learning processor. When a computation runs along directed edge AD to node D, because node D is not machine learning processor-supported, the data coming from node a needs to be copied to, for example, a central processor for processing. After the processing is completed, for example, by the central processing unit, the processed data is then transferred from node D to node C along the directed edge DC, or the data is copied from node D to node C.

In the prior art, the nodes such as node E and node a are not considered to be fused to improve the calculation efficiency and reduce the cost of IO.

According to the concepts of the present disclosure, the segmentation results may be recorded using a co-lookup set (or other similar data structure that may hold homogeneous nodes).

The steps of partitioning may be as follows:

(1) starting with all nodes in the computational graph without predecessors, the nodes remaining in all graphs are traversed extensively (skipped if there are nodes deleted in subsequent steps, e.g., node D),

(2) for each arbitrary node, for example node a, if the node is a node supporting a machine learning processor, for example, which can run on an MLU, all subsequent nodes starting from it are traversed (if there is a directed edge added to this node in a subsequent step, it is also traversed), it should be noted that nodes are connected by directed edges, and as mentioned in the embodiments, the traversal along a directed edge actually refers to the traversal from one node to the next node along a directed edge, or from one starting node to its subsequent node.

(3) For an end point of an accessed directed edge, such as a node B, checking whether another computation path can be found from A to B through reverse depth traversal (traversal from the end point of the directed edge to the starting point direction), and if so, ending the traversal of the directed edge; if not, then node A, B is merged into a set AB (illustrating that they are in a fused subgraph), all incoming edges of node B, e.g., AB and outgoing edges, e.g., BC, are added to node A, and node B and its directed edges are deleted, completing the traversal of this edge.

The rationality of the node (algorithm) partitioning of the present disclosure lies in:

correctness: connectivity is a natural result of the algorithm's process of adding nodes;

convexity (for any two nodes A, B in the fused subgraph, there is no computational path from node a to B through nodes outside the subgraph) can be proven by back-syndrome;

optimality: can be proved by a counter-syndrome method;

stability: the topological ordering of the nodes in the directed graph has stability, and a fixed traversal order is used.

The steps of fusion are described below in conjunction with FIGS. 3A-3D and FIGS. 4A-4D.

It is noted that in various embodiments of the present disclosure, the solid circle "●" represents a node that supports a machine learning processor, and is a node that later needs to be considered whether further fusion is possible, while the node represented by the open circle "o" is a node that does not support a machine learning processor, and is a node that needs to be processed separately, for example, by a central processor, speech processor, programmable logic processor, and/or image processor.

Example 1

When performing the fusion calculation, the node F starts (the node F is a node without a predecessor calculation node, i.e., a starting node in this embodiment of the present disclosure), and there are two directed edges starting from the node F, one is a directed edge FG leading to the node G, and the other is a directed edge FH leading to the node H. It can be considered that the nodes G and H are successors of the node F, the node I is a successor of the node H, the node J is a successor of the nodes G and I, and the node K is a successor of the node J.

For example, traversing a successor node along a directed edge, i.e., from one node to its successor node, with node F without a predecessor node as the starting node, the traversal along a directed edge described in the specification of the present disclosure is actually a matter of: node G and node H. In the case where no other path returns to the starting node (node F) during the backward traversal from the succeeding nodes, e.g., node G and node H, to the starting node (node F), the starting node (node F) and the succeeding nodes (e.g., node G and node H) may be partitioned within the same subgraph { F, G, H }, as shown in fig. 3B, or nodes F, G, H may be merged into a set FGH. The subgraph { F, G, H } serves as a new starting node, the fusion is performed from the new starting node { F, G, H } to the successor node J and the successor node I, it is found that it is possible to return to the starting node through the node I (as shown in fig. 3B) in the process of backward traversal from the successor node J to the starting node { F, G, H }, the successor node J is traversed along a directed edge with the node (e.g., { F, G, H } without the predecessor node as the starting node, and the access from the starting node (e.g., { F, G, H }) to the successor node J is ended in the case of returning to the starting node (e.g., { F, G, H } through the node I) with other paths (e.g., via the node I) in the process of backward traversal from the successor node J to the starting node (e., { F, G, H }). As shown in fig. 3B. Optionally, the passed node may be a node that does not support a machine learning processor, or a node that supports the machine learning processor, and when the passed node is a node that supports the machine learning processor, the passed node may be the same device as the starting node { F, G, H } or a different device. Continuing to access another successor node I of the starting node, node I is not likely to be partitioned within the same subgraph as the new starting node { F, G, H } or node J because it is a node that does not support a machine learning processor, i.e., a node that needs to be processed separately. In an alternative embodiment, if node I is the node supporting the machine learning processor, and no other path can be returned from the successor node I to the starting node, node I may be partitioned in a sub-graph with the starting node { F, G, H } (this is not shown in the figure); in one possible embodiment, when the subsequent node I is a node supporting the machine learning processor, and the device information of the node I is consistent with the device information of the starting node { F, G, H }, the node I may be divided into a sub-graph with its nodes (this is not shown in the figure).

At this time, the nodes F, G, and H may be considered to be divided in the maximum weakly connected convex subgraph, the node J may be considered to be a node without predecessor computation, and the node J may be considered to be a new starting node for proceeding the fusion computation later. It should be noted that, the successor node and the node without predecessor computation are relative, for example, in fig. 3C, the successor node J is a successor node of the nodes F, G, and H, but in the case that the successor node J cannot be fused with the nodes F, G, and H in the same sub-graph, the successor node J becomes a node without predecessor computation of the successor computation, that is, a new starting node. That is, the node J and the new start node { F, G, H } are not divided in the same subgraph, and at this time, the node J can be used as the next new start node to start the subsequent fusion step. As shown in fig. 3C, the new starting node J and its succeeding node K may be partitioned within the same sub-graph because no other path returns to the starting node J during the backward traversal from the succeeding node K to the starting node J, and thus the starting node J and the succeeding node K may be partitioned within the same sub-graph or the node J, K may be merged into a set JK. That is, one can consider that node J, K is divided within another convex subgraph of the maximum weak connectivity.

Finally, as shown in fig. 3D, nodes F, G, and H form a convex sub-graph with maximum weak connectivity, nodes J and K form another convex sub-graph with maximum weak connectivity, and node I is a node that does not support a machine learning processor and needs to be processed separately.

In the embodiment of fig. 3A to 3D, node K is a successor node to node J, but node K is also the only successor node to node J. In various implementations, the successor node may or may not be unique.

Example 2

When performing the fusion calculation, the node F starts (the node F is a node without a predecessor calculation node, i.e., a starting node in this embodiment of the present disclosure), and there are two directed edges starting from the node F, one is a directed edge FG leading to the node G, and the other is a directed edge FH leading to the node H. It can be considered that the nodes G and H are successors of the node F, the node L is successor of the node H, the node I is successor of the node L, the node J is successor of the nodes G and I, and the node K is successor of the node J.

For example, with node F without a predecessor node as the starting node, the successor nodes, e.g., nodes G and H, are traversed along directed edges FG and FH, and the starting node (node F) and the successor nodes (e.g., nodes G and H) are partitioned within the same subgraph { F, G, H } as shown in fig. 4A, in the case where no other path returns to the starting node (node F) during the backward traversal from the successor nodes, e.g., nodes G and H, to the starting node (node F). Or nodes F, G, H are merged into a set FGH.

The subgraph { F, G, H } is used as a new initial node, the subgraph { F, G, H } is fused into a new initial node, the subsequent node of the new initial node is traversed along the directed edge from the new initial node, namely, the new initial node { F, G, H } is fused to the subsequent node J and the subsequent node L, and the new initial node { F, G, H } and the subsequent node L of the new initial node are divided into the new subgraph { F, G, H } under the condition that no other path returns to the new initial node { F, G, H } in the backward traversal process from the subsequent node L of the new initial node { F, G, H } to the new initial node { F, G, H }. Or nodes F, G, H, L are merged into a set FGHL. In the embodiments of fig. 4A to 4D, it is assumed that the node L has a processing order higher than the node J according to the logic order of the execution of the operators corresponding to the nodes, and therefore, whether to fuse the node L is determined first. And repeating the steps until all the nodes which support the machine learning processor and can be fused together are divided into the same convex subgraph which is in the maximum weak connection. During the backward traversal from the successor node J to the starting node { F, G, H, L }, it is discovered that a node I that does not support a machine learning processor may be passed. At this time, it can be considered that the subsequent node J and the new initial node { F, G, H, L } cannot be divided in the same subgraph. Node I, because it is a node that does not support a machine learning processor, i.e., a node that needs to be processed separately, cannot be partitioned within the same subgraph as the new starting node { F, G, H, L } or node J. At this time, it can be considered that the nodes F, G, H, L are divided within the maximum weakly connected convex subgraph. That is, subgraph { F, G, H, L } can be considered to be a more weakly connected convex subgraph than subgraph { F, G, H }.

The successor node J is traversed along a directed edge with the node without the predecessor node (e.g., { F, G, H, L }) as the starting node, and in the case of a backward traversal from the successor node J to the starting node (e.g., { F, G, H, L }) with another path (e.g., via node I) back to the starting node (e.g., { F, G, H, L }), the access from the starting node (e.g., { F, G, H, L }) to the successor node J ends. As shown in fig. 4B. Meaning that nodes { F, G, H, L } and successor node J cannot be merged into a word graph.

The node J and the new starting node { F, G, H, L } are not divided in the same subgraph, and at this time, the node J can be used as the next new starting node to start the subsequent fusion step. As shown in fig. 4B, the new starting node J and its succeeding node K may be partitioned within the same sub-graph because no other path returns to the starting node J during the backward traversal from the succeeding node K to the starting node J, and thus the starting node J and the succeeding node K may be partitioned within the same sub-graph. The node J, K partition can be considered to be within another convex subgraph of the maximum weak connectivity. That is, the successor node (e.g., node K) of the new starting node is traversed along the directed edge, e.g., JK, from the new starting node with the successor node (e.g., node J) as the new starting node;

in the case where no other path returns to the new starting node during the backward traversal from the successor node of the new starting node (e.g., node K) to the new starting node (e.g., node J), the new starting node (e.g., node J) and the successor node of the new starting node (e.g., node K) are partitioned within the same subgraph (e.g., { J, K }), or node J, K is merged into a set JK. As shown in fig. 4C.

Finally, as shown in fig. 4D, nodes F, G, H, L form a convex subgraph with maximum weak connectivity, nodes J, K form another convex subgraph with maximum weak connectivity, and node I is a node that does not support a machine learning processor and needs to be processed separately.

In one embodiment of the present disclosure, the convex sub-graph in which the sub-graph and the maximum weakly connected convex sub-graph may be directed graphs.

In one embodiment of the present disclosure, the partitioning method in which the nodes are run is for a Pythrch framework.

In one embodiment of the present disclosure, the partitioning results of the node partitioning method are saved in a parallel lookup set, a common set, or a linked list.

The algorithm which is not originally provided in the Pythrch can support the computation of a CPU and MLU heterogeneous mixed network, can divide the original network to obtain the fewest and largest possible fusion subgraphs, and can fuse operators supported by the MLU as much as possible, and each subgraph conforms to the premise that the subgraphs can be fused, so that the performance is enhanced through the fusion computation to the maximum extent, the computation speed is improved, and the purpose of improving the computation efficiency as much as possible even if operators which are not supported by the MLU exist is achieved.

Fig. 5 is a block diagram illustrating a combined processing device 500 according to an embodiment of the present disclosure. As shown, the combination processing apparatus 500 includes a node partitioning apparatus 502 having the aforementioned computational graph, which can be configured to execute the node partitioning method of the computational graph described in conjunction with the aforementioned figures. In one or more embodiments, the node partitioning apparatus 502 of the computation graph may also be a chip, an integrated circuit that computes gradients of input data. In addition, the combined processing device includes a universal interconnect interface 504 and other processing devices 506. The node partitioning apparatus 502 of the computational graph according to the present disclosure may interact with other processing apparatuses 506 through the universal interconnection interface 504 to collectively perform operations specified by a user.

According to aspects of the present disclosure, the other processing devices may include one or more types of general and/or special purpose processors such as a central processing unit ("CPU"), a graphics processing unit ("GPU"), an artificial intelligence processor, etc., and the number thereof may be determined not by limitation but by actual needs. In one or more embodiments, the other processing device may include the aforementioned reference hardware platform or reference computing device, so that it may form a tested system with the node partitioning device including the computational graph of the test hardware platform. In one or more embodiments, the other processing device may serve as an interface between the node partitioning device (which may be embodied as an artificial intelligence related computing device) of the computational graph of the present disclosure and external data and control, and perform basic control including, but not limited to, data handling, and completing starting, stopping, and the like of the machine learning computing device; other processing devices may cooperate with the machine learning related computing device to perform computing tasks.

According to aspects of the present disclosure, the universal interconnect interface may be used to transfer data and control instructions between the node partitioning apparatus and other processing apparatuses of the computational graph. For example, the node partitioning apparatus of the computational graph may obtain required input data from other processing apparatuses via the general interconnection interface, and write the required input data into a storage apparatus (or memory) on the node partitioning apparatus chip of the computational graph. Further, the node partitioning apparatus of the computational graph may obtain the control instruction from another processing apparatus via the general interconnection interface, and write the control instruction into the control cache on the node partitioning apparatus chip of the computational graph. Alternatively or optionally, the universal interconnection interface may also read data in the storage module of the node partitioning apparatus of the computation graph and transmit the data to other processing apparatuses.

Optionally, the combined processing apparatus may further include a storage apparatus 508, which may be connected to the node dividing apparatus and the other processing apparatus of the computation graph, respectively. In one or more embodiments, the storage device may be used to store data of the node partitioning device and other processing devices of the computational graph, particularly data that may not be stored in its entirety in internal or on-chip storage of the node partitioning device or other processing devices of the computational graph.

According to the difference of application scenes, the combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, so that the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the general interconnection interface of the combined processing apparatus is connected with some components of the device. Some components such as a camera, a display, a mouse, a keyboard, a network card, or a wifi interface.

In some embodiments, the present disclosure also discloses a chip including the node dividing device or the combined processing device of the above computation graph. In other embodiments, the present disclosure also discloses a chip packaging structure, which includes the above chip.

In some embodiments, the present disclosure also discloses a board card, which includes the above chip packaging structure. Referring to fig. 6, the aforementioned exemplary board is provided, and the board may include other accessories besides the chip 602, including but not limited to: a memory device 604, an interface device 606, and a control device 608.

The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells 610. Each group of memory cells is connected with the chip through a bus. It is understood that each group of memory cells may be a DDR SDRAM ("Double Data Rate SDRAM").

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 groups of memory cells. Each group of memory cells may include multiple DDR4 pellets (chips). In one embodiment, the chip can internally comprise 4 72-bit DDR4 controllers, wherein 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits of the controller are used for ECC check.

In one embodiment, each group of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with the chip in the chip packaging structure. The interface means is used for enabling data transfer between the chip and an external device 612, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so that data transfer is implemented. In another embodiment, the interface device may also be another interface, and the present disclosure does not limit the specific representation of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to the external device (e.g., server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through the SPI interface. The control device may include a single chip Microcomputer (MCU). In one or more embodiments, a chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chip.

In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card. According to different application scenarios, the electronic device or apparatus may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. Vehicles include airplanes, boats, and/or vehicles; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

According to different application scenarios, the node partitioning device of the computation graph, or the combined processing device including the node partitioning device of the computation graph, the chip for computing the gradient of the input data, and the corresponding computer-readable storage medium, and the integrated circuit for computing the gradient of the input data may be applied to devices such as a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. Vehicles include airplanes, boats, and/or vehicles; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

In addition, it should be noted that the term "data" referred to in the specification of the present disclosure should be broadly construed and may include graphics, images, video, audio, and the like.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

In the above embodiments of the present disclosure, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing may be better understood in light of the following clauses:

clause a1, a method of node partitioning of a computational graph, comprising:

dividing nodes which support a machine learning processor and can be fused together into corresponding subgraphs;

nodes that do not support a machine learning processor are processed separately.

Clause a2, the method according to clause a1, the machine learning processor comprising one or more machine learning processors.

Clause A3, the method according to clause a2, wherein partitioning nodes that support a machine learning processor and that can be respectively fused together within a corresponding subgraph comprises:

nodes that support machine learning processors and can be fused together are partitioned within a convex subgraph corresponding to the maximum weak connectivity.

Clause a4, the method of clause A3, partitioning nodes that support machine learning processors and that can be fused together within a convex subgraph corresponding to the maximum weak connectivity comprises:

and taking the node without the precursor node as an initial node, traversing the subsequent node along the directed edge, and dividing the initial node and the subsequent node into the same subgraph under the condition that no other path returns to the initial node in the reverse traversal process from the subsequent node to the initial node.

Clause a5, the method according to clause a4, partitioning nodes that support a machine learning processor and that can be fused together within a convex subgraph corresponding to the maximum weak connectivity further comprising:

fusing the subgraphs into a new initial node, traversing the successor node of the new initial node along the directed edge of the new initial node, dividing the new initial node and the successor node of the new initial node into the new subgraph under the condition that no other path returns to the new initial node in the reverse traversal process from the successor node of the new initial node to the new initial node, and repeating the steps until all the nodes which support the machine learning processor and can be fused together are divided into the same convex subgraph which is in the maximum weak connection.

Clause a6, the method of clause a5, partitioning nodes that support machine learning processors and that can be fused together within a convex subgraph corresponding to the maximum weak connectivity comprises:

and taking the node without the precursor node as an initial node, traversing the subsequent node along the directed edge, and finishing the access from the initial node to the subsequent node under the condition that other paths return to the initial node in the reverse traversal process from the subsequent node to the initial node.

Clause a7, the method of clause a6, partitioning nodes that support machine learning processors and that can be fused together within a convex subgraph corresponding to the maximum weak connectivity comprises:

the subsequent node is used as a new initial node, and the subsequent node of the new initial node is traversed along the directed edge of the new initial node;

under the condition that no other path returns to the new initial node in the reverse traversal process from the successor node of the new initial node to the new initial node, the new initial node and the successor node of the new initial node are divided into the same subgraph;

in case there are other paths to return to the new starting node in the backward traversal from the successor of the new starting node to the new starting node, then the access from the new starting node to the successor of the new starting node ends.

Clause A8, the method of clause a1-7, wherein the individually processing nodes that do not support a machine learning processor comprises:

nodes that do not support a machine learning processor are processed by a central processor, speech processor, programmable logic processor, and/or image processor.

Clause a9, the method according to clause a1-7, wherein the subgraph and the most weakly connected convex subgraph are directed graphs.

Clause a10, the method according to clause a1-7, wherein the method of partitioning the operating nodes is for a pytorech framework.

Clause a11, the method according to clause a1-7, stores the partitioning results of the node partitioning method in a look-up set, common set, or linked list.

Clause a12, a node partitioning apparatus of a computation graph, may include:

a processor configured to execute program instructions; and

a memory configured to store program instructions that, when loaded and executed by the processor, cause the apparatus to perform the method according to clause a 1-11.

Clause a13, a computer-readable storage medium having stored therein program instructions adapted to be loaded by a processor and to perform the method according to clause a 1-11.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:针对HPCG基准测试的两层分块多色并行优化方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!