Networked computer domain with embedded ring

文档序号：1866276 发布日期：2021-11-19 浏览：25次中文

阅读说明：本技术 具有嵌入的环的联网计算机领域 (Networked computer domain with embedded ring ) 是由 S.诺尔斯 O.托鲁德巴肯 S.费利克斯 L.P.休斯于 2020-03-26 设计创作，主要内容包括：本发明的一个方面提供了一种计算机,其包括布置在梯状配置中的多个互连的处理节点,该梯状配置包括多个相向的处理节点对。每个对的处理节点通过两个链路彼此连接。每个对中的处理节点通过至少一个链路连接到相邻对中的对应处理节点。处理节点被编程为操作梯状配置以围绕由处理节点和链路的相应集合形成的两个嵌入的一维环来传输数据,每个环仅使用梯状中的所有处理节点一次。(One aspect of the invention provides a computer comprising a plurality of interconnected processing nodes arranged in a ladder configuration comprising a plurality of facing pairs of processing nodes. The processing nodes of each pair are connected to each other by two links. The processing nodes in each pair are connected to corresponding processing nodes in adjacent pairs by at least one link. The processing nodes are programmed to operate a ladder configuration to transmit data around two embedded one-dimensional rings formed by respective sets of processing nodes and links, each ring using all of the processing nodes in the ladder only once.)

1. A computer comprising a plurality of interconnected processing nodes arranged in a ladder configuration comprising a plurality of facing pairs of processing nodes, wherein the processing nodes of each pair are connected to each other by two links and the processing nodes of each pair are connected to corresponding processing nodes of an adjacent pair by at least one link, wherein the processing nodes are programmed to operate the ladder configuration to transmit data around two embedded one-dimensional rings formed by respective sets of processing nodes and links, each ring using all processing nodes in the ladder configuration only once.

2. The computer of claim 1, wherein each processing node is programmed to divide the node's respective local vector into segments and transmit successive segments around each ring.

3. The computer of claim 2, programmed to operate each ring as a set of logical rings, wherein the consecutive segments are transmitted around each logical ring in a simultaneous transmission step.

4. The computer of claim 2 or 3, wherein each processing node is configured to output a respective segment on each of two links simultaneously.

5. The computer of claim 2, 3 or 4, wherein each processing node is configured to receive a respective fragment on each of two links simultaneously.

6. The computer of any preceding claim, wherein each link is bidirectional.

7. The computer of claim 2, wherein each processing node is configured to reduce, in a reduction-scatter phase of a full reduction aggregate, two incoming fragments with two respective locally stored corresponding fragments of a respective local vector at that processing node, and to simultaneously transmit the reduced fragments on each of two links.

8. The computer of claim 7, wherein each processing node is configured to transmit fully reduced fragments simultaneously on each of two links in a full aggregation phase of a fully reduced aggregate.

9. The computer of any preceding claim, wherein the ladder configuration comprises a first pair of processing nodes, at least one intermediate pair of processing nodes, and a second pair of processing nodes, wherein corresponding ones of the first and second pairs of processing nodes are connected to each other by respective at least one link.

10. A method of operating a computer, the computer comprising a plurality of interconnected processing nodes arranged in a ladder configuration in which facing pairs of processing nodes are connected by two links and the processing nodes of each pair are each connected to a corresponding processing node of an adjacent pair by at least one link, the method comprising operating the ladder configuration to transmit data around two embedded one-dimensional rings, each ring using all processing nodes in the ladder configuration only once.

11. The method of claim 10, wherein each processing node divides the node's respective local vector into segments and transmits consecutive segments around each ring.

12. The method of claim 10 or 11, comprising operating each embedded one-dimensional ring as a plurality of logical rings, and transmitting successive fragments around each logical ring in successive transmission steps.

13. The method of claim 10, 11 or 12, comprising operating the ladder configuration to achieve full reduction aggregation by implementing a reduction-scattering phase using two embedded rings simultaneously and a full aggregation phase using two embedded rings simultaneously, wherein the full aggregation phase follows the reduction-scattering phase.

14. A method of building a computer in a ladder configuration, comprising:

providing a first pair of processing nodes;

connecting the first pair of processing nodes together by two first links;

providing a second pair of processing nodes;

connecting the second pair of processing nodes together by two second links;

connecting each processing node of the first pair of processing nodes to each corresponding processing node of the second pair of processing nodes over a respective ladder connection link;

providing a third pair of processing nodes;

connecting the third pair of processing nodes together by two third links;

connecting each processing node of the second pair of processing nodes to each corresponding processing node of the third pair of processing nodes over respective further ladder connection links;

connecting each processing node of the third pair of processing nodes to each corresponding processing node of the first pair of processing nodes via a respective ladder return link; and

the computer is configured to transmit data around two embedded one-dimensional rings, each ring using all processing nodes in the ladder configuration only once.

15. The method of claim 14, comprising expanding the ladder configuration by introducing at least one further second pair intermediate a first and a third pair of end pairs serving the ladder configuration, wherein the at least one further second pair has corresponding processing nodes connected to each other by further ladder connection links.

16. The method of claim 15, wherein the step of introducing at least one additional second pair is accomplished by: the state of the third pair is changed to assume a further second pair by disconnecting the ladder return link and using it to assume the ladder connect link to connect to a processing node of a subsequent pair.

17. The method of claim 16, wherein the subsequent pair serves as a peer pair.

18. The method of claim 16, comprising providing another subsequent pair, wherein the subsequent pair is connected to the other subsequent pair functioning as a peer pair.

19. The method of claim 16, 17 or 18, wherein each processing node has four operable links, wherein a first and second of the four operable links are for connecting that node to its counterpart processing node, a third of the four operable links is for connecting the processing node to its counterpart node in a neighboring pair, and a fourth of the operable links is for connecting the processing node to its counterpart node in another neighboring pair.

20. A method as claimed in claim 16, 17 or 18, wherein each processing node has four operable links, wherein a first and second of the four operable links are for connecting that node to its counterpart processing node, a third of the four operable links is for connecting that processing node to its counterpart node in an adjacent pair, and a fourth of the operable links is for connecting that processing node as a ladder return link to a pair of processing nodes.

Technical Field

The present disclosure relates to data exchange between processing nodes connected in a computer, particularly but not exclusively for optimizing data exchange in machine learning/artificial intelligence applications.

Background

Aggregations (collecting) are routines that are commonly used when processing data in a computer. They are routines that enable data to be shared and processed across multiple different processes (processes), which may be running on the same processing node or different processing nodes. For example, if one process reads data from a data store, it may use a "broadcast" process to share that data with other processes. Another example is when the result of a particular function is needed by multiple processes. "reduction" is the result of a calculation function that has been required to be applied to data values from each of a plurality of processes. The "Gather" and "Scatter" aggregates handle more than one data item. Certain aggregates have become increasingly important in processing machine learning applications.

MPI (Message Passing Interface) is a messaging standard that can be applied to a variety of parallel computing architectures. MPI defines a number of aggregates that are suitable for machine learning. One such aggregate is known as "all reduce". The full reduce operation enables the results of a calculation function that acts on multiple data values from different source processes to be provided at the receiving process. Note that the receiving process may be one of the source processes, or there may be a plurality of receiving processes. The full reduce union reduces data values from multiple source processes and distributes the results to all source processes (which act as receiving processes for reducing the results). According to the MPI standard, a full-reduced union is realized by: the data values from all source processes are reduced in a reduction complex (e.g., at one of the processes) and the results are then broadcast to each source process.

FIG. 1 is a schematic block diagram of a distributed architecture for training a neural network. A training data source 100 is provided. This may be a database or any other type of data store capable of retaining training data suitable for the neural network model being trained. The processing itself according to the neural network model is distributed across a plurality of processing units 110a, 110b, 110c, etc. Only three units are shown in fig. 1, but it is readily understood that any number of processing units may be used. Each processing unit 110a, b, c receives batches of training data from a training data source 100. Each processing unit 110a, b, c maintains a set of parameters 112a, 112b, 112c that define the model. An incoming batch of training data is processed with a current set of parameters in a computation function 114, and the result of the computation function is used to generate a so-called delta, which represents the difference between the original and the new parameters, as a result of applying the computation function to the batch of training data and the current set of parameters. In many neural networks, these parameters are referred to as "weights", and thus the incremental values are referred to as "incremental weights". In fig. 1, the weights are labeled 112a, b, c, and the incremental weights are labeled 116a, b, c. It will be appreciated that in practice the weights and the incremental weights are stored in a suitable memory accessible to the processing unit. This makes the training process more efficient if the weights and the incremental weights can be kept in local memory.

The architecture of fig. 1 is not intended to train three independent models, but rather to train a single model in a distributed manner. Thus, the goal is to have the model parameters (or weights) converge to a single common set in each processing unit. It will be apparent that starting from any particular set of weights, and assuming that the batch of training data received at each processing unit is not the same, the incremental weights generated by each computation function in each processing unit will vary. Therefore, there is a need for a method of distributing incremental weights across processing units after each iteration of a batch of training data. This is schematically illustrated in fig. 1, where the combining function 118 receives the incremental weights from each processing unit and performs a mathematical function (such as an averaging function) that reduces the incremental weights. The output of the combining function 118 is then fed back to combining circuits 120a, 120b, and 120c, respectively, within each processing unit. A set of new weights is thus generated as a combination of both the original weights and the combined output from the combining function 118, and the new weights 118a, 118b, 118c are stored back to local memory. The next batch of training data is then provided to each processing unit and the process is repeated a number of times. It is clear that if the starting weights of the processing units are the same, they will again be reset to the same new value after each iteration. It is readily seen that the above is one example where a full reduction function is particularly useful. The incremental weights are provided to a combining function 118 where they are reduced and then provided back to each processing unit in their reduced form where they may be combined with the original weights.

FIG. 1A is a schematic diagram illustrating how at five processing nodes N₀...N₅A line (line) connection topology of (a) implements a fully-reduced union. These processing nodes may correspond to the processing units of fig. 1, with the combining functions distributed among the processing units. The processing nodes are shown connected in a row configuration, where each processing node is connected by a "forward" link L_FAnd a "backward" link L_BTo its neighbors. As shown, and as the directional words imply, the forward link connects the processing nodes from left to right in FIG. 1A, and the backward link connects the processing nodes from right to left in FIG. 1A. Each processing node has a processing capability, indicated at 200, and a storage capability, indicated at 202. The processing power and storage power may be implemented in any of a wide variety of ways. In one particular manifestation, a processing node may include a plurality of tiles (tiles), each individual tile having its own processing capabilities and associated memory capabilities. Each processing node also has one or more link interfaces to enable it to communicate via a link L_F/L_BTo its neighbor node.

To understand the implementation of a fully-reduced set complex, assume a first node N₀A flag delta has been generated₀Partial (partial) vector. A "local" may be a data structure that includes an array of incremental weights (such as vectors or tensors). The local vector is an array of a plurality of parts, each part pairCorresponding to computations on the processing node. This is stored in storage capacity 202 that is to be exchanged in the fully-reduced union. In a simple "streaming" line full reduction algorithm, the forward link is used for "reduction" and the backward link is used for "broadcast". The algorithm starts with the processing node at one end (the left-hand node in FIG. 1A) having its local Δ₀To its neighbor node N₁. For incoming parts (in this case Δ)₀) And already processed node N₁Corresponding local (Δ) generated by the computing power 200 of (c)₁) And (6) reducing. The result of this reduction (shown as an ADD function in FIG. 1A) is then passed from processing node N₁Node N to send to next connection₂. As further mentioned herein, the ADD function may be replaced with any combinatorial function that can be used to reduce locality. This process occurs at each processing node until at the final processing node (denoted as N in FIG. 1A)₄) Here, the reduction of the part is completed. At this time, via the backward link L_BThe reduction (sum Δ) is sent back to each processing node. It is received at each node and stored in memory capacity at that node and then also transmitted to the next node. In this way, each processing node eventually ends with a reduced result.

Fig. 1B shows a timing diagram of the reduction and broadcast phases. Note that a processing node does not send the reduced result to the next node until it receives incoming data from the previous node. Thus, for each outgoing transmission on the forward link, there is an inherent delay labeled Δ.

Furthermore, the backward link is not used for broadcasting until the fully reduced result has been obtained at the end (end) node. However, if the local vector is large, due to pipelining, the resulting leading (lead) data item will be returned to the originating node before it has finished sending its local data items, so there may be substantial overlap in activity on all forward and reverse links.

In a modification of this algorithm, which represents a minor improvement, in the rowThe processing nodes at each end of (a) may begin transmitting their parts to the central node where the reduction is accomplished. In this case, the results are broadcast back to the end nodes. Note that in this scenario, there will be a reversal in the direction of movement, e.g., at node N on both the forward and reverse links₂And N₃And node N₃And N₄In the meantime. If a row is closed into a ring (by closing the last node N on both the backward and forward links)₅Is connected to a first node N₀) The pipeline algorithm may serialize the reduction and broadcast in the same direction (serialize) so that the two logical loops formed by the bidirectional links can each operate independently on half of the data, see fig. 1C. That is, each local vector is divided into two halves. The first half, Δ A, is reduced on the forward link (as shown in FIG. 1A) and is reduced at N₅And N₀Is broadcast on the connecting branch (leg) between. The other half of the vector deltab is reduced on the backward link and then broadcast on the connection ring of the backward link.

Fig. 1D shows corresponding timing diagrams for the forward and reverse links.

The principles of the one-dimensional rings shown in fig. 1C and 1D can be extended to two-dimensional rings, such as in a torus (toroid) connected computer.

An alternative approach to using a two-dimensional ring is to use a reduction-Scatter (Reduce-Scatter) ensemble followed by an Allgather (Allgather) ensemble to achieve full reduction. A paper by Jain and Sabhawal entitled "Optimal Bucket Algorithms for large MPI collections on devices interconnects" (ICS'10, 6 months, 2 days-4 days, wave building) proposes a Bucket-based algorithm for full aggregation, reduction-scatter, and full reduction-aggregation assuming bi-directional links between processing nodes in torus (torus) interconnected processors. The method operates on the basis that in each step there are a plurality of data values (fragments) to be processed. In the reduction-scatter aggregate, each process starts with an initial local vector. It is assumed that references to processes herein refer to processes executing on a processing node. A local vector may be divided into a plurality of elements or segments. The corresponding elements of all processes are reduced and then distributed across the processes. In a fully aggregated aggregate, each process receives all elements from all other processes. The reduction-scatter aggregate reduces all parts and stores each reduction on a respective node-see fig. 2. The full reduce ensemble operation may be implemented by performing a reduce-scatter ensemble followed by a full gather ensemble operation.

As discussed in Jain's paper, the torus interconnect is an attractive interconnect architecture for distributed memory supercomputers. In the discussion above, the ensemble has been explained in the context of communication between processes. In a distributed supercomputer, processing nodes are interconnected, and each processing node may be responsible for one or more processes (in the context of an aggregate). A torus interconnect is a type of mesh interconnect in which the processing nodes are arranged in an N-dimensional array, each node is connected to its nearest neighbors, and the corresponding nodes on opposite edges in the array are also connected. There are bi-directional communication links between the interconnected processing nodes.

The algorithms discussed in the paper written by Jain and Sabharwal above for implementing an ensemble are applied to the torus connection architecture. This allows the aggregate to process different segments of the vector in different dimensional rings at the same time, making the processing bandwidth efficient. Indeed, Jain and Sabthwal present their technology as the best choice for an asymmetric annulus, and have recognized in the art that this is true.

It is an object of the present disclosure to propose an improved topology and method for implementing a full reduction function, in particular but not exclusively for processing functions in machine learning.

While the topologies and configurations described herein are particularly effective for efficient implementation of full reduction, they may also be advantageously used with other machine learning complexes and other types of parallel programs.

Disclosure of Invention

One aspect of the invention provides a computer comprising a plurality of interconnected processing nodes arranged in a ladder configuration, the ladder configuration comprising a plurality of facing pairs of processing nodes, wherein the processing nodes of each pair are connected to each other by two links, and the processing nodes of each pair are connected to corresponding processing nodes of an adjacent pair by at least one link, wherein the processing nodes are programmed to operate the ladder configuration to transmit data around two embedded one-dimensional rings formed by respective sets of processing nodes and links, each ring using all processing nodes in the ladder only once.

The ladder configuration may include a first pair of processing nodes, at least one intermediate pair of processing nodes, and a second pair of processing nodes, wherein corresponding processing nodes of the first and second pairs of processing nodes are connected to each other by at least one respective link.

In one embodiment, each processing node is programmed to divide the respective local vector for that node into segments and transmit successive segments around each ring.

Each ring may be operated as a set of logical rings, wherein consecutive segments are transmitted around each logical ring in a simultaneous transmission step.

Each processing node may be configured to output a respective segment simultaneously on each of the two links.

Each processing node may be configured to receive a respective fragment simultaneously on each of the two links.

In some embodiments, each link is bidirectional. This doubles the number of operating loops in the computer.

To implement a fully-reduced ensemble for ML or other distributed applications, each processing node may be configured to, in a reduction-scatter phase of the fully-reduced ensemble, reduce two incoming fragments at the processing node with two respective locally-stored corresponding fragments of a respective local vector, and transmit the reduced fragments simultaneously on each of the two links.

Each processing node may be configured to transmit a fully reduced fragment simultaneously on each of the two links in a full aggregation phase of the full reduction set.

Another aspect of the invention provides a method of operating a computer comprising a plurality of interconnected processing nodes arranged in a ladder configuration in which facing pairs of processing nodes are connected by two links and the processing nodes in each pair are each connected to a corresponding processing node in an adjacent pair, the method comprising operating the ladder configuration to transmit data around two embedded one-dimensional rings, each ring using all processing nodes in the ladder only once.

A method according to some embodiments includes operating each embedded one-dimensional ring as a plurality of logical rings, and transmitting successive fragments around each logical ring in successive transmission steps.

Another aspect provides a method of building a computer in a ladder configuration, comprising:

providing a first pair of processing nodes;

connecting a first pair of processing nodes together by two first links;

providing a second pair of processing nodes;

connecting together a second pair of processing nodes by two second links;

connecting each processing node of the first pair of processing nodes to each corresponding processing node of the second pair of processing nodes via a respective ladder connection link;

providing a third pair of processing nodes;

connecting a third pair of processing nodes together by two third links;

connecting each processing node of the second pair of processing nodes to each corresponding processing node of the third pair of processing nodes by respective further ladder connection links;

connecting each processing node of the third pair of processing nodes to each corresponding processing node of the first pair of processing nodes via a respective ladder return link; and

the computer is configured to transmit data around two embedded one-dimensional rings, each ring using all processing nodes in the ladder configuration only once.

The ladder configuration may be expanded by introducing a further second pair intermediate the first and third pair (which act as end pairs of the ladder configuration). The further second pair has corresponding processing nodes connected to each other by further ladder connection links. In practice, the introduction of the further second pair can be implemented by: the state of the third pair is changed by disconnecting the ladder return links and using them to act as ladder connection links to connect to processing nodes of subsequent pairs. This subsequent pair may then act as a pair of pairs, or it may itself be connected to another subsequent pair that will act as a pair of pairs. In this manner, the ladder configuration may be expanded to accommodate additional processing nodes.

Each processing node may have four operable links-two operable links for connecting to its counterpart processing node; one operable link is connected to its corresponding node in an adjacent pair; an operable link (if it is an intermediate pair) is connected to its corresponding node in another adjacent pair, or (if it is a pair) is connected to the other pair as a ladder return link.

To achieve full reduction aggregation, the method may include operating the ladder configuration to achieve a reduction-scattering phase that uses two embedded rings simultaneously and a full aggregation phase that uses two embedded rings simultaneously, where the full aggregation phase follows the reduction-scattering phase.

Each processing node may comprise a memory configured to store an array of data items (such as vectors or tensors) to be exchanged in the reduction-scatter phase, wherein each data item is located in the array, respectively, wherein the corresponding data item is located at a corresponding position in the array of other processing nodes, respectively. The array may be a "local" vector in the reduction-scatter phase (vector of local results) or a "result" in the full gather phase (vector of local after full reduction).

Each processing node may be programmed to transmit data items in a forward direction to its neighboring processing nodes in each ring during the reduction-scatter phase. The data items that can be transmitted in each step are called "fragments". A fragment is a piece of a vector-as described herein, a vector is divided into fragments to take advantage of logical rings formed in embedded rings.

Each array may represent at least a portion of a vector of local increments, each local increment representing an adjustment to a value stored at each processing node. Each processing node may be programmed to generate a vector of local deltas in the computing step. Each processing node can be programmed to divide its vector into two sub-arrays to utilize two embedded rings, respectively.

Drawings

For a better understanding of the invention to show how it may be carried into effect, reference will now be made, by way of example, to the accompanying drawings.

Fig. 1 is a schematic diagram illustrating distributed training in a neural network.

FIG. 1A is a schematic diagram showing a row of processing nodes for implementing a simple "streaming" row full reduction algorithm.

FIG. 1B is a timing diagram of the "streaming" line full reduction algorithm.

Fig. 1C is a schematic diagram of a line in which end nodes are connected in a ring.

FIG. 1D is a timing diagram of a circular full reduction algorithm.

Fig. 2 is a schematic diagram illustrating the implementation of a full reduction function by a reduction-scatter, followed by a full aggregation step.

Fig. 3A and 3B illustrate a bucket-based full reduction algorithm.

Fig. 4A and 4B are schematic diagrams of processing node configurations in which two rings are embedded.

Fig. 4C shows a ladder configuration as a 2 x 12 ring (or dual ring) with a 2:1 bandwidth asymmetry.

Fig. 4D is a diagram illustrating a connectivity graph showing the allocation of total node bandwidth B to links.

Detailed Description

Aspects of the present invention have been developed in the context of a multi-tile (tile) processor designed to act as an accelerator for machine learning workloads. The accelerator includes a plurality of interconnected processing nodes. Each processing node may be a single multi-tile chip, a package of multiple chips, or a rack of multiple packages. It is an object herein to devise a machine that is efficient in deterministic (repeatable) computing. Processing nodes are interconnected in a manner that enables an efficient implementation of aggregates, especially broadcasts and full reduction.

One particular application is to update models when training neural networks using distributed processing. In this context, distributed processing utilizes multiple processing nodes located in different physical entities (such as chips or packages or racks). That is, data transmission between processing nodes requires the exchange of messages over physical links.

The challenges in developing a topology specific to machine learning are different from the general area of High Performance Computing (HPC) networks. HPC networks typically emphasize asynchronous full-to-full (All-to-All) personalized communications on demand, so dynamic routing and bandwidth over provisioning (provisioning) are normal. Excess bandwidth may be supplied in the HPC network with the goal of reducing latency rather than providing bandwidth. Over provisioning of the active communication link wastes power that may contribute to computational performance. The most common link type used in today's computing consumes power when it is active, regardless of whether it is being used to transmit data.

The inventors have developed a machine topology that is particularly suited for MI workloads and address the following properties of MI workloads.

In MI workloads, inter-chip communication is currently dominated by the broadcast and full-reduced convergence. A broadcast ensemble may be implemented by a scattering ensemble followed by a fully aggregated ensemble, and a fully reduced ensemble may be implemented by a reduced-scattering ensemble followed by a fully aggregated ensemble. In this context, the term "inter-chip" means any communication between processing nodes connected via an external communication link. As previously mentioned, these processing nodes may be chips, packages, or racks. Note that the communication link may be between chips on one printed circuit board or between chips on different printed circuit boards.

The workload may be compiled such that within an individual Intelligent Processing Unit (IPU) machine, full-to-full communication is primarily inter-chip.

A full reduced set of complexes has been described above and illustrated in fig. 2. FIG. 2 shows a set of local values (vectors) or "local" vectors P on each of the four nodes in the starting state S1₀、P₁、P₂、P₃. In this context, a node is a processing node in a network of processing nodes. Note that each node N₀、N₁、N₂、N₃There are four "corresponding" parts, which are correspondingly shaded — diamond large grid, diagonal down wide stripe, square large grid, diagonal up wide stripe. That is, each local has a position in its local vector such that the position of P0(n) on node n in its vector is the same as the position of P0(n +1) on node n +1 in its vector. The suffix (N) is used to denote the node where the local is located-thus, P0(0) is node N₀Part P of₀. In the reduction-scatter pass (pass), the corresponding local is reduced and the reduction is provided to one of the nodes. For example, parts P0(0), P0(1), P0(2), P0(3) are reduced (to r)₀) And is connected to node N₀The above. Similarly, parts P1(0), P1(1), P1(2), and P1(3) were reduced (to r)₁) And is connected to node N₁The above. And so on, so that: in the intermediate state S2, each node has a reduction r₀、r₁、r₂And r₃One of them. As explained, the reduction may be by any combination functionThe function may include an independent operator (e.g., max) or a related operator P1(0) P1(1) P1(2) P1 (3). Then, in full aggregate transfer, each reduction is provided to all nodes to activate state S3, where each node now retains all four reductions. Note that in S1, the "corresponding" parts (e.g., P0(0), P0(1), P0(2), and P0(3)) may all be different, while in state S3, each reduction (e.g., r₀) Is the same at all nodes, where r_i＝f{(P_i(0),P_i(1),P_i(2),P_i(3))}. In machine learning, a set of local values P₀、P₁、P₂、P₃Is a vector. Each pass of the model during training produces a local vector (e.g., updated weights). At state S3, reduce r on each node₀(diamond grid), r₁(diagonal downward stripes), r₂(square grid), r₃(diagonal up stripes) are the full reduction vectors, i.e., the vectors of the results (or the fully reduced parts). In the context of machine learning, each local may be a set of update increments of a parameter in the model. Alternatively (in arrangements not further described herein) it may be an updated parameter.

Fig. 3A and 3B illustrate a bucket-based algorithm for reduction-scatter/full-gather that assumes six "virtual" ID rings. These rings are also referred to herein as "logical" rings. FIG. 3A is a schematic diagram illustrating reduction of parts in multiple virtual rings. Each part is divided into six segments (fragments). In fig. 3A, capital letters R, Y, G, B, P, L each represent a distinct segment of the locality stored at each node, represented by a shaded diamond grid, a diagonal upward stripe, a square grid, a horizontal stripe, a diagonal downward stripe, a vertical stripe. The letters represent corresponding segments to be reduced together to each other and define a "virtual" ring or a "logical" ring for these segments. Referring to FIG. 3A, the "R" segment in each of local P0, P1, P2, P3, and P4 is reduced to a single segment in the result vectorThe same is true for Y, G, B, P and the L fragment.

Fig. 3B shows a timing diagram in which time on the horizontal axis represents data exchange and calculation in each step of the full reduction process. In fig. 3A and 3B, the full reduction process is completed by a reduction-scatter phase followed by a full aggregation phase. In fig. 3B, each segment is represented by a different shading as described above.

The representations in fig. 3A and 3B are as follows. Each timeEach part is represented as P₀、P₁、P₂、P₃、P₄、P₅. At the start of the process, each local store is on a respective node N0, N1, N2, N3, N4, N5. Each fragment is tagged according to its fragment sequence number and its position in the virtual ring in which it is considered to be reduced. For example, RA₀Represents a part P₀Because this is the first segment in the virtual ring formed by nodes N0-N1-N2-N3-N4-N0. RA₁Representing the R segment at node N1, which is located at the second position in its virtual ring. YA (Yam acrylic acid)₀Representing the Y segment at node N1. The "0" suffix indicates that it is the first segment in its virtual ring, the Y ring being N1-N2-N3-N4-N0-N1. Note in particular that the suffix for a reflects a virtual ring, and does not correspond to a physical node (or local). Note that fig. 3A only shows a virtual ring on the forward link. Fig. 3B shows the equivalent process occurring on the backward link, where the segment is denoted B.

In step one, the first segment (A) in each virtual ring₀) From its node to the next adjacent node, at which the first segment is reduced with the corresponding segment at the adjacent node. That is, RA₀From N0 to N1, wherein RA₀Reduced to R (A) at N1₀+A₁). Also here, the "+" sign is used as shorthand for any combination function. Note that in the same step, A for each virtual ring₀The fragments will be transmitted simultaneously. That is, the link between N1 and N2 is used to transport YA₀Link between N2 and N3 for transmitting GA₀And the like. In the next step, the corresponding reduced fragments are transmitted to their next neighbor node over the forward link. For example, R (A)₀+A₁) From N1 to N2, Y (A)₀+A₁) From N2 to N3. Note that for clarity, not all segments are numbered in fig. 3A, nor are all transmissions numbered. The complete set of segments and numbers is shown in fig. 3B. This process is carried out in five steps. After five steps, all the segments on each node are classifiedAnd (4) about. At the end of the fifth step, the reduction is on the last node of each corresponding ring for the segment. For example, the R reduction is at node N5.

The starting point of the full aggregation phase is the transfer starting from the last node to the first node in each virtual ring. Thus, the final reduction of the R segment ends at node N5, ready for the first step of the full aggregation phase. The final reduction of the Y segment correspondingly ends at node N0. In the next step of the full aggregation phase, the reduced fragments are again transmitted to their next neighbor node. Thus, the fully reduced R fragment is now also at N2, the fully reduced Y fragment is now also at N3, and so on. In this way, at the end of the full aggregation phase, each node ends with all fully reduced segments R, Y, G, B, P, L of the local vector.

The implementation of the algorithm is optimal if the computations required for the reduction can be hidden after pipeline delay. Note that when a suitable ring is formed in a computer in order to achieve full reduction, the tour (tour) of the ring must be accessed only once for each node in the ring. Thus, the natural ring formed by rows with bidirectional links (fig. 1A) is not the most efficient ring.

An improved topology for an interconnection network of processing nodes will now be described that allows for efficient exchange of local and results between processing nodes to achieve a fully-reduced collective.

FIG. 4A is a schematic diagram illustrating a connection topology of a plurality of processing nodes. In fig. 4A, there are eight processing nodes in a ladder configuration, but it should be understood that the principles can be extended to any number of processing node pairs arranged in a ladder. Each processing node is shown as a circle and is numbered for ease of reference. In the following description, a prefix N will be inserted when referring to a node. For example, N0 represents the circle in FIG. 4A that represents node N0. The processing nodes are connected by links in a manner to be described. Each link may be bidirectional, i.e. it may transmit data in both directions over the link. Note that there is physical interconnectivity and logical connectivity. Logical connectivity is used to form two embedded continuous loops, each loop having two directions. First, physical connectivity will be described. Each processing node in the Y direction is connected to their neighbors by a single bidirectional link. The facing (facing) processing nodes are connected by two bi-directional links. For example, (see vertical line on the left in fig. 4A), N0 is connected to N2 (shown as two oppositely facing arrows in fig. 4) by a single bidirectional link. Similarly, N2 is connected to N4, N4 is connected to N6, and N6 is connected back to N0. In the horizontal direction, node N0 is connected to N1 by two bidirectional links (one of which is shown as the opposite arrow of black and small dashed lines and the other of which is shown as the opposite arrow of grey and large dashed lines). Similarly, N2 is connected to N3, N4 is connected to N5, and N6 is connected to N7. This physical connection allows two logically embedded rings (each ring being bidirectional) to be embedded into the fabric. The first loop (shown as grey and large dashed lines) extends from N0 to N1, N1 to N3, N3 to N2, N2 to N4, N4 to N5, N5 to N7, N7 to N6, and then back to N0. The second logic loop (shown as black and small dashed lines) extends from N0 to N2, N2 to N3, N3 to N5, N5 to N4, N4 to N6, N6 to N7, N7 to N1, and then from N1 back to N0. Each ring includes all eight processing nodes. Note also that both rings can be used simultaneously, since there is no conflict on any single path.

Consider node N0 in fig. 4A and 4B. The node N0 has two links connecting it to node N1, node N1 being referred to as its opposing pair of nodes. These links are labeled L011 and L012. Reference L01 denotes a link connecting nodes 0 and 1. Reference numeral 1 denotes that it is a first link between nodes 0 and 1, and reference numeral 2 denotes that it is a second link between nodes N0 and N1. Note that in this embodiment, each link is bidirectional. Node N0 also has a third bidirectional link labeled L02. This link L02 is connected to the corresponding node N2 in the next adjacent pair in the ladder configuration. Therefore, the link L02 is referred to as a ladder connection link. Node N0 has a fourth link labeled L06. This link connects corresponding nodes N0 and N6 in the end pair (end pair) of the ladder configuration. In fig. 4A and 4B, the first pair is the facing pair of nodes N0 and N1, and the second pair is the facing pair of nodes N6 and N7. Note that processing nodes N6, N7 in opposing pairs are similarly connected by two bidirectional links L671, L672.

The link L06 extending between the corresponding nodes of the end pair (N0 and N6), and the link L17 extending between nodes N1 and N7, respectively, are referred to as ladder return (return) links. These links make each embedded ring in the ladder configuration a complete one-dimensional ring.

The ladder configuration in the embodiment of FIG. 4 includes two oppositely facing middle pairs, N2/N3 and N4/N5. The opposite nodes in each pair are connected by corresponding links L231, L232; l451 and L452 are connected.

The corresponding nodes in opposite pairs are connected by ladder-like connecting links. For example, node N1 in the first end pair is connected to node N3 and its adjacent intermediate pair by link L13. In fig. 4A and 4B, each link is represented by two double-headed arrows. However, these links may be housed in a single physical link fabric and represent bi-directional communication paths over the physical link fabric. In an alternative embodiment, each direction of the link structure may be a separate physical manifestation.

In the embodiment of fig. 4A and 4B, four opposing pairs of a total of eight nodes form a ladder configuration. It will be readily appreciated that the ladder configuration can be simply extended to include more facing pairs. This may be done by changing the connectivity of the links on the nodes in the end pair. For example, to add another opposing pair to the ladder configuration, link L60 is disconnected from node N0 and connected to another node N8 (shown in dashed lines). This node N8 has its own physical link which is then connected to node N0 as a ladder return link. Similarly, the link L71 extending from node N7 to node N1 is disconnected from node N1 and connected to a new node N9 (shown in dashed lines). The physical link from N9 is then connected as a ladder return link to node N1. Nodes N8 and N9 are connected together by physical links N891 and N892.

More node pairs may be added to the ladder configuration as needed to expand the processing power of the computer. Note that when a new node is added, the existing connections of other node pairs in the ladder configuration need not be interrupted except for the end pair. This greatly simplifies the extension to the ladder configuration to include additional processing nodes as needed.

As shown in fig. 4B, the embedded loops in the ladder configuration may be considered "barley twist" loops and labeled VR1, VR2 due to the appearance of the embedded loops in the ladder configuration. It is emphasized that the ring shown in fig. 4B is a virtual data transfer ring embedded in a ladder configuration. The rings may operate simultaneously because no link is used twice in different collision rings.

Fig. 4C illustrates an embodiment having 24 processing nodes arranged in a 2 x 12 ring structure. Fig. 4C indicates the use of bandwidth on the link. The bandwidth of the intra (intrapair) link is B/2. The bandwidth usage of the inter (interpro) link is B/4. The bandwidth usage of the return link is B/4. This is referred to herein as 2:1 bandwidth asymmetry.

Fig. 4D illustrates, in schematic form, two embedded loops that may be concurrently cycled in structure. The rings are homogeneous (i.e., they are the same shape and therefore the same length). As previously described, each ring is accessed only once for all nodes in the fabric. Each ring uses B/4 bandwidth per link in the ring. Where there are two links (intra-pair links) between processing nodes, the overall bandwidth is B/2, as described above.

To use this structure, the part (or segment) to be transmitted is split into two parts at each node, and each part is fully reduced around one of the rings using the one-dimensional ring algorithm described above with reference to fig. 3A and 3B.

Size of segment output by each nodeWhere n is the number of nodes and v is the size of the data structure being reduced-scattered or fully aggregated in a particular phase. At the beginning, v is the size of the local vector. The number of segments is equal to the number of nodes in the ring before each step around the ring. Note that in contrast to the structure described in the Jain paper mentioned earlier, the ring passes through all nodes and all links beginIs used eventually. It is assumed herein that each processing node can output its data on both links simultaneously and can receive and process the data simultaneously. Each ring is one-dimensional-the ring is a chain of processing nodes without branches.

Each node is capable of performing processing or computing functions. Each node may be implemented as a single processor. However, it is more likely that each node will be implemented as a single chip or chip package, with each chip including multiple processors. There are many different possible representations for each individual node. In one example, a node may be made up of intelligent processing units of the type described in British applications (published as GB 2569847; GB 2569430; GB 2569275); the contents of which are incorporated herein by reference. However, the techniques described herein may be used on any type of processor that constitutes a node. Outlined herein is a method of exchanging data that implements a particular exchange pattern in an efficient manner that is useful in a machine learning model. Further, the links may behave in any suitable manner. It is advantageous that the links are bidirectional and preferably the links can operate simultaneously in both directions, although this is not a necessary requirement. One particular class of communication link is the SERDES link, the power requirements of which are independent of the amount of data transmitted over the link or the time it takes to transmit the data. SERDES is an acronym for Serializer/DeSerializer, and such links are known. In order to transmit a signal on a line of such a link, power needs to be applied to the line to change the voltage to generate the signal. The SERDES link has the following characteristics: power is continuously applied to the line to maintain it at a certain voltage level so that a signal can be conveyed by a change in that voltage level (rather than by a change between 0 and the applied voltage level). Thus, the bandwidth capacity on the SERDES link has a fixed power regardless of whether it is used or not. SERDES links are implemented at each end by circuitry that connects the link layer devices to a physical link (such as copper wire). This circuit is sometimes referred to as PHY (physical layer). PCIe (Peripheral Component Interconnect Express) is an interface standard for connecting high-speed computers.

It is possible that the link can be dynamically deactivated (inactive) to effectively not consume power when not in use. However, the activation time and non-deterministic nature of machine learning applications often make dynamic activation during program execution problematic. Thus, the inventors have determined that it may be better to take advantage of the fact that: for any particular configuration, the power consumption of the chip-to-chip link is substantially constant, so the best optimization is to maximize the use of the physical link by keeping the chip-to-chip traffic (traffic) as concurrent with IPU activity as possible.

SERDES PHY are full duplex (i.e., 16Gbit per second, the PHY supports 16Gbit per second simultaneously in each direction), so full link bandwidth utilization means balanced bidirectional traffic. Further note that the use of direct chip-to-chip communication has significant advantages over indirect communication such as via a switch. Direct chip-to-chip communication is more power efficient than switched communication.

Another factor to consider is the bandwidth requirements between nodes. One goal is to have sufficient bandwidth to hide the inter-node communication at each node behind the computations performed for distributed machine learning.

In optimizing machine architecture for machine learning, a fully-reduced collective can be used as a criterion for required bandwidth. One example of a fully-reduced set of bodies has been given above in the process of updating parameters for model averaging. Other examples include gradient averaging and calculating norms.

As one example, the full reduction requirement of the residual learning network may be considered. Residual learning networks are a class of deep convolutional neural networks. In deep convolutional neural networks, multiple layers are utilized to learn the corresponding features within each layer. In residual learning, residuals may be learned instead of learning features. A special residual learning network called ResNet enables direct connection between the different layers of the network. It has been shown that in some cases it may be easier to train such a residual network than a conventional deep convolutional neural network.

ResNet 50 is a 50-layer residual network. ResNet 50 has a weight of 25M, so a full reduction of all weight gradients in unit-positioned floating-point format F16 involves a locality of 50 megabytes (megabytes). To illustrate the bandwidth requirements, assume that one full reduction is required for each full batch. This is likely, but not necessarily, a full reduction to the gradient. To achieve this, each node must output 100 megabits (megabits) per full reduction. ResNet 50 requires 250 Gigaflops (billions of floating point operations per second) for each image to train. If the sub-batch size of each processing node is 16 images, then each processor executes 400 Gigaflops for each full-reduced collective. If the processor reaches 100 Teraflops per second (trillion floating point operations per second), then approximately 25 gigabits per second (gigabits) are required between all links in order to maintain concurrency of computation and full reduce communications (coherency). For sub-batches of 8 images per processor, the required bandwidth is nominally doubled, which is mitigated in part by reducing the Teraflops achievable per second to process smaller batches.

A fully-reduced collective is implemented between p processors, each starting with a local size of m megabytes (equal to the reduced size), the implementation requiring at least 2m. (p-1) megabytes to be sent over the link. So if each processor has/links on which it can transmit simultaneously, the asymptotic minimum reduction time is 2m. (p-1) (p-1) divided by (p.l).

The concepts and technologies described above can be used in a number of different examples.

In one example, a fixed configuration for use as a computer is provided. In this example, the processing nodes are interconnected as described and illustrated in the various embodiments discussed above.

A fixed configuration may be constructed from the exact number of processing nodes of the configuration. Alternatively, it may be provided by partitioning it out of a larger structure. That is, a set of processing nodes may be provided that form a "ladder" having stacked layers of the original set. The processing nodes in each stack layer may have: an inter-layer link to a corresponding processing node in an adjacent stacked layer; and intra (intrapair) links between adjacent processing nodes in the layer.

A fixed configuration of a desired number of stacked layers may be provided by: each inter-layer link in a given one of the stacked layers of the original group is disconnected and connected to an adjacent processing node in the given stacked layer to provide an intra-link. In this way, a given one of the stacked layers of the original set may be made to form one of the first and second endmost layers of the structure. Note that the layers of the original set may be partitioned into more than one fixed configuration in this manner.

The inter-layer links and the intra-layer links are physical links provided by suitable buses or lines as described above. In one manifestation, each processing node has a set of wires extending therefrom for connecting it to another processing node. This may be done, for example, through one or more interfaces of each processing node having one or more ports connected to physical lines.

In another manifestation, the link may be comprised of onboard wiring. For example, a single board may support a group of chips, e.g., four chips. Each chip has an interface with ports that can be connected to other chips. Connections may be made between the chips by soldering the wires to the board according to a predetermined method. Note that the concepts and technologies described herein are particularly useful in this context because they make maximum use of the links between chips that have been pre-soldered onto a printed circuit board.

The concepts and techniques described herein are particularly useful because they enable optimized use of non-switchable links. The processing nodes described herein may be connected to build a configuration by using fixed, non-switchable links between the nodes. In some manifestations, if additional links are not to be used, there is no need to provide these additional links between processing nodes. For example, in an intermediate layer of the configuration, there are fewer links between processing nodes than in the endmost layer. Alternatively, a link may be provided between processing nodes, but in some configurations the link may be permanently deactivated.

To use the configuration, a set of parallel programs is generated. The set of parallel programs comprises node level programs, i.e., programs designated for working on a particular processing node in the configuration. A set of parallel programs to be operated on a particular configuration may be generated by a compiler. The compiler has the responsibility to generate node level programs that correctly define the links to be used for each data transfer step of certain data. The programs include one or more instructions for implementing a data transfer effect at a data transfer stage that uses a link identifier to identify a link to be used for the transfer stage. For example, a processing node may have two or three active links at any time (doubled if the links are bidirectional at the same time). The link identifier enables the correct link to be selected for the data item at that transmission stage. Note that each processing node may not be aware of the actions of its neighboring nodes — the switching activities are precompiled for each switching phase.

Also note that the link does not have to be switched — the data item does not need to be actively routed as it is being transferred, nor does it need to change the connectivity of the link.

As mentioned above, the configuration of computer networks described herein is to enhance parallelism in computing. In this context, parallelism is achieved by loading node-level programs into configured processing nodes that are intended to be executed in parallel, e.g., training an artificial intelligence model in a distributed manner as previously described. However, it will be readily appreciated that this is only one application of parallelism achieved by the configuration described herein. One scheme for achieving parallelism is referred to as "batch synchronous parallel" (BSP) computation. According to the BSP protocol, each processing node performs a computation phase and an exchange phase following the computation phase. During the compute phase, each processing node performs its compute tasks locally, but does not exchange its compute results with other processing nodes. In the exchange phase, each processing node is allowed to exchange the computation results of the previous computation phase with other processing nodes in the configuration. No new calculation phase is started until the configured exchange phase is completed. In this form of the BSP protocol, barrier (barrier) synchronization is placed at the time (juncture) of transition from the computation phase to the exchange phase, or from the exchange phase to the computation phase, or both.

In this embodiment, when the switching phase is initiated, each processing node uses the link identifier determined by the compiler for that switching phase to execute instructions for exchanging data with its neighboring nodes. The nature of the exchange phase may be determined by using the MPI messaging criteria described previously. For example, an aggregate, such as a fully-reduced aggregate, may be called from a library. In this way, the compiler has a precompiled node-level program that controls the links over which the local vectors (or corresponding fragments of the local vectors) are transmitted.

It will be apparent that other synchronization protocols may be used.

While specific embodiments have been described, other applications and variations of the disclosed technology may become apparent to those skilled in the art once given the disclosure herein. The scope of the present disclosure is not limited by the described embodiments, but only by the appended claims.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：使用映射归约计算叉积

Networked computer domain with embedded ring

相关技术

网友询问留言