Speculative resource allocation for routing on an interconnect fabric

文档序号：1889044 发布日期：2021-11-26 浏览：16次中文

阅读说明：本技术 互连结构上路由的推测性资源分配 (Speculative resource allocation for routing on an interconnect fabric ) 是由 I·马托舍维奇 L·巴伊奇于 2021-05-20 设计创作，主要内容包括：本文公开了与互连结构上路由的推测性资源分配相关的方法和系统。一种公开的方法包括推测性地分配资源集合以支持通过互连结构的一组路径。该方法还包括在该组路径上的分支节点处聚集来自该组路径的一组响应。如果检测到资源争用,则该组响应将包括资源争用的指示符。然后,该方法将还包括响应于资源争用的指示符,从分支节点向下游传输解除分配消息并向上游传输资源争用的指示符,并且在保持一段时间之后为多播重新分配资源。(Methods and systems related to speculative resource allocation for routing on an interconnect fabric are disclosed. One disclosed method includes speculatively allocating a set of resources to support a set of paths through an interconnect fabric. The method also includes aggregating a set of responses from the set of paths at the branch nodes on the set of paths. If contention for the resource is detected, the set of responses will include an indicator of contention for the resource. The method will then further include transmitting a deallocation message downstream from the branching node and transmitting an indicator of resource contention upstream in response to the indicator of resource contention, and reallocating resources for the multicast after a hold period.)

1. A method for multicasting over an interconnect fabric, comprising:

allocating a set of resources to support a set of paths through the interconnect fabric, whereby the set of paths are speculatively allocated for the multicast;

aggregating a set of responses from the set of paths at a branch node on the set of paths, wherein the set of responses includes an indicator of resource contention;

transmitting a deallocation message downstream from the branching node and transmitting the indicator of the resource contention upstream in response to the indicator of the resource contention;

deallocating the set of resources for a retention period in response to the deallocation message; and

reallocating resources for the multicast after the hold period.

2. The method of claim 1, further comprising:

detecting the contention for resources at nodes on the set of paths while the set of resources is allocated; and is

Generating, at the node, the indicator of the resource contention in response to detecting the resource contention.

3. The method of claim 2, wherein:

the interconnect fabric links a set of processing cores;

the node comprises one processing core from the set of processing cores and one router from a set of routers;

the branch node comprises another processing core from the set of processing cores and another router from the set of routers;

the multicast distributes data for a composite computation performed by the set of processing cores; and is

All steps of the method are performed by hardware in the interconnect fabric without requiring any computation on any of the processing cores.

4. The method of claim 1, wherein:

the allocation is done locally on a node-by-node basis; and is

Without any information about the global state of the interconnect structure.

5. The method of claim 2, wherein:

the detecting occurs in a router at the node;

the generation of the indicator occurs at a router at the node; and is

The indicator of the resource contention is transmitted upstream on a dedicated signal line.

6. The method of claim 2, wherein detecting the resource contention further comprises:

receiving an allocation slice at the node; and is

Detecting a dispensing failure of the dispensing sheet.

7. The method of claim 6, wherein:

the resource set comprises a set of buffers and a set of crossbar channels;

the detecting occurs in a router at the node; and is

The allocation fails because of a lack of available buffers or crossbar channels on the router.

8. The method of claim 2, wherein detecting the resource contention further comprises:

receiving an allocation slice at the node;

determining that the allocation slice is a multicast allocation slice;

triggering a prioritized tile handling routine at the node in response to determining that the allocation tile is a multicast allocation tile; and is

Wherein a standard slice processing routine at the node takes more clock cycles than the prioritized slice processing routine.

9. The method of claim 1, wherein de-allocating the set of resources further comprises:

receiving the deallocation message at a downstream node, wherein the downstream node is downstream from the branch node; and is

Triggering a prioritized tile handling routine at the downstream node in response to receiving the deallocation message.

10. The method of claim 1, further comprising:

determining a duration of the hold time period at a source node; and is

Wherein the duration is one of: (i) random; and (ii) increase exponentially based on the number of allocation attempts.

11. The method of claim 1, further comprising:

compiling, using a compiler, a set of instructions for execution on a set of processing cores linked by the interconnect fabric; and is

When the compiler determines that multicast resource contention can be avoided, a resource contention detection routine is disabled.

12. An interconnect structure comprising:

a set of nodes;

a set of routers, wherein the set of nodes and the set of routers have a one-to-one correspondence; and

a set of distributed logic circuits distributed among the set of nodes and programmed to perform a method for multicasting on the interconnect fabric, the method comprising:

allocating a set of resources on the set of routers to support a set of paths through the set of nodes, whereby the set of paths are speculatively allocated for multicasting of the data;

aggregating a set of responses from the set of paths at a branch node of the set of nodes, wherein the set of responses includes an indicator of resource contention;

transmitting a deallocation message downstream from the branch node in response to the indicator of the resource contention and

transmitting the indicator of the resource contention upstream;

deallocating the set of resources for a retention period in response to the deallocation message; and

reallocating resources for the multicast after the hold period.

13. The interconnect structure of claim 12, the method further comprising:

detecting the contention for resources at a node of the set of nodes when allocating the set of resources; and is

Generating, at the node, the indicator of the resource contention in response to detecting the resource contention.

14. The interconnect structure of claim 13, further comprising:

a set of terminals of the interconnect fabric, wherein the set of terminals are processing cores;

wherein the node and the branch node each comprise a processing core from the set of terminals and a router from a set of routers; and is

Wherein the multicasting distributes data for a composite computation performed by the processing core.

15. The interconnect structure of claim 12, wherein:

the allocation is done locally on a node-by-node basis; and is

Without any information about the global state of the interconnect structure.

16. The interconnect structure of claim 13, further comprising:

a set of dedicated signal lines connecting the set of nodes;

wherein the detecting occurs in a router at the node;

wherein the generation of the indicator occurs at the router at the node; and is

Wherein the indicator of the resource contention is transmitted upstream on a dedicated signal line of the set of dedicated signal lines.

17. The interconnect fabric of claim 13, wherein detecting resource contention further comprises:

receiving an allocation slice at the node; and is

Detecting a dispensing failure of the dispensing sheet.

18. The interconnect structure of claim 17, wherein:

the resource set comprises a set of buffers and a set of crossbar channels;

the detecting occurs in a router at the node; and is

The allocation fails because of a lack of available buffers or crossbar channels on the router.

19. The interconnect structure of claim 13, further comprising:

a distributor at the node; and is

Wherein detecting the resource contention further comprises:

receiving an allocation slice at the node;

determining that the allocation slice is a multicast allocation slice;

triggering a prioritized tile handling routine on the allocator in response to determining that the allocation tile is a multicast allocation tile; and is

Wherein the standard slice processing routine on the distributor takes more clock cycles than the prioritized slice processing routine.

20. The interconnect fabric of claim 12, wherein deallocating the set of resources further comprises:

receiving the deallocation message at a downstream node, wherein the downstream node is downstream from the branch node; and is

Triggering a prioritized tile handling routine at the downstream node in response to receiving the deallocation message.

21. The interconnect structure of claim 12, the method further comprising:

determining a duration of the hold time period at a source node; and is

Wherein the duration is one of: (i) random; and (ii) increase exponentially based on the number of allocation attempts.

22. The interconnect structure of claim 12, further comprising:

a controller programmed to perform a control method comprising:

compiling, using a compiler, a set of instructions for execution on a set of processing cores linked by the interconnect fabric; and deactivating a resource contention detection routine when the compiler determines that multicast resource contention is avoidable.

23. A method for supporting multicast on an interconnect fabric, comprising:

receiving, at a branch node on the interconnect fabric, an allocation message of resources for supporting the multicast path;

transmitting the assignment message from the branch node to a set of paths originating from the branch node;

receiving, at the branching node, a set of responses from the set of paths originating from the node, wherein the set of responses includes an indicator of resource contention;

transmitting a downstream de-allocation message from the branch node to at least one path of the set of paths in response to the indicator of the resource contention; and is

Transmitting an upstream indicator of the resource contention from the branch node in response to the indicator of the resource contention.

24. A node for an interconnect structure, comprising:

a router; and

logic circuitry programmed to perform a method for supporting multicasting on the interconnect fabric, the method comprising:

receiving, at the router, an allocation message of resources for supporting the multicast path;

transmitting, using the router, the assignment message to a set of paths originating from the node;

receiving, at the router, a set of responses from the set of paths originating from the node, wherein the set of responses includes an indicator of resource contention;

transmitting, using the branching node, a downstream de-allocation message to at least one path of the set of paths in response to the indicator of the resource contention; and is

Transmitting, using the router, an upstream indicator of the resource contention in response to the indicator of the resource contention.

Technical Field

The invention relates to speculative resource allocation for routing on an interconnect fabric.

Background

Modern large-scale system-on-a-chip (SoC) devices may integrate many instances of various functional blocks, including processing cores, accelerators, memory, and off-chip interfaces. Additionally, modern socs can integrate modules that traditionally could not even be implemented on-chip, such as radio frequency communication modules and microelectromechanical modules. The speed efficiency and power consumption efficiency of SoC implementations, as well as the cost savings associated with building a single chip rather than multiple chips, drive an increase in integration. To achieve these benefits, a communication system must be established to allow the various modules to communicate with high bandwidth and low latency. Conventional approaches to implementing such communication systems include on-chip buses or dedicated wiring for specific signals. However, dedicated wiring is resource intensive and inefficient because the wiring is typically kept idle, and on-chip buses cannot scale as modern SoC bandwidth requirements increase.

Modern solutions for inter-block communication in socs include the use of dedicated networks in the form of interconnect fabrics. As used herein, the term "interconnect fabric" refers to a programmable system that transfers data between terminals of the system, wherein the system is programmed by writing control information to the various routers associated with the terminals. In an interconnect structure, a router, and any terminals that may be associated with the router, may be referred to as a "node". When the interconnect fabric is located entirely within a single chip, it may be referred to as a network-on-chip (NoC). Data may be transmitted from one terminal to another along a series of nodes through an interconnect fabric. A series of nodes may be referred to as a "path" through the interconnect fabric.

Fig. 1 shows a part of a NoC 100 of an SoC in the form of a multicore processor, wherein the terminals of the structure are four processing cores 101, 102, 103 and 104. The processing cores are shown linked using four routers 110, 111, 112 and 113. The processing cores and routers constitute nodes of the interconnect structure. Routers communicate with their associated processing cores using Network Interface Units (NIUs) located on each processing core. Data may be transmitted between cores using unicast transmission, multicast transmission, or broadcast transmission. In conventional approaches, the transmissions in the fabric may be single-hop or multi-hop, depending on the network topology and the physical location of the source and destination cores in the network. For example, in the NoC 100, neighboring cores 101 and 102 can communicate using single-hop transmissions, while remote cores 101 and 104 need to communicate over multi-hop transmissions, since data is passed in series between neighboring routers and forwarded from each intermediate router to the destination router.

The interconnect fabric can efficiently distribute data between multiple terminals with high bandwidth and efficient resource utilization by sharing the same network resources for many different transmissions in the fabric. However, design work needs to be done to ensure a fair and efficient distribution of these resources to the various terminals connected to the fabric. For example, what is referred to in the technical literature as a deadlock situation is a situation that must be avoided or mitigated in which two separate transfers require the same resources and each transfer is prevented by the other transfer from completing and releasing the shared resources. An illustration of this situation is shown in the block diagram 200 of fig. 2, where terminal 201 and terminal 202 are the source and destination terminals of the unicast path 211, terminal 203 and terminal 204 are the source and destination terminals of the unicast path 212, and terminal 205 and terminal 206 are the source and destination terminals of the unicast path 213. In the illustrated case, if there are not enough buffers and channels on at least two illustrated terminals along at least two illustrated paths, then deadlock can occur. For example, if node 204 cannot support the flow for path 213 and path 212, node 201 cannot support the flow for path 213 and path 211, and node 203 cannot support the flow for path 211 and path 212, then the flow in all three unicasts will be blocked without any way forward. Although only two paths are required to form the deadlock state, three paths are used in this example to illustrate the fact that deadlock sometimes involves a large number of paths and their complex interdependencies.

Disclosure of Invention

Methods and systems related to efficient operation of an interconnect structure are disclosed herein. Certain methods and systems disclosed herein include: speculatively allocating resources to support transmission using an interconnect fabric over a set of paths of the interconnect fabric; aggregating a set of responses from the set of paths; conditionally allocating or deallocating those resources based on the set of responses; and if those resources are deallocated, reallocating those resources after the hold period has elapsed. Resources are speculatively allocated because it is not known in advance whether the paths needed to support the transmission are fully available, or whether resource contention with another transmission through the interconnect fabric will be detected during allocation of these resources. If the response indicates that contention is detected, resources may be deallocated to wait for another transmission to complete. These resources may then be reallocated to complete the transmission. Although the process of deallocating resources is inevitably required in the event that contention is detected, applicants have found that in certain implementations, for reasonable workloads, the number of detected contention is small and the benefits associated with fast speculative allocation outweigh the disadvantages of deallocation.

The interconnect structure according to certain embodiments of the present invention disclosed herein supports various transfers in a manner that avoids deadlock situations while also being highly scalable. Certain embodiments disclosed herein are highly scalable in that they do not require a central arbiter or negotiation between source nodes to allocate resources for potentially conflicting transmissions. Instead, resources are speculatively allocated and distributed logic programmed into the various nodes of the interconnect fabric will detect potential contention, aggregate responses, and prevent deadlock situations from occurring without central coordination. Thus, the disclosed methods may operate in an interconnect fabric without regard to the time taken for messages to travel through the fabric, and without regard to the time taken to notify all nodes of the actions of other nodes. Indeed, in certain embodiments of the present invention, the disclosed routing method is a pop-up property of the individual nodes, such that the size of the fabric has no significant effect on routing and flow control through the fabric.

The methods disclosed herein for utilizing speculative allocation in an interconnect structure are broadly applicable to any form of transmission in an interconnect structure. However, in certain embodiments of the present invention, the disclosed method and system are deployed to support multicast transmissions on an interconnect fabric, where each transmission has a long data burst. In particular, for long data bursts, the additional overhead associated with speculative allocation is less significant with respect to transmission time. As a non-limiting example, in a standard-sized NoC for parallel computing applications, the time required to transfer tens of kilobytes or greater at a time is relatively long, making the time required to speculatively allocate transfer resources according to certain embodiments disclosed herein relatively inappropriate. This particular example is described in more detail below with reference to step 303 in fig. 3. Furthermore, the disclosed method may be deployed to support transmission of multi-path multicast transmissions, wherein a path through the interconnect fabric includes at least one branching node from which two or more paths originate. The particular approach disclosed is particularly useful in multi-path multicast transmissions due to the lack of known solutions for avoiding resource contention using intelligent compilers or routing algorithms designed to completely avoid resource contention conditions in advance. In contrast, because certain methods disclosed herein efficiently and quickly deallocate resources upon detection of contention, deadlocks are completely avoided because there is no need to coordinate transmissions in advance, yet one potentially blocked transmission is always allowed to complete and not blocked by another. Additionally, using these disclosed methods, multiple paths required to support multicast transmission can be allocated in parallel without requiring global ordering of the individual paths.

The methods disclosed herein are broadly applicable to any interconnect structure connecting any group of terminals, such as socs having various types of circuit blocks. However, in certain embodiments of the present invention, the disclosed methods and systems are deployed in an interconnect fabric that connects a set of terminals in the form of a set of parallel processors that execute a directed graph of a neural network. As modern neural networks run on large data structures and these may need to be distributed in parallel to a large number of processors, an interconnect network that is able to efficiently perform multipath multicast with large bursts of data would be particularly beneficial when applied to such an arrangement.

In a particular embodiment of the invention, wherein a set of responses from different nodes is aggregated, the set of responses may be generated at a particular node in the interconnect fabric where a possible risk of deadlock is detected. By detecting resource contention at a node, a potential risk of deadlock may be detected at the node. For example, a node may detect that a request is for a given resource at the node to support N +1 transmissions, where N is the number of simultaneous transmissions that a particular resource may support. In this manner, using the particular methods disclosed herein, speculative allocation of resources is made locally on a node-by-node basis without any information about the global state of the interconnect fabric, while avoiding the occurrence of deadlock conditions despite the interconnect fabric supporting a large number of transfers through the fabric. In certain embodiments, this functionality comes from the emergent behavior of distributed logic, where each node is programmed to act in concert using hardware logic alone. In certain embodiments of the invention disclosed herein, the allocation, de-allocation, and re-allocation of resources in a fabric may all be performed separately in hardware, such as in distributed logic and routers in an interconnect fabric. Thus, the overhead associated with speculative allocation and deallocation may be minimized.

In a particular embodiment of the present invention, a method for multicasting over an interconnect fabric is disclosed. The method comprises the following steps: a set of resources is allocated to support a set of paths through the interconnect fabric, whereby the set of paths is speculatively allocated for multicast. The method further comprises the following steps: a set of responses from the set of paths is aggregated at the branch nodes on the set of paths. The group response includes an indicator of resource contention. The method further comprises the following steps: in response to the indicator of resource contention, a deallocation message is transmitted downstream from the branch node and the indicator of resource contention is transmitted upstream. The method further comprises the following steps: in response to the indicator of resource contention, the set of resources is deallocated for a retention period. The method further comprises the following steps: reallocating resources for multicast after the hold period.

In a particular embodiment of the present invention, an interconnect structure is disclosed. The architecture includes a set of nodes, a set of routers, and a collection of distributed logic circuits. The set of nodes and the set of routers have a one-to-one correspondence. A set of distributed logic circuits is distributed among the set of nodes and programmed to perform a method for multicasting over the interconnect fabric. These nodes may be programmed manually or using a hardware description language (e.g., VHDL, Verilog) or other programming language, which results in the design of digital or mixed-signal hardware circuits. The program is then implemented in logic gates, registers, cross bar switches, memory and other circuit elements. The method comprises the following steps: a set of resources is allocated on the set of routers to support a set of paths through the set of nodes. The set of paths are speculatively allocated for multicasting of data. The method further comprises the following steps: a set of responses from the set of paths is aggregated at a branch node in the set of nodes. The group response includes an indicator of resource contention. The method further comprises the following steps: in response to the indicator of resource contention, a deallocation message is transmitted downstream from the branch node and the indicator of resource contention is transmitted upstream. The method further comprises the following steps: the set of resources is deallocated for a retention period based on the aggregated set of responses. The method further comprises the following steps: reallocating resources for multicast after the hold period.

In a particular embodiment of the present invention, a method for supporting multicast on an interconnect fabric is provided. The method comprises the following steps: an allocation message for resources of a path supporting multicast is received at a branch node on the interconnect fabric. The method further comprises the following steps: an assignment message is transmitted from a branch node to a set of paths originating from the branch node. The method further comprises the following steps: transmitting a downstream deallocation message from the branch node to at least one path in the set of paths in response to the indicator of resource contention. The method further comprises the following steps: an upstream indicator of resource contention is transmitted from the branch node in response to the indicator of resource contention.

In a particular embodiment of the present invention, a node for an interconnect structure is disclosed. The node includes a router and logic circuitry programmed to perform a method for supporting multicast on an interconnect fabric. The logic circuit is programmed to perform a method for multicasting over the interconnect fabric. The logic circuits may be programmed manually or using a hardware description language (e.g., VHDL, Verilog) or other programming languages, which results in the design of digital or mixed-signal hardware circuits as is known in the art. The method comprises the following steps: an allocation message for resources of a path supporting multicast is received at a router. The method further comprises the following steps: a router is used to transmit an assignment message to a set of paths originating from a branch node. The method further comprises the following steps: receiving, at the router, a set of responses from the set of paths originating from the node, wherein the set of responses includes an indicator of resource contention. The method further comprises the following steps: transmitting, using the router, a downstream de-allocation message to at least one path of the set of paths in response to the indicator of resource contention. The method further comprises the following steps: an upstream indicator of resource contention is transmitted, using the router, in response to the indicator of resource contention.

Drawings

Fig. 1 illustrates a portion of a NoC according to the related art;

FIG. 2 illustrates an interconnect structure experiencing a deadlock condition in accordance with the related art;

FIG. 3 illustrates a flow diagram for multicasting over an interconnect fabric and a block diagram of the interconnect fabric in allocating resources for multicasting in accordance with certain embodiments of the present invention disclosed herein;

FIG. 4 illustrates a block diagram of the contents of a node of a set of nodes forming an interconnect structure, in accordance with a particular embodiment of the present invention disclosed herein;

FIG. 5 illustrates a flow chart of a process of assigning message nodes and a block diagram of an interconnect structure performing the flow of the flow chart in accordance with certain embodiments of the invention disclosed herein;

FIG. 6 illustrates a flow diagram of a branch node process and a block diagram of an interconnect structure performing the flow of the flow diagram in accordance with certain embodiments of the present invention disclosed herein;

FIG. 7 illustrates a flow diagram of a source node process and a block diagram of an interconnect structure performing the flow of the flow diagram in accordance with certain embodiments of the invention disclosed herein;

FIG. 8 illustrates a block diagram of a router, in accordance with certain embodiments of the inventions disclosed herein;

fig. 9 illustrates a set of flow charts of processes that a router may perform in accordance with certain embodiments of the invention disclosed herein.

Detailed Description

Methods and systems for speculative resource allocation routing on an interconnect fabric according to the above summary are disclosed in detail herein. The methods and systems disclosed in this section are non-limiting embodiments of the invention and are presented for purposes of illustration only and should not be used to limit the full scope of the invention. Although the specific examples provided in this section are directed to interconnect structures in the NoC format, the methods disclosed herein are broadly applicable to any interconnect structure. Furthermore, the interconnect structure according to the present disclosure may be implemented in a single chip system, a multi-chip single package system, or a multi-chip system where the chips are typically attached to a common substrate, such as a Printed Circuit Board (PCB), interposer, or silicon mesh. An interconnect structure according to the present disclosure may also include chips on multiple substrates that are linked together by a higher level common substrate, such as where multiple PCBs each have a set of chips, where the multiple PCBs are secured to a common backplane. Furthermore, although the specific examples provided in this section are directed to bidirectional regular square network topologies, the methods disclosed herein are more broadly applicable to networks having a variety of topologies, including unidirectional or bidirectional topologies as well as trees, meshes, rings, and other topologies. Indeed, a benefit of the particular approach disclosed herein is that responses can be fully aggregated and resources fully de-allocated regardless of the regularity or irregularity of the interconnect fabric topology. Furthermore, although the specific examples provided in this section are directed to wormhole flow control on an interconnect fabric, the methods disclosed herein are more broadly applicable to any type of flow control on an interconnect fabric, including any type of pass-through packet buffer flow control or slice buffer flow control. As used herein, the term "slice (flit)" refers to the smallest unit of data identified by the flow control system of the interconnect fabric. Furthermore, although the specific examples provided in this section refer to the resources of a node as buffers and crossbar channels (crossbar channels), the disclosed methods are more broadly applicable to monitoring any resource that may be speculatively allocated and maintained for transferring data over an interconnect fabric.

Fig. 3 illustrates a flow diagram 300 for multicasting over an interconnect fabric and a block diagram 310 of the interconnect fabric in allocating resources for multicasting in accordance with a particular embodiment of the invention disclosed herein. The steps of flowchart 300 are performed by an interconnect structure. However, in certain embodiments of the present invention, these steps result from the emergent behavior of various nodes using distributed logic. Fig. 4-7 illustrate the behavior of the various nodes in more detail. The behavior of a particular node is described throughout this disclosure as a "source" node, a "branch" node, an "intermittent" node, and a "termination" node. However, in certain embodiments of the present invention, each node in the interconnection network may include logic to perform the behavior of multiple node types to support multiple transmissions over the network at different times. Different logic may be activated based on the receipt of routing and flow control information at a node. For example, one node may act as a source node for one transmission and as a termination node for another transmission, depending on where the data needs to be routed through the fabric. Furthermore, in certain embodiments of the present invention, if the same node supports multiple transmissions simultaneously, the node may act as multiple types of nodes simultaneously. By receiving routing and flow control information, a node can be assigned to a particular node type. In certain embodiments of the invention, the routing and flow control information may arrive on the same hardware line as the data being transferred. The difference between the data and control information may be encoded in the header of the information conveyed by the structure. Alternatively, the header of the information passed through the structure may always include control information, while the body holds data. In certain embodiments of the invention, the routing and flow control information may arrive at the node on different dedicated signal lines, separate from the lines used to route the data.

The flow diagram 300 begins at step 301 where a set of resources is allocated to support a set of paths through an interconnect fabric, thereby speculatively allocating a set of paths for multicast at step 301. Referring to block 310, the multicast may be a multicast 311 from a source node 312, the source node 312 having a set of termination nodes 313. As shown, block diagram 310 includes a set of nodes, including source node 312 and second source node 314, meaning that there is a possibility of a deadlock condition between two transmissions from the two different source nodes. Multicast 311 is a multi-path multicast because it extends into a set of three paths to efficiently route data to the set of termination nodes 313. Multicast 311 requires resources at each node along the line indicating multicast 311 because movement through the interconnect fabric involves multi-hop data transmission from one node to the next. Paths are speculatively allocated in the source node 312 and intermittent nodes along the path forming the multicast 311 are initially unaware of the fact that a subset of the nodes in the set of termination nodes 313 are also in the second set of termination nodes 315 of the multicast 320. In the illustrated case, if each node (e.g., node 316) does not have sufficient resources to support multicast 311 and multicast 320, contention for resources may occur between the two multicasts.

In particular embodiments of the present invention, as allocation requests propagate along a transmission path (e.g., a group of paths including multi-path multicast), allocation occurs node-by-node without any information about the global state of the interconnect fabric. The allocation request may take the form of a node-by-node communication of the allocation message along the path forming multicast 311. The process may also include copies of allocation requests generated by the branching node (e.g., branching node 317) and sent to different branches. Branching node 317 is a "branch" of multicast 311 because multiple downstream paths on multicast 311 originate from branching node 317. However, the branching node 317 may have similar routing and flow control logic as other nodes in the network and is only temporarily configured as a branching node for the multicast 311. In a particular embodiment of the invention, the allocation message may be a header flit of a wormhole flow control packet. The node may include logic programmed to allocate resources on the node when evaluating the allocation message, replicate the allocation message as needed, and then forward one or more allocation messages downstream.

In particular embodiments of the present invention, the actions of the nodes may be highly parallel. Since routing control may be performed without reference to a global state, each node may include logic from distributed logic to manage the allocation of resources required to perform step 301 and perform the additional method steps of flowchart 300. For example, block diagram 400 in fig. 4 may be a common diagram for all nodes in block diagram 310. In particular embodiments, not only are the network aspects of the nodes the same, but the terminals of the nodes may also be matched. For example, each node in the interconnect fabric 310 may include a terminal 401 in the form of a processing core from a set of processing cores linked by the interconnect fabric and a router 402 from a set of routers that route data through the interconnect fabric. The terminal 401 is a processing core with a network interface unit (connected to a router), a processing pipeline and a memory. The memory may store routines executed by the pipeline and data run by those routines. Thus, the multicast 311 may be a multicast that distributes data for a composite computation performed by the set of processing cores. The composite computation may include executing a directed graph to draw inferences from the neural network. However, in other embodiments, the nodes will differ in their termination to the interconnect fabric. For example, a terminal may be a set of blocks in a SoC that are completely homogeneous, exchanging data for the combined action of the system.

In certain embodiments of the present invention, in the performance of step 301, the resources allocated at each node of the interconnect structure may include a crossbar and buffers allocated by logic circuitry at the node in response to the received allocation message. In the illustrated example, the steps of the method of flowchart 300 may be implemented by distributed logic circuitry instantiated by logic circuitry on each node of the structure (e.g., logic circuitry 403) and resources on routers on each node (e.g., resource 404). The logic circuit may be part of a distributed logic circuit that is distributed among the set of nodes in the form of a set of matching logic circuits that are located on the set of nodes in a one-to-one correspondence. Node 400 includes a resource 404 on a router 402 in the form of a set of crossbar channels that can route data between various external lines 405, 406, 407, and 408 and terminals 401, and in the form of buffers that can temporarily store data when the channels are in use. In response to the router receiving the allocation message, the buffers and crossbar channels may be speculatively allocated for transmission, such as multicast 311. The allocation message may be a header fragment of the wormhole control packet. The logic circuitry may be programmed to allocate resources when evaluating the allocation message.

In particular embodiments of the present invention, the nodes may provide a set of responses to exchange the assignment messages. The response may indicate whether the resource was successfully allocated or whether resource contention was encountered. The response may be generated by logic (e.g., logic 403) and transmitted upstream from the node. The logic circuit may also be programmed to generate a successful response if the resource is successfully allocated and transmit the response upstream. The logic circuit may be further programmed to generate an indicator of resource contention at the node in response to detecting the resource contention. The logic circuit may be further programmed to responsively transmit an indicator of the contention for the resource upstream. In particular embodiments of the present invention, the generation of a successful response may be conditionally performed based on determining that the node is a terminating node. The upstream and downstream directions of transmissions by the node may be stored in a memory of a router of the node upon receipt of the allocation message and/or upon allocation of resources at the node. The state of a node, as to the type of node it is a particular transport, may also be maintained in the memory of the node's router.

Figure 5 illustrates a flow diagram 500 of the behavior of a termination node on an interconnect fabric and a block diagram 510 of the interconnect fabric forwarding a successful response upstream in accordance with a particular embodiment of the invention disclosed herein. Flow diagram 500 includes a set of steps performed by logic circuitry on termination node 511 in block 510. This node is referred to as the terminating node because it is the last hop on the path of the multi-path multicast 311. However, as with the branching nodes mentioned above, there is not necessarily anything that permanently distinguishes terminating node 511 except for temporarily executing the particular logical flow associated with the terminating node as multicast 311.

The flow chart 500 begins at step 501 where at step 501, an allocation message for resources supporting a transmission path, such as multicast 311, is received at a node, such as termination node 511. The logic of the node may be programmed to allocate resources for multicast or detect contention, as described below. However, assuming the assignment is successful, the logic may also transmit the assignment message downstream or transmit the success message upstream depending on the assignment message and the location of the node on the multicast. In certain embodiments of the present invention, the terminating node will transmit a success message upstream, while the intermittent nodes will not. The allocation message may be a header fragment of the wormhole control packet. The resources may be crossbars and/or buffers on the nodes.

In a particular embodiment of the invention, the logic of the node will examine the address of the assignment message to determine how the node should react. Indeed, in certain embodiments of the present invention, a node will not know until reacted what role it plays for a particular transmission. Accordingly, the flow diagram 500 continues with step 502, checking the address in the assignment message. A node may use this check of the assignment message to determine what role the node will play in a given transmission and what steps should be performed subsequently.

Block 510 shows a flow diagram 500 of a termination node execution. As shown in block 510, the destination node of multicast 311 is explicitly defined by two addresses (X _ STR, Y _ STR) and (X _ END, Y _ END). Logic circuitry of the terminating node (e.g., terminating node 511) may use this information, along with information about previous nodes that the path has passed through, a priori knowledge about the topology, and/or knowledge of the source address obtained from the assignment message, to determine that it is the terminating node. At this point, the logic of the terminating node may then perform step 503 in which a response in the form of a success message is returned upstream, assuming there is no contention for resources at node 511. The intermediate nodes on the path forming the multicast 311 may be programmed to forward any success messages received from downstream further upstream.

The flow diagram 500 also includes a step 504 where the assignment message is transmitted from the node to the downstream node in step 504. If it is determined in step 502 that the node is an intermittent node or a branch node, then this step is performed. The assignment message may be transmitted from the branch node to a set of paths originating from the branch node. The identity of a node as an intermittent node or a branching node may be determined using methods similar to those described above for determining whether the node is a terminating node.

Although an example of a multicast destination encoded as a rectangle is provided in fig. 5 for purposes of explanation, the methods disclosed herein may be generalized to any multicast group in which a node's position on a transmission path may be determined from routing coordinates and an assignment message assigns one or more output ports in each hop. In particular, if the assignment message assigns multiple output ports, the node will activate the branch node logic routine described herein, whereas if the assignment message assigns a single output port, the node will activate the intermittent node logic routine, and if the packet does not assign any output ports, the node will activate the termination node logic routine.

In particular embodiments of the invention, responses from various nodes that have been speculatively allocated to form one or more paths for transmission through the interconnect fabric may be aggregated so that the fabric determines the success or failure of the allocation. Certain embodiments of the present invention disclosed herein aggregate responses in such a way that guarantees will inform all nodes that may be participating in a transmission of the failure of a speculative allocation, regardless of when and where in the structure contention is detected in the process of allocating resources. As with other behaviors of the structure disclosed herein, this aggregation is a salient attribute of the structure that results from the behavior of the logic circuits of the various nodes acting in combination. In particular, the responses of different paths of a multi-path multicast may be aggregated, evaluated and acted upon at the branching nodes of the multicast.

The flow diagram 300 continues to step 302 where a set of responses is aggregated from the set of paths forming at least a portion of the transmissions at the branch nodes on the set of paths at step 302. When response messages are transmitted between nodes forming a multicast, the group of responses may be aggregated by forwarding logic of the response messages. In particular, the set of responses may be forwarded upstream by the intermittent nodes and saved by the branch nodes for aggregate evaluation. The logic circuit of the branch node may track how many paths originate from the branch and wait for the response of each branch before combining the evaluation responses. First, the received response may be stored in a buffer. The group response may indicate that no resource contention is detected during the resource allocation. For example, the termination node and downstream branch nodes of all paths originating from a branch node may include logic blocks to return success signals, and the branch node may keep receiving responses from all these nodes during aggregation of the responses. Alternatively, the set of responses may indicate that resource contention is detected during resource allocation.

In particular embodiments of the present invention, the logic of the branch node may be configured to perform different actions based on the evaluation of the aggregate response. For example, if any of the responses indicate a failure of the allocation, a combined analysis of the aggregated responses may evaluate to a failure condition, and only if all of the responses indicate a successful allocation, may evaluate to a success condition. For the failure condition and the success condition, logic may be triggered to execute different routines. For example, for a success condition, the branch node itself may return a success signal to pass upstream to additional branch nodes or the source node of the transmission. Alternatively, for a failure condition, the branch node may return an indicator of the resource condition upstream and notify other downstream paths of the allocation failure.

FIG. 6 illustrates a flow diagram 600 for aggregating and responding to a set of responses from a set of paths on a branch node of an interconnect fabric, and a block diagram 610 of the interconnect fabric executing the flow of flow diagram 600, in accordance with a particular embodiment of the invention disclosed herein. The flow diagram 600 begins at step 601, where a set of responses is received at a branch node at step 601. The branch node may be branch node 611 in block 610. The response may be received on the same lines used in block 610 to convey routing information and transport data between the various nodes. If the interconnect fabric is a bidirectional fabric, multicast data sent downstream may be sent on data lines and responses may be sent upstream on corresponding data lines for upstream data transfer in the other direction. Alternatively, the response may be sent on a dedicated line between nodes in the interconnect fabric. The dedicated lines may be upstream and/or downstream dedicated lines that form the same set of hops as used for routing multicast data through the fabric. As shown, two paths branch off from the branch node 611. One branch returns a successful response generated by the terminating node 511 and the other branch returns an indicator of resource contention generated by one node in the group 612 in the form of a failure message. Using the methods disclosed herein, contention may occur on any node in the set of nodes 612 and the failure responses will be appropriately aggregated such that the failure responses are routed to the branch node 611.

The flow diagram 600 continues to step 602 where the aggregated message is analyzed at step 602. The branch node may evaluate all received responses by placing them into a buffer and then applying them to a set of logic gates for evaluation when all the reserved buffers are full. The number of reserved buffers may be set based on a count of the number of paths originating from the branch node. The logic circuit may implement an and operation such that a success message is passed upstream only if all responses are successful, and a different set of events is performed if any response includes a failure message.

Step 603 shows the step of passing a success message upstream. This step includes generating a success message or forwarding it up to the data source of the node. This step may be performed using a dedicated signal line for the signal. The branching nodes and any intermittent or terminating nodes may temporarily store routing information to support the transmission, which will allow the nodes to track which direction is in the upstream direction of the transmission.

If any of the responses indicate a failure of the allocation, the steps taken include step 604 and step 605. The steps may include transmitting a deallocation message downstream from the branch node in response to an indicator of resource contention and transmitting an indicator of resource contention upstream. In step 604, a failure message is transmitted from the branch node back upstream. This is illustrated by the branching node 611 sending a failure signal to the source node 312 along the multicast path. In step 605, a deallocation message is delivered downstream from the branch node. This is illustrated by the branching node 611 sending a deallocation message to the terminating node 511 along the multicast path. The deallocation message may be sent on a dedicated signal line or the same signal line used to transmit data. In particular, if a particular interconnect fabric uses wormhole routing, where trailing pieces of a packet are used to deallocate resources at a given node, the deallocation messages may be transmitted and acted upon in substantially the same manner as those trailing pieces. The distributed logic of the interconnect fabric may be designed to release control of any resources reserved on the node receiving the deallocation signal upon receipt of the deallocation message. The logic of the branch node may be programmed to send a deallocation message downstream on each branch originating from the branch node. In particular embodiments, the branch node may be programmed to send the deallocation message downstream only on all branches that originate from the branch node and respond with a successful response. In these embodiments, the branching nodes, the intermittent nodes, the source node, and the termination node may be programmed to deallocate multicast resources in response to receiving a failure message, such that any branch reporting a failure will have been deallocated when the failure response is generated and forwarded upstream. Intermittent nodes can be programmed to deallocate based on receipt of the message and forward the deallocation message further downstream.

In a particular embodiment of the invention, the aggregate performance occurring in the response of the interconnect fabric from a particular node to an indicator of resource contention will result in the interconnect fabric consistently deallocating resources that have been speculatively allocated to support the transfer. This function is illustrated by step 304 of flowchart 300, which illustrates the interconnect fabric waiting a hold period before reallocating resources for the multicast in the second iteration of step 301. As described above, the various nodes in the interconnect structure may each be programmed to de-allocate any resources reserved for transmission in response to receiving a de-allocation message or an indicator of resource contention. For example, failure messages from downstream routes or deallocation messages from upstream routes may cause a node to release any resources reserved for transmissions associated with those messages. Thus, after all responses have been aggregated, the aggregate performance of the node will be to completely deallocate the transmission resources. The resources of the interconnect fabric will then be available to support another transmission, e.g. the one that initially caused the contention for the resources.

In addition to deallocating any allocated resources, the interconnect fabric may also wait to retry transmission after a hold period. The logic to control the hold period may be provided by logic circuitry in the source node of the transmission. In particular, the logic to determine the duration of the hold period and implement the hold period may be provided by logic circuitry in the source node of the transmission. Once the source node receives an indicator of resource contention, e.g., a failure message, the routines required to set and implement the hold period may be initiated. This function is shown in flowchart 300 as step 305.

As an alternative to deallocating resources and waiting for a reallocation, the source node may start transmission. This function is shown in flowchart 300 as step 303. After completion of step 303, the multicast will be complete and the resources associated with the transmission may be deallocated. In general, the amount of time required before a source node can begin transmission will be the time it takes for the source node to obtain an indication of a successful tree assignment, which will be time t ═ (2 × D × L), where D is the hop count distance to the farthest destination in the multicast and L is the single hop delay of transmission through the fabric. In a NoC with a one-way regular square topology with less than 100 terminals (such that the maximum distance D is in the order of 20) and several tens of bytes (e.g., 16, 32, or 64) can be transmitted per cycle, the above time t is typically evaluated to be equivalent to a period of time to transmit several hundred or a few thousand bytes. Thus, a transfer of tens of kilobytes will find the transfer time relatively inappropriate, and workloads that often involve transfers of this length or longer will be relatively unaffected by this latency. In a particular embodiment of the invention, since the branching node keeps receiving responses from all paths originating from the branching node and only transmits success messages upstream when it has received feedback from each path, when the source node receives a success message, the aggregation of responses will be complete for the entire fabric, and when a success message is received, the source node will know that all resources have been allocated for transmission. In particular embodiments of the invention, the transmission may begin before an acknowledgement is received that the resource has been successfully allocated. In other words, the fabric does not need to wait for the full 2DL period before starting transmission. Referring to flowchart 300, this would mean that step 303 begins before all responses are aggregated, even before all resources have been allocated for transmission. In addition to speculative allocation, interconnect fabrics that exhibit this behavior may be referred to as practicing speculative transfers.

In certain embodiments of the present invention, the speculative transfers may be made in a variety of ways. The data transfer may begin before all responses are aggregated. The data transmission may start immediately after the transmission of the allocation message from the source node. Alternatively, the transmission of data may begin after waiting a period of time. These embodiments are preferably applied in case the incomplete data is written along the multicast path without any irreversible side effects. For example, in some environments, writing transfer data to a register on a node triggers a routine on the node that increments a counter that cannot be reset by subsequent routing or flow control information sent to the node. As another example, in some environments, writing transfer data to a register on a node may result in writing an address that is not easily deleted or rewritten. Thus, the routing or flow control system of the interconnect fabric has no way to notify higher level systems of the multicast failure and revoke the write. In these implementations, it is proposed to keep writing any data of the multicast until all responses have been successfully aggregated. In implementations where writes do cause side effects, but those side effects are reversible, such as setting a counter or writing to a register that needs to be cleared or cleared if the resource allocation fails, the wait period for speculative transfers may be calibrated to balance the investment in writing data and the cost of undoing the process for the possibility of successful allocation of the transfer resource.

In particular embodiments, the source node may be programmed to implement the retention period in various ways. There is no need to set the holding period using any specific information on the transmission causing the resource contention. Thus, there is no need for global coordination or inter-source coordination and negotiation in order to share resources on the interconnect fabric. Instead, the retention period may be derived using only data programmed into the source node or data passing through the source node in the resource contention indication. For example, the failure message may include information regarding the distance from the source node to the contention location, which may be considered during the determination of the hold period; or a counter of the number of transmission attempts may be kept at the source node, which counter may likewise be taken into account when calculating the hold period. The hold period may be a fixed value equal to the average time required for the multicast on the interconnect fabric to complete. In this way the interconnect fabric will have a reasonable confidence that any transmission that results in a collision will complete when the resources are reallocated. The fixed value may be determined at design time or may be periodically updated based on measurements taken by the interconnect structure while the terminal is being serviced while executing a standard workload. The holding period may also be a random value with a deviation set by the above principle, or a random value from the above fixed value. This approach will ensure that multiple transmissions blocked by the initial multicast will not subsequently continue to reserve resources for overlapping resources at exactly the same time and continue to block each other in endless loops. The hold period may also be set to increase each time a transmission is forced to be held. The hold period may also increase exponentially each time a transmission is forced to be held. Each time an allocation message is sent on a given path, the source node may increment a counter to track the number of allocation attempts for these purposes. The source node may also be designed to throw deadlock flags to higher level systems when the maximum number of iterations is exceeded. A higher level system may initiate a global resource allocation routine to stop the system loop.

Fig. 7 shows a flow diagram 700 of the processing of a set of responses and the execution of responses thereto by a source node, and a block diagram 710 of an interconnect structure that executes the flow of flow diagram 700. All processes begin with the source node receiving a response from one or more branches that will be used to support transmissions from the source node. Intermittent nodes on these branches may be programmed to forward any responses received from downstream back upstream. The branch nodes on these branches may be programmed to conditionally send responses upstream after aggregating responses from their respective branches. Nodes that detect contention for resources may be programmed to generate and transmit upstream responses. Thus, in step 701, source node 312 may process a set of responses from each path it needs to transmit. In block 710, this includes a path down to intermittent node 711, to termination node 712, and a second path to termination node 713. Block 710 illustrates the fact that a source node, such as source node 312, may receive multiple responses that it must analyze because the source node may have multiple branches originating from the source node. The response may include success messages from all of the terminating nodes (e.g., node 713 and node 712) to which the source node will transmit data. Alternatively, as shown, the at least one response may be an indicator of resource contention, e.g., a failure message generated in response to resource contention detected while allocating resources to the terminating node 713 along the path.

The flow diagram 700 continues to step 702 where the responses from the various paths used to support the transmission of the source node are analyzed. This step may include aggregation and evaluation logic similar to that described above with reference to the branch node performing step 602. In particular, the source node may be programmed to proceed to step 703 and initiate a transmission if and only if all branches respond with a success message; or if the interconnect fabric is programmed for speculative transfers, the transfer may begin before a success message is received. Further, the source node may be programmed to continue with step 704, determine a hold time period, and continue with step 705, transmit a deallocation message downstream if any branch responds with a failure message. The holding period may be determined in any of the various ways described above. If the source node has multiple branches and one node responds with a failure message and another node responds with a success message or no response, a deallocation message may be sent in step 705. The distributed logic of the interconnect fabric may be programmed to cause the deallocation message to pass downstream and either proceed all the way to the terminating node (e.g., node 712) or otherwise satisfy the upstream travel indication of resource contention as an intermittent node (e.g., node 711) and effect a complete deallocation of all resources in any event. The branch node may also be programmed to push the deallocation message downstream on a branch that has not received a response.

The process described above with reference to fig. 3-6 in terms of the behavior of a particular node and the emergent behavior of the interconnect structure may generally begin with detecting resource contention at a node on a set of paths for data transmission when allocating a set of resources that will support the data transmission, and generating an indicator of the resource contention at the node in response to detecting the resource contention. The indicator of resource contention may be the failure message described above. Specific embodiments of these steps may be described with reference to fig. 8-9.

In a particular embodiment of the invention, the interconnect fabric may include a set of distributed logic circuits distributed among a set of nodes comprising the interconnect fabric, the logic circuits programmed to perform a method for multicasting on the interconnect fabric, the method including detecting resource contention at a node in the set of nodes while allocating a set of resources to support multicasting. This step is illustrated by step 901 in flowchart 900 of fig. 9. This step may be performed at a router of the interconnect fabric, such as router 402, the details of which are shown in block diagram 800. The router includes a set of input ports 802, a set of output ports 803, a crossbar 804, and a splitter 805. The set of input ports 802 will include a set of buffers, illustrated by buffer 806 and buffer 807, and an oval indicating how many buffers there are inputs to the router. The set of output ports 803 will include a set of buffers, illustrated by buffers 808 and 809, and an oval indicating how many buffers there are outputs from the router. The number of inputs and outputs will depend on the topology and, as with the topology itself, is not limiting on the invention. In certain embodiments of the present invention, the channels of crossbar 804 and the buffers comprising the set of input ports and output ports comprise the resources that must be used to support data transfers over the fabric. Thus, as in step 901, detecting resource contention on a node may include a router detecting that it does not have sufficient resources in terms of buffer space or crossbar channels to support another data transmission because existing data transmissions are using those resources.

In particular embodiments of the present invention, the detection of resource contention at a node may occur in a router at the node and may include processing of an allocation message by the router. This step may also be part of the standard resource allocation transmitted over the interconnect fabric. In flowchart 900, this step is represented by step 903: an allocation slice is received at a node. In block 800, the allocation message may be an allocation slice 820 received at a buffer 807. The resource contention may then be detected by detecting an allocation failure of an allocation slice. This step is represented by step 906: a dispensing failure of the dispensing sheet is detected. This step may include the allocator 805 examining the resource allocation requests in the allocation slice and determining that the router 402 cannot support the requested resource allocation. The allocation failure may be due to a lack of available buffers or cross-bar channels on the router 402 to support the virtual channel requested by the allocation tile 820. A virtual channel may require an allocator 805 to allocate at least one buffer in output port 803, at least one buffer in input port 802, and at least one crossbar channel from a set of crossbar channels in crossbar 804 between these buffers. If these resources are not available because the router supports more transmissions than it can, the allocator 805 may immediately return a failure indicator on the dedicated line 810 and deallocate any resources that have been designated for transmission. Notably, the architecture of the router 402 also allows the intermittent node to receive success or failure messages received from downstream on the second dedicated line 811, respond to the failure messages by processing the failure messages with the allocator 805, thereby deallocating resources on the node represented in block 800 and forwarding the success or failure messages upstream on the dedicated line 810. The dedicated line 810 and the second dedicated line 811 may be dedicated signal lines of a set of dedicated signal lines that are connected to all nodes of a set of nodes in the interconnect structure. The set of dedicated signal lines may run in parallel with the data transmission lines of the interconnect structure and follow the same hopping pattern through the interconnect structure. Additionally, the dedicated line may be bidirectional, whether the topology is a unidirectional or bidirectional topology.

In a particular embodiment of the invention, the interconnect structure may include a collection of distributed logic circuits distributed among a set of nodes comprising the interconnect structure and programmed to perform a method for multicasting on the interconnect structure including generating an indicator of resource contention at a node in response to detecting the resource contention. The indicator of resource contention may be transmitted upstream on the dedicated signal line. Referring to flowchart 900 in fig. 9, the step of generating an indicator of resource contention is represented by step 902. The process may occur immediately upon detecting resource contention at a node, and may include generating a failure signal for transmission upstream from the node. For example, it may include the distributor 805 creating a signal for transmission on the dedicated signal line 810. In an alternative embodiment, this step may include generating a transmit return signal for transmission back along the primary data line of the interconnect structure.

In particular embodiments of the present invention, the processing of allocation messages may be prioritized so as to limit the amount of time it takes to return a message to a branch or source node in the system that fails the allocation of resources. For example, receipt of an allocation slice by a router may trigger a prioritized allocation slice handling routine to override the processing of other messages by the allocator on the router. In general, processing of a slice includes receiving the slice at an input buffer, allocating a virtual channel for a packet or other transmission associated with the slice, arbitrating a crossbar, and outputting data to an output buffer. Allocating the virtual channel may include ensuring that the router is set to reserve space to receive and transmit the slice over the identified channel until the channel is no longer needed. Such a conventional pipeline for slice processing may include 4-5 clock cycles per hop. In a particular embodiment of the invention, the logic circuits of the routers in the interconnect fabric are programmed to prioritize and immediately process the path allocation messages. This will result in either an immediate allocation and forwarding of the allocation message downstream, or reporting of an allocation failure and blocking or deallocating any resource allocation for that transmission on the router. In a particular embodiment of the invention, the distributed logic of the interconnect fabric may be programmed to perform this action in two cycles per hop. The first cycle receives a slice and identifies it as an allocated slice, and the second cycle checks whether the allocation is available and immediately designates the resource and forwards the slice or returns a failure message. In particular embodiments, this is accomplished by using a slice with reserved header bits that are written to a reserved register on an input port of a router in the interconnect fabric to trigger a prioritized slice handling routine.

Flowchart 900 includes optional steps 904 and 905 to illustrate steps associated with prioritization of assignment messages. In a particular embodiment of the present invention, detecting resource contention may comprise the steps of 904: determining the allocation slice is a multicast allocation slice and step 905: a prioritized slice handling routine is triggered at the node in response to determining that the allocation slice is a multicast allocation slice. The standard slice processing routine at the node takes more clock cycles than the prioritized slice processing routine. For example, as described above, a conventional slice handling routine may take 4-5 clock cycles, while a prioritized slice handling routine may complete in two clock cycles.

In particular embodiments of the present invention, the processing of the deallocation message may be prioritized. The prioritization process of the deallocation message may use the same routine for prioritization as the prioritization process of the allocation message. In particular embodiments, both the allocate and deallocate messages will be received upstream along the transmission path on the same hardware. For example, header and trailer fragments similar to wormhole routing packets may be received and processed by the router, and assignment and de-assignment messages may be received and processed in a similar manner. Flow chart 910 illustrates a method of deallocating a set of resources, the method comprising the steps 911: a deallocation message is received at a node. The node may be a downstream node of the branch node that generated the deallocation message. For example, in block 800, the deallocation message may be the deallocation message 821 received from the upstream node on the input buffer 806. The flowchart also includes step 912: in response to receiving the deallocation message, a prioritized tile handling routine is triggered at the downstream node. The prioritized piece processing routine may be the same prioritized piece processing routine described above with respect to the allocation message. In a particular embodiment of the invention, the deallocation message may use the same dedicated bits in the message header as used for the allocation message of the above-described embodiment.

In particular embodiments of the present invention, the interconnect fabric may include the ability to shut down logic associated with implementing speculative allocations. The process may include: it is determined that speculative allocation will not be needed and the steps associated with the speculative allocation methods disclosed herein are disabled. The determining may include: for a given workload that a terminal of the interconnect fabric will operate, it may not be possible or may be avoided at all to determine a deadlock condition using certain rules for controlling the execution of the workload and the transfer of data through the interconnect fabric. The determination and deactivation may be performed by a higher level system than the interconnect fabric hardware and routers. For example, these steps may be performed by a controller on one of the terminals of the interconnected network, e.g. instead of the processing core of terminal 401. Alternatively, the interconnection network may have a dedicated higher level controller for this role. The controller may also act as a terminal for the interconnection network and is dedicated to setting the state of the various routers for debugging, initializing, maintaining and testing the interconnection network.

In a particular embodiment of the invention, the determination mentioned in the preceding paragraph may be made by a compiler whose task is to compile the program code for the workload that the terminals of the interconnect fabric are to operate on. For a particular workload, the compiler that generated the program code for that workload may be able to statically determine at compile time that deadlock conditions are completely avoidable, making the overhead associated with speculative allocation superfluous. The compiler may make this decision for the entire workload or on a transport-by-transport basis. The process may include compiling a set of instructions for execution on a set of processing cores and determining that multicast resource contention is avoidable for one or more transmissions to be made during execution of the instructions. The process may include statically marking whether the transfer instruction requires speculative allocation. In certain embodiments of the invention, no additional mechanism is required at runtime to support this optimization in the hardware that is executing the workload. The compiling and determining steps may be controlled by a higher level control system as described above.

In particular embodiments of the present invention, the disabling of the speculative allocation process may be performed in a variety of ways. The process may include: compiling, using a compiler, a set of instructions for execution on a set of processing cores linked by an interconnect fabric; and deactivating the resource contention detection routine when the compiler determines that multicast resource contention is avoidable. Disabling speculative allocation may include: the hold time at the source node or at all source nodes for transmitting data over the interconnect fabric is disabled so that it does not remain to receive a response from the transmitting end node before beginning the data transmission. The deactivation may be on all source nodes, on a particular set of source nodes, or only when a particular source node is performing a particular transmission. Disabling speculative allocation may further include: disabling prioritized allocation message processing and any additional resource contention detection logic adds additional overhead beyond the usual processing of allocation messages. The act of disabling the speculative allocation process may be managed by the higher level control system described above and may comprise setting the values of the status registers on the various routers making up the interconnect fabric so that they shut down any speculative allocation processing. Alternatively, the actions may include: a header bit is set on a message to be sent through the interconnect fabric to indicate that it should not be processed according to the speculative allocation process, or a flag is set on a particular instruction to be compiled into a transmission by a router of the interconnect fabric. The act of deactivating may further comprise: the state of the routers that will set the interconnect fabric and change the header combination of the data to be transmitted through the interconnect fabric. Compilation, determination, and deactivation may be controlled by the higher level control system described above.

In certain embodiments of the present invention, all of the steps of flowchart 300 may be performed entirely in hardware, for example in distributed logic implemented on the routers and signal lines connecting the routers disclosed herein. For example, in the disclosed embodiment, where multicast 311 may accordingly be a multicast of the distributed data for a composite computation performed by the set of processing cores, all of the method steps disclosed herein may be performed by hardware in the interconnect fabric without performing any computation on any such processing cores. In this way, the various processing cores may perform individual component computations in their composite computations regardless of the operation of the routing and flow control algorithms of the interconnect fabric.

While the specification has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, may readily conceive of alterations to, variations of, and equivalents to these embodiments. The present description should not be limited to interconnect structures applied to a set of processing cores that perform complex computations, as the disclosed interconnect structures and associated methods may be applied to improve the performance of socs with widely varying terminals. Furthermore, the present disclosure should not be limited to any particular type of complex computation performed on a multi-core processor, as the programming and execution of any complex computation may be aided by particular embodiments disclosed herein, including training or drawing inferences from artificial neural networks, or executing hashing, encryption, decryption, or graphics rendering algorithms, including numerous component computations. These and other modifications and variations to the present invention may be practiced by those of ordinary skill in the art, without departing from the scope of the present invention, which is more particularly set forth in the appended claims.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种芯片工作模式控制方法、系统及介质

Speculative resource allocation for routing on an interconnect fabric

相关技术

网友询问留言