System and method for facilitating data-driven intelligent networks employing endpoint congestion detection and control

文档序号：1909819 发布日期：2021-11-30 浏览：21次中文

阅读说明：本技术 促进采用端点拥塞检测和控制的数据驱动智能网络的系统和方法 (System and method for facilitating data-driven intelligent networks employing endpoint congestion detection and control ) 是由 A·M·巴塔耶纳 T·J·约翰逊 T·L·科特 D·C·休森 J·P·比克罗夫特 J·G· 于 2020-03-23 设计创作，主要内容包括：提供了数据驱动的智能联网系统和方法。所述系统可以通过快速、有效的端点拥塞检测和控制来适应动态业务流量。所述系统可以维护各个包流的状态信息,这些信息可以基于已注入数据而被动态建立或释放。在到达交换机时,可以为每个流提供特定于流的输入队列。相应流的包可以在到达网络的出口点后被确认,并且确认包可以沿相同的数据路径被发送回流的入口点。因此,每个交换机都可以获得每个流的状态信息,并以逐流的方式执行流控制。(Data-driven intelligent networking systems and methods are provided. The system can adapt to dynamic traffic flows through fast, efficient endpoint congestion detection and control. The system may maintain state information for individual packet flows, which may be dynamically established or released based on the injected data. Upon arrival at the switch, a flow-specific input queue may be provided for each flow. Packets of the respective streams may be acknowledged upon reaching an egress point of the network, and the acknowledgement packets may be sent back along the same data path to the ingress point of the stream. Thus, each switch can obtain state information of each flow and perform flow control in a flow-by-flow manner.)

1. A method, comprising:

detecting endpoint congestion associated with a packet flow at an egress port of a destination switch of the packet flow, wherein the flow corresponds to a packet having one or more common header fields and is identified by a flow identifier that is unique within an ingress port of an egress switch; and

generating an endpoint congestion notification message specific to the flow to be transmitted to a source of the flow.

2. The method of claim 1, wherein detecting the endpoint congestion comprises monitoring an output buffer associated with the egress port and calculating one or more congestion parameters.

3. The method of claim 2, wherein the congestion parameters comprise one or more of:

the total amount of data stored in the output buffer;

a total number of packets stored in the output buffer; and

a rate of change of data stored in the output buffer and data stored in a corresponding link partner input buffer.

4. The method of claim 3, wherein the calculation of the one or more parameters is performed when a packet arrives or leaves the output buffer; and is

Wherein the endpoint congestion notification message indicates a maximum value of the calculated congestion parameters.

5. The method of claim 1, wherein the congestion notification message comprises the flow identifier and a congestion indicator indicating a level of congestion.

6. The method of claim 1, further comprising generating a regular acknowledgement message in response to a packet being transmitted to the egress port and the congestion having cleared.

7. A method, comprising:

receiving an endpoint congestion notification message comprising a flow identifier, wherein the flow identifier identifies a packet flow having one or more common header fields;

determining an endpoint congestion level based on a congestion indicator in a received message; and

applying flow control to packets of the flow based on the endpoint congestion level.

8. The method of claim 7, wherein applying flow control to packets of the flow comprises monitoring an input queue of the flow and determining a set of conditions for applying credit-based flow control to an upstream switch.

9. The method of claim 8, wherein applying credit-based flow control to the upstream switch comprises generating a message for the upstream switch indicating an amount of data that can be transmitted by the upstream switch.

10. The method of claim 8, wherein determining the set of conditions comprises determining a target water level and a drop water level for an input queue for monitoring the flow based on the endpoint congestion level.

11. A switch, comprising:

congestion detection logic to detect endpoint congestion associated with a flow of packets at an egress port of the switch, wherein the switch is a destination of the flow, and wherein the flow corresponds to packets having one or more common header fields and is identified by a flow identifier that is unique within an ingress port of the switch; and

an endpoint congestion notification message generation logic block to generate a flow-specific endpoint congestion notification message to be transmitted to a source of the flow.

12. The switch of claim 11, wherein the congestion detection logic block comprises an output buffer monitor to monitor an output buffer associated with the egress port and congestion parameter calculation logic block to calculate one or more congestion parameters.

13. The switch of claim 12, wherein the congestion parameters comprise one or more of:

the total amount of data stored in the output buffer;

a total number of packets stored in the output buffer; and

a rate of change of data stored in the output buffer and data stored in a corresponding link partner input buffer.

14. The switch of claim 13, wherein, in calculating the one or more parameters, the congestion detection logic is to perform the calculation when a packet arrives at or leaves the output buffer; and is

Wherein the endpoint congestion notification message indicates a maximum value of the calculated congestion parameters.

15. The switch of claim 11, wherein the congestion notification message comprises the flow identifier and a congestion indicator indicating a level of congestion.

16. The switch of claim 11, further comprising acknowledgement generation logic to generate a regular acknowledgement message in response to a packet being transmitted to the egress port and the congestion having cleared.

17. A switch, comprising:

congestion management logic to receive an endpoint congestion notification message and determine an endpoint congestion level based on a congestion indicator in the received message, wherein the received message includes a flow identifier that identifies a packet flow having one or more common header fields; and

flow control logic to apply flow control to packets of the flow based on the endpoint congestion level.

18. The switch of claim 17, wherein, in applying flow control to packets of the flow, the flow control logic is to monitor an input queue of the flow and determine a set of conditions for applying credit-based flow control to an upstream switch.

19. The switch of claim 18, wherein, when applying credit-based flow control to the upstream switch, the flow control logic is to generate a message for the upstream switch indicating an amount of data that can be transmitted by the upstream switch.

20. The switch of claim 18, wherein, in determining the set of conditions, the flow control logic is to determine a target water level and a drop water level for an input queue to monitor the flow based on the endpoint congestion level.

Technical Field

The present disclosure relates generally to the field of networking technology. More particularly, the present disclosure relates to systems and methods for facilitating scalable, data-driven intelligent networks employing endpoint congestion detection and control.

Prior Art

As network-enabled devices and applications become more prevalent, various types of traffic (traffic) and increasing network load continue to demand higher performance from underlying network architectures. For example, applications such as High Performance Computing (HPC), streaming media, and internet of things (IOT) may generate different types of traffic that are well characterized. Thus, in addition to traditional network performance metrics such as bandwidth and latency, network architects continue to face scalability, versatility, and efficiency challenges.

Background

Disclosure of Invention

A data driven intelligent network system employing endpoint congestion detection and control is provided. The system can adapt to dynamic traffic flows by fast and efficient congestion control. The system may maintain state information for individual packet flows, which may be dynamically established or released based on injected data. Upon arrival at the switch, a flow-specific input queue may be provided for each flow. The packets of the corresponding stream are acknowledged upon reaching an exit point of the network and the acknowledgement packets are sent back along the same data path to the entry point of the stream. Thus, each switch can obtain per-flow state information and perform end point congestion detection and control on a flow-by-flow basis.

Drawings

FIG. 1 illustrates an example network that facilitates flow channels.

Fig. 2A illustrates an example switch that facilitates a flow path.

Fig. 2B shows an example of how switches on a data path may maintain flow state information.

Fig. 3A shows an exemplary structural header of a data packet.

Fig. 3B illustrates an exemplary Acknowledgement (ACK) packet format.

FIG. 3C illustrates the relationship between different variables used to obtain and maintain state information for a flow.

Fig. 4A shows an example of how a flow may be transmitted using a flow channel table.

Fig. 4B shows an example of an Edge Flow Channel Table (EFCT).

Fig. 4C shows an example of an Input Flow Channel Table (IFCT).

Fig. 4D shows an example of an Output Flow Channel Table (OFCT).

Fig. 5 shows an example of an unfair share of link bandwidth that may occur in a network.

Fig. 6 shows an example of endpoint congestion.

Fig. 7A illustrates a flow diagram of an exemplary process of generating an explicit endpoint congestion notification ACK.

Fig. 7B illustrates an exemplary endpoint congestion management logic block.

Fig. 8 shows a flow diagram of an exemplary process of generating an ACK in response to a packet being dequeued from an output buffer.

Fig. 9A shows a flow diagram of an exemplary fine-grained flow control (FGFC) process.

Fig. 9B shows an example of a network interface controller supporting FGFC.

Fig. 10 shows an example of fabric link congestion.

Fig. 11 shows a flow diagram of an example process for applying credit-based flow control over a congestion fabric link.

FIG. 12 illustrates an example edge exchange system that facilitates flow channels.

FIG. 13 illustrates an exemplary intermediate exchange system that facilitates flow channels.

In the drawings, like reference numerals refer to like elements.

Detailed Description

Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. The invention is thus not limited to the embodiments shown.

SUMMARY

The present disclosure describes systems and methods for adapting dynamic data traffic flows by maintaining state information for individual packet flows for fast, efficient endpoint congestion detection and control. More specifically, packets injected into the switch network may be classified into flows, which may be mapped to their layer 2, layer 3, or other protocol-specific header information. Each flow may be tagged with a unique identifier local to the switch input port and provided with a flow-specific input buffer so that each flow can be controlled independently. In addition, packets of the corresponding stream may be acknowledged upon reaching an egress point of the network, and the acknowledgement packet may be sent back along the same data path to the ingress point of the backward stream. Thus, each switch can obtain state information for the active packet flows it is forwarding and can perform highly responsive, flow-specific flow control. This flow control allows the network to operate at higher capacity while providing multi-functional traffic engineering capabilities.

In the present disclosure, a packet stream may also be referred to as a "packet flow" or simply a "flow". The data path through which the flow passes and its configuration information maintained by the switch may be referred to as a "flow channel". Furthermore, the terms "buffer" and "queue" are used interchangeably in this disclosure.

FIG. 1 illustrates an example network that facilitates flow channels. In this example, a network of switches 100 (which may also be referred to as a "switch fabric") may include switches 102, 104, 106, 108, and 110. Each switch may have a unique address or ID within the switch fabric 100. Various types of devices and networks may be coupled to the switch fabric. For example, storage array 112 may be coupled to switch fabric 100 via switch 110; infiniband (IB) -based HPC network 114 may be coupled to switch fabric 100 via switch 108; a plurality of end hosts (e.g., host 116) may be coupled to switch fabric 100 via switch 104; and the IP/ethernet network 118 may be coupled to the switch fabric 100 via the switch 102. In general, a switch may have edge ports and fabric ports. The edge port may be coupled to a device external to the structure. The fabric port may be coupled to another switch within the fabric via a fabric link.

In general, traffic may be injected into the switch fabric 100 via an ingress port of an edge switch and exit the switch fabric 100 via an egress port of another (or the same) edge switch. The ingress edge switch may group the injected packets into flows, which may be identified by flow IDs. The concept of a flow is not limited to a particular protocol or layer (e.g., layer 2 or layer 3 in the Open Systems Interface (OSI) reference model). For example, a flow may map to traffic having a particular source ethernet address, traffic between a source IP address and a destination IP address, traffic corresponding to a TCP or UDP port/IP 5-tuple (source and destination IP addresses, source and destination TCP or UDP port numbers, and IP protocol numbers), or traffic generated by a process or thread running on an end host. In other words, a flow may be configured to map to data between any physical or logical entities. The configuration of the mapping may be done remotely or locally on the ingress edge switch.

After receiving the injected packet, the ingress edge switch may assign a flow ID for the flow. The flow ID may be included in a special header that the ingress edge switch may use to encapsulate the injected packet. In addition, the ingress edge switch may also examine the original header field of the injected packet to determine the appropriate egress edge switch address and include that address as the destination address in the encapsulation header. It should be noted that the flow ID may be a link-specific locally valid value, and that this value may be unique only to a specific input port on the switch. When the packet is forwarded to the next hop switch, the packet enters another link and the flow ID may be updated accordingly. Since packets of a flow go through multiple links and switches, the flow ID corresponding to the flow may form a unique chain. That is, at each switch, the packet's flow ID may be updated to the flow ID used by the outgoing link before the packet leaves the switch. This upstream-to-downstream one-to-one mapping between flow IDs may begin at the ingress edge switch and end at the egress edge switch. Because the flow ID only needs to be unique within the incoming link, the switch can accommodate a large number of flows. For example, if the stream ID is 11 bits long, one input port can support a maximum of 2048 streams. Further, the matching pattern (one or more header fields of the packet) for mapping to the stream may include more bits. For example, a 32-bit long matching pattern (which may include multiple fields in a packet header) may map 2^32 different header field patterns. If the fabric has N ingress edge ports, a total of N x 2^32 identifiable flows can be supported.

The switch may assign a separate dedicated input queue for each flow. This configuration allows the switch to monitor and manage the degree of congestion of individual flows and prevent head-of-line congestion that can occur when a shared buffer is used for multiple flows. When a packet is transmitted to a destination egress switch, the egress switch may generate an Acknowledgement (ACK) and send the acknowledgement back to the ingress edge switch along the same data path in the upstream direction. Since the ACK packet travels the same data path, the switches on that path can obtain status information associated with the transmission of the corresponding flow by monitoring the amount of outstanding, unacknowledged data. This state information can then be used to perform flow-specific traffic management to ensure overall network health and fair handling of flows. As explained in more detail below, this flow-by-flow queuing in combination with flow-specific delivery acknowledgements can enable the switch fabric to implement efficient, fast and accurate congestion control. Further, the switch fabric can deliver traffic with significantly improved network utilization without congestion.

The flow may be established and released dynamically or "on the fly" based on demand. In particular, a flow may be established by an ingress edge switch (e.g., establishing a mapping of flow IDs to packet headers) when a data packet arrives at the switch and the packet has not previously been assigned a flow ID. As the packet traverses the network, a flow ID may be assigned along each switch through which the packet passes, and a chain of flow IDs may be established from ingress to egress. Subsequent packets belonging to the same stream may use the same stream ID along the data path. When a packet is transmitted to a destination egress switch and a switch on the data path receives an ACK packet, each switch may update its state information regarding the amount of outstanding, unacknowledged data for the flow. For this flow, when the switch's input queue is empty and there is no more unacknowledged data, the switch may release the flow ID (i.e., release the flow channel) and reuse the flow ID for other flows. This data-driven dynamic flow setup and teardown mechanism may eliminate the need for centralized flow management and allow the network to respond quickly to traffic flow pattern changes.

It should be noted that the network architecture described herein is different from Software Defined Networks (SDNs) that typically use the OpenFlow protocol. In SDN, switches are configured by a central network controller and packets are forwarded based on one or more fields in a layer 2 (data link layer, such as ethernet), layer 3 (network layer, such as IP) or layer 4 (transport layer, such as TCP or UDP) header. In SDN, such header field lookups are performed at each switch in the network and there is no flow ID based fast forwarding as done in the networks described herein. Furthermore, since OpenFlow header field lookups are done using Ternary Content Addressable Memory (TCAM), the cost of such lookups can be high. Moreover, since the header field mapping configuration is done by a central controller, the establishment and teardown of each mapping is relatively slow, potentially requiring a large amount of control traffic. As a result, SDN networks may be slow to respond to various network conditions (e.g., congestion). In contrast, in the networks described herein, flows may be dynamically established and torn down based on traffic flow requirements, and packets may be forwarded with fixed length flow IDs. In other words, flow channels may be data driven and managed (i.e., set up, monitored, and torn down) in a distributed manner without intervention by a central controller. Furthermore, flow ID based forwarding can reduce the amount of TCAM space used and thus can accommodate more flows.

Referring to the example shown in FIG. 1, assume that storage array 112 is to send data to host 116 using TCP/IP. During operation, storage array 112 may send a first packet having the IP address of host 116 as the destination address and a predetermined TCP port specified in the TCP header. When the packet arrives at switch 110, a packet processor at the input port of switch 110 may identify the TCP/IP 5 tuple of the packet. The packet processor of switch 110 may also determine that the 5-tuple is not currently mapped to any flow ID and may assign a new flow ID to the 5-tuple. Further, switch 110 may determine the egress switch of the packet, i.e., switch 104, based on the destination (i.e., host 116's) IP address (assuming switch 110 knows that host 116 is coupled to switch 104). Switch 110 may then encapsulate the received packet with a fabric header indicating the newly assigned flow ID and the fabric address of switch 104. Switch 110 may then schedule the encapsulated packet to be forwarded to switch 104 based on a fabric forwarding table, which may be computed by all switches in fabric 100 using a routing algorithm such as a link state or distance vector.

It should be noted that the above operation can be performed substantially at line speed, with little buffering and delay when the first packet is received. Subsequent packets from the same flow may be processed by switch 110 more quickly after the first packet is processed and scheduled for transmission because the same flow ID is used. In addition, the design of the flow channels may be such that the distribution, matching and de-distribution of the flow channels may have substantially the same cost. For example, a lookup match-based conditional assignment of a flow lane and a separate, independent deallocation of another flow lane may be performed simultaneously in nearly every clock cycle. This means that the generation and control of the flow paths adds little additional overhead to the conventional forwarding of packets. On the other hand, congestion control mechanisms may improve the performance of some applications by more than three orders of magnitude.

At each switch on the data path (including switches 110, 106, and 104), a dedicated input buffer may be provided for the flow and the amount of transmitted but unacknowledged data may be tracked. When the first packet arrives at switch 104, switch 104 may determine that the destination fabric address in the fabric header of the packet matches its own address. In response, switch 104 may decapsulate the packet according to the fabric header and forward the decapsulated packet to host 116. In addition, switch 104 may generate an ACK packet and send the ACK packet back to switch 110. When the ACK packet traverses the same data path, switches 106 and 110 may each update their own state information for the flow's unacknowledged data.

Typically, congestion within the network will cause network buffers to be filled. When the network buffer is full, ideally, traffic attempting to pass through the buffer should slow down or stop. Otherwise, the buffer may overflow and the packet may be discarded. In conventional networks, congestion control is typically done end-to-end at the edge. The core of the network is considered to act only as a "dumb pipe", the primary purpose of which is to forward traffic. Such network designs are typically slow to respond to congestion because congestion information is typically not sent to the edge devices quickly, and the resulting actions taken by the edge devices are not always effective at removing congestion. This slow response in turn limits the utilization of the network, since network operators typically need to limit the total amount of traffic injected into the network in order to keep the network clear. Furthermore, end-to-end congestion control is typically only effective if the network is not congested. Once the network is heavily congested, end-to-end congestion control will not work because the congestion notification message itself may be congested (unless the congestion control message is sent using a separate control plane network than the data plane network).

In contrast, flow paths may prevent such congestion from growing within the switch fabric. When a flow is experiencing some degree of congestion, the flow path mechanism may recognize this and, in response, may slow down or stop new packets of the same flow entering the fabric. Further, these new packets may be buffered in a stream path queue on the edge port and only allowed to enter the fabric when a packet of the same stream leaves the fabric at the destination edge port. This process may limit the total buffering requirements of the stream within the fabric to an amount that does not cause the fabric buffer to become too full.

Through the stream path, the switch is able to obtain fairly accurate status information about the amount of data outstanding en route within the fabric. This status information may be aggregated for all flows on the ingress edge port. This means that the total amount of data injected through the ingress edge port can be known. Thus, the stream lane mechanism may place a limit on the amount of data in the structure. When all edge ports apply this limiting action, the amount of packet data in the entire fabric can be well controlled, which in turn can prevent the entire fabric from saturating. The flow path may also slow down the progress of a single congested flow within the fabric without slowing down other flows. This feature can keep packets away from congestion hot spots while preventing buffers from becoming full and ensuring that free buffer space is provided for irrelevant traffic flows.

Operation of flow channels

In general, a flow path may define a path for each communication session across the switch fabric. The paths and the amount of data belonging to each flow may be described in a set of dynamic connection flow tables associated with each link of the switch fabric. On each ingress port, edge, and fabric, a set of flow path queues may be defined. There may be one queue per flow channel. When packets arrive, they may be assigned to flow channels on edge ports, or have been assigned to flow channels on fabric ingress ports by the egress fabric ports of the link partner (link partner). The flow lane information may be used to direct the packet into the appropriate flow lane queue.

Fig. 2A illustrates an example switch that facilitates a flow path. In this example, the switch may include a crossbar switch 202. Crossbar switch 202 may have multiple input ports (e.g., input port 204) and multiple output ports (e.g., output 208). Crossbar switch 202 may forward packets from input ports to output ports. Each input port may be associated with a plurality of input queues, each queue being assigned to a different incoming stream arriving at the input port. For example, data arriving at a given port of a switch may first be segregated based on their respective flows and stored in flow-specific input queues (e.g., input queue 206). Packets stored in the input queues may be dequeued and sent to the crossbar switch 202 based on a scheduling algorithm (described in more detail in later sections) designed to control congestion. On the output side, once a packet passes through crossbar switch 202, it may be temporarily stored in an output transmit queue (e.g., output transmit queue 210), which may be shared by all flows leaving the same output port. Meanwhile, the header of the packet may be updated using the flow ID of the outgoing link before the packet is dequeued from the output transmit queue and transmitted on the outgoing link. It should be noted that this hop-by-hop flow ID mapping may be done as the first packet in the flow traverses the network. When the packet reaches the next hop switch, the packet may again be stored in the flow-specific input queue, and the same process may be repeated. It should be noted that the flow ID is used to distinguish flows traveling on the same fabric link and may be generally assigned by the transmitter end of the link, which is the switch output port that is transmitting onto the link.

By providing a flow-specific input queue, the switch can allow each flow to move independently of all other flows. The switch can avoid the very common head of line blocking problem in the shared input buffer. The flow-specific input queues also allow packets in a single flow to maintain order. As a flow passes through the switch, flow-specific input queues may be allocated on each input port for the flow, and these input queues become linked, effectively forming one long queue across the entire fabric for the flow, and packets for the flow may remain in order.

The progress of successful delivery of packets belonging to a flow may be reported by a series of ACKs generated by the edge ports of the egress switches. The ACK packet may travel back along the data path traversed by the packet and may be forwarded by the switch according to forwarding information maintained in the flow table. As the ACK packets propagate upstream, these packets may be processed by the input queue manager of each switch, which may update the state information of the corresponding flow based on the information carried by the ACK packets. The ACK packet may have a type field to provide advanced information about the downstream data path, such as congestion. The switch's input queue manager may use this information to make decisions about pending packets currently buffered in its input queue, such as limiting the transmission rate or changing the forwarding path. In addition, the input queue manager may update the information carried in the ACK packet based on the buffered flow's state information so that the upstream switch can make the appropriate decision. For example, if an incoming queue for a given flow encounters congestion (e.g., the amount of data in the queue is above a predetermined threshold), the incoming queue manager may update an ACK packet forwarded to the next upstream switch to include the congestion information.

If the ACK corresponds to the last packet of the flow, the switch may determine that the flow does not have any unacknowledged data. Accordingly, the switch may release the flow channel by deleting the corresponding entry in the flow table.

As described above, the input queue manager on each switch may maintain information about transmitted but unacknowledged data for a given flow. Fig. 2B shows an example of how switches on a data path may maintain flow state information. In this example, the data path taken by the flow may include switches 222, 224, and 226. The amount of transmitted but unacknowledged stream data may be indicated by a variable "flow _ extend", which may be measured in the number of fixed-length data units, such as 256 bytes. In addition, flow _ extend and other flow state information may be maintained by the switch's input queue manager, which may continuously monitor all flow-specific queues.

In the example of fig. 2B, the value of flow _ extend at the input queue manager of the switch is 1 because there is one data unit that has been sent out of the input queue and forwarded through the crossbar switch. It should be noted that due to the scheduling of all data packets to be transmitted via the output link, data packets sent by the input queue may be temporarily buffered in the output transmission buffer. When such a packet is buffered in the transmit buffer of the output port, the input queue may still treat the packet as transmitted in order to update the flow _ extend value.

Accordingly, since the input queue for a given flow at switch 226 has six queued data units, and two additional data units are en route between switches 224 and 226, the flow _ extend value at switch 224 is 9. Similarly, the flow _ extend value at switch 222 is 13 because there are three data units stored in the input queue at switch 224 and one data unit is en route between switches 222 and 224.

In general, a flow lane may remain assigned to a single flow until all ACKs for all packets sent on the flow lane are returned. This means that the flow channel table entry can remain active for a longer time near the fabric inlet edge port than near the outlet edge port. If a single packet is injected into the network, a flow channel may be assigned to the ingress edge port, then another flow channel may be assigned to the next fabric link through which the packet passes, and so on until the last flow channel is assigned when the packet reaches the last fabric link. Each allocation may generate a flow ID, represented as a variable "flow _ ID," to identify an entry of the flow table for the fabric link. (more details about the flow path table are provided below in connection with FIG. 4A.) the first packet may result in a different flow _ id being assigned on each fabric link through the switch fabric.

At the input queue of each switch, the flow channel table entry may indicate the state information of each flow from that point downstream to the flow's egress destination edge port, including the flow _ extend value. Packets received on the local input port may increase the flow _ extend value by the amount of incoming data, while ACKs may decrease the flow _ extend by the amount of acknowledged transmitted data.

When a packet reaches the final destination egress port, an ACK packet may be generated for the packet and returned. The ACK may be routed using the data path information stored in the corresponding entry of the flow path table of each switch on the data path. Alternatively, the ACK packet itself need not carry path information and therefore may be small and light. If no other packets are sent on the flow, the ACK may release each flow channel in reverse order. Once released, the flow paths on each switch may be assigned to different flows.

If another packet follows the first packet on the same flow, then an ACK corresponding to the second packet needs to be received before the given switch releases the flow path. In one embodiment, the flow channel can only be released when the ACKs of all transmitted packets of the same flow have returned.

In general, various protocols may require in-order packet delivery. Even when the fabric uses adaptive routing to achieve load balancing across multiple data paths, stream lanes can be used to guarantee this delivery order. If packets between an ingress edge port and an egress edge port (possibly located in a different switch remote from the fabric) are injected at a very low rate, each injected packet may reach its destination and return an ACK to the source before injecting the next packet. In this case, each packet may be a leading packet and any path from the routing cross-fabric may be selected using the best available dynamic adaptive routing. This is possible because the first packet can define the path of the flow through the fabric.

Now assume that the packet injection rate is slightly increased to the point that the next packet of the same flow is injected before the ACK for the current packet is returned to the source. The second packet may meet the ACK of the first packet somewhere along the data path of the stream. After this meeting point, the ACK will release the flow channel assigned to the first packet because the flow _ extend value associated with the first packet will return to zero when the ACK is processed by the logic of the flow channel. At the same time, the second packet can now define a new flow, since it again results in the allocation of flow channels on each subsequent fabric link. This second packet, while causing the flow channels to be allocated after this meeting point, can still be forwarded to a different path based on dynamic adaptive routing. On the other hand, before the rendezvous point, the second packet may expand the pending flow created by the first packet to include the second packet. This means that the ACK for the first packet may not reduce the flow _ extend value to zero and the flow channel may remain active until the meeting point. This also means that the second packet can follow the exact path taken by the first packet to the meeting point. It should be noted that when it follows the previous packet, the second packet cannot reach the egress edge port before the first packet, and thus the correct packet order can be maintained.

If the injection rate of the flow is further increased, the second packet will meet the ACK of the first packet at a location closer to the destination edge port. Depending on the packet injection rate and packet-to-ACK round trip delay for the flow, it is also possible that a third, fourth, fifth or additional packet enters the fabric before the ACK for the first packet returns to the source edge port. The maximum packet rate depends on the size of the packet and the bandwidth of the link. The round trip delay of the data packets and ACKs may be an important parameter for the fabric implementation and may be used along with the maximum packet rate to calculate the maximum number of flow channels required per link. Ideally, the design can provide a reasonable number of unassigned flow channels regardless of traffic flow patterns. When a large number of packets arriving at an ingress edge port have different destinations and these packets have smaller sizes and high injection rates, the demand on the number of flow channels may be high. In the most extreme case, each packet may be assigned a different flow channel. These flow channels are released when the ACK for the packet returns. Accordingly, the number of required flow channels may be calculated as ((packet rate) × (average packet to ACK round trip delay)).

Note that the packet rate on a single flow channel is not to be confused with the packet rate on the link. If the traffic flow pattern is such that many small packets are sent to different destinations, consecutive packets sent onto the link may have different destinations. This means that each packet may belong to a different flow and may be the only packet that uses the corresponding flow channel. In this example, the link may experience a high packet rate, but the packet rate of each flow may be lower. Optionally, multiple ACKs (e.g., 48 ACKs) may be aggregated together to form a single ACK frame for transmission over the link and protected by a frame check sequence (e.g., 32-bit FCS). For example, each ACK may occupy 25 bits and the frame may have an overhead of 9 bytes. That is, the overhead per ACK on a full-size frame is about 9/(25/8 × 48) × 100% — 6%. The logic may optimize the number of ACKs per frame so that when ACKs arrive slowly, the ACKs do not have to wait too long to aggregate. For example, the ACK aggregation logic may manage ACK transmission based on activity of the outgoing link using three timers. These timers may be started when a new ACK arrives at the ACK aggregation logic block. If the outgoing link is idle, a first timer (which may be set to 30ns, for example) may be used to hold the ACK while waiting for additional ACKs to arrive. When the timer expires, all ACKs received within the corresponding time window may be aggregated into a frame and transmitted onto the outgoing link. If the outgoing link is busy, a second timer (which may be set to 60ns, for example) may be used to wait for additional ACKs. Using this second timer may allow more ACKs to be aggregated into a single frame, and the frame may only be transmitted when a predetermined number of ACKs are collected. It should be noted that some number of ACKs in a single frame may use less per-ACK line bandwidth than other numbers of ACKs due to ethernet framing constraints. If a significant number of ACKs are not collected and the outgoing link is still busy transmitting normal data packets, a third timer may be used, which may be set to 90ns, for example. Once the third timer expires, all collected ACKs may be aggregated in one frame and transmitted onto the link. By using these three timers, the system can significantly reduce the overhead of sending ACKs on the outgoing link.

In some examples, an ingress edge port of a switch may use a fabric header to encapsulate a received packet, which allows forwarding of the packet using a flow path. Fig. 3A shows an exemplary structural header of a data packet. The fabric header may include a flow _ id field (which may identify a flow channel) and a "data _ flow" field (which may indicate the progress of the entire flow).

At least one ACK may be generated when the data packet is transmitted to its destination. Fig. 3B shows an exemplary ACK packet format. The ACK packet may include a "flow _ id" field, an "ACK _ flow" field, an "ACK type" field, and a Cyclic Redundancy Check (CRC) field. The flow _ id field may indicate the flow to which the ACK packet belongs. The ACK flow field may indicate the data packet acknowledged by the ACK packet. Recall that each switch may maintain a flow _ extend value that indicates the amount of transmitted but unacknowledged data. The value of flow _ extend can be obtained as data _ flow-ack _ flow, where the data _ flow value is taken from the last transmitted packet.

The ACK type field may indicate different types of ACKs. As described above, during normal operation, when a data packet is transferred to a destination edge port, a regular ACK packet may be generated and sent back to the source. Accordingly, the ACK type field in the ACK packet may indicate a normal ACK. When congestion occurs in the fabric, the ACK type field may be used to indicate various types and severity of congestion, such as new congestion, persistent congestion, severe congestion on egress edge ports that require rerouting of flows. In addition, in special cases such as heavily congested fabric links, dropped packets, or link errors, ACKs may also be generated by intermediate switches that are not the final destination, and the ACK type field may be used to inform upstream switches of different types of network conditions. Other additional fields may also be included in the ACK packet.

FIG. 3C illustrates the relationship between different variables used to obtain and maintain state information for a flow. In this example, the switch may use the variable "total _ extend" to track the total amount of unacknowledged transmitted data and data currently queued at the switch. the value of total _ extend may be equal to the sum of flow _ extend (the amount of data transmitted but not acknowledged) and queue _ extend (the amount of data stored in the input queue of the corresponding stream). The variable "ACK _ flow" may indicate the data location corresponding to the latest ACK for the flow. The variable "data _ flow" may indicate the location of the next packet to be transmitted, which also corresponds to the packet stored at the head of the input queue. The variable "next _ data _ flow" may indicate the location where the switch can expect the next packet received from an upstream switch. Note that queue _ extent is next _ data _ flow-data _ flow, and flow _ extent is data _ flow-ack _ flow.

In some examples, a flow channel table may be used to facilitate flow channels throughout the structure. A flow path table is a data structure that stores forwarding and state information for a given flow on a switch port. Fig. 4A illustrates an example of how state information associated with multiple streams may be stored using a stream channel table. This state information can be specific to each flow and efficiently stored in a table. Assume that source host 402 is sending a packet to destination host 404 via a fabric. The data path traversed by the packet may include ingress edge switch 406, intermediate switches 408 and 430, and egress edge switch 432.

When a packet arrives on an ingress edge link 403 of a switch 406, the address translation logic 410 may analyze the header of the packet. Address translation logic 410 may determine the destination fabric address of the egress switch (in this case switch 432) based on the ethernet, IP, or HPC header information of the packet. It should be noted that the address translation logic 410 may also use header information associated with other protocols or combinations of different protocols. The fabric destination address determined by the address translation logic 410 may then be used to perform a lookup in an Edge Flow Channel Table (EFCT) 412. EFCT 412 may perform a lookup operation on a packet using the packet's structural destination address and optionally additional values extracted from the packet's header (which may be referred to as a matching pattern). EFCT 412 may compare the matching pattern of packets to the stored matching patterns of all existing assigned streams. If a match is found, the packet is part of an existing flow and a previously assigned flow ID may be returned for the packet. If a match is not found, a new flow ID may be assigned to the packet and a matching pattern may be added to EFCT 412. In other words, EFCT 412 may be used to determine whether a flow channel already exists for an incoming packet or whether a new flow channel needs to be allocated. In addition to the destination fabric address, other packet header information (such as traffic classification, TCP or UDP port number, and process or thread ID) may also be used to map or assign the flow ID.

The stream ID obtained by EFCT 412 may then be used as an index to map to an entry in an input stream channel table (IFCT) 414. Each entry in IFCT 414 may be indexed by a stream ID and store state information for the corresponding stream. The entries in IFCT 414 may store the values of next _ data _ flow, and ack _ flow (see FIG. 3C) associated with the flow. In addition, the IFCT entries may store other parameters for congestion control and dynamic routing of flows.

The flow ID may also be used to identify or assign a flow-specific input queue in which incoming packets may be temporarily stored. Status information for a particular queue and parameters used to monitor and control the queue (e.g., thresholds for detecting congestion) may be stored in corresponding entries in IFCT 414. The input queue management logic may determine when a packet may be dequeued from an input queue and sent to the data crossbar switch 413 based on flow control parameters stored in entries of the IFCT 414.

When a packet is dequeued from an input queue and sent through crossbar switch 413 to an output port, the packet will be sent with the input port number at the time it arrives at switch 406. When the packet arrives at the transmission buffer of the output port, the header of the packet may be updated with a new flow ID based on the flow ID and input port number of the packet for use by the next-hop switch (i.e., switch 408) for the same flow. This is because each link may have its own set of flow channels identified by their respective flow IDs in each direction. The mapping from incoming stream ID to outgoing stream ID used on the next link can be done by looking up the outgoing stream channel table (OFCT) 416. OFCT 416 may perform a lookup using a matching pattern that is a combination of the local input port number corresponding to link 403 and the flow ID of the packet generated by EFCT 412. If a match is found, the flow has already been defined and the packet's flow ID is updated with the value corresponding to the matching pattern (this new outgoing flow ID will be used by the downstream next-hop switch 408). If no match is found, a new outgoing stream ID may be assigned to the new stream channel, which may be mapped to the input port number and the previous incoming stream ID. An entry including the outgoing flow ID, the incoming port number, and the incoming flow ID may be stored in OFCT 416.

In the case where the packet is the first packet in the flow, the lookup in OFCT 416 does not result in any mapping. In turn, OFCT 416 may assign a flow path to the packet with a flow ID to be used by the input port on switch 408 and IFCT 418. This new flow path identified by its flow ID may be added to the packet header for transmission onto link 417 and may be used by IFCT 418 of the link partner (switch 408) to access the congestion information of the flow path. As previously described, if no match is found, OFCT 424 may further generate a new flow channel using its matching pattern of the immediately upstream input port number and the flow ID associated with link 417. OFCT 424 may then assign a new flow channel identified by the new flow ID. It should be noted that OFCT 416 may also be used as a forwarding table for ACKs for the flow in the upstream direction. After forwarding upstream from switch 408 to switch 406, the ACK packet may be updated with the flow ID associated with edge link 403 and forwarded to the appropriate input port on switch 406 as indicated by the corresponding entry in OFCT 416. The ACK packet may be forwarded by ACK crossbar switch 415 in the upstream direction to the input port.

Subsequently, when a packet arrives at switch 408, its flow ID may be used to identify the input queue to use and determine the entry in IFCT 418. If switch 408 has not previously allocated the flow ID of the packet, a new input queue may be provided and a new entry may be created in IFCT 418. From this point on, a similar process may be performed to forward packets across switches 408 and 430 until the packets reach egress switch 432.

When a packet arrives at switch 432, ACK generator logic block 420 may generate an ACK packet based on the packet's flow ID and input port number after the packet is forwarded by data crossbar switch 423. The ACK packet may then be forwarded in the upstream direction by ACK crossbar switch 422. Meanwhile, based on the ACK packet, the IFCT 421 may update the state information of the flow in the corresponding table entry. When the ACK packet arrives at switch 430, OFCT 419 may be looked up to determine the upstream flow ID and upstream input port to which the ACK packet is to be forwarded. The ACK packet may then update its flow ID and forward to the appropriate input port in the upstream direction. Since ACK packets traverse the data path upstream in a similar manner, each switch's IFCT may update its table entry for the flow based on the ACK.

It should be noted that the flow _ extend variable may be an important parameter because it represents the total amount of data of the downstream packet of the stream. When the flow _ extend of the entry is zero, it is considered that the stream channel can be freely reassigned to another stream. Typically, upon receiving a new packet, the input logic may request that data be sent to the output port. The selected output port may be a function of flow _ extend stored in the IFCT. If flow _ extend is zero, there are no packets downstream of the flow towards the destination egress edge port. Thus, the switch can use load-based adaptive routing to select any active path to the destination. In a multi-path network, dynamic adaptive routing can be accomplished without reordering packets. If flow _ extend is not zero and in-order delivery is required, the packet may use the same route that was used for the previous packet. The IFCT may have a field storing the previous output port number, which is loaded and may be used to ensure connection with the previously used output port when a packet request is issued to the output port.

As previously described, a flow channel may use a matching function to identify packets belonging to an existing flow. When an ethernet frame or other type of packet is received on an ingress edge port, the received frame or packet may be parsed in real time and some fields of the packet header may be used for lookup in a CAM or Ternary Content Addressable Memory (TCAM). If there is a match, the matching address may become the stream ID used to select the stream channel. When a match does not occur, the switch hardware may load the pattern of the match failure directly onto the CAM's free line, which may be done without additional delay. Thus, any subsequent packets can be matched to the new entry without extensive buffering. The selected free entry becomes the new flow ID of the new flow channel entry. It should be noted that no external software intervention is required to load the new entry. This process may be done autonomously by the switch hardware.

The de-allocation of the stream ID and corresponding CAM match line may also be performed automatically by hardware when the last ACK of the stream is returned. The deallocation can be done in hardware for new packets that potentially match, without external software intervention.

In some examples, the ingress edge switch 406 may include fine-grained flow control logic 434 that may communicate with a Network Interface Controller (NIC)401 on the host 402 to apply flow control in a flow-by-flow manner. More details regarding fine-grained flow control are provided below in connection with the description of congestion management.

Fig. 4B shows an example of EFCT. In this example, EFCT may include a data _ flow field 454, an ACK _ flow field 456, and optionally additional fields. EFCT may be associated with an input port and entries in EFCT may be indexed by a flow _ ID value, such as flow _ ID 452. In one embodiment, the match pattern field may reside in a match function logic block, which may comprise a CAM or TCAM. The match function logic may use the match pattern to generate a flow _ ID value, which in turn may be used as an index to a corresponding EFCT entry. From the perspective of the EFCT, flow _ extend (i.e., data _ flow-ack _ flow) may include all unacknowledged data downstream of the table, which may include the local flow _ queue plus the flow _ extend value for the corresponding IFCT.

Fig. 4C shows an example of IFCT. In this example, the IFCT may be associated with an input port and may include a follow _ port field 466, a next _ data _ flow field 468, a data _ flow field 470, an ACK _ flow field 472, an ep _ context field 474, an Upstream Metering (UM) flag field 477, a Downstream Metering (DM) flag field 478, and optionally additional fields. The flow _ ID value of an incoming packet (e.g., flow _ ID 464) may be used as an index to look up the output port number indicated by the follow _ port field 466 and the state information associated with the corresponding flow. Congestion control information associated with endpoint congestion (e.g., ep _ congestion field 474) and hop-by-hop credit based flow control (e.g., UM flag field 477 and DM flag field 478) may also be stored in the IFCT, as will be described in more detail later in this document. The IFCT may further store information about dynamic routes associated with different flows.

Fig. 4D shows an example of OFCT. In this example, the OFCT may be associated with an output port and may include an input _ port field 482, an input _ port _ flow _ ID field 484 (which corresponds to the existing flow _ ID when a packet arrives at the input port), a data _ flow field 486, an ACK _ flow field 488, and optionally additional fields. Data _ flow field 486 and ACK _ flow field 488 may be used to determine the value of flow _ extend from the OFCT. The combination of input _ port field 482 and input _ port _ flow _ ID field 484 (which may also be referred to as "ingress flow _ ID") may be used to determine or assign an egress flow _ ID ready for transmission to a packet on an egress link corresponding to the OFCT. In one embodiment, an outgoing flow _ ID value (e.g., flow _ ID 486) may be used as an index to look up an entry in the OFCT.

Congestion management

As described above, each flow on a given switch may have its own dedicated packet queue. This configuration facilitates separate flow control for each flow. Thus, the network can remain largely lossless and one flow using a link can be blocked without blocking any other flow using the same link. Unlike conventional packet-switched networks, congestion in a certain part of the network can only affect the flows that cause the congestion. For example, in a conventional network, the buffer before the congested link may soon be filled with packets that cause congestion. This in turn may force a switch to issue a pause command or use some other flow control method to prevent neighboring switches from sending packets to the congested link. Thus, packets that cause congestion may be stopped or slowed down, and all other packets that may not be destined for the congested link may also be stopped or slowed down. Thus, from a topological point of view, congestion may spread laterally and increase the size of the saturation tree.

In contrast, a flow path may reduce the load on the link causing congestion corresponding to the flow causing congestion. This load reduction may allow other flows sharing these links to use more link bandwidth and transmit their payloads faster, while only packets that cause congested links are slowed down.

In general, conventional networks can operate properly as long as the network load is not at or near full capacity. This is most often the case for small and medium size networks. However, for large or very large networks running multiple high bandwidth applications, a portion of the network may be saturated by traffic load at any point in time. In these cases, unfair packet delivery may occur even if the individual switches implement fairness policies locally.

Fig. 5 shows an example of an unfair share of link bandwidth that may occur in a network. In this example, each of sources a through K attempts to send a packet stream to destination L, forming an incast scenario where multiple sources send packets to a single destination. Source nodes A, B and C are coupled to switch 502; source nodes D, E and F are coupled to switch 504; source nodes G, H and I are coupled to switch 506; and the source node and J and K, and destination node L are coupled to switch 508. It is assumed that each switch employs a fair arbitration policy, i.e., an equal number of packets are selected from each of its input ports to any particular output port. However, as shown in fig. 5, sources closer to the destination may obtain a higher proportion of the final link bandwidth than sources whose traffic needs to pass through more switching stages. Switch 508 has three incoming data sources from node J, K and switch 506, and the bandwidth on the outgoing link to node L may be divided equally between each source. Thus, nodes J, K may each occupy 33.3% of the bandwidth on the outgoing link to destination node L.

The same may be true for the next closest switch (i.e., switch 506), and so on. In this example, there are only four levels of switches, there are only three or four inputs per level, and only a total of 11 inputs are intended to be sent to the destination node L, the bandwidth occupied by the three input sources (nodes A, B and C) on the outgoing link to the destination node L being only 1/48 for the two other input sources (nodes J and K). Thus, even with a locally fair arbitration policy, nodes far from the destination are subject to very unfair processing. A more realistic network topology may involve more switching stages, more switching inputs, and more sources attempting to transmit to a single destination. A medium-sized incast may result in six orders of magnitude difference between the transmission bandwidths of different sources.

The unfairness problem described above is often caused by the fact that: the arbitration policy implemented by the switch is based on the input port. That is, bandwidth limiting is done at a per-port granularity. In contrast, by facilitating flow channels and enforcing flow-specific restrictions, the network can significantly reduce the degree of unfairness between different flows. For example, in the scenario shown in fig. 5, when the switch implements a per-flow fair bandwidth allocation policy, all eight source nodes may share the bandwidth of the edge link between switch 508 and destination node L substantially equally. The extreme tail-out delay of a single packet can also be greatly reduced by providing a more fair flow-based arbitration policy. For large system installations, the maximum latency in the control network is often a major concern for architects. Typically, this can only be achieved by limiting the incoming bandwidth into the network to a fraction of the peak bandwidth. For example, an input bandwidth limit of 20% of peak bandwidth may be typical for a large data center. By contrast, with flow channels and appropriate control mechanisms, it is now possible to build networks that do not impose such restrictions.

In addition to fairness, another challenge faced by network architects is congestion. In general, two types of congestion may occur in a network. The first type is endpoint congestion, where egress edge links coupled to a destination device are congested. The second type is fabric link congestion, i.e., intermediate fabric link congestion.

Fig. 6 shows an example of endpoint congestion. In this example, two source hosts 602 and 604 are sending data to a destination host 606. Traffic from source hosts 602 and 604 converges at edge switch 610 and egress edge link 608 between switch 610 and host 606 may become congested. Such a congestion scenario typically occurs in incast, where multiple sources send traffic to a single destination. Congestion may occur when an egress edge link reaches its full data rate capacity, or when the destination host 606 is unable to process all incoming packets at a fast enough rate. In any event, when endpoint congestion occurs, the outgoing transport buffers on the switch 610 coupled to the link 608 may experience an increase in their amount of stored data.

The switch may detect endpoint congestion by monitoring output buffers on the egress edge links and relieve endpoint congestion by sending ACKs with congestion information to upstream switches and source nodes. More specifically, an output buffer coupled to an egress edge link may monitor the status of the buffer and detect congestion when certain criteria are met. When a packet arrives or leaves the output buffer, the output buffer may calculate three congestion detection parameters, such as: (1) the amount of data stored in the buffer, (2) the number of packets stored in the buffer, and (3) the rate of change of the buffer depth (the amount of data stored in the buffer). Three thresholds may be set for the three monitoring parameters, respectively, but more or fewer thresholds may be set. When at least one of these parameters exceeds a corresponding threshold, congestion is considered to be present.

When congestion is detected, the switch may generate and transmit an endpoint congestion notification ACK corresponding to the packet that just entered the output buffer. The ACK may include a value indicating the severity of congestion. It should be noted that the endpoint congestion notification ACK is not intended to inform upstream switches of the successful delivery of packets, but rather the existence and extent of congestion on their egress edge links. (in fact, when the endpoint congestion notification ACK is sent, packets may still be stored in the output buffer waiting to be transmitted onto the egress edge link.) this fast, explicit congestion notification mechanism allows the switch to quickly take action on the particular flow that caused the congestion.

In addition, the output buffer may update the congestion detection parameters as packets are dequeued and transmitted to the egress edge link. If there is no congestion, a regular ACK is generated and sent, which may clear any previous congestion notifications received by the upstream switch operating on the corresponding flow. If there is congestion, the ACK may be tagged with a flag that allows the ACK to inform the switch of continued congestion on the egress edge link and successful delivery of the packet.

Fig. 7A illustrates a flow diagram of an exemplary process of generating an explicit endpoint congestion notification ACK. During operation, the system may continuously monitor the output buffer of the egress edge link. The system may then receive a packet at an output buffer (operation 702). After receiving the packets, the system may calculate the three congestion parameters (total amount of data, total number of packets (e.g., as represented by total number of headers), and rate of change of buffer depth) for the output buffer (operation 704). The system may further determine whether any of these parameters exceed corresponding thresholds (operation 706). If at least one parameter exceeds a threshold, congestion is deemed to be present. Accordingly, the system may generate and send an explicit endpoint congestion notification ACK packet corresponding to the flow of packets to the upstream switch (operation 708). If congestion is not detected, the system may resume normal operation.

Fig. 7B illustrates an exemplary endpoint congestion management logic block. In this example, the endpoint congestion management logic block 730 may include an output buffer monitor 732, a congestion parameter calculation logic block 734, and an endpoint congestion notification ACK generation logic block 736. During operation, the output buffer monitor 732 may monitor the status of the output buffers associated with the egress edge links. Based on the status of the monitored output buffers, the congestion parameter calculation logic block 734 may calculate the three congestion parameters (see operation 704 in the flow chart in fig. 7A). When one of these parameters exceeds the corresponding threshold, the endpoint congestion notification ACK generation logic 736 may generate an endpoint congestion notification ACK and send the ACK to the upstream switch.

Fig. 8 shows a flow diagram of an exemplary process of generating an ACK in response to a packet being dequeued from an output buffer. In this example, the system first dequeues the packet from the output buffer (operation 802). The system may then calculate the three congestion parameters (total amount of data, total number of packets, and rate of buffer depth change) for the output buffer (operation 804). The system may determine whether any of these parameters exceed corresponding thresholds (operation 806). If at least one parameter exceeds a threshold, congestion is deemed to be present. Accordingly, the system may generate an ACK packet with a flag indicating persistent congestion (operation 808). If congestion is not detected, the system may generate a regular ACK packet (operation 809). The system may then send the ACK packet to the upstream switch (operation 810) and transmit the dequeued data packet onto the egress edge link (operation 812).

One mechanism that may be used to detect endpoint congestion may use an endpoint congestion ack (eca) value, which is determined by the maximum of the following parameters:

-a function of endpoint output queue depth (in bytes). The function has two control components: a balance value of a minimum depth of the generated value, and a scaling component of the remaining value are set.

A differential component which yields a high ECA value if the output queue depth increases rapidly.

Function of the total number of headers in the output queue (which can detect incast due to small packet flows).

In one embodiment, the endpoint output buffer may track metrics for feeding back congestion information via congestion ACKs that are part of the ECA. This feedback can be done on a queue-by-queue basis (there may be multiple queues for different buffer/traffic classifications) and involves tracking multiple factors. When a request reaches the endpoint output buffer, the factors of the queue, including the impact of the request, may be examined and the maximum of them may be forwarded to the OFCT for use in deciding what congestion feedback to provide. These factors may be checked again when the request wins arbitration, dequeues and forwards to the OFCT. A flag may be forwarded with the request to indicate whether the factors are all zero (indicating whether congestion persists). This value does not include the effect of dequeuing the request. The following section describes these retained factors and how they are calculated.

Linear byte factor: the linear byte factor tracks the current depth (in bytes) of requests of the associated queue. It can subtract the equilibrium value from this depth and then put it into 256 buckets (buckets) that are evenly distributed over the dynamic range. The details of the calculation are as follows:

lbf _ thresh-linear byte factor threshold per queue (in 256B).

lbf _ shift-linear byte factor shift per queue.

qu _ depth-the number of bytes of the request currently in the queue (in 256B).

The result of the 8-bit lbf _ factor is:

in one embodiment, the (qu _ depth-lbf _ thresh) term is not allowed to be negative. If the result is negative, it will be forced to zero. Any lbf _ factor value exceeding 255 will be forced to 255.

Linear envelope factor: the linear packet factor tracks the current number of requests in the associated queue. It can subtract the equilibrium value from this depth and then place it into 256 buckets that are evenly distributed over the dynamic range. The details of the calculation are as follows:

lpf _ thresh-linear packet factor threshold per queue.

lpf _ shift-linear packet factor shift per queue.

qu _ pkt-the number of requests currently in the queue.

The result of the calculation of the 8-bit lpf _ factor is:

in one embodiment, the (qu _ pkt-lpf _ thresh) term is not allowed to be negative. If the result is negative, it will be forced to zero. Any lpf _ factor value exceeding 255 will be forced to 255.

Derived byte factor: the derived byte factor tracks whether the current queue depth is growing. For each time segment it requires a depth difference between the beginning and the end of the time segment. It can obtain this depth and divide it by a time factor. To simplify the required hardware, the time factor can be limited to a value of a power of 2 (division into shifts). The details of the calculation are as follows:

dbf _ clk _ cnt-derived byte factor clock count for each queue.

dbf _ shift — derived byte factor shift per queue.

dbf thresh — derived byte factor threshold for each queue.

qu _ depth-the number of bytes of requests in the queue at the end of the time period.

qu _ depth _ old-number of bytes of request in queue at the beginning of the time period.

The result of the calculation of the 8-bit dbf _ factor is:

in one embodiment, the (qu _ depth-qu _ depth _ old) entry is not allowed to be negative. If the result is negative, then dbf factor will be forced to zero. It can be seen that if the current qu _ depth is not greater than qu _ thresh, then dbf _ factor is also forced to zero. Any dbf factor value exceeding 255 will be forced to 255. dbf _ factor performs a calculation at each dbf _ clk _ cnt clock, with the incoming request using the last calculated value.

Different types of congestion ACKs that may be returned from an endpoint fabric egress port may be used to report different degrees of endpoint congestion. These endpoint congestion ACKs may be used to manage the bandwidth of flows flowing into heavily congested endpoint egress ports.

The current ECA value, source port ID, and source flow ID may be sent to egress ofct (efct) each time a header is enqueued to the egress port output buffer queue. If the ECA value is not zero (indicating congestion), it may return the ACK _ ECA generated at the EFCT front end to all upstream Input Flow Channel Tables (IFCTs). The goal of ACK _ ECA ACKs is to quickly inform upstream stream lanes whether they contribute to endpoint congestion so that they can react quickly. This indicates that there is an incast as the endpoint egress port output buffer queue size grows deeper. The ACK _ ECA ACK value calculated as above may carry the extent of incast to all upstream IFCTs that the reflowed packet has traversed.

Another congestion level calculation may be performed when a packet is processed and ready to leave the endpoint egress buffer queue and move to a NIC/node connected to the egress port. Based on this new congestion calculation, if this egress port congestion persists, the system may convert the normal ACK _ DELTA ACK (normal flowlane ACK) used by the operation of the flowlane mechanism to ACK _ DELTA _ ECA (e.g., by setting a flag). The ACK DELTA ECA ACK may be used to clear or maintain the previous congestion status of the ingress port flow path that caused the congestion.

There may be other ACKs, such as ACK _ route, computed using ECA congestion management at the intermediate fabric ports. These reroute ACKs may be generated to help the flows avoid congestion in the middle of the fabric.

It should be noted that the endpoint congestion management logic shown in fig. 7B may also perform the operations described in the flowchart shown in fig. 8. In other words, the endpoint congestion management logic 730 may potentially generate an endpoint congestion notification ACK when a packet arrives at an output buffer and when a packet departs from the output buffer.

When an endpoint congestion notification ACK traverses the fabric, the IFCT of the switch on the path may apply a bandwidth limitation to the flow corresponding to the ACK. In fact, the fabric may slow the delivery speed of the flow in a distributed manner on each switch on the data path. When the endpoint congestion notification ACK passes the IFCT, its value may be stored in the table entry of the flow as an ep _ constraint value, which may be used to select the maximum bandwidth desired for the flow. Each value of ep _ containment may have a corresponding set of high water level value, target water level value, and discard water level value. For high levels of congestion, when ep _ congestion has a higher value, the water level value may have a lower value so that congestion may be more aggressively alleviated. For low levels of congestion, when ep _ containment has a lower value, a different set of larger high, target, and drop water level values may be used to achieve higher flow bandwidth. For example, a table indexed by the ep _ constraint value may be used. For each ep _ containment value, the table may indicate a corresponding set of high, target, and discard water level values. The entries of the table may be predetermined such that when an endpoint congestion notification ACK is received, the switch may perform a lookup in the table using the ep _ congestion value and apply the three corresponding water level values to the identified flow.

In some cases, if the sources greedy inject data, simply slowing down the forwarding speed inside the network may not be enough to completely eliminate the congestion. To address this issue, the ingress edge switch may be configured to instruct the source device (which typically resides outside the fabric) to limit data injection in a fine-grained, flow-by-flow manner. This switch-to-host flow control mechanism may be referred to as fine-grained flow control (FGFC).

In particular, especially in an HPC environment, an end host or compute node may have a large number of cores running a large number of threads, processes, or virtual machines, where each core may inject its own data stream into the network through a common physical Network Interface Controller (NIC). When congestion is present, port-by-port based flow control can only limit the overall data rate of a single port on the NIC, which can be 40Gb/s or higher. Extrapolating back the total data rate across the entire port may result in unfairness to the flows that do not cause congestion. FGFC can extend the concept of individual flows or associated flow packets to their ultimate source, which can be a single thread executing on one of the cores.

To slow down the injection of data from the source, the FGFC logic block on the ingress edge switch (e.g., FGFC logic block 434 in edge switch 406 in fig. 4A) may use a pause-credit mixing approach to restrict incoming data associated with a particular flow or flow packet. Pause-based methods typically involve the receiving end issuing a pause command to the transmitter end, in response to which the transmitter end may stop transmission until further notice. For a credit-based approach, the receiving end may send transmission credits to the transmitting end, which allows the transmitter to send more data, but at most only the amount specified by the credit value. This mechanism allows the receiving end to more accurately control its input buffer depth to avoid overflow while allowing transmission to continue. FGFC may use a hybrid approach, where upon detecting congestion, the ingress edge switch may issue FGFC frames with set timer values for one or more streams to an end host NIC (e.g., NIC 401 on end host 402 in fig. 4A). After receiving the FGFC frame, the ingress edge switch may turn on a credit-based flow control mode. In response, the NIC may limit the transmission data rate of the corresponding stream(s) based on the received credits, while allowing other streams to transmit at normal data rates. After the expiration of the predetermined timer, the end host NIC may resume normal transmission of the restricted stream(s) unless another pause command is received. It should be noted that the restricted flow may be identified by any field derived from the packet. The restricted flow may be specific to a single process or thread executing on the end host.

FGFC can implement control communications between the edge switch and the end host NIC using ethernet frames with an Organization Unique Identifier (OUI) extended ethertype field. The frames may indicate one or more of: (1) the protocol used by the controlled flow; (2) an identifier indicating a source (e.g., an application, process, or thread) that generates a packet that needs to be restricted; (3) a pause time value for which flow control is to continue (which may prevent locking if a subsequent FGFC frame is lost due to an error), and (4) a credit value, which may be zero, indicating the number of frames or amount of data that may be sent during the pause.

It should be noted that the identifier used to indicate the source stream subject to flow control may differ based on the protocol associated with the stream. For layer 2 ethernet Virtual Local Area Network (VLAN) traffic, the identifier may include a VLAN number. For IPv4 traffic, the identifier may include a source/destination IP address pair, a UDP or TCP/IP 5-tuple containing a UDP or TCP port number, or an optional flow label. For IPv6 traffic, the identifier may include one or more IPv6 addresses or IPv6 flow labels. For proprietary HPC protocol traffic, the identifier may comprise a process or thread ID. Typically, this identifier is also stored in the EFCT of the edge switch, as it is used to map the corresponding traffic flow to the flow ID.

To trigger FGFC, the IFCT of the ingress edge switch may monitor its flow-specific input queues. For each queue, the corresponding IFCT entry may indicate three water level values: high, target, discard, which may be used to measure queue depth. In some examples, these water level values may be included as additional fields in the IFCT as shown in fig. 4C, or may be stored in a separate table and linked by fields in the IFCT. When the queue depth is less than the target value, FGFC is not required. When the queue depth reaches the target water level value, the IFCT may communicate with the FGFC logic block to initiate FGFC with the NIC of the end host. When the queue depth falls below the drop water level value, the FGFC may be stopped and normal transmission of the stream resumed.

Fig. 9A shows a flow chart of an exemplary FGFC process. During operation, at the ingress edge switch, the system may monitor the flow-specific input queues (operation 902). The system may further determine whether the FGFC is currently on for the corresponding flow (operation 904). If FGFC is currently turned on for the flow, the system may determine whether the queue depth is below the drop water level (operation 906). If the queue depth has not dropped below the drop level, the system may continue credit-based transmission in FGFC mode (operation 912). If the queue depth has decreased below the drop water level, the system may revert to normal transmission of the stream (operation 914). Returning to operation 904, if the FGFC is not currently on, the system may determine whether the queue depth is greater than the target water level (operation 908). If so, the system may initiate FGFC for the stream (operation 910). The FGFC logic block in the edge switch may obtain flow identification information (e.g., VLAN tag, TCP/IP 5-tuple, thread ID, etc.) from the EFCT entry corresponding to the flow and send the FGFC ethernet frame to the NIC on the end host. Subsequently, the system may continue to monitor the input queue (operation 902). If the queue depth is not greater than the target level, the system may continue with regular data transfer (operation 914).

To facilitate FGFC, the NIC may be configured to process FGFC ethernet frames so that the NIC may communicate with an application or process on the end host that is generating data. The parsing of the FGFC ethernet frames and the communication with the application or process may be done in software, hardware, or a combination of both. Fig. 9B shows an example of a NIC supporting FGFC. In this example, the NIC 930 may include a processor 932, a memory 934, a transmitter 936, a receiver 938, an FGFC logic block 940, and a communications logic block 942. During operation, the transmitter 936 and the receiver 938 may perform communication with the edge switch via the edge link. The communication logic 942 may communicate with a central processing unit of an end host where the NIC 930 is located via a data bus, such as a peripheral component interconnect express (PCIe) bus. Processor 932 and memory 934 internal to NIC 930 may perform local processing of data. During operation, FGFC logic block 940 may work with edge switches to apply FGFC in a flow-by-flow manner. Additionally, the FGFC logic block 940 may communicate with the central processing unit of the end host via the communication logic block 942 to limit data injection by a single application or process corresponding to a particular flow subject to FGFC, thereby controlling the amount of data injected into the fabric.

As described above, two types of congestion may occur in the network. The first type is end point congestion and the second type is fabric link congestion. Fig. 10 shows an example of fabric link congestion. In this example, two intermediate switches 1002 and 1006 communicate via fabric link 1004. Multiple source/destination pairs may transmit traffic via fabric link 1004. Thus, fabric link 1004 may experience congestion, but links to and from fabric link 1004 may not be congested. When such congestion occurs, the fabric link 1004 may appear to be a "hot spot".

To alleviate fabric link congestion, the switch may apply dynamic flow control based on flow-by-flow credits. At the switch, if the input queue begins to fill up and the queue _ extend value for the flow reaches a predetermined threshold, the switch may generate a special ACK to notify the IFCT of the upstream switch of the congestion. This special hop-by-hop ACK may be referred to as a "headroom ACK". Upon receipt of the Headroom ACK, the IFCT of the upstream switch may initiate credit-based flow control to the downstream switch. In the downstream IFCT entry, a flag Upstream Metering (UM) may be set to indicate that data transmissions from the upstream switch are now metered on a credit basis. The Headroom ACK packet may also include a credit value.

When the upstream switch receives the Headroom ACK, a flag called Downstream Metering (DM) may be set in the corresponding entry of the IFCT. The IFCT may also store a headroom field for the symbol in the IFCT entry with a credit value carried by the headroomcack (i.e., the headroom value indicates the number of credits). The headroom field may represent the amount of data that may be forwarded to the downstream switch. This establishes credit-based flow control for the corresponding flow. If the Headroom ACK is received if the DM flag in the flow entry has been set, the credit value carried by the Headroom ACK may be added to the existing headroom (headroom) value.

If the headroom value is not greater than zero (i.e., no credit available), new packets received by the upstream IFCT may be blocked. These packets may fill the input queue for the flow and may in turn cause the IFCT to initiate flow control on its upstream IFCT on a flow credit-by-flow basis, and so on. If the headroom value is greater than zero, the packets stored in the input queues may be dequeued and forwarded to the downstream switch, and the headroom value may be decremented by the size of the forwarded packets, which may cause the headroom value to become zero or negative.

Since the flow is restricted from sending new packets to the downstream IFCT, the input queue of the downstream IFCT may begin to be consumed at some rate, depending on its downstream congestion. As described above, the input queue for each flow may have three queue depth water level values, i.e., high, target, and drop, which may be used to manage credit-based flow control. The target water level may be approximated to the ideal queue depth for the desired stream bandwidth. Which means that there is sufficient buffering available to transmit data downstream. When congestion occurs, the credit-based flow control mechanism may attempt to keep the queue _ extend value of the flow near the target water level.

If the queue _ extend value is between the high and drop levels and greater than the target level, then a credit slightly smaller than the packet size may be returned to the upstream switch along with the Headroom ACK when forwarding the packet. If the value of queue _ extend does not exceed the target level, a credit slightly larger than the packet size may be returned to the upstream switch along with the Headroom ACK when forwarding the packet.

If the queue _ extend depth is greater than the high water level, no credit is returned when forwarding the packet. This mechanism can lower the queue _ extend value more quickly and is typically used when congestion is first detected. The input queue for a flow may become empty more quickly if the congestion clears. Credit-based flow control may be turned off when the queue depth is less than the drop water level. This may be done by clearing the UM flag in the IFCT entry and returning the headroom ack with the largest credit value to the upstream switch. When received by the upstream IFCT, the Headroom ACK will clear the DM flag of the entry and turn off flow control for the headroom value.

It should be noted that in a typical network topology, there may be multiple switches and multiple data paths between two endpoints. In a multi-path network, various methods may be used to control fabric link congestion. For example, injection limits described later in this document may control the maximum amount of data in the overall structure. This means that if a particular fabric link is overloaded, the flow may use a different data path that does not pass through the congested link. An overloaded link may be detected and a "reroute" ACK generated for a group of flows. Rerouting ACKs may temporarily block a flow in an upstream switch, and when all ACKs for that flow are returned, the flow may be unblocked and may be free to use a different path on the fabric. The load-based dynamic adaptive routing mechanism may then direct the leading packet to use a different non-congested fabric link. Furthermore, the load on the entire structure may become more balanced.

In one embodiment, the IFCT may have 24 configurable stream bandwidths and buffer settings. The ECA value returned in the ACK _ ECA congestion ACK may be used to select one of the 24 settings. The initial flow channel bandwidth setting provides full bandwidth and decreases as the congested ACK returns. These bandwidth and buffer settings may accommodate individual streams' bandwidths varying between 200Gbps and 10 Mbps.

Once the ECA value is stored in the IFCT table, the bandwidth and buffer settings corresponding to that value can be used each time a new header is injected or forwarded to the network fabric. The final ACK to close the flow is the flow move ACK that is normally returned from the endpoint egress port. The ACK may be ACK _ DELTA or ACK _ DELTA _ EAC depending on the depth of the final egress port output buffer queue. As previously described, the endpoint egress port OFCT may generate an ACK _ ECA ACK when a packet is enqueued, but the queue may have been depleted to a point where it is considered no longer congested. If the flow lane receives ACK _ DELTA instead of ACK _ DELTA _ ECA, this indicates that the congestion has cleared and the flow lane will return to full speed. However, if the flow path receives ACK _ DELTA _ EAC instead of ACK _ DELTA, the congestion status of the flow path and the reduced injection rate are maintained.

It is also worth mentioning that the system can initialize the new flow as "slow start", similar to TCP. By setting a slightly lower maximum flow bandwidth, less payload data will be injected into the fabric when the first frame is enqueued to the congested endpoint egress port queue. If the slow start of the flow does not result in congestion, the ACK _ DELTA ACK can quickly remove the restriction and allow the flow to proceed at full bandwidth. Various HPC applications generate many short streams from each source. Having a slight bandwidth limitation at the start of a single flow does not generally have adverse consequences on the total bandwidth from each node.

In a typical busy network, moderate congestion often appears and disappears. This may be due to hot spots on the route to the destination and moderate incasts of multiple flows at the destination. Any source of congestion will cause the input buffer to fill up. Once the buffer is full, the bandwidth on the link may collapse. The flow path includes a hard back pressure (flow control) mechanism to limit frame forwarding for each individual flow that cannot transmit frames at full bandwidth to the final egress port. If hard backpressure is acting on one flow, other flows using the same link do not have to be slowed down by congested flows.

Backpressure may reduce the amount of input buffer space consumed when a flow slowly enters an incast congestion endpoint. The wider the incast, the slower the individual progression of each flow. This mechanism can operate on a single link and is only affected by the round trip delay of the single link. This means that more unwanted frames can be prevented from reaching the final input buffer where the highest pressure occurs on the input buffer space of incast.

An application may have thousands of nodes collaborating on a problem. Ideally, a well-designed HPC application can generate structured traffic patterns that do not cause congestion. However, not all applications are designed with a view to avoiding large incast. For example, perhaps the final action of each major step of the application is to aggregate the results of each node into one place, while simple, large-scale incast is the simplest approach. However, the impact of large incasts on other applications can be severe. The described congestion management mechanism can control these events even when the width of incast becomes very large (e.g., 1000 nodes). The ECA value may be important for controlling incast events. The end point egress port queue may directly indicate the width of incast.

Incast can develop very quickly and it is important that the feedback control of Incast is equally fast. The first frame of a flow arriving at the rapidly deepened endpoint egress port queue may generate an ACK EAC ACK carrying the ECA value. Each IFCT traversed by the ACK may select the appropriate bandwidth and buffer setting. The upstream flow range field may define the maximum flow range of the bandwidth setting. This may be an absolute limit on the amount of downstream frame data. Since the mechanism runs on each upstream IFCT of the flow, the transfer of frame data to the wide incast may be stopped when the first ACK passes each IFCT.

Upstream flow range restriction may be effective, but only up to a point, because in terms of its restriction it can only restrict each flow that is part of incast to a single frame. This does not provide sufficient protection if the frame size is the size of the MTU, and worse, when the MTU has been set to a jumbo frame. For example, 1000 × 9kB equals 9 MB. Packets of this size may flood the endpoint egress port buffer.

In one embodiment, the bandwidth and buffer settings may include another setting called the limit-after-ack bit. This bit may be configured on the host ingress port. When this bit is set, additional restrictions on the stream can be placed that can limit the bandwidth to be much smaller than a single frame in the fabric. If the limit-after-ack bit is set, fc-state may be set to block-uniform-empty when the flow range exceeds upstream-flow-extend. This will prevent any frame from leaving until all ACKs for all downstream frames are returned. When returning the ACK of the last packet sent, fc-state may be changed to RUNNING, and the wakeup TIME may also be set to (local-rolling-TIME + (BYTE _ TIME < BYTE _ TIME _ SCALE))). The next frame may wait for a new wake-up time so that the average contribution of the flow to the egress age queue may be reduced to much less than a single frame.

Ordering is enforced by subsequent packets, wherein subsequent packet flows are forced to follow the path of the first frame of the flow. If the frame is not forced to follow but the stream is still created in the normal way, the stream channel tree is formed in much the same way as the multicast tree. The ack-flow may change in the normal manner except in the absence of frame repetition.

The sustained bandwidth of out-of-order flows tends to be higher than that of ordered flows because adaptive routing can more effectively balance the overall load. The latest adaptive load value can be used for each frame. However, out-of-order flows may create more congestion because the propagation effect of adaptive routing improves the chances of filling more buffers in the network as incasts develop.

These flows can be controlled using ECA values. The ACK ECA ACK will select a new bandwidth and buffer setting using the new flow range of the out-of-order flow tree. Once any small incast develops, the congested flows can be further controlled by quickly changing out-of-order flows to in-order flows. This policy provides the bandwidth advantage of out-of-order delivery for well-performing applications, but protects other applications when they do not perform well and generate incasts.

Fig. 11 shows a flow diagram of an example process for applying credit-based flow control over a congestion fabric link. During operation, the switch system may monitor its flow-specific input queues (operation 1102). The system may determine whether an entry in its IFCT has the UM flag set (operation 1104). If the UM flag is set, which means that credit-based flow control is on, the system may further determine whether the queue _ extend value is less than the discard water level value (operation 1106). If the queue _ extend value is less than the discard water level value, the system may clear the UM flag, turn off credit-based flow control, and resume normal data transfer (operation 1014). If the queue _ extend value is greater than the discard water level value, the system may continue credit-based flow control (operation 1106). Returning to operation 1104, if the UM flag is not set, which means that the system is in a normal transmission mode, the system may determine whether the queue _ extend value is greater than the target water level value (operation 1108). If so, the system may initiate credit-based flow control and send a Headroom ACK to the upstream switch (operation 1110). If the queue _ extent value is not greater than the target water level value, the system may continue with the normal data transfer (operation 1112).

In general, a flow path switch may use a combination of several congestion detection and control mechanisms. For example, different degrees of endpoint congestion may be reported using endpoint congestion notification ACKs that may be returned from the final fabric egress edge ports. The ACK type may be used to manage the bandwidth of flows flowing into heavily congested egress edge ports. The system may also manage fabric link congestion using hop-by-hop credit based flow control. Such hop-by-hop congestion management mechanisms may effectively cope with low to moderate levels of congestion, as the response time may be much shorter than the network-level round trip delay.

The system may also apply per-flow injection limits if congestion is severe, which may be caused by a large range of incasts. The injection limit for the flow may be determined based on the ep _ constraint value. The injection limit may be compared to the flow _ extend value in all IFCTs traversed by the flow. If flow _ extend is greater than the limit, then IFCT may block packets for the flow from being forwarded from the input queue. This mechanism can reduce the packet forwarding rate over the entire flow to as few as a single packet.

The system can also protect irrelevant traffic from extreme congestion due to incast of a large number of contributors. In this case, the ep _ containment value may be set to a higher value, and the average data amount of the stream may be reduced to a small portion of the packet. This may be achieved by releasing the next packet of a single flow into the fabric from the IFCT of the ingress edge port only after a programmable delay since the ACK of the previous packet was received.

In addition to the flow-by-flow injection limits, the system may also measure the amount of data injected into the fabric on a flow-by-flow port basis, and set the injection limits to impose an upper limit on the total amount of data that a port can inject into the fabric. Since each ingress port can apply the injection limit, the system can control the maximum amount of data allowed within the fabric. Limiting the amount of data entering the structure ensures that no buffer depletion occurs in the event of bandwidth scarcity. Thus, traffic that does not use the reduced bandwidth path is not affected.

To facilitate port-by-port injection limiting, the IFCT may maintain a total traffic flow count. The total count is incremented each time a packet is injected into the fabric from an edge port. When an ACK is returned for a flow, the total traffic flow count may be decremented. Once all ACKs for all flows of the ingress port are returned (i.e., when the sum of the flow _ extend values for all flows becomes zero), the total traffic flow count may be set to zero.

Fig. 12 illustrates an example edge switching system that facilitates flow paths (which may correspond to switch 406 in fig. 4A, for example). In this example, switch 1202 may include a plurality of communication ports, such as port 1220. Each port may include a transmitter and a receiver. Switch 1202 may also include processor 1204, storage 1206, and stream channel switching logic 1208. The stream lane switch module 1208 may be coupled to all communication ports and may further include a crossbar switch 1210, an EFCT logic block 1212, an IFCT logic block 1214, and an OFCT logic block 1216.

Crossbar switch 1210 may include one or more crossbar switch chips that may be configured to forward data packets and control packets (e.g., ACK packets) between communication ports. EFCT logic block 1212 may process packets received from the edge link and map the received packets to corresponding flows based on one or more header fields in the packets. Additionally, the EFCT logic block 1212 may assemble FGFC ethernet frames that may be transmitted to the end host to control the amount of data injected by the various processes or threads. IFCT logic 1214 may include IFCTs and perform various flow control methods in response to control packets, such as end point congestion notification ACKs and structural link credit based flow control ACKs. The OFCT logic 1216 may comprise a memory unit that stores OFCTs, and communicates with the IFCT logic of another switch to update the flow ID of the packet as it is forwarded to the next hop switch.

Fig. 13 illustrates an example intermediate switch system that facilitates flow channels (which may correspond to switches 408 and 430 in fig. 4A, for example). In this example, switch 1302 may include a plurality of communication ports, such as port 1320. Each port may include a transmitter and a receiver. Switch 1302 may also include a processor 1304, a storage device 1306, and stream channel switching logic 1308. The flow channel switch module 1308 may be coupled to all communication ports and may further include a crossbar switch 1310, an IFCT logic block 1314, and an OFCT logic block 1316.

Crossbar switch 1310 may include one or more crossbar switch chips that may be configured to forward data packets and control packets (e.g., ACK packets) between communication ports. Additionally, the EFCT logic block 1312 may assemble FGFC ethernet frames that may be transmitted to the end host to control the amount of data injected by the various processes or threads. The IFCT logic 1314 may include IFCT and perform various flow control methods in response to control packets (e.g., endpoint congestion notification ACKs and structural link credit based flow control ACKs). The OFCT logic 1316 may include a memory unit that stores OFCTs and communicates with the IFCT logic of another switch to update the flow ID of the packet as it is forwarded to the next hop switch.

In summary, the present disclosure describes a data-driven intelligent networking system that can accommodate dynamic traffic flows through fast, efficient endpoint congestion detection and control. The system may maintain state information for individual packet flows, which may be dynamically established or released based on injected data. The packet stream may be mapped to its layer 2, layer 3, or other protocol specific header information. Upon arrival at the switch, a flow-specific input queue may be provided for each flow. Packets of the corresponding stream are acknowledged upon reaching an egress point of the network, and the acknowledged packets are sent back along the same data path to the ingress point of the backward stream. Thus, each switch can obtain state information of each flow and perform flow control in a flow-by-flow manner. Such flow control allows better utilization of the network while providing versatile traffic engineering and congestion control capabilities.

The methods and processes described above may be performed by hardware logic blocks, modules, or devices. A hardware logic block, module or apparatus may include, but is not limited to, an Application Specific Integrated Circuit (ASIC) chip, a Field Programmable Gate Array (FPGA), a dedicated or shared processor that executes code at a particular time, and other programmable logic devices now known or later developed. When activated, the hardware logic blocks, modules, or devices perform the methods and processes included therein.

The methods and processes described herein may also be embodied as code or data, which may be stored in a storage device or computer-readable storage medium. The methods and processes may be performed by a processor when the stored code or data is read and executed by the processor.

The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. The description is not intended to be exhaustive or to limit the invention to the precise form disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the invention is defined by the appended claims.

42页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于促进对各个应用和流量流进行流控制的数据驱动智能网络的系统和方法

System and method for facilitating data-driven intelligent networks employing endpoint congestion detection and control

相关技术

网友询问留言