Roller arbitration method and circuit for on-chip data exchange

文档序号：168638 发布日期：2021-10-29 浏览：38次中文

阅读说明：本技术 片上数据交换的滚轮仲裁方法及电路 (Roller arbitration method and circuit for on-chip data exchange ) 是由王东辉赵鹏常亮桑永奇李甲姚飞于 2021-07-29 设计创作，主要内容包括：本发明公开了一种片上数据交换的滚轮仲裁方法,基于N输入端口N输出端口的的NxN交叉网络,包括以下步骤：S1、确定优先仲裁排列；S2、判断排列中各预期传输对是否有传输需求,是则确定为实际传输对,确定的实际传输对立即进行数据传输；否则认定为非传输对；S3、进行顺序的行/列或列/行仲裁,选择优先级高的交换点作为实际传输对；S4、步骤S3轮询完毕后,优先仲裁排列滚动获得新的仲裁排列；S5、循环进行S2-S4。此种方法可以提高片上数据交换效率和速度,特别适合人工智能和大数据处理芯片,尤其是SIMT架构的芯片。此种方法属于芯片设计、片上网络、片上系统、和计算机体系结构。(The invention discloses a roller arbitration method for on-chip data exchange, which is an NxN cross network based on N input ports and N output ports and comprises the following steps: s1, determining priority arbitration arrangement; s2, judging whether each expected transmission pair in the arrangement has a transmission requirement, if so, determining the transmission pair as an actual transmission pair, and performing data transmission on the determined actual transmission pair; otherwise, the transmission is determined as a non-transmission pair; s3, performing sequential row/column or column/row arbitration, and selecting the switching point with high priority as the actual transmission pair; s4, after the polling is finished in the step S3, the priority arbitration arrangement rolls to obtain a new arbitration arrangement; and S5, circulating S2-S4. The method can improve the efficiency and speed of on-chip data exchange, and is particularly suitable for artificial intelligence and large data processing chips, especially chips with SIMT architecture. This method belongs to chip design, network on chip, system on chip, and computer architecture.)

1. A gyro wheel arbitration method of the data exchange on chip, based on N xN cross network of N input port N output port, all output ports that an input port corresponds to are a line, all input ports that an output port corresponds to are a column, every input/output switching point is a transmission pair; it is characterized by comprising the following steps:

s1, determining priority arbitration arrangement W_VOQ＝{VOQ[0,a]，VOQ[1,b]，VOQ[2,c]，…，VOQ[N-1,x]}，a、b、c…x∈[0，N-1]And are different from each other; priority ofArbitration arrangement W_VOQRepresents N expected transmission pairs, wherein: VOQ [0, a ]]Indicating an expected input port as PI₀The expected output port is the expected transmit pair of PO a; non-prioritized arbitration arrangement W_VOQThe transmission pair in (1) is an unintended transmission pair;

s2, judging whether each expected transmission pair in the arrangement has a transmission requirement, if so, determining the transmission pair as an actual transmission pair, and performing data transmission on the determined actual transmission pair;

s3, for the switching point of the unexpected transmission pair, firstly, performing the column arbitration of each output port or the column arbitration of the input port where the output port is located to obtain the possible actual transmission pair; performing row arbitration or column arbitration on possible actual transmission pairs, and selecting an unexpected transmission pair with high priority as an actual transmission pair;

s4, after the polling in step S3 is finished, when a certain condition is met, prioritizing the arbitration arrangement W_VOQRolling to get a new arbitration permutation W'_VOQ(ii) a Otherwise, the roller keeps still;

and S5, circulating S2-S4.

2. The method of claim 1, wherein in S1, the priority arbitration ranking W is determined_VOQ＝{VOQ[0,0]，VOQ[1,1]，VOQ[2,2]，…，VOQ[N-1,N-1]}。

3. The method of claim 1, wherein in S2 and S3, for burst transmission applications, there is no outstanding transmission on the transmission pair before determining the actual transmission pair.

4. The method of claim 1, wherein in S2 and S3, after determining the actual transmission pair, the row and column where the actual transmission pair is located are cleared, and no longer participate in the arbitration of the next transmission request.

5. The method of claim 1, wherein the rolling condition is that each requested expected transmission pair is acknowledged by an actual transmission pair at S4.

6. The method of claim 1, wherein in S4, the scrolling is performed such that all expected input ports are unchanged, and all expected output ports + n, n are selected to cover all transmission pairs when the scroll wheel rolls circularly.

7. The method of claim 1, wherein in S4, the scrolling is performed such that all expected output ports are unchanged, and all expected input ports + n, n are selected to cover all transmission pairs when the scroll wheel rolls circularly.

8. A roller arbitration circuit for on-chip data exchange is based on an NxN cross network of N input ports and N output ports, and is characterized by comprising:

-a rolling point selection circuit for determining a prioritized arbitration arrangement W in an NxN crossbar network_VOQ＝{VOQ[0,a]，VOQ[1,b]，VOQ[2,c]，…，VOQ[N-1,x]And a priority arbitration permutation W_VOQThe rolling update of (2);

-an expectation point matching circuit for determining whether each expected transmission pair in the permutation has a transmission requirement, and for marking transmission-required pairs as actual transmission pairs and transmission-non-required pairs as non-transmission pairs;

-a row-column polling arbitration circuit for performing a column arbitration of the column where the switching point is located or a row arbitration of the row where the switching point is located, and obtaining possible actual transmission pairs; and performing row arbitration or column arbitration on the possible actual transmission pairs, and selecting the switching point with high priority as the actual transmission pair.

9. The circuit of claim 8, further comprising:

-rank clearing circuitry, each switching point of the NxN crossbar network being provided with clearing logic for inhibiting the rows and columns where the matching points are located from participating in the arbitration of the transmission demand after the actual transmission pair is determined for the intended transmission pair.

10. The circuit of claim 8, wherein the scroll point selection circuit: each row and each column of the NxN cross network are respectively provided with 1 k-bit register, k is ceiling (N), and the registers of the rows/columns sequentially move register values to realize rolling.

11. The circuit of claim 8, wherein the desired point matching circuit: each switching point of the NxN cross network is provided with a comparator, and whether the switching point is an expected point or not is judged by comparing the values of the row register and the column register which the switching point belongs to.

12. The circuit of claim 8, wherein the rank-poll arbitration circuit: one arbiter is arranged in each row and column of the NxN cross network, and each arbiter has a priority pointer which is rolled through the jump of the roller wheel.

Technical Field

The invention relates to the field of chip design, network on chip, system on chip and computer system structure, in particular to a roller scheduling method and circuit implementation of a data switching network on chip. The method can improve the efficiency and speed of on-chip data exchange, and is particularly suitable for artificial intelligence and large data processing chips, especially chips with SIMT architecture.

Background

Machine learning, scientific computing, and graphics rendering require significant computing power, which is typically provided by large chips (e.g., GPUs, TPUs, APUs, etc.) to implement highly complex machine learning and graphics processing tasks. The machine Learning is used for identifying the image data which needs a huge depth (Deep Learning) network and a large amount of image data, and the training process is time-consuming; in a three-dimensional application or game scene, if Recursive Ray-Tracing is adopted for rendering and the scene is complex, massive calculation is required and massive data also needs to be transmitted. This requires extremely high computational performance and therefore extremely wide data exchange bandwidth to support such requirements. High performance on-chip switches become an important component of AI and GPU chips.

For this particular scenario, AI and graphics computation, the arbitration method for on-chip caching and data exchange is very important. The inefficient arbitration (arbitration) method and the design of the arbiter (arbiters) become the bottleneck of the system, greatly affecting the performance of the system. Therefore, the arbitration method and the arbiter circuit must achieve high performance and low complexity.

There is a long history of arbitration method research in network switches, and especially the research results in the fast development stage of the internet are numerous. A review and description of various methods are given in Jonathan Chao's High Performance Switches and Routers (Wiley-IEEE Press, 2007) and in George Varghese's Network Algorics, An Interdisciplicity Approach to design Fast Network Devices (Morgan Kaufmann, 2004). Virtual Output Queue (VOQ) switches are a typical data exchange method, and research on this aspect is quite productive.

Important achievements in this respect include PIM, RRM, iSLIP, DRRM and GA classes of methods.

Among them, PIM has problems of fairness and complexity because each selection is random and requires three steps.

The RRM and the iSIP adopt priority polling arbitration which is simpler than random arbitration logic, the iSIP improves the skip condition of a grant pointer, the fairness is improved, but three steps are still needed, so that the complexity problem is also existed, and the high-speed circuit is difficult to realize.

The DRRM has two independent polling arbitration mechanisms for arbitration at the input and the output, the arbitration time is shorter than that of an iSIP scheme, and the performance equivalent to the iSIP is realized; the GA gives Grant information of the output port to the input port on the basis of the DRRM, and although the arbitration efficiency is improved, the complexity is increased more than that of the DRRM. Since complexity grows exponentially as ports increase (N3logN), it is difficult for DDRM and GA to implement more than two times of iterative arbitration on high-speed circuits as ports increase.

Disclosure of Invention

The invention provides a roller scheduling method of an on-chip data switching network and a circuit implementation for solving the problems in the background technology.

The invention has disclosed the gyro wheel arbitration method of a data exchange on chip at first, based on NxN cross network of N input ports N output ports, all output ports that 1 input port corresponds are a line, all input ports that 1 output port corresponds are a column, every input/output exchange point is a transmission pair; it comprises the following steps:

s1, determining priority arbitration arrangement W_VOQ＝{VOQ[0,a]，VOQ[1,b]，VOQ[2,c]，…，VOQ[N-1,x]}，a、b、c…x∈[0，N-1]And are different from each other; priority arbitration arrangement W_VOQRepresents N expected transmission pairs, wherein: VOQ [0, a ]]Indicating an expected input port as PI₀The expected output port is the expected transmit pair of PO a; non-prioritized arbitration arrangement W_VOQThe transmission pair in (1) is an unintended transmission pair;

and S5, circulating S2-S4.

Preferably, in S1, a priority arbitration arrangement W is determined_VOQ＝{VOQ[0,0]，VOQ[1,1]，VOQ[2,2]，…，VOQ[N-1,N-1]}。

Preferably, in S2 and S3, for burst transmission applications, there is a need to confirm that there is no outstanding transmission on the transmission pair before determining it as an actual transmission pair.

Preferably, after determining the actual transmission pair in S2 and S3, the row and column where the actual transmission pair is located are cleared, and no longer participate in the arbitration of the next transmission requirement.

Preferably, in S4, the rolling condition is that each requested expected transmission pair is confirmed as an actual transmission pair.

As a rolling manner of the roller, in S4, the rolling is performed such that all the expected input ports are unchanged, and the values of all the expected output ports + n, n need to ensure that all the transmission pairs are covered when the roller rolls circularly.

As another rolling manner of the roller, in S4, the rolling is performed such that all the expected output ports are unchanged, and the values of all the expected input ports + n, n need to ensure that all the transmission pairs are covered when the roller rolls circularly.

The invention also discloses a roller arbitration circuit for on-chip data exchange, which is an NxN cross network based on N input ports and N output ports, and comprises the following components:

Preferably, it further comprises:

Specifically, the scroll point selection circuit: each row and each column of the NxN cross network are respectively provided with 1 k-bit register, k is ceiling (N), and the registers of the rows/columns sequentially move register values to realize rolling.

Specifically, the expectation matching circuit: each switching point of the NxN cross network is provided with a comparator, and whether the switching point is an expected point or not is judged by comparing the values of the row register and the column register which the switching point belongs to.

Specifically, the line-polling arbitration circuit: one arbiter is arranged in each row and column of the NxN cross network, and each arbiter has a priority pointer which is rolled through the jump of the roller wheel.

The invention has the advantages of

The application provides a roller scheduling method and circuit implementation of an on-chip data switching network, and fairness of data transmission opportunities of each switching point is achieved based on rolling scheduling. The arbitration traversal of each switching point is guaranteed based on the initial column/row polling; the uniqueness of each input-output is guaranteed based on the row/column polling again, and collision is avoided. After the transmission pair is determined, the clearing of the line and the row of the switching point ensures the uniqueness of each input-output; and arbitration conflict and times are reduced, and transmission efficiency is improved. Based on the roller priority scheduling algorithm, arbitration of the pre-transmission pair on the roller is avoided, and only judgment of whether the transmission request exists in the expected transmission pair on the roller is needed. The judgment logic is simple, the arbitration time is shorter, the high-speed circuit is easy to realize, particularly the advantage is obvious in the high-speed circuit with strict time sequence requirement, and two iterations are easily realized on a chip.

Drawings

FIG. 1 is a schematic diagram of NxN cross-network routing structure

FIG. 2 is a schematic view of a roller representing priority

FIG. 3 is a schematic diagram of row and column select clearing

FIG. 4 is a schematic diagram of the movement of the roller

FIG. 5a is a schematic diagram of a VOQ circuit implemented in FIFO mode

FIG. 5b is a schematic diagram of a VOQ circuit implemented in a manner of managing linked list pointers

FIG. 6a is a schematic diagram of a roller point selection circuit

FIG. 6b is a diagram showing the row (column) incremental update of the register in the rolling point selection circuit

FIG. 7 is a schematic diagram of a desired point matching circuit

FIG. 8 is a schematic diagram of a roller pattern

FIG. 9 is a diagram of an exemplary arbitration circuit for row and column polling

FIG. 10 is an exemplary diagram of three steps comprising one iteration of a PIM scheduling algorithm

Fig. 11 is an exemplary diagram of three steps constituting one iteration of the RRM scheduling algorithm

FIG. 12 is an exemplary diagram of the iSIP algorithm

FIG. 13 is an exemplary diagram of the DRRM algorithm

FIG. 14 is an exemplary graph of the GA algorithm

Detailed Description

The invention is further illustrated by the following examples, without limiting the scope of the invention:

fig. 1 shows a cross-network routing structure of NxN: each intersection of I0, I1, … …, IN-1 with O0, O1, … …, ON-1 is a routing path, also called a transport pair. Each intersection is a VOQij request path, the subscript of VOQ first number input port number and second number output port number. Each input port Ii can only have one routing node selected in one cycle, and similarly, each output port can only have one routing node selected in one cycle. At most N paths are selected in a cycle.

In order to ensure that each input port realizes fair arbitration, each input port obtains data transmission quantity which is as equal as possible, and each virtual output queue VOQ of each input port obtains data transmission quantity which is as equal as possible, the invention discloses a roller arbitration method for on-chip data exchange, which comprises the following steps:

in connection with fig. 8, at a certain point in time, the wheel represents the desired output port for each input port, N desired output ports for N input ports. The N expected ports are all different, i.e., {0,1,2, …, N-1} in an arrangement (persistence), one arrangement being a pattern, for a total of N! And each part. At initialization, {0,1,2, …, N-1} may be used directly, that is to say the input port PI₀Desired output port PO₀Input port PI₁Desired output port PO₁…, input port PI_N-1Desired output port PO_N-1. We will note this arrangement as W { (0,0), (1,1), (2,2), …, (N, N) }, as shown by the small hexagonal hollow nodes in fig. 2 (in this example, a ═ 1, b ═ 2, c ═ 3, … …, and x ═ N-1).

S2, judging whether each expected transmission pair in the arrangement has a transmission requirement, if so, determining the transmission pair as an actual transmission pair, and performing data transmission on the determined actual transmission pair;

assume that at time t, w (t) { (0, (t + 0)% N), (1, (t + 1)% N), (2, (t + 2)% N), …, (N-1, (t + N-1)% N) }, this corresponds to the queue having the highest priority. That is, if input port i has a given output port j_iThen this request must be granted. According to the control, different patterns can be selected according to the corresponding relation between different input and output, and the patterns are rolled and circulated to achieve fairness.

In the preferred embodiment, after the actual transmission pair is determined, the row and column in which the actual transmission pair is located are cleared and no longer participate in the transmission demand arbitration. Referring to fig. 3, a pattern node is in VOQ10 (small hexagon solid node), and if the VOQ10 has a request, the node route is selected. Meanwhile, the other nodes in the row where the node is located, that is, the requests VOQ11, VOQ12, … …, VOQ1N-1 of all other output nodes (small hexagonal hollow nodes) of the input port I1, and the column where the node is located, that is, the requests VOQ00, VOQ20, … …, and VOQN-10 of all input nodes (yellow small hexagonal nodes) of the output O0, are all cleared, and do not participate in the rank arbitration of S3, so that the efficiency of the rank arbitration for the second time can be improved.

In a preferred embodiment, the above can be achieved by the following code:

in a preferred embodiment, the step S3 is implemented by:

finally, VOQ [ i, t ]_i]Either stat or VOQ i, j]The row-true point is the VOQ with successful arbitration; data on these VOQs can be transmitted immediately while the burst counter is decremented by one (VOQ i, t)_i]Cn- -or VOQ [ i, j].cnt--)。

In S2 and S3, for the transmission application with the single cycle length, when each expected transmission pair has a transmission demand, the transmission pair is directly regarded as an actual transmission pair and data transmission is performed. For multi-cycle length transmission applications (e.g., burst transmission applications), it is necessary to confirm that there are no outstanding transmissions on the transmission pair before determining it to be an actual transmission pair.

S4, after polling in step S3, some conditions may be set, and when the conditions are satisfied, the priority arbitration arrangement W_VOQRolling to get a new arbitration permutation W'_VOQ(ii) a This condition may be: it is ensured that the requested pairs of intended transmissions have all been determined to be actual pairs, i.e. each requested pair of intended transmissions has had at least one transmission. This condition ensures starvation-free and fairness, and other conditions require consideration of both characteristics.

In conjunction with FIG. 4, the scroll wheel needs to cycle between a set of patterns. As shown in the example of FIG. 4, the small hollow hexagon nodes (VOQ00, VOQ11, … …, VOQN-1N-1) on the diagonal are patterns before the roller rolls, and the small black solid hexagon nodes (VOQ10, VOQ21, … …, VOQN-1N-2, VOQ0N-1) are patterns after one rolling cycle of the roller. The other patterns are similar in rolling, and can roll according to a preset rolling mode + N, and when the value of N needs to ensure that the roller rolls circularly, all transmission pairs are covered (for the cross network routing structure of N × N, taking N as an odd number as an example, and the value of N as 2 can also meet the requirement of complete coverage).

In a preferred embodiment, the step S4 is implemented by:

and S5, circulating S2-S4.

The invention also provides a roller arbitration circuit for on-chip data exchange, which is based on an NxN cross network of N input ports and N output ports, and the VOQ circuit is realized in two modes: one is the FIFO manner as shown in fig. 5a, which routes VOQ routing requests to the FIFO of the corresponding output port respectively according to the destination address information of the input I _ s, and this implementation is simple, but consumes hardware resources; another way is to implement VOQ by linked list management as shown in fig. 5 b: the inputs are stored in a set of linked lists, and VOQ information is managed by managing linked list pointers. Each input port contains a pointer register queue of queue length M, a head and tail pointer of queue length N, and a valid request of width N. Where M is the maximum number of requests that can be received and stored, and N is the number of request target ports. Compared with the FIFO implementation mode, the circuit area is smaller, and less hardware resources are used.

The roller arbitration circuit further comprises:

in a preferred embodiment, the scroll point selection circuit is: each row and each column of the NxN crossover network are respectively provided with 1 k-bit register, k is ceiling (n), the registers of one row can sequentially shift the registered values, and the registers of one column can also sequentially shift the registered values. The line registers are designated R0, R1, …, R N-1. The column registers are designated C0, C1, …, C N-1. At initialization, R [0] ═ 0, R [1] ═ 1, …, R [ N-1] ═ N-1, C [0] ═ 0, C [1] ═ 1, …, and C [ N-1] ═ N-1. At each roll, C0 ═ C N-1, C [ N-1] ═ C [ N-2], C [ N-2] ═ C [ N-3], …, C2 ═ C1, C1 ═ C0, and R [ N-1], …, and R [0] remain unchanged.

With reference to fig. 6a, an example pattern is shown, where the ranks of each node are mutually exclusive. Each node of the pattern has a routing node number. Referring to fig. 6b, each pattern completes the node update of the corresponding column (row) in a manner of incrementing by row (column), and the column (row) pre-selection is stored in the register.

in a preferred embodiment, the expectation matching circuit is: each switching point of the NxN cross network is provided with a comparator, and whether the switching point is an expected point or not is judged by comparing the values of the row register and the column register which the switching point belongs to.

Referring to fig. 7, each node on the Pattern outputs a row-column routing number, and when the number is equal to the routing number of the VOQ, that is, when there is a corresponding VOQ request on the node of the Pattern, it is expected that the points are matched. And traversing and judging all the nodes of the current pattern to find out the expected matching point on the roller.

FIG. 9 is a more complete diagram of the arbitration circuit of the wheel. After the expected points on the roller are matched, the row-column polling arbitration can be completed by arbitrating twice in the two directions of the row and the column according to the figure (the result of the column cycle cannot ensure that the row arbitrated by each column is different, therefore, the row-cycle circuit is required to arbitrate the result of the column cycle once again, so that the final result has at most one selected point in each row and each column, and no data transmission conflict is generated), and all channel routing arbitration in one period is completed. Each row and column in fig. 9 has an arbiter, each arbiter has a priority pointer that rolls over by the wheel jump, which is implemented by a shift register in the figure.

In a preferred embodiment, it further comprises:

-a rank clearing circuit, each switching point of the NxN crossbar network being provided with a clearing logic for prohibiting the rows and columns where the matching points are located from participating in the arbitration of the transmission request after the input ports have determined the actual transmission pair. The implementation scheme is as follows: labels for these VOQs are excluded locally at the input port.

Arbitration algorithm compared with existing CrossBar

Most scheduling algorithms currently maximize the connection path per arbitration, thereby maximizing the bandwidth, but these algorithms are too complex and the complexity increases exponentially with increasing ports (N3logN) to be implemented in hardware and take a long time to complete. Today the generic CrossBar is based on iterative or non-iterative loop algorithms. The more classical algorithms include FIFO, PIM, iSIP, DRRM and GA algorithms. The wheel arbitration algorithms and the differences and advantages between these algorithms are analyzed based on these algorithms.

One, VOQ introduction

The early input port waits for arbitration allocation in a FIFO queuing mode, the maximum input throughput of the mode is only 58.6% based on the arbitration scheduling of the FIFO, the throughput rate is improved to 100% by a VOQ mode through evolution in the later period, and conditions are provided for finding the maximum matching item each time. In the implementation mode of the VOQ, the VOQ is implemented in a linked list management mode, and hardware resources are saved to the maximum extent. The advantage of this implementation over large network routes with high latency is even more pronounced.

PIM algorithm

PIM refers to a parallelmatilerarian Matching algorithm, which is specifically divided into three steps, as shown in fig. 10, which is an example of three steps constituting one iteration of a PIM scheduling algorithm.

Step 1: request, each input requests for each output it has a cell;

step 2: grant, each output consistently randomly selects one of the inputs that requested it. In this example, both inputs 1 and 3 request output 2. Output 2 selects grant input 3;

and step 3: accept, each input randomly selects one of the granted outputs. In this example, both outputs 2 and 4 are granted to input 3. Input 3 selects accept output 2.

In this example, the first iteration does not match input 4 and output 4, even though it does not conflict with other connections. This connection will be established in the second iteration.

Three features of the PIM algorithm:

firstly, the unmatched part of the previous iteration can be selected in each iteration process, so the iteration number for completing the maximum matching needs logN;

second, it ensures that all requests will eventually be granted, and no incoming VOQ will be arbitrated;

third, it means that no memory or state is used to track the time a connection was established in the past.

Performance of PIM algorithm: because the arbitration process of PIM is random, there are several limitations:

firstly, it is a random arbitration, which is not good for realizing high speed; each arbiter must randomly choose between all the candidate members;

secondly, PIM can lead to unfairness between connections when CrossBar is overloaded;

finally, PIM performs poorly in a single iteration; it limits throughput to about 63% only slightly above FIFO switches. So when a Crossbar request is transmitted to transmit overload, the efficiency is only 1-1/e-63%.

Three, RRM algorithm

RRM refers to the Basic Round-Robin Matching algorithm. The RRM algorithm is also divided into three steps: as shown in fig. 11, this is an example of the three steps that make up one iteration of the RRM scheduling algorithm.

RRM potentially overcomes two problems in PIM, complexity and unfairness. The round-robin arbitration procedure implemented as a priority encoder is much simpler and faster to execute than the random arbitration procedure. Round priority helps the algorithm to distribute bandwidth fairly, more fairly, between requesting connections. The three steps of arbitration are:

step 1: request, each input sends a Request at each output where there is a queued VOQ.

Step 2: grant, if the output receives any requests, it will select the next request in a fixed roundbin schedule starting with the highest priority element. The output notifies each input whether or not the request is approved. The output pointer is incremented for the highest priority element of the round robin schedule.

And step 3: accept. If an input receives a grant, it will accept the next in a fixed round robin schedule starting with the highest priority element. The pointer to the highest priority element in the round robin schedule will be incremented (one unit except for the accepted output).

RRM performance analysis

For a load that provides only 63% RRM, the RRM will become unstable. The reason for the poor RRM performance is the rule to update the pointers on the output arbiters. We illustrate this with the example shown in the above figure. Inputs 1 and 2 are both under high load and receive a new cell for both outputs during each cycle time. But because the output scheduler is moving in lock priority, only one input is provided per compute unit time. Note that the grant pointer changes in lock priority: in unit time 1, both point to input 1, in unit time 2, both point to input 2, and so on. This synchronization phenomenon results in a maximum throughput of only 50% for this traffic pattern. Synchronization of the grant pointers also limits the performance of the random arrival mode.

Four, iSLIP algorithm

The iSLIP algorithm uses rotating priority ("Round") arbitration to schedule each input and output in turn. The main characteristic is that it is simple; it is easy to implement on hardware and can operate at high speed. Researches find that the performance under the uniform traffic condition is higher; for consistent independent co-distributed bernoulli-arrivals, a single iteration iSLIP can achieve 100% throughput. Careful comparison will reveal that iSLIP is actually a variant of the simple basic circular matching algorithm (RRM). RRM is probably the simplest and most important.

Fig. 12 gives an example of iSLIP.

With respect to RRM, iSLIP makes the following changes: the grant pointer is not moved unless the grant is accepted. The grant pointer is updated. The Grant step of iSLIP is changed as follows:

step 2: grant, if the output receives any requests, it will select the next request that appears in the fixed round robin schedule, starting with the element with the highest priority. The output informs the pointer of each input whether the request is granted or not. The highest priority element of the round robin schedule is incremented. This small change to the algorithm leads to the following results:

characteristic 1: the most recently established connection has the lowest priority. This is because when the arbiter moves the pointer, the input (output) that was most recently granted (accepted) becomes the lowest priority for that output (input). If the input priority connection is at the next calculation of the unit time.

Characteristic 2: there is no connection interruption. This is because the input will continue to request output until successful. Is accepted by each input for the most cell time. Thus, a request input always provides service in less time than a computing unit.

Characteristic 3: at high load, all queues with common output have the same throughput. This is a result of characteristic 2, with the output pointer moved to each request input in a fixed order, providing the same throughput for each request.

But most importantly, this small change prevents synchronous shifting of the output arbiter, resulting in a large improvement in performance.

Five, DRRM algorithm

DRRM is Double Roumb-Robin Matching. Since there are two independent Round-Robin arbitration mechanisms performing arbitration at the input and output, it is called DRRM (Dual Round Robin matching).

There is a request arbiter and N VOQs at each input port, the request arbiter selects at most one non-empty queue, representing the highest priority output port, based on the pointer values. There is also a grant arbiter at each output port, arbitrating an input port among the N requests, and then sending the result to the input port, which will update the pointer value if a request is received by the grant, and the corresponding request arbiter will also update the pointer value.

FIG. 13 shows an example of a DRRM algorithm:

the DRRM algorithm includes two phases:

step 1: request, each Request input in the polling schedule arbitrates out one Request;

step 2: and Grant, each output arbitrates one input in all the requests of the port.

The DDRM scheme has a shorter arbitration time than the iSLIP scheme while achieving comparable performance to iSLIP.

Six, GA algorithm

The GA Algorithm is Grant-Aware Scheduling Algorithm for VOQ-Based Input-Buffered Packet Switches. As shown in fig. 14, the arbitration algorithm iterates the DRRM algorithm multiple times in one cycle to achieve maximum arbitration efficiency, and each iteration informs the input port of the output port arbitration result. Efficiency is improved, but complexity is larger than DRRM, and for slightly larger route arbitration (N > -8), more than two iterations are difficult to implement on high-speed circuits, because each iteration is equivalent to making a DRRM again.

Seventh, roller arbitration algorithm (application)

The roller arbitration algorithm provided by the patent has the greatest advantage that the roller priority arbitration is firstly adopted, and then the DDRM arbitration algorithm is adopted. The method and the device realize the two-time arbitration in one period, wherein the first arbitration judges whether the roller nodes have requests according to the roller priority principle so as to finish the arbitration, the hardware circuit is very easy to realize, and a large amount of time is saved for the second arbitration. The second arbitration is to carry out the second arbitration according to the algorithm of DRRM on the basis of filtering out the first arbitration nodes.

Compared with PIM, RRM, iSIP, DDRM, GA algorithm and roller priority algorithm, the method has the advantages of being simpler, more direct and easier to implement. In the existing high-speed circuit design, more than two times of iterative arbitration of PIM, RRM, iSIP, DDRM and GA algorithms is difficult to complete in one clock cycle, each iteration of the algorithms needs to perform logic processing on the arbitration priority of an input/output port, the arbitration priority of the input/output port of the roller priority algorithm is a pattern structure, and the arbitration priority is independent of arbitration logic judgment and leaves sufficient time for arbitration judgment, so that more iteration times can be realized in one clock cycle. The efficiency of 63% is 23% higher than that of other implementations in which one iteration is performed in one clock cycle, and an efficiency of 86% can be achieved. Has been implemented on an AIGPU chip.

Because the scroll wheel is a pattern selected by the user according to the data routing characteristics, the arbitration time in practical application is much shorter than that of GA, and more iterations can be realized to improve the arbitration efficiency.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

24页详细技术资料下载

Roller arbitration method and circuit for on-chip data exchange

相关技术

网友询问留言