Glitch-free multiplexer and glitch propagation prevention

文档序号：490294 发布日期：2022-01-04 浏览：16次中文

阅读说明：本技术 无毛刺的多路器以及防止毛刺传播 (Glitch-free multiplexer and glitch propagation prevention ) 是由 W·J·达利于 2021-06-03 设计创作，主要内容包括：当信号毛刺时,接收信号的逻辑可能会改变响应,从而对逻辑内的节点进行充电和/或放电并耗散功率。提供无毛刺信号可以减少节点被充电和/或放电的次数,从而降低功耗。用于消除毛刺的技术是插入存储元件,该存储元件在完成信号改变之后,对该信号进行采样以产生无毛刺的输出信号。存储元件由具有与生成信号的电路的延迟匹配的延迟的“准备就绪”信号来启用。该技术可防止输出信号改变,直到达到信号的最终值为止。输出信号仅改变一次,通常减少接收信号的逻辑中的节点被充电和/或放电的次数,从而也减少功耗。(When a signal glitch, the logic receiving the signal may change response, charging and/or discharging nodes within the logic and dissipating power. Providing a glitch-free signal may reduce the number of times a node is charged and/or discharged, thereby reducing power consumption. A technique for removing glitches is to insert a storage element that samples the signal after the signal change is completed to produce a glitch-free output signal. The storage elements are enabled by a "ready" signal having a delay that matches the delay of the circuit generating the signal. This technique prevents the output signal from changing until the final value of the signal is reached. The output signal changes only once, typically reducing the number of times a node in the logic receiving the signal is charged and/or discharged, thereby also reducing power consumption.)

1. A circuit, comprising:

a decoder circuit configured to:

receiving a selection ready signal, the selection ready signal being negated until a selection signal generated by combinatorial logic is unchanged and the selection ready signal is asserted after the selection signal is unchanged;

generating at least one sample enable signal corresponding to a set of data input signals in accordance with the select signal, wherein the at least one sample enable signal is negated at the same time the select ready signal is negated, and the at least one sample enable signal is asserted in response to the assertion of the select ready signal; and

generating a hold signal that is asserted while the at least one sample enable signal is negated, and that is negated in response to the assertion of the at least one sample enable signal; and

a sampling circuit configured to:

holding the output signal constant while the hold signal is asserted; and

sampling one of the data input signals according to the at least one sampling enable signal while the hold signal is negated to transfer the level of the sampled data input signal to the output signal.

2. The circuit of claim 1, wherein the select signal comprises a multi-bit signal, and each bit in the select signal is associated with a different one of the set of data input signals and only one bit is asserted at a time.

3. The circuit of claim 2, wherein each of the bits in the select signal is used to sample the associated data input signal.

4. The circuit of claim 2, wherein the decoder circuit is further configured to:

receiving a set of ready signals, each ready signal of the set of ready signals being associated with a different one of the set of data input signals, wherein

Each ready signal is negated until the associated data input signal is unchanged, an

Each ready signal is asserted after the associated data input signal is unchanged.

5. The circuit of claim 4, wherein the decoder is further configured to: generating a set of enable signals, each enable signal in the set of enable signals being associated with a different one of the set of data input signals, wherein

Each of the enable signals is negated while the associated ready signal is negated; and

each of the enable signals is asserted in response to the assertion of the associated ready signal when the associated data input signal is asserted.

6. The circuit of claim 1, wherein the decoder is further configured to:

receiving a set of ready signals, each ready signal of the set of ready signals being associated with a different one of the set of data input signals; wherein

Each ready signal is negated until the associated data input signal is unchanged, an

Each ready signal is asserted after the associated data input signal is unchanged.

7. The circuit of claim 1, wherein the decoder circuit is further configured to assert the hold signal while the hold signal is asserted and the at least one sample enable signal is negated.

8. The circuit of claim 1, wherein the sampling circuit comprises a feedback loop configured to assert the output signal at the same time that the hold signal is asserted and the output signal is asserted.

9. The circuit of claim 1, wherein both the hold signal and the at least one sample enable signal are asserted for a first duration.

10. The circuit of claim 1, wherein the hold signal is inverted to generate an output ready signal.

11. The circuit of claim 1, wherein the set of input data signals comprises three input data signals, and the selection signal is configured to select one of the three input data signals to generate the output signal.

12. The circuit of claim 1, wherein the set of input data signals includes four input data signals, and the select signal is configured to select two from the four input data signals to produce the output signal and an additional output signal.

13. The circuit of claim 1, wherein the circuit is included within a processor configured to generate an image and the processor is part of a server or a data center and streams the image to a user device.

14. The circuit of claim 1, wherein the circuit is included within a processor configured to train, test, or validate a neural network employed in a machine, robot, or autonomous vehicle.

15. The circuit of claim 1, wherein the circuit is included within a processor configured to implement a neural network model.

16. A computer-implemented method, comprising:

generating a hold signal that is asserted while the at least one sample enable signal is negated, and that is negated in response to the assertion of the at least one sample enable signal;

holding the output signal constant while the hold signal is asserted; and

sampling one of the data input signals according to the at least one sampling enable signal to transfer the value of the sampled data input signal to the output signal when the hold signal is negated.

17. The computer-implemented method of claim 16, wherein the selection signal comprises a multi-bit signal, and each bit of the selection signal is associated with a different one of the set of data input signals and only one bit is asserted at a time.

18. The computer-implemented method of claim 17, further comprising:

receiving a set of ready signals, each ready signal of the set of ready signals being associated with a different one of the set of data input signals, wherein

Each ready signal is negated until the associated data input signal is unchanged, an

Each ready signal is asserted after the associated data input signal is unchanged.

19. The computer-implemented method of claim 17, wherein the feedback loop is configured to assert the output signal at the same time that the hold signal is asserted and the output signal is asserted.

20. The computer-implemented method of claim 16, wherein the steps of generating, receiving, and sampling are performed within a processor configured to implement a neural network model.

21. A circuit, comprising:

a delay circuit configured to generate a ready signal that is negated at a first transition of a clock signal and is asserted after a first delay relative to the first transition, wherein the first delay is at least as long as a second delay; and

a sampling circuit configured to:

receiving an input signal generated by combinational logic, wherein after the second delay following the first transition of a clock signal, a change in a first signal received at an input of the combinational logic causes a corresponding change in the input signal at an output of the combinational logic; and

sampling the input signal to transfer a level of the input signal to an output signal of the sampling circuit while the ready signal is asserted, wherein the input signal is unchanged from the second delay until the input signal is sampled.

22. The circuit of claim 21, wherein the sampling circuit is further configured to hold the output signal at a constant level from a first transition of the clock signal until the input signal is sampled.

23. The circuit of claim 21, wherein the sampling circuit comprises a latched storage element.

24. The circuit of claim 23, wherein the latching storage element comprises a cross-coupled logic gate configured to cause a feedback path to hold the output signal stable until after the input signal is sampled.

25. The circuit of claim 21, wherein the input signal changes at least once after the first transition of the clock signal and before the second delay.

26. The circuit of claim 21, wherein the delay circuit delays the clock signal by the first delay to generate the ready signal.

27. The circuit of claim 21, wherein the delay circuit inverts the clock signal to generate the ready signal.

28. The circuit of claim 21, wherein the second delay is a propagation delay of the combinational logic, and the first delay is equal to or greater than the second delay.

29. The circuit of claim 21, wherein the delay circuit is further configured to negate the ready signal after a third delay relative to the first delay.

30. The circuit of claim 29, wherein the first delay and the third delay occur within a period of the clock signal.

31. The circuit of claim 21, wherein the sampling circuit is further configured to generate an output ready signal that is negated for the first delay and that is asserted once the level of the input signal is transferred to the output signal.

32. The circuit of claim 31, further comprising negating the ready signal in response to an assertion of the output ready signal.

33. The circuit of claim 21, wherein the circuit is included within a processor configured to generate an image and the processor is part of a server or a data center and streams the image to a user device.

34. The circuit of claim 21, wherein the circuit is included within a processor configured to train, test, or validate a neural network employed in a machine, robot, or autonomous vehicle.

35. The circuit of claim 21, wherein the circuit is included within a processor configured to implement a neural network model.

36. A computer-implemented method, comprising:

generating a ready signal that is negated at a first transition of a clock signal and that is asserted after a first delay relative to the first transition, wherein the first delay is produced by a delay circuit and is at least as long as a second delay;

receiving an input signal generated by combinational logic, wherein after the first transition of the clock signal, a change in a first signal received at an input of the combinational logic causes a corresponding change in the input signal at an output of the combinational logic after the second delay; and

sampling the input signal to transfer a level of the input signal to an output signal while the ready signal is asserted, wherein the input signal is unchanged from the second delay until sampled.

37. The computer-implemented method of claim 36, further comprising holding the output signal at a constant level from a first transition of the clock signal until the input signal is sampled.

38. The computer-implemented method of claim 36, wherein the input signal changes at least once after the first transition of the clock signal and before the second delay.

39. The computer-implemented method of claim 36, wherein the second delay is a propagation delay of the combinational logic, and the first delay is equal to or greater than the second delay.

40. The computer-implemented method of claim 36, wherein the steps of generating, receiving, and sampling are performed within a processor configured to implement a neural network model.

Technical Field

The present disclosure relates to preventing glitch (glitch) propagation in a circuit. In particular, the present disclosure relates to removing glitches in a signal by inserting storage elements that sample the signal to produce a glitch-free output signal.

Background

Conventional sequential circuits include combinational logic with inputs driven by synchronous registers or flip-flops. "combinational logic" refers to logic that receives one or more inputs that are combined to produce an output without storing the inputs, the output, or any intermediate values. In other words, the combinational logic is "stateless" and may be asynchronous (not driven by a clock signal). In contrast, for sequential circuits (logic), registers store state.

On the rising edge of the clock, the output of the register changes exactly once. However, multiple paths through the combinational logic may cause the signal output by the combinational logic to change multiple times before reaching its final level. The signal may be data input to the multiplexer or may be a selection signal for causing the multiplexer to select one of the data inputs for output. The output of the multiplexer may be changed several times in response to the changed data and the input selected before the final state is established. The change in the multiplexer output is considered a glitch and additional combinational logic receiving the output may respond by charging and/or discharging the nodes and dissipating power. Providing a glitch-free output signal may reduce the number of times a node is charged and/or discharged, thereby reducing the power dissipated by the additional combinational logic. There is a need to address these and/or other problems associated with the prior art.

Disclosure of Invention

When a signal glitch occurs, the logic receiving the signal may change response, charging and/or discharging nodes within the logic and dissipating power. In the context of the following description, a glitch is at least one high or low pulse of at least one bit of a signal in a clock cycle. In particular, the pulse is a high-low-high transition or a low-high-low transition. Providing a glitch-free signal may reduce the number of times a node is charged and/or discharged, thereby reducing power consumption. A technique for removing glitches in a signal is to insert a storage element that samples the signal after the signal change is completed to produce a glitch-free output signal. The storage elements are enabled by a "ready" signal having a delay that matches the delay of the circuit generating the signal. The ready storage element may prevent the output signal from changing until the final value of the signal is reached. The output signal changes only once, transitioning from low to high or high to low, typically reducing the number of times a node in the logic receiving the signal is charged and/or discharged, thereby also reducing power consumption.

A method, computer-readable medium, and system for preventing spur propagation are disclosed. In one embodiment, the decoder circuit is configured to receive a selection ready signal that is negated until the selection signal generated by the combinational logic is unchanged and is asserted after the selection signal is unchanged. The decoder circuit generates at least one sample enable signal corresponding to a set of data input signals in accordance with the select signal, wherein the at least one sample enable signal is negated when the select ready signal is negated and is asserted in response to the assertion of the select ready signal. The decoder circuit generates a hold signal that is asserted when the at least one sample enable signal is negated and negated in response to the assertion of the at least one sample enable signal. The sampling circuit is configured to hold the output signal constant when the hold signal is asserted and to sample one of the data input signals in accordance with the at least one sampling enable signal to convey the level of the sampled data input signal to the output signal when the hold signal is negated.

A method, computer-readable medium, and system for preventing spur propagation are disclosed. In one embodiment, the delay circuit is configured to generate a ready signal that is negated at a first transition of the clock signal and is asserted after a first delay relative to the first transition, wherein the first delay is at least as long as the second delay. The sampling circuit is configured to receive an input signal generated by the combinational logic, wherein receiving a change in the first signal at the input of the combinational logic signal after a second delay after the first transition of the clock signal causes a corresponding change in the input signal at the output of the combinational logic. The sampling circuit is further configured to sample the input signal when the ready signal is asserted to transfer a level of the input signal to an output signal of the sampling circuit, wherein the input signal is unchanged from the second delay until the input signal is sampled.

Drawings

FIG. 1A illustrates a block diagram of a glitch-free sampling circuit and combinational logic, according to one embodiment.

FIG. 1B illustrates a timing diagram of the circuit shown in FIG. 1A according to one embodiment.

Fig. 1C shows a flow diagram of a method for generating a glitch-free signal according to an embodiment.

Fig. 1D illustrates a glitch-free sampling circuit according to one embodiment.

FIG. 1E illustrates an overlapping inverter circuit, according to one embodiment.

FIG. 1F illustrates another glitch-free sampling circuit in accordance with one embodiment.

Fig. 1G illustrates an asymmetric ready signal generation circuit in accordance with one embodiment.

FIG. 1H illustrates a timing diagram of the asymmetric ready signal generation circuit of FIG. 1G, according to one embodiment.

FIG. 2A illustrates a block diagram of a glitch-free N-to-1 multiplexer, according to one embodiment.

FIG. 2B illustrates a timing diagram for the glitch-free N to 1 multiplexer of FIG. 2A, according to one embodiment.

FIG. 2C illustrates a block diagram of the timing decoder of FIG. 2A according to one embodiment.

Fig. 2D illustrates a flow diagram of a method for generating a glitch-free multiplexer output signal, according to one embodiment.

Fig. 2E illustrates an extended ready signal generation circuit, according to one embodiment.

Fig. 2F illustrates a timing diagram of the extended ready signal generation circuit of fig. 2E, according to one embodiment.

FIG. 2G illustrates a snap back circuit according to one embodiment.

FIG. 2H illustrates a timing diagram for the quick return circuit of FIG. 2G, according to one embodiment.

FIG. 3 illustrates a parallel processing unit according to one embodiment.

FIG. 4A illustrates a general processing cluster within the parallel processing unit of FIG. 3, according to one embodiment.

FIG. 4B illustrates a memory partition unit of the parallel processing unit of FIG. 3, according to one embodiment.

FIG. 5A illustrates the streaming multiprocessor of FIG. 4A, according to one embodiment.

Fig. 5B is a conceptual diagram of a processing system implemented using the PPU of fig. 3, according to an embodiment.

FIG. 5C illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.

Detailed Description

In response to the changed data and select inputs, the output of the multiplexer may change multiple times before settling to a final state. In the context of the following description, a change is a transition in voltage level that identifies it as a different state. For example, a change is identified as an assigned state as compared to a negated or unassigned state. In another example, the change is a transition from a high level to a low level or from a low level to a high level. In the context of the following description, a stable or constant level may vary while remaining within a range of voltage values identified as the same state (e.g., logic true or logic false).

Multiple changes in the output of the multiplexer are considered glitches and combinatorial logic receiving the output may respond by charging and/or discharging nodes within the combinatorial logic and dissipating power. Glitches can be prevented by inserting additional registers to reduce power consumption. For example, data and select inputs may be registered by inserting a flip-flop at the input to the multiplexer. The register adds pipeline stages and prevents glitches at the inputs to the multiplexer because each input changes only once (or remains constant) on a clock edge. However, registers are very expensive in terms of power consumption, delay, and chip area. Alternatively, the multiplexer output may be registered using a delay clock that ensures that all inputs to the multiplexer have reached their final state and that the multiplexer output is no longer changing before the multiplexer output is registered. Inserting a register at the output of the multiplexer may reduce the power consumption of the multiplexer and the additional combinatorial logic. However, registers are an expensive solution.

Another technique to remove glitches in a signal is to insert a storage element that samples the signal after the change is completed to produce a glitch-free output signal in a clock cycle. The storage elements are enabled by a "ready" signal having a delay that matches the delay of the circuit generating the signal. This technique prevents the output signal from changing until the final value of the signal is reached. Providing a glitch-free signal may reduce the amount of time to charge and/or discharge a node in logic receiving the signal, thereby reducing the power consumed by the logic.

For example, a multiplexer may be used to select only non-zero activations and/or weights for convolution operations in a convolutional neural network. If the select signal to the multiplexer is glitched, the input to the multiplexer may be glitched, causing the multiplexer to evaluate several different products, with several times higher power consumption than evaluating a single product. As further described herein, providing a glitch-free select signal to select a glitch-free non-zero value may reduce power consumption. In the context of the following description, a changing signal is ignored until the signal becomes stable, at which point the signal may be sampled so that the combinational logic of the received signal evaluates once.

Fig. 1A illustrates a block diagram 100 of a glitch-free sampling circuit 110 and combinational logic 103 and 113, according to one embodiment. The register outputs a signal a in response to a rising edge of the clock signal. In response to an input (not explicitly shown) of register 101, the level of input signal a may change or remain the same at each rising edge of the clock signal and be stable (constant) between rising edges of the clock signal. The combinational logic 103 receives the input signal a and generates the output signal B. The combinatorial logic 103 may also receive one or more additional inputs and generate one or more additional outputs (not explicitly shown). Due to the different timing paths in combinational logic 103, output signal B may transition through several intermediate states before settling to a final value. If the glitch output signal B is directly input to the combinational logic 113 such as a complex arithmetic circuit, the glitch may cause high power consumption.

The delay circuit 105 is configured to match the propagation delay for the transition of signal a to the effective output for signal B. In one embodiment, the propagation delay is equal to or greater than a worst case propagation delay. In one embodiment, the propagation delay is greater than the worst case propagation delay. In one embodiment, delay circuit 105 includes an even number of inverters coupled in series. Delay circuit 105 receives the clock as an input and outputs a signal BR (B ready) that indicates when signal B is stable (no glitch) and should be sampled.

Sampling circuit 110 is a latch storage element configured to sample input D when enable input E is asserted, propagating the level of sampled input signal B to output Q to generate output signal BX. The sampling circuit 110 is transparent when the signal BR is asserted and the output signal BX "follows" the input signal B. The output signal BX is an input signal to the combinational logic 113. The signal BR coupled to the enable input E is negated, the sampling circuit 110 is opaque, the input signal B is not sampled and the signal BX remains stable. However, since input signal B is glitch-free when signal BR is asserted, output signal BX changes at most once-in response to signal BR transitioning from negated to asserted. In other words, the level of B is sampled after reaching steady state and in response to signal BR being asserted.

FIG. 1B illustrates a timing diagram of the sampling circuit 110 shown in FIG. 1A according to one embodiment. Shortly after the rising edge of the clock signal, signal a is valid and stable. Signal a propagates through combinational logic 103 and signal B glitches, switching between high and low levels before eventually reaching a stable level. The matched delay delta provided by delay circuit 105 delays the clock signal to produce the rising edge of BR. The matching delay used to generate signal BR is long enough to ensure that signal B has completed the change and is stable. When signal BR is asserted, signal B is sampled and propagated to signal BX. When the signal BR is negated, the sampling of the signal B stops and the level of the signal BX remains unchanged. It is important that the glitchless signal BX transitions once or remains constant only (when signal B is unchanged from the previous clock cycle) for each clock cycle. Sampling circuit 110 effectively filters out glitches so that combinatorial logic 113 is evaluated only once, thereby minimizing power consumption.

In another embodiment, signal a is valid and stable on the falling edge of the clock signal rather than on the rising edge of the clock signal. In this other embodiment, signal B is sampled and propagated to signal BX when signal BR is negated, and sampling of signal B stops and the level of signal BX is maintained when signal BR is asserted. Those of ordinary skill in the art will appreciate that any logic that performs the operations of block diagram 100 and the corresponding waveforms shown in fig. 1B is within the scope and spirit of embodiments of the present disclosure.

More illustrative information will now be given regarding various optional architectures and features that may be used to implement the foregoing framework, according to the needs of the user. It should be particularly noted that the following information is presented for illustrative purposes and should not be construed as limiting in any way. Any of the following features may be incorporated or not be excluded as desired.

Fig. 1C shows a flow diagram of a method 115 for generating a glitch-free signal according to one embodiment. The method 115 is described in the context of logic or circuitry and may also be executed within a processor. For example, the method 115 may be performed by a GPU (graphics processing unit), a CPU (central processing unit), or any processor capable of generating a glitch-free signal. Moreover, one of ordinary skill in the art will appreciate that any system that performs the method 115 is within the scope and spirit of embodiments of the present disclosure.

At step 120, a ready signal is generated that is negated at a first transition of the clock signal and is asserted after a first delay relative to the first transition, wherein the first delay is produced by a delay circuit configured to match a second delay. In one embodiment, the first delay is matched to the second delay when the first delay is equal to the second delay. In one embodiment, the first delay is matched to the second delay when the first delay is equal to or greater than the second delay. In one embodiment, the first delay is greater than the second delay and is within the same clock cycle. In one embodiment, the ready signal is BR, the first delay is generated by delay circuit 105, and the second delay is a propagation delay through combinatorial logic 103. In one embodiment, delay circuit 105 inverts the clock signal to generate a delayed signal. In one embodiment, delay circuit 105 includes a chain of inverters coupled in series.

At step 125, the input signal generated by the combinational logic is received, and after a first transition of the clock signal, a change in the first signal causes a corresponding change in the input signal at a second delay. In one embodiment, input signal B generated by combinatorial logic 103 is received at sampling circuit 110, and a change in signal a causes a corresponding change in input signal B to occur at a matching delay. In one embodiment, the change in the first signal is a falling transition from a high level to a low level. In another embodiment, the change in the first signal is a rising transition from a low level to a high level.

At step 130, the input signal is sampled while the ready signal is asserted to transfer the level of the input signal to the output signal, wherein the input signal remains unchanged from the second delay until the sampling. In one embodiment, the input signal B is sampled by the sampling circuit 110 to transfer the level of the input signal B to the output signal BX. In one embodiment, the sampling circuit 110 is configured to hold the output signal BX at a constant level from the first transition of the clock signal until the input signal B is sampled. In one embodiment, the input signal B changes (glitch) at least once after the first transition of the clock signal and before the second delay.

In an embodiment, the delay circuit is further configured to negate the ready signal after a third delay relative to the first delay, wherein the ready signal is negated before a next clock cycle. In one embodiment, the sampling circuit is further configured to generate an output ready signal that is negated for the first delay and is asserted upon transferring the level of the input signal to the output signal. In one embodiment, the negation is in response to the assertion of the output ready signal.

Fig. 1D illustrates a glitch-free sampling circuit 110 according to one embodiment. The sampling circuit 110 includes two AND gates 122 AND 124, OR gate 126 AND an enable inverter 118. When both the enable E and D inputs are high (asserted), the output of and gate 122 is high and, therefore, the output Q of or gate 126 is asserted. Otherwise, the output of gate 122 is low (negated).

The output Q of the or gate 126 is fed back within the sampling circuit 110 and input to the and gate 124. The enable signal E is inverted by the enable inverter 118 to produce a signal NE (not enabled) that is input to the and gate 124. Therefore, when the enable signal is high, the AND gate 124 effectively disables the feedback path. When the enable signal is low (NE is high), and gate 124 enables the feedback path, propagating the level of output Q to or gate 126 to maintain the level of output Q until the enable signal is asserted. In other words, the enable inverter 118 is configured to enable the feedback path to keep the output signal Q constant until the D input signal is sampled. The enable inverter 118 ensures that when the enable signal E transitions from low to high, the NE signal remains high, so that the feedback path remains enabled through the and gate 124 until the enable signal rises and the input D propagates through to the output Q, when D is asserted, ensuring that Q does not glitch as the enable E is asserted. For example, if the NE signal transitions low and drives the output of and gate 124 low if the NE signal transitions low and both inputs to or gate 126 will be low at the same time, causing the output Q to glitch low before the output of and gate 122 drives the output Q high when E transitions high before a high level D propagates through and gate 122.

In one embodiment, an output ready signal is generated that indicates when the Q output is ready to be sampled by the receive logic. In one embodiment, the output ready signal is negated during the delay provided by delay circuit 105 and asserted once the level of input signal D is conveyed to output signal Q. In one embodiment, the enable signal E is negated in response to the assertion of the output ready signal. In one embodiment, a delay circuit that matches the propagation delay from the enable signal E to the output signal Q is used to delay the enable signal E to generate the output ready signal. In one embodiment, an even number of inverters coupled in series delay the enable signal E to generate the output ready signal.

FIG. 1E shows an inverter circuit 119 according to one embodiment. Inverter circuit 119 includes two cross-coupled logic gates that generate EO (enable output) and NE. The cross-coupled logic gates are or gates 132 with one inverting input to generate EO, and NAND gates (NAND)134 to generate NE. The enable input E is input to two logic gates. When enable E transitions from low to high, OR gate 132 drives EO from low to high. When the rising edge of EO is received at nand gate 134, the NE output transitions low. The propagation delay through nand gate 134 ensures that EO and NE are high at the same time, providing overlap time. Similarly, when enable E transitions from high to low, nand gate 134 drives NE from low to high. When the rising edge of NE is received at or gate 132, EO is driven from high to low, and the propagation delay through or gate 132 ensures that EO and NE are high at the same time, providing overlap time. It should be understood that inverter circuit 119 may replace enable inverter 118 in sampling circuit 110 with EO routed to and gate 122 and NE routed to and gate 124.

FIG. 1F illustrates another glitch-free sampling circuit 140 in accordance with one embodiment. Glitch-free sampling circuit 110 may be replaced by glitch-free sampling circuit 140. Or gate 138 receives the intermediate signals generated by and gates 135, 136 and 137, where and gate 137 has an inverted input. When both input D and enable E are high, and gate 135 propagates input D to or gate 138 to drive output Q high. When enable E is low, and gate 137 provides a feedback path to hold output Q high. And gate 136 provides a feedback path to keep output Q high when input D is high regardless of the level of enable E, thereby preventing output Q glitch from going low during the rising edge of enable E when input D is high. The output Q is driven low when the input D is low and the enable E is high, or when the enable E is low and the output Q is low.

Returning to the block diagram 100 of fig. 1A, in the case of a large delay (more than half a clock) of the signal B, it may be desirable to have an "asymmetric" ready signal BRA that is higher than it is for a longer time. In one embodiment, the asymmetrical ready signal BRA is low at the end of the clock cycle.

Fig. 1G illustrates an asymmetric ready signal generation circuit 145 according to one embodiment. The asymmetric ready signal generation circuit 145 may replace the delay circuit 105 in the block diagram 100. The ready signal BRA is delayed from the rising edge of the clock signal by D1 and has a width equal to D2. The first delay circuit 147 provides a delay D1 and the second delay circuit 146 provides a delay D2. An and gate 148 having an inverting input generates the output BRA. In one embodiment, the first and second delay circuits 147 and 146 are implemented using a chain of serially coupled inverters, with an intermediate output of the chain providing the delay D1 and a further delayed output of the chain providing the delay D2.

FIG. 1H illustrates a timing diagram of the asymmetric ready signal generation circuit of FIG. 1G, according to one embodiment. The first delay circuit 147 causes the rising edge of BRA to occur at time D1 after the rising edge of the clock. The second delay circuit 146 sets the pulse width of the signal BRA to D2. The asymmetrical ready signal BRA rises when the clock signal is low and falls before the next clock cycle begins.

Glitch-free sampling circuit 110 is a storage element that can be used to eliminate glitches in a signal by sampling the signal after it changes level and stabilizes. The delay of the ready signal matches the delay of the circuit generating the signal and enables the sampling to transfer the stable level of the signal to the output in a glitch-free manner. Sampling based on the ready signal will produce a glitch-free output signal. Any changes in the signal prior to sampling the signal are filtered out by glitchless sampling circuit 110, thereby reducing the number of times a node within the logic receiving the output signal is charged and discharged. As a result, the power consumed by the logic receiving the output signal is reduced compared to receiving the varying signal directly.

Burr-free multiplexer

Ideally, to minimize the charge and discharge nodes within the combinational logic receiving the output of the multiplexer, the output of the multiplexer will remain unchanged until all inputs, including the select signal, reach their final state, and then the output changes exactly once (or remains unchanged). Glitch-less (glitch-free) multiplexing can be achieved by using "bundled" self-timed signals. In particular, a "ready" signal may be associated with each possible multi-bit logic signal (e.g., data input and select). The ready signal is only asserted when the associated logic signal has reached a final state and is unchanged (e.g., no glitches) within a clock cycle.

For example, signal "BR" indicates when signal "B" is ready. As previously described, the ready signal may be generated from the clock signal by a delay circuit having a delay matched to the combinational logic block generating signal B. In one embodiment, and gates are used to provide asymmetric rise and fall delays, as shown in fig. 1G. In one embodiment, an inverted version of the clock signal is used directly as the ready signal.

FIG. 2A illustrates a block diagram of a glitch-free 4-to-1 multiplexer 200 according to one embodiment. At a high level, a multiplexer is a logic component that receives a plurality (N) of data input signals and a select signal and then transmits one of the data input signals to an output signal based on the value of the select signal. For example, a multiplexer with inputs a, b and select and output c may be configured such that c is a when select is 0 and c is b when select is 1. However, as input a, b and select glitches, the glitches are propagated to output c. Reducing the glitch on output c reduces the power consumed by the combinational logic receiving output c as input.

The previously described glitch-free technique can be used to implement a glitch-free 4-to-1 multiplexer 200, where the data input signals are a. The input ready signals AR,. DR are delayed to match the respective input signals. The selection signal "Sel" selects one of the inputs to be sampled, which is to be transmitted to the output signal X. The feedback from X prevents output glitches by keeping the previous level on X when it is asserted.

Multiplexer 200 includes a timing decoder 210 and an output stage 215, each having signal terminals carrying inputs and outputs to and from the circuit. By associating a ready signal with each input signal (Sel and data input signal), which may be multi-bit, glitch-free multiplexing may be achieved. The ready signal goes high when the associated input signal reaches a steady state (e.g., a constant or glitch-free value) within the current clock cycle. For example, the signal AR indicates when the input signal a is stable. The ready signal may be generated from the circuit clock signal by an inverter chain that matches the combinatorial logic block delay that generates each input signal. In some cases, and gates may be used in the chain to provide asymmetric rise and fall delays, as previously described in connection with fig. 1G and 1H. In other cases, a positive or negative version of the clock may be used directly as the ready signal.

The timing decoder 210 receives as inputs a selection signal (Sel), a selection ready signal (SelR), and a plurality of input ready signals (AR, … DR). The timing decoder 210 generates sample enable signals (AS,. DS), each of which is asserted to cause the output stage 215 to sample a corresponding data input signal. The timing decoder 210 also generates a hold signal. In the context of the following description, the sampling enable signal may be considered a multi-bit sampling (e.g., one hot spot) enable signal, where each bit corresponds to a different one of the data input signal and the corresponding ready signal. In one embodiment, the timing of the different sample enable signals may vary based on the SelR and the input ready signal.

In general, each data input signal may be a multi-bit signal, and the corresponding ready signal indicates that all bits of the data input signal are glitch-free. In one embodiment, the sample enable signal for each data input may be used to sample all bits of the multi-bit data input signal and the output stage 215 is replicated for each bit to generate the multi-bit output signal X.

The output stage 215 accepts as inputs the data input signal, the sample enable signal, and the hold signal. The output stage 215 generates at least one output signal. The output signal may be fed back to the output stage 215 to be used as a hold feedback input. The timing decoder 210 is coupled to the output stage 215 to provide a sample enable signal to the output stage 215. The sample enable signal is timed to enable the output stage 215 to sample one of the data input signals according to the select input after each input signal has reached a final value in each clock cycle. The sampled signal is propagated to at least one output signal as further illustrated in fig. 2B.

The select ready and input ready signals in combination with the hold signal cause multiplexer 200 to hold the output signal stable (unchanged) until all select and data input signals are ready, which means at the final stable value for the clock interval, as indicated by the corresponding ready signal. The different logic paths and combinational logic that generate each data input and select signal may cause timing differences between the various signals that reach their own ready states. Each ready signal may be adjusted based on the logic path traversed by its respective data input signal to provide the appropriate delay. Once the select ready signal and the ready signal corresponding to the selected data input indicate that the data input and the select signal have reached a glitch-free and stable state, multiplexer 200 propagates the selected data input to the output signal.

The output stage 215 shown in fig. 2A includes a select gate 206 for each data input signal (a to D), and a select gate 205 for the hold signal. The output stage 215 also includes an output gate 212. "gate" refers to any logic that will be configured to combine one or more inputs according to a logic equation or truth table. For example, an and gate is logic that combines multiple inputs according to a boolean and operation. The gate does not imply a particular arrangement of transistors, and in some cases the gate may be implemented using a combination of different boolean operations and suitable for implementation.

Each select gate 205 and 206 may logically act as a two-input and gate. Each select gate 206 receives AS inputs a data input signal and a corresponding sample enable signal (a and AS, B and BS …). When the sample enable signal is a logic "1", the output of the select gate matches the input. When the select signal is a logic "0", the output of the select gate is a logic "0". In one embodiment, the timing decoder 210 is configured as a single heat output generator such that no more than one sample enable signal is high (logic "1") at any given time.

The output of select gate 206 is coupled to the input of output gate 212. The output gate 212 may act as a multiple input or gate. In this manner, the sampled signal, as previously described at select gate 206, may propagate through output gate 212 to output signal X. In one embodiment, the output signal X is also fed back as an input to the select gate 205. Select gate 206 is configured as feedback that is input to output gate 212 to hold output signal X high while the hold signal is high. In this way, when the hold signal is high, the output signal is selected as the next input to propagate through the output gate 212, effectively holding the output signal steady for the clock interval. In other words, when the hold signal is asserted (all sample enable signals are low), the output signal X maintains its current state.

The hold signals must remain asserted until one of the sample enable signals (AS … DS) is asserted to prevent glitches at the output when sampling the selected input signal. In particular, when output signal X is asserted (high) and the selected input signal (a … D) is high, the overlap of the hold and sample enable signals may ensure that output signal X does not glitch low (e.g., when output signal X is high, it does not transition low and return high in one clock cycle). If no overlap is used, a low glitch may result due to the race condition when the hold transitions to low. A low glitch occurs when the input signal has a high value and the output of select gate 205 transitions low before the sample enable signal transitions high at the input of select gate 206, resulting in all inputs to output gate 212 going low at the same time.

The feedback path is inexpensive in terms of transistors and power (the cost of a single nand gate is typically 4 to 6 transistors). The increased resource cost is similar to adding additional inputs to multiplexer 200 and is less expensive than adding flip-flops. The feedback path and the proper sequence of the hold and sample enable signals prevent glitching of the output signal X of multiplexer 200.

Those of ordinary skill in the art will readily recognize that output stage 215 may include additional or slightly different elements not shown and not necessary to this description. The AND gates are used to describe the select gates 205 and 206, and the OR gates are used to describe the output gate 212, since these symbols represent the functionality of the underlying logic used to generate the output signal based on the inputs. However, other combinations of logic may also be implemented to achieve the effects of the indicated functions. The five two-input nand gates may feed into, for example, a single five-input nand gate. Alternatively, three and-or-inversions (AOIs) may feed into a three-input nand gate. In achieving the desired logic result, many circuits may be used to achieve the same logic result while affecting the number of transistors used, the delay incurred, and the power consumed. Those of ordinary skill in the art will readily recognize that multiplexer 200 may include additional or slightly different elements not shown and not necessary to this description.

Multiplexer 200 provides a reliably stable output signal value even when the data input and/or Sel signal changes value multiple times and/or at different times. Thus, for multiplexer 200, power consumption and noise associated with glitches and unnecessary output signal switching is reduced, as well as additional work performed by circuits receiving potentially unstable signals as inputs.

FIG. 2B illustrates a timing diagram of the glitch-free 4 to 1 multiplexer 200 of FIG. 2A, according to one embodiment. The timing diagram depicts the timing of the sample enable signal generated by the timing decoder 210 to achieve glitch-free multiplexing, with arrows indicating causality. The signal SelR is asserted to indicate that the select signal Sel is ready. Before signal SelR is asserted, Sel may be glitched one or more times, changing between values (as indicated by the transition labeled "glitch") until a steady state is reached, after which SelR is asserted. In this example, the select signal selects the data input signal a, and the input ready signal AR is asserted before SelR is asserted, indicating that the data input signal a is ready. In response to both AR and SelR having been asserted, timing decoder 210 asserts sample enable signal AS to sample data input signal a and propagate the sampled values to output signal X of multiplexer 200.

If the AR signal has not been asserted when SelR is asserted, the timing decoder 210 waits for the AR signal to be asserted before asserting the AS signal. After a short overlapping delay (t0) following the assertion of the AS signal, the signal is held deasserted (de-asserted). At time t₀The overlap of the assignments of AS and hold signals during is for the case where both bits of the output signal X and the data input signal a are assigned. The overlap ensures that the bit of the output signal X does not glitch into the deasserted state between the time the output of the hold select gate 205 in the output stage 215 becomes deasserted and the time the output of the a select gate 206 in the output stage 215 is asserted. To reset multiplexer 200 in the next clock cycle, the hold signal is asserted after both the SelR signal and the AR are deasserted, and at an overlap delay (t)₁) The AS signal is then de-asserted. Delay t₀And t₁May be equal or different. In one embodiment, delay t₀And t₁Approximately the delay of the logic gate. In some implementations, an inverted version of the hold signal (possibly with added delay) may be used as the "output ready" signal XR.

FIG. 2C illustrates a block diagram of the timing decoder 210 of FIG. 2A, according to one embodiment. Sequence logic is added to order the hold and sample enable signals AS, …, DS to avoid glitches. The one-hot decoder 220 may be omitted if the Sel signal input is already in one-hot (one-hot) form.

In the ready signal logic 221, each single hot signal ad,. -, dd is defined by selecting the ready signal SelR and a corresponding input ready signal (e.g., AR for input signal a, etc.). Depending on the implementation, the and gate that executes the defined ready signal logic 221 may be implemented within the single hot decoder 220 or by extending the and gates in the sample enable signal logic 216 and the hold signal logic 218.

The resulting decoded ready signal adr, output from the ready signal logic 221, ddr is applied to drive a set of reset/set (RS) flip-flops or latches, implemented AS and gates in the depicted embodiment, to generate and hold the input select signals AS, …, DS. The combinational logic including the block of sample enable logic 216 and hold signal logic 218 is gated by ready signal logic 221 to generate sample enable and hold signals.

When the decode input ready signal (e.g., adr) is asserted, it sets the output of the latch implemented by the corresponding sample enable signal logic 216 (e.g., signal AS). When any sample enable signal (e.g., AS) is asserted, it resets the flip-flop implemented in hold signal logic 218, causing the hold signal to go low. If additional overlap is required to ensure reliable operation, a delay circuit may be inserted into the reset path. However, in many cases, the delay of the RS flip-flop in the sample signal logic 216 itself may be sufficient to avoid de-assignment glitches when both the held output signal X and the newly selected input signal are asserted.

For multi-bit multiplexers (one multiplexer is enabled to select a multi-bit input signal as a multi-bit output signal), each block of sample enable logic 216 (e.g., an and or gate configuration) performs the operation of an RS flip-flop or latch for all bits of one multi-bit input signal. Other low gate count configurations for RS/SR flip-flop behavior may also be utilized.

To reset and prepare the timing decoder 210 for the next clock cycle, the retention RS flip-flop implemented in the retention signal logic 218 is set when an enable signal, such as adr, goes low. Since one enable signal is asserted at a time, the condition is detected in the hold signal logic 218 by a nor gate (e.g., an and gate with an inverting input) configured to detect all deasserted enable signals. The hold signal becomes asserted and then the selected RS flip-flop is reset in sample enable logic 216. A delay may be inserted in the hold sample enable reset path if the overlap period needs to be extended.

Fig. 2D illustrates a flow diagram of a method 225 for generating a glitch-free multiplexer output signal, according to one embodiment. The method 225 may be performed by logic or custom circuitry. For example, method 225 may be performed by a GPU, a CPU, or any processor capable of generating a glitch-free signal. Moreover, one of ordinary skill in the art will appreciate that any system that performs the method 225 is within the scope and spirit of embodiments of the present disclosure.

At step 230, a selection ready signal is received at the timing decoder 210. The select ready signal is negated until the select signal generated by the combinational logic is unchanged (no glitch) and is asserted after the select signal is unchanged. At step 235, the timing decoder 210 generates at least one sampling enable signal according to the selection signal, wherein the at least one sampling enable signal corresponds to a set of data input signals. The at least one sample enable signal is negated while the select ready signal is negated, and the at least one sample enable signal is asserted when both the select ready signal and the corresponding data ready signal are asserted.

In one embodiment, the selection signal comprises a multi-bit signal, and each bit in the selection signal is associated with a different one of the set of data input signals, and only one bit is asserted at a time. For example, the select signal may be single-hot encoded by single-hot decoder 220 to produce adr, …, ddr. In one embodiment, each bit in the selection signal is used to sample the associated data input signal. For example, a multi-bit select signal is used to generate sample enable signals that are each associated with a different one of the data input signals. In one embodiment, the selection signal may be binary coded, wherein each binary mode selects one of the data input signals.

In one embodiment, the timing decoder 210 receives a set of ready signals, each ready signal in the set being associated with a different one of the data input signals in the set of data input signals. Furthermore, each ready signal is negated until the associated data input signal is unchanged, and each ready signal is asserted after the associated data input signal is unchanged.

In one embodiment, the timing decoder 210 is further configured to generate a set of enable signals, e.g., adr, …, ddr, wherein each enable signal in the set is associated with a different one of the data input signals in the set. Further, each of the enable signals is negated when the associated ready signal is negated, and each of the enable signals is asserted in response to the assertion of the associated ready signal when the associated data input signal is asserted.

At step 240, the timing decoder 210 generates a hold signal that is asserted when the at least one sample enable signal is negated and that is negated in response to the assertion of the at least one sample enable signal. In one embodiment, the timing decoder 210 is configured to assert the hold signal when the hold signal is asserted and the at least one sample enable signal is negated. In one embodiment, both the hold signal and the at least one sample enable signal are asserted for a first duration (e.g., t @)₀And/or t₁). In one embodiment, the hold signal is inverted to generate an output ready signal associated with the output signal.

At step 245, the sampling circuit, e.g., output stage 215, holds the output signal unchanged while the hold signal is asserted. At step 250, one of the data input signals is sampled according to at least one sampling enable signal while the hold signal is negated to transfer the value of the sampled data input signal to the output signal.

In one embodiment, the set of input data signals includes three input data signals, and the selection signal is configured to select one of the three input data signals to produce the output signal. In one embodiment, the set of input data signals includes four input data signals, and the selection signal is configured to select two of the four input data signals to produce the output signal and the additional output signal.

A particular embodiment of a glitch-free multiplexer is a 4-2 multiplexer that selects four simultaneouslyInput a₀To a₃To propagate to two output signals q₀And q is₁. With order preserved, as shown in table 1, there are six possibilities for this multiplexer to encode by a three bit select signal:

se1	q0	q1
			0	a0	a1
1	a0	a2
			2	a0	a3
3	a1	a2
			4	a1	a3
5	a2	a3

TABLE 1

The data path of the 4-2 multiplexer may be implemented as two 3-1 multiplexers with corresponding decoded input ready signals derived from the 3-6 decoding scheme, as shown by the following equation:

for the 3-1 multiplexer that produces q 0:

for generating q₁3-1 multiplexer:

a12dr＝(s1∨s3)∧selr∧a2r

a13dr＝(s2∨s4 V s5)∧selr∧a3r

here, sx (where x takes on a value of 0-5) indicates that the select signal is named x; auvdr is a decoded input ready signal that gates the input signal av to the output signal qu. These signals are used to drive the four RS latches of each 3-1 multiplexer of the 4-2 multiplexer. For the q0 multiplexer, three RS latches generate signals a00s, a01s, and a02s, thereby enabling paths from inputs a0, a1, and a2 to q0, respectively. The final RS latch holds q 0. A similar set of four RS latches controls the q1 multiplexer. For example, the formula for the a00s RS latch is:

fig. 2E illustrates an extended ready signal generation circuit 255 according to one embodiment. The select ready signal is extended with RS flip-flops including and gate 256 and or gate 257 to ensure overlap with the input ready signal. For the glitch-free multiplexer to function properly, the select ready (SelR) signal and the input ready signal (XR) overlap for a sufficient time to set the selected RS flip-flop in the sample enable signal logic 216 of the timing decoder 210. The delay of the input ready signal is much greater than the delay of the select ready signal and no overlap may occur. This problem can be solved by extending the select ready signal with additional RS flip-flops. The extension signal (SelXR) is set by setting SelR high and the extension signal is cleared by setting the corresponding sampling enable signal (XS) high. The timing decoder 210 may be modified to include additional RS flip-flops for each input data signal. Specifically, an extended ready signal generation circuit 255 may be inserted to replace the SelR input to each and gate within the ready signal logic 221.

Fig. 2F illustrates a timing diagram of the extended ready signal generation circuit 255 of fig. 2E, according to one embodiment. The XS is generated using the extended SelR signal SelXR due to the lack of overlap time when SelR and XR are assigned simultaneously. SelXR is reset in response to XS being assigned a value. SelXR is set in response to SelR being assigned a value.

In some cases, the input ready signal is longer in duration than needed, and the reset of glitchless N-1 multiplexer 200 may be unnecessarily delayed to prepare it for the next set of inputs. The quick return circuit may be implemented on the input ready signal. The fast return circuit returns the input ready signal to the de-asserted state immediately after detecting the output ready signal.

FIG. 2G illustrates a quick return circuit 260 according to one embodiment. The logic formula of the circuit is as follows:

ARq＝AR∧S’

S＝(AR∧S)∨(ARq∧OR)

where AR is the input data ready signal, ARq is the quick return version of the input data ready signal, OR is the output ready signal — the delayed version of AS, S is the state variable, and S' is the complement of S.

FIG. 2H illustrates a timing diagram for the quick return circuit 260 of FIG. 2G, according to one embodiment. When the input ready signal AR rises, the signal ARq rises, but as long as the output ready signal OR becomes high, the signal ARq falls. The signal AR goes high and remains high for a while. To enable faster reset of glitchless N-1 multiplexer 200, fast return circuit 260 generates signal ARq, which signal ARq returns to zero with low delay after OR goes high. By setting the status signal S generated by the RS flip-flop, the quick return behavior can be enabled. The RS flip-flop is set by OR going high when ARq is high. The RS flip-flop keeps S high until AR goes low to reset it.

In other cases, the output ready signal (OR) may be extended until all outputs combined therewith are ready and the subsequent circuit accepts the combination. This event may be signaled by an ack signal (which may also be an output ready signal of subsequent circuitry). In these cases, using the complement hold signal as the OR may not be long enough to meet the timing constraints of the implementation. The extended output ready signal may be generated using an RS flip-flop that sets when held de-asserted and resets when ack is asserted:

OR＝hold’∨(ack’∧OR)

in this case, the hold signal logic 218 of FIG. 2C may be modified to prevent the hold from becoming de-asserted again until ack is asserted.

Glitch-free techniques may be used to reduce the power consumption of the multiplexer logic. The output signal is prevented from changing until a glitch-free value is propagated to the output signal causing the output signal to transition only once (or remain constant) per clock cycle. When a glitch occurs in the output signal, the logic receiving the signal may respond, charging and/or discharging nodes within the logic and consuming power. Providing glitch-free signals may reduce the time for a node to charge and/or discharge, thereby reducing the power consumed by the logic.

The glitch-free technique associates a ready signal with each input of the glitch-free multiplexer to control sampling of the selected input after the input is stable and glitch-free. Embodiments of the sampling element include a transparent latch or a multiplexer with feedback. Each delay of the ready signal matches the associated input and may lengthen, an asymmetrical pulse occurs or the ready is reset quickly. Within the multiplexer, a hold signal will be generated to hold the output signal stable through the feedback path until the sampled signal value is propagated. The hold signal may also be used to generate an output ready signal.

The glitch-free technique may be implemented in any circuit including combinational logic and sequential logic. For example, a glitch-free technique may be used within one or more logic blocks within a processor and used for the input or output of the processor. In particular, glitch-free multiplexers may be used to select only non-zero activations and/or weights for convolution operations. In one embodiment, the glitchless techniques may be implemented within a parallel processing architecture, as further described herein.

Parallel processing architecture

FIG. 3 illustrates a Parallel Processing Unit (PPU)300 according to one embodiment. In one embodiment, the PPU300 is a multithreaded processor implemented on one or more integrated circuit devices. The PPU300 is a latency hiding architecture designed to handle multiple threads in parallel. A thread (e.g., a thread of execution) is an instance of a set of instructions configured to be executed by PPU 300. In one embodiment, the PPU300 is a Graphics Processing Unit (GPU) configured to implement a graphics rendering pipeline for processing three-dimensional (3D) graphics data in order to generate two-dimensional (2D) image data for display on a display device, such as a Liquid Crystal Display (LCD). In other embodiments, the PPU300 may be used to perform general-purpose computations. Although one example parallel processor is provided herein for illustrative purposes, it is strongly noted that such a processor is presented for illustrative purposes only, and any processor may be employed in addition to and/or in place of it.

The one or more PPUs 300 are configured to accelerate high Performance computing ("HPC"), data centers, and machine learning applications. In at least one embodiment, the PPU300 is configured to accelerate deep learning systems and applications including auto-driven vehicle platforms, deep learning, high precision speech, image, text recognition systems, intelligent video analysis, molecular simulation, drug discovery, disease diagnosis, weather forecasting, big data analysis, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimization, and personalized user recommendations, among others.

As shown in FIG. 3, PPU300 includes, but is not limited to, an input/output ("I/O") unit 305, a front end unit 315, a scheduler unit 320, a work allocation unit 325, a hub 330, a crossbar ("Xbar") 370, one or more general purpose processing cluster GPCs 350, and one or more memory partition units 380. In at least one embodiment, PPUs 300 are interconnected to a host processor or other PPU300 by one or more NV links 310. PPU300 is connected to a host processor or other peripheral device via interconnect 302. The PPU 3300 is coupled to a local memory 304 that includes a plurality of memory devices. In at least one embodiment, local memory 304 comprises a plurality of dynamic random access memory DRAM devices. The DRAM devices are configured as high bandwidth memory HBM subsystems and multiple DRAM dies are stacked within each device.

NV link310 scales the system and includes one or more PPUs 300 in conjunction with one or more CPUs, supporting cache coherence between PPU 3300 and the CPU and CPU hosting. NV link310 transmits data and/or commands to other units of PPU300, such as one or more copy engines, video encoders, video decoders, power management units, etc. (not explicitly shown) via hub 330. NV link310 will be described in more detail in conjunction with fig. 5B.

The I/O unit 305 is configured to send and receive communications (e.g., commands, data) from a host processor (not shown) over the interconnect 302. The I/O unit 305 communicates with the host processor directly over the interconnect 302 or through one or more intermediate devices (e.g., a memory bridge). In one embodiment, I/O unit 305 may communicate with one or more other processors (e.g., one or more PPUs 300) via interconnect 302. In at least one embodiment, I/O unit 305 implements a peripheral component interconnect express ("PCIe") interface for communicating over a PCIe bus, and interconnect 302 is a PCIe bus. In at least one embodiment, the I/O unit 305 implements other types of known interfaces for communicating with external devices.

The I/O unit 305 decodes packets received via the interconnect 302. In one embodiment, the packet represents a command that is configured to cause the PPU300 to perform various operations. The I/O unit 305 sends the decoded command to various other units of the PPU 3300 as specified by the command. For example, some commands are sent to the hub 330 or other units of the PPU 3300, such as one or more replication engines, video encoders, video decoders, power management units, and so forth (not explicitly shown). In other words, the I/O unit 305 is configured to route communications between the various logical units of the PPU 300.

In one embodiment, a program executed by a host processor encodes a command stream in a buffer that provides a workload to the PPU300 for processing. The workload includes a plurality of instructions and data to be processed by those instructions. A buffer is an area in memory that is accessible (e.g., read/write) by both the host processor and the PPU 300. For example, I/O unit 305 may be configured to access buffers in system memory coupled to interconnect 302 via memory requests transmitted over interconnect 302. In one embodiment, the host processor writes the command stream to a buffer and then sends a pointer to the PPU300 indicating the start of the command stream. The front end unit 315 receives pointers to one or more command streams. The front end unit 315 manages one or more streams, reads commands from the streams and forwards the commands to the various units of the PPU 300.

The front end units 315 are coupled to a scheduler unit 320, which scheduler unit 320 configures various GPCs 350 to process tasks defined by one or more streams. The scheduler unit 320 is configured to track status information related to various tasks managed by the scheduler unit 320. The state information may indicate which GPCs 350 the task is assigned to, whether the task is active or inactive, a priority associated with the task, and so on. In at least one embodiment, the scheduler unit 320 manages a plurality of tasks executing on one or more GPCs 350.

In at least one embodiment, the scheduler unit 320 is coupled to a work allocation unit 325, the work allocation unit 325 being configured to dispatch tasks for execution on the GPCs 350. In at least one embodiment, the work allocation unit 325 tracks a plurality of scheduled tasks received from the scheduler unit 320. In one embodiment, the work distribution unit 325 manages a pending task pool and an active task pool for each GPC 350. The pool of pending tasks includes a plurality of time slots (e.g., 32 time slots) that contain tasks assigned to be processed by a particular GPC 350. The active task pool may include a number of slots (e.g., 4 slots) for tasks actively processed by the GPCs 350. When execution of a task is completed in a GPC350, the task is evicted from the active task pool of the GPC350 and another task is selected from the pending task pool and scheduled to execute on the GPC 350. In at least one embodiment, if the active task is in an idle state on the GPC350, such as while waiting for a data dependency to be resolved, the active task is evicted from the GPC350 and returned to the pending task pool while another task in the pending task pool is selected and scheduled to execute on the GPC 350.

The work distribution unit 325 communicates with one or more GPCs 350 via xbars 370. XBar 370 is an interconnection network that couples many of the elements of PPU300 to other elements of PPU 300. For example, the XBar 370 may be configured to couple the work allocation unit 325 to a particular GPC 350. Although not explicitly shown, other units of one or more PPUs 300 may also be connected to XBar 370 through hub 330.

Tasks are managed by the scheduler unit 320 and distributed to GPCs 350 by the work distribution unit 325. The GPCs 350 are configured to process tasks and generate results. The results may be consumed by other tasks in the GPC350, routed to a different GPC350 through the XBar 370 or stored in memory 304. The results may be written to memory 304 by memory partition unit 380, which implements a memory interface for writing data to memory 304 or reading data from memory 304. The results may be transmitted to another PPU300 or CPU via NV link 310. In one embodiment, the PPU300 includes, but is not limited to, U memory partition units 380 equal to the number of separate and distinct storage devices 304 coupled to the PPU 300. Memory partition unit 380 is described in more detail below in conjunction with FIG. 4B.

In one embodiment, the host processor executes a driver core that implements an Application Programming Interface (API) that enables one or more applications executing on the host processor to schedule operations to execute on the PPU 300. In one embodiment, multiple computing applications are executed simultaneously by the PPU300, and the PPU300 provides isolation, quality of service ("QoS"), and independent address spaces for the multiple computing applications. The application generates instructions (e.g., in the form of API calls) that cause the driver core to generate one or more tasks for execution by the PPU 300. The driver core outputs tasks to one or more streams processed by the PPU 300. Each task includes one or more related thread groups, which may be referred to as thread bundles (warp). In one embodiment, the thread bundle includes 32 related threads that may be executed in parallel. Cooperative threads may refer to multiple threads, including instructions for performing tasks and exchanging data through shared memory. Threads and cooperative threads are described in more detail in conjunction with FIG. 5A.

FIG. 4A illustrates a GPC350 of the PPU300 of FIG. 3, according to one embodiment. As shown in fig. 4A, each GPC350 includes multiple hardware units for processing tasks. In one embodiment, each GPC350 includes a pipeline manager 410, a pre-raster operations unit, PROP, 415, a raster engine 410, a work distribution crossbar, WDX, 480, a memory management unit, MMU, 490, one or more data processing clusters, DPC, 420. It should be understood that the GPC350 of fig. 4A may include other hardware units in place of, or in place of, the units shown in fig. 4A.

In one embodiment, the operation of GPCs 350 is controlled by a pipeline manager 410. The pipeline manager 410 manages the configuration of one or more DPCs 420 to process tasks assigned to GPCs 350. In one embodiment, pipeline manager 410 configures at least one of the one or more DPCs 420 to implement at least a portion of a graphics rendering pipeline. For example, DPC420 is configured to execute vertex shader programs on programmable streaming multiprocessor SM 440. The pipeline manager 410 may also be configured to route data packets received from the work distribution unit to the appropriate logical unit within the GPC 350. For example, some packets may be routed to fixed function hardware units in the PROP415 and/or raster engine 425, while other packets may be routed to the DPC420 for processing by the primitive engine 435 or SM 440. In at least embodiments, pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement a neural network model and/or a compute pipeline.

The PROP unit 415 is configured to route data generated by the raster engine 425 and the DPC420 to a raster operations ROP unit described in more detail above in connection with fig. 4B, which is described in more detail in connection with fig. 4B. The PROP unit 415 may also be configured to perform optimizations for color mixing, organize pixel data, perform address translation, and so on.

The raster engine 425 includes a plurality of fixed function hardware units configured to perform various raster operations. In one embodiment, the raster engine 425 includes a setup engine, a coarse raster engine, a culling engine, a clipping engine, a fine raster engine, a tile aggregation engine. The setup engine receives the transformed vertices and generates plane equations associated with the geometric primitives defined by the vertices. The plane equations are passed to a coarse raster engine to generate coverage information (e.g., x, y coverage masks for tiles) for the base primitive. The output of the coarse raster engine will be passed to a culling engine where fragments associated with primitives that fail the z-test will be culled and passed to a clipping engine where fragments that lie outside the viewing cone are clipped. The clipped and culled segments are passed to a fine raster engine to generate attributes for the pixel segments based on a plane equation generated by a setup engine. The output of the raster engine 425 includes fragments to be processed, for example, by a fragment shader implemented within the DPC 420.

Each DPC420 included in the GPC350 includes an M-line controller MPC 430; a primitive engine 435; and one or more SMs 440. In at least one embodiment, the MPC430 controls the operation of the DPC420, routing packets received from the pipeline manager 410 to the appropriate elements in the DPC 420. For example, packets associated with the vertices are routed to primitive engine 435, primitive engine 435 is configured to retrieve vertex attributes associated with the vertices from memory; instead, data packets associated with the shader program may be sent to SM 440.

SM440 includes a programmable streaming processor configured to process tasks represented by a plurality of threads. Each SM440 is multithreaded and configured to concurrently execute multiple threads (e.g., 32 threads) from a particular thread group. In one embodiment, SM440 implements a SIMD ("single instruction multiple data") architecture, in which each thread in a set of threads (e.g., a thread bundle) is configured to process different sets of data based on a common set of instructions. All threads in a thread group execute the same instruction. In one embodiment, the SM440 implements a SIMT ("single instruction multi-threading") architecture, where each thread in a group of threads is configured to process different sets of data based on a common set of instructions, but where the individual threads in the group of threads are allowed to diverge during execution. In at least one embodiment, a program counter, call stack, and execution state are maintained for each thread bundle to enable concurrency between the thread bundle and serial execution within the thread bundle as threads in the thread bundle diverge. In another embodiment, a program counter, call stack, and execution state are maintained for each individual thread, so that there is equal concurrency between all threads within and between thread bundles. While the execution state is maintained for each individual thread, threads executing common instructions may be converged and executed in parallel to improve efficiency. SM440 is described in more detail below in conjunction with fig. 5A.

The MMU490 provides an interface between the GPCs 350 and the memory partition units 380, and the MMU490 provides translation of virtual addresses to physical addresses, memory protection, and arbitration of memory requests. In one embodiment, the MMU490 provides one or more translation lookaside buffers ("TLBs") for performing translations of virtual addresses to physical addresses in the memory 304.

FIG. 4B illustrates a memory partition unit 380 of the PPU300 of FIG. 3, according to one embodiment. As shown in fig. 4B, memory partition unit 380 includes, but is not limited to, a Raster Operations (ROP) unit 450; a level two (L2) cache 460; a memory interface 470. The memory interface 470 is coupled to memory. The memory interface 470 may implement a 32, 64, 128, 1024 bit data bus, etc. for high speed data transfer. In one embodiment, PPU300 includes U memory interfaces 470. One memory interface 470 for each pair of partition units 350, where each pair of memory partition units 380 is connected to a respective memory device. For example, the PPU300 may be connected to Y storage devices, such as a high bandwidth storage stack or graphics double data rate, version 5, synchronous dynamic random access memory, or other introspected memory.

In one embodiment, memory interface 470 implements the HBM2 memory interface, and Y equals half of U. In one embodiment, the HBM2 memory stack is located on the same physical package as the PPU300, saving a significant amount of power and area compared to a conventional GDDR5 SDRAM system. In one embodiment, each HBM2 stack includes four memory dies and Y ═ 4, while the HBM2 stack includes two 128-bit lanes per die, for a total of 8 lanes and a data bus width of 1024 bits.

In one embodiment, memory 304 supports Single Error Correction Double Error Detection (SECDED) Error Correction Codes (ECC) to protect data. ECC may provide greater reliability for computing applications that are sensitive to data corruption. Reliability is particularly important in large-scale clustered computing environments where the PPU300 handles very large data sets and/or long running applications.

In one embodiment, PPU300 implements a multi-level memory hierarchy. In one embodiment, memory partition unit 380 supports unified memory to provide a single unified virtual address space for the CPU and PPU300 memory, thereby enabling data sharing between virtual memory systems. In one embodiment, the frequency of accesses by the PPU300 to memory located on other processors is tracked to ensure that memory pages are moved to the physical memory of the PPU300 where the pages are accessed more frequently. In one embodiment, NV link310 supports an address translation service that allows PPU300 to directly access CPU's page tables and provides full access to CPU memory by PPU 300.

In one embodiment, the replication engine transfers data between multiple PPUs 300 or between a PPU300 and a CPU. The copy engine may generate a page fault for an address that is not mapped into a page table, and the memory partition unit 380 then services the page fault, maps the address into the page table, and the copy engine performs the transfer thereafter. In conventional systems, fixed (i.e., non-pageable) memory is operated for multiple replication engines between multiple processors, thereby substantially reducing available memory. Due to hardware page faults, addresses can be passed to the copy engine without having to worry about whether a memory page resides, and the copy process is transparent.

Data from memory 304 or other system memory is fetched by memory partition unit 380 and stored in an L2 cache 460, which L2 cache 460 is on-chip and shared among various GPCs. As shown, each memory partition unit 380 includes at least a portion of L2 cache 460 associated with a corresponding memory 304. The lower level cache is then implemented in various units within the GPC 350. For example, each SM440 may implement a level one (L1) cache, where the L1 cache is a private memory dedicated to a particular SM 440. Data from L2 cache 460 may be retrieved and stored in each L1 cache for processing in the functional units of SM 440. L2 cache 460 is coupled to memory interface 470 and XBR 370.

ROP unit 450 performs graphics raster operations related to pixel color, such as color compression, pixel blending, and the like. ROP unit 450 performs depth testing with raster engine 425, receiving the depth of sample locations associated with pixel fragments from the culling engine of raster engine 425. To address the sample locations associated with the fragments, depth tests are performed in the depth buffer for the corresponding depths. If the fragment passes the depth test for the sample location, ROP unit 450 updates the depth buffer and sends the results of the depth test to grid raster engine 425. It will be appreciated that the number of memory partition units 380 may be different from the number of GPCs 350, and thus, each ROP unit 450 may be coupled to each GPC. ROP unit 450 tracks packets received from different GPCs and determines to which GPC350 the results generated by ROP unit 450 are routed through Xbar 370. Although ROP unit 450 is included within memory partition unit 380 in fig. 4B, ROP unit 450 may be external to memory partition unit 380 in other embodiments. For example, ROP unit 450 may reside in the GPC350 or another unit.

Fig. 5A illustrates the streaming multiprocessor ("SM") 440 of fig. 4A, according to one embodiment. As shown in fig. 5A, SM440 includes, but is not limited to, instruction cache 505; one or more scheduler units 510; a register file 520; one or more processing cores 550; one or more special function units SFU 552; one or more load/store units LSU 554; an interconnection network 580; shared memory/L1 cache 570.

As described above, the work distribution unit 325 distributes tasks to be performed on GPCs 350 of PPUs 300. A task is assigned to a particular DPC420 within the GPC350, and if the task is associated with a shader program, the task may be assigned to the SM 440. The scheduler unit 510 receives tasks from the work allocation unit 325 and manages instruction scheduling for one or more thread blocks allocated to the SM 440. Scheduler unit 510 schedules thread blocks for execution as parallel thread bundles, where each thread block is assigned at least one thread bundle. In one embodiment, 32 threads are executed per bundle. Scheduler unit 510 may manage multiple different thread blocks, assign thread bundles to different thread blocks, and then schedule instructions from multiple different cooperating groups to various functional units (e.g., cores 550, SFUs 552 and LSUs 554). Within each clock cycle.

A collaboration group is a programming model for organizing groups of communication threads that allows developers to express the granularity at which threads communicate, thereby enabling richer, more efficient parallel decomposition to be expressed. The cooperative launch API supports synchronization between thread blocks to execute parallel algorithms. The conventional programming model provides a single simple structure for synchronizing cooperative threads: blocking of all threads across a thread block (e.g., synchrads () function). However, programmers often want to define thread groups at a size smaller than the thread block granularity and synchronize within the defined groups to achieve higher performance, design flexibility, and software reuse in the form of collective group-level functional interfaces.

The collaboration group enables programmers to explicitly define thread groups on a sub-block (e.g., as small as a single thread) and multi-block granularity, and perform collective operations, such as synchronizing threads in the collaboration group. The programming model supports clean composition across software boundaries, so library and utility functions can be safely synchronized in their local context without assumptions on convergence. The collaboration group primitives may enable new collaborative parallel modes including producer-consumer parallel, opportunistic parallel, and global synchronization across the thread block grid.

Dispatch unit 515 is configured to send instructions to one or more functional units. In this embodiment, the scheduler unit 510 includes two scheduling units 515 that enable two different instructions from the same thread bundle to be scheduled in each clock cycle. In alternative embodiments, each scheduler unit 510 may include a single scheduling unit 515 or additional scheduling units 515.

Each SM440 includes a register file 520 that provides a set of registers for the functional units of the SM 440. In one embodiment, register file 520 is divided among each functional unit such that each functional unit is assigned a dedicated portion of register file 520. In another embodiment, the register file 520 is divided among the different threads executed by the SM 440. Register file 520 provides temporary storage for operands connected to the data paths of the functional units.

Each SM440 includes L processing cores 550. In one embodiment, SM440 includes a large number (e.g., 128, etc.) of different processing cores 550. Each core 550 may include full-pipeline, single-precision, double-precision, and/or mixed-precision processing units, including floating-point arithmetic logic units and integer arithmetic logic units. In one embodiment, the floating point arithmetic logic unit implements the IEEE 754-. In one embodiment, the core 550 includes 64 single-precision (32-bit) floating-point cores, 64 integer cores, 32 double-precision (64-bit) floating-point cores, and 8 tensor cores.

Tensor kernels configured to perform matrix operations, and in one embodiment, one or more tensor kernels are included in kernel 550. In particular, the tensor core is configured to perform deep learning matrix arithmetic, such as convolutional arithmetic network training and reasoning for nerves. In one embodiment, each tensor core operates on a 4 × 4 matrix and performs a matrix multiply and accumulate operation D ═ a × B + C, where A, B, C and D are 4 × 4 matrices.

In one embodiment, the matrix multiplication inputs a and B are 16-bit floating-point matrices, while the accumulation matrices C and D may be 16-bit floating-point or 32-bit floating-point matrices. The tensor core operates on 16-bit floating-point input data with 32-bit floating-point accumulation. 16-bit floating-point multiplication requires 64 operations and produces a full-precision product, which is then accumulated with other intermediate products using 32-bit floating-point addition to perform a 4 x4 matrix multiplication. In practice, the tensor core is used to perform larger two-dimensional or higher-dimensional matrix operations, which are composed of these smaller elements. APIs such as the CUDA 9C + + API disclose specialized matrix loading, matrix multiplication and accumulation, and matrix storage operations to efficiently use the tensor core in CUDA-C + + programs. At the CUDA level, the thread bundle level interface assumes a matrix of 16 × 16 size across all 32 threads of the thread bundle.

Each SM440 also includes M SFUs 552 that perform special functions (e.g., attribute evaluation, inverse square root, etc.). In one embodiment, SFU552 may include a tree traversal unit configured to traverse a hierarchical tree data structure. In one embodiment, SFU552 may comprise a texture unit configured to perform texture map filtering operations. In one embodiment, the texture unit is configured to load a texture map (e.g., a 2D array of texels) from memory 304 and sample the texture map to produce sampled texture values for use in a shader program executed by SM 440. The texture map is then stored in shared memory/L1 cache 470. Texture units use texture maps (e.g., texture maps with different levels of detail) to implement texture operations such as filtering operations. In one embodiment, each SM 340 includes two texture units.

Each SM440 also includes N LSUs 554 that implement load and store operations between shared memory/L1 cache 570 and register file 520. Each SM440 includes an interconnection network 580 that connects each functional unit to register file 520 and LSU554 to register file 520, shared memory/L1 cache 570. In one embodiment, interconnect network 580 is a crossbar switch that may be configured to connect any functional unit to any register in register file 520. And connects LSU554 to register files and memory locations in shared memory/L1 cache 570.

The shared memory/L1 cache 570 is an array of on-chip memory that allows data storage and communication between the SM440 and the primitive engines 435 and between threads in the SM 440. The L1 cache 570 includes 128KB of storage capacity and is located in the path from the SM440 to the memory partition unit 380. Shared memory/L1 cache 570 may be used for cache reads and writes. One or more of shared memory/L1 cache 570, L2 cache 460, and memory 304 are backing stores.

Combining data caching and shared memory functions into a single memory block provides the best overall performance for both types of memory accesses. This capacity may be used as a cache for programs that do not use shared memory. For example, if the shared memory is configured to use half of the capacity, texture and load/store operations may use the remaining capacity. The integration within shared memory/L1 cache 570 enables shared memory/L1 cache 570 to function as a high throughput pipeline for streaming data, while providing high bandwidth and low latency access to frequently reused data.

When the configuration is used for general-purpose parallel computing, a simpler configuration can be used compared to graphics processing. In particular, the fixed function graphics processing unit shown in FIG. 3 is bypassed, thereby creating a simpler programming model. In a general-purpose parallel computing configuration, the work allocation unit 325 allocates and allocates thread blocks directly to the DPCs 420. The threads in the block execute the same program in computations using unique thread IDs to ensure that each thread generates unique results, execute the program and perform the computations using SM440, communicate between threads using shared memory/L1 cache 570, and read and write global memory using LSU554 through shared memory/L1 cache 570 and memory partition unit 380. SM440 configured for general purpose parallel computing may also write commands that scheduler unit 320 may use to initiate new work on DPC 420.

PPU300 may be included in a desktop computer, a laptop computer, a tablet computer, a server, a supercomputer, a smartphone (e.g., wireless, handheld device), a Personal Digital Assistant (PDA), a digital camera, a vehicle, a head mounted display, a handheld electronic device, etc. In one embodiment, the PPU300 is embodied on a single semiconductor substrate. In another embodiment, the PPU300 is included in a system on chip (SoC) with one or more other devices, such as an additional PPU300, memory 304, a Reduced Instruction Set Computer (RISC) CPU, memory, a management unit (MMU), a digital-to-analog converter (DAC), and so forth.

In one embodiment, the PPU300 may be included on a graphics card that includes one or more memory devices. The graphics card may be configured to interact with a PCIe slot on a motherboard of the desktop computer. In yet another embodiment, PPU300 may be a graphics processing unit (iGPU) or a parallel processor integrated in a chipset of a motherboard.

Exemplary computing System

As developers disclose and utilize more parallelism in applications such as artificial intelligence computing, systems with multiple GPUs and CPUs are used in various industries. High performance GPU acceleration systems with tens of thousands of compute nodes have been deployed in data centers, research facilities, and supercomputers to address ever-larger problems. As the number of processing devices in high performance systems increases, communication and data transfer mechanisms need to be extended to support the increased bandwidth.

FIG. 5B is a conceptual diagram of a processing system 500 implemented using the PPU300 of FIG. 3, according to one embodiment. Exemplary system 565 may be configured to implement method 115 shown in fig. 1C and/or method 225 shown in fig. 2D. Processing system 500 includes a CPU 530, a switch 510, and a plurality of PPUs 300, and corresponding memory 304. NVLink310 provides a high-speed communication link between each PPU 300. Although a particular number of NVLink310 and interconnect 302 connections are shown in FIG. 5B, the number of connections to each PPU300 and CPU 530 may vary. Switch 510 interacts between interconnect 302 and CPU 530. PPU300, memory 304, and NVLink310 may be located on a single semiconductor platform to form parallel processing module 525. In one embodiment, the switch 510 supports two or more protocols to interact between various different connections and/or links.

In another embodiment (not shown), NVLink310 provides one or more high speed communication links between each PPU300 and CPU 530, and switch 510 interacts between interconnect 302 and each PPU 300. PPU300, memory 304, and interconnect 302 may be located on a single semiconductor platform to form parallel processing module 525. In yet another embodiment (not shown), interconnect 302 provides one or more communication links between each PPU300 and 300, and CPU 530 and switch 510 interact between each PPU300 using NVLink310 to provide one or more high-speed communication links between PPUs 300. In another embodiment (not shown), NVLink310 provides one or more high speed communication links between PPU300 and CPU 530 through switch 510. In yet another embodiment (not shown), interconnect 302 provides one or more communication links directly between each PPU 300. One or more of NVLink310 high-speed communication links may be implemented as physical NVLink interconnects or on-chip or die interconnects using the same protocol as NVLink 310.

In the context of this specification, a single semiconductor platform may refer to the only single semiconductor-based integrated circuit fabricated on a die or chip. It should be noted that the term "single semiconductor platform" may also refer to multi-chip modules with increased connectivity that simulate on-chip operation and provide substantial improvements over utilizing conventional bus implementations. Of course, the various circuits or devices may be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user. Alternatively, the parallel processing module 525 may be implemented as a circuit board substrate, and each PPU300 and/or memory 304 may be a packaged device. In one embodiment, CPU 530, switch 510, and parallel processing module 525 are located on a single semiconductor platform.

In one embodiment, the signaling rate for each NVLink310 is 20 to 25 gigabytes/second, and each PPU300 includes six NVLink310 interfaces (as shown in fig. 5B, each PPU300 includes five NVLink310 interfaces). Each NVLink310 provides a data transfer rate of 25 gigabytes/second in each direction, with six links providing 300 gigabytes/second. As shown in FIG. 5B, when CPU 530 also includes one or more NVLink310 interfaces, NVLink310 may be dedicated to PPU communications, or some combination of PPU to PPU and PPU to CPU.

In one embodiment, NVLink310 allows direct load/store/atomic access from CPU 530 to each PPU300 memory 304. In one embodiment, NVLink310 supports coherency operations, allowing data read from memory 304 to be stored in the cache hierarchy of CPU 530, which reduces cache access latency of CPU 530. In one embodiment, NVLink310 includes support for Address Translation Services (ATS), allowing PPU300 direct access to page tables within CPU 530. One or more of the nvlinks 310 may also be configured to operate in a low power mode.

Fig. 5C illustrates an exemplary system 565 in which the various architectures and/or functionalities of the various previous embodiments may be implemented. Exemplary system 565 may be configured to implement method 115 shown in fig. 1C and/or method 225 shown in fig. 2D.

As shown, a system 565 is provided, the system 565 including at least one central processing unit 530 coupled to a communication bus 575. The communication bus 575 may be implemented using any suitable protocol, such as PCI (peripheral component interconnect), PCI-Express, AGP (accelerated graphics Port), HyperTransport, or any other bus or point-to-point communication protocol. The system 565 also includes a main memory 540. Control logic (software) and data are stored in main memory 540, which may take the form of Random Access Memory (RAM).

System 565 also includes input device 560, parallel processing system 525, and display device 545. Conventional CRT (cathode ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display, and the like. User input may be received from an input device 560, such as a keyboard, a mouse, a touchpad, a microphone, and so forth. Each of the aforementioned modules and/or devices may even reside on a single semiconductor platform to form system 565. Alternatively, the modules may be located in the semiconductor platform separately or in various combinations of semiconductor platforms, as desired by the user.

Further, system 565 can be coupled to a network (e.g., a telecommunications network, a Local Area Network (LAN), a wireless network, a Wide Area Network (WAN) such as the internet, a peer-to-peer network, a cable, for communication through network interface 535).

System 565 can also include secondary memory (not shown). The secondary storage 610 includes, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, a Digital Versatile Disk (DVD) drive, a recording device, a Universal Serial Bus (USB) flash memory. The removable storage drive reads from and/or writes to a removable storage unit in a well known manner.

Computer programs, or computer control logic algorithms, may be stored in main memory 540 and/or secondary memory. Such computer programs, when executed, enable system 565 to perform various functions. Memory 540, memory, and/or any other memory are possible examples of computer-readable media.

The architecture and/or functionality of the various previous figures may implement other desired systems in a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, a dedicated system, and/or any environment. For example, system 565 may take the form of a desktop computer, laptop computer, tablet computer, server, supercomputer, smartphone (e.g., wireless, handheld device), Personal Digital Assistant (PDA), digital camera, vehicle, head-mounted display, handheld electronic device, mobile phone device, television, workstation, game console, embedded system, and/or any other type of logic.

While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Graphics processing pipeline

In one embodiment, PPU300 includes a Graphics Processing Unit (GPU). PPU300 is configured to receive commands that specify shader programs for processing graphics data. Graphics data may be defined as a set of primitives, such as points, lines, triangles, quadrilaterals, triangle strips, and so forth. Typically, a primitive includes data (e.g., in a model space coordinate system) that specifies a plurality of vertices for the primitive and attributes associated with each vertex of the primitive. PPU300 may be configured to process graphics primitives to generate a frame buffer (e.g., pixel data for each pixel of a display).

The application writes the model data (e.g., set of vertices and attributes) for the scene to a memory, such as system memory or storage 304. The model data defines each object visible on the display. The application then makes API calls to the driver kernel to request rendering and display model data. The driver kernel reads the model data and writes commands to one or more streams to perform operations to process the model data. The commands may reference different shader programs to be implemented on the SM440 of the PPU300, including one or more of a vertex shader, a hull shader, a domain shader, a geometry shader, and a pixel shader. For example, one or more SMs 440 may be configured to execute a vertex shader program that processes a plurality of vertices defined by model data. In one embodiment, different SMs 440 can be configured to execute different shader programs simultaneously. For example, a first subset of SMs 440 may be configured to execute vertex shader programs, while a second subset of SMs 440 may be configured to execute pixel shader programs. The first subset of SMs 440 processes the vertex data to produce processed vertex data and writes the processed vertex data to L2 cache 460 and/or memory 304. After rasterizing the processed vertex data (e.g., converting from three-dimensional data to two-dimensional data in two screen spaces) to generate fragment data, a second subset of the SMs 440 executes pixel shaders to generate processed fragment data, which is then mixed with other processed fragment data and written to a frame buffer in memory 304. The vertex shader program and the pixel shader program may execute concurrently, processing different data from the same scene in a pipelined manner until all of the model data for the scene has been rendered to the frame buffer. The contents of the frame buffer are then sent to the display controller for display on the display device.

Images generated using one or more of the techniques disclosed herein may be displayed on a monitor or other display device. In some embodiments, the display device may be directly coupled to a system or processor that generates or renders the image. In other embodiments, the display device may be coupled to the system or processor indirectly, for example, via a network. Examples of such networks include the internet, mobile telecommunications networks, WIFI networks, and any other wired and/or wireless networking system. When the display device is indirectly coupled, the images generated by the system or processor may be streamed to the display device over a network. Such streaming allows, for example, a video game or other application that renders images to be executed on a server or in a data center, and the rendered images may be transmitted and displayed on one or more user devices (e.g., computers, video game consoles, smartphones, other mobile devices, etc.) that are physically separate from the server or data center. Thus, the techniques disclosed herein may be applied to enhance streamed images and to enhance services of streamed images, such as NVIDIA GeForce Now (GFN), Google stable, and the like.

Machine learning

Deep Neural Networks (DNNs) developed on processors such as the PPU300 have been used for a variety of use cases, from unmanned automobiles to faster drug development, from automated image captioning in a picture database to intelligent real-time language translation in a video chat application. Deep learning is a technique that models the neural learning process of the human brain, learns continuously, becomes smarter, and provides more accurate results faster over time. Initially, adults taught children how to correctly recognize and classify various shapes, and finally recognized shapes without any guidance. Also, deep learning or neural learning systems need to be trained in object recognition and classification, as it becomes smarter and more efficient in recognizing basic objects, occluded objects, etc., while also assigning context to the objects.

At the simplest level, neurons in the human brain look at various inputs received, assign an importance level to each of these inputs, and pass the output to other neurons to operate on them. Artificial neurons are the most basic model of neural networks. In one example, a neuron may receive one or more inputs representing various features of an object being trained by the neuron to recognize and classify, and assign a weight to each of the features based on the importance of the feature in defining the shape of the object.

Deep Neural Network (DNN) models include many layers of connected nodes (e.g., neurons, boltzmann machines, radial basis functions, convolutional layers, etc.), which can be trained with large amounts of input data to quickly solve complex problems with high accuracy. In one example, the first layer of the DNN model decomposes the input image of the car into parts and finds basic patterns such as lines and angles. The second layer assembles the line to look for higher level patterns such as wheels, windshields and rear-view mirrors. The next layer identifies the type of vehicle and the last layers generate labels for the input images to identify models of specific automobile brands.

Once the DNNs are trained, they may be deployed and used to identify and classify objects or patterns in a process called reasoning. Examples of reasoning (the process by which DNN extracts useful information from a given input) include recognizing handwritten numbers on checks deposited into ATM machines, recognizing friend images in photographs, providing movie recommendations to more than fifty million users, recognizing and classifying cars, pedestrians, and road hazards in different types of unmanned cars, or translating human speech in real time.

During training, data flows through the DNN during the forward propagation phase until a prediction is generated that indicates the label corresponding to the input. If the neural network does not have the correct label input, the errors between the correct label and the predicted label will be analyzed and the weights of each feature will be adjusted in the back propagation stage until the DNN correctly labels the input and other inputs in the training dataset. Training a complex neural network requires a significant amount of parallel computational performance, including floating-point multiplication and addition supported by the PPU 300. Inference is less computationally intensive than training, a delay-sensitive process in which a trained neural network is applied to new inputs for classifying images, translating speech, and inferring new information that have never been seen before.

Neural networks rely heavily on matrix mathematics, and complex multi-layer networks require a large amount of floating point performance and bandwidth for efficiency and speed. The PPU300 has thousands of processing cores, is optimized for matrix mathematical operations, provides performance of tens to hundreds of TFLOPS, and is a computing platform capable of providing performance required for artificial intelligence and machine learning applications based on deep neural networks.

Further, images generated using one or more of the techniques disclosed herein may be used to train, test, or certify DNNs for identifying objects and environments in the real world. Such images may include scenes of roads, factories, buildings, urban environments, rural environments, humans, animals, and any other physical object or real environment. Such images may be used to train, test, or certify DNNs used in machines or robots to manipulate, process, or modify physical objects in the real world. Further, such images may be used to train, test, or certify DNNs used in autonomous vehicles to navigate and move vehicles in the real world. Additionally, images generated using one or more of the techniques disclosed herein may be used to convey information to users of such machines, robots, and vehicles.

Note that the techniques described herein may be embodied in executable instructions stored in a computer-readable medium for use by or in connection with a processor-based instruction execution machine, system, apparatus, or device. Those skilled in the art will appreciate that for some embodiments various types of computer readable media may be included to store data. As used herein, "computer-readable medium" includes one or more of any suitable medium for storing executable instructions of a computer program, such that the instruction execution machine, system, apparatus, or device can read (or retrieve) the instructions from the computer-readable medium and execute the instructions for performing the described embodiments. Suitable storage formats include one or more of electronic, magnetic, optical, and electromagnetic formats. A non-exhaustive list of conventional exemplary computer readable media includes: portable computer diskette, Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory devices, optical storage devices including portable Compact Discs (CD), portable Digital Video Discs (DVD), and the like.

It is to be understood that the arrangement of components shown in the figures is for illustrative purposes and that other arrangements are possible. For example, one or more elements described herein may be implemented in whole or in part as electronic hardware components. Other elements may be implemented in software, hardware, or a combination of software and hardware. Also, some or all of these other elements may be combined, some other elements may be omitted entirely, and additional components may be added while still achieving the functionality described herein. Accordingly, the subject matter described herein can be embodied in a number of different variations, and all such variations are considered to be within the scope of the claims.

To facilitate an understanding of the subject matter described herein, many aspects are described in terms of sequences of actions. Those skilled in the art will recognize that various actions can be performed by specialized circuits or circuitry, by program instructions being executed by one or more processors, or by a combination of both. The description herein of any sequence of actions is not intended to imply that a particular order must be followed in order for the sequence to be performed. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.

The use of the terms "a" and "an" and "the" and similar referents in the context of describing the subject matter (especially in the context of the following claims) are to be construed to cover both the singular and the singular, unless otherwise indicated herein or clearly contradicted by context. The term "at least one" followed by a list of one or more items (e.g., "at least one of a and B") should be understood to mean one item selected from the listed items (a or B) or any combination of two or more of the listed items (a and B), unless otherwise indicated herein or clearly contradicted by context. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation, as the scope of protection sought is defined by the claims as set forth below, and their equivalents. The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate the subject matter and does not pose a limitation on the scope of the subject matter unless otherwise claimed. In the claims and written description, the term "based on" and other similar phrases indicate the use of the condition that produces a result and are not intended to exclude any other condition that produces the result. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the claimed invention.

43页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：轨到轨输入比较器用失调电压修调电路

Glitch-free multiplexer and glitch propagation prevention

相关技术

网友询问留言