Techniques to configure a processor to efficiently execute instructions

文档序号：1464838 发布日期：2020-02-21 浏览：41次中文

阅读说明：本技术 配置处理器以有效执行指令的技术 (Techniques to configure a processor to efficiently execute instructions ) 是由 A·奈克 T·拉贾 K·怀尔德 R·塞尔瓦桑 D·拉玛克里沙南 D·罗德里格斯 B·福克于 2019-08-08 设计创作，主要内容包括：本发明提供了一种配置处理器以有效执行指令的技术。具体地,提供了用于在适应由执行噪声指令(例如,高功率消耗指令)引起的动态电压降的同时改善电路性能的系统和技术。通过选择性地减慢各类型噪声指令的操作频率来改善性能。示例性技术通过检测被预测为具有预定义噪声特性的预定噪声类型的指令来控制时钟(例如,由于指令消耗的更大电流量而在电路的电压轨上产生的高水平噪声),以及响应于检测,降低时钟的频率。检测在执行指令之前发生。即使一些工作负载包括操作频率减慢的指令,根据指令类型改变频率也使电路能够以高频率操作。(The present invention provides a technique for configuring a processor to efficiently execute instructions. In particular, systems and techniques are provided for improving circuit performance while accommodating dynamic voltage drops caused by executing noisy instructions (e.g., high power consumption instructions). Performance is improved by selectively slowing down the operating frequency of each type of noise instruction. An exemplary technique controls a clock by detecting an instruction that is predicted to be of a predetermined noise type having predefined noise characteristics (e.g., a high level of noise generated on a voltage rail of a circuit due to a greater amount of current consumed by the instruction), and in response to the detection, reduces a frequency of the clock. The detection occurs prior to execution of the instruction. Even if some workloads include instructions that slow down the operating frequency, changing the frequency according to the type of instruction enables the circuit to operate at a high frequency.)

1. A method of controlling a clock of a processor, comprising:

detecting an instruction predicted to have a predefined noise characteristic, the detecting occurring prior to execution of the instruction by the processor; and

changing a frequency of the clock in response to the detecting.

2. The method of claim 1, wherein the changing the frequency comprises: reducing the frequency by an offset determined to compensate for a predicted voltage drop corresponding to the predefined noise characteristic.

3. The method of claim 2, further comprising:

executing the instruction at least partially when the clock is operating at the reduced frequency; and

raising the frequency of the clock after the executing.

4. The method of claim 3, further comprising subjecting the elevation to a hysteresis process.

5. The method of claim 4, wherein the reducing is performed without undergoing a hysteresis process.

6. The method of claim 2, wherein the offset is determined by a programmable mapping of instructions to frequency offsets.

7. The method of claim 1, wherein the detecting comprises: determining that the type of the instruction matches a type in a predetermined set of instruction types.

8. The method of claim 7, wherein the predetermined set of instruction types includes at least one high power instruction type predicted to consume more than a threshold power level.

9. The method of claim 7, wherein the predetermined set of instruction types includes at least two sets of instruction types, each of the two sets having a mapping to a respective frequency offset,

wherein the detecting further comprises determining to which of the at least two groups the instruction belongs, an

Wherein the changing further comprises adjusting the frequency according to the respective frequency offset mapped to the determined one of the at least two groups.

10. The method of claim 9, wherein at least one of the components of the at least two groups or the respective frequency quantities is programmable.

11. The method of claim 1, further comprising:

configuring the clock to operate at a default frequency, an

Wherein the detecting occurs while the clock is operating at the default frequency, wherein the default frequency is determined from other instructions predicted to have a lower noise level than the detected instructions.

12. The method of claim 1, wherein the detecting occurs during decoding of the instruction and the changing occurs before the processor executes the decoded instruction.

13. The method of claim 12, further comprising:

in response to the detecting, stalling the instruction prior to execution; and

the instruction is not halted in response to the frequency being changed or a timer signal.

14. The method of claim 1, wherein changing the frequency of the clock comprises: changing a frequency of a clock source for a predetermined fixed number of clock cycles after receiving a clock change signal at the clock source.

15. The method of claim 14, wherein changing the frequency of the clock further comprises: identifying a target frequency from a look-up table, and setting the clock source to the target frequency.

16. The method of claim 15, wherein changing the frequency of the clock further comprises: the frequency is adjusted according to a noise-aware frequency-locked loop technique.

17. A system comprising processing circuitry and a clock generator that controls an operating frequency of the processing circuitry, the processing circuitry configured to:

detecting an instruction predicted to have a predefined noise characteristic, the detecting occurring prior to execution of the instruction in the processor; and

in response to the detecting, signaling the clock generator to change a frequency of the clock generator.

18. The system of claim 17, further comprising a mapping from a type of the instruction to a frequency offset determined to compensate for a predicted voltage drop corresponding to the predefined noise characteristic, wherein the signaling comprises including the amount of frequency in the signaling.

19. The system of claim 17, wherein the detecting is performed during decoding of the instruction, and the processing circuitry is further configured to execute the instruction after the changing the frequency.

20. The system of claim 17, wherein the detecting comprises: determining that the type of the instruction matches a type in a predetermined set of instruction types.

21. The system of claim 20, further comprising at least first and second signaling paths between the processing circuit and the clock generator, wherein the predetermined set of instruction types includes at least first and second sets of instruction types associated with the first and second signaling paths, respectively, each of the two sets having a mapping to a respective frequency offset,

wherein the detecting further comprises determining to which of at least two groups the instruction belongs, an

Wherein the changing further comprises adjusting the frequency according to the respective frequency offset mapped to the determined one of the at least two groups.

22. The system of claim 17, wherein changing the frequency comprises: the frequency of the clock source is changed for a predetermined fixed number of clock cycles after receiving a clock change signal at the clock generator.

23. The system of claim 22, wherein varying the frequency further comprises: identifying a target frequency from a look-up table, and setting the clock source to the target frequency.

24. The system of claim 23, wherein varying the frequency further comprises: the frequency is adjusted according to Noise Aware Frequency Locked Loop (NAFLL) techniques.

25. The system of claim 17, wherein the processing circuit comprises an instruction pipeline configured to prevent execution of the instruction until a predetermined signal is received from the clock generator or timer.

26. The system of claim 17, wherein the system further comprises a plurality of processor cores comprising the processing circuitry, wherein each of the processor cores receives the frequency from the clock generator, and wherein the clock generator is configured to:

receiving signals from one or more of the plurality of processor cores, an

Varying the frequency in accordance with the selected subset of received signals.

27. The system of claim 17, wherein the processing circuit and the clock generator are formed on an integrated circuit.

Technical Field

The present disclosure relates to power management in processors, and more particularly, to varying clock speed and/or power instructions based on the type of instruction being executed. More particularly, the present disclosure relates to dynamically controlling clock frequency and subsequent voltage drops associated with executing certain types of instructions in a manner that accommodates noise.

Background

Modem processors are very complex, containing millions of transistors on a single substrate. Although each individual transistor consumes only a small amount of power, millions of transistors on the substrate together consume a large amount of power. This power consumption generates heat and also uses a large amount of power from the power supply. Power delivery is therefore an important aspect of modem processor design.

The wiring for delivering power on the processor substrate forms a power delivery network. The power delivery network includes, for example, power rails for supplying current of appropriate voltage to different components on the processor chip. Power delivery networks affect many performance aspects, such as the amount of noise on the voltage rails. The term "noise" as used in this document refers to voltage transients that occur on the voltage rails. While the power supply may be designed to be almost noise free, the processor itself may generate noise on its internal power delivery network. For example, executing a high-capacity instruction (which switches many transistors on the chip simultaneously or nearly simultaneously) can result in transient voltage fluctuations on the voltage rails. Excessive noise on the voltage rails can cause various circuit operational failures, such as the processor not being able to obtain the power (current and/or voltage) required for its operating frequency, resulting in timing errors, erroneous switching, and the like.

In more detail, one or more voltage rails are powered from a power source (such as a power source to a processor). The voltage regulator typically controls the voltage on each voltage rail to one or more pads of the processor (or an integrated circuit containing the processor) according to the power requirements of the various components of the processor. The metal vias and traces distribute the received power at the pads throughout the processor (and/or integrated circuit). One or more voltage regulators may dynamically transition supply voltage rises or falls according to changes in the workload and thus the power requirements of the processor. The nominal constant voltage of the track leading to the processing core is called "V_dd”。

As processors become faster and smaller, the design of power delivery networks within the chip becomes more complex. Generally, the faster the processor runs, the more power it consumes. Conversely, slowing down a processor reduces the amount of power required by the various functional units of the processor. The characterization of the processor power consumption P makes clear this relationship between power and frequency: p ═ CV²f, where C, V and f represent the switched capacitance of the processor, the voltage seen by the processor, and the operating frequency of the processor, respectively.

The speed at which the circuit operates is affected by the voltage at which the circuit is driven. Voltage fluctuations (noise) on the voltage rails can cause the voltage of one or more processor circuits to vary from the nominal voltage. The main factor affecting the noise amplitude is the workload that the circuit (or other circuit on the chip) is handling. In particular, the dynamic resistance drop (also referred to as "IR drop") component of the noise is affected by the current I consumed by a particular instruction (draw) when executed. Noise has two main components: di/dt and IR drop. Di/dt noise depends on the rate of change of current over time, while IR drop depends only on the magnitude of the current. Noise-aware frequency-locked loop (NAFLL) techniques can be used to account for the Di/dt component of noise. The techniques described in this disclosure are primarily directed to IR drop.

Different commands result in different numbers of transistors switching, consuming different amounts of current, and thus different amounts of IR drop on the voltage rails. By way of a rough analogy, consider a muggy day in the middle of a summer workday. The thermometer reached 100 degrees fahrenheit (38 degrees celsius). When everyone in a city goes home from work, the first thing they do is to turn on the air conditioner to cool their living space. The grid may not be designed to handle such high peak current consumption, resulting in a "power down" -i.e., voltage fluctuations that deviate from the nominal 110 VAC that the grid is designed to deliver. This is now considered in microscopic detail on modern processor chips, where millions of transistors run thousands of times per second. This is like turning on and off air conditioners for hundreds of cities synchronously many times per second. The resulting power fluctuations will generate a lot of noise on the power transmission network of the city.

As processors become smaller and more complex, the increase in power density between circuits may further exacerbate problems associated with noise generated on the voltage rails by executing certain instructions.

In some processors, such as Nvidia Graphics Processing Units (GPUs) from the past few years, at a voltage V_ddThe lower operating frequency (i.e., speed) is set conservatively so that even the noisiest expected workload will result in proper operation. In addition, features (such as NAFLL) have been implemented to address the effects of noise on the power distribution network by tracking noise on the voltage rail and dynamically changing the clock frequency in response. In particular, when a voltage drop is detected on the voltage rail, the NAFLL mechanism reduces the clock frequency (i.e., the rate at which the processor transistors switch) to allow the transistors more time to operate properly and ensure that there are no circuit failures. This is effective because the transistor requires a higher voltage to switch faster. Reducing switching speedThe ratio enables the transistor to operate at a lower, reduced voltage level. When the drop disappears, the clock frequency can be increased. For example, U.S. patent publication No. 2018/0123604 describes an example NAFLL that follows this principle.

However, as processors continue to become smaller, faster, and more complex, techniques (e.g., NAFLL) may not fully address the problems and inefficiencies associated with noise on the voltage rail. Therefore, further improvements are desired to adapt the processor to the effects of noise of different instruction types.

Disclosure of Invention

The illustrative embodiments correct some of the deficiencies of the above-described techniques for addressing noise-related voltages present in a processor. Systems and methods for controlling the operation of a processor in a manner that accommodates noise on a voltage rail are disclosed.

The illustrative embodiments provide a method of controlling a clock of a processor. The method includes detecting an instruction predicted to have a predefined noise characteristic, and in response to the detecting, changing a frequency of a clock. The detection may occur prior to execution of the instructions by the processor.

According to another exemplary embodiment, a system is provided that includes a processing circuit and a clock generator that controls an operating frequency of the processing circuit. The processing circuit is configured to detect an instruction predicted to have a predefined noise characteristic and, in response to the detection, signal the clock generator to change a frequency of the clock generator. This detection may occur prior to execution of the instruction in the processing circuitry.

Drawings

FIG. 1A illustrates a block diagram of a processing system including a fast clock slow down circuit, according to some example embodiments.

FIG. 1B illustrates a block diagram of a processor implementing fast clock slowing in accordance with certain example embodiments.

FIG. 1C is another example processor implementing fast clock slowing according to some example embodiments.

Fig. 2A illustrates a process for performing fast clock slowing according to some example embodiments.

FIG. 2B is a timing diagram illustrating a clock slow down signal and a corresponding acknowledge signal that may be used in a processor incorporating fast clock slow down, according to some demonstrative embodiments.

Fig. 2C illustrates a change in clock frequency over time when performing fast clock slowing, according to some demonstrative embodiments.

FIG. 2D shows a timing diagram illustrating stopping execution of instructions during fast clock slow down, according to some example embodiments.

FIG. 3 illustrates a parallel processing unit according to one embodiment.

FIG. 4A illustrates a general processing cluster within the parallel processing unit of FIG. 3, according to one embodiment.

FIG. 4B illustrates a memory partition unit of the parallel processing unit of FIG. 3, according to one embodiment.

FIG. 5A illustrates the streaming multiprocessor of FIG. 4A, according to one embodiment.

FIG. 5B is a conceptual diagram of a processing system implemented using the Parallel Processing Unit (PPU) of FIG. 3, according to one embodiment.

Fig. 5C illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.

FIG. 6 is a conceptual diagram of a graphics processing pipeline implemented by the PPU of FIG. 3, according to one embodiment.

Detailed Description

Certain exemplary embodiments of the present invention dynamically control the operating frequency of a processor and/or integrated circuit to reduce inefficiencies established in conventional techniques for handling dynamic voltage drops caused by noise generated by certain instructions on a power distribution network. For example, some embodiments provide for detecting noisy workloads based on instructions in the instruction pipeline and quickly slowing down the clock frequency before executing the detected noisy instructions. Further exemplary embodiments provide for detecting and responding to such noise workloads in local processing regions that have not yet been detected with a technology such as NAFLL.

Such as above the backAs mentioned in the background section, in some current commercial processors, the operating frequency is set so that even the noisiest expected workload can be performed at a given nominal voltage or set voltage (i.e., the frequency is reduced to compensate for the worst possible drop from the nominal voltage). However, this approach is suboptimal because it requires that a frequency margin be maintained for all other workloads that produce less noise than the worst case (e.g., noisiest) workload. That is, by setting the frequency according to the noisiest workload, which requires a lower frequency for stable operation, even fewer noisy instructions are forced to execute at a reduced frequency. For example, in the example of a commercially available processor, at V_ddAt 900mV, the operating frequency of all instructions is set to 1871MHz based on the voltage drop predicted to be caused by the noisiest instruction type, while all other instruction types can operate safely at a faster frequency of 1886MHz, meaning that approximately 15% of the frequency margin remains unused when instructions other than the noisiest instruction type are being executed. For various V_ddA similar margin range of 2-26% has been observed. This slows down the effective processing rate, which affects how fast a high performance processor can complete complex tasks.

Furthermore, in NAFLL implementations in existing processors, voltage/noise sensing is performed only at locations on the chip where the NAFLL logic is physically located. Thus, NAFLL may miss noise/voltage drops in other parts of the chip that are localized and do not propagate to the NAFLL sensing location. In particular, IR-drop can be a highly localized phenomenon. Circuits on the chip with high current density have higher IR drops than are detected at NAFLL. Noise that is not detected by NAFLL and therefore not compensated for needs to be annotated (flagged) and has been shown to cause a frequency loss of about 3-4% for workloads that produce lower noise. Further, the response of NAFLL occurs after the noise/voltage has dropped.

The exemplary embodiments actively slow down the clock frequency based on the type of instruction to accommodate expected or predicted noise (consisting primarily of IR drop) on the voltage rails. The exemplary embodiment is predictive and responds to predicted noise/voltage drops before the drop occurs, as compared to NAFLL which responds after the noise/voltage has dropped. In some embodiments, the clock frequency slowing compensates for prediction noise that is not compensated for by the NAFLL. By selectively slowing one or more clocks only for instructions of the type expected to produce high levels of noise, exemplary embodiments provide for the instructions producing lower noise to be clocked at a faster rate than the noise instructions, thereby improving performance of at least the lower noise instructions. Thus, in a typical workload where most instructions are of a lower noise type, the illustrative embodiments may also improve the overall performance of the processor and/or integrated circuit having the processor.

Some exemplary embodiments take advantage of such characteristics: after the instruction has been decoded, large inrush currents occur during instruction execution that cause increased noise on the voltage rails. This allows one or more clocks to be modulated based on the decoded instructions, but prior to the start of execution of the decoded instructions, the clocks are actively shifted to account for frequency loss due to noise generated during instruction execution. By rapidly modulating one or more clocks to reduce the frequency before executing certain noisy instructions, exemplary embodiments may provide for the processor to operate only at the reduced frequency to execute those noisy instructions.Base of Fast clock slow down system for instruction types

FIG. 1A is a block diagram of a system 100 for instruction type based fast clock slowing according to some demonstrative embodiments. The system 100 is configured to actively slow down the clock frequency based on the type of instruction in order to accommodate noise on the voltage rail (consisting primarily of IR drop). The fast clock slowing in the exemplary embodiment is achieved through coordination between the detection and signaling circuitry, the actuation circuitry for clock slowing, and the clock generation circuitry. In an exemplary embodiment, the processor 102 may include detection and signaling components, and the actuation component and the fast clock slow down circuit may be in clock logic 114 associated with the processor. In an exemplary embodiment where the fast clock slowing is implemented in a parallel processing unit PPU (e.g., a Graphics Processing Unit (GPU)) such as that shown in fig. 3, the detection and signaling components may be in a streaming processing multiprocessor (SM) of the PPU (as shown in fig. 4A). However, embodiments are not limited to GPUs and other parallel processing units, and may include any instruction processing circuitry that relies on a controllable clock signal.

The processor 102 receives power from one or more voltage regulators 110 through one or more power rails 112. One or more voltage regulators 110 control the voltage from a power supply unit (not shown) according to the requirements of the processor 102. In some example embodiments, the one or more voltage regulators 110 may control the voltage supplied to the processor 102 such that one or more voltage rails to the processor have a different voltage than a voltage on another rail to the processor. The power received through the voltage rails 112 is distributed to various power domains (power domains) 106 of the processor, including various functional units (e.g., Arithmetic and Logic Units (ALUs), Floating Point Units (FPUs), load/store units, branch units, dedicated execution paths, SIMD execution units, etc.) of the data path 104 to the processing circuitry. The data path 104 (and other components of the processor 102 or an integrated circuit that includes the processor 102) receives a clock signal 116 from one or more clock generators in the clock logic 114.

If the reduced frequency required for noisy instructions is set to the clock frequency of all instructions executing on processor 102, as described above, unused voltage and frequency headroom during execution of fewer noisy instructions will result in inefficiencies. The illustrative embodiments avoid setting the clock frequency based on worst case noisy instructions by taking advantage of the characteristics of noise increase that occurs during instruction execution and after the instruction has been decoded in the instruction pipeline 108 as described above. This allows one or more clock signals 116 to be modulated based on the decoded noisy instructions prior to the beginning of execution of the decoded instructions, actively shifting the clock frequency to account for voltage losses incurred due to noise during instruction execution. When an instruction reaches an execution stage in the instruction pipeline, execution of the instruction begins and is dispatched for execution on the data path 104. More specifically, when each functional unit of the data path is activated according to the noisy instruction, a surge of current occurs and thus an increase in noise occurs, and therefore, the clock frequency in the exemplary embodiment is slowed down before each functional unit in the data path starts to execute the noisy instruction.

The instruction pipeline 108 may be a pipeline of any number of pipeline stages, including separate pipeline stages for instruction decoding and instruction execution. An exemplary pipeline may have separate pipeline stages for instruction fetch, instruction decode, instruction execution, memory access, and write. The pipeline enables different parts of the processing and control circuitry to be used simultaneously by different instructions in the pipeline.

The instruction detection and signaling block 118 includes logic for monitoring the instruction pipeline 108. More specifically, block 118 includes logic for detecting when instructions are decoded in pipeline 108 that may cause some predetermined level of noise on the voltage rails. The detection may be based on an instruction type code (i.e., opcode) and/or an instruction type code and operand combination. The type of instruction to be detected may be hard-coded and/or may be specified in a programmable table. When hard coded, according to some embodiments, classification may be done using the current (I) level as a representative of the noise level. The classification may be a relative ordering rather than an absolute ordering (e.g., "instruction A: 1 Amp"; "instruction B: 2 Amp"; "instruction C: 2.1 Amp"; "instruction D: 3 Amp"). Assuming a constant resistance (R), I can be considered a good representation of IR. Thus, for example, instruction a may be classified in a low power group, instructions B and C may be classified in a medium power group, and instruction D may be classified in a high power group. The specific amount of clock slow down required for each group can be characterized on silicon and programmed in a slow down table. Thus, as is clear from the above description, in some embodiments, the actual noise of each instruction is not necessary for the grouping.

Note that while the exemplary embodiment detects instructions that may cause some level of noise, the embodiment compensates for frequency loss when running high power instructions. The frequency loss may also be due to jitter and/or other second order components, but most of the frequency loss may be due to noise.

The instruction detection and signaling block 118 may also include logic for communicating signaling with the clock logic 114 in response to detecting a predetermined instruction type in the pipeline 108. For example, signaling to clock circuit 114 may include asserting clock slow down signal 120 in response to detecting a noise instruction in pipeline 108.

Additionally, in at least some embodiments, the instruction detection and signaling block 118 may also include interlock logic to stall the pipeline 108 until the clock is slowed down in response to detecting a noisy instruction in the pipeline.

Although fig. 1A shows clock logic 114 as being external to processor 102, embodiments are not so limited. Example embodiments may include various combinations of processing circuitry and clock logic, e.g., on the same die, with a single clock per processor, a single clock shared by multiple processors, etc. Wherein the communication between the processing circuitry and the clock logic is fast and efficient enough that the clock can be requested to be slowed down and made efficient during the time interval between instruction decoding and the start of its execution.

FIG. 1B illustrates a block diagram of a processor 130, according to some demonstrative embodiments. According to some embodiments, processor 130 may be identical to processor 102, but the clock logic is located on the same die as processor 102. As shown in fig. 1A, the processor 130 also receives power through one or more voltage rails controlled by one or more voltage regulators. Neither the voltage rails nor the voltage regulator are shown separately in fig. 1B.

Processor 130 includes one or more processing cores 132 (e.g., core a and core B), and clock logic 133. Clock logic 133 provides clock signal 158 to control the operating frequency of one or more processing cores 132.

The noisy instruction detect and stall block 136 is configured to detect certain predetermined instruction types 141 in instructions 140 in the instruction pipeline 134 prior to an execution stage of the pipeline, where the instructions are executed by one or more processor cores 132. The predetermined instruction type 141 to be detected (also referred to as a "noisy instruction type") may be hard-coded and/or specified in a programmable table. According to some embodiments, the detected instructions for clock slowing comprise certain high power instructions. That is, the detected instructions are instructions that are expected to consume an amount of power above a certain threshold. For example, some matrix multiply instructions utilize more functional units in the datapath and/or use datapaths with very high gate densities and are therefore more productive power consumers than many other instruction types. For example, according to one embodiment, a half-precision matrix multiply accumulate (HMMA) instruction may utilize a dedicated high gate density HMMA data path in the SM (as described with respect to fig. 4A), which is more power consuming than many instructions and which is also prone to generate more noise on the voltage rails than many other instructions. The noise of the HMMA command and the resulting voltage drop may be localized to the vicinity of the HMMA data path. HMMA instructions are described, for example, in U.S. application No.15/826,435 filed on 29/11/2017, which is incorporated herein by reference in its entirety. Thus, block 136 may monitor the instruction pipeline to detect when a high power instruction is decoded (or being decoded). However, some embodiments may be configured to detect one or more noisy instructions, which are not necessarily high power instructions, but which generate a significant amount of noise on the voltage rails (e.g., due to the density of concurrently activated portions of the data path, etc.).

The clock control signaling block 138 is configured to assert the clock slow down signal 154 in response to detection by the noisy instruction detection block 136. Clock slow down signal 154 is a request to clock logic 133 to slow down clock signal 158 provided to processor core 132.

In some embodiments, the clock logic 133 may provide a clock slow down acknowledge signal 156 indicating that it has achieved the requested clock slow down in response to receiving the clock slow down request signal 154.

In some embodiments, the noisy instruction detection and stall block 136 also includes logic for stalling the instruction pipeline 134 in response to asserting the clock slow down request signal 154 until it is determined that the clock frequency has slowed down as requested. Upon receipt of the acknowledgement signal 156, the pipeline may not be stalled so that the detected noisy instruction may be dispatched for execution. Stopping helps to ensure that a detected noisy instruction is not executed before slowing down the clock in response to detection. In some embodiments, where the latency between instruction decode and execution is too small for clock slow down start, block 136 may also include logic for stopping.

In the embodiment shown in FIG. 1B, clock logic 133 is provided on the same die as one or more processor cores 132. However, in some embodiments, clock logic 133 may not be located on the same die as the processor core that detects and/or executes the noisy instructions.

The signaling paths of signals 154 and 156 between one or more processor cores 132 and clock logic 133 are configured for low latency. In an embodiment, it is highly desirable to keep the delay and clock time on the signaling path to achieve slow down so that the instruction pipeline does not have to be delayed. In some embodiments, request/acknowledgement messages (e.g., REQ/ACK bits) may be moved between the SM and the clock through a retiming phase. Such retiming stages may be custom selected and/or custom placed, for example. Between retiming failures (retiming hops), data may be routed in a manner similar to a trunk clock, for example, by driving a wide upper metal layer mask net through a large clock buffer, using custom routing rules that eliminate expensive routing. In some embodiments, the level of delay between decoding and execution may be 12 cycles or the like, and the delay between request and clock change may be 10 cycles or the like, as an example. Thus, in an exemplary embodiment, clock slowing occurs before the decoded instruction reaches execution.

Clock logic 133 may include a clock generator block 142 and a clock update block 148. In some embodiments, the clock generator block 142 may include a clock source, such as a Digital Voltage Controlled Oscillator (DVCO) clock source, NAFLL mechanism, such as described in U.S. patent application No.15/340,901 (U.S. patent publication No. 2018/0123604), filed on 1/11/2016, the entire contents of which are incorporated herein by reference.

However, the clock frequency change technique described in U.S. patent application No.15/340,901 is not sufficient to meet the time constraint of rapid slow down required in some exemplary embodiments in order to achieve slow down over a short time interval between decoding of instructions and execution pipeline stages. In some embodiments, the clock frequency change is implemented by performing a lookup in a lookup table, such as any of tables 144 and 146. In embodiments where the clock frequency is configured to be either "full speed" (e.g., the maximum possible operating frequency with no margin for voltage noise) or "reduced speed" (e.g., the maximum possible operating frequency with margin for noise), the lookup table 144 may provide a voltage-frequency curve (VF curve) for full speed operation and the table 146 may provide a VF curve for reduced speed operation. In response to the clock slow down request 154, some embodiments may change the frequency at the current voltage to the VF curve in the table 146 (reduced speed operation).

In another embodiment, there may be no look-up table and the clock source or other clock generator may be set to clock down the requested frequency.

At block 152, incoming clock slow down requests may be aggregated. For example, block 152 determines the clock slow down request requesting the greatest slow down of the slow down requests 154 received from the various processing cores 132, and then forwards it to the clock generator block 142. This ensures that the adjusted clock frequency is equal to or lower than the clock frequency required to safely execute all instructions at any time.

The hysteresis block 150 may be located in a signal path 154 between the aggregation block 152 and the clock generator block 142. In some example embodiments, although the clock slow down request is forwarded to the clock generator block 142 immediately (i.e., sent to the clock generator without suffering from hysteresis), the clock slow down override (de-assert) may be subject to a hysteresis process before being forwarded to the clock generator block 142. Subjecting the nulls to hysteresis may minimize or avoid completely switching clock frequencies back and forth between a nominal normal operating frequency and one or more reduced frequencies.

The clock update block 148 controls the update clock and may also generate a clock slow down acknowledge signal 158 when a clock slow down has been achieved. The clock update block 148 may initiate an update of the clock frequency by the clock generator 142 in response to receiving an assertion or an assertion of the clock slow down signal 154.

FIG. 1C illustrates a block diagram of another example processor implementing fast clock slowing in accordance with some example embodiments. The embodiment shown in FIG. 1C achieves a finer level of slowdown, thereby achieving better performance with workloads having instructions that consume significantly varying amounts of power

(e.g., when compared to an implementation having only a single level of frequency slowing).

It has been observed that the amount of power consumed by various computing instructions used during certain applications when executed in the data path varies significantly. Such variations in power consumed by various instruction types when executed in the data path are expected to result in variations in noise and IR drop for different instruction types. On some example processors, the variation may be in the range of 10-12% depending on the instruction type. According to an estimate that every 1% of noise may result in about a 2% slow down, such a change in power level may result in slowing the processor to more than necessary. Implementing multi-level slowing as shown in the embodiment of fig. 1C may achieve better performance when there are more than two different instruction types in the workload, which consume a correspondingly significantly varying amount of power and thus produce a correspondingly significantly varying amount of noise on the voltage rails.

Parallel processor 161, such as the PPU shown in fig. 3, may include one or more processors (e.g., processors 160a and 160 b). Each processor, such as processor 160a, may include one or more processing cores 162. For example, with respect to the embodiment shown in fig. 3, processor 160a may correspond to a GPC350, and each processing core 162 may correspond to a respective Streaming Multiprocessor (SM)440 (see fig. 4A).

In each processing core 162, instructions 170 enter pipeline 164, where they are decoded and then dispatched for execution in pipeline 164. The noisy instruction detect and stall block 166 monitors the pipeline 164 to detect when any instructions in the predetermined instruction set 171 are decoded. For example, block 166 may monitor the decoding of any instruction (e.g., a high power instruction) from a predetermined noisy instruction set 171. A clock slow down request signal is generated when one or more instructions in the predetermined set are detected.

In the embodiment described with respect to fig. 1C, the clock slow down request signal 182 (e.g., signals 182a and 183b) may depend on the type of instruction detected (e.g., different types of instruction types detected may result in different clock slow down signals). The detected commands are classified according to their estimated noise generation. Classification may be performed according to a programmable classification in the command noise class block 167. Programmability enables the pre-silicon to be classified as post-silicon based on simulation on the actual chip. The noise caused by the instructions may vary from chip to chip and/or workload to workload, and programmability enables the classification to be tailored to the specific environment.

The classification may be based on multiple sets of instruction types, where each of the one or more sets of instruction types is mapped to a respective clock offset. The sets of instruction types may be determined based on the amount of noise expected to be generated by the instructions of each instruction type. For example, all instruction types expected to generate noise levels in the first range may be in a different group than all instruction types expected to generate noise levels in the second range. The clock offset specified in the map of the group may be an amount of slowing down of the current frequency required to compensate for the voltage drop caused by the expected noise level produced by the instruction type of instruction in the group.

In some embodiments, the mapping is programmable, thus enabling post-silicon reprogramming or reconfiguration of the mapping of instruction types to specific groups. In this manner, the initial mapping made based on pre-silicon design considerations and noise estimates may be subsequently overwritten based on the more accurate noise levels of the instruction types observed in the post-silicon simulation. The pre-silicon and post-silicon terms used herein refer to the chip design stage prior to delivery to fabrication (tapeout) and to the delivered fabricated chip, respectively. Because the noise level generated by each type of instruction in the actual circuit (i.e., the post-silicon) may, and typically does, differ significantly from the pre-silicon estimate, the programmability of the map provided in some embodiments enables instruction types to be reclassified into groups according to post-silicon simulation, which can significantly improve the performance of the processor. In some embodiments, the clock offset associated with each group may also be programmable.

Based on the detection and classification, the clock control signaling logic 168 is configured to generate one of the clock slow down request signals 182a and 182b for each processor core 162. In an exemplary embodiment, the clock slow down request signals 182a and 182b may be mapped to a first type of instruction type that is a high power consumer and a second type of instruction type that is a medium power consumer, respectively. Thus, the clock slow down request signal 182a may request a greater slow down than the signal 182 b.

In an exemplary embodiment, for each level of throttling added, a separate wire may need to be routed from processor core 162 to clock block 163. In the illustrated embodiment, separate wires (i.e., 182a and 182b) are provided for each of the two deceleration levels. As also noted with respect to FIGS. 1A and 1B, the conductors for sending the clock slow down signal must be formed in order to minimize delay in order to meet the stringent timing constraints required for sending the clock slow down signal and receiving the slow down clock signal in the time interval between instruction decode and execution pipeline stages.

For each level of throttling, signals from all processor cores 162 may be aggregated with corresponding aggregation logic (e.g., aggregated clock request blocks 183a and 183 b). In an exemplary embodiment, the aggregation may include subjecting the input signals to a logic or operation. In some embodiments, the aggregation may include subjecting the sets of processor cores 162 to determining whether any of the respective sets exceeds a preset threshold of high noise that would trigger a clock slow down request. These groups may be predetermined based on the physical layout of the cores, adjacency, and/or other considerations of the power distribution network. In some embodiments, clock slow down requests may be assigned weights based on characteristics of the trigger core and/or group. The aggregate clock slow down request signal from each aggregate block is provided to the clock logic 163.

The clock logic 163 may include a priority encoder 190 that receives a clock slow down request for each of a plurality of speed down levels and ensures that the resulting clock frequency 188 is at the lowest requested level. Therefore, only one of 182a and 182b is effective between the priority encoder 190 and the clock generator 172. This is important to ensure that there are no failures in the data path that executes the noisiest (e.g., most power consuming) instructions. This priority encoding may be done when the clock generator is shared between cores to ensure that there are no failures in the datapath executing the noisiest instructions. Such priority encoding is not required between different clock generators. Thus, for example, at least some Nvidia GPUs have GPCs with a set of TPC and NAFLL. In some embodiments, the priority encoding may be across TPCs in GPCs, but not GPCs, as each GPC has its own NAFLL.

In some embodiments, the hysteresis block 180 may be triggered anytime a clock change occurs, even within a level of deceleration (e.g., a hysteresis counter may be started). Hysteresis is triggered only when the clock slow down signal is inactive (de-asserted) to prevent the frequency from switching back and forth between frequencies.

The lookup table 174 may include one or more frequency entries for each voltage. According to embodiments in which either of the two deceleration levels 192 and 194 may be signaled to the clock generation block 172, the clock update logic 178 and/or the clock generation block 172 may access frequency settings for immediate transitions and long-term locking. For example, three frequency entries per voltage may be accessed in the lookup table 174. For each voltage level, ndiv _ offset, and dvco _ offset may be obtained from a lookup table, which is used by the NAFLL logic 196 to set the clock frequency. ndiv provides a long-term frequency target for low noise commands, while (ndiv-ndiv _ offset) represents a long-term frequency target for high noise commands. dvco _ offset provides an immediate pull down code to quickly change the clock frequency from ndiv to low (ndiv-ndiv _ offset). The change in clock frequency is described in more detail below in conjunction with fig. 2C.

The acknowledgement signal 186 may also require more than one bit to implement the smoothing process that slows down the handover. Multi-bit acknowledge signal 186 enables stall logic in the processor to properly process the acknowledge.

Instruction type-based fast clock slowing method

Fig. 2A illustrates a process 200 for performing fast clock slowing according to some example embodiments. Process 200 may be performed by a processor and clock logic as shown in any of fig. 1A, 1B, and 1C. In some exemplary embodiments, process 200 is implemented entirely in hardware. According to an exemplary embodiment, process 200 is performed at least in part by an SM (e.g., as shown in fig. 4A).

At operation 202, the process 200 is initiated when one or more noisy instructions are detected in an instruction pipeline of a processor (e.g., any of the processors shown in fig. 1A, 1B, and 1C). The type of noisy instructions monitoring the instruction pipeline may be predetermined and either hard-coded, specified in one or more registers, or specified in a programmable table. In one exemplary embodiment, each instruction entering the instruction pipeline is tested to determine if its instruction type matches any of one or more predetermined instruction types.

At operation 204, the detected one or more instructions are grouped according to a noise level expected to be generated by the detected instructions. After decoding instructions in the instruction pipeline, the programmable grouping logic enables the decoded instructions to be grouped into a bucket (e.g., a logical grouping) along with other instructions of other similar noise characteristics. Each instruction bucket is configured to assert a clock slow down signal that enters the clock logic.

The programmability of the packets enables instructions to move from one bucket to another based on post-silicon learning and allows any errors in the pre-silicon power characterization of the instructions (e.g., inaccurate estimates of noise levels associated with the various instruction types) to be corrected on silicon. For example, simulations performed on the back-silicon die may show that the noise level of a particular instruction type is different from the noise level corresponding to the packet/bucket into which the particular instruction is grouped according to the front-silicon estimate. In this case, the programmability of the packets enables the type of particular instruction to be remapped to a different group than the group to which it was originally mapped based on the pre-silicon estimate of the noise generation characteristics of the particular instruction.

At operation 206, the clock slow down request signal is asserted in response to detecting the noisy instruction at operation 202. In some embodiments, the clock slow down signal may include an associated clock modulation or slow down value designated as a frequency offset or frequency. The value of the frequency offset or frequency may be obtained from a mapping of the group associated with the clock slow down signal. In some embodiments, the clock slow down signal may include an associated voltage or voltage offset rather than a frequency or frequency offset.

At operation 208, a dispatch stall signal is asserted to stall the detected instruction prior to execution. The dispatch stall is optional, and some embodiments may proceed from operation 206 to operation 210 without performing operation 208. For example, in some embodiments, the pipeline may not stall, where it may be reliably expected that the clock frequency may be adjusted to a reduced frequency between the time that the clock slow down request is made valid during or at the end of the decode stage of the instruction pipeline and the execution stage of the pipeline. However, in embodiments where it cannot be reliably expected that the clock frequency will be adjusted in response to the clock slow down signal at a time before execution begins, the pipeline may be stopped until the requested clock slow down is valid.

The signaling from the processor core (e.g., SM) to the clock logic should have very low latency in order not to significantly degrade performance. The physical design and loading on the signal must be carefully controlled to ensure very low delay. Ideally, the delay is low enough that after the signal is active, the clock slows down before the instruction reaches the execution pipeline, allowing uninterrupted instruction progress. However, this may not always be possible in view of the size of the chip, and the need for interlock logic is not always necessary. In some embodiments, the interlock logic halts execution of one or more instructions in the processor core until the clock logic asserts a signal that acknowledges (ack) that the clock has slowed. After the signal is asserted, all stalled processor cores are stalled in a staggered manner (e.g., according to a software configurable delay for each core) to ensure that the current ramps up and avoids di/dt noise issues.

At operation 210, the clock frequency is reduced in response to asserting the clock slow down request signal. When multiple processor cores are clocked by the same clock source, the clock slow down signals asserted by the various processor cores are aggregated (e.g., "OR") for each group before interfacing with the clock logic.

The clock logic includes fast clock slow down logic and hysteresis logic, and may also include a priority encoder. Additionally, since in some embodiments the fast clock slow down works with the NAFLL clock logic, the clock logic includes logic to ensure that the frequency locked loop function of the NAFLL can be maintained with the fast clock slow down.

In some embodiments, where multiple clock slow down levels may be requested, a priority encoder may be used to ensure that the clock is set to honor the maximum slow down request corresponding to the highest current instruction that will be executed on any of the processing cores clocked by the NAFLL. In embodiments where there is only one slow down level (other than normal or full speed operating frequency), a priority encoder may not be required.

In some embodiments, each clock slow down signal has a programmable slow down value associated therewith. When the signal is active, the frequency is reduced by an associated amount of slow down (after the priority encoder) for a predetermined small number of clock cycles by a fast slow down mechanism internal to the NAFLL. According to one embodiment, the predetermined small number of periods is 1.

In some exemplary embodiments, the clock generation internal to the NAFLL is through a DVCO made by a tunable parallel driver. In normal operation, multiple parallel drivers of the DVCO are tuned until the frequency matches the desired frequency of the NAFLL. However, a typical frequency transition from one state to another is done in a small tuning change and requires many microseconds. In an exemplary embodiment, a fast tune change mechanism controlled by a side-band is employed to change the frequency within one clock cycle. In some embodiments, the DVCO may include an input (e.g., offset _ enable) to immediately reduce (to reduce the frequency) the number of parallel drivers by some preset amount (e.g., offset) in a short duration (e.g., one or two clock cycles). Decoder logic may be included within the DVCO that applies a specified offset (e.g., offset _ enable ═ 1) only when indicated by the input of the DVCO. This fast-offset mechanism is different from the main tuning mechanism of DVCO. The preset offset may be a function of the operating voltage. This low-latency frequency change mechanism helps meet the stringent timing requirements of the exemplary embodiments.

Additionally, in contrast to clock slowing techniques (such as LDIV or clock skipping), the fast-tuning change mechanism employed in some example embodiments may slow down the clock frequency by a small percentage (e.g., 2%, 4%, etc.) while maintaining a duty cycle for all cycles.

Hysteresis logic is used to ensure that the NAFLL does not switch frequently between slow down modes, which can cause instability in the frequency lock of the NAFLL. Switching from the smaller slow down mode to the larger slow down mode occurs quickly. Hysteresis may only apply from a larger slow down to a smaller slow down, or when the slow down is released.

Additional logic may also be required to maintain the frequency locked loop function of the NAFLL while engaging and disengaging various slowdowns. In some embodiments, the two VF curves "full speed" and "slow down" are characterized separately on silicon, each representing the maximum frequency (F) of the workload to which its instructions are adapted_max). These frequency targets are provided to the frequency locked loop of NAFLL for the current voltage. The tunable error accumulator may be used to keep the tunable DVCO of the NAFLL locked to the target frequency, typically the "full-rate" frequency.

When entering the slow down state, the control loop of the NAFLL can temporarily stop tracking the target "full speed" frequency and immediately apply its fast slow down mechanism to reduce the frequency and allow the slow down instruction to execute. Then, it will attempt to slowly relock to the "slow down" frequency, assuming the fast slow down mechanism may have extra margin to account for PVT (process voltage temperature) variations or reduce characterization effort.

The NAFLL may again suspend target frequency tracking when leaving the slow down state after the hysteresis window has expired. The internal sequencer may optionally apply static, transient, negative frequency offsets to reduce the risk of later overshoot (e.g., due to PVT variations or characterization uncertainties). It can then disengage the fast slow down mechanism, restoring the frequency to less than or equal to the original "full speed" frequency. Finally, it moves its long-term lock target to the "full speed" frequency, which will gradually restore the frequency to the pre-slow down frequency. Fig. 2C shows the behavior of the clock frequency when the fast slow down mechanism is disengaged.

At operation 212, in some embodiments, the clock slow down acknowledge signal is asserted by the clock logic. An acknowledgement may be sent to the processor core once the boot slows down, which may be used to un-stall instruction execution in the event of large ring delays in clock slowing down.

In the case of a processor core and clock logic working in concert, the clock frequency can be rapidly modulated depending on the instruction type, allowing instructions of different powers to execute at their respective optimum frequencies.

At operation 214, in response to the clock slow down acknowledge signal, instruction execution is not stalled and the instruction is dispatched for execution. In the event that the instruction pipeline has stalled, then upon receipt of the acknowledgement signal, the detected instruction may be dispatched for execution in the data path. As described above, instruction execution includes activating a data path to execute an instruction.

At operation 216, one or more noisy instructions are executed at the reduced clock frequency. In other words, one or more noisy instructions are executed by a processor resource (e.g., in a data path), with the clock frequency set to a maximum frequency for the noisy instructions.

At operation 218, the clock slow down request signal is deasserted. Invalidation may be performed before or after the instruction pipeline is not stalled. In some embodiments, the disabling may be based on expiration of a timer set when the clock slow down request signal is active. A description of two exemplary techniques for clock slow down request signal based invalidation is provided with respect to fig. 2D.

After the execution of the one or more instructions is completed at the reduced frequency, the clock frequency is recovered at operation 220. In other words, the clock frequency is restored to the maximum frequency of normal instructions (i.e., instructions that are not noisy instructions). An example of clock frequency recovery is shown in fig. 2C.

Signaling of fast clock slow down based on instruction type

Fig. 2B is a timing diagram illustrating a clock slow down request signal 222 and a corresponding acknowledge signal 224 that may be used in fast clock slow down according to some example embodiments. The timing diagram shows the assertion and deassertion of each signal from left to right over time. The rising edge of the representative curve indicates the signal is active and the falling edge of the representative curve indicates inactive.

As described above, signal 222 is asserted by an instruction pipeline associated with processing circuitry (e.g., instruction pipelines 108, 134, and 164) when an instruction is detected during decoding of the instruction as belonging to one or more instruction types being monitored. One example of a monitored instruction that may cause a significant amount of noise in the power supply rails is a matrix multiply instruction.

In response to request signal 202, signal 224 is asserted by clock logic. In some embodiments, the acknowledgement signal 224 is asserted after the clock logic has set the clock to the target reduced frequency. In some embodiments, the acknowledgement signal 224 may be asserted immediately before or during the change of the clock to the target reduced frequency.

As shown, in some embodiments, the clock slow down request signal 222 may remain active even after the acknowledgement signal 224 is received at the instruction pipeline and/or processing logic and the detected high power instruction is dispatched for execution. After the detected high power instruction has completed execution, the instruction pipeline and/or processing logic may deactivate request signal 222.

In some embodiments, the acknowledgement signal 224 may remain active until the request signal is inactive. For example, the clock logic may disable the acknowledge signal 224 after determining that the request signal 222 has been disabled.

Fig. 2C illustrates the change in clock frequency 228 over time when performing fast clock slowing, according to some demonstrative embodiments. The y-axis represents clock frequency and the x-axis represents time. This figure illustrates an exemplary variation of a clock frequency 228 associated with a clock slow down request signal 226 in a system such as that shown in FIG. 1A, FIG. 1B, or FIG. 1C.

In the exemplary embodiment shown, the clock may be configured to operate at either of two frequencies: maximum frequency for non-noisy (e.g., non-high power) instructions (e.g., F referred to in the example described with respect to FIG. 1B_{Non-high power instructions}230 or "full speed"), or maximum frequency for noisy (e.g., high power) instructions (e.g., F referred to in the example described with respect to fig. 1B)_{High power instructions}232 or "slow down"). As shown, the maximum frequency for non-noisy commands may be faster than the maximum frequency for high power commands.

A primary advantage of the exemplary embodiments is that it reduces the time during which the processor must operate at a lower frequency (e.g., "slow down" the frequency) due to margin requirements associated with noise. More specifically, the illustrative embodiments provide for operation at a reduced frequency only for the duration of execution of certain instructions that may result in a noise effect that is large enough to cause a voltage drop across the circuit.

Therefore, most of the time it is expected that the processor will operate at frequency 230. When the instruction pipeline and/or processing logic detects the decoding of a monitored instruction, then the clock slow down request signal 226 is asserted next, as described above. The clock slow down request signal 226 may remain active until the detected instruction has completed execution and/or until the predetermined interval expires. The clock logic is configured to decrease the clock frequency 228 when it receives the clock slow down request signal 226.

As shown (see the leftmost sharp drop in clock frequency 228), when it receives the clock slow down request signal 226, the clock logic can quickly slow down the clock to the target reduced frequency 232. In some embodiments, the clock logic may set the initial droop to a droop greater than the target frequency. This drop is illustrated by a drop in the clock frequency 228 advancing below the target reduced frequency 232. After the rapid drop, the clock frequency 208 gradually advances to the target reduced frequency 232.

The fast slow down may be performed by referring to a look-up table to determine the new frequency and then immediately setting the clock source to the new operating frequency. In some embodiments, the new frequency becomes active in one clock cycle after the clock logic receives the clock slow down request signal 226.

After the clock slow down request signal 226 is deasserted, the clock logic may restore the clock frequency 228, now at the reduced frequency 232, back to the higher frequency 230. Recovery may be initiated by temporarily reducing the frequency to the lower frequency 234 before recovering the higher frequency. In some exemplary embodiments, a reduction to the lower frequency 234 may be necessary in order to accommodate the overshoot.

As can be noted in the frequency curve 228, the drop to the target lower frequency 232 is a sharp drop compared to a gradual increase in frequency during recovery. As described above, there is a need to quickly implement clock slowdown between decoding and dispatching instructions to the data path, and slowdown must be immediately effected as soon as possible. However, the recovery experiences a hysteresis process so that oscillations between frequencies 230 and 232 can be avoided.

FIG. 2D illustrates a timing diagram showing stalling execution of instructions during fast clock slow down and stall release, according to some demonstrative embodiments. The figure illustrates the timing associated with the detection of noisy instructions and the assertion of a clock slow down request signal 240, a corresponding slow down acknowledge signal 242, a stop signal 244 according to some embodiments, and an alternate stop signal 246 according to some other embodiments.

With respect to the two stall signals 244 and 246, as shown, they are active on the rising edge of the slow down request signal 240 (or shortly after the start of the slow down request signal 240). However, while in one embodiment stop signal 244 is deasserted when acknowledgement signal 242 is asserted, in another embodiment stop signal 246, which is controlled by a programmable timer, may be deasserted without waiting for an acknowledgement signal. For example, the programmable timer may be deactivated after a predetermined duration. The duration may be programmed according to a post-silicon simulation.

Although the fast clock slowing shown in the systems of fig. 1A, 1B, and 1C and the process described with respect to fig. 2A may be described in the context of a processing unit and certain hardware components, these systems and processes may also be performed by custom circuits, or by a combination of CPUs, GPUs, custom circuits, and/or programs. Those of ordinary skill in the art will appreciate that any system that performs, for example, method 200, is within the scope and spirit of embodiments of the present invention.

Parallel processing architecture using fast clock slowing

More illustrative information will now be set forth regarding various optional architectures and features by which the foregoing framework may be implemented, as desired by a user. It should be particularly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any way. Optional the following features may optionally be incorporated or not exclude other features described.

FIG. 3 illustrates a Parallel Processing Unit (PPU)300 according to one embodiment. In one embodiment, the PPU300 is a multi-threaded processor implemented on one or more integrated circuit devices. The PPU300 is a latency hiding architecture designed for processing many threads in parallel. A thread (i.e., an execution thread) is an instance of a set of instructions configured to be executed by the PPU 300. In one embodiment, PPU300 is a Graphics Processing Unit (GPU) configured to implement a graphics rendering pipeline for processing three-dimensional (3D) graphics data in order to generate two-dimensional (2D) image data for display on a display device, such as a Liquid Crystal Display (LCD) device. In other embodiments, the PPU300 may be used to perform general-purpose computations. Although one example parallel processor is provided herein for purposes of illustration, it is specifically noted that this processor is set forth for purposes of illustration only, and any processor may be used in addition to and/or in place of this processor.

One or more PPUs 300 may be configured to accelerate thousands of High Performance Computing (HPC), data centers, and machine learning applications. The PPU300 may be configured to accelerate a wide variety of deep learning systems and applications, including auto-driven automobile platforms, deep learning, high-precision speech, image and text recognition systems, intelligent video analysis, molecular simulation, drug development, disease diagnosis, weather forecasting, big data analysis, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimization, and personalized user recommendations, among others.

As shown in FIG. 3, PPU300 includes input/output (I/O) unit 305, front end unit 315, scheduler unit 320, work distribution unit 325, hub 330, crossbar (Xbar)370, one or more general purpose processing clusters (GPCs) 350, and one or more partition units 380. PPUs 300 may be connected to host processors or other PPUs 300 via one or more high-speed nvlinks 310 interconnects. PPU300 may be connected to a host processor or other peripheral device via interconnect 302. The PPU300 may also be connected to local memory including a plurality of memory devices 304. In one embodiment, the local memory may include a plurality of Dynamic Random Access Memory (DRAM) devices. DRAM devices may be configured as High Bandwidth Memory (HBM) subsystems, with multiple DRAM dies (die) stacked within each device.

The NVLink310 interconnect enables the system to scale and include one or more PPUs 300 in conjunction with one or more CPUs, support cache coherency between PPUs 300 and CPUs, and CPU hosting. Data and/or commands may be sent by NVLink310 to and from other units of PPU300, such as one or more replication engines, video encoders, video decoders, power management units, etc. (not explicitly shown) via hub 330. NVLink310 is described in more detail in conjunction with fig. 5B.

The I/O unit 305 is configured to send and receive communications (e.g., commands, data, etc.) from a host processor (not shown) over the interconnect 302. The I/O unit 305 may communicate with the host processor directly via the interconnect 302, or through one or more intermediate devices (such as a memory bridge). In one embodiment, I/O unit 305 may communicate with one or more other processors (e.g., one or more PPUs 300) via interconnect 302. In one embodiment, I/O unit 305 implements a peripheral component interconnect express (PCIe) interface for communicating over a PCIe bus, and interconnect 302 is a PCIe bus. In alternative embodiments, the I/O unit 305 may implement other types of known interfaces for communicating with external devices.

The I/O unit 305 decodes data packets received via the interconnect 302. In one embodiment, the data packet represents a command configured to cause the PPU300 to perform various operations. The I/O unit 305 transmits the decoded command to various other units of the PPU300 as specified by the command. For example, some commands may be sent to the front end unit 315. Other commands may be sent to hub 330 or other units of PPU300, such as one or more replication engines, video encoders, video decoders, power management units, and the like (not explicitly shown). In other words, the I/O unit 305 is configured to route communications between and among the various logical units of the PPU 300.

In one embodiment, a program executed by a host processor encodes a command stream in a buffer that provides workloads to the PPU300 for processing. The workload may include a number of instructions and data to be processed by those instructions. A buffer is an area of memory that is accessible (e.g., read/write) by both the host processor and the PPU 300. For example, I/O unit 305 may be configured to access buffers in system memory connected to interconnect 302 via memory requests transmitted over interconnect 302. In one embodiment, the host processor writes the command stream to a buffer and then sends a pointer to the beginning of the command stream to the PPU 300. The front end unit 315 receives pointers to one or more command streams. The front end unit 315 manages one or more streams, reads commands from the streams and forwards the commands to the various units of the PPU 300.

The front end units 315 are coupled to a scheduler unit 320, which configures various GPCs 350 to process tasks defined by one or more streams. The scheduler unit 320 is configured to track status information related to various tasks managed by the scheduler unit 320. The status may indicate which GPC350 the task is assigned to, whether the task is active or inactive, a priority associated with the task, and so on. The scheduler unit 320 manages the execution of multiple tasks on one or more GPCs 350.

The scheduler unit 320 is coupled to a work allocation unit 325 configured to dispatch tasks for execution on the GPCs 350. The work allocation unit 325 may track several scheduled tasks received from the scheduler unit 320. In one embodiment, the work distribution unit 325 manages a pending (pending) task pool and an active task pool for each GPC 350. The pool of pending tasks may include a number of time slots (e.g., 32 time slots) that contain tasks assigned to be processed by a particular GPC 350. The active task pool may include a number of slots (e.g., 4 slots) for tasks being actively processed by the GPCs 350. When a GPC350 completes execution of a task, the task is evicted from the active task pool of the GPC350, and one of the other tasks from the pending task pool is selected and scheduled for execution on the GPC 350. If an active task on a GPC350 has been idle, for example while waiting for a data dependency to be resolved, the active task may be evicted from the GPC350 and returned to the pool of pending tasks, while another task in the pool of pending tasks is selected and scheduled for execution on the GPC 350.

The work distribution unit 325 communicates with one or more GPCs 350 via XBar 370. XBar370 is an interconnection network that couples many of the elements of PPU300 to other elements of PPU 300. For example, XBar370 may be configured to couple work allocation unit 325 to a particular GPC 350. Although not explicitly shown, one or more other units of PPU300 may also be connected to XBar370 via hub 330.

Tasks are managed by the scheduler unit 320 and dispatched to GPCs 350 by the work distribution unit 325. GPCs 350 are configured to process tasks and generate results. The results may be consumed by other tasks within the GPC350, routed to a different GPC350 via the XBar370, or stored in the memory 304. The results may be written to memory 304 via partition unit 380, partition unit 380 implementing a memory interface for reading data from memory 304 and writing data to memory 304. The results may be sent to another PPU 304 or CPU via NVLink 310. In one embodiment, the PPU300 includes a U number of partition units 380 equal to the number of independent and distinct memory devices 304 coupled to the PPU 300. Partition unit 380 is described in more detail below in conjunction with FIG. 4B.

In one embodiment, the host processor executes a driver kernel that implements an Application Programming Interface (API) that enables one or more applications to execute on the host processor to schedule operations for execution on the PPU 300. In one embodiment, multiple computing applications are executed simultaneously by the PPU300, and the PPU300 provides isolation, quality of service (QoS), and independent address spaces for the multiple computing applications. An application may generate instructions (e.g., API calls) that cause a driver kernel to generate one or more tasks to be executed by PPU 300. The driver kernel exports tasks to one or more streams being processed by the PPU 300. Each task may include one or more related thread groups, referred to herein as thread bundles (warp). In one embodiment, the thread bundle includes 32 related threads that may be executed in parallel. Cooperative threads may refer to multiple threads that include instructions to perform tasks and may exchange data through a shared memory. Threads and cooperative threads are described in more detail in conjunction with FIG. 5A.

FIG. 4A illustrates a GPC350 of the PPU300 of FIG. 3, according to one embodiment. As shown in fig. 4A, each GPC350 includes multiple hardware units for processing tasks. In one embodiment, each GPC350 includes a pipeline manager 410, a pre-raster operations unit (PROP)415, a raster engine 425, a work distribution crossbar (WDX)480, a Memory Management Unit (MMU)490, and one or more Data Processing Clusters (DPCs) 420. It should be understood that the GPCs 350 of fig. 4A may include other hardware units in place of or in addition to the units shown in fig. 4A.

In one embodiment, the operation of GPCs 350 is controlled by a pipeline manager 410. The pipeline manager 410 manages the configuration of one or more DPCs 420 for processing tasks assigned to the GPC 350. In one embodiment, pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement at least a portion of a graphics rendering pipeline. For example, DPC 420 may be configured to execute a vertex shading program on programmable Streaming Multiprocessor (SM) 440. The pipeline manager 410 may also be configured to route data packets received from the work distribution unit 325 to the appropriate logical unit in the GPC 350. For example, some packets may be routed to fixed function hardware units in the PROP 415 and/or raster engine 425, while other packets may be routed to the DPC 420 for processing by the primitive engine 435 or SM 440. In one embodiment, pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement a neural network model and/or a compute pipeline.

The PROP unit 415 is configured to route data generated by the raster engine 425 and the DPC 420 to a Raster Operations (ROP) unit, described in more detail in connection with fig. 4B. The PROP unit 415 may also be configured to perform optimization of color mixing, organize pixel data, perform address translation, and the like.

The raster engine 425 includes several fixed-function hardware units configured to perform various raster operations. In one embodiment, raster engine 425 includes a setup engine, a coarse raster engine, a culling engine, a cropping engine, a fine raster engine, and a tile aggregation engine. The setup engine receives the transformed vertices and generates plane equations associated with the geometric primitives defined by the vertices. The plane equations are sent to a coarse raster engine to generate coverage information for the primitive (e.g., x, y coverage masks for the tiles). The output of the coarse raster engine is sent to a culling engine, where fragments associated with primitives that fail the z-test are culled, and the culled fragments are sent to a clipping engine, where fragments outside of the view frustum are clipped. Those fragments that remain after clipping and culling may be passed to a fine raster engine to generate attributes for the pixel fragments based on a plane equation generated by a setup engine. The output of the raster engine 425 includes, for example, fragments to be processed by a fragment shader implemented within the DPC 420.

Each DPC 420 included in a GPC350 includes an M-pipeline controller (MPC)430, a primitive engine 435, and one or more SMs 440. The MPC 430 controls the operation of the DPC 420 and routes data packets received from the pipeline manager 410 to the appropriate elements in the DPC 420. For example, packets associated with the vertices may be routed to primitive engine 435, primitive engine 435 configured to fetch the vertex attributes associated with the vertices from memory 304. Instead, packets associated with the shading program may be sent to SM 440.

SM440 includes a programmable streaming processor configured to process tasks represented by a plurality of threads. Each SM440 is multithreaded and configured to concurrently execute multiple threads (e.g., 32 threads) from a particular thread group. In one embodiment, SM440 implements a SIMD (single instruction, multiple data) architecture, where each thread in a thread group (e.g., warp) is configured to process different data sets based on the same instruction set. All threads in a thread group execute the same instruction. In another embodiment, the SM440 implements a SIMT (single instruction, multi-threaded) architecture, where each thread in a thread group is configured to process different sets of data based on the same instruction set, but where individual threads in the thread group are allowed to diverge during execution. In one embodiment, a program counter, call stack, and execution state are maintained for each thread bundle, enabling concurrency between serial execution in the thread bundle and the thread bundle when the threads within the thread bundle diverge. In another embodiment, a program counter, call stack, and execution state are maintained for each individual thread, thereby achieving equal concurrency among all threads within and between thread bundles. When the execution state is maintained for each individual thread, threads executing the same instruction may be converged and executed in parallel for maximum efficiency. SM440 is described in more detail below in conjunction with fig. 5A.

The MMU 490 provides an interface between the GPCs 350 and the partition units 380. The MMU 490 may provide translation of virtual addresses to physical addresses, memory protection, and arbitration of memory requests. In one embodiment, the MMU 490 provides one or more Translation Lookaside Buffers (TLBs) for performing translations from virtual addresses to physical addresses in memory 304.

FIG. 4B illustrates a memory partition unit 380 of the PPU300 of FIG. 3, according to one embodiment. As shown in FIG. 4B, memory partition unit 380 includes a Raster Operations (ROP) unit 450, a level two (L2) cache 460, and a memory interface 470. A memory interface 470 is coupled to the memory 304. Memory interface 470 may implement a 32, 64, 128, 1024 bit data bus, etc. for high speed data transfer. In one embodiment, PPU300 incorporates U memory interfaces 470, one memory interface 470 for each pair of partition units 380, where each pair of partition units 380 is coupled to a corresponding memory device 304. For example, the PPU300 may be connected to up to Y memory devices 304, such as a high bandwidth memory stack or a graphics double data rate version 5 synchronous dynamic random access memory or other type of persistent memory.

In one embodiment, memory interface 470 implements the HBM2 memory interface, and Y equals half of U. In one embodiment, the HBM2 memory stack is located on the same physical package as the PPU300, providing significant power and area savings compared to conventional GDDR5 SDRAM systems. In one embodiment, each HBM2 stack includes four memory dies and Y equals 4, where the HBM2 stack includes two 128-bit lanes per die, for a total of 8 lanes and a data bus width of 1024 bits.

In one embodiment, memory 304 supports Single Error Correction Double Error Detection (SECDED) Error Correction Codes (ECC) to protect data. For computing applications that are sensitive to data corruption, ECC provides higher reliability. In large cluster computing environments, reliability is particularly important where the PPU300 handles very large data sets and/or long running applications.

In one embodiment, PPU300 implements a multi-level memory hierarchy. In one embodiment, memory partition unit 380 supports unified memory to provide a single unified virtual address space for CPU and PPU300 memory, enabling data sharing between virtual memory systems. In one embodiment, the frequency of accesses by the PPU300 to memory located on other processors is tracked to ensure that a page of memory is moved to the physical memory of the PPU300 that accesses the page more frequently. In one embodiment, NVLink310 supports an address translation service that allows PPU300 to directly access CPU's page tables and provides full access to CPU memory by PPU 300.

In one embodiment, the replication engine transfers data between multiple PPUs 300 or between a PPU300 and a CPU. The copy engine may generate a page fault for an address that does not map to a page table. The memory partition unit 380 may then service the page fault, mapping the address into the page table, after which the copy engine may perform the transfer. In conventional systems, fixed memory (e.g., non-pageable) is operated for multiple copy engines between multiple processors, which significantly reduces available memory. Due to a hardware paging error, the address can be passed to the copy engine without worrying about whether the memory page resides and whether the copy process is transparent.

Data from memory 304 or other system memory may be retrieved by memory partition unit 380 and stored in L2 cache 460, L2 cache 460 being on-chip and shared among GPCs 350. As shown, each memory partition unit 380 includes a portion of the L2 cache 460 associated with a corresponding memory device 304. The lower-level cache may then be implemented in multiple units within the GPC 350. For example, each SM440 may implement a level one (L1) cache. The L1 cache is a private memory dedicated to a particular SM 440. Data from L2 cache 460 may be fetched and stored in each L1 cache for processing in the functional units of SM 440. L2 cache 460 is coupled to memory interface 470 and XBR 370.

ROP unit 450 performs graphics raster operations related to pixel colors, such as color compression, pixel blending, and the like. ROP unit 450 also enables depth testing with raster engine 425, which receives the depth of sample locations associated with pixel fragments from a culling engine of raster engine 425. The sample locations associated with the fragments are tested for depth relative to corresponding depths in the depth buffer. If the fragment passes the depth test for the sample location, ROP unit 450 updates the depth buffer and sends the results of the depth test to raster engine 425. It will be understood that the number of partition units 380 may be different than the number of GPCs 350, and thus each ROP unit 450 may be coupled to each GPC 350. ROP unit 450 tracks packets received from different GPCs 350 and determines to which GPC350 the results generated by ROP unit 450 are routed through Xbar 370. Although ROP unit 450 is included within memory partition unit 380 in fig. 4B, ROP unit 450 may be external to memory partition unit 380 in other embodiments. For example, ROP unit 450 may reside in the GPC350 or another unit.

FIG. 5A illustrates the streaming multiprocessor 440 of FIG. 4A according to one embodiment. As shown in fig. 5A, SM440 includes an instruction cache 505, one or more scheduler units 510, a register file 520, one or more processing cores 550, one or more Special Function Units (SFUs) 552, one or more load/store units (LSUs) 554, an interconnection network 580, a shared memory/L1 cache 570.

As described above, the work distribution unit 325 schedules tasks to execute on GPCs 350 of PPUs 300. A task is assigned to a particular DPC 420 within the GPC350, and if the task is associated with a shader program, the task may be assigned to the SM 440. The scheduler unit 510 receives tasks from the work allocation unit 325 and manages the scheduling of instructions assigned to one or more thread blocks of the SM 440. Scheduler unit 510 schedules thread blocks to execute as bundles of parallel threads, where each thread block is assigned at least one bundle. In one embodiment, 32 threads are executed per bundle. Scheduler unit 510 may manage multiple different thread blocks, assign thread bundles to different thread blocks, and then dispatch instructions from multiple different cooperative groups to various functional units (i.e., cores 550, SFUs 552, and LSUs 554) during each clock cycle.

Collaboration groups are programming models for organizing groups of communication threads that allow developers to express the granularity at which threads are communicating, enabling richer, more efficient parallel decomposition to be expressed. The cooperative launch API supports synchronicity between thread blocks to execute parallel algorithms. The conventional programming model provides a single simple structure for the synchronous cooperative threads: barriers (barriers) across all threads of a thread block (e.g., synchreads () function). However, programmers often want to define thread groups at a granularity less than the thread block granularity and synchronize within the defined groups, enabling higher performance, design flexibility, and software reuse in the form of collective group-wide function interfaces (collective-wide function interfaces).

The collaboration group enables programmers to explicitly define thread groups at sub-block (e.g., as small as a single thread) and multi-block granularity and perform collective operations, such as synchronicity across threads in the collaboration group. The programming model supports clean composition across software boundaries so that libraries and utility functions can be safely synchronized in their local environment without assumptions on convergence. The collaboration group primitives enable new modes of collaborative parallelism, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire thread block grid.

Dispatch unit 515 is configured to transmit instructions to one or more functional units. In this embodiment, the scheduler unit 510 includes two dispatch units 515 that enable two different instructions from the same thread bundle to be scheduled during each clock cycle. In alternative embodiments, each scheduler unit 510 may include a single dispatch unit 515 or additional dispatch units 515.

Each SM440 includes a register file 520 that provides a set of registers for the functional units of the SM 440. In one embodiment, register file 520 is divided among each functional unit such that each functional unit is allocated a dedicated portion of register file 520. In another embodiment, the register file 520 is divided between different thread bundles executed by the SM 440. Register file 520 provides temporary storage for operands connected to the data paths of the functional units.

Each SM440 includes L processing cores 550. In one embodiment, SM440 includes a large number (e.g., 128, etc.) of different processing cores 550. Each core 550 may include fully pipelined, single-precision, double-precision, and/or mixed-precision processing units including floating-point arithmetic logic units and integer arithmetic logic units. In one embodiment, the floating-point arithmetic logic unit implements the IEEE 754-. In one embodiment, the cores 550 include 64 single-precision (32-bit) floating-point cores, 64 integer cores, 32 double-precision (64-bit) floating-point cores, and 8 tensor cores (tensorcore).

The tensor cores are configured to perform matrix operations, and in one embodiment, one or more tensor cores are included in core 550. In particular, the tensor core is configured to perform deep learning matrix operations, such as convolution operations for neural network training and reasoning. In one embodiment, each tensor core operates on a 4 × 4 matrix and performs a matrix multiply and accumulate operation D ═ a × B + C, where A, B, C and D are 4 × 4 matrices.

In one embodiment, the matrix multiplication inputs a and B are 16-bit floating-point matrices, while the accumulation matrices C and D may be 16-bit floating-point or 32-bit floating-point matrices. The tensor core operates on 16-bit floating-point input data and 32-bit floating-point accumulation. 16-bit floating-point multiplication requires 64 operations, produces a full precision product, and then accumulates using the addition of 32-bit floating points with other intermediate products of a 4 x 4 matrix multiplication. In practice, the tensor core is used to perform larger two-dimensional or higher-dimensional matrix operations built up from these smaller elements. APIs (such as the CUDA 9C + + API) disclose specialized matrix loading, matrix multiplication and accumulation, and matrix storage operations to efficiently use the tensor core from the CUDA-C + + program. At the CUDA level, the thread bundle level interface assumes a 16 × 16 size matrix that spans all 32 threads of a thread bundle.

Each SM440 also includes M SFUs 552 that perform special functions (e.g., attribute evaluation, inverse square root, etc.). In one embodiment, SFU 552 may include a tree traversal unit configured to traverse a hierarchical tree data structure. In one embodiment, SFU 552 may comprise texture units configured to perform texture map filtering operations. In one embodiment, the texture unit is configured to load a texture map (e.g., a 2D array of texels) from memory 304 and sample the texture map to produce sampled texture values for use in a shader program executed by SM 440. In one embodiment, the texture map is stored in shared memory/L1 cache 470. Texture units implement texture operations, such as filtering operations using mip maps (i.e., texture maps of different levels of detail). In one embodiment, each SM440 includes two texture units.

Each SM440 also includes N LSUs 554 that implement load and store operations between shared memory/L1 cache 570 and register file 520. Each SM440 includes an interconnection network 580 that connects each functional unit to register file 520 and LSU554 to register file 520, shared memory/L1 cache 570. In one embodiment, interconnect network 580 is a crossbar that may be configured to connect any functional unit to any register in register file 520, and to connect LSU554 to a register file and a memory location in shared memory/L1 cache 570.

The shared memory/L1 cache 570 is an on-chip memory array that allows data storage and communication between the SM440 and the primitive engine 435 and between threads in the SM 440. In one embodiment, shared memory/L1 cache 570 comprises 128KB of storage capacity and is in the path from SM440 to partition unit 380. Shared memory/L1 cache 570 may be used for cache reads and writes. One or more of shared memory/L1 cache 570, L2 cache 460, and memory 304 are backing stores.

Combining data caching and shared memory functions into a single memory block provides the best overall performance for both types of memory accesses. This capacity can be used by programs as a cache that does not use shared memory. For example, if the shared memory is configured to use half the capacity, texture and load/store operations may use the remaining capacity. Integration within shared memory/L1 cache 570 allows shared memory/L1 cache 570 to function as a high throughput pipeline for streaming data while providing high bandwidth and low latency access to frequently reused data.

When configured for general-purpose parallel computing, a simpler configuration can be used compared to graphics processing. In particular, the fixed function graphics processing unit shown in FIG. 3 is bypassed, creating a simpler programming model. In a general-purpose parallel computing configuration, the work allocation unit 325 assigns and allocates thread blocks directly to the DPCs 420. The threads in the block execute the same program, use unique thread IDs in the computations to ensure that each thread generates a unique result, execute the program and perform the computations using the SM440, use the shared memory/L1 cache 570 to communicate between the threads, and use the LSU554 to read and write to global memory through the shared memory/L1 cache 570 and the memory partition unit 380. When configured for general purpose parallel computing, SM440 may also write commands that scheduler unit 320 may use to initiate new work on DPC 420.

The PPU300 may be included in a desktop computer, a laptop computer, a tablet computer, a server, a supercomputer, a smartphone (e.g., wireless, handheld device), a Personal Digital Assistant (PDA), a digital camera, a vehicle, a head-mounted display, a handheld electronic device, and so forth. In one embodiment, the PPU300 is included on a single semiconductor substrate. In another embodiment, the PPU300 is included on a system on a chip (SoC) with one or more other devices, such as an additional PPU300, memory 304, a Reduced Instruction Set Computer (RISC) CPU, a Memory Management Unit (MMU), a digital-to-analog converter (DAC), and so forth.

In one embodiment, PPU300 may be included on a graphics card that includes one or more memory devices 304. The graphics card may be configured to interface with a PCIe slot on a motherboard of the desktop computer. In yet another embodiment, the PPU300 may be an Integrated Graphics Processing Unit (iGPU) or a parallel processor contained in a chipset of a motherboard.

Exemplary computing System

Systems with multiple GPUs and CPUs are used in various industries as developers expose and exploit more parallelism in applications such as artificial intelligence computing. High performance GPU acceleration systems with tens to thousands of compute nodes are deployed in data centers, research institutions, and supercomputers to address larger problems. As the number of processing devices within high performance systems increases, communication and data transfer mechanisms need to be extended to support the increased bandwidth.

FIG. 5B is a conceptual diagram of a processing system 500 implemented using the PPU300 of FIG. 3, according to one embodiment. Exemplary system 500 may be configured to implement method 200 shown in fig. 2A. Processing system 500 includes a CPU530, a switch 555, and each of a plurality of PPUs 300, as well as a respective memory 304. NVLink310 provides a high-speed communication link between each PPU 300. Although a particular number of NVLink310 and interconnect 302 connections are shown in FIG. 5B, the number of connections to each PPU300 and CPU530 may vary. Switch 555 interfaces between interconnect 302 and CPU 530. PPU300, memory 304, and NVLink310 may be located on a single semiconductor platform to form parallel processing module 525. In one embodiment, switch 555 supports two or more protocols that interface between various different connections and/or links.

In another embodiment (not shown), NVLink310 provides one or more high speed communication links between each PPU300 and CPU530, and switch 555 interfaces between interconnect 302 and each PPU 300. PPU300, memory 304, and interconnect 302 may be located on a single semiconductor platform to form parallel processing module 525. In yet another embodiment (not shown), interconnect 302 provides one or more communication links between each PPU300 and CPU530, and switch 555 interfaces between each PPU300 using NVLink310 to provide one or more high-speed communication links between PPUs 300. In another embodiment (not shown), NVLink310 provides one or more high speed communication links between PPU300 and CPU530 through switch 555. In yet another embodiment (not shown), interconnect 302 provides one or more communication links directly between each PPU 300. One or more NVLink310 high speed communication links may be implemented as physical NVLink interconnects or on-chip or bare-die interconnects using the same protocol as NVLink 310.

In the context of this specification, a single semiconductor platform may refer to only a single semiconductor-based integrated circuit fabricated on a die or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connections that simulate on-chip operation and are substantially improved by utilizing conventional bus implementations. Of course, the various circuits or devices may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user. Alternatively, the parallel processing module 525 may be implemented as a circuit board substrate, and each of the PPU300 and/or the memory 304 may be a packaged device. In one embodiment, CPU530, switch 555, and parallel processing module 525 are located on a single semiconductor platform.

In one embodiment, the signaling rate for each NVLink310 is 20 to 25 gbits/sec, and each PPU300 includes six NVLink310 interfaces (as shown in fig. 5B, each PPU300 includes five NVLink310 interfaces). Each NVLink310 provides a data transfer rate of 25 gbits/sec in each direction, with six links providing 300 gbits/sec. When CPU530 also includes one or more NVLink310 interfaces, NVLink310 may be dedicated to PPU communications as shown in FIG. 5B, or some combination of PPU to PPU and PPU to CPU.

In one embodiment, NVLink310 allows direct load/store/atomic access from CPU530 to memory 304 of each PPU 300. In one embodiment, NVLink310 supports coherency operations, allowing data read from memory 304 to be stored in the cache hierarchy of CPU530, reducing cache access latency of CPU 530. In one embodiment, NVLink310 includes support for Address Translation Services (ATS), allowing PPU300 direct access to page tables within CPU 530. One or more nvlinks 310 may also be configured to operate in a low power mode.

Fig. 5C illustrates an exemplary system 565 in which the various architectures and/or functionalities of the various previous embodiments may be implemented. Exemplary system 565 may be configured to implement method 200 shown in fig. 2A.

As shown, a system 565 is provided that includes at least one central processing unit 530 coupled to a communication bus 575. The communication bus 575 may be implemented using any suitable protocol, such as PCI (peripheral component interconnect), PCI-Express, AGP (accelerated graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s). The system 565 also includes a main memory 540. Control logic (software) and data are stored in main memory 540, and main memory 540 may take the form of Random Access Memory (RAM).

System 565 also includes an input device 560, a parallel processing system 525, and a display device 545, such as a conventional CRT (cathode ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display, and the like. User input may be received from an input device 560 (e.g., keyboard, mouse, touchpad, microphone, etc.). Each of the aforementioned modules and/or devices may even reside on a single semiconductor platform to form system 565. Alternatively, the various modules may also be placed separately or in various combinations of semiconductor platforms, as desired by the user.

Further, system 565 can be coupled for communication purposes to a network (e.g., a telecommunications network, a Local Area Network (LAN), a wireless network, a Wide Area Network (WAN) such as the internet, a peer-to-peer network, a cable network, etc.) through network interface 535.

System 565 can also include secondary storage (not shown). Secondary storage 610 includes, for example, a hard disk drive and/or a removable storage drive, on behalf of a floppy disk drive, a magnetic tape drive, an optical disk drive, a Digital Versatile Disk (DVD) drive, a recording device, a Universal Serial Bus (USB) flash memory. The removable storage drive reads from and/or writes to a removable storage unit in a well known manner.

Computer programs, or computer control logic algorithms, may be stored in main memory 540 and/or secondary storage. Such computer programs, when executed, enable system 565 to perform various functions. Memory 540, storage, and/or any other storage are possible examples of computer-readable media.

The architecture and/or functionality of the various previous figures may be implemented in the context of a general purpose computer system, a circuit board system, a game console system dedicated for entertainment purposes, a dedicated system, and/or any other desired system. For example, system 565 may take the form of a desktop computer, laptop computer, tablet computer, server, supercomputer, smartphone (e.g., wireless, handheld device), Personal Digital Assistant (PDA), digital camera, vehicle, head-mounted display, handheld electronic device, mobile phone device, television, workstation, game console, embedded system, and/or any other type of logic.

While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Graphics processing pipeline

In one embodiment, PPU300 includes a Graphics Processing Unit (GPU). PPU300 is configured to receive commands specifying a shading program for processing graphics data. Graphics data may be defined as a set of primitives, such as points, lines, triangles, quadrilaterals, triangle strips, and so forth. Typically, a primitive includes data that specifies a plurality of vertices (e.g., in a model space coordinate system) of the primitive and attributes associated with each vertex of the primitive. The PPU300 may be configured to process the primitives to generate a frame buffer (e.g., pixel data for each of the pixels of the display).

The application writes model data (e.g., a set of vertices and attributes) for the scene to a memory, such as system memory or memory 304. The model data defines each of the objects that may be visible on the display. The application then makes an API call to the driver kernel, which requests the model data to be rendered and displayed. The driver kernel reads the model data and writes commands to one or more streams to perform operations to process the model data. These commands may reference different shading programs to be implemented on the SM440 of the PPU300, including one or more of vertex shading, shell shading, domain shading, geometry shading, and pixel shading. For example, one or more of the SMs 440 may be configured to execute a vertex shading program that processes a plurality of vertices defined by model data. In one embodiment, different SMs 440 may be configured to execute different shading programs simultaneously. For example, a first subset of SMs 440 may be configured to execute a vertex shading program, while a second subset of SMs 440 may be configured to execute a pixel shading program. The first subset of SMs 440 processes the vertex data to produce processed vertex data and writes the processed vertex data to L2 cache 460 and/or memory 304. After the processed vertex data is rasterized (e.g., converted from three-dimensional data to two-dimensional data in screen space) to generate fragment data, a second subset of the SMs 440 performs pixel shading to generate processed fragment data, which is then mixed with other processed fragment data and written to a frame buffer in memory 304. The vertex shading program and the pixel shading program may execute concurrently, processing different data from the same scene in a pipelined manner until all model data for the scene has been rendered to the frame buffer. The contents of the frame buffer are then transferred to the display controller for display on the display device.

FIG. 6 is a conceptual diagram of a graphics processing pipeline 600 implemented by the PPU300 of FIG. 3, according to one embodiment. Graphics processing pipeline 600 is an abstract flow diagram of processing steps implemented to generate 2D computer-generated images from 3D geometric data. It is well known that pipelined architectures can more efficiently perform long delay operations by dividing the operation into multiple stages, with the output of each stage coupled to the input of the next successive stage. Thus, graphics processing pipeline 600 receives input data 601 that passes from one stage to the next stage of graphics processing pipeline 600 to generate output data 602. In one embodiment, graphics processing pipeline 600 may be represented by

API defined graphics processing pipeline. Alternatively, graphics processing pipeline 600 may be implemented in the context of the functionality and architecture of the previous figure and/or any subsequent figure or figures.

As shown in FIG. 6, graphics processing pipeline 600 comprises a pipeline architecture comprising a plurality of stages. These stages include, but are not limited to, a data assembly stage 610, a vertex shading stage 620, a primitive assembly stage 630, a geometry shading stage 640, a viewport scale, culling, and clip (VSCC) stage 650, a rasterization stage 660, a fragment shading stage 670, and a raster operations stage 680. In one embodiment, input data 601 includes commands that configure processing units to implement stages of graphics processing pipeline 600 and configure geometric primitives (e.g., points, lines, triangles, quadrilaterals, triangle strips, or sectors, etc.) to be processed by these stages. The output data 602 may include pixel data (i.e., color data) that is copied into a frame buffer or other type of surface data structure in memory.

The data assembly stage 610 receives input data 601 specifying vertex data for high-order surfaces, primitives, and the like. The data assembly stage 610 collects vertex data in temporary storage or queues, such as by receiving a command from a host processor that includes a pointer to a buffer in memory and reading the vertex data from the buffer. The vertex data is then passed to vertex shading phase 620 for processing.

Vertex shading phase 620 processes vertex data by performing a set of operations (e.g., vertex shaders or programs) on each of the vertices at a time. A vertex may, for example, be specified as a 4-coordinate vector (e.g., < x, y, z, w >) associated with one or more vertex attributes (e.g., color, texture coordinates, surface normals, etc.). Vertex shading phase 620 may manipulate various vertex attributes, such as position, color, texture coordinates, and the like. In other words, vertex shading phase 620 performs operations on vertex coordinates or other vertex attributes associated with the vertices. These operations typically include lighting operations (e.g., modifying color attributes of the vertices) and transformation operations (e.g., modifying the coordinate space of the vertices). For example, the vertices may be specified using coordinates in object coordinate space, which are transformed by multiplying the coordinates by a matrix that converts the coordinates from object coordinate space to world-space or normalized-device-coordinate (NCD) space. Vertex shading phase 620 generates transformed vertex data that is passed to primitive assembly phase 630.

Primitive assembly stage 630 collects the vertices output by vertex shading stage 620 and groups the vertices into geometric primitives for processing by geometry shading stage 640. For example, primitive assembly stage 630 may be configured to group every three consecutive vertices into geometric primitives (e.g., triangles) for communication to geometry shading stage 640. In some embodiments, a particular vertex may be reused for consecutive geometric primitives (e.g., two consecutive triangles in a triangle strip may share two vertices). Primitive assembly stage 630 passes the geometric primitives (e.g., the set of associated vertices) to geometry shading stage 640.

Geometry shading phase 640 processes geometric primitives by performing a set of operations (e.g., a geometry shader or program) on the geometric primitives. A tessellation (tessellation) operation may generate one or more geometric primitives from each geometric primitive. In other words, geometry shading stage 640 may subdivide each geometric primitive into a finer grid of two or more geometric primitives for processing by the remainder of graphics processing pipeline 600. The geometry shading stage 640 passes the geometric primitives to the viewport SCC stage 650.

In one embodiment, graphics processing pipeline 600 may operate within a streaming multiprocessor and vertex shading stage 620, a primitive assembly stage 630, a geometry shading stage 640, a fragment shading stage 670, and/or hardware/software associated therewith, which may sequentially perform processing operations. Once the sequential processing operations are complete, in one embodiment, the viewport SCC stage 650 can utilize the data. In one embodiment, primitive data processed by one or more of the stages in graphics processing pipeline 600 may be written to a cache (e.g., an L1 cache, a vertex cache, etc.). In this case, in one embodiment, the viewport SCC stage 650 can access the data in the cache. In one embodiment, the viewport SCC stage 650 and the rasterization stage 660 are implemented as fixed function circuitry.

The viewport SCC stage 650 performs viewport scaling, culling, and clipping of geometric primitives. Each surface being rendered is associated with an abstract camera position. The camera position represents the position of a viewer who is viewing the scene and defines the view frustum of the object that surrounds the scene. The viewing frustum may include a viewing plane, a back plane, and four clipping planes. Any geometric primitives that lie completely outside the view frustum may be culled (e.g., discarded) because they will not contribute to the final rendered scene. Any geometric primitives that are partially within the viewing frustum and partially outside the viewing frustum may be cropped (e.g., converted to new geometric primitives that are enclosed within the viewing frustum). Furthermore, each geometric primitive may be scaled based on the depth of the view frustum. All possible visible geometric primitives are then passed to the rasterization stage 660.

The rasterization stage 660 converts the 3D geometric primitive into 2D fragments (e.g., capable of being used for display, etc.). The rasterization stage 660 may be configured to set a set of plane equations with the vertices of the geometric primitives from which various attributes may be interpolated. The rasterization stage 660 may also calculate a coverage mask for a plurality of pixels that indicates whether one or more sample positions of the pixels intercept the geometric primitive. In one embodiment, a z-test may also be performed to determine if a geometric primitive is occluded by other geometric primitives that have been rasterized. The rasterization stage 660 generates fragment data (e.g., interpolated vertex attributes associated with a particular sample position for each covered pixel) that is passed to the fragment shading stage 670.

The fragment shading phase 670 processes fragment data by performing a set of operations (e.g., fragment shaders or programs) on each of the fragments. The fragment shading stage 670 may generate pixel data (e.g., color values) for the fragment, such as by performing a lighting operation or sampling a texture map using interpolated texture coordinates for the fragment. Fragment shading stage 670 generates pixel data, which is sent to raster operations stage 680.

Raster operations stage 680 may perform various operations on the pixel data, such as performing alpha testing, stencil testing (stencil test), and blending the pixel data with other pixel data corresponding to other fragments associated with the pixel. When raster operations stage 680 has completed processing pixel data (e.g., output data 602), the pixel data may be written to a render target, such as a frame buffer, color buffer, or the like.

It should be appreciated that one or more additional stages may be included in graphics processing pipeline 600 in addition to or in place of one or more of the stages described above. Various implementations of the abstract graphics processing pipeline may implement different stages. Furthermore, in some embodiments, one or more of the stages described above may be excluded from the graphics processing pipeline (such as geometry shading stage 640). Other types of graphics processing pipelines are considered to be contemplated within the scope of the present disclosure. Further, any stage of graphics processing pipeline 600 may be implemented by one or more dedicated hardware units within a graphics processor (such as PPU 300). Other stages of graphics processing pipeline 600 may be implemented by programmable hardware units, such as SM440 of PPU 300.

Graphics processing pipeline 600 may be implemented via an application program executed by a host processor, such as a CPU. In one embodiment, the device driver may implement an Application Programming Interface (API) that defines various functions that may be utilized by an application to generate graphical data for display. The device driver is a software program that includes a plurality of instructions that control the operation of the PPU 300. The API provides an abstraction for a programmer that allows the programmer to utilize specialized graphics hardware (such as PPU300) to generate graphics data without requiring the programmer to utilize a specific instruction set of PPU 300. The application may include API calls routed to the device driver of PPU 300. The device driver interprets the API calls and performs various operations in response to the API calls. In some cases, a device driver may perform operations by executing instructions on a CPU. In other cases, the device driver may perform operations at least in part by initiating operations on the PPU300 using an input/output interface between the CPU and the PPU 300. In one embodiment, the device driver is configured to implement graphics processing pipeline 600 using the hardware of PPU 300.

Various programs may be executed within PPU300 to implement the various stages of graphics processing pipeline 600. For example, a device driver may start a kernel on the PPU300 to execute the vertex shading phase 620 on one SM440 (or multiple SMs 440). The device driver (or the initial kernel executed by PPU 400) may also launch other kernels on PPU 400 to execute other stages of graphics processing pipeline 600, such as geometry shading stage 640 and fragment shading stage 670. Additionally, some of the stages of graphics processing pipeline 600 may be implemented on fixed unit hardware, such as a rasterizer or data populator implemented within PPU 400. It will be appreciated that results from one core may be processed by one or more intermediate fixed function hardware units before being processed by a subsequent core on SM 440.

Machine learning

Deep Neural Networks (DNNs) developed on processors, such as PPU300, have been used for various use cases: from self-driving to faster drug development, from automatic image captioning in an image database to intelligent real-time language translation in video chat applications. Deep learning is a technique that models the neural learning process of the human brain, learns continuously, becomes increasingly smarter, and delivers more accurate results faster over time. A child is initially taught by adults to correctly recognize and classify various shapes, and ultimately to be able to recognize shapes without any coaching. Also, deep learning or neural learning systems need to be trained in object recognition and classification in order to become more intelligent and efficient when identifying basic objects, occluding objects, etc., while also assigning scenarios to objects.

At the simplest level, neurons in the human brain look at various inputs received, assign a level of importance to each of these inputs, and pass the output to other neurons for processing. Artificial neurons or perceptrons are the most basic model of neural networks. In one example, a perceptron may receive one or more inputs representing various features of an object that the perceptron is being trained to recognize and classify, and each of these features is given a weight based on the importance of the feature when defining the shape of the object.

Deep Neural Network (DNN) models include multiple layers of many connected nodes (e.g., perceptrons, boltzmann machines, radial basis functions, convolutional layers, etc.), which can be trained with large amounts of input data to solve complex problems quickly and with high accuracy. In one example, the first layer of the DNN model decomposes the input image of the car into various parts and finds basic patterns (such as lines and corners). The second layer assembles the lines to look for higher level patterns such as wheels, windshields and mirrors. The next layer identifies the type of vehicle, and the last few layers generate labels for the input images to identify the model of a particular automobile brand.

Once the DNNs are trained, they may be deployed and used to identify and classify objects or patterns in a process called inference (inference). Examples of reasoning (the process by which DNN extracts useful information from a given input) include identifying handwritten numbers deposited on check deposits in ATM machines, identifying images of friends in photographs, providing movie recommendations to more than fifty million users, identifying and classifying road hazards in different types of cars, pedestrians, and unmanned cars, or translating human speech in real time.

During training, data flows through the DNN during the forward propagation phase until a prediction is generated, which indicates the label corresponding to the input. If the neural network does not correctly label the input, the error between the correct label and the predicted label is analyzed and the weights are adjusted for each feature during the back propagation phase until the DNN correctly labels the input and the other inputs in the training dataset. Training a complex neural network requires a significant amount of parallel computational performance, including floating-point multiplication and addition, supported by the PPU 300. Inference is less computationally intensive than training and is a delay sensitive process in which a trained neural network is applied to new inputs that it has not seen before to perform image classification, translate speech, and generally infer new information.

Neural networks rely heavily on matrix mathematics, and complex multi-layer networks require a large amount of floating point performance and bandwidth to improve efficiency and speed. With thousands of processing cores, optimized for matrix mathematical operations, and delivering performance in the tens to hundreds of TFLOPS, PPU300 is a computing platform capable of delivering the performance required for deep neural network-based artificial intelligence and machine learning applications.

Exemplary technical advantages of some embodiments

Certain exemplary embodiments provide for a fast slow down of the clock frequency so that the processor can effectively reduce the operating frequency for only noisy instructions while enabling noiseless instructions to be executed at or near the highest nominal frequency of the voltage level.

Compared to solutions such as NAFLL, which are designed to address the frequency effects of noise but which address noise seen over a large area, the exemplary embodiments also accommodate very localized noise. In addition, the illustrative embodiments allow flexibility in IR drop closure, since known power-intensive units can be closed to a looser IR drop budget, with the desire to quickly clock down to a lower frequency when executing instructions on the power-intensive unit. Relaxing the IR drop budget may alleviate routing congestion and prevent or reduce die area growth.

Many modifications and variations of the present invention are possible in light of the above teachings. It is, therefore, to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.

41页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种具有自动调节功能的显示屏

Techniques to configure a processor to efficiently execute instructions

相关技术

网友询问留言