Techniques to configure a processor to efficiently execute instructions
阅读说明:本技术 配置处理器以有效执行指令的技术 (Techniques to configure a processor to efficiently execute instructions ) 是由 A·奈克 T·拉贾 K·怀尔德 R·塞尔瓦桑 D·拉玛克里沙南 D·罗德里格斯 B·福克 于 2019-08-08 设计创作,主要内容包括:本发明提供了一种配置处理器以有效执行指令的技术。具体地,提供了用于在适应由执行噪声指令(例如,高功率消耗指令)引起的动态电压降的同时改善电路性能的系统和技术。通过选择性地减慢各类型噪声指令的操作频率来改善性能。示例性技术通过检测被预测为具有预定义噪声特性的预定噪声类型的指令来控制时钟(例如,由于指令消耗的更大电流量而在电路的电压轨上产生的高水平噪声),以及响应于检测,降低时钟的频率。检测在执行指令之前发生。即使一些工作负载包括操作频率减慢的指令,根据指令类型改变频率也使电路能够以高频率操作。(The present invention provides a technique for configuring a processor to efficiently execute instructions. In particular, systems and techniques are provided for improving circuit performance while accommodating dynamic voltage drops caused by executing noisy instructions (e.g., high power consumption instructions). Performance is improved by selectively slowing down the operating frequency of each type of noise instruction. An exemplary technique controls a clock by detecting an instruction that is predicted to be of a predetermined noise type having predefined noise characteristics (e.g., a high level of noise generated on a voltage rail of a circuit due to a greater amount of current consumed by the instruction), and in response to the detection, reduces a frequency of the clock. The detection occurs prior to execution of the instruction. Even if some workloads include instructions that slow down the operating frequency, changing the frequency according to the type of instruction enables the circuit to operate at a high frequency.)
1. A method of controlling a clock of a processor, comprising:
detecting an instruction predicted to have a predefined noise characteristic, the detecting occurring prior to execution of the instruction by the processor; and
changing a frequency of the clock in response to the detecting.
2. The method of claim 1, wherein the changing the frequency comprises: reducing the frequency by an offset determined to compensate for a predicted voltage drop corresponding to the predefined noise characteristic.
3. The method of claim 2, further comprising:
executing the instruction at least partially when the clock is operating at the reduced frequency; and
raising the frequency of the clock after the executing.
4. The method of claim 3, further comprising subjecting the elevation to a hysteresis process.
5. The method of claim 4, wherein the reducing is performed without undergoing a hysteresis process.
6. The method of claim 2, wherein the offset is determined by a programmable mapping of instructions to frequency offsets.
7. The method of claim 1, wherein the detecting comprises: determining that the type of the instruction matches a type in a predetermined set of instruction types.
8. The method of claim 7, wherein the predetermined set of instruction types includes at least one high power instruction type predicted to consume more than a threshold power level.
9. The method of claim 7, wherein the predetermined set of instruction types includes at least two sets of instruction types, each of the two sets having a mapping to a respective frequency offset,
wherein the detecting further comprises determining to which of the at least two groups the instruction belongs, an
Wherein the changing further comprises adjusting the frequency according to the respective frequency offset mapped to the determined one of the at least two groups.
10. The method of claim 9, wherein at least one of the components of the at least two groups or the respective frequency quantities is programmable.
11. The method of claim 1, further comprising:
configuring the clock to operate at a default frequency, an
Wherein the detecting occurs while the clock is operating at the default frequency, wherein the default frequency is determined from other instructions predicted to have a lower noise level than the detected instructions.
12. The method of claim 1, wherein the detecting occurs during decoding of the instruction and the changing occurs before the processor executes the decoded instruction.
13. The method of claim 12, further comprising:
in response to the detecting, stalling the instruction prior to execution; and
the instruction is not halted in response to the frequency being changed or a timer signal.
14. The method of claim 1, wherein changing the frequency of the clock comprises: changing a frequency of a clock source for a predetermined fixed number of clock cycles after receiving a clock change signal at the clock source.
15. The method of claim 14, wherein changing the frequency of the clock further comprises: identifying a target frequency from a look-up table, and setting the clock source to the target frequency.
16. The method of claim 15, wherein changing the frequency of the clock further comprises: the frequency is adjusted according to a noise-aware frequency-locked loop technique.
17. A system comprising processing circuitry and a clock generator that controls an operating frequency of the processing circuitry, the processing circuitry configured to:
detecting an instruction predicted to have a predefined noise characteristic, the detecting occurring prior to execution of the instruction in the processor; and
in response to the detecting, signaling the clock generator to change a frequency of the clock generator.
18. The system of claim 17, further comprising a mapping from a type of the instruction to a frequency offset determined to compensate for a predicted voltage drop corresponding to the predefined noise characteristic, wherein the signaling comprises including the amount of frequency in the signaling.
19. The system of claim 17, wherein the detecting is performed during decoding of the instruction, and the processing circuitry is further configured to execute the instruction after the changing the frequency.
20. The system of claim 17, wherein the detecting comprises: determining that the type of the instruction matches a type in a predetermined set of instruction types.
21. The system of claim 20, further comprising at least first and second signaling paths between the processing circuit and the clock generator, wherein the predetermined set of instruction types includes at least first and second sets of instruction types associated with the first and second signaling paths, respectively, each of the two sets having a mapping to a respective frequency offset,
wherein the detecting further comprises determining to which of at least two groups the instruction belongs, an
Wherein the changing further comprises adjusting the frequency according to the respective frequency offset mapped to the determined one of the at least two groups.
22. The system of claim 17, wherein changing the frequency comprises: the frequency of the clock source is changed for a predetermined fixed number of clock cycles after receiving a clock change signal at the clock generator.
23. The system of claim 22, wherein varying the frequency further comprises: identifying a target frequency from a look-up table, and setting the clock source to the target frequency.
24. The system of claim 23, wherein varying the frequency further comprises: the frequency is adjusted according to Noise Aware Frequency Locked Loop (NAFLL) techniques.
25. The system of claim 17, wherein the processing circuit comprises an instruction pipeline configured to prevent execution of the instruction until a predetermined signal is received from the clock generator or timer.
26. The system of claim 17, wherein the system further comprises a plurality of processor cores comprising the processing circuitry, wherein each of the processor cores receives the frequency from the clock generator, and wherein the clock generator is configured to:
receiving signals from one or more of the plurality of processor cores, an
Varying the frequency in accordance with the selected subset of received signals.
27. The system of claim 17, wherein the processing circuit and the clock generator are formed on an integrated circuit.
Technical Field
The present disclosure relates to power management in processors, and more particularly, to varying clock speed and/or power instructions based on the type of instruction being executed. More particularly, the present disclosure relates to dynamically controlling clock frequency and subsequent voltage drops associated with executing certain types of instructions in a manner that accommodates noise.
Background
Modem processors are very complex, containing millions of transistors on a single substrate. Although each individual transistor consumes only a small amount of power, millions of transistors on the substrate together consume a large amount of power. This power consumption generates heat and also uses a large amount of power from the power supply. Power delivery is therefore an important aspect of modem processor design.
The wiring for delivering power on the processor substrate forms a power delivery network. The power delivery network includes, for example, power rails for supplying current of appropriate voltage to different components on the processor chip. Power delivery networks affect many performance aspects, such as the amount of noise on the voltage rails. The term "noise" as used in this document refers to voltage transients that occur on the voltage rails. While the power supply may be designed to be almost noise free, the processor itself may generate noise on its internal power delivery network. For example, executing a high-capacity instruction (which switches many transistors on the chip simultaneously or nearly simultaneously) can result in transient voltage fluctuations on the voltage rails. Excessive noise on the voltage rails can cause various circuit operational failures, such as the processor not being able to obtain the power (current and/or voltage) required for its operating frequency, resulting in timing errors, erroneous switching, and the like.
In more detail, one or more voltage rails are powered from a power source (such as a power source to a processor). The voltage regulator typically controls the voltage on each voltage rail to one or more pads of the processor (or an integrated circuit containing the processor) according to the power requirements of the various components of the processor. The metal vias and traces distribute the received power at the pads throughout the processor (and/or integrated circuit). One or more voltage regulators may dynamically transition supply voltage rises or falls according to changes in the workload and thus the power requirements of the processor. The nominal constant voltage of the track leading to the processing core is called "Vdd”。
As processors become faster and smaller, the design of power delivery networks within the chip becomes more complex. Generally, the faster the processor runs, the more power it consumes. Conversely, slowing down a processor reduces the amount of power required by the various functional units of the processor. The characterization of the processor power consumption P makes clear this relationship between power and frequency: p ═ CV2f, where C, V and f represent the switched capacitance of the processor, the voltage seen by the processor, and the operating frequency of the processor, respectively.
The speed at which the circuit operates is affected by the voltage at which the circuit is driven. Voltage fluctuations (noise) on the voltage rails can cause the voltage of one or more processor circuits to vary from the nominal voltage. The main factor affecting the noise amplitude is the workload that the circuit (or other circuit on the chip) is handling. In particular, the dynamic resistance drop (also referred to as "IR drop") component of the noise is affected by the current I consumed by a particular instruction (draw) when executed. Noise has two main components: di/dt and IR drop. Di/dt noise depends on the rate of change of current over time, while IR drop depends only on the magnitude of the current. Noise-aware frequency-locked loop (NAFLL) techniques can be used to account for the Di/dt component of noise. The techniques described in this disclosure are primarily directed to IR drop.
Different commands result in different numbers of transistors switching, consuming different amounts of current, and thus different amounts of IR drop on the voltage rails. By way of a rough analogy, consider a muggy day in the middle of a summer workday. The thermometer reached 100 degrees fahrenheit (38 degrees celsius). When everyone in a city goes home from work, the first thing they do is to turn on the air conditioner to cool their living space. The grid may not be designed to handle such high peak current consumption, resulting in a "power down" -i.e., voltage fluctuations that deviate from the nominal 110 VAC that the grid is designed to deliver. This is now considered in microscopic detail on modern processor chips, where millions of transistors run thousands of times per second. This is like turning on and off air conditioners for hundreds of cities synchronously many times per second. The resulting power fluctuations will generate a lot of noise on the power transmission network of the city.
As processors become smaller and more complex, the increase in power density between circuits may further exacerbate problems associated with noise generated on the voltage rails by executing certain instructions.
In some processors, such as Nvidia Graphics Processing Units (GPUs) from the past few years, at a voltage VddThe lower operating frequency (i.e., speed) is set conservatively so that even the noisiest expected workload will result in proper operation. In addition, features (such as NAFLL) have been implemented to address the effects of noise on the power distribution network by tracking noise on the voltage rail and dynamically changing the clock frequency in response. In particular, when a voltage drop is detected on the voltage rail, the NAFLL mechanism reduces the clock frequency (i.e., the rate at which the processor transistors switch) to allow the transistors more time to operate properly and ensure that there are no circuit failures. This is effective because the transistor requires a higher voltage to switch faster. Reducing switching speedThe ratio enables the transistor to operate at a lower, reduced voltage level. When the drop disappears, the clock frequency can be increased. For example, U.S. patent publication No. 2018/0123604 describes an example NAFLL that follows this principle.
However, as processors continue to become smaller, faster, and more complex, techniques (e.g., NAFLL) may not fully address the problems and inefficiencies associated with noise on the voltage rail. Therefore, further improvements are desired to adapt the processor to the effects of noise of different instruction types.
Disclosure of Invention
The illustrative embodiments correct some of the deficiencies of the above-described techniques for addressing noise-related voltages present in a processor. Systems and methods for controlling the operation of a processor in a manner that accommodates noise on a voltage rail are disclosed.
The illustrative embodiments provide a method of controlling a clock of a processor. The method includes detecting an instruction predicted to have a predefined noise characteristic, and in response to the detecting, changing a frequency of a clock. The detection may occur prior to execution of the instructions by the processor.
According to another exemplary embodiment, a system is provided that includes a processing circuit and a clock generator that controls an operating frequency of the processing circuit. The processing circuit is configured to detect an instruction predicted to have a predefined noise characteristic and, in response to the detection, signal the clock generator to change a frequency of the clock generator. This detection may occur prior to execution of the instruction in the processing circuitry.
Drawings
FIG. 1A illustrates a block diagram of a processing system including a fast clock slow down circuit, according to some example embodiments.
FIG. 1B illustrates a block diagram of a processor implementing fast clock slowing in accordance with certain example embodiments.
FIG. 1C is another example processor implementing fast clock slowing according to some example embodiments.
Fig. 2A illustrates a process for performing fast clock slowing according to some example embodiments.
FIG. 2B is a timing diagram illustrating a clock slow down signal and a corresponding acknowledge signal that may be used in a processor incorporating fast clock slow down, according to some demonstrative embodiments.
Fig. 2C illustrates a change in clock frequency over time when performing fast clock slowing, according to some demonstrative embodiments.
FIG. 2D shows a timing diagram illustrating stopping execution of instructions during fast clock slow down, according to some example embodiments.
FIG. 3 illustrates a parallel processing unit according to one embodiment.
FIG. 4A illustrates a general processing cluster within the parallel processing unit of FIG. 3, according to one embodiment.
FIG. 4B illustrates a memory partition unit of the parallel processing unit of FIG. 3, according to one embodiment.
FIG. 5A illustrates the streaming multiprocessor of FIG. 4A, according to one embodiment.
FIG. 5B is a conceptual diagram of a processing system implemented using the Parallel Processing Unit (PPU) of FIG. 3, according to one embodiment.
Fig. 5C illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.
FIG. 6 is a conceptual diagram of a graphics processing pipeline implemented by the PPU of FIG. 3, according to one embodiment.
Detailed Description
Certain exemplary embodiments of the present invention dynamically control the operating frequency of a processor and/or integrated circuit to reduce inefficiencies established in conventional techniques for handling dynamic voltage drops caused by noise generated by certain instructions on a power distribution network. For example, some embodiments provide for detecting noisy workloads based on instructions in the instruction pipeline and quickly slowing down the clock frequency before executing the detected noisy instructions. Further exemplary embodiments provide for detecting and responding to such noise workloads in local processing regions that have not yet been detected with a technology such as NAFLL.
Such as above the backAs mentioned in the background section, in some current commercial processors, the operating frequency is set so that even the noisiest expected workload can be performed at a given nominal voltage or set voltage (i.e., the frequency is reduced to compensate for the worst possible drop from the nominal voltage). However, this approach is suboptimal because it requires that a frequency margin be maintained for all other workloads that produce less noise than the worst case (e.g., noisiest) workload. That is, by setting the frequency according to the noisiest workload, which requires a lower frequency for stable operation, even fewer noisy instructions are forced to execute at a reduced frequency. For example, in the example of a commercially available processor, at VddAt 900mV, the operating frequency of all instructions is set to 1871MHz based on the voltage drop predicted to be caused by the noisiest instruction type, while all other instruction types can operate safely at a faster frequency of 1886MHz, meaning that approximately 15% of the frequency margin remains unused when instructions other than the noisiest instruction type are being executed. For various VddA similar margin range of 2-26% has been observed. This slows down the effective processing rate, which affects how fast a high performance processor can complete complex tasks.
Furthermore, in NAFLL implementations in existing processors, voltage/noise sensing is performed only at locations on the chip where the NAFLL logic is physically located. Thus, NAFLL may miss noise/voltage drops in other parts of the chip that are localized and do not propagate to the NAFLL sensing location. In particular, IR-drop can be a highly localized phenomenon. Circuits on the chip with high current density have higher IR drops than are detected at NAFLL. Noise that is not detected by NAFLL and therefore not compensated for needs to be annotated (flagged) and has been shown to cause a frequency loss of about 3-4% for workloads that produce lower noise. Further, the response of NAFLL occurs after the noise/voltage has dropped.
The exemplary embodiments actively slow down the clock frequency based on the type of instruction to accommodate expected or predicted noise (consisting primarily of IR drop) on the voltage rails. The exemplary embodiment is predictive and responds to predicted noise/voltage drops before the drop occurs, as compared to NAFLL which responds after the noise/voltage has dropped. In some embodiments, the clock frequency slowing compensates for prediction noise that is not compensated for by the NAFLL. By selectively slowing one or more clocks only for instructions of the type expected to produce high levels of noise, exemplary embodiments provide for the instructions producing lower noise to be clocked at a faster rate than the noise instructions, thereby improving performance of at least the lower noise instructions. Thus, in a typical workload where most instructions are of a lower noise type, the illustrative embodiments may also improve the overall performance of the processor and/or integrated circuit having the processor.
Some exemplary embodiments take advantage of such characteristics: after the instruction has been decoded, large inrush currents occur during instruction execution that cause increased noise on the voltage rails. This allows one or more clocks to be modulated based on the decoded instructions, but prior to the start of execution of the decoded instructions, the clocks are actively shifted to account for frequency loss due to noise generated during instruction execution. By rapidly modulating one or more clocks to reduce the frequency before executing certain noisy instructions, exemplary embodiments may provide for the processor to operate only at the reduced frequency to execute those noisy instructions.Base of Fast clock slow down system for instruction types
FIG. 1A is a block diagram of a
The processor 102 receives power from one or
If the reduced frequency required for noisy instructions is set to the clock frequency of all instructions executing on processor 102, as described above, unused voltage and frequency headroom during execution of fewer noisy instructions will result in inefficiencies. The illustrative embodiments avoid setting the clock frequency based on worst case noisy instructions by taking advantage of the characteristics of noise increase that occurs during instruction execution and after the instruction has been decoded in the
The
The instruction detection and
Note that while the exemplary embodiment detects instructions that may cause some level of noise, the embodiment compensates for frequency loss when running high power instructions. The frequency loss may also be due to jitter and/or other second order components, but most of the frequency loss may be due to noise.
The instruction detection and signaling block 118 may also include logic for communicating signaling with the
Additionally, in at least some embodiments, the instruction detection and signaling block 118 may also include interlock logic to stall the
Although fig. 1A shows
FIG. 1B illustrates a block diagram of a processor 130, according to some demonstrative embodiments. According to some embodiments, processor 130 may be identical to processor 102, but the clock logic is located on the same die as processor 102. As shown in fig. 1A, the processor 130 also receives power through one or more voltage rails controlled by one or more voltage regulators. Neither the voltage rails nor the voltage regulator are shown separately in fig. 1B.
Processor 130 includes one or more processing cores 132 (e.g., core a and core B), and clock logic 133. Clock logic 133 provides clock signal 158 to control the operating frequency of one or more processing cores 132.
The noisy instruction detect and stall block 136 is configured to detect certain predetermined instruction types 141 in instructions 140 in the instruction pipeline 134 prior to an execution stage of the pipeline, where the instructions are executed by one or more processor cores 132. The predetermined instruction type 141 to be detected (also referred to as a "noisy instruction type") may be hard-coded and/or specified in a programmable table. According to some embodiments, the detected instructions for clock slowing comprise certain high power instructions. That is, the detected instructions are instructions that are expected to consume an amount of power above a certain threshold. For example, some matrix multiply instructions utilize more functional units in the datapath and/or use datapaths with very high gate densities and are therefore more productive power consumers than many other instruction types. For example, according to one embodiment, a half-precision matrix multiply accumulate (HMMA) instruction may utilize a dedicated high gate density HMMA data path in the SM (as described with respect to fig. 4A), which is more power consuming than many instructions and which is also prone to generate more noise on the voltage rails than many other instructions. The noise of the HMMA command and the resulting voltage drop may be localized to the vicinity of the HMMA data path. HMMA instructions are described, for example, in U.S. application No.15/826,435 filed on 29/11/2017, which is incorporated herein by reference in its entirety. Thus, block 136 may monitor the instruction pipeline to detect when a high power instruction is decoded (or being decoded). However, some embodiments may be configured to detect one or more noisy instructions, which are not necessarily high power instructions, but which generate a significant amount of noise on the voltage rails (e.g., due to the density of concurrently activated portions of the data path, etc.).
The clock control signaling block 138 is configured to assert the clock slow down signal 154 in response to detection by the noisy instruction detection block 136. Clock slow down signal 154 is a request to clock logic 133 to slow down clock signal 158 provided to processor core 132.
In some embodiments, the clock logic 133 may provide a clock slow down acknowledge signal 156 indicating that it has achieved the requested clock slow down in response to receiving the clock slow down request signal 154.
In some embodiments, the noisy instruction detection and stall block 136 also includes logic for stalling the instruction pipeline 134 in response to asserting the clock slow down request signal 154 until it is determined that the clock frequency has slowed down as requested. Upon receipt of the acknowledgement signal 156, the pipeline may not be stalled so that the detected noisy instruction may be dispatched for execution. Stopping helps to ensure that a detected noisy instruction is not executed before slowing down the clock in response to detection. In some embodiments, where the latency between instruction decode and execution is too small for clock slow down start, block 136 may also include logic for stopping.
In the embodiment shown in FIG. 1B, clock logic 133 is provided on the same die as one or more processor cores 132. However, in some embodiments, clock logic 133 may not be located on the same die as the processor core that detects and/or executes the noisy instructions.
The signaling paths of signals 154 and 156 between one or more processor cores 132 and clock logic 133 are configured for low latency. In an embodiment, it is highly desirable to keep the delay and clock time on the signaling path to achieve slow down so that the instruction pipeline does not have to be delayed. In some embodiments, request/acknowledgement messages (e.g., REQ/ACK bits) may be moved between the SM and the clock through a retiming phase. Such retiming stages may be custom selected and/or custom placed, for example. Between retiming failures (retiming hops), data may be routed in a manner similar to a trunk clock, for example, by driving a wide upper metal layer mask net through a large clock buffer, using custom routing rules that eliminate expensive routing. In some embodiments, the level of delay between decoding and execution may be 12 cycles or the like, and the delay between request and clock change may be 10 cycles or the like, as an example. Thus, in an exemplary embodiment, clock slowing occurs before the decoded instruction reaches execution.
Clock logic 133 may include a clock generator block 142 and a clock update block 148. In some embodiments, the clock generator block 142 may include a clock source, such as a Digital Voltage Controlled Oscillator (DVCO) clock source, NAFLL mechanism, such as described in U.S. patent application No.15/340,901 (U.S. patent publication No. 2018/0123604), filed on 1/11/2016, the entire contents of which are incorporated herein by reference.
However, the clock frequency change technique described in U.S. patent application No.15/340,901 is not sufficient to meet the time constraint of rapid slow down required in some exemplary embodiments in order to achieve slow down over a short time interval between decoding of instructions and execution pipeline stages. In some embodiments, the clock frequency change is implemented by performing a lookup in a lookup table, such as any of tables 144 and 146. In embodiments where the clock frequency is configured to be either "full speed" (e.g., the maximum possible operating frequency with no margin for voltage noise) or "reduced speed" (e.g., the maximum possible operating frequency with margin for noise), the lookup table 144 may provide a voltage-frequency curve (VF curve) for full speed operation and the table 146 may provide a VF curve for reduced speed operation. In response to the clock slow down request 154, some embodiments may change the frequency at the current voltage to the VF curve in the table 146 (reduced speed operation).
In another embodiment, there may be no look-up table and the clock source or other clock generator may be set to clock down the requested frequency.
At block 152, incoming clock slow down requests may be aggregated. For example, block 152 determines the clock slow down request requesting the greatest slow down of the slow down requests 154 received from the various processing cores 132, and then forwards it to the clock generator block 142. This ensures that the adjusted clock frequency is equal to or lower than the clock frequency required to safely execute all instructions at any time.
The hysteresis block 150 may be located in a signal path 154 between the aggregation block 152 and the clock generator block 142. In some example embodiments, although the clock slow down request is forwarded to the clock generator block 142 immediately (i.e., sent to the clock generator without suffering from hysteresis), the clock slow down override (de-assert) may be subject to a hysteresis process before being forwarded to the clock generator block 142. Subjecting the nulls to hysteresis may minimize or avoid completely switching clock frequencies back and forth between a nominal normal operating frequency and one or more reduced frequencies.
The clock update block 148 controls the update clock and may also generate a clock slow down acknowledge signal 158 when a clock slow down has been achieved. The clock update block 148 may initiate an update of the clock frequency by the clock generator 142 in response to receiving an assertion or an assertion of the clock slow down signal 154.
FIG. 1C illustrates a block diagram of another example processor implementing fast clock slowing in accordance with some example embodiments. The embodiment shown in FIG. 1C achieves a finer level of slowdown, thereby achieving better performance with workloads having instructions that consume significantly varying amounts of power
(e.g., when compared to an implementation having only a single level of frequency slowing).
It has been observed that the amount of power consumed by various computing instructions used during certain applications when executed in the data path varies significantly. Such variations in power consumed by various instruction types when executed in the data path are expected to result in variations in noise and IR drop for different instruction types. On some example processors, the variation may be in the range of 10-12% depending on the instruction type. According to an estimate that every 1% of noise may result in about a 2% slow down, such a change in power level may result in slowing the processor to more than necessary. Implementing multi-level slowing as shown in the embodiment of fig. 1C may achieve better performance when there are more than two different instruction types in the workload, which consume a correspondingly significantly varying amount of power and thus produce a correspondingly significantly varying amount of noise on the voltage rails.
In each
In the embodiment described with respect to fig. 1C, the clock slow down request signal 182 (e.g., signals 182a and 183b) may depend on the type of instruction detected (e.g., different types of instruction types detected may result in different clock slow down signals). The detected commands are classified according to their estimated noise generation. Classification may be performed according to a programmable classification in the command
The classification may be based on multiple sets of instruction types, where each of the one or more sets of instruction types is mapped to a respective clock offset. The sets of instruction types may be determined based on the amount of noise expected to be generated by the instructions of each instruction type. For example, all instruction types expected to generate noise levels in the first range may be in a different group than all instruction types expected to generate noise levels in the second range. The clock offset specified in the map of the group may be an amount of slowing down of the current frequency required to compensate for the voltage drop caused by the expected noise level produced by the instruction type of instruction in the group.
In some embodiments, the mapping is programmable, thus enabling post-silicon reprogramming or reconfiguration of the mapping of instruction types to specific groups. In this manner, the initial mapping made based on pre-silicon design considerations and noise estimates may be subsequently overwritten based on the more accurate noise levels of the instruction types observed in the post-silicon simulation. The pre-silicon and post-silicon terms used herein refer to the chip design stage prior to delivery to fabrication (tapeout) and to the delivered fabricated chip, respectively. Because the noise level generated by each type of instruction in the actual circuit (i.e., the post-silicon) may, and typically does, differ significantly from the pre-silicon estimate, the programmability of the map provided in some embodiments enables instruction types to be reclassified into groups according to post-silicon simulation, which can significantly improve the performance of the processor. In some embodiments, the clock offset associated with each group may also be programmable.
Based on the detection and classification, the clock
In an exemplary embodiment, for each level of throttling added, a separate wire may need to be routed from
For each level of throttling, signals from all
The
In some embodiments, the
The lookup table 174 may include one or more frequency entries for each voltage. According to embodiments in which either of the two
The
Instruction type-based fast clock slowing method
Fig. 2A illustrates a process 200 for performing fast clock slowing according to some example embodiments. Process 200 may be performed by a processor and clock logic as shown in any of fig. 1A, 1B, and 1C. In some exemplary embodiments, process 200 is implemented entirely in hardware. According to an exemplary embodiment, process 200 is performed at least in part by an SM (e.g., as shown in fig. 4A).
At operation 202, the process 200 is initiated when one or more noisy instructions are detected in an instruction pipeline of a processor (e.g., any of the processors shown in fig. 1A, 1B, and 1C). The type of noisy instructions monitoring the instruction pipeline may be predetermined and either hard-coded, specified in one or more registers, or specified in a programmable table. In one exemplary embodiment, each instruction entering the instruction pipeline is tested to determine if its instruction type matches any of one or more predetermined instruction types.
At operation 204, the detected one or more instructions are grouped according to a noise level expected to be generated by the detected instructions. After decoding instructions in the instruction pipeline, the programmable grouping logic enables the decoded instructions to be grouped into a bucket (e.g., a logical grouping) along with other instructions of other similar noise characteristics. Each instruction bucket is configured to assert a clock slow down signal that enters the clock logic.
The programmability of the packets enables instructions to move from one bucket to another based on post-silicon learning and allows any errors in the pre-silicon power characterization of the instructions (e.g., inaccurate estimates of noise levels associated with the various instruction types) to be corrected on silicon. For example, simulations performed on the back-silicon die may show that the noise level of a particular instruction type is different from the noise level corresponding to the packet/bucket into which the particular instruction is grouped according to the front-silicon estimate. In this case, the programmability of the packets enables the type of particular instruction to be remapped to a different group than the group to which it was originally mapped based on the pre-silicon estimate of the noise generation characteristics of the particular instruction.
At operation 206, the clock slow down request signal is asserted in response to detecting the noisy instruction at operation 202. In some embodiments, the clock slow down signal may include an associated clock modulation or slow down value designated as a frequency offset or frequency. The value of the frequency offset or frequency may be obtained from a mapping of the group associated with the clock slow down signal. In some embodiments, the clock slow down signal may include an associated voltage or voltage offset rather than a frequency or frequency offset.
At operation 208, a dispatch stall signal is asserted to stall the detected instruction prior to execution. The dispatch stall is optional, and some embodiments may proceed from operation 206 to operation 210 without performing operation 208. For example, in some embodiments, the pipeline may not stall, where it may be reliably expected that the clock frequency may be adjusted to a reduced frequency between the time that the clock slow down request is made valid during or at the end of the decode stage of the instruction pipeline and the execution stage of the pipeline. However, in embodiments where it cannot be reliably expected that the clock frequency will be adjusted in response to the clock slow down signal at a time before execution begins, the pipeline may be stopped until the requested clock slow down is valid.
The signaling from the processor core (e.g., SM) to the clock logic should have very low latency in order not to significantly degrade performance. The physical design and loading on the signal must be carefully controlled to ensure very low delay. Ideally, the delay is low enough that after the signal is active, the clock slows down before the instruction reaches the execution pipeline, allowing uninterrupted instruction progress. However, this may not always be possible in view of the size of the chip, and the need for interlock logic is not always necessary. In some embodiments, the interlock logic halts execution of one or more instructions in the processor core until the clock logic asserts a signal that acknowledges (ack) that the clock has slowed. After the signal is asserted, all stalled processor cores are stalled in a staggered manner (e.g., according to a software configurable delay for each core) to ensure that the current ramps up and avoids di/dt noise issues.
At operation 210, the clock frequency is reduced in response to asserting the clock slow down request signal. When multiple processor cores are clocked by the same clock source, the clock slow down signals asserted by the various processor cores are aggregated (e.g., "OR") for each group before interfacing with the clock logic.
The clock logic includes fast clock slow down logic and hysteresis logic, and may also include a priority encoder. Additionally, since in some embodiments the fast clock slow down works with the NAFLL clock logic, the clock logic includes logic to ensure that the frequency locked loop function of the NAFLL can be maintained with the fast clock slow down.
In some embodiments, where multiple clock slow down levels may be requested, a priority encoder may be used to ensure that the clock is set to honor the maximum slow down request corresponding to the highest current instruction that will be executed on any of the processing cores clocked by the NAFLL. In embodiments where there is only one slow down level (other than normal or full speed operating frequency), a priority encoder may not be required.
In some embodiments, each clock slow down signal has a programmable slow down value associated therewith. When the signal is active, the frequency is reduced by an associated amount of slow down (after the priority encoder) for a predetermined small number of clock cycles by a fast slow down mechanism internal to the NAFLL. According to one embodiment, the predetermined small number of periods is 1.
In some exemplary embodiments, the clock generation internal to the NAFLL is through a DVCO made by a tunable parallel driver. In normal operation, multiple parallel drivers of the DVCO are tuned until the frequency matches the desired frequency of the NAFLL. However, a typical frequency transition from one state to another is done in a small tuning change and requires many microseconds. In an exemplary embodiment, a fast tune change mechanism controlled by a side-band is employed to change the frequency within one clock cycle. In some embodiments, the DVCO may include an input (e.g., offset _ enable) to immediately reduce (to reduce the frequency) the number of parallel drivers by some preset amount (e.g., offset) in a short duration (e.g., one or two clock cycles). Decoder logic may be included within the DVCO that applies a specified offset (e.g., offset _ enable ═ 1) only when indicated by the input of the DVCO. This fast-offset mechanism is different from the main tuning mechanism of DVCO. The preset offset may be a function of the operating voltage. This low-latency frequency change mechanism helps meet the stringent timing requirements of the exemplary embodiments.
Additionally, in contrast to clock slowing techniques (such as LDIV or clock skipping), the fast-tuning change mechanism employed in some example embodiments may slow down the clock frequency by a small percentage (e.g., 2%, 4%, etc.) while maintaining a duty cycle for all cycles.
Hysteresis logic is used to ensure that the NAFLL does not switch frequently between slow down modes, which can cause instability in the frequency lock of the NAFLL. Switching from the smaller slow down mode to the larger slow down mode occurs quickly. Hysteresis may only apply from a larger slow down to a smaller slow down, or when the slow down is released.
Additional logic may also be required to maintain the frequency locked loop function of the NAFLL while engaging and disengaging various slowdowns. In some embodiments, the two VF curves "full speed" and "slow down" are characterized separately on silicon, each representing the maximum frequency (F) of the workload to which its instructions are adaptedmax). These frequency targets are provided to the frequency locked loop of NAFLL for the current voltage. The tunable error accumulator may be used to keep the tunable DVCO of the NAFLL locked to the target frequency, typically the "full-rate" frequency.
When entering the slow down state, the control loop of the NAFLL can temporarily stop tracking the target "full speed" frequency and immediately apply its fast slow down mechanism to reduce the frequency and allow the slow down instruction to execute. Then, it will attempt to slowly relock to the "slow down" frequency, assuming the fast slow down mechanism may have extra margin to account for PVT (process voltage temperature) variations or reduce characterization effort.
The NAFLL may again suspend target frequency tracking when leaving the slow down state after the hysteresis window has expired. The internal sequencer may optionally apply static, transient, negative frequency offsets to reduce the risk of later overshoot (e.g., due to PVT variations or characterization uncertainties). It can then disengage the fast slow down mechanism, restoring the frequency to less than or equal to the original "full speed" frequency. Finally, it moves its long-term lock target to the "full speed" frequency, which will gradually restore the frequency to the pre-slow down frequency. Fig. 2C shows the behavior of the clock frequency when the fast slow down mechanism is disengaged.
At operation 212, in some embodiments, the clock slow down acknowledge signal is asserted by the clock logic. An acknowledgement may be sent to the processor core once the boot slows down, which may be used to un-stall instruction execution in the event of large ring delays in clock slowing down.
In the case of a processor core and clock logic working in concert, the clock frequency can be rapidly modulated depending on the instruction type, allowing instructions of different powers to execute at their respective optimum frequencies.
At operation 214, in response to the clock slow down acknowledge signal, instruction execution is not stalled and the instruction is dispatched for execution. In the event that the instruction pipeline has stalled, then upon receipt of the acknowledgement signal, the detected instruction may be dispatched for execution in the data path. As described above, instruction execution includes activating a data path to execute an instruction.
At operation 216, one or more noisy instructions are executed at the reduced clock frequency. In other words, one or more noisy instructions are executed by a processor resource (e.g., in a data path), with the clock frequency set to a maximum frequency for the noisy instructions.
At operation 218, the clock slow down request signal is deasserted. Invalidation may be performed before or after the instruction pipeline is not stalled. In some embodiments, the disabling may be based on expiration of a timer set when the clock slow down request signal is active. A description of two exemplary techniques for clock slow down request signal based invalidation is provided with respect to fig. 2D.
After the execution of the one or more instructions is completed at the reduced frequency, the clock frequency is recovered at operation 220. In other words, the clock frequency is restored to the maximum frequency of normal instructions (i.e., instructions that are not noisy instructions). An example of clock frequency recovery is shown in fig. 2C.
Signaling of fast clock slow down based on instruction type
Fig. 2B is a timing diagram illustrating a clock slow down request signal 222 and a corresponding acknowledge signal 224 that may be used in fast clock slow down according to some example embodiments. The timing diagram shows the assertion and deassertion of each signal from left to right over time. The rising edge of the representative curve indicates the signal is active and the falling edge of the representative curve indicates inactive.
As described above, signal 222 is asserted by an instruction pipeline associated with processing circuitry (e.g.,
In response to request signal 202, signal 224 is asserted by clock logic. In some embodiments, the acknowledgement signal 224 is asserted after the clock logic has set the clock to the target reduced frequency. In some embodiments, the acknowledgement signal 224 may be asserted immediately before or during the change of the clock to the target reduced frequency.
As shown, in some embodiments, the clock slow down request signal 222 may remain active even after the acknowledgement signal 224 is received at the instruction pipeline and/or processing logic and the detected high power instruction is dispatched for execution. After the detected high power instruction has completed execution, the instruction pipeline and/or processing logic may deactivate request signal 222.
In some embodiments, the acknowledgement signal 224 may remain active until the request signal is inactive. For example, the clock logic may disable the acknowledge signal 224 after determining that the request signal 222 has been disabled.
Fig. 2C illustrates the change in
In the exemplary embodiment shown, the clock may be configured to operate at either of two frequencies: maximum frequency for non-noisy (e.g., non-high power) instructions (e.g., F referred to in the example described with respect to FIG. 1BNon-high power instructions230 or "full speed"), or maximum frequency for noisy (e.g., high power) instructions (e.g., F referred to in the example described with respect to fig. 1B)High power instructions232 or "slow down"). As shown, the maximum frequency for non-noisy commands may be faster than the maximum frequency for high power commands.
A primary advantage of the exemplary embodiments is that it reduces the time during which the processor must operate at a lower frequency (e.g., "slow down" the frequency) due to margin requirements associated with noise. More specifically, the illustrative embodiments provide for operation at a reduced frequency only for the duration of execution of certain instructions that may result in a noise effect that is large enough to cause a voltage drop across the circuit.
Therefore, most of the time it is expected that the processor will operate at frequency 230. When the instruction pipeline and/or processing logic detects the decoding of a monitored instruction, then the clock slow down
As shown (see the leftmost sharp drop in clock frequency 228), when it receives the clock slow down
The fast slow down may be performed by referring to a look-up table to determine the new frequency and then immediately setting the clock source to the new operating frequency. In some embodiments, the new frequency becomes active in one clock cycle after the clock logic receives the clock slow down
After the clock slow down
As can be noted in the
FIG. 2D illustrates a timing diagram showing stalling execution of instructions during fast clock slow down and stall release, according to some demonstrative embodiments. The figure illustrates the timing associated with the detection of noisy instructions and the assertion of a clock slow down
With respect to the two
Although the fast clock slowing shown in the systems of fig. 1A, 1B, and 1C and the process described with respect to fig. 2A may be described in the context of a processing unit and certain hardware components, these systems and processes may also be performed by custom circuits, or by a combination of CPUs, GPUs, custom circuits, and/or programs. Those of ordinary skill in the art will appreciate that any system that performs, for example, method 200, is within the scope and spirit of embodiments of the present invention.
Parallel processing architecture using fast clock slowing
More illustrative information will now be set forth regarding various optional architectures and features by which the foregoing framework may be implemented, as desired by a user. It should be particularly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any way. Optional the following features may optionally be incorporated or not exclude other features described.
FIG. 3 illustrates a Parallel Processing Unit (PPU)300 according to one embodiment. In one embodiment, the PPU300 is a multi-threaded processor implemented on one or more integrated circuit devices. The PPU300 is a latency hiding architecture designed for processing many threads in parallel. A thread (i.e., an execution thread) is an instance of a set of instructions configured to be executed by the
One or more PPUs 300 may be configured to accelerate thousands of High Performance Computing (HPC), data centers, and machine learning applications. The PPU300 may be configured to accelerate a wide variety of deep learning systems and applications, including auto-driven automobile platforms, deep learning, high-precision speech, image and text recognition systems, intelligent video analysis, molecular simulation, drug development, disease diagnosis, weather forecasting, big data analysis, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimization, and personalized user recommendations, among others.
As shown in FIG. 3, PPU300 includes input/output (I/O) unit 305, front end unit 315,
The NVLink310 interconnect enables the system to scale and include one or more PPUs 300 in conjunction with one or more CPUs, support cache coherency between
The I/O unit 305 is configured to send and receive communications (e.g., commands, data, etc.) from a host processor (not shown) over the
The I/O unit 305 decodes data packets received via the
In one embodiment, a program executed by a host processor encodes a command stream in a buffer that provides workloads to the PPU300 for processing. The workload may include a number of instructions and data to be processed by those instructions. A buffer is an area of memory that is accessible (e.g., read/write) by both the host processor and the
The front end units 315 are coupled to a
The
The
Tasks are managed by the
In one embodiment, the host processor executes a driver kernel that implements an Application Programming Interface (API) that enables one or more applications to execute on the host processor to schedule operations for execution on the
FIG. 4A illustrates a GPC350 of the PPU300 of FIG. 3, according to one embodiment. As shown in fig. 4A, each GPC350 includes multiple hardware units for processing tasks. In one embodiment, each GPC350 includes a
In one embodiment, the operation of
The
The
Each
SM440 includes a programmable streaming processor configured to process tasks represented by a plurality of threads. Each SM440 is multithreaded and configured to concurrently execute multiple threads (e.g., 32 threads) from a particular thread group. In one embodiment, SM440 implements a SIMD (single instruction, multiple data) architecture, where each thread in a thread group (e.g., warp) is configured to process different data sets based on the same instruction set. All threads in a thread group execute the same instruction. In another embodiment, the SM440 implements a SIMT (single instruction, multi-threaded) architecture, where each thread in a thread group is configured to process different sets of data based on the same instruction set, but where individual threads in the thread group are allowed to diverge during execution. In one embodiment, a program counter, call stack, and execution state are maintained for each thread bundle, enabling concurrency between serial execution in the thread bundle and the thread bundle when the threads within the thread bundle diverge. In another embodiment, a program counter, call stack, and execution state are maintained for each individual thread, thereby achieving equal concurrency among all threads within and between thread bundles. When the execution state is maintained for each individual thread, threads executing the same instruction may be converged and executed in parallel for maximum efficiency. SM440 is described in more detail below in conjunction with fig. 5A.
The
FIG. 4B illustrates a
In one embodiment,
In one embodiment,
In one embodiment, PPU300 implements a multi-level memory hierarchy. In one embodiment,
In one embodiment, the replication engine transfers data between
Data from
FIG. 5A illustrates the
As described above, the
Collaboration groups are programming models for organizing groups of communication threads that allow developers to express the granularity at which threads are communicating, enabling richer, more efficient parallel decomposition to be expressed. The cooperative launch API supports synchronicity between thread blocks to execute parallel algorithms. The conventional programming model provides a single simple structure for the synchronous cooperative threads: barriers (barriers) across all threads of a thread block (e.g., synchreads () function). However, programmers often want to define thread groups at a granularity less than the thread block granularity and synchronize within the defined groups, enabling higher performance, design flexibility, and software reuse in the form of collective group-wide function interfaces (collective-wide function interfaces).
The collaboration group enables programmers to explicitly define thread groups at sub-block (e.g., as small as a single thread) and multi-block granularity and perform collective operations, such as synchronicity across threads in the collaboration group. The programming model supports clean composition across software boundaries so that libraries and utility functions can be safely synchronized in their local environment without assumptions on convergence. The collaboration group primitives enable new modes of collaborative parallelism, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire thread block grid.
Dispatch unit 515 is configured to transmit instructions to one or more functional units. In this embodiment, the scheduler unit 510 includes two dispatch units 515 that enable two different instructions from the same thread bundle to be scheduled during each clock cycle. In alternative embodiments, each scheduler unit 510 may include a single dispatch unit 515 or additional dispatch units 515.
Each SM440 includes a register file 520 that provides a set of registers for the functional units of the
Each SM440 includes
The tensor cores are configured to perform matrix operations, and in one embodiment, one or more tensor cores are included in
In one embodiment, the matrix multiplication inputs a and B are 16-bit floating-point matrices, while the accumulation matrices C and D may be 16-bit floating-point or 32-bit floating-point matrices. The tensor core operates on 16-bit floating-point input data and 32-bit floating-point accumulation. 16-bit floating-point multiplication requires 64 operations, produces a full precision product, and then accumulates using the addition of 32-bit floating points with other intermediate products of a 4 x 4 matrix multiplication. In practice, the tensor core is used to perform larger two-dimensional or higher-dimensional matrix operations built up from these smaller elements. APIs (such as the CUDA 9C + + API) disclose specialized matrix loading, matrix multiplication and accumulation, and matrix storage operations to efficiently use the tensor core from the CUDA-C + + program. At the CUDA level, the thread bundle level interface assumes a 16 × 16 size matrix that spans all 32 threads of a thread bundle.
Each SM440 also includes
Each SM440 also includes
The shared memory/L1 cache 570 is an on-chip memory array that allows data storage and communication between the SM440 and the
Combining data caching and shared memory functions into a single memory block provides the best overall performance for both types of memory accesses. This capacity can be used by programs as a cache that does not use shared memory. For example, if the shared memory is configured to use half the capacity, texture and load/store operations may use the remaining capacity. Integration within shared memory/L1 cache 570 allows shared memory/L1 cache 570 to function as a high throughput pipeline for streaming data while providing high bandwidth and low latency access to frequently reused data.
When configured for general-purpose parallel computing, a simpler configuration can be used compared to graphics processing. In particular, the fixed function graphics processing unit shown in FIG. 3 is bypassed, creating a simpler programming model. In a general-purpose parallel computing configuration, the
The PPU300 may be included in a desktop computer, a laptop computer, a tablet computer, a server, a supercomputer, a smartphone (e.g., wireless, handheld device), a Personal Digital Assistant (PDA), a digital camera, a vehicle, a head-mounted display, a handheld electronic device, and so forth. In one embodiment, the PPU300 is included on a single semiconductor substrate. In another embodiment, the PPU300 is included on a system on a chip (SoC) with one or more other devices, such as an additional PPU300,
In one embodiment, PPU300 may be included on a graphics card that includes one or
Exemplary computing System
Systems with multiple GPUs and CPUs are used in various industries as developers expose and exploit more parallelism in applications such as artificial intelligence computing. High performance GPU acceleration systems with tens to thousands of compute nodes are deployed in data centers, research institutions, and supercomputers to address larger problems. As the number of processing devices within high performance systems increases, communication and data transfer mechanisms need to be extended to support the increased bandwidth.
FIG. 5B is a conceptual diagram of a
In another embodiment (not shown), NVLink310 provides one or more high speed communication links between each PPU300 and CPU530, and switch 555 interfaces between
In the context of this specification, a single semiconductor platform may refer to only a single semiconductor-based integrated circuit fabricated on a die or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connections that simulate on-chip operation and are substantially improved by utilizing conventional bus implementations. Of course, the various circuits or devices may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user. Alternatively, the
In one embodiment, the signaling rate for each NVLink310 is 20 to 25 gbits/sec, and each PPU300 includes six NVLink310 interfaces (as shown in fig. 5B, each PPU300 includes five NVLink310 interfaces). Each NVLink310 provides a data transfer rate of 25 gbits/sec in each direction, with six links providing 300 gbits/sec. When CPU530 also includes one or more NVLink310 interfaces, NVLink310 may be dedicated to PPU communications as shown in FIG. 5B, or some combination of PPU to PPU and PPU to CPU.
In one embodiment, NVLink310 allows direct load/store/atomic access from CPU530 to
Fig. 5C illustrates an exemplary system 565 in which the various architectures and/or functionalities of the various previous embodiments may be implemented. Exemplary system 565 may be configured to implement method 200 shown in fig. 2A.
As shown, a system 565 is provided that includes at least one
System 565 also includes an input device 560, a
Further, system 565 can be coupled for communication purposes to a network (e.g., a telecommunications network, a Local Area Network (LAN), a wireless network, a Wide Area Network (WAN) such as the internet, a peer-to-peer network, a cable network, etc.) through network interface 535.
System 565 can also include secondary storage (not shown). Secondary storage 610 includes, for example, a hard disk drive and/or a removable storage drive, on behalf of a floppy disk drive, a magnetic tape drive, an optical disk drive, a Digital Versatile Disk (DVD) drive, a recording device, a Universal Serial Bus (USB) flash memory. The removable storage drive reads from and/or writes to a removable storage unit in a well known manner.
Computer programs, or computer control logic algorithms, may be stored in main memory 540 and/or secondary storage. Such computer programs, when executed, enable system 565 to perform various functions. Memory 540, storage, and/or any other storage are possible examples of computer-readable media.
The architecture and/or functionality of the various previous figures may be implemented in the context of a general purpose computer system, a circuit board system, a game console system dedicated for entertainment purposes, a dedicated system, and/or any other desired system. For example, system 565 may take the form of a desktop computer, laptop computer, tablet computer, server, supercomputer, smartphone (e.g., wireless, handheld device), Personal Digital Assistant (PDA), digital camera, vehicle, head-mounted display, handheld electronic device, mobile phone device, television, workstation, game console, embedded system, and/or any other type of logic.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Graphics processing pipeline
In one embodiment, PPU300 includes a Graphics Processing Unit (GPU). PPU300 is configured to receive commands specifying a shading program for processing graphics data. Graphics data may be defined as a set of primitives, such as points, lines, triangles, quadrilaterals, triangle strips, and so forth. Typically, a primitive includes data that specifies a plurality of vertices (e.g., in a model space coordinate system) of the primitive and attributes associated with each vertex of the primitive. The PPU300 may be configured to process the primitives to generate a frame buffer (e.g., pixel data for each of the pixels of the display).
The application writes model data (e.g., a set of vertices and attributes) for the scene to a memory, such as system memory or
FIG. 6 is a conceptual diagram of a
As shown in FIG. 6,
The data assembly stage 610 receives
Primitive assembly stage 630 collects the vertices output by
Geometry shading phase 640 processes geometric primitives by performing a set of operations (e.g., a geometry shader or program) on the geometric primitives. A tessellation (tessellation) operation may generate one or more geometric primitives from each geometric primitive. In other words, geometry shading stage 640 may subdivide each geometric primitive into a finer grid of two or more geometric primitives for processing by the remainder of
In one embodiment,
The viewport SCC stage 650 performs viewport scaling, culling, and clipping of geometric primitives. Each surface being rendered is associated with an abstract camera position. The camera position represents the position of a viewer who is viewing the scene and defines the view frustum of the object that surrounds the scene. The viewing frustum may include a viewing plane, a back plane, and four clipping planes. Any geometric primitives that lie completely outside the view frustum may be culled (e.g., discarded) because they will not contribute to the final rendered scene. Any geometric primitives that are partially within the viewing frustum and partially outside the viewing frustum may be cropped (e.g., converted to new geometric primitives that are enclosed within the viewing frustum). Furthermore, each geometric primitive may be scaled based on the depth of the view frustum. All possible visible geometric primitives are then passed to the
The
The
Raster operations stage 680 may perform various operations on the pixel data, such as performing alpha testing, stencil testing (stencil test), and blending the pixel data with other pixel data corresponding to other fragments associated with the pixel. When raster operations stage 680 has completed processing pixel data (e.g., output data 602), the pixel data may be written to a render target, such as a frame buffer, color buffer, or the like.
It should be appreciated that one or more additional stages may be included in
Various programs may be executed within PPU300 to implement the various stages of
Machine learning
Deep Neural Networks (DNNs) developed on processors, such as PPU300, have been used for various use cases: from self-driving to faster drug development, from automatic image captioning in an image database to intelligent real-time language translation in video chat applications. Deep learning is a technique that models the neural learning process of the human brain, learns continuously, becomes increasingly smarter, and delivers more accurate results faster over time. A child is initially taught by adults to correctly recognize and classify various shapes, and ultimately to be able to recognize shapes without any coaching. Also, deep learning or neural learning systems need to be trained in object recognition and classification in order to become more intelligent and efficient when identifying basic objects, occluding objects, etc., while also assigning scenarios to objects.
At the simplest level, neurons in the human brain look at various inputs received, assign a level of importance to each of these inputs, and pass the output to other neurons for processing. Artificial neurons or perceptrons are the most basic model of neural networks. In one example, a perceptron may receive one or more inputs representing various features of an object that the perceptron is being trained to recognize and classify, and each of these features is given a weight based on the importance of the feature when defining the shape of the object.
Deep Neural Network (DNN) models include multiple layers of many connected nodes (e.g., perceptrons, boltzmann machines, radial basis functions, convolutional layers, etc.), which can be trained with large amounts of input data to solve complex problems quickly and with high accuracy. In one example, the first layer of the DNN model decomposes the input image of the car into various parts and finds basic patterns (such as lines and corners). The second layer assembles the lines to look for higher level patterns such as wheels, windshields and mirrors. The next layer identifies the type of vehicle, and the last few layers generate labels for the input images to identify the model of a particular automobile brand.
Once the DNNs are trained, they may be deployed and used to identify and classify objects or patterns in a process called inference (inference). Examples of reasoning (the process by which DNN extracts useful information from a given input) include identifying handwritten numbers deposited on check deposits in ATM machines, identifying images of friends in photographs, providing movie recommendations to more than fifty million users, identifying and classifying road hazards in different types of cars, pedestrians, and unmanned cars, or translating human speech in real time.
During training, data flows through the DNN during the forward propagation phase until a prediction is generated, which indicates the label corresponding to the input. If the neural network does not correctly label the input, the error between the correct label and the predicted label is analyzed and the weights are adjusted for each feature during the back propagation phase until the DNN correctly labels the input and the other inputs in the training dataset. Training a complex neural network requires a significant amount of parallel computational performance, including floating-point multiplication and addition, supported by the
Neural networks rely heavily on matrix mathematics, and complex multi-layer networks require a large amount of floating point performance and bandwidth to improve efficiency and speed. With thousands of processing cores, optimized for matrix mathematical operations, and delivering performance in the tens to hundreds of TFLOPS, PPU300 is a computing platform capable of delivering the performance required for deep neural network-based artificial intelligence and machine learning applications.
Exemplary technical advantages of some embodiments
Certain exemplary embodiments provide for a fast slow down of the clock frequency so that the processor can effectively reduce the operating frequency for only noisy instructions while enabling noiseless instructions to be executed at or near the highest nominal frequency of the voltage level.
Compared to solutions such as NAFLL, which are designed to address the frequency effects of noise but which address noise seen over a large area, the exemplary embodiments also accommodate very localized noise. In addition, the illustrative embodiments allow flexibility in IR drop closure, since known power-intensive units can be closed to a looser IR drop budget, with the desire to quickly clock down to a lower frequency when executing instructions on the power-intensive unit. Relaxing the IR drop budget may alleviate routing congestion and prevent or reduce die area growth.
Many modifications and variations of the present invention are possible in light of the above teachings. It is, therefore, to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:一种具有自动调节功能的显示屏