Instructions and logic to perform floating point and integer operations for machine learning

文档序号：1155845 发布日期：2020-09-15 浏览：28次中文

阅读说明：本技术 用来执行用于机器学习的浮点和整数操作的指令和逻辑 (Instructions and logic to perform floating point and integer operations for machine learning ) 是由 H.考尔 M.A.安德斯 S.K.马修姚安邦 J.雷 P.T.唐 M.S.斯特里克兰德于 2018-04-27 设计创作，主要内容包括：一个实施例提供了包括具有在整数数据路径和浮点数据路径之间共享的加法器和乘法器的计算单元的机器学习硬件加速器,到乘法器的输入操作数的高位要在浮点操作期间被门控。(One embodiment provides a machine learning hardware accelerator including a compute unit having an adder and a multiplier shared between an integer data path and a floating point data path, the upper bits of an input operand to the multiplier to be gated during a floating point operation.)

1. A graphics processing unit for accelerating machine learning operations, the graphics processing unit comprising:

a multiprocessor having a single-instruction multi-thread (SIMT) architecture, the multiprocessor to execute at least one single instruction across multiple threads of the multiprocessor, wherein at least a portion of a register file is dedicated to each of the multiple threads;

a first compute unit included within the multiprocessor, the at least one single instruction to cause the first compute unit to perform a two-dimensional matrix multiply accumulate operation, wherein performing the two-dimensional matrix multiply accumulate operation includes computing a 32-bit intermediate product of 16-bit operands and computing a 32-bit sum based on the 32-bit intermediate product, wherein the first compute unit is to compute the 32-bit intermediate product from two or more 16-bit operands of the at least one single instruction, perform a 16-bit floating-point multiplication on the 16-bit operands to generate a 16-bit product, and convert the 16-bit product to the 32-bit intermediate product.

2. The graphics processing unit of claim 1, the multiprocessor to execute parallel threads of a thread group, each thread of the thread group having an independent thread state.

3. The graphics processing unit of claim 2, the multiprocessor comprising a scheduler for scheduling parallel threads of the thread group to a plurality of compute units within the multiprocessor.

4. The graphics processing unit of claim 3, a plurality of compute units within the multiprocessor comprising a second compute unit for executing integer operations, the scheduler to schedule floating point operations to the first compute unit and integer operations to the second compute unit, wherein the multiprocessor is to concurrently execute floating point operations on the first compute unit and integer operations on the second compute unit.

5. The graphics processing unit of claim 4, the multiprocessor to simultaneously perform a first floating point operation at a first precision and a second floating point operation at a second precision on the first compute unit.

6. The graphics processing unit of claim 1, the first computing unit further comprising one or more shifters to normalize or align intermediate results.

7. The graphics processing unit of claim 6, the first computation unit to compute a 16-bit sum based on the 32-bit intermediate product.

8. A data processing system comprising:

a graphics processing unit to accelerate machine learning operations, the graphics processing unit comprising a multiprocessor having a single-instruction multi-threading (SIMT) architecture, the multiprocessor to execute at least one single instruction across multiple threads of the multiprocessor, wherein at least a portion of a register file is dedicated to each of the multiple threads;

A memory communicatively coupled with the graphics processing unit.

9. The data processing system of claim 8, the multiprocessor to execute parallel threads of a thread group, each thread of the thread group having an independent thread state.

10. The data processing system of claim 9, the multiprocessor comprising a scheduler for scheduling the parallel threads to a plurality of compute units within the multiprocessor.

11. The graphics processing unit of claim 10, a plurality of compute units within the multiprocessor comprising a second compute unit for executing integer operations, the scheduler to schedule floating point operations to the first compute unit and integer operations to the second compute unit, wherein the multiprocessor is to concurrently execute floating point operations on the first compute unit and integer operations on the second compute unit.

12. The graphics processing unit of claim 11, the multiprocessor to simultaneously perform a first floating point operation at a first precision and a second floating point operation at a second precision on the first compute unit.

13. The graphics processing unit of claim 8, the first computing unit further comprising one or more shifters to normalize or align intermediate results.

14. The graphics processing unit of claim 13, the first computation unit to compute a 16-bit sum based on the 32-bit intermediate product.

15. A method for accelerating machine learning operations, the method comprising:

decoding a single instruction on a Graphics Processing Unit (GPU), the GPU having a single instruction multi-thread (SIMT) architecture;

executing, by a multiprocessor within the GPU, the single instruction being executed across multiple threads of the multiprocessor, wherein at least a portion of a register file is dedicated to each of the multiple threads; and

in response to executing the single instruction by the multiprocessor, performing a two-dimensional matrix multiply-accumulate operation on a first compute unit of the multiprocessor, wherein performing the two-dimensional matrix multiply-accumulate operation includes computing a 32-bit intermediate product of 16-bit operands and computing a 32-bit sum based on the 32-bit intermediate product, wherein computing the 32-bit intermediate product includes performing a 16-bit floating-point multiplication on two or more 16-bit operands of the single instruction to generate a 16-bit product and converting the 16-bit product into the 32-bit intermediate product.

16. The method of claim 15, further comprising: parallel threads of a thread group are executed, each thread in the thread group having an independent thread state.

17. The method of claim 16, further comprising: scheduling parallel threads of the thread group to a plurality of compute units within the multiprocessor.

18. The method of claim 17, further comprising:

scheduling floating point operations to the first computational unit and integer operations to a second computational unit; and

executing, by a second compute unit within the multiprocessor, the integer operation concurrently with the floating point operation on the first compute unit.

19. The method as recited in claim 18, further comprising:

simultaneously performing a first floating point operation on the first computational unit at a first precision and a second floating point operation at a second precision.

20. The method of claim 15, further comprising:

a 16-bit sum is calculated based on the 32-bit intermediate product.

Technical Field

Background

Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data, such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, and so forth. Traditionally, graphics processors use fixed-function computational units to process graphics data; more recently, however, portions of graphics processors have been made programmable, enabling such processors to support a wide variety of operations for processing vertex and fragment data.

To further increase performance, graphics processors often implement processing techniques such as pipelining that attempt to process as much graphics data as possible in parallel throughout different portions of the graphics pipeline. Parallel graphics processors with single instruction multi-thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In the SIMT architecture, parallel thread groups attempt to execute program instructions synchronously together as often as possible to increase processing efficiency. A general overview of software and hardware for the SIMT architecture can be found in the CUDA Programming by Shane Cook, Chapter 3, pages 37-51 (2013) and/or the CUDA Handbook by Nicholes wild, A Comprehensive Guide to GPU Programming, sections 2.6.2 to 3.1.2 (6 months 2013).

Drawings

So that the manner in which the features of the invention can be understood in detail, a more particular description of the invention may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments and are therefore not to be considered limiting of its scope, for all embodiments.

FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein.

2A-2D illustrate parallel processor components, according to an embodiment.

3A-3B are block diagrams of a graphics multiprocessor, according to an embodiment.

4A-4F illustrate exemplary architectures in which multiple GPUs are communicatively coupled to multiple multicore processors.

FIG. 5 illustrates a graphics processing pipeline, according to an embodiment.

Fig. 6 illustrates a machine learning software stack according to an embodiment.

FIG. 7 illustrates a highly parallel general purpose graphics processing unit, according to an embodiment.

FIG. 8 illustrates a multi-GPU computing system, according to an embodiment.

9A-9B illustrate layers of an exemplary deep neural network.

Fig. 10 illustrates an exemplary recurrent neural network.

Fig. 11 illustrates training and deployment of a deep neural network.

Fig. 12 is a block diagram illustrating distributed learning.

FIG. 13 illustrates an exemplary inference System On Chip (SOC) suitable for performing inference using a training model.

FIG. 14 is a block diagram of a multiprocessor unit according to an embodiment.

FIGS. 15A-15B illustrate a design of a logic unit to perform integer and floating point fused multiply-add operations, according to an embodiment.

FIG. 16 illustrates a fused multiply-add logic unit with merged floating point and integer data paths, according to an embodiment.

17A-17B illustrate logic units including combined computational circuitry to perform floating point and integer fused multiply accumulate operations, according to an embodiment.

18A-18B illustrate a data processing system and associated computational and logic units that perform accelerated training and reasoning operations for machine learning.

FIG. 19 shows details of an activate command module according to an embodiment.

Fig. 20 shows a random quantization unit according to an embodiment.

FIG. 21 illustrates a FPU encoding and configuration module according to one embodiment.

FIG. 22 illustrates logic for processing instructions using a dynamically configurable compute unit, according to an embodiment.

23A-23B are flow diagrams illustrating logic to perform sparse compute operations within a GPGPU as provided by embodiments described herein.

Fig. 24 is a block diagram of a processing system according to an embodiment.

Fig. 25 is a block diagram of a processor according to an embodiment.

FIG. 26 is a block diagram of a graphics processor, according to an embodiment.

FIG. 27 is a block diagram of a graphics processing engine of a graphics processor, according to some embodiments.

FIG. 28 is a block diagram of a graphics processor provided by an additional embodiment.

FIG. 29 illustrates thread execution logic including an array of processing elements employed in some embodiments.

FIG. 30 is a block diagram illustrating a graphics processor instruction format according to some embodiments.

FIG. 31 is a block diagram of a graphics processor, according to another embodiment.

FIGS. 32A-32B illustrate graphics processor command formats and command sequences, according to some embodiments.

FIG. 33 illustrates an exemplary graphical software architecture for a data processing system according to some embodiments.

Fig. 34 is a block diagram illustrating a development system of an IP core according to an embodiment.

FIG. 35 is a block diagram illustrating an exemplary system-on-chip integrated circuit according to an embodiment.

FIG. 36 is a block diagram illustrating an additional graphics processor, according to an embodiment.

FIG. 37 is a block diagram illustrating an additional exemplary graphics processor of a system-on-chip integrated circuit according to an embodiment.

Embodiments relate generally to data processing, and more particularly to data processing via a general purpose graphics processing unit.

107页详细技术资料下载

Instructions and logic to perform floating point and integer operations for machine learning

相关技术

网友询问留言