System and method for INT9 quantization

文档序号：168453 发布日期：2021-10-29 浏览：54次中文

阅读说明：本技术 用于int9量化的系统和方法 (System and method for INT9 quantization ) 是由 A·苏达尼 U·哈恩巴特陈嘉欣于 2021-04-29 设计创作，主要内容包括：本公开的实施例涉及用于INT9量化的系统和方法。公开了一种将存储在存储器中的数据从第一格式转换为第二格式的方法。方法包括：将存储在双倍数据速率(DDR)存储器中的数据中的位数扩展1位以形成扩展数据。方法还包括：确定存储在DDR中的数据是有符号数据还是无符号数据。而且,响应于确定数据是有符号的,将符号值添加到扩展数据的最高有效位并将数据复制到扩展数据的低阶位。响应于确定数据是无符号的,将数据复制到扩展数据的低阶位并将最高有效位设置为无符号值,例如零。将扩展数据存储在机器学习计算机阵列的处理瓦片的片上存储器(OCM)中。(Embodiments of the present disclosure relate to systems and methods for INT9 quantization. A method of converting data stored in a memory from a first format to a second format is disclosed. The method comprises the following steps: a number of bits in data stored in a Double Data Rate (DDR) memory is extended by 1 bit to form extended data. The method further comprises the following steps: it is determined whether data stored in the DDR is signed data or unsigned data. Also, in response to determining that the data is signed, a sign value is added to the most significant bits of the extension data and the data is copied to the lower order bits of the extension data. In response to determining that the data is unsigned, the data is copied to the lower order bits of the extension data and the most significant bits are set to an unsigned value, e.g., zero. The expansion data is stored in on-chip memory (OCM) of a processing tile of the machine learning computer array.)

1. A method of converting data stored in a memory from a first format to a second format for machine learning ML operations, the method comprising:

expanding the number of bits in the data stored in a double data rate DDR memory by 1 bit to form expanded data;

determining whether the data stored in the DDR is signed data or unsigned data;

in response to determining that the data is signed, adding a sign value to the most significant bits of the extension data and copying the data to the lower order bits of the extension data;

in response to determining that the data is unsigned, copying the data to lower order bits of the extension data and setting the most significant bits to an unsigned value; and

storing the expansion data in an on-chip memory OCM of a processing tile of a machine learning computer array.

2. The method of claim 1, wherein the data is an unsigned integer.

3. The method of claim 1, wherein the data is a signed integer.

4. The method of claim 1, wherein the data is 8 bits, and wherein the extension data is 9 bits.

5. The method of claim 1, wherein the extension data is int9 data.

6. The method of claim 1, further comprising:

tracking whether the data stored in the DDR is signed or unsigned; and

based on whether the data is signed or unsigned, an appropriate instruction is scheduled for the extended data.

7. The method of claim 6, further comprising: an arithmetic logic unit ALU operation is performed on the extension data as operands.

8. The method of claim 7, further comprising: storing results of the operations in the OCMs of the processing tiles of the machine learning computer array.

9. The method of claim 8, further comprising: storing the result stored in the OCM into the DDR.

10. The method of claim 9, further comprising: prior to storing the result in the DDR, adjusting a value of the result to a maximum value of a range for the data if the value of the result exceeds the maximum value, and adjusting the value of the result to a minimum value of the range for the data if the value of the result is below the minimum value of the range for the data.

11. The method of claim 9, further comprising: discarding the most significant bits of the result before storing the result from the OCM in the DDR.

12. The method of claim 1, wherein the data stored in the DDR is an integer representation of the data for floating point data.

13. The method of claim 12, wherein the floating point data is scaled and quantized to form the data in the first format.

14. The method of claim 15, wherein a first scaling value is used to convert the floating point data to int8 format, and wherein a second scaling value is used to convert the floating point data to the uint8 format.

15. A system, comprising:

a Double Data Rate (DDR) memory configured to store integer data in a first format; and

a machine learning processing unit comprising a plurality of processing tiles, wherein each processing tile comprises:

an on-chip memory (OCM) configured to accept and maintain extension data for various ML operations, the extension data converted from the integer data in the first format from the DDR memory, wherein the extension data includes one additional bit compared to the integer data in the first format, and wherein a most significant bit of the extension data is signed if the integer data in the first format is signed, and wherein the most significant bit of the extension data is set to an unsigned value if the integer data in the first format is unsigned, and wherein a least significant bit of the extension data is the same as the integer data in the first format.

16. The system of claim 15, wherein the integer data of the first format is int8 or uint 8.

17. The system of claim 15, wherein the extension data is int 9.

18. The system of claim 15, wherein the integer data in the first format stored in the DDR memory is tracked as being signed or unsigned, and wherein an appropriate instruction is scheduled depending on whether the integer data in the first format is signed or unsigned.

19. The system of claim 15, wherein the extension data is an operand for an operation.

20. The system of claim 19, wherein a result of the operation is stored in the OCM.

21. The system of claim 20, wherein the results of the operations stored in the OCM are also stored in the DDR memory.

22. The system of claim 21, wherein prior to storing the result in the DDR memory, if a value of the result exceeds a maximum value of a range of the integer data for the first format, the value of the result is adjusted to the maximum value, and if the value of the result is below a minimum value of the range of the data, the value of the result is adjusted to a minimum value of the range of the integer data for the first format.

23. The system of claim 21, wherein the most significant bits of the result are discarded before storing the result in the DDR memory.

24. The system of claim 15, wherein the integer in the first format is an integer representation of floating point data, and wherein the floating point data is scaled and quantized to form the integer data in the first format.

25. The system of claim 24, wherein a first scaling value is used to convert the floating point data to int8 format, and wherein a second scaling value is used to convert the floating point data to the fluid 8 format.

Technical Field

Embodiments of the present disclosure relate to machine learning.

Background

Hardware-based Machine Learning (ML) systems typically include multiple cores/subsystems (blocks and tiles), each with its own processing unit and on-chip memory (OCM). The ML system can process the quantized numbers to perform various calculations. For example, quantized data stored in a memory unit, such as a Double Data Rate (DDR) memory, may be transferred to a processing tile so that the processing unit may process the data for various ML operations.

Typically, floating point numbers (data) are converted to a quantized data format for storage, for example, in DDR and subsequent processing, for example, by an ML system. Quantization formats may include, but are not limited to, signed integers, unsigned integers, etc., which are used in Arithmetic Logic Unit (ALU) computations. Typically, a mixture of quantization format types is used for various computations, e.g., by ML systems.

Unfortunately, no mechanism currently exists to perform ALU computations on a mixture of quantization format types without introducing an offset. Thus, when a mixture of quantization format types is used in the ALU computation, the format type of each operand is tracked, which increases the complexity and latency of the ALU computation. Furthermore, in ML systems, the integer value may need to be rescaled (rescale) before being input to the processing unit. However, mathematical division in a processor is often time consuming and inefficient.

The foregoing examples of related art and limitations related thereto are intended to be illustrative, and not exclusive. Other limitations of the related art will become apparent from a reading of the specification and a study of the drawings.

Disclosure of Invention

According to some implementations, there is provided a method of converting data stored in a memory from a first format to a second format for machine learning ML operations, the method comprising: expanding the number of bits in the data stored in the double data rate DDR memory by 1 bit to form expanded data; determining whether data stored in the DDR is signed data or unsigned data; in response to determining that the data is signed, adding a sign value to the most significant bits of the extension data and copying the data to the lower order bits of the extension data; in response to determining that the data is unsigned, copying the data to lower order bits of the extension data and setting the most significant bit to an unsigned value; and storing the expansion data in an on-chip memory OCM of a processing tile of the machine learning computer array.

According to some implementations, there is provided a system comprising: a Double Data Rate (DDR) memory configured to store integer data in a first format; and a machine learning processing unit comprising a plurality of processing tiles, wherein each processing tile comprises: an on-chip memory OCM configured to accept and maintain extension data for various ML operations, the extension data converted from integer data of a first format from a DDR memory, wherein the extension data includes one additional bit compared to the integer data of the first format, and wherein a most significant bit of the extension data is signed if the integer data of the first format is signed, and wherein the most significant bit of the extension data is set to an unsigned value if the integer data of the first format is unsigned, and wherein a least significant bit of the extension data is the same as the integer data of the first format.

Drawings

Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. Note that in accordance with standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or decreased for clarity.

FIG. 1 depicts an illustrative example of a hardware-based programmable architecture configured to support machine learning in accordance with an aspect of the present embodiment.

Fig. 2A-2D illustrate quantization from one digital format type to another digital format type implemented by the hardware-based programmable architecture of fig. 1.

Fig. 3A-3B illustrate a flow diagram for converting stored data from a first format to a second format in accordance with an aspect of the present embodiment.

Fig. 4 depicts a diagram of an example of an architecture of a POD unit in accordance with an aspect of the present embodiment.

FIG. 5 illustrates a methodology for rescaling integer data in a machine learning operation in accordance with an aspect of the subject embodiment.

Detailed Description

The following disclosure provides many different embodiments, or examples, for implementing different features of the subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these are merely examples and are not intended to be limiting. Additionally, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Before the various embodiments are described in greater detail, it is to be understood that the embodiments are not limiting, as the elements of such embodiments may vary. It should also be understood that a particular embodiment described and/or illustrated herein has elements that can be readily separated from that particular embodiment and optionally combined with any other embodiment or substituted for elements in any of a number of other embodiments described herein. It is also to be understood that the phraseology used herein is for the purpose of description of certain concepts and that this terminology is not intended to be limiting. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood in the art to which the embodiments belong.

A new programmable hardware architecture for Machine Learning (ML) is proposed, which comprises at least a host, a memory, a core, a data streaming engine, an instruction streaming engine and an inference engine. The memory is configured to store floating point numbers in a quantized format, including but not limited to int8, uint8, and the like. According to some embodiments, the quantized data stored in memory is converted to int9 format, thereby uniformly representing different quantized data format types, e.g., int8, uint8, etc., and providing symmetric quantization of the data (i.e., quantization is symmetric with respect to zero), while eliminating the need to perform offset calculations. It is to be appreciated that converting data to int9 format types enables the inference engine to perform ALU computations on homogenous int9 format type operands without having to track the format type of the quantized operands, thereby simplifying complexity and making processing time faster.

In some embodiments, when reading data from a memory cell (e.g., DDR), the 8-bit number is converted to an int9 format type based on whether the number stored in the memory cell is int8 or uint8, as a non-limiting example. In some embodiments, 9 bits are used and the int8 format type symbol is extended to int9 format type, while the uint8 format type is copied to the least significant bit of the 9 bits of data and its most significant bit (i.e., level 9) is unsigned, e.g., set to zero. It is to be appreciated that the software component can ensure that the value of int9 written to the memory cell (e.g., 8-bit DDR) is within the appropriate range of int8, uint8, etc., since eight bits are not sufficient to store the full int9 range.

In some embodiments, the software component is configured to perform operations to limit the range of int9 numbers to within the range of int8, uint8, etc., or alternatively, to perform operations on int9 numbers to represent it as two int9 numbers, one within the range of int8 and the other within the range of uint 8. The int9 number is represented as two int9 numbers, one within the int8 range and the other within the agent 8, so that the least significant bit of the int9 number can be copied into the 8-bit DDR while retaining the information.

In some embodiments, the core of the programmable hardware architecture is configured to interpret a plurality of ML commands/instructions for the ML operation and/or data received from the host, and to coordinate the activities of the streaming engine and the inference engine based on the data in the received ML commands. The inference engine may include a dense operation engine and an irregular operation engine. Intensive operation engines are engines that are optimized to efficiently process intensive data using regular operations, such as matrix operations (such as multiplication, matrix operations, tanh, Sigmoid, etc.). On the other hand, an irregular operation engine is an engine optimized to efficiently process sporadic data with irregular operations (e.g., memory transposes, operations on irregular data structures such as trees, graphs, and priority queues). According to some embodiments, the core may coordinate some of the instructions received from the host to be processed. In some embodiments, the core may be a general purpose processor, such as a CPU.

In some embodiments, the core is specifically configured to divide the plurality of ML commands between the core and the inference engine for efficient execution. The ML commands and related data to be executed by the inference engine are sent from the core and memory to the instruction streaming engine and data streaming engine for efficient streaming to the inference engine. As described above, the data read from the memory cell is converted into int9 format. The data and instruction stream transmission engine is configured to send one or more data streams and ML commands to the inference engine in response to programming instructions received from the core. The inference engine is configured to process the instruction/data stream received from the data/instruction stream engine for ML operations according to programming instructions received from the instruction/data stream transmission engine.

It is to be appreciated that the data input of the operation-intensive engines of the inference engine may need to be rescaled before certain operations (e.g., tanh, Sigmoid, etc.). To rescale in an efficient manner, the data input into the int32 format is multiplied by an integer scale value (scale value) and then shifted. To obtain the highest possible precision and lowest error in the calculation, the relationship between the integer scale value and the shift value is obtained based on the size of the register storing the integer data (e.g., int 32).

Referring now to fig. 1, an example of a diagram of a hardware-based programmable system/architecture 101 configured to support machine learning is depicted. Although these figures depict components as functionally separate, such depiction is merely for illustrative purposes. It should be apparent that the components depicted in this figure may be arbitrarily combined or divided into separate software, firmware, and/or hardware components. Further, it will also be apparent that these components, regardless of how they are combined or divided, may execute on the same host or multiple hosts, and where multiple hosts may be connected by one or more networks. Each engine in architecture 101 is a dedicated hardware block/component that includes one or more microprocessors and an on-chip memory unit that stores data and software instructions programmed by a user for various machine learning operations. When the software instructions are executed by the microprocessor, each hardware component will become a special purpose hardware component for practicing certain machine learning functions discussed in detail below. In some embodiments, the architecture 101 is on a single chip, such as a system on a chip (SOC).

In the example of fig. 1, architecture 101 may include a host 110 coupled to a memory (e.g., DDR)120 and a core engine 130. The memory 120 may be coupled to a Direct Memory Access (DMA) engine (not shown) and a Network Interface Controller (NIC) (not shown) to receive external data. The memory 120 may be internally connected to the data streaming engine 140. Core 130 is coupled to instruction stream transport engine 150, and instruction stream transport engine 150 is coupled to data stream transport engine 140. The core 130 is also coupled to a general purpose processor 165. In some embodiments, general purpose processor 165 may be part of core 130. The instruction stream transmission engine 150 and the data stream transmission engine 140 are coupled to an inference engine 160 that includes a dense operation engine 161 and an irregular operation engine 163. It is to be appreciated that inference engine 160 may include an array for performing various computations. It is to be appreciated that any discussion of an array for performing various computations in inference engine 160 is for purposes of illustration and should not be construed as limiting the scope of the embodiments. For example, in some embodiments, the array used to perform the various computations may be located external to inference engine 160.

It is to be appreciated that the external data may be in a floating point format, such as 32-bit floating point. Thus, when data is stored in memory 120 (e.g., 8-bit DDR), it may be converted to an integer format type, such as int8, uint8, and so on. It is understood that the range of agent 8 is 0 to 255, and int8 is-128 to 127. In contrast, int9 ranges from-256 to 255, so int8 and uint8 can be represented simultaneously without any offset calculation. Using int9 as the agent 8 range and int8 range enables copying of data to standard 8-bit DDR. It is to be appreciated that the description regarding the use of 32-bit floating point and 8-bit DDR is for illustrative purposes and should not be construed to limit the scope of the embodiments. The floating point data is ultimately quantized to int9 instead of int8 or uint 8. Furthermore, since the int9 range encompasses both positive and negative values, it results in a zero offset and further simplifies rescaling of int9 numbers in ML systems. Thus, when data is read from memory 120 (e.g., 8-bit DDR), the data is converted to int9 format. It is to be appreciated that once the data is converted to int9 format, there is no need to track the type of operand when a mix of different format types is used in the computation. For example, using int9 eliminates the need to track whether operands in the computations being performed are int8, uint8, etc.

It is also appreciated that in some embodiments, the memory 120 (e.g., DDR) may store floating point numbers (e.g., 32-bit floating points) as four 8-bit values. Thus, when data is read from memory 120 (e.g., 8-bit DDR) into on-chip memory, quantization from 32-bit floating point to int9 is performed in general processor 165 and/or irregular operation engine 163. In some embodiments, registers within the general purpose processor 165 and/or the irregular operation engine 163 store 32-bit widths that hold 32-bit floating point values. In this way, floating point numbers can be converted to int9 numbers for use in ML systems. However, the 32-bit floating point number is first scaled to convert it to int9 format. For example, a suitable scale may be:

the scale (floating point upper limit range-floating point lower limit range)/(int 9 upper limit range-int 9 lower limit range) ═ (End) - (-End))/(255- (-255)) ═ 2End/(2(255)) ═ End/255.

It is to be appreciated that for the int9 lower limit, the same scale can be used when extended to include-256. It is to be appreciated that fig. 2A illustrates symmetric quantization and mapping of 32-bit floating points to the full range of int9, while fig. 2B illustrates the mapping in fig. 2A to include-256 for int 9. As shown, the same scale may be used for both fig. 2A and 2B. Fig. 2C illustrates the mapping of 32-bit floating points to int9 and the representation of int8 within the int9 range. It will be appreciated that the scale for int8 representation within int9 is different from the scale identified above. In some embodiments, the scale for representing int8 with 9 bits may be:

the scale is (floating point upper limit range-floating point lower limit range)/(int 8 upper limit range-int 8 lower limit range) ═ End- (-End))/(127- (-127))) -End/127.

FIG. 2D illustrates a 32-bit floating point to int9 mapping and a representation of u int8 within int 9. It is to be appreciated that the representation of the agent 8 within int9 has the same scale as fig. 2A and 2B.

In some embodiments, when data is transferred from memory 120 into an array (e.g., inference engine 160, etc.), the data to be transferred is sign-extended or zero-extended depending on whether the data to be transferred is int8 or agent 8. In other words, data is converted from one format type (e.g., int8, uint8, etc.) to another format type (e.g., int 9). As a non-limiting example, when converting data from int8 or uint8 to int9 format, 8-bit data may be converted to 9-bit data by extending the number of bits by 1 bit. It is determined whether the data being converted is signed (e.g., int8) or unsigned (e.g., uint 8). Sign extending the most significant bit of the 9-bit data of int9 if the data being converted is signed; if the data being converted is unsigned, the most significant bit of the 9 bits of int9 is set to zero. It is to be appreciated that int8 or agent 8 is copied directly to the low order bits (i.e., the lower 8-bit order) of the int9 data. It is to be appreciated that int9 data may be referred to as extended data, as compared to int8 or uint8 format types. The extension data, in this example of type int9 format, is stored in the inference engine 160 for operation thereon. In some embodiments, the extension data may be stored in on-chip memory (OCM) of inference engine 160 for processing by the processing tiles of the ML computer array. It is to be appreciated that in some embodiments, floating point numbers (e.g., 32-bit floating point) may be converted to integer representations, e.g., int 9. In one illustrative embodiment, as shown in FIGS. 2A-2D, floating point numbers are quantized and scaled appropriately to convert them to int9 format types. As shown, one scaling value may be used to represent a floating point number of int8 of int9 format type, while another scaling value may be used to represent a floating point number of int8 of int9 format type. It is to be appreciated that the 16-bit floating point numbers stored in memory unit 120 (e.g., DDR) remain unchanged as they are stored from memory unit 120 into the OCM of inference engine 160. For a non-limiting example, the lower 7 bits of a 16-bit floating point are the same as the lower 7 bits of the mantissa, the order 8 bits are extended but not used in the operation, the remaining mantissa bits are the 9 th and 10 th order bits, respectively, followed by the exponent and sign bits, and the sign bits are followed by additional extension bits that are not used in any operation.

It is to be appreciated that inference engine 160 may include a plurality of processing tiles arranged in a two-dimensional array of a plurality of rows and columns (e.g., 8 rows by 8 columns). Each processing tile may include at least one OCM, one POD unit, and one processing engine/element (PE). Here, the OCM in the processing tile is configured to receive data from data streaming engine 140 in a streaming manner. It is to be appreciated that the received data may be in int9 format, as described above. OCM enables efficient local access to data per processing tile. The processing units (e.g., POD and PE) are configured to perform highly specialized tasks on the received data in the OCM, such as intensive and sparse computations of ML operations, respectively.

It is to be appreciated that for various ALU operations associated with an ML operation, the OCM of the respective processing tile may receive data in the int9 format type. It is to be appreciated that in some embodiments, the format type (e.g., signed or unsigned) of the data stored in memory 120 is tracked such that the appropriate instructions can be scheduled to be streamed for execution by the appropriate processing unit (e.g., the respective POD/PE processing the tile). In other words, by processing the tile, various ALU operations are performed on data received in int9 format. The data received in int9 format may be operands to various ALU operations. The results of various ALU operations of int9 format type may be stored in the corresponding OCM.

In some embodiments, the inference engine 160 includes an intensive operation engine 161, the intensive operation engine 161 optimized to efficiently process intensive data (e.g., data received from memory 120 in int9 format) using regular operations (e.g., matrix operations such as multiplication, matrix operations, tanh, Sigmoid, etc.). On the other hand, the inference engine 160 may also include an irregular operation engine 163 optimized to efficiently process sporadic data (e.g., of the int9 format type) with irregular operations (e.g., memory transposes, addition operations, operations on irregular data structures such as trees, graphs, and priority queues). According to some embodiments, the core 130 may coordinate some instructions received from the host 110 to be processed by a general purpose processor 165 (e.g., a CPU, etc.).

In some embodiments, core 130 is configured to execute any software code written in a common high-level language. The core 130 is configured to handle a number of performance non-critical operations, such as data/instruction preparation work, data collection, data mapping, and the like. In some embodiments, the core 130 may also be configured to decompose received ML commands into performance critical and non-critical operations/tasks, such that performance non-critical operations may be processed by the core 130, while performance critical operations (e.g., matrix multiplication) may be processed by the inference engine 160. In other words, the core 130 is configured to divide the plurality of ML commands between the core 130 and the inference engine 160 in order to efficiently execute them. In some embodiments, the core 130 may also be configured to assign/partition multiple ML commands (also referred to as tasks or subtasks) to various components (e.g., the inference engine 160) for processing. In some embodiments, the core 130 is configured to allocate one or more locations in the memory 120 for storing tasks/commands, data, results after the data is processed, etc. for access and use by the core 130 or other components in the architecture 101 (e.g., the inference engine 160). In this way, core 130 and inference engine 160 are configured to perform the entire ML algorithm and operations on their own, without having to rely on or require host 110 to perform certain ML commands or operations. By supporting and performing the entire ML operation on programmable hardware architecture 101, core 130 eliminates the performance overhead of transferring data to host 110 and returning to performing any unsupported ML operations and relieves host 110 of the burden of achieving higher performance.

In some embodiments, the ML commands and related data (e.g., in int8 format) to be executed by the inference engine 160 are sent from the core 130 and memory 120 to the instruction streaming engine 150 and data streaming engine 140 for efficient streaming to the inference engine 160. In some embodiments, the data/instruction stream transmission engine 140 and 150 are configured to: one or more data streams and programming instructions are sent to the inference engine 160 in response to the ML commands received from the core 130. It is to be appreciated that in some embodiments, the format type (e.g., signed or unsigned) of the data stored in memory 120 is tracked so that appropriate instructions can be scheduled to be streamed to the intensive operation engine 161 and/or the irregular operation engine 163 of inference engine 160. In other words, various ALU operations are performed on data received in int9 format by the engines in inference engine 160. The data received in int9 format may be operands to various ALU operations. The results of various ALU operations of the int9 format type may be stored in the intensive operation engine 161 and/or the irregular operation engine 163 of the inference engine 160. In some embodiments, the results may be stored in the appropriate OCM of the processing tiles of the ML computer array.

It is to be appreciated that in some embodiments, the results of the ALU operations within inference engine 160 are stored within the memory component (e.g., OCM) of the corresponding processing tile within inference engine 160. The results stored in inference engine 160 may be sent for storage to memory unit 120 (e.g., DDR). However, before storing the result, if the value of the result exceeds an upper limit (e.g., a maximum value) of the format type of the data in the memory unit 120, the value of the result may be adjusted to an upper limit range for the data, and if the value of the result is smaller than the upper limit range of the memory unit 120, the value of the result may be adjusted to a lower limit range (e.g., a minimum value) for the data. It is to be appreciated that the most significant bit of the int9 result is discarded when the result from the OCM of the corresponding processing tile is stored into memory cell 120.

It is to be appreciated that in some embodiments, the processing results stored in the respective OCM may be sent to be stored back to memory unit 120 (e.g., DDR). However, before storing the result, if the value of the result exceeds an upper limit (e.g., a maximum value) of the format type of the data in the memory unit 120, the value of the result may be adjusted to an upper limit range for the data, and if the value of the result is smaller than the upper limit range of the memory unit 120, the value of the result may be adjusted to a lower limit range (e.g., a minimum value) for the data. In other words, the data may be limited to an appropriate range, such as an int8 range, a unit8 range, and the like. It is to be appreciated that the most significant bit of the int9 result is discarded when the result from the OCM of the corresponding processing tile is stored into memory cell 120. Moreover, it is to be appreciated that the software module may track whether data stored in memory unit 120 is signed or unsigned such that when data is transferred from an inference engine (e.g., the corresponding OCM of inference engine 160) to memory unit 120 (e.g., DDR), int9 data format types may be correctly interpreted into the appropriate format types, e.g., int8 for int8 formatted data in memory unit 120; for the data in memory unit 120 in the format of uint8, uint 8; and so on.

Referring now to fig. 3A and 3B, examples of processes for supporting conversion of data stored in memory (e.g., DDR) from a first format (e.g., int8, uint8, floating point, etc.) to a second format type (e.g., int9) as described in fig. 1-2D are shown. Although the figures depict the functional steps in a particular order for purposes of illustration, the processes are not limited to any particular order or arrangement of steps. One skilled in the relevant art will appreciate that the various steps depicted in the drawings may be omitted, rearranged, combined, and/or altered in various ways.

As shown in fig. 3A and 3B, in step 310, the number of bits stored in memory unit 120 (e.g., DDR) is extended by 1 bit to form extended data (e.g., int 9). Thus, int8 or uint8, which includes 8 bits, is extended to 9 bits. It is to be appreciated that in some embodiments, the data stored in memory unit 120 is floating point numbers. In step 320, it is determined whether the data stored in memory 120 is signed (e.g., int8) or unsigned (e.g., uint 8). In step 330, the spread data is sign-spread in response to determining that the data is signed. Instead, in step 340, the most significant bits of the extension data are set to zero in response to determining that the data is unsigned. The data is copied to the lower order bits, in step 350, and thus all but the most significant bits. In step 360, the extended data is copied to inference engine 160, e.g., the OCM of inference engine 160. It is to be appreciated that in step 370, whether the data stored in memory unit 120 (e.g., DDR) is signed or unsigned is tracked, and as such, in step 380, the appropriate instructions for the extended data are scheduled. In step 382, various ALU operations may be performed on the expanded data. At step 384, the result of the ALU operation is stored in the OCM. At step 386, the results of the ALU operation stored in the OCM are also stored/copied to the memory unit 120 (e.g., DDR). In step 388, the most significant bits of the result are discarded before the result is stored from the OCM to the DDR. It is to be appreciated that, optionally, in step 390, the value of the result of the ALU may be adjusted prior to storing it from the OCM to the memory unit 120 (e.g., DDR) based on the range of digital format types stored in the memory unit 120. For example, if the value of the result stored in the OCM exceeds the upper limit range of the number type (e.g., int8 or uint8, etc.) stored in the memory unit 120, the result is adjusted and changed to the maximum value or upper limit range of the number (thus int8, uint8, etc.).

Fig. 4 depicts a diagram of an example of the architecture of POD. It is to be appreciated that the number of components, component sizes and number of bits, matrix sizes, etc. shown in FIG. 4 are for purposes of illustration and are not intended to limit the scope of the embodiments. Although matrix multiplication is used as a non-limiting example in the following discussion, it is to be understood that POD is also configured to perform other types of intensive computational tasks of ML operations. In the example of fig. 4, the POD includes a compute POD instruction control 699 configured to control loading of data/instructions into various components (e.g., registers, tanh/sigmoid unit 614, etc.). It is to be appreciated that the POD includes a matrix multiplication block 602, the matrix multiplication block 602 being a two-dimensional array having X rows and Y columns, with each element/cell in the array having a certain number of registers (e.g., MIPS or non-interlocked pipeline stage microprocessors). The matrix multiplication block 602 is configured to multiply two matrices (X rows and Z columns of matrix a and Z rows and Y columns of matrix B) to generate X rows and Y columns of matrix C. It is to be appreciated that even though the data stored in memory unit 120 is in a different type format (e.g., int8, uint8, floating point number, etc.), the multiplied data may be of the int9 format type stored in their respective OCMs.

In the example of fig. 4, the POD also includes three types of registers: an a register 604, a B register 606, and a C register 608, which feed matrix data to the matrix multiplication block 602 for matrix multiplication. The a register 604 comprises a set of registers, e.g., m registers, each register configured to maintain one row/column of the a matrix fed to a column of the array of the matrix multiplication block 602. Each a register may have several entries (e.g., k elements), each entry having a certain number of bits wide, and supporting one read or write operation per cycle. It is to be appreciated that the data in each register can be of the int9 format type even though the data stored in memory cells 120 is of a different format type (e.g., int8, uint8, floating type, etc.). In other words, data is converted from one format of memory unit 120 to a different format type (e.g., int9) to be stored in the corresponding OCM of the processing tile to be used for ALU computations for PE and/or POD operations. These entries allow each A register to fetch the next portion of the A matrix before the matrix multiplication block 602 computes that the next portion of the A matrix is needed. The B register 606 comprises a set of registers, e.g., n registers, each configured to maintain a row/column of the B matrix fed to a row of the array of the multiplication block 602. Similar to the A register 604, each B register may have several entries (e.g., k elements), each entry having a certain number of bits wide, supporting one read or write operation per cycle. These entries allow each B register to fetch the next portion of the B matrix before the matrix multiplication block 602 computes that the next portion of the B matrix is needed. The C register 608 is configured to hold the matrix multiplication result (C matrix) produced by the multiplication block 602. The C register 608 includes several banks, each configured to maintain one row/column of the C matrix. The C matrix is configured to have m x n elements.

During the matrix multiplication process, the matrix multiplication block 602 is configured to read the elements of the matrices a and B from the OCM into the a and B registers once, respectively (rather than repeatedly reading each row or each column of the matrix), thereby saving memory access time to the OCM. In particular, each matrix multiplication operation has its own structure, where a row of a first matrix will be multiplied by all columns in a second matrix, and a column of the second matrix will be multiplied by all rows in the first matrix. When the matrix multiplication block 602 performs a matrix multiplication operation, each row of the a register 604 remains unchanged, while the columns of the B register 606 are fed to the matrix multiplication block 602, one column at a time, to multiply the rows in the a register 604. At the same time, each column of the B register 606 remains unchanged, while the rows of the a register 604 are fed to the matrix multiplication block 602, one row at a time, to be multiplied by the columns of the B register 606. As such, the matrix multiplication block 602 is configured to simultaneously multiply each row of the first matrix by all columns of the second matrix and multiply each column of the second matrix by all rows of the first matrix. These outputs of these multiplications are accumulated and stored in the C register until the matrix multiplication process is complete.

As shown in the example of fig. 4, a register 604, B register 606, and C register 608 are associated with a corresponding OCM streamer (streamer)603, 605, or 607, respectively, where each OCM streamer is programmed and configured to stream data from the OCM into the corresponding register to ensure that matrix multiplication block 602 can perform matrix multiplication operations in a streaming manner. Each OCM streamer has an address range of OCMs to read, and a stride to follow for the next read. The registers of type a or B are configured to send a next row ready signal for each memory bank to its corresponding streamer, where the bit pattern of the signal indicates which memory banks are requesting the next row of data. The corresponding streamer of the a or B register responds to the read signal by sending the corresponding line of data from the OCM to the register. The streamer sends a completion signal to its corresponding register when sending the last line of data to be transmitted. When all banks of registers have rows of data, the a or B register sends a ready signal to the matrix multiplication block 602 indicating that the next set of a or B registers is ready to be read into the matrix multiplication block 602 for matrix multiplication. In some embodiments, each register bank has a valid bit that informs the matrix multiplication block 602 which values are valid and should be operated on.

When the matrix multiplication is complete, e.g., when the end of the row of the a matrix and the end of the column of the B matrix are reached, the matrix multiplication block 602 notifies the C register 608 to: all accumulations in the entries of C register 608 are complete and the entries are ready to be written back to the OCM via their corresponding streamer 607. Each bank of the C register 608 then sends data to the OCM. If the OCM is not ready to accept data from the bank of C register 608, the transmission is suspended and retried in the next cycle until the PE is ready to accept data from the bank. In some embodiments, during the next matrix multiply operation, the C register 608 is preloaded with data or reset to zero before the next set of accumulations. This preloading allows for the addition of biases as part of the next matrix multiplication. In some embodiments, each PE is configured to accept, process and write the output C matrix from the matrix multiplication block 602 of the POD into the OCM.

According to one example, the processing results stored in the respective OCM may be sent to be stored back to memory unit 120 (e.g., DDR). However, before storing the result, if the value of the result exceeds an upper limit (e.g., a maximum value) of the format type of the data in the memory unit 120, the value of the result may be adjusted to an upper limit range for the data, and if the value of the result is smaller than a lower limit range of the memory unit 120, the value of the result may be adjusted to a lower limit range (e.g., a minimum value) for the data. In other words, the data may be limited to an appropriate range, such as an int8 range, a unit8 range, and the like. It is to be appreciated that the most significant bit of the int9 result is discarded when the result from the OCM of the corresponding processing tile is stored into memory cell 120. Moreover, it is to be appreciated that the software module may track whether data stored in memory unit 120 is signed or unsigned such that when data is transferred from an inference engine (e.g., the corresponding OCM of inference engine 160) to memory unit 120 (e.g., DDR), int9 data format types may be correctly interpreted into the appropriate format types, e.g., int8 for int8 formatted data in memory unit 120; for the data in memory unit 120 in the format of uint8, uint 8; and so on.

In some embodiments, the inference engine 160 is configured to fuse/integrate these post (post) matrix multiplication operations by each PE with matrix multiplication operations by the corresponding POD, such that these post matrix multiplication operations are performed immediately on the outputs from the matrix multiplication block 602, without having to send and save the outputs to the OCM first, and then read the C matrix from the OCM again for these post matrix multiplication operations. By bypassing the round trip process (roundtrip) to the OCM, the fusion of post matrix multiplication operations with matrix multiplication operations saves time and improves the efficiency of the inference engine 160. For example, it is to be appreciated that in some embodiments, additional conventional operations may need to be performed on the output C matrix, such as rectifying linear units (RELUs), quantization, and so forth. Thus, a switching mechanism can be integrated in the POD architecture to determine if additional regular operations are required and, if so, can operate on the output rather than writing the output C matrix to another memory location. For example, when a rectified linear operation is required, the output C-matrix stream is transmitted to the RELU unit 601, which RELU unit 601 is configured to perform a RELU operation on the C-matrix. Similarly, when quantization is required, the output stream of the output C matrix or RELU unit 601 is transmitted to a quantization unit 612, which quantization unit 612 is configured to quantize the C matrix or the result from the RELU operation.

In some embodiments, the scaling, shifting, and/or offset values required for the quantization/re-quantization operations may be statically set by core 130 and may be different for different ML operations. In some embodiments, these values may be part of the ML model downloaded to the core, where the values corresponding to the ML operation may be read from the model and written into the appropriate registers before the quantization operation begins. It is to be appreciated that the re-quantization performs a re-scaling of the output values stored in the C register 608 for input to the quantization 612 and/or tanh/sigmoid unit 614, and then directly stored into its corresponding OCM block. It is to be appreciated that in this example, re-quantization is performed on the output data (e.g., C register 608), but in other examples it may perform re-quantization on other outputs from other registers. As such, performing re-quantization on the data stored in the C register 608 is for illustration purposes and should not be construed as limiting the scope of the embodiments. In some embodiments, a single scaling value is applied to all elements of the output. It is to be appreciated that the scaling operation (which is a division operation) may be replaced with integer multiplication and shift operations. It is also understood that the relationship between the value of the integer multiplication (also called the integer scalar value) and the shift value determines the accuracy and error in the system. In some embodiments, the relationship between the integer scale value and the shift value is obtained and the most likely value of the integer scale value and its corresponding shift value is selected based on the register size of the result of the storage multiplication (multiplication of the output from the C register 608 and the integer scale value). In some embodiments, the output from the C register 608 may be represented as V and the quantization multiplier may be represented as x, where x may be greater or less than 1. It will be appreciated that the relationship between the integer scale value and the shift value determines the quantization multiplier. The relationship between the integer scale value and the shift value is approximately given by equation (1) below.

x-integer scale value/(2)^{Shift value}) (1)。

Therefore, the temperature of the molten metal is controlled,

integer scale value int (x 2)^{Shift value}) (2)。

It will be appreciated that the maximum integer scalar value is limited by the size of the register holding the result of the integer multiplication, and thus the output of the C register 608, i.e., the V value, is limited by the integer scalar value. For example, if V is int32 and the register size is 64 bits, the integer scalar value must be less than the maximum 32-bit integer, otherwise it will overflow. In other words, the maximum allowable value is 2147483647. It is to be appreciated that other maximum allowable values may be different, and the examples provided above are for illustrative purposes only and are not intended to limit the scope of embodiments. Thus, the condition set forth in the following equation (3) will be satisfied.

Integer scalar value/maximum allowed value <1 (3)

In some embodiments, equations (2) and (3) are iterated in order to obtain the largest possible integer scale value. Initially the shift value is 0 and after each iteration the shift value is increased by a value, e.g. 1, 2, 5, 6, 7, 11, etc. The shift value determines the possible integer scale value and once again an iteration is performed as long as the condition identified by equation (3) holds. This process is repeated until equation (3) is no longer true, at which point the previous shift value and its corresponding integer scalar value are selected. It will be appreciated that any previous shift value and its corresponding integer scale value may be selected even though the largest previous integer scale value and its corresponding shift value provide the highest precision given the register size. The above procedure for selecting the maximum possible integer scalar value and its corresponding shift value is shown with python:

def calculate rscale rshift(which type，x)：

given a quantized multiplier x, calculate the values of rscale and rshift

The # element-by-element operation is limited to 49 bits in the accumulator, so the maximum rscale must be reduced to 18 bits

Otherwise, a 64-bit accumulator is used and a 32-bit rscale value is allowed,

# may still retain the same logic used to determine rshift, since rshift will exit earlier

for iin range (65): # i.e., 0-64, requires rshifts 1 through 63 and for shift 63 to be correct, it is necessary to have one more cycle

It is to be appreciated that once the integer scale values and their corresponding shift values are selected, a quantization/re-quantization operation may be performed. The output of the C register 608 is multiplied by an integer scalar value. The result of the multiplication is shifted by the above selected shift value to form scaled integer data. Once the data is scaled, additional operations may be performed, such as tanh operations, sigmoid operations, rounding operations, clipping/clamping operations, and the like. In some embodiments, the rounding operation is performed by considering the most significant bits that have fallen due to the shift operation, and rounding the remaining results based on the most significant bits that have fallen. It is to be appreciated that the scaled integer data can be further adjusted based on the range of integer data. For example, if the integer data stored in memory cell 120 is int8, and if the scaled integer data exceeds the upper limit of int8, the scaled integer data is changed and adjusted to the maximum value or upper limit of int 8. Similarly, if the integer data stored in the memory unit 120 is uint8, and if the scaled integer data exceeds the upper limit of uint8, the scaled integer data is changed and adjusted to the maximum value or upper limit of uint 8. In contrast, if the value of the scaled integer data is below the minimum value or lower limit range of the data stored in the memory unit 120, for example, int8 or uint8, the scaled integer data is adjusted and changed to the minimum value or lower limit range of the integer data in the memory unit 120.

Referring now to FIG. 5, a method for rescaling integer data in a machine learning operation is shown. It is to be appreciated that the method illustrated in FIG. 5 is a method flow of operations as illustrated in FIG. 4. In step 510, the relationship between the integer scale values and the shift values is determined as described in equation (1). In step 520, the shift value is iteratively increased and its corresponding integer scale value is obtained for equation (2) until the condition shown in equation (3) is no longer true, and thus the value is greater than or equal to 1. In step 530, the shift value and its corresponding integer scalar value are selected before equation (3) no longer holds. It is to be appreciated that in some non-limiting examples, steps 510 and 530 are performed during the compilation step and prior to any inference by inference engine 160. In step 540, an integer value, for example in int32 format, is received, for example from the C register 608. In step 550, the received integer value is multiplied by the selected integer scalar value. In step 560, the result of the multiplication is shifted by a shift value corresponding to the selected integer scale value. In step 570, further operations may be performed, such as tanh, sigmoid, rounding, clipping, clamping, and the like. In step 580, the value of the scaled integer data may be adjusted based on the range of integer data stored in memory unit 120. For example, if int8 type data is stored in memory unit 120 (e.g., DDR), then if the scaled integer data exceeds the upper limit of the int8 data type, the scaled integer data is changed to the maximum or upper limit value of the int8 type data. Similarly, if the uint8 type data is stored in memory unit 120 (e.g., DDR), then if the scaled integer data exceeds the upper limit of the uint8 data type, then the scaled integer data is changed to the maximum or upper limit value of the uint8 type data. Conversely, if the scaled integer data is less than the lower limit of the int8 data type, the scaled integer data is changed to the minimum value or lower limit of the int8 data type stored in the memory cell 120 (e.g., DDR). Similarly, if the scaled integer data is less than the lower limit of the uint8 data type, the scaled integer data is changed to the minimum value or lower limit of the uint8 data type stored in memory unit 120 (e.g., DDR). Thus, based on the size of the register size, greater precision and accuracy may be achieved.

The foregoing description of various embodiments of the claimed subject matter has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise form disclosed. Many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best describe the principles of the invention and its practical application, thereby enabling others skilled in the relevant art to understand the claimed subject matter, various embodiments, and with various modifications as are suited to the particular use contemplated.

22页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种数据处理装置

System and method for INT9 quantization

相关技术

网友询问留言