Hypothetical tensor processing

文档序号：95721 发布日期：2021-10-12 浏览：28次中文

阅读说明：本技术 假定张量处理 (Hypothetical tensor processing ) 是由 V·S·拉梅什于 2021-03-26 设计创作，主要内容包括：本申请涉及假定张量处理。一种与假定张量处理有关的方法可包含：通过彼此耦合的多个乘累加器MAC单元接收多个通用数字unum或假定位串,所述多个通用数字unum或假定位串组织成矩阵并且将在使用所述多个MAC单元执行的多个相应递归运算中用作操作数；以及使用所述MAC单元执行所述多个相应递归运算。所述相应递归运算的迭代使用至少一个位串执行,所述至少一个位串与用于所述相应递归运算的先前迭代的位串相同。所述方法可进一步包含在接收所述多个unum或假定位串之前,执行组织所述多个unum或假定位串的操作以在所述多个相应递归运算的执行期间实现阈值带宽比、阈值时延或这两者。(The present application relates to hypothesis tensor processing. A method related to hypothetical tensor processing may comprise: receiving, by a plurality of multiply-accumulator MAC units coupled to one another, a plurality of universal number um or hypothesized bit strings that are organized into matrices and to be used as operands in a plurality of respective recursive operations performed using the plurality of MAC units; and performing the plurality of respective recursive operations using the MAC unit. The iteration of the respective recursive operation is performed using at least one bit string that is the same as a bit string used for a previous iteration of the respective recursive operation. The method may further include, prior to receiving the plurality of unum or hypothesized bit strings, performing an operation of organizing the plurality of unum or hypothesized bit strings to achieve a threshold bandwidth ratio, a threshold latency, or both during performance of the plurality of respective recursive operations.)

1. A method for hypothesis tensor processing, comprising:

organizing a plurality of universal number units or hypothetical bit strings (331, 431) in a matrix based at least in part on threshold bandwidth ratios or threshold delays or both associated with recursive operations of a plurality of multiply-accumulator MAC units (226-1, 226-2, 226-3, 226-N) coupled to one another;

receiving, by the MAC unit (226-1, 226-2, 226-3, 226-N), the plurality of unum or hypothesized bit strings (331, 431) organized into the matrix;

performing a plurality of respective recursive operations using the MAC unit (226-1, 226-2, 226-3, 226-N), wherein iterations of the respective recursive operations are performed using at least one bit string that is the same as a bit string used for a previous iteration of the respective recursive operation; and

storing the iterative results of the plurality of respective recursive operations in respective registers (227-1, 227-2, 227-3, 227-N) coupled to respective MAC units (226-1, 226-2, 226-3, 226-N) of the plurality of MAC units (226-1, 226-2, 226-3, 226-N).

2. The method of claim 1, further comprising performing at least a portion of the plurality of recursive operations simultaneously.

3. The method of claim 1, wherein the plurality of MAC units (226-1, 226-2, 226-3, 226-N) comprise circuitry configured to perform the plurality of respective recursive operations and reside on an application specific integrated circuit or a field programmable gate array.

4. The method of any of claims 1-3, wherein the organizing of the plurality of unum or hypothesized bit strings (331, 431) is performed by respective data arbiters coupled to respective ones of the plurality of MAC units.

5. The method of any of claims 1 to 3, wherein the plurality of MAC units (226-1, 226-2, 226-3, 226-N) are organized in a grid configuration with respect to each other.

6. The method of any one of claims 1-3, further comprising:

receiving, by the plurality of MAC units (226-1, 226-2, 226-3, 226-N), a command to assert a first physical portion of the respective MAC unit (226-1, 226-2, 226-3, 226-N) residing along a first lateral dimension of the plurality of MAC units; and

receiving the plurality of unum or hypothesized bit strings (331, 431) by a second physical portion of the respective MAC unit (226-1, 226-2, 226-3, 226-N) residing along a second lateral dimension of the plurality of MAC units (226-1, 226-2, 226-3, 226-N), wherein the first and second lateral dimensions are orthogonal to each other.

7. The method of any one of claims 1-3, further comprising performing the plurality of respective recursive operations by performing sub-operations associated with the plurality of respective recursive operations by a different MAC unit (226-1, 226-2, 226-3, 226-N) among the plurality of MAC units (226-1, 226-2, 226-3, 226-N).

8. The method of any of claims 1-3, further comprising using at least one identical unum or hypothesized bit string (331, 431) as operands of a plurality of recursions of the plurality of respective recursive operations.

9. An apparatus for hypothesis tensor processing, comprising:

assume a multiply-accumulator MAC block (124, 224) comprising a plurality of MAC units (226-1, 226-2, 226-3, 226-N) organized in a trellis and configured to:

receiving a command via an interface (216-1, 216-M) residing along a first lateral dimension of the hypothetical MAC block (124, 224), the command to perform a respective recursive operation using a universal number or hypothetical bit string (331, 431) organized as a matrix;

receiving the universal number or hypothetical bit string (331, 431) along a second lateral dimension of the hypothetical MAC block (124, 224), the second lateral dimension being substantially orthogonal to the first lateral dimension; and

performing the respective recursive operation using the received generic or hypothetical bit strings (331, 431), wherein an iteration of the respective recursive operation is performed using at least one bit string that is the same as a bit string used for a previous iteration of the respective recursive operation.

10. The apparatus of claim 9, further comprising a data arbiter (218-1, 218-1, 218-3, 218-N) coupled to the MAC block (124, 224) and configured to organize the bit strings to achieve a threshold bandwidth ratio, a threshold latency, or both during execution of the plurality of respective recursive operations.

11. The apparatus of claim 9, wherein the MAC block (124, 224) further comprises an accumulator register (227-1, 227-1, 227-3, 227-N) configured to store an intermediate result corresponding to at least one iteration of the respective recursive operation.

12. The apparatus of any of claims 9-11, wherein the MAC block (124, 224) resides on an application specific integrated circuit or a field programmable gate array.

13. The apparatus of any of claims 9 to 11, wherein the MAC block (124, 224) is configured to perform a plurality of matrix operations as part of performing the respective recursive operations using the received bit string.

14. The apparatus of any of claims 9 to 11, wherein the MAC block (124, 224) is configured to:

-receiving said generic digital or hypothetical bit string (331, 431) in a streaming manner; and

performing the respective recursive operation using the bit string as the bit string is streamed through the MAC block (124, 224).

15. A system for hypothesis tensor processing, comprising:

a plurality of hypothetical multiply-accumulator MAC units (226-1, 226-2, 226-3, 226-N) deployed on an application specific integrated circuit or a field programmable gate array, the MAC units (226-1, 226-2, 226-3, 226-N) being provided with processing resources (120, 220) and coupled to each other in a grid configuration; and

a respective overlay register (227-1, 227-1, 227-3, 227-N) coupled to each of the MAC units (226-1, 226-2, 226-3, 226-N), wherein each MAC unit (226-1, 226-2, 226-3, 226-N) of the plurality of MAC units (226-1, 226-2, 226-3, 226-N) is configured to:

receiving a bit string (331, 431) formatted according to a universal digital format or a hypothetical format and organized into a matrix;

performing a recursive operation using the received bit string (331, 431) as an operand of the recursive operation;

accumulating the iterative results of said recursive operations in said respective overlay registers (227-1, 227-2, 227-3, 227-N);

determining that the recursive operation is complete; and

generating a final result of the recursive operation using a final iteration result of the recursive operation, iteration results stored in the respective overlay registers (227-1, 227-2, 227-3, 227-N), or both.

16. The system of claim 15, further comprising a respective puncturing data arbiter (218-1, 218-2, 218-3, 218-N) coupled to each of the MAC units (226-1, 226-2, 226-3, 226-N), wherein each of the respective puncturing data arbiters (218-1, 218-2, 218-3, 218-N) is configured to perform an operation of organizing the bit string to optimize a bandwidth parameter corresponding to a provision of the bit string (331, 431) to the MAC unit (226-1, 226-2, 226-3, 226-N).

17. The system of claim 15, wherein each of the respective MAC units (226-1, 226-2, 226-3, 226-N) is configured as a 64-bit by 64-bit MAC unit (226-1, 226-2, 226-3, 226-N).

18. The system of any of claims 15 to 17, wherein the plurality of MAC units (226-1, 226-2, 226-3, 226-N) are configured to:

receiving a command to perform the recursive operation via an interface (216-1, 216-M) residing along a first horizontal dimension of the MAC unit (226-1, 226-2, 226-3, 226-N); and is

Receiving the bit string (331, 431) along a second lateral dimension of the MAC unit (226-1, 226-2, 226-3, 226-N), the second lateral dimension being substantially orthogonal to the first lateral dimension.

19. The system of any of claims 15 to 17, wherein at least one of the processing resources (120, 220) is configured to:

determining a size of the matrix; and

in response to the size of the matrix being greater than a threshold size, performing operations to simplify the matrix prior to the plurality of MAC units (226-1, 226-2, 226-3, 226-N) performing the recursive operation such that a resulting matrix is less than the threshold size.

20. The system of any of claims 15 to 17, wherein the plurality of MAC units (226-1, 226-2, 226-3, 226-N) are configured to perform a plurality of iterations of the recursive operation using the same at least one bit string (331, 431) as a previous iteration of the recursive operation.

Technical Field

The present disclosure relates generally to semiconductor memories and methods, and more particularly, to apparatus, systems, and methods for postulated (posit) tensor processing.

Background

Memory devices are typically provided as internal, semiconductor, integrated circuits in computers or other electronic systems. There are many different types of memory, including volatile and non-volatile memory. Volatile memory may require power to maintain its data (e.g., host data, error data, etc.) and includes Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Synchronous Dynamic Random Access Memory (SDRAM), and Thyristor Random Access Memory (TRAM), among others. Non-volatile memory may provide persistent data by preserving stored data when not powered, and may include NAND flash memory, NOR flash memory, and resistance variable memory, such as Phase Change Random Access Memory (PCRAM), Resistive Random Access Memory (RRAM), and Magnetoresistive Random Access Memory (MRAM), such as spin torque transfer random access memory (sttram), among others.

The memory device may be coupled to a host (e.g., a host computing device) to store data, commands, and/or instructions for use by the host when the computer or electronic system is in operation. For example, data, commands, and/or instructions may be transferred between a host and a memory device during operation of a computing or other electronic system.

Disclosure of Invention

In one aspect, the present application provides a method for hypothesis tensor processing, comprising: organizing a plurality of universal numbers (unum) or hypothetical bit strings in a matrix based at least in part on a threshold bandwidth ratio or a threshold time delay or both associated with recursive operations of a plurality of multiply-accumulator (MAC) units coupled to one another; receiving, by the MAC unit, the plurality of unum or hypothesized bit strings organized into the matrix; performing a plurality of respective recursive operations using the MAC unit, wherein iterations of the respective recursive operations are performed using at least one bit string that is the same as a bit string used for a previous iteration of the respective recursive operation; and storing the iteration results of the plurality of respective recursive operations in respective registers coupled to respective ones of the plurality of MAC units.

In another aspect, the present application further provides an apparatus for hypothetical tensor processing, comprising: assume a multiply-accumulator (MAC) block comprising a plurality of MAC units organized in a grid configuration and configured to: receiving a command via an interface residing along a first lateral dimension of the hypothetical MAC block, the command to perform a respective recursive operation using a universal number or a hypothetical bit string organized into a matrix; receiving the universal number or hypothesized bit string along a second lateral dimension of the hypothesized MAC block, the second lateral dimension being substantially orthogonal to the first lateral dimension; and performing the respective recursive operation using the received generic number or hypothetical bit string, wherein an iteration of the respective recursive operation is performed using at least one bit string that is the same as a bit string used for a previous iteration of the respective recursive operation.

In yet another aspect, the present application further provides a system for hypothesis tensor processing, comprising: a plurality of hypothetical multiply-accumulator (MAC) units deployed on an application specific integrated circuit or a field programmable gate array, the MAC units being provided with processing resources and coupled to each other in a grid configuration; and a respective overlay register coupled to each of the MAC units, wherein each MAC unit of the plurality of MAC units is configured to: receiving a bit string formatted according to a universal digital format or a hypothetical format and organized into a matrix; performing a recursive operation using a received bit string as an operand of the recursive operation; accumulating the iteration results of said recursive operation in said respective overlay register; determining that the recursive operation is complete; and generating a final result of the recursive operation using the final iteration result of the recursive operation, the iteration result stored in the corresponding overlay register, or both.

Drawings

FIG. 1 is a functional block diagram in the form of an apparatus including a host and a memory device, the memory device including a processing unit, according to several embodiments of the present disclosure.

Fig. 2A is a functional block diagram in the form of a processing unit including a multiply-accumulator block, according to several embodiments of the present disclosure.

Fig. 2B is another functional block diagram in the form of a processing unit including a multiply-accumulator block, according to several embodiments of the present disclosure.

Fig. 3 is an example of n bits after the es exponent bit.

Fig. 4A is an example of a positive value for a 3-bit assumption.

FIG. 4B is an example of a hypothetical construction using two exponent bits.

Figure 5 is a flow diagram representing an example method for hypothesis tensor processing according to several embodiments of the present disclosure.

Detailed Description

Systems, devices, and methods related to hypothetical tensor processing are described. Methods related to hypothetical tensor processing may comprise: receiving, by a plurality of multiply-accumulator (MAC) units coupled to one another, a plurality of universal numbers (unum) or hypothesized bit strings that are organized into matrices and are to be used as operands in a plurality of respective recursive operations performed using the plurality of MAC units; and the MAC unit performing the plurality of respective recursive operations, wherein iterations of the respective recursive operations are performed using at least one bit string that is the same as a bit string used for a previous iteration of the respective recursive operation. The method may further comprise: prior to receiving the plurality of unum or hypothesized bit strings, performing an operation of organizing the plurality of unum or hypothesized bit strings to achieve a threshold bandwidth ratio, a threshold latency, or both during performance of the plurality of respective recursive operations.

Computing systems may perform a wide range of operations that may include various calculations, which may require different accuracies. However, computing systems have a limited amount of memory in which to store operands upon which to perform computations. To facilitate performing operations on operands stored by a computing system within constraints imposed by limited memory resources, the operands may be stored in a particular format. For simplicity, one such format is referred to as a "floating point" format or a "float" (e.g., IEEE754 floating point format).

According to the floating point standard, a bit string (e.g., a bit string that can represent a number), such as a binary string, is represented in terms of three sets of integers or bits (one set of bits is referred to as a "base (base)," one set of bits is referred to as an "exponent," and one set of bits is referred to as a "mantissa" (or significand)). The set of integers or bits defines the format in which the string of binary digits is stored, and for simplicity may be referred to herein as the "numeric format" or "format". For example, the three sets of integers or bits described above (e.g., radix, exponent, and mantissa) that define a floating-point bit string may be referred to as a format (e.g., a first format). As described in more detail below, it is assumed that a bit string may include four sets of integers or bits (e.g., symbols, schemes, exponents, and mantissas), which may also be referred to as a "numeric format" or a "format" (e.g., a second format). In addition, according to the floating-point standard, two infinite values (e.g., + ∞ and- ∞) and/or two kinds of "non-digital values (NaN)" (quiet NaN and signaling NaN) may be included in the bit string.

Floating point standards have been used in computing systems for years and define arithmetic formats, interchange formats, rounding rules, operations, and exception handling for computations by many computing systems. The arithmetic format may include binary and/or decimal floating point data, which may include finite numbers, wireless values, and/or specific NaN values. The interchange format may include encodings (e.g., bit strings) that may be used to exchange floating point data. The rounding rules may include a set of characteristics that may be satisfied when rounding numbers during arithmetic operations and/or conversion operations. Floating point operations may include arithmetic operations and/or other computational operations, such as trigonometric functions. Exception handling may include an indication of an exception condition, such as divide by zero, overflow, and the like.

An alternative format for floating point is referred to as the "universal numbers" (unim) format. There are several forms of unum formats, type I unum, type II unum, and type III unum, that may be referred to as "presumed" and/or "valid. Type I units are a superset of the IEEE754 standard floating point format that uses "bits" at the end of the mantissa to indicate whether the real number is an exact float or is in the interval between adjacent floats. The sign, exponent and mantissa bits in type I unum take their definition from the IEEE754 floating point format, however, the length of the exponent and mantissa fields of type I unum may vary significantly from a single bit to a maximum user definable length. By taking the sign, exponent and mantissa bits from the IEEE754 standard floating-point format, the type I unum may behave similar to a floating-point number, however, the variable bit length present in the exponent and fraction bits of the type I unum may require additional management compared to floating.

Type II unum is generally incompatible with floating, however, type II unum may allow for clean mathematical designs based on projected real numbers. Type II unum may contain n bits and may be described in terms of a "u lattice" in which the quadrants of the circular projection are filled with 2^n-3-1 ordered set of real numbers. The value of type II unum may be reflected around an axis that bisects the circular projection such that positive values are located in the upper right quadrant of the circular projection and their negative corresponding values are located in the upper left quadrant of the circular projection. The lower half of the circular projection representing type II unum may contain the inverse of the value located in the upper half of the circular projection. Type II unum relies on a look-up table in general for most operations. Thus, in some cases, the size of the lookup table may limit the effect of type II unum. However, type II unum may provide improved computational functionality compared to floating according to some conditions.

The type III unum format is referred to herein as a "hypothetical format" or "hypothetical" for simplicity. In contrast to floating point bit strings, it is assumed that higher precision (e.g., wider dynamic range, higher resolution, and/or higher accuracy) may be allowed than floating point numbers having the same bit width depending on certain conditions. This may allow operations performed by the computing system to be performed at a higher rate (e.g., faster) than floating point numbers when using assumptions, which in turn may improve the performance of the computing system by, for example, reducing the number of clock cycles used in performing the operations, thereby reducing processing time and/or power consumed in performing such operations. Additionally, the use of assumptions in a computing system may allow for greater accuracy and/or precision in computations than floating point numbers, which may further improve the functionality of the computing system over some approaches (e.g., approaches that rely on floating point format bit strings).

The hypotheses may be highly variable in precision and accuracy based on the total number of bits and/or the number of integer or bit sets included in the hypotheses. In addition, it is assumed that a wide dynamic range can be generated. Depending on certain conditions, the assumed accuracy, precision, and/or dynamic range may be greater than the accuracy, precision, and/or dynamic range of a floating or other digital format, as described in greater detail herein. The assumed variable accuracy, precision, and/or dynamic range may be manipulated, for example, based on the application in which the assumption is to be used. In addition, it is assumed that overflow, underflow, NaN, and/or other extremes associated with floating and other digital formats may be reduced or eliminated. In addition, the use of assumptions can allow a numerical value (e.g., a number) to be represented using fewer bits than in a floating or other numerical format.

In some embodiments, these features may allow for highly reconfigurable assumptions, which may provide improved application performance compared to approaches that rely on floating or other digital formats. Additionally, these features of the assumptions may provide improved performance in machine learning applications compared to floating or other digital formats. For example, it is assumed that it is useful for machine learning applications where computational performance is critical to training a network (e.g., a neural network) with the same or greater accuracy and/or precision as floating or other digital formats using fewer bits than the floating or other digital formats. In addition, the inference operation in a machine learning scenario may be achieved using assumptions having fewer bits (e.g., smaller bit widths) than floating or other digital formats. By using fewer bits to achieve the same or enhanced results as compared to floating or other digital formats, the use of assumptions may thus reduce the amount of time in performing operations and/or reduce the amount of memory space required in an application, which may improve the overall functionality of a computing system in which the assumptions are employed.

Embodiments herein relate to hardware circuitry (e.g., a processing unit that may include a multiply-accumulator (MAC) block, which may include a plurality of constituent MAC units) configured to perform various operations on a bit string to improve overall operation of a computing device. For example, embodiments herein relate to hardware circuitry configured to perform recursive operations using bit strings formatted according to a universal number (um) or hypothetical format. In some embodiments, the bit strings may be organized into matrices or tensors before, during, or after performing the recursive operation.

For example, a MAC block may be configured such that the constituent MAC units of the MAC block receive an unum or hypothesized bit string organized into a matrix or tensor, and collectively operate to perform a recursive operation using the bit string matrix or tensor, resulting in a resultant bit string matrix or tensor. In some embodiments, the recursive operation may be performed by the constituent MAC units as part of performing the multiply-accumulate operation, but embodiments are not limited to the performance of multiply-accumulate operations, as described in more detail herein.

In some embodiments, prior to performing the recursive operation, the bit string may be converted from a first format (e.g., a floating point format or a fixed point format) that supports arithmetic or logical operations of a first level of precision to a second format (e.g., a universal number format, a hypothetical format, etc.) that supports arithmetic or logical operations of a second level of precision. For example, the host may be configured to use a bit string formatted in a floating-point format or a fixed-point format, and the processing unit may be configured to perform arithmetic and/or logical operations using a bit string formatted in a um or hypothetical format. Thus, to perform a recursive operation using a processing unit, the bit string may be converted from a first format to a second format before performing the operation.

As described in greater detail herein, the processing unit may include circuitry and components that can perform such operations on the converted bit string when operating in accordance with various control signals. In some embodiments, the operations may be performed within the tensor processing unit without activating input/output circuitry coupled to the processing unit (e.g., prior to activation). Thus, in some embodiments, the processing unit may be configured to perform the operation using the converted bit string without passing the bit string out of the processing unit (e.g., to the host) until the requested operation is completed.

By performing such operations within the processing unit using a bit string that has been overlaid into the second format, improved performance of the computing system may be achieved by allowing for improved precision and/or accuracy in the performed operations, improved speed in performing the operations, and/or reduced memory space required for the bit string before, during, or after performing the recursive arithmetic and/or logical operations. Additionally, by performing such operations within a processing unit, the number of signals asserted within the computing system may be reduced, thereby reducing the power consumption and/or the amount of time spent performing the operations, as compared to methods that do not include the processing unit described herein.

In some embodiments, the results of operations (e.g., resulting bit strings) performed within the processing unit may be converted back to a first format (e.g., floating point format) and then may be passed to another circuitry of the computing system (e.g., a host, a memory device, etc.).

In the following detailed description of the present disclosure, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration ways in which one or more embodiments of the disclosure may be practiced. These embodiments are described in sufficient detail to enable those of ordinary skill in the art to practice the embodiments of this disclosure, and it is to be understood that other embodiments may be utilized and that process, electrical, and structural changes may be made without departing from the scope of the present disclosure.

As used herein, designators such as "N," "M," "X," and "Y," particularly with respect to reference numerals in the drawings, indicate that a number of the particular feature so designated may be included. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms "a" and "the" include both the singular and the plural, unless the context clearly dictates otherwise. Additionally, "a plurality," "at least one," and "one or more" (e.g., a plurality of memory banks) can refer to one or more memory banks, while "a plurality" is intended to refer to more than one such thing.

Moreover, the word "can/may" is used throughout this application in a permissive sense (i.e., possible, able) rather than a mandatory sense (i.e., must). The term "comprising" and its derivatives mean "including but not limited to". The term "coupled" means physically connecting or accessing and moving (transmitting) commands and/or data, directly or indirectly, depending on the context. Depending on the context, the terms "bit string," "data," and "data value" are used interchangeably herein and may have the same meaning. In addition, depending on the context, the terms "bit set," "bit subset," and "portion" (in the case of a portion of a bit string) are used interchangeably herein and may have the same meaning.

The drawings herein follow a numbering convention in which the first one or more digits correspond to the drawing number and the remaining digits identify an element or component in the drawing. Similar elements or components between different figures may be identified by the use of similar digits. For example, 120 may refer to element "20" in fig. 1, and a similar element may be represented as 220 in fig. 2. A plurality or group of similar elements or components may be generally referred to herein by a single reference numeral. For example, the plurality of reference elements 226-1 through 226-N (e.g., 226-1, 226-2, 226-3, …, 226-N) may be generally referred to as 226. As will be appreciated, elements shown in the various embodiments herein can be added, exchanged, and/or removed in order to provide a number of additional embodiments of the present disclosure. Additionally, the proportion and/or the relative scale of the elements provided in the figures are intended to illustrate certain embodiments of the present disclosure, and should not be taken in a limiting sense.

Fig. 1 is a functional block diagram in the form of an apparatus including a host 102 and a memory device 104, the memory device 104 including a processing unit 120, according to several embodiments of the present disclosure. As used herein, "apparatus" may refer to, but is not limited to, any of a variety of structures or combinations of structures, such as a circuit or circuitry, one or more dies, one or more modules, one or more devices, or one or more systems. The memory device 104 may include one or more memory modules (e.g., single inline memory modules, dual inline memory modules, etc.). The memory device 104 may include volatile memory and/or non-volatile memory. In a number of embodiments, the memory device 104 may comprise a multi-chip device. A multi-chip device may include a plurality of different memory types and/or memory modules. For example, the memory system may include non-volatile or volatile memory on any type of module.

As shown in fig. 1, apparatus 100 may include a processing unit 120 and a memory array 130, the processing unit 120 may include logic circuitry 122 and a multiply-accumulator (MAC) block 124. Additionally, each of the components (e.g., host 102, processing unit 120, logic circuitry 122, MAC block 124, and/or memory array 130) may be individually referred to herein as a "device. The logic circuitry 122 may be referred to herein as a "processing device" or a "processing unit".

The memory device 104 may provide a main memory for the computing system 100 or may be used as additional memory or storage throughout the computing system 100. The memory device 104 may include one or more memory arrays 130 (e.g., an array of memory cells), which may include volatile and/or nonvolatile memory cells. For example, the memory array 130 may be a flash array having a NAND architecture. Embodiments are not limited to a particular type of memory device. For example, the memory device 104 may include RAM, ROM, DRAM, SDRAM, PCRAM, RRAM, flash memory, and so forth.

In embodiments where the memory device 104 includes non-volatile memory, the memory device 104 may include a flash memory device, such as a NAND or NOR flash memory device. However, embodiments are not so limited, and memory device 104 may include other non-volatile memory devices such as non-volatile random access memory devices (e.g., NVRAM, ReRAM, FeRAM, MRAM, PCM), an "emerging" memory device such as a resistance variable (e.g., 3-D cross point (3D XP)) memory device, a memory device including an array of self-selecting memory (SSM) cells, and the like, or combinations thereof. The resistance variable memory device may perform bit storage based on a change in body resistance in conjunction with a stackable cross-meshed data access array. Additionally, in contrast to many flash-based memories, a resistance variable non-volatile memory may perform a write-in-place operation in which non-volatile memory cells may be programmed without previously erasing the non-volatile memory cells. In contrast to flash-based memory and resistance variable memory, a self-selected memory cell can include a memory cell having a single chalcogenide material acting as both a switch and a storage element of the memory cell.

As illustrated in fig. 1, a host 102 may be coupled to a memory device 104. In a number of embodiments, memory device 104 may be coupled to host 102 via one or more channels (e.g., channel 103). In FIG. 1, memory device 104 is coupled to host 102 via channel 103, and processing unit 120 of memory device 104 is coupled to memory array 130 via channel 107. Host 102 may be a host system such as a personal laptop computer, desktop computer, digital camera, smart phone, memory card reader, and/or internet of things (IoT) enabled device, among various other types of hosts.

The host 102 may include a system motherboard and/or backplane, and may include a memory access device, such as a processor (or processing device). One of ordinary skill in the art will appreciate that a "processor" may be one or more processors, such as a parallel processing system, a plurality of coprocessors, and the like. The system 100 may include separate integrated circuits, or the host 102, memory device 104, and memory array 130 may be on the same integrated circuit. For example, system 100 may be a server system and/or a High Performance Computing (HPC) system and/or a portion thereof. Although the example shown in fig. 1 illustrates a system having a Von Neumann architecture, embodiments of the present disclosure may be implemented in a non-Von Neumann architecture, which may not include one or more components (e.g., CPU, ALU, etc.) typically associated with Von Neumann architectures.

Herein, the memory device 104, shown in more detail in fig. 2, may include a processing unit 120, which may include logic circuitry 122 and a MAC block 124. Processing unit 120 may be provided in the form of an integrated circuit, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a system on a chip, or the like, while logic circuitry 122 may be provided in the form of a reduced instruction set computing device (RISC), an advanced RISC machine, or other combination of hardware and/or circuitry configured to perform operations described in greater detail herein. In some embodiments, the logic circuitry 122 may include one or more processors (e.g., processing devices, processing units, etc.).

The logic circuitry 122 may cause operations to be performed using bit strings received by the processing unit 120. In some embodiments, the logic circuitry may arrange operations on the bit string to convert the bit string between various formats and/or to pass the converted bit string to the memory array 130. For example, the conversion operation may include an operation to convert a floating-point bit string (e.g., a floating-point number) into a bit string in the assumed format and an inverse operation. Once the floating-point bit string is converted to a bit string in a hypothetical format, processing unit 120 (e.g., MAC block 124 of processing unit 120) may be configured to perform (or cause to be performed) arithmetic operations, such as addition, subtraction, multiplication, division, fused multiply-add, multiply-accumulate, dot product unit, greater than or less than, absolute value (e.g., FABS ()), fast fourier transform, inverse fast fourier transform, sigmoid function, convolution, square root, exponential, and/or logarithmic operations, using the hypothetical bit string; AND/OR recursive logical operations such as AND, OR, XOR, NOT, etc.; and trigonometric operations such as sine, cosine, tangent, etc. In some embodiments, such operations may be performed recursively. As will be appreciated, the above list of operations is not intended to be exhaustive, nor is it intended to be limiting, and the processing unit 120 may be configured to perform (or cause to be performed) other arithmetic, bitwise, and/or logical operations.

The memory device 104 may further include memory resources, which may be communicatively coupled to the processing unit 120. The memory resources may include volatile memory resources, non-volatile memory resources, or a combination of volatile and non-volatile memory resources. In some embodiments, the memory resource may be Random Access Memory (RAM), such as Static Random Access Memory (SRAM). Embodiments are not so limited and the memory resources may be cache, one or more registers, NVRAM, ReRAM, FeRAM, MRAM, PCM, "emerging" memory devices such as resistance variable memory resources, phase change memory devices, memory devices including self-selected memory cell arrays, and the like, or combinations thereof.

The memory resource may store one or more bit strings. After performing operations using the processing unit 120, the bit strings stored by the memory resources may be stored in a universal number (unum) or hypothetical format. As used herein, a string of bits stored in a yum (e.g., type III yum) or hypothetical format may include several subsets of bits or "bit subsets. For example, a generic number or hypothetical bit string may include a subset of bits referred to as a "symbol" or "symbol portion", a subset of bits referred to as a "scheme" or "scheme portion", a subset of bits referred to as an "exponent" or "exponent portion", and a subset of bits referred to as a "mantissa" or "mantissa portion" (or significand). As used herein, a subset of bits is intended to refer to a subset of bits contained in a bit string. Examples of symbols, schemes, exponents, and mantissa bit sets are described in more detail herein in connection with fig. 3 and 4A-4B. However, embodiments are not so limited, and the memory resources may store bit strings in other formats, such as a floating point format or other suitable formats.

In some embodiments, the processing unit 120 may receive data comprising a bit string (e.g., a floating point bit string) having a first format that provides a first level of precision. The processing unit 120 may receive data from the memory resources and convert the bit string (e.g., using conversion circuitry 225-1 to 225-N shown in fig. 2B) to a second format (e.g., a universal number or hypothetical format) that provides a second level of accuracy that is different from the first level of accuracy. In some embodiments, the first level of accuracy may be lower than the second level of accuracy. For example, if the first format is a floating point format and the second format is a universal number or hypothetical format, the floating point bit string may provide a level of precision that is lower than the universal number or hypothetical bit string under certain conditions, as described in more detail herein in connection with fig. 3 and 4A-4B.

The first format may be a floating point format (e.g., IEEE754 format) and the second format may be a universal number (um) format (e.g., type I um format, type II um format, type III um format, hypothetical format, valid format, etc.). Thus, the first format may include a mantissa, radix, and exponent portions, and the second format may include mantissa, sign, scheme, and exponent portions.

The processing unit 120 may be configured to perform arithmetic operations or logical operations, or both, using a bit string having a second format (e.g., a unum or hypothetical format). In some embodiments, the arithmetic operations and/or logical operations may be recursive operations. As used herein, "recursive operation" generally refers to an operation that is performed a specified number of times, with the result of a previous iteration of the recursive operation being used as an operand for a subsequent iteration of the operation. For example, the recursive multiplication operation may be one in which two bit-string operands, β andeach iteration of multiplication and recursion operationsThe result is used as the operation of the bit string operand for the subsequent iteration. In other words, a recursive operation may refer to a process in which the first iteration of the recursive operation contains β andthe multiplication results in a result lambda (for example,). The next iteration of this example recursive operation may include the results λ andthe multiplication results in another result omega (e.g.,)。

another illustrative example of a recursive operation may be explained in terms of computing factorials of natural numbers. This example given by equation 1 may include performing a recursive operation when the factoring of a given number n is greater than zero, and returning a one when the number n equals zero:

equation 1

As shown in equation 1, the recursive operation of determining the factorization of the number n may be carried out until n equals zero, at which point a solution is obtained and the recursive operation is terminated. For example, using equation 1, the factorial of the number n may be computed recursively by performing the following operations: n × (n-1) × (n-2) × … × 1.

Yet another example of a recursive operation is a multiply-accumulate operation, where accumulator a is modified in iteration according to equation a ← a + (b x c). In a multiply-accumulate operation, each previous iteration of accumulator a is summed with the product of two operands b and c. Examples of recursive operations contemplated herein are not limited to these examples. Rather, the above examples of recursive operations are merely illustrative and are provided to clarify the scope of the term "recursive operations" in the context of the present disclosure.

The embodiment of fig. 1 may include additional circuitry that is not shown so as not to obscure embodiments of the present disclosure. For example, the memory device 104 may include address circuitry for latching address signals provided over I/O connections by the I/O circuitry. Address signals may be received and decoded by a row decoder and a column decoder to access the memory device 104 and/or the memory array 130. Those skilled in the art will appreciate that the number of address input connections may depend on the density and architecture of the memory devices 104 and/or the memory array 130.

Fig. 2A is a functional block diagram in the form of a processing unit 220 including a multiply-accumulator block 224, according to several embodiments of the present disclosure. Processing unit 220 and multiply-accumulator (MAC) block 224 may be similar to processing unit 120 and MAC block 124 shown in fig. 1.

As shown in FIG. 2A, processing unit 220 may include logic circuitry 222-1 through 222-N (which may be similar to logic circuitry 122 shown in FIG. 1), a plurality of data arbiters 218-1 through 218-N, and a plurality of buffers/interfaces 216-1 through 216-M. Additionally, the MAC block 224 may include a plurality of MAC units 226-1 through 226-N. Although shown as being internal to processing unit 220, buffers/interfaces 216-1 through 216-M may be external to processing unit 220. In such embodiments, buffers/interfaces 216-1 through 216-M may be coupled to processing unit 220 via a communication path, such as a Double Data Rate (DDR) bus, a peripheral component interconnect express (PCIe) bus, or other suitable bus.

Logic circuitry 222-1 through 222-N may be communicatively coupled to buffers/interfaces 216-1 through 216-M via respective communication paths 217-1 through 217-N. Communication paths 217-1 through 217-N may allow direct communication (e.g., for command exchange) between logic circuitry 222-1 through 222-N and buffers/interfaces 216-1 through 216-M.

The MAC units 226-1 to 226-N may be configured to perform arithmetic and/or logical operations as part of digital signal processing to perform recursive operations using bit string operands. As described above, the bit string operands may be unum or hypothetical bit string operands and may be organized into matrices or tensors before being received by the MAC units 226-1 to 226-N. MAC units 226-1 through 226-N may include circuitry to facilitate the performance of arithmetic and/or logical operations to perform multiply-accumulate operations (among other operations) using unum or hypothesized bit string operands.

In some embodiments, the MAC units 226-1 to 226-N may each be configured to accommodate 64 bits by 64 bits (e.g., a bit string matrix or tensor having 64 columns and 64 rows of data values), although embodiments are not limited to this particular example. For example, embodiments described herein may include MAC units 226-1 through 226-N that may be configured to accommodate 8 bits by 8 bits, 16 bits by 16 bits, 32 bits by 32 bits, or any other bit quantity. In some embodiments, MAC units 226-1 through 226-N may be reconfigured such that they may include a first amount of bits during the performance of some operations and a second amount of bits within other operations. For example, MAC units 226-1 through 226-N may be configured to accommodate 8 bits by 8 bits for certain operations, and may be reconfigured at a subsequent time to accommodate 64 bits by 64 bits.

In embodiments where the MAC units 226-1 through 226-N are configured to accommodate 64 bits by 64 bits, the MAC block 224 may be configured to contain multiple MAC units 226-1 through 226-N such that the MAC block 224 may accommodate 128 bits by 128 bits. For example, in the example shown in FIG. 2A, MAC block 224 includes four MAC units 226-1 through 226-N that provide 128-bit by 128-bit accommodation when spatially configured in a grid configuration (as shown in FIG. 2A). As used herein, "lattice construction" generally refers to a situation in which components are arranged such that the top and side edges of the respective components are aligned with each other. For example, in the mesh configuration shown in FIG. 2A, the top edges of MAC units 226-1 and 226-2 are aligned with each other, and the top edges of MAC units 226-3 and 226-N are aligned with each other. In addition, the side edges of MAC units 226-1 and 226-3 are aligned with each other, and the side edges of MAC units 226-3 and 226-N are aligned with each other.

In some embodiments, each MAC unit 226-1 to 226-N may include 8-bit inputs and/or outputs, 16-bit inputs and/or outputs, 32-bit inputs and/or outputs, and so on. A bit string (e.g., an unum or hypothetical bit string used as an operand for an operation performed within the processing unit 220) may be streamed through the MAC block 224 during execution of the operation. For example, the bit string may be processed continuously (or nearly continuously) by the MAC block 224 during execution of a recursive operation, such as a multiply-accumulate operation.

The bit string may be processed by the data arbiters 218-1 through 218-N before being received by the MAC block 224. The data arbiters 218-1 through 218-N may be configured to organize bit strings to optimize bandwidth and/or latency associated with performing operations using the MAC block 224. That is, embodiments herein may allow bit strings to be organized to optimize bandwidth and/or latency parameters associated with performing operations, as compared to methods that do not provide a data arbiter to organize bit strings prior to performing operations using the bit strings. This may result in an increase in the throughput of operations performed using the MAC block 224 compared to methods that do not optimize the content of the bit string prior to performing the operation using the bit string. In some embodiments, the data arbiters 218-1 through 218-N may be systolic data arbiters.

As used herein, the term "shrink" refers to a situation in which data, e.g., a bit string, is streamed or streamed through a computing device, e.g., processing unit 220. For example, embodiments herein may include providing a bit string to the MAC block 224, and thus, the MAC units 226-1 to 226-N are contracted such that the bit string continuously flows or streams to the MAC block 224 without an intervening command after issuing a command that initiates a recursive operation using the MAC block 224. Thus, a "contracting data arbiter" refers to a hardware device that allocates access to a shared computing resource in a contracting manner (e.g., a hardware device that allocates access to a shared computing resource in a streaming or streaming manner).

The operations performed by the MAC block 224 may be controlled by asserting command and/or control signals to the MAC units 226-1 through 226-N of the MAC block 224. Command and/or control signals may be asserted on the MAC block 224 via the buffers/interfaces 216-1 through 216-M. Thus, in some embodiments, command and/or control signals may be asserted on the MAC block 224 along a first lateral dimension of the MAC block 224 (e.g., from the top and/or bottom of the MAC block 224, as shown in fig. 2A), and the bit string may be streamed into the MAC block 224 along a second lateral dimension of the MAC block 224 (e.g., along the sides of the MAC block 224, as shown in fig. 2A).

Fig. 2B is another functional block diagram in the form of a processing unit 220 including a multiply-accumulator block 224, according to several embodiments of the present disclosure. Processing unit 220 and multiply-accumulator (MAC) block 224 may be similar to processing unit 120 and MAC block 124 shown in fig. 1.

As shown in FIG. 2B, processing unit 220 may include logic circuitry 222-1 through 222-N (which may be similar to logic circuitry 122 shown in FIG. 1), a plurality of data arbiters 218-1 through 218-N, and a plurality of buffers/interfaces 216-1 through 216-M. Additionally, the MAC block 224 may include multiple MAC units 226-1 through 226-N, which may each include a corresponding superposition (quire) register 227-1 through 227-N. Additionally, processing unit 220 may include conversion circuitry 225-1 to 225-N and/or neural network components 223-1 to 223-X.

Overlay registers 227-1 to 227-N may alternatively be referred to herein as "accumulator registers" or "registers," which may include sense amplifiers, flip-flops, or other storage locations configured to store data values for simplicity. In some embodiments, the overlay registers 227-1 to 227-N may be configured to store the results of iterations of the recursive operations performed by the MAC block 224. The iteration results may be accumulated at each stage of the recursive operation. This may cause the results of recursive operations that have not been rounded or truncated to be stored, which is common in some other approaches. However, by storing unrounded iterative results in the overlay registers 227-1 to 227-N, the accuracy of the final result of the recursive operation may be greater than methods that perform rounding or truncation on the iterative results of the recursive operation.

As shown in FIG. 2B, processing unit 220 may include neural network components 223-1 through 223-X. The neural network components 223-1 to 223-X may be coupled to logic circuitry and may perform neural network operations, such as convolution, activation, pooling, and/or normalization operations, on bit strings (e.g., unum or hypothetical bit strings organized into matrices or tensors) that are to undergo recursive operations performed by the MAC block 224.

Processing unit 220 may further include conversion circuitry 225-1 through 225-N. The conversion circuitry 225-2 to 225-N may be configured to perform a conversion operation on a bit string received by the processing unit 220 before the MAC block 224 performs a recursive operation. For example, as described above, a host, such as host 102 shown in fig. 1, may use a bit string formatted according to a floating point or fixed point format, while processing unit 220 may perform operations using a bit string formatted according to a uum or hypothetical format. To convert the bit string from a floating-point or fixed-point format to a unum or hypothetical format, the conversion circuitry 225-1 to 225-N may perform one or more conversion operations to convert the bit string to a unum or hypothetical format before starting a recursive operation using the MAC block 224.

In a non-limiting example, MAC block 224 may be a hypothetical MAC block and may contain a plurality of MAC units 226-1 through 226-N organized in a grid configuration. As used herein, a "hypothetical MAC block" refers to a MAC block that is configured to perform operations using operands formatted in a unim or hypothetical format, rather than a floating point format or other format. It is assumed that the MAC block may reside on an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA). As used herein, the term "resident on … …" means that something is physically located on a particular component. For example, assume that a MAC block "resides on an ASIC or FPGA" refers to a situation in which the MAC block is assumed to be physically located on the ASIC or FPGA. The term "resident on … …" may be used interchangeably herein with other terms such as "disposed on … …" or "located on … …".

In some embodiments, the MAC block 224 may be configured to perform a plurality of matrix operations as part of performing respective recursive operations using the received bit string. For example, in embodiments in which the bit strings are provided in the form of a matrix or tensor, the MAC block 224 may be configured to perform addition, subtraction, multiplication, and/or division operations using the matrix or tensor.

MAC units 226-1 through 226-N may be configured to receive commands via interfaces (e.g., buffers/interfaces 216-1 through 216-M) residing along a first lateral dimension of hypothetical MAC block 224. The commands to perform the respective recursive operations may include commands to perform the respective recursive operations using universal numbers or hypothetical bit strings organized into matrices or tensors.

The plurality of MAC units 226-1 to 226-N may be configured to receive a bit string along a second lateral dimension of the hypothetical MAC block, the second lateral dimension being substantially orthogonal to the first lateral dimension, and to perform a corresponding recursive operation using the received bit string. In some embodiments, the iteration of the respective recursive operation may be performed using at least one bit string that is the same as the bit string used for a previous iteration of the respective recursive operation. By using at least one identical unum or assumed bit string as operands of a plurality of recursions of a respective recursive operation, the number of signals and/or function calls for retrieving the unum or assumed bit string may be reduced compared to a method in which successive iterations of the respective recursive operation use different bit strings.

In some embodiments, the MAC block 224 may be configured to receive the bit string in a streaming manner and perform a corresponding recursive operation using the bit string as it is streamed through the MAC block 224. For example, the bit strings may be serially streamed through the logic circuitry 222-1 through 222-N and/or through the data arbiters 218-1 through 218-N during execution of the respective recursive operations such that the MAC block 224 serially (or nearly serially) receives the bit strings in the absence of intermediate commands (e.g., commands that individually initiate each recursive operation). However, embodiments are not so limited, and in some embodiments, the bit string may be loaded into the MAC block 224 on the rising and/or falling edges of a clock signal that may be asserted on the MAC block 224 to control the timing of operations performed by the MAC block 224.

The data arbiters 218-1 through 218-N may be coupled to the MAC block 224 and may be configured to organize the bit string to achieve a threshold bandwidth ratio, a threshold latency, or both during execution of the plurality of respective recursive operations. For example, the data arbiters 218-1 through 218-N may be configured to perform operations to organize the bit string to optimize the distribution of the bit string, reduce the amount of data associated with the bit string, simplify the bit string, or otherwise organize the bit string to optimize at least one parameter associated with the performance of the recursive operations by the MAC block 224.

In some embodiments, the MAC block 224 may further include an accumulator register (e.g., overlay registers 227-1 to 227-N) configured to store intermediate results corresponding to at least one iteration of a respective recursive operation. The accumulator register may include two hundred fifty-six (256) storage locations for storing intermediate results of respective recursion operations containing two hundred fifty-six bits or less, although embodiments are not so limited and the accumulator register may include more or less than two hundred fifty-six storage locations. By storing the intermediate results of the respective recursive operations in the accumulator register, the level of precision of the final result of the respective recursive operation may be maintained, as compared to methods that merely round or truncate the intermediate results of the recursive operation.

In another non-limiting example, a plurality of hypothetical multiply-accumulator (MAC) units 226-1 through 226-N may be deployed on an application specific integrated circuit or field programmable gate array. MAC units 226-1 through 226-N may be provided with processing resources (internally or communicatively) and coupled to one another in a grid configuration. The MAC units 226-1 through 226-N may include respective overlay registers 227-1 through 227-N that may be coupled to each of the MAC units 226-1 through 226-N. In some embodiments, each MAC unit of the plurality of MAC units 226-1 through 226-N may be configured to receive a bit string formatted according to a universal digital format or a hypothetical format and organized into a matrix.

The MAC units 226-1 through 226-N may be further configured to perform a recursive operation using the received bit string as an operand of the recursive operation and accumulate the results of the iteration of the recursive operation in the respective overlay registers 227-1 through 227-N. In some embodiments, the MAC units 226-1 through 226-N may be configured to determine that the recursive operation is complete and generate a final result of the recursive operation using a final iteration result of the recursive operation, an iteration result stored in the respective overlay registers 227-1 through 227-N, or both.

For example, because the final results of the recursive operations may be accumulated in the overlay registers 227-1 through 227-N, the final results of the recursive operations may be retrieved from the respective overlay registers 227-1 through 227-N. However, in some embodiments, the final result of the recursive operations stored in the overlay registers 227-1 through 227-N may contain more than a desired amount of bits due to the size of the respective overlay registers 227-1 through 227-N. If so, MAC units 226-1 through 226-N and/or logic circuitry 222-1 through 222-N may be configured to round or truncate the final result of the recursive operation to a predetermined bit amount.

In some embodiments, a respective shrink data arbiter (e.g., data arbiters 218-1 through 218-N) may be coupled to each of MAC units 226-1 through 226-N. In this example, each of the respective systolic data arbiters may be configured to perform operations to organize the bit strings to optimize bandwidth parameters corresponding to provisioning of the bit strings to the MAC units 226-1 through 226-N. In some embodiments, the bandwidth parameter may correspond to a maximum rate of data transfer through the MAC units 226-1 through 226-N during execution of the recursive operation.

MAC units 226-1 through 226-N may be configured to receive commands to perform recursive operations via interfaces residing along a first lateral dimension of MAC units 226-1 through 226-N and to receive bit strings along a second lateral dimension of MAC units 226-1 through 226-N. In some embodiments, the second lateral dimension may be substantially orthogonal to the first lateral dimension. For example, commands may be asserted to MAC units 226-1 through 226-N via buffers/interfaces 216-1 through 216-M located on top or bottom portions of MAC units 226-1 through 226-N (as oriented in FIG. 2B), while bit strings may be streamed into MAC units 226-1 through 226-N via logic circuitry 222-1 through 222-N and/or data arbiters 218-1 through 218-N located along side portions of MAC units 226-1 through 226-N (as oriented in FIG. 2B).

In some embodiments, at least one of the processing resources may be configured to determine a matrix size and, in response to the matrix size being greater than a threshold size, perform operations to simplify the matrix prior to the plurality of MAC units 226-1 to 226-N performing the recursive operation such that the resulting matrix is less than the threshold size. For example, if the MAC units 226-1 through 226-N are each configured to perform operations on a 64-bit by 64-bit matrix or tensor and the bit string is initially provided in a matrix or tensor that is greater than 64 bits by 64 bits, then at least one processing resource may perform operations that control the matrix or tensor so that they contain at most 64 bits by 64 bits.

The MAC units 226-1 to 226-N may be configured to perform multiple iterations of the recursive operation using the same at least one bit string as a previous iteration of the recursive operation. For example, the MAC units 226-1 through 226-N may reuse the same bit string for multiple iterations of the recursive operation. By repeatedly using at least one bit string as an operand of a plurality of recursions of a respective recursive operation, the number of signals and/or function calls used to retrieve the bit string may be reduced compared to a method in which successive iterations of the respective recursive operation use different bit strings.

FIG. 3 is an example of an n-bit universal number or "um" with an es exponent bit. In the example of fig. 3, n-bit unum is the hypothetical bit string 331. As shown in fig. 3, n-bit hypothesis 331 may include a set of sign bits (e.g., first subset of bits or sign subset of bits 333), a set of scheme bits (e.g., second subset of bits or scheme subset of bits 335), a set of exponent bits (e.g., third subset of bits or exponent subset 337), and a set of mantissa bits (e.g., fourth subset of bits or mantissa subset 339). Mantissa bits 339 may be referred to as a "fraction portion" or "fraction bits" in the alternative, and may represent a portion (e.g., a number) of the bit string after a decimal point.

Sign bit 333 may be zero (0) for positive numbers and one (1) for negative numbers. Scheme bit 335 is described below in conjunction with table 1, which shows a (binary) bit string and its associated numerical meaning k. In table 1, the numerical meaning k is determined by the run length of the bit string. The letter x in the binary part of table 1 indicates that the bit value is irrelevant for the determination of the scheme, since the (binary) bit string terminates in response to consecutive bit flips or when the end of the bit string is reached. For example, in a (binary) bit string 0010, the bit string terminates in response to zero flipping to one and then back to zero. Thus, the last zero is not relevant to the scheme and all that is considered for the scheme is the leading identity bit and the first identity bit of the terminating bit string (if the bit string contains such bits).

Binary system	0000	0001	001X	01XX	10XX	110X	1110	1111
									Number (k)	-4	-3	-2	-1	0	1	2	3

TABLE 1

In FIG. 3, the scheme bits 335r correspond to identical bits in a bit string, and the scheme bits 335r correspond to identical bits in a bit stringCorresponding to the opposite bit of the terminating bit string. For example, for the digital k value-2 shown in Table 1, the scheme bit r corresponds to the first two leading zeros, while the scheme bit r corresponds to the first two leading zerosCorresponding to one. As described above, the final bit corresponding to the number k represented by X in table 1 is not related to the scheme.

If m corresponds to the number of identical bits in the bit string, if said bitsIs zero, then k ═ m. If the bit is one, then k is m-1. This is illustrated in table 1, where for example the (binary) bit string 10XX has a single one and k-m-1-0. Similarly, the (binary) bit string 0001 contains three zeros, so that k-m-3. Scheme can indicate used^kA scaling factor of whereinSeveral example values for used are shown in table 2 below.

es	0	1	2	3	4
						used	2	2²＝4	4²＝16	16²＝256	256²＝65536

TABLE 2

Exponent bit 337 corresponds to exponent e, which is an unsigned number. The exponent bits 337 described herein may have no bias associated therewith, as opposed to floating point numbers. Accordingly, as used hereinThe exponent bits 337 of the description may be represented by a factor of 2^eScaling of (3). As shown in FIG. 3, there may be up to es exponent bits (e), depending on how many bits remain to the right of scheme bits 335 of n-bit hypothesis 331₁、e₂、e₃、…、e_es). In some embodiments, this may allow for a diminishing accuracy of n-bit hypothesis 331, where numbers closer in magnitude to one have a higher accuracy than the maximum or minimum number. However, since very large or small numbers may be used less frequently for certain kinds of operations, the diminishing accuracy operation of the n-bit hypothesis 331 shown in fig. 3 may be desirable in a wide range of situations.

Mantissa bits 339 (or fractional bits) represent any additional bits that may be part of an n-bit hypothesis 331 located to the right of exponent bits 337. Similar to a floating-point bit string, mantissa bits 339 represent a fraction f, which may be similar to fraction 1.f, where f includes one or more bits to the right of a subsequent decimal point. In contrast to a floating-point bit string, however, in the n-bit hypothesis 331 shown in fig. 3, a "hidden bit" (e.g., one) may always be a one (e.g., uniform), while a floating-point bit string may include sub-normal numbers with a "hidden bit" zero (e.g., 0. f).

As described herein, altering the numerical value or amount of bits of one or more of the symbol 333-bit subset, the scheme 335 bit subset, the exponent 337 bit subset, or the mantissa 339 bit subset may change the accuracy of the n-bit hypothesis 331. For example, changing the total number of bits in the n-bit hypothesis 331 may alter the resolution of the n-bit hypothesis string 331. That is, an 8-bit hypothesis may be translated into a 16-bit hypothesis by, for example, increasing the number and/or amount of bits associated with a subset of constituent bits of one or more of the string of hypothesis bits to increase the resolution of the string of hypothesis bits. Conversely, the resolution of the hypothetical bit string can be reduced, for example, from a 64-bit resolution to a 32-bit resolution by reducing the number and/or amount of bits associated with a constituent subset of bits of one or more of the hypothetical bit strings.

In some embodiments, altering the value and/or amount of bits associated with one or more of the scheme 335 bit subset, the exponent 337 bit subset, and/or the mantissa 339 bit subset to change the precision of the n-bit hypothesis 331 may result in a change in at least one of another of the scheme 335 bit subset, the exponent 337 bit subset, and/or the mantissa 339 bit subset. For example, when the precision of the n-bit hypothesized 331 is altered to increase the resolution of the n-bit hypothesized bit string 331 (e.g., when an "up-conversion" operation is performed to increase the bit width of the n-bit hypothesized bit string 331), the values and/or amounts of bits associated with one or more of the scheme 335 bit subset, the exponent 337 bit subset, and/or the mantissa 339 bit subset may be altered.

In a non-limiting example, where the resolution of the n-bit hypothetical bit string 331 is increased (e.g., the precision of the n-bit hypothetical bit string 331 is changed to increase the bit width of the n-bit hypothetical bit string 331) but the number or amount of bits associated with the subset of the exponent 337 bits is not changed, the number or amount of bits associated with the subset of the mantissa 339 bits may be increased. In at least one embodiment, increasing the value and/or amount of bits of the mantissa 339 bit subset while the exponent 338 bit subset remains unchanged may include adding one or more zero bits to the mantissa 339 bit subset.

In another non-limiting example, where the resolution of the n-bit hypothetical bit string 331 is increased by altering the value and/or amount of bits associated with the exponent 337 bit subset (e.g., the precision of the n-bit hypothetical bit string 331 is changed to increase the bit width of the n-bit hypothetical bit string 331), the value and/or amount of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset may be increased or decreased. For example, if the values and/or amounts of bits associated with the exponent 337 bit subset increase or decrease, the values and/or amounts of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset may produce a corresponding change. In at least one embodiment, increasing or decreasing the value and/or amount of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset may include adding one or more zero bits to the scheme 335 bit subset and/or the mantissa 339 bit subset and/or truncating the value or amount of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset.

In another example, where the resolution of the n-bit hypothetical bit string 331 is increased (e.g., the precision of the n-bit hypothetical bit string 331 is changed to increase the bit width of the n-bit hypothetical bit string 331), the number and/or amount of bits associated with the subset of index 335 bits may be increased and the number and/or amount of bits associated with the subset of scheme 333 bits may be decreased. Conversely, in some embodiments, the value and/or amount of bits associated with the 335 bit subset of exponents may be decreased and the value and/or amount of bits associated with the 333 bit subset of schemes may be increased.

In a non-limiting example, where the resolution of the n-bit hypothetical bit string 331 is reduced (e.g., the precision of the n-bit hypothetical bit string 331 is changed to reduce the bit width of the n-bit hypothetical bit string 331) but the number or amount of bits associated with the subset of the exponent 337 bits is not changed, the number or amount of bits associated with the subset of the mantissa 339 bits may be reduced. In at least one embodiment, reducing the value and/or amount of bits of the subset of mantissa 339 bits while the subset of exponent 338 bits remains unchanged may include truncating the value and/or amount of bits associated with the subset of mantissa 339 bits.

In another non-limiting example, where the resolution of the n-bit hypothetical bit string 331 is reduced by altering the value and/or amount of bits associated with the exponent 337 bit subset (e.g., the precision of the n-bit hypothetical bit string 331 is changed to reduce the bit width of the n-bit hypothetical bit string 331), the value and/or amount of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset may be increased or decreased. For example, if the values and/or amounts of bits associated with the exponent 337 bit subset increase or decrease, the values and/or amounts of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset may produce a corresponding change. In at least one embodiment, increasing or decreasing the value and/or amount of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset may include adding one or more zero bits to the scheme 335 bit subset and/or the mantissa 339 bit subset and/or truncating the value or amount of bits associated with the scheme 335 bit subset and/or the mantissa 339 bit subset.

In some embodiments, changing the value and/or amount of bits in the subset of exponent bits may alter the dynamic range of the n-bit hypothesis 331. For example, a 32-bit hypothetical bit string having a subset of exponent bits containing the value zero (e.g., a 32-bit hypothetical bit string having es ═ 0 or a (32,0) hypothetical bit string) may have a dynamic range of approximately 18 decimal digits (decade). However, a 32-bit hypothetical bit string having a exponent bit subset having a value of 3 (e.g., a 32-bit hypothetical bit string having es ═ 3 or a (32,3) hypothetical bit string) may have a dynamic range of approximately 145 decimal digits.

Fig. 4A is an example of a positive value for a 3-bit assumption. In FIG. 4A, only the right half of real numbers are projected, however, it should be appreciated that the real numbers corresponding to their positive counterparts shown in FIG. 4A may exist on a curve representing a transformation about the y-axis of the curve shown in FIG. 4A.

In the example of fig. 4A, es ═ 2, and thusSuppose the accuracy of 431-1 can be increased by appending bits to a string of bits, as shown in FIG. 4B. For example, appending a bit with a value of one (1) to a bit string of hypothesis 431-1 increases the accuracy of hypothesis 431-1 as shown by hypothesis 431-2 in FIG. 4B. Similarly, appending a bit having the value one to the bit string of hypothesis 431-2 in FIG. 4B increases the accuracy of hypothesis 431-2 as shown by hypothesis 431-3 shown in FIG. 4B. The following is an example of an interpolation rule that may be used to append bits to the bit string of hypothesis 431-1 shown in FIG. 4A to obtain hypotheses 431-2, 431-3 illustrated in FIG. 4B.

If maxpos is the maximum positive value of the bit string assuming 431-1, 431-2, 431-3 and minpos is the minimum value of the bit string assuming 431-1, 431-2, 431-3, maxpos may equal useed and minpos may equal minposBetween maxpos and ± ∞, the new bit value may be maxpos used, and between zero and minpos, the new bit value may beThese new bit values may correspond to new scheme bits 335. At present value x is 2^mAnd y is 2ⁿWhere m and n differ by more than one, the new bit value may be given by geometric averaging:which corresponds to the new exponent bit 337. If the new bit value is an intermediate position between the immediately preceding x and y values, the new bit value may represent an arithmetic meanWhich corresponds to the new mantissa bit 339.

FIG. 4B is an example of a hypothetical construction using two exponent bits. In fig. 4B, only the right half of real numbers are projected, however, it should be appreciated that the real numbers corresponding to their positive counterparts shown in fig. 4B may exist on a curve representing a transformation around the y-axis of the curve shown in fig. 4B. Assumptions 431-1, 431-2, 431-3 shown in FIG. 4B each contain only two outliers: zero (0) when all bits of the bit string are zero and + - ∞whenthe bit string is one (1) after all zeros. It should be noted that the values of assumptions 431-1, 431-2, 431-3 presented in FIG. 4 are precisely used^k. That is, the numerical values shown in FIG. 4 for the assumptions 431-1, 431-2, 431-3 are precisely the power of the k value represented by the scheme (e.g., the scheme bits 335 described above in connection with FIG. 3). In fig. 4B, it is assumed that 431-1 has es ═ 2, and thusIt is assumed that 431-2 has es ═ 3, and thusAnd 431-3 is assumed to have es-4, so

As an illustrative example of adding bits to 3-bit hypothesis 431-1 to form 4-bit hypothesis 431-2 of FIG. 4B, used is 256, so the bit string corresponding to 256 used has the extra scheme bits appended thereto and the previous used 16 has the termination scheme bits appended theretoAs described above, the corresponding bit string has an addition between existing valuesTo additional exponent bits thereon. For example, the numerical values 1/16, 1/4, 1, and 4 will have exponent bits appended thereto. That is, the last exponential digit corresponding to a value of 4, the last zero exponential digit corresponding to a value of 1, and so on. This pattern can be further seen in hypothesis 431-3, which is a 5-bit hypothesis generated from the 4-bit hypothesis 431-2 according to the above rules. If another bit is added to hypothesis 431-3 in FIG. 4B to produce a 6-bit hypothesis, mantissa bit 339 will be appended to a value between 1/16 and 16.

The following is a non-limiting example of decoding a hypothesis (e.g., hypothesis 431) to obtain its numerical equivalent. In some embodiments, the bit string corresponding to the hypothesis p is an unsigned integer in the range-2^n-1To 2^n-1K is an integer corresponding to the scheme bit 335 and e is an unsigned integer corresponding to the exponent bit 337. If the set of mantissa bits 339 is denoted as f₁ f₂…f_fsAnd f is 1.f₁ f₂…f_fsThe value represented (e.g., by one after the decimal point after the mantissa bit 339), then p may be given by equation 2 below.

Equation 2

Another illustrative example of decoding a hypothetical bit string is provided below in connection with hypothetical bit string 0000110111011101 shown in table 3 below.

(symbol)	Scheme(s)	Index of refraction	Mantissa
				0	0001	101	11011101

TABLE 3

In table 3, assume that the bit string 0000110111011101 is broken down into its constituent bit sets (e.g., sign bit 333, scheme bit 335, exponent bit 337, and mantissa bit 339). Since es is 3 in the hypothetical bit string shown in table 3 (e.g., because there are three exponent bits), use is 256. Because the sign bit 333 is zero, the value corresponding to the digital representation of the hypothetical bit string shown in table 3 is a positive value. The scheme bit 335 has a series of three consecutive zeros corresponding to a value of-3 (as described above in connection with table 1). Thus, the scaling factor contributed by the scheme bit 335 is 256^-3(e.g., used^k). Exponent number 337 represents five (5) as an unsigned integer and thus contributes 2^e＝2⁵An additional scaling factor of 32. Finally, the mantissa bit 339 given as 11011101 in Table 3 represents two hundred twenty one (221) as an unsigned integer, so the mantissa bit 339 given as f above isUsing these values and equation 1, corresponding to that given in Table 3

Fig. 5 is a flow diagram representing an example method 550 for operating a processing unit for a hypothetical operation, in accordance with several embodiments of the present disclosure. At block 552, the method 550 may include: the plurality of universal numbers (unum) or hypothetical bit strings are organized in a matrix based at least in part on a threshold bandwidth ratio, a threshold time delay, or both associated with recursive operations of a plurality of multiply-accumulator (MAC) units coupled to one another. Herein, the MAC units may be similar to the MAC units 226-1 through 226-N shown in FIGS. 2A and 2B. The threshold bandwidth and/or the threshold latency may correspond to a desired throughput and/or processing speed at which the recursive operation is to be performed. In some embodiments, the operations to organize the unum or hypothetical bit strings may be performed by respective data arbiters (e.g., data arbiters 218-1 through 218-N shown in fig. 2A and 2B herein) coupled to respective MAC units.

In some embodiments, the MAC units may include circuitry configured to perform respective recursive operations, and reside on application specific integrated circuits or field programmable gate arrays. Additionally, as described above, the MAC units may be organized in a grid configuration with respect to each other.

At block 554, the method 550 may include receiving the plurality of unum or hypothesized bit strings organized into a matrix through the MAC prior to receiving the unum or hypothesized bit strings.

At block 556, the method 550 may include performing a plurality of respective recursive operations using the MAC unit, wherein iterations of the respective recursive operations are performed using at least one bit string that is the same as a bit string used for a previous iteration of the respective recursive operation. For example, the method 550 may include using at least one identical unum or hypothesized bit string as operands for multiple recursions of the respective recursive operation. By using at least one identical unum or assumed bit string as operands of a plurality of recursions of a respective recursive operation, the number of signals and/or function calls for retrieving the unum or assumed bit string may be reduced compared to a method in which successive iterations of the respective recursive operation use different bit strings.

In some embodiments, the method 550 may include performing at least a portion of the recursive operations substantially in parallel. As used herein, the term "substantially" means that the characteristic need not be absolute, but rather close enough to achieve an advantage in the characteristic. For example, "substantially parallel" is not limited to operations that are performed completely in parallel, and may include timing that is intended to be parallel, but may not be precisely parallel due to manufacturing limitations. For example, portions of a recursive operation that are performed "substantially in parallel" may not begin or end at exactly the same time due to read/write delays that may be exhibited by different interfaces of a processing unit. For example, the first and second portions of the recursive operation may be executed such that they are executed concurrently, regardless of whether one of the first and second portions of the recursive operation begins or terminates before the other.

In some embodiments, the method 550 may include performing respective recursive operations by performing sub-operations associated with the respective recursive operations by different MAC units among the plurality of MAC units. For example, a first MAC unit (e.g., MAC unit 226-1 shown in fig. 2A and 2B herein) may perform a first sub-operation associated with the performance of a recursive operation, and a second MAC unit (e.g., MAC unit 226-2 shown in fig. 2A and 2B herein) may perform a second sub-operation associated with the performance of a recursive operation.

At block 558, the method 550 may include storing iteration results of the plurality of respective recursive operations in respective registers coupled to respective MAC units among the plurality of MAC units. The corresponding iteration results may be stored in registers associated with the MAC units, such as overlay registers 227-1 through 227-N shown in fig. 2B herein.

In some embodiments, the method 550 may include receiving, by a MAC unit, a command to assert a first physical portion of the respective MAC unit that resides along a first lateral dimension of the MAC unit and receiving, by a second physical portion of the respective MAC unit that resides along a second lateral dimension of the MAC unit, an unum or hypothesized bit string. The first and second lateral dimensions may be substantially orthogonal to each other. For example, commands may be asserted to the top or bottom portion of the MAC unit via the buffers/interfaces 216-1 through 216-M shown in FIGS. 2A and 2B, while the unum or hypothetical bit string may be received from the side portion of the MAC unit via the logic circuitry 222-1 through 222-N and/or the data arbiters 218-1 through 218-N shown in FIGS. 2A and 2B herein.

Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that an arrangement calculated to achieve the same results can be substituted for the specific embodiments shown. This disclosure is intended to cover adaptations or variations of one or more embodiments of the present disclosure. It is to be understood that the above description has been made in an illustrative fashion, and not a restrictive one. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description. The scope of one or more embodiments of the present disclosure includes other applications in which the above structures and processes are used. The scope of one or more embodiments of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

In the foregoing detailed description, certain features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the disclosed embodiments of the disclosure have to use more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the detailed description, with each claim standing on its own as a separate embodiment.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：神经网络装置、操作神经网络装置的方法和应用处理器

Hypothetical tensor processing

相关技术

网友询问留言