Acceleration circuitry for posit operations

文档序号:153078 发布日期:2021-10-26 浏览:32次 中文

阅读说明:本技术 用于posit运算的加速电路系统 (Acceleration circuitry for posit operations ) 是由 V·S·拉梅什 P·G·海斯 C·M·卡特勒 A·J·雷斯 于 2021-04-22 设计创作,主要内容包括:描述与用于posit运算的加速电路系统有关的系统、设备和方法。可由驻留于加速电路系统上的第一缓冲器接收以通用数字或posit格式格式化的第一运算数。可由驻留于所述加速电路系统上的第二缓冲器接收以通用数字或posit格式格式化的第二运算数。可使用驻留于所述加速电路系统上的处理电路系统使用所述第一运算数和所述第二运算数执行算术运算、逻辑运算或两者。可由驻留于所述加速电路系统上的第三缓冲器接收所述算术运算、所述逻辑运算或两者的结果。(Systems, devices, and methods related to acceleration circuitry for posit operations are described. A first operand formatted in a universal number or posit format may be received by a first buffer residing on acceleration circuitry. A second operand formatted in a universal number or posit format may be received by a second buffer residing on the acceleration circuitry. Arithmetic operations, logical operations, or both may be performed using the first operand and the second operand using processing circuitry residing on the acceleration circuitry. The results of the arithmetic operation, the logical operation, or both may be received by a third buffer residing on the acceleration circuitry.)

1. A method of accelerating circuitry for posit operations, comprising:

receiving (1281) a first operand formatted in universal digital um or posit format by a first buffer (545, 546, 662, 762, 862, 962, 1045, 1046, 1171, 1173, 1175) residing on acceleration circuitry (120, 220, 520, 920, 968, 1020, 1170);

receiving (1282), by a second buffer (545, 546, 1045, 1046, 1171, 1173, 1175) residing on the acceleration circuitry, a second operand formatted in the um or posit format;

performing (1283) an arithmetic operation, a logical operation, or both, using the first operand and the second operand using processing circuitry (222, 223, 548, 665, 765, 766, 865, 866, 965, 966, 1076, 1170) residing on the acceleration circuitry; and

transmitting (1284) results of the arithmetic operation, the logical operation, or both, through a third buffer (547, 1047, 1177) residing on the acceleration circuitry.

2. The method of claim 1, further comprising retrieving the first operand, the second operand, or both from a main memory (130, 230, 530) that is external to the acceleration circuitry (120, 220, 520, 920, 968, 1020, 1170).

3. The method of one of claims 1-2, further comprising retrieving the first operand, the second operand, or both via a Direct Memory Access (DMA) controller (542, 1171, 1173, 1175) coupleable to the acceleration circuitry (120, 220, 520, 920, 968, 1020, 1170).

4. An apparatus of acceleration circuitry for posit operations, comprising:

a direct memory access DMA controller (542, 1171, 1173, 1175);

a first buffer (545, 546, 662, 762, 862, 962, 1045, 1046, 1171, 1173, 1175) coupleable to the DMA controller and configured to store a first operand formatted in a Universal digital um or posit format;

a second buffer (545, 546, 662, 762, 862, 962, 1045, 1046, 1171, 1173, 1175) coupleable to the DMA controller and configured to store a second operand formatted in the unum or posit format;

processing circuitry (222, 223, 548, 665, 766, 866, 966, 968, 1076, 1170) coupleable to the first buffer and the second buffer, wherein the processing circuitry is configured to perform an arithmetic operation, a logical operation, or both using the first operand and the second operand;

a third buffer (547, 1047, 1177) coupleable to the DMA controller and the processing circuitry, wherein the third buffer is configured to store results of the arithmetic operations, the logical operations, or both.

5. The apparatus of claim 4, further comprising a memory (130, 230, 530, 624, 724, 824, 924, 1124) coupleable to the DMA controller (542, 1171, 1173, 1175) and configured to store at least one of the first operand, the second operand, and the result.

6. The apparatus of claim 5, further comprising a memory controller (110, 550) coupleable to the memory (130, 230, 530, 624, 724, 824, 924, 1124),

wherein the memory comprises at least two distinct memory components, an

Wherein the memory controller is configured to retrieve the first operand, the second operand, or both from a particular one of the at least two distinct memory components of the memory based at least in part on signaling received from the apparatus.

7. The apparatus of claim 5, further comprising a processing device (554, 654, 754, 854, 1154) coupleable to the DMA controller (542, 1171, 1173, 1175), wherein the processing device is configured to:

parsing a host-level instruction into machine-level instructions; and

communicating the machine-level instruction to the DMA controller.

8. The apparatus of claim 7, wherein the processing device (554, 654, 754, 854, 1154) comprises a cache coherent reduced instruction set computing device.

9. A system of acceleration circuitry for posit operations, comprising:

a host (102, 202, 502, 602, 702, 802, 902); and

a memory device (104, 204, 504, 1104) coupleable to the host and comprising:

a first memory (762, 862, 962, 1162) configured to store operands formatted in universal number um or posit format and received from the host;

a second memory (763, 863, 963, 1163) configured to store instructions received from the host;

a processing device (554, 654, 754, 854, 1154) coupleable to the first memory and the second memory and configured to parse instructions stored by the second memory into signals indicative of execution of arithmetic operations, logical operations, or both to be performed by computing circuitry (222, 223, 548, 665, 766, 866, 920, 966, 968, 1076, 1170) coupleable to the processing device,

wherein the computing circuitry is configured to:

receiving the signal from the processing device;

retrieving a particular one of the operands stored by the first memory based at least in part on the received signal; and

performing an arithmetic operation, a logical operation, or both using the retrieved operands based at least in part on the received signal.

10. The system of claim 9, wherein the host (102, 202, 502, 602, 702, 802, 902) is configured to mirror instructions stored in a buffer (661, 761, 861, 961) of the host to the second memory (763, 863, 963, 1163).

11. The system of claim 10, wherein the computing circuitry (222, 223, 548, 665, 766, 866, 920, 966, 968, 1076, 1170) comprises:

a buffer (664, 764, 864, 964, 1164) configured to store the machine-level instructions; and

a state machine (623, 723, 823, 923, 1123) coupleable to the buffer and configured to direct execution of the arithmetic operation, the logical operation, or both, in accordance with the machine-level instruction.

12. The system of claim 11, wherein the state machine (623, 723, 823, 923, 1123) comprises:

a plurality of Direct Memory Access (DMA) controllers (1171, 1173, 1175) configured to retrieve respective ones of the retrieved operands; and

a DMA controller (1177) configured to write a result of the arithmetic operation, the logical operation, or both to the first memory.

13. The system of claim 11, wherein the processing device (554, 654, 754, 854, 1154) is further configured to:

determining whether the machine-level instruction has been executed by the state machine (623, 723, 823, 923, 1123); and

in response to determining that the machine-level instruction has been executed, causing the second memory (763, 863, 963, 1163) to store data indicative of the machine-level instruction executed.

14. The system of claim 13, wherein the second memory (762, 862, 962, 1162) is configured to mirror the data indicative of the machine-level instructions executed to a buffer (660, 760, 860, 960) of the host (102, 202, 502, 602, 702, 802, 902).

15. The system of any of claims 9 to 14, wherein the computing circuitry (222, 223, 548, 665, 765, 766, 865, 866, 965, 966, 1076, 1170) comprises:

a Direct Memory Access (DMA) controller (542);

a first buffer (545) coupleable to the DMA controller and configured to store a first operand of the retrieved operands;

a second buffer (546) coupleable to the DMA controller and configured to store a second one of the retrieved operands; and

processing circuitry (544) resident on the computation circuitry, coupleable to the first buffer and the second buffer, and configured to perform an arithmetic operation, a logical operation, or both using the first operand and the second operand.

16. The system of any of claims 9 to 14, wherein the computing circuitry (222, 223, 548, 665, 765, 766, 865, 866, 965, 966, 1076, 1170) comprises a plurality of computing circuits (766-0 to 776-N, 866-0 to 866-N, 966-0 to 966-N) coupleable to the first memory (762, 862, 962, 1162) and the second memory (763, 863, 963, 1163), wherein each computing circuit comprises:

a buffer (764, 864, 964) configured to store a respective machine-level instruction; and

multiply-accumulate circuitry MAC (548);

a state machine (723, 823, 923) that may be coupled to the buffer and configured to direct execution by the MAC of an arithmetic operation, a logical operation, or both, using a first one of the retrieved operands and a second one of the retrieved operands according to the respective machine-level instruction, and

wherein the second memory comprises respective buffers (763, 863) corresponding to respective ones of the plurality of compute circuits.

17. The system of claim 16, wherein each computational circuit (665, 765, 766-0 to 766-N, 865, 866-0 to 866-N, 965, 966-0 to 966-N) is configured to perform a dot-product operation of a matrix multiplication operation defined by the instructions stored by the second memory (763, 863, 963, 1163).

18. The system of claim 16, further comprising:

acceleration circuitry (820) coupleable to the first memory (862), the second memory (863), and a peripheral component interconnect express (PCIe) interface (852) of the memory device (824), wherein the acceleration circuitry comprises:

a Direct Memory Access (DMA) controller (542);

a first buffer (545) that may be coupled to the DMA controller and configured to store a first operand formatted in a universal digital um or posit format;

a second buffer (546) that may be coupled to the DMA controller and configured to store a second operand formatted in the um or posit format; and

processing circuitry (548) coupled to the first buffer and the second buffer and configured to perform a scalar operation using the first operand and the second operand.

19. A system of acceleration circuitry for posit operations, comprising:

a host (102, 202, 502, 602, 702, 802, 902) configured to:

transferring the data vector formatted in the universal digit or posit format to a memory (130, 230, 530, 624, 724, 824, 924, 1124) of a memory device (104, 204, 504, 1104);

parsing higher-level instructions into lower-level instructions; and

writing respective ones of the lower-level instructions to respective buffers (664, 764, 864, 964, 1164) of respective compute circuits (665, 765, 766, 865, 866, 965, 966, 1165); and

the memory device capable of being coupled to the host via a peripheral component interconnect express (PCIe) interface (652, 752, 852, 952, 1152), wherein the memory device comprises:

the memory configured to store the data vector received from the host; and

a plurality of compute circuits (766-0 to 776-N, 866-0 to 866-N, 966-0 to 966-N) coupleable to the memory, wherein each compute circuit is configured to:

receiving a corresponding lower level instruction from the host; and

performing an arithmetic operation, a logical operation, or both, in accordance with the respective lower-level instruction.

20. An apparatus of acceleration circuitry for posit operations, comprising:

a Direct Memory Access (DMA) block (1020) implemented in a Field Programmable Gate Array (FPGA), wherein the DMA block comprises:

a first-in-first-out (FIFO) input buffer (1045) configured to receive a first bit string;

a second FIFO input buffer (1046) configured to receive a second bit string; and

circuitry (1072) coupleable to the first FIFO buffer and the second FIFO buffer, wherein the circuitry is configured to unpack the first bit string and the second bit string;

first circuitry (1073) implemented in the FPGA and coupleable to the DMA block and configured to convert data values formatted in a floating point format to data values formatted in the unum or posit format;

second circuitry (1074) implemented in the FPGA and coupleable to the DMA block and configured to convert data values formatted in the um or posit format to data values formatted in a floating point format;

third circuitry (1076) implemented in the FPGA and coupleable to the DMA block and configured to perform arithmetic operations, logical operations, or both using the first string of bits and the second string of bits;

fourth circuitry (1076) implemented in the FPGA and coupleable to the first, second, and third blocks and configured to pack results of the arithmetic operation, the logical operation, or both; and

a FIFO output buffer (1047) that may be coupled to the fourth circuitry.

21. A method of accelerating circuitry for posit operations, comprising:

precomputing (1386) results of a plurality of arithmetic operations, a plurality of logical operations, or both, using a plurality of operands formatted in universal number um or posit format;

compiling (1387) the pre-computed results into a look-up table (LUT);

receiving (1388), from a processing device (554, 654, 754, 854, 1154), an instruction to perform a particular operation of the plurality of arithmetic operations or the plurality of logical operations using a particular set of the plurality of operands; and

identifying (1389), via the LUT, one of the pre-computed results that corresponds to the particular operation and the particular set of the plurality of operands.

22. The method of claim 21, wherein the pre-computing (1386) the results comprises pre-computing results of the plurality of arithmetic operations, the plurality of logical operations, or both, using operands of a particular bit length.

23. The method of any one of claims 21-22, wherein identifying the one of the pre-computed results comprises determining an address of the LUT associated with a result of the particular operation and the particular set of the plurality of operands, and

wherein the method further comprises performing additional arithmetic or logical operations using the identified result.

Technical Field

The present disclosure relates generally to semiconductor memories and methods, and more particularly, to apparatus, systems, and methods of acceleration circuitry for posit operations.

Background

Memory devices are typically provided as internal, semiconductor, integrated circuits in computers or other electronic systems. There are many different types of memory, including volatile and non-volatile memory. Volatile memory may require power to maintain its data (e.g., host data, error data, etc.) and includes Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Synchronous Dynamic Random Access Memory (SDRAM), and Thyristor Random Access Memory (TRAM), among others. Non-volatile memory may provide persistent data by preserving stored data when not powered, and may include NAND flash memory, NOR flash memory, and resistance variable memory, such as Phase Change Random Access Memory (PCRAM), Resistive Random Access Memory (RRAM), and Magnetoresistive Random Access Memory (MRAM), such as spin torque transfer random access memory (sttram), among others.

The memory device may be coupled to a host (e.g., a host computing device) to store data, commands, and/or instructions for use by the host when the computer or electronic system is in operation. For example, data, commands, and/or instructions may be transferred between a host and a memory device during operation of a computing or other electronic system.

Disclosure of Invention

According to an aspect of the present application, a method of accelerating circuitry for posit operations is provided. The method comprises the following steps: receiving, by a first buffer residing on acceleration circuitry, a first operand formatted in a universal number (unum) or posit format; receiving, by a second buffer residing on the acceleration circuitry, a second operand formatted in the um or posit format; performing an arithmetic operation, a logical operation, or both using the first operand and the second operand using processing circuitry residing on the acceleration circuitry; and transmitting a result of the arithmetic operation, the logical operation, or both, through a third buffer residing on the acceleration circuitry.

According to another aspect of the present application, an apparatus is provided for acceleration circuitry for posit operations. The apparatus comprises: a Direct Memory Access (DMA) controller; a first buffer coupleable to the DMA controller and configured to store a first operand formatted in a universal number (unum) or posit format; a second buffer, which can be coupled to the DMA controller, and which is configured to store a second operand formatted in the um or posit format; processing circuitry coupleable to the first buffer and the second buffer, wherein the processing circuitry is configured to perform an arithmetic operation, a logical operation, or both using the first operand and the second operand; a third buffer that may be coupled to the DMA controller and the processing circuitry, wherein the third buffer is configured to store results of the arithmetic operations, the logical operations, or both.

According to yet another aspect of the present application, a system of acceleration circuitry for posit operations is provided. The system comprises: a host; and a memory device coupleable to the host and comprising: a first memory configured to store operands formatted in a universal number (um) or posit format and received from the host; a second memory configured to store instructions received from the host; a processing device coupleable to the first memory and the second memory and configured to parse instructions stored by the second memory into signals indicative of execution of arithmetic operations, logical operations, or both to be performed by computing circuitry coupleable to the processing device, wherein the computing circuitry is configured to: receiving the signal from the processing device; retrieving a particular one of the operands stored by the first memory based at least in part on the received signal; and performing an arithmetic operation, a logical operation, or both using the retrieved operands based at least in part on the received signal.

According to yet another aspect of the present application, a system of acceleration circuitry for posit operations is provided. The system comprises: a host configured to: transferring the data vector formatted in a universal digit or posit format to a memory of a memory device; parsing higher-level instructions into lower-level instructions; and writing a respective one of the lower level instructions to a respective buffer of a respective compute circuit; and the memory device capable of being coupled to the host via a peripheral component interconnect express (PCIe) interface, wherein the memory device comprises: the memory configured to store the data vector received from the host; and a plurality of compute circuits coupleable to the memory, wherein each compute circuit is configured to: receiving a corresponding lower level instruction from the host; and performing an arithmetic operation, a logical operation, or both, in accordance with the respective lower-level instruction.

According to yet another aspect of the present application, an apparatus of acceleration circuitry for posit operations is provided. The apparatus comprises: a Direct Memory Access (DMA) block implemented in a Field Programmable Gate Array (FPGA), wherein the DMA block comprises: a first-in-first-out (FIFO) input buffer configured to receive a first bit string; a second FIFO input buffer (1046) configured to receive a second bit string; and circuitry coupleable to the first FIFO buffer and the second FIFO buffer, wherein the circuitry is configured to unpack the first bit string and the second bit string; first circuitry implemented in the FPGA and coupleable to the DMA block and configured to convert data values formatted in a floating point format to data values formatted in the um or posit format; second circuitry implemented in the FPGA and coupleable to the DMA block and configured to convert data values formatted in the um or posit format to data values formatted in a floating point format; third circuitry implemented in the FPGA and coupleable to the DMA block and configured to perform an arithmetic operation, a logical operation, or both using the first bit string and the second bit string; fourth circuitry implemented in the FPGA and coupleable to the first, second, and third blocks and configured to pack results of the arithmetic operation, the logical operation, or both; and a FIFO output buffer that may be coupled to the fourth circuitry.

According to yet another aspect of the present application, a method of accelerating circuitry for posit operations is provided. The method comprises the following steps: pre-computing results of a plurality of arithmetic operations, a plurality of logical operations, or both, using a plurality of operands formatted in universal number (unum) or posit format; compiling the pre-computed results into a look-up table (LUT); receiving, from a processing device, an instruction to perform a particular operation of the plurality of arithmetic operations or the plurality of logic operations using a particular set of the plurality of operands; and identifying, via the LUT, one of the pre-computed results that corresponds to the particular operation and the particular set of the plurality of operands.

Drawings

Fig. 1 is another functional block diagram in the form of a computing system including an apparatus including a host and a memory device, according to several embodiments of the present disclosure.

Fig. 2A is a functional block diagram in the form of a computing system including an apparatus including a host and a memory device, according to several embodiments of the present disclosure.

Fig. 2B is another functional block diagram in the form of a computing system including an apparatus including a host and a memory device, according to several embodiments of the present disclosure.

FIG. 3 is an example of an n-bit posit with es exponent bits.

FIG. 4A is an example of a positive value for a 3-bit posit.

FIG. 4B is an example of a posit construction using two exponent bits.

Fig. 5 is another functional block diagram in the form of a computing system including an apparatus including a host and a memory device, according to several embodiments of the present disclosure.

Fig. 6 is another functional block diagram in the form of a computing system including an apparatus including a host and memory resources, according to several embodiments of the present disclosure.

Fig. 7 is another functional block diagram in the form of a computing system including an apparatus including a host and memory resources, in accordance with several embodiments of the present disclosure.

Fig. 8 is another functional block diagram in the form of a computing system including an apparatus including a host and memory resources, according to several embodiments of the present disclosure.

Fig. 9 is another functional block diagram in the form of a computing system including an apparatus including a host and memory resources, in accordance with several embodiments of the present disclosure.

Fig. 10 is a functional block diagram in the form of acceleration circuitry according to several embodiments of the present disclosure.

Fig. 11 is another functional block diagram in the form of a memory device, according to several embodiments of the present disclosure.

Fig. 12 is a flow diagram representing an example method of acceleration circuitry for posit operations, in accordance with several embodiments of the present disclosure.

Fig. 13 is another flow diagram representing an example method of acceleration circuitry for posit operations, in accordance with several embodiments of the present disclosure.

Detailed Description

Systems, devices, and methods related to acceleration circuitry for posit operations are described. Universal digital (um) bit strings, such as posit bit string operands and posit bit strings representing the results of arithmetic and/or logical operations performed using posit bit string operands, may be stored in a memory resource, such as a memory array or other memory resource. Circuitry deployed in the memory device may access memory resources to retrieve the unum bit string operands from the memory resources and/or the results of arithmetic and/or logical operations performed using the unum bit string operands. For example, arithmetic operations and/or logical operations may be performed using a first unum bit string stored in a memory resource and a second unum bit string stored in a memory resource. The results of the arithmetic operations and/or logical operations may be stored in memory resources and subsequently retrieved.

Computing systems may perform a wide range of operations that may include various calculations, which may require different accuracies. However, computing systems have a limited amount of memory in which to store operands upon which to perform computations. To facilitate performing operations on operands stored by a computing system within constraints imposed by limited memory resources, in some approaches, the operands may be stored in a particular format. For simplicity, one such format is referred to as a "floating point" format or "floating point number (float)" (e.g., IEEE 754 floating point format).

According to the floating point standard, a bit string (e.g., a bit string that can represent a number), such as a binary string, is represented in terms of three sets of integers or bits (one set of bits is referred to as a "base," one set of bits is referred to as an "exponent," and one set of bits is referred to as a "mantissa" (or significand)). The set of integers or bits defines the format in which the binary string is stored, and for simplicity may be referred to herein as the "numerical format" or "format". For example, the three integer sets of the above-described bits (e.g., base, exponent, and mantissa) that define a floating-point bit string may be referred to as a format (e.g., a first format). As described in more detail below, the posit bit string can contain four sets of integers or bits (e.g., symbols, schemes, exponents, and mantissas), which can also be referred to as a "digital format" or a "format" (e.g., a second format). In addition, according to the floating-point standard, two kinds of infinite values (e.g., + ∞ and- ∞) and/or two kinds of "non-digital values (NaN)" (quiet NaN and signaling NaN) may be included in the bit string.

Floating point standards have been used in computing systems for years and define arithmetic formats, interchange formats, rounding rules, operations, and exception handling for computations by many computing systems. The arithmetic format may include binary and/or decimal floating point data, which may include finite numbers, wireless values, and/or specific NaN values. The interchange format may include encodings (e.g., bit strings) that may be used to exchange floating point data. A rounding rule may include a set of characteristics that may be satisfied when rounding a number during an arithmetic operation and/or a conversion operation. Floating point operations may include arithmetic operations and/or other computational operations, such as trigonometric functions. Exception handling may include an indication of an exception condition, such as divide by zero, overflow, and the like.

An alternative format for floating point is referred to as the "universal numbers" (unim) format. There are several forms of the unum format, type I unum, type II unum, and type III unum, that may be referred to as "posit" and/or "validity". Type I units are a superset of the IEEE 754 standard floating point format that uses "bits" at the end of the fraction to indicate whether the real number is an exact floating point number or is in the interval between adjacent floating point numbers. The sign, exponent and fraction bits in type I unum take their definition from the IEEE 754 floating point format, however, the length of the exponent and fraction fields of type I unum may vary significantly from a single bit to a maximum user definable length. By taking the sign, exponent and mantissa bits from the IEEE 754 standard floating point format, the type I unum may behave similarly to a floating point number, however, the variable bit length exhibited in the type I unum exponent and fraction bits may require additional management compared to a floating point number.

Type II unum is generally incompatible with floating point numbers, which permits clean mathematical design based on projected real numbers. Type II unum may contain n bits and may be described in terms of a "u lattice" in which the quadrants of the circular projection are filled with 2n-3-1 ordered set of real numbers. The value of type II unum may be reflected around an axis that bisects the circular projection such that positive values are located in the upper right quadrant of the circular projection and their negative corresponding values are located in the upper left quadrant of the circular projection. The lower half of the circular projection representing type II unum may contain the inverse of the value located in the upper half of the circular projection. Type II unum typically relies on a look-up table (LUT) for most operations. For example, in some cases, the size of the lookup table may limit the effect of type II unum. However, type II unum may provide improved computational functionality compared to floating point numbers according to some conditions.

The type III unum format is referred to herein as the "posit format" or for simplicity as "posit". In contrast to floating point bit strings, posit may, under certain conditions, allow for wider dynamic range and higher accuracy (e.g., precision) than floating point numbers having the same bit width. This may allow operations performed by the computing system to be performed at a higher rate (e.g., faster) when using posit than when using floating point numbers, which in turn may improve the performance of the computing system by, for example, reducing the number of clock cycles used in performing operations, thereby reducing processing time and/or power consumed in performing such operations. Additionally, the use of posit in a computing system may enable higher accuracy and/or precision than floating point numbers, which may further improve the functionality of the computing system over some approaches (e.g., approaches that rely on floating point format bit strings).

posit can vary highly in precision and accuracy based on the total number of bits and/or the set of integers or amounts of bits included in posit. In addition, posit can produce a wide dynamic range. Depending on certain conditions, the accuracy, precision, and/or dynamic range of posit may be greater than the accuracy precision and/or dynamic range of floating or other digital formats, as described in greater detail herein. The variable accuracy, precision, and/or dynamic range of posit may be manipulated, for example, based on the application in which posit is to be used. In addition, posit may reduce or eliminate overflow, underflow, NaN, and/or other extremes associated with floating and other digital formats. Further, using posit may allow a numeric value (e.g., a number) to be represented using fewer bits than a floating point number or other numeric format.

In some embodiments, these features may allow for highly reconfigurable posits, which may provide improved application performance compared to approaches that rely on floating or other digital formats. Additionally, these features of posit may provide improved performance in machine learning applications over floating or other digital formats. For example, in machine learning applications where computational performance is critical, posit may be used to train a network (e.g., a neural network) with the same or higher accuracy and/or precision as floating point numbers or other digital formats but using fewer bits than floating point numbers or other digital formats. In addition, the inference operation in a machine learning scenario may be implemented using posits having fewer bits (e.g., a smaller bit width) than floating point numbers or other digital formats. By using fewer bits to achieve the same or enhanced results as compared to floating or other digital formats, the use of posit may therefore reduce the amount of time in performing operations and/or reduce the amount of memory space required in an application, which may improve the overall functionality of a computing system in which posit is employed.

Embodiments herein are directed to acceleration circuitry for posit operations. In some embodiments, the acceleration circuitry may be implemented on a Field Programmable Gate Array (FPGA). The control logic may be coupled to or a component of the acceleration circuitry. A peripheral component interconnect express (PCIe) interface may be coupled to the acceleration circuitry to communicate with the host. For example, a PCIe interface may be used to communicate data (e.g., a data array) to a local memory coupled to the acceleration circuitry. The local memory may be internal to the memory device including the acceleration circuitry, external to the memory device including the acceleration circuitry, or both. Commands from the host may be communicated to the acceleration circuitry via the PCIe interface. For example, a command from the host may control processing of the data array (e.g., performing an arithmetic operation, a logical operation, or both). Results from processing the data array may be written to local memory. The command from the host may be interpreted by a processing device coupled to the acceleration circuitry. The processing device may be configured to control the retrieval of the operand, e.g., formatted in a unim or posit format, and the communication of the result via a Direct Memory Access (DMA) controller. The processing device may be configured to control the processing of data by the acceleration circuitry. In some embodiments, the processing device may be a cache coherent reduced instruction set computing device, such as a Reduced Instruction Set Computer (RISC) RISC-V device.

Commands from the host may be placed in a queue (e.g., a command buffer as described further herein) of the processing device. Non-limiting examples of commands from the host to the processing device may include communicating blocks of data from the host's memory to the local memory, or vice versa, and processing the data in the local memory. The processing device may be configured to process data "in place".

In some embodiments, a first operand formatted in a unim or posit format may be received by a first buffer residing on acceleration circuitry. As used herein, the term "resident on … …" means that something is physically located on a particular component. For example, a processing unit "residing on" a memory device refers to the case where the processing unit is physically located on the memory device. The term "resident on … …" may be used interchangeably herein with other terms such as "disposed on … …" or "located on … …". The second operand formatted in a unim or posit format may be received by a second buffer residing on the acceleration circuitry. Arithmetic operations, logical operations, or both may be performed using the first operand and the second operand using processing circuitry residing on the acceleration circuitry. The results of the arithmetic operation, the logical operation, or both, may be transmitted by a third buffer residing on the acceleration circuitry.

In some embodiments, the acceleration circuitry may include a DMA controller. The acceleration circuitry may include a first buffer that may be coupled to the DMA controller and configured to store a first operand formatted in a um or posit format. The acceleration circuitry may include a second buffer, which may be coupled to the DMA controller, and configured to store a second operand formatted in a unim or posit format. The acceleration circuitry may include processing circuitry that may be coupled to the first buffer and the second buffer. The processing circuitry may be configured to perform an arithmetic operation, a logical operation, or both, using the first operand and the second operand. The acceleration circuitry may include a third buffer that may be coupled to the DMA controller and the computation circuitry. The third buffer may be configured to store results of arithmetic operations, logical operations, or both.

In the following detailed description of the present disclosure, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration ways in which one or more embodiments of the disclosure may be practiced. These embodiments are described in sufficient detail to enable those of ordinary skill in the art to practice the embodiments of this disclosure, and it is to be understood that other embodiments may be utilized and that process, electrical, and structural changes may be made without departing from the scope of the present disclosure.

As used herein, designators such as "X," "Y," "N," "J," "R," etc., specifically with respect to a reference numeral in a drawing indicate that a number of the particular feature so designated may be included. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used herein, the singular forms "a" and "the" may include both the singular and the plural, unless the context clearly dictates otherwise. In addition, "plurality," "at least one," and "one or more" (e.g., multiple memory banks) may refer to one or more memory banks, while "plurality" is intended to refer to more than one such thing. Moreover, the words "may" and "may" are used throughout this application in a permissive sense (i.e., having the potential to, being able to), rather than the mandatory sense (i.e., must).

The term "comprising" and its derivatives mean "including but not limited to". Depending on the context, the term "coupled" means physically connecting or accessing and moving (transferring) commands and/or data, directly or indirectly. Depending on the context, the terms "bit string," "data," and "data value" are used interchangeably herein and may have the same meaning. The term "resides on" refers to a condition where one or more components are physically disposed on one or more different components. Further, the terms "data," "plurality of bits," and "bit string" are used interchangeably herein, and the term "number" may refer to a bit string or a plurality of bits, depending on the context.

The drawings herein follow a numbering convention in which the first one or more digits correspond to the drawing number and the remaining digits identify an element or component in the drawing. Similar elements or components between different figures may be identified by the use of similar digits. For example, 120 may refer to element "20" in fig. 1, and a similar element may be labeled 220 in fig. 2. A group or plurality of similar elements or components may be generally referred to herein by a single element symbol. For example, the plurality of reference elements 433-1, 433-2, …, 433-N may be collectively referred to as 433. As will be appreciated, elements shown in the various embodiments herein can be added, exchanged, and/or eliminated so as to provide a number of additional embodiments of the present disclosure. Additionally, the proportion and/or the relative scale of the elements provided in the figures are intended to illustrate certain embodiments of the present disclosure, and should not be taken in a limiting sense.

Fig. 1 is another functional block diagram in the form of a computing system including an apparatus 100 including a host 102 and a memory device 104, according to several embodiments of the present disclosure. As used herein, an "apparatus" may refer to, but is not limited to, any of a variety of structures or combinations of structures, such as a circuit or circuitry, one or more dies, one or more modules, one or more devices, or one or more systems. The acceleration circuitry 120 may be similar to the acceleration circuitry 220 illustrated in fig. 2A and 2B. Similarly, the host 102 may be similar to the host 202 illustrated in fig. 2A and 2B, the memory device 104 may be similar to the memory device 204 illustrated in fig. 2A and 2B, and the memory array 130 may be similar to the memory array 230 illustrated in fig. 2A and 2B. Each of the components (e.g., host 102, acceleration circuitry 120, logic circuitry 122, memory resources 124, and/or memory array 130, etc.) may be referred to herein individually as an "apparatus.

The host 102 may be communicatively coupled to the memory device 104 via one or more channels 103, 105. The channels 103, 105 may be interfaces or other physical connections that allow bit strings, data, and/or commands to be transferred between the host 102 and the memory device 105. For example, commands to be performed by the acceleration circuitry 120 that cause a start operation (e.g., an operation that converts a bit string in floating-point format to a bit string in posit format, and subsequent arithmetic and/or logical operations on the bit string in posit format) may be communicated from the host via the channels 103, 105. It should be noted that in some examples, the acceleration circuitry 120 may perform the operation in response to an initiation command transmitted from the host 102 via one or more of the channels 103, 105 in the absence of an intervening command from the host 102. That is, once the acceleration circuitry 120 has received a command from the host 102 to initiate execution of the operation, the operation may be performed by the acceleration circuitry 120 without additional commands from the host 102.

In some embodiments, the bit string (e.g., posit bit string) may be generated and/or stored in the memory array 130 without interfering with the host 102. For example, the bit string may be generated and/or stored in the memory array 130 without receiving multiple commands from the host 102. In other words, in some embodiments, the host 102 may send a single command to the memory device to request that operations be performed using one or more bit strings. In response to receiving a command requesting execution of an operation, the memory device 104 (e.g., the controller 110, the acceleration circuitry 120, or other components of the memory device 104) may perform the operation and/or retrieve a stored result of the operation in the absence of additional commands from the host 102. This may reduce traffic across the channel 103/105, which may increase performance of computing devices associated with the host 102 and/or the memory device 104.

As shown in fig. 1, memory device 104 may include a register access component 106, a High Speed Interface (HSI)108, a controller 110, one or more extended row address (XRA) components 112, main memory input/output (I/O) circuitry 114, Row Address Strobe (RAS)/Column Address Strobe (CAS) chain control circuitry 116, RAS/CAS chain component 118, acceleration circuitry 120, and a memory array 130. As shown in fig. 1, the acceleration circuitry 120 is located in an area of the memory device 104 that is physically distinct from the memory array 130. That is, in some embodiments, the acceleration circuitry 120 is located in a peripheral location of the memory array 130.

The register access component 106 can facilitate data transfer and extraction from the host 102 to the memory device 104 and from the memory device 104 to the host 102. For example, the register access component 106 can store an address (or facilitate a lookup of an address), such as a memory address, corresponding to a bit string to be transferred from the memory device 104 to the host 102 or from the host 102 to the memory device 104. In some embodiments, the register access component 106 can facilitate transferring and extracting bit strings to be operated on by the acceleration circuitry 120, and/or the register access component 106 can facilitate transferring and extracting bit strings that have been operated on by the acceleration circuitry 120 for transfer to the host 102.

The HSI 108 may provide an interface between the host 102 and the memory device 104 for bit strings, commands, and/or data that traverse the channel 105. The HSI 108 may be a Double Data Rate (DDR) interface, such as DDR3, DDR4, DDR5, and the like. However, embodiments are not limited to DDR interfaces, and the HSI 108 may be a Quad Data Rate (QDR) interface, a Peripheral Component Interconnect (PCI) interface (e.g., peripheral component interconnect express (PCIe)) interface, or other suitable interface for transferring bit strings, commands, and/or data between the host 102 and the memory device 104.

The controller 110 may be responsible for executing instructions from the host 102 and accessing the acceleration circuitry 120 and/or the memory array 130. The controller 110 may be a state machine, a sequencer, or some other type of controller. The controller 110 may receive commands from the host 102 (e.g., via the HSI 108) and, based on the received commands, control operations that accelerate the circuitry 120 and/or the memory array 130. In some embodiments, the controller 110 may receive a command from the host 102 to cause the operations to be performed using the acceleration circuitry 120. In response to receiving such a command, the controller 110 may instruct the acceleration circuitry 120 to begin performing operations.

In a non-limiting example, the controller 110 may instruct the acceleration circuitry 120 to perform operations to retrieve one or more bit strings stored in the memory array 130 and/or a resulting bit string stored in the memory array 130 that represents the result of operations performed among the one or more bit strings. For example, the controller may receive a command from the host 104 requesting execution of an operation between one or more bit strings and send the command to the acceleration circuitry 120 to perform the operation. The acceleration circuitry 120 (e.g., a state machine residing on the acceleration circuitry 120, such as the state machine 223 illustrated in fig. 2B herein) may determine whether the result of the requested operation is stored in the memory array 130, determine an address in the memory array 130 that stores the result of the requested operation, and/or retrieve the result of the requested operation from the memory array 130. The acceleration circuitry 120 and/or the controller 110 may then cause the results of the requested operation to be communicated to the host 102 or other circuitry external to the memory array 130.

In some embodiments, the controller 110 may be a global processing controller and may provide power management functions to the memory device 104. The power management functions may include control of power consumed by the memory device 104 and/or the memory array 130. For example, the controller 110 may control power provided to various banks of the memory array 130 to control which banks of the memory array 130 operate at different times during operation of the memory device 104. This may include turning off certain banks of the memory array 130 while power is provided to other banks of the memory array 130 to optimize power consumption of the memory device 130. In some embodiments, the controller 110 that controls the power consumption of the memory device 104 may include controlling power to various cores of the memory device, such as to the acceleration circuitry 120, and the like.

The XRA component 112 is intended to provide additional functionality (e.g., peripheral amplifiers) that sense (e.g., read, store, cache) data values of memory cells in the array and that are distinct from the memory array 130. The XRA components 112 may include latches and/or registers. For example, additional latches may be included in the "XRA component 112". The latches of the XRA components 112 may be located on the periphery of the memory array 130 of the memory device 104 (e.g., on the periphery of one or more banks of memory cells).

Main memory input/output (I/O) circuitry 114 may facilitate transfer of bit strings, data, and/or commands to and from the memory array 130. For example, the main memory I/O circuitry 114 may facilitate transfer of bit strings, data, and/or commands from the host 102 and/or acceleration circuitry 120 to the memory array 130 and from the memory array. In some embodiments, the main memory I/O circuitry 114 may include one or more Direct Memory Access (DMA) components that may transfer bit strings, data blocks, etc. from the acceleration circuitry 120 to the memory array 130 (and vice versa).

In some embodiments, the main memory I/O circuitry 114 may facilitate transfer of one or more bit strings from the memory array 130 to the acceleration circuitry 120 so that the acceleration circuitry 120 may perform operations on the bit strings. Similarly, the main memory I/O circuitry 114 may facilitate transfer of bit strings that have been subjected to one or more operations by the acceleration circuitry 120 to the memory array 130. As described in greater detail herein, the operations may include operations to convert a string of bits formatted according to a floating point standard to a string of bits formatted as posit (and vice versa), arithmetic operations performed on a string of bits formatted as posit, logical operations performed on a string of bits formatted as posit, and the like.

As described above, a posit bit string (e.g., data) can be stored and/or retrieved from the memory array 130. In some embodiments, the main memory I/O circuitry 114 may facilitate storage and/or retrieval of posit bit strings to and/or from the memory array 130. For example, the main memory I/O circuitry 114 may be enabled to transfer the posit bit string to the memory array 130 to be stored, and/or the main memory I/O circuitry 114 may facilitate retrieval of a posit bit string (e.g., a posit bit string representing an executed operation between one or more posit bit string operands) from the memory array 130 in response to a command, for example, from the controller 110 and/or the acceleration circuitry 120.

Row Address Strobe (RAS)/Column Address Strobe (CAS) chain control circuitry 116 and RAS/CAS chain assembly 118 may be used in conjunction with memory array 130 to latch a row address and/or a column address to initiate a memory cycle. In some embodiments, RAS/CAS chain control circuitry 116 and/or RAS/CAS chain component 118 may resolve row addresses and/or column addresses of memory array 130 at which read and write operations associated with memory array 130 will initiate or terminate. For example, upon completion of an operation using the acceleration circuitry 120, the RAS/CAS chain control circuitry 116 and/or the RAS/CAS chain component 118 may latch and/or resolve a particular location in the memory array 130 to which a bit string that has been operated on by the acceleration circuitry 120 is to be stored. Similarly, RAS/CAS chain control circuitry 116 and/or RAS/CAS chain component 118 may latch and/or resolve particular locations in memory array 130 from which the bit string will be transferred to acceleration circuitry 120 before acceleration circuitry 120 performs an operation on the bit string.

In some embodiments, RAS/CAS chain control circuitry 116 and RAS/CAS chain assembly 118 may be used in conjunction with memory array 130 to latch a row address and/or a column address to initiate an operation to retrieve one or more bit strings from memory array 130. For example, in response to the acceleration circuitry 120 (e.g., the state machine 223 illustrated in fig. 2B and residing on the acceleration circuitry 220) requesting a bit string stored at a particular address of the memory array 130, the RAS/CAS chain control circuitry 116 and the RAS/CAS chain component 118 may lock a row address and/or a column address corresponding to the bit string to facilitate retrieval of the requested bit string from the memory array 130.

As described in more detail in connection with fig. 2A and 2B and below in connection with fig. 5, the acceleration circuitry 120 may be configured to receive a bit string having a first format (e.g., a bit string in a floating-point format), convert the bit string to a second format (e.g., convert the bit string to a universal number format such as a posit format), and/or cause operations, such as arithmetic and/or logical operations, to be performed using the bit string having the second format. In some embodiments, the acceleration circuitry 120 may cause a bit string having the second format and/or a bit string that includes the result of an arithmetic and/or logical operation to be stored in a data structure (e.g., the data structure 634 illustrated in fig. 6 herein) of the memory array 130.

The acceleration circuitry 120 may include logic circuitry (e.g., logic circuitry 222 illustrated in fig. 2A and 2B) and memory resources (e.g., memory resources 224 illustrated in fig. 2A and 2B). The bit string may be received by the acceleration circuitry 120 from, for example, the host 102 and/or the memory array 130, and stored by the acceleration circuitry 120 in, for example, a memory resource of the acceleration circuitry 120. The acceleration circuitry (e.g., logic circuitry of acceleration circuitry 120) may perform (or cause to be performed on) the operation of the bit string, as described in more detail herein in connection with fig. 5.

As described in more detail in connection with fig. 3 and 4A-4B, posit may provide improved accuracy and may require less storage space (e.g., may contain a smaller number of bits) than a corresponding bit string represented in floating point format. Thus, by using the acceleration circuitry 120 to convert the floating-point bit string to the posit bit string, performance of the memory device 102 may be improved over methods that utilize only floating-point bit strings because operations may be performed faster on the posit bit string (e.g., because data in posit format is smaller and thus requires less time and/or processing resources to perform the operations), and because less memory space is needed in the memory device 102 to store the bit string in posit format, which may free up additional space in the memory device 102 for other bit strings, data, and/or other operations to be performed. Additionally, by storing and subsequently retrieving the bit string in a data structure of the memory array 130, the bit string may be subsequently quickly accessed for other operations and/or used by, for example, an application running on the host 102 in a manner that is more efficient than methods that require arithmetic and/or logical operations to be performed each time the result of such an operation is requested. Furthermore, fewer clock cycles may be required to perform arithmetic and/or logical operations than some methods, as the results of arithmetic and/or logical operations between bit string operands may be retrieved from the memory array 130, in contrast to methods that perform operations between bit string operands whenever the results of such operations are requested.

The acceleration circuitry 120 may perform (or cause to be performed) arithmetic and/or logical operations on the resulting posit bit string. Arithmetic operations may include addition, subtraction, multiplication, division, fused multiply-add, multiply-accumulate, dot product units, greater than OR less than, absolute values (e.g., FABS ()), fast Fourier transforms, inverse fast Fourier transforms, sigmoid functions, convolutions, square roots, exponents, AND/OR modified linear units (ReLu), AND trigonometric operations such as sine, cosine, tangent, etc., while logical operations may include AND, OR, XOR, NOT, etc. As will be appreciated, the foregoing list of operations is not intended to be exhaustive, nor is the foregoing list of operations intended to be limiting, and the acceleration circuitry 120 may be configured to perform (or cause to be performed) other arithmetic and/or logical operations. As described herein, operands used in such operations (e.g., bit string operands) and/or results of such operations may be stored in the memory array 130 and may subsequently be retrieved from the memory array 130 in response to a command from the acceleration circuitry 120.

In some embodiments, the acceleration circuitry 120 may perform the operations listed above in conjunction with the execution of one or more machine learning algorithms. For example, the acceleration circuitry 120 may perform operations related to one or more neural networks. The neural network may allow training of the algorithm over time to determine an output response based on the input signal. For example, over time, neural networks may learn to substantially better maximize the chances of accomplishing a particular goal. This may be advantageous in machine learning applications, as the neural network may be trained over time with new data to better maximize the chances of accomplishing a particular purpose. In many cases, neural networks may be trained over time to improve the operation of a particular task and/or a particular goal.

However, in some approaches, machine learning (e.g., neural network training) may be processing intensive (e.g., may consume a large amount of computer processing resources) and/or may be time intensive (e.g., may require performing lengthy computations that consume multiple cycles). In contrast, by performing such operations using the acceleration circuitry 120, e.g., by performing such operations on a bit string that has been converted by the acceleration circuitry 120 to a posit format, the amount of processing resources and/or time consumed in performing the operations may be reduced compared to methods that perform such operations using bit strings in a floating-point format. Additionally, by storing the result of the operation on the bit string that has been converted into posit format in the memory array 130 for subsequent retrieval, the amount of processing resources and/or the amount of time consumed in performing the operation can be reduced even further for subsequent requests for the result since the operation has already been performed.

For example, the memory array 130 may be a DRAM array, SRAM array, STT RAM array, PCRAM array, TRAM array, RRAM array, NAND flash array, and/or NOR flash array, although the embodiments are not limited to these particular examples. The memory array 130 may serve as the main memory for the computing system 100 shown in fig. 1. In some embodiments, the memory array 130 may be configured to store a bit string (e.g., a resultant bit string) operated on by the acceleration circuitry 120 and/or to store a bit string to be communicated to the acceleration circuitry 120.

In some embodiments, the memory array 130 is configured to store a plurality of bit strings to be used as operands in the execution of an arithmetic operation or a logical operation, or both. The controller 110 coupled to the memory array 130 may be configured to receive a request to perform an arithmetic operation or a logical operation, or both, using one or more bit strings. The request may be received from circuitry external to the memory device 104, such as the host 102. The controller 110 may be configured to determine a bit size of one or more bit strings to be used in the execution of an arithmetic operation or a logical operation, and cause the arithmetic operation or the logical operation, or both, to be performed using bit strings of a plurality of bit strings stored in the memory array 130 in response to determining that the execution of the arithmetic operation or the logical operation, or both, requires that the bit size of the one or more bit strings be less than a predetermined bit size.

In some embodiments, the controller 110 coupled to the memory array 130 and the acceleration circuitry 120 may be configured to send a command to the acceleration circuitry to cause the acceleration circuitry to perform the arithmetic operation or the logical operation, or both, in response to determining that the execution of the arithmetic operation or the logical operation, or both, requires the bit size of the one or more bit strings to be less than the predetermined bit size. In some embodiments, the predetermined bit size may be 16 bits or less, however, embodiments are not limited to a particular bit size.

As described in more detail herein, the acceleration circuitry 120 may be configured to convert a bit string representing the result of an arithmetic operation or a logical operation, or both, into a format different from the format of the plurality of bit strings stored in the memory array 130. For example, a plurality of bit strings to be used as operands in performing arithmetic operations or logical operations, or both, are stored in a data structure in the memory array 130.

Fig. 2A is a functional block diagram in the form of a computing system 200 including an apparatus including a host 202 and a memory device 204, according to several embodiments of the present disclosure. Memory device 204 may include one or more memory modules (e.g., single inline memory modules, dual inline memory modules, etc.). The memory device 204 may include volatile memory and/or non-volatile memory. In a number of embodiments, the memory device 204 may comprise a multi-chip device. A multi-chip device may include several different memory types and/or memory modules. For example, the memory system may include non-volatile or volatile memory on any type of module. Further, each of the components (e.g., host 202, acceleration circuitry 220, logic circuitry 222, memory resources 224, and/or memory array 230) may be referred to herein individually as an "apparatus". Although not explicitly shown in fig. 2A, memory device 204 may be coupled to one or more media devices, such as solid state drives, flash media devices, and the like.

Memory device 204 may provide a main memory for computing system 200 or may be used as additional memory or storage throughout computing system 200. The memory device 204 may include one or more memory arrays 230 (e.g., an array of memory cells), which may include volatile and/or nonvolatile memory cells. For example, the memory array 230 may be a flash array having a NAND architecture. Embodiments are not limited to a particular type of memory device. For example, memory device 204 may include RAM, ROM, DRAM, SDRAM, PCRAM, RRAM, flash memory, and the like.

In embodiments where memory device 204 comprises non-volatile memory, memory device 204 may comprise a flash memory device, such as a NAND or NOR flash memory device. However, embodiments are not so limited, and memory device 204 may include other non-volatile memory devices, such as non-volatile random access memory devices (e.g., NVRAM, ReRAM, FeRAM, MRAM, PCM), "emerging" memory devices, such as variable resistance (e.g., 3-D cross point (3D XP) memory devices), memory devices including an array of self-selecting memory (SSM) cells, and the like, or combinations thereof. The variable resistance memory device may perform bit storage based on changes in body resistance in conjunction with a stackable cross-meshed data access array. In addition, variable resistance non-volatile memories may perform a write-in-place operation in which non-volatile memory cells may be programmed without pre-erasing the non-volatile memory cells, as compared to multiple flash-based memories. In contrast to flash-based and variable resistance memories, the self-selected memory cells can include memory cells having a single chalcogenide material that acts as both a switch and a storage element for the memory cell.

As illustrated in fig. 2A, a host 202 may be coupled to a memory device 204. In a number of embodiments, memory device 204 can be coupled to host 202 via one or more channels (e.g., channel 203). In FIG. 2A, the memory device 204 is coupled to the host 202 via a channel 203, and the acceleration circuitry 220 of the memory device 204 is coupled to the memory array 230 via a channel 207. Host 202 may be a host system such as a personal laptop, desktop computer, digital camera, smart phone, memory card reader, and/or internet-of-things enabled device, as well as various other types of hosts, and may include a memory access device such as a processor (or processing device). One of ordinary skill in the art will appreciate that a "processor" may be one or more processors, such as a parallel processing system, a plurality of coprocessors, and the like.

Host 202 may include a system motherboard and/or backplane, and may include a number of processing resources (e.g., one or more processors, microprocessors, or some other type of control circuitry). The system 200 may include separate integrated circuits, or the host 202, memory device 204, and memory array 230 may be on the same integrated circuit. For example, system 200 may be a server system and/or a High Performance Computing (HPC) system and/or a portion thereof. Although the example shown in fig. 2A illustrates a system having a Von Neumann architecture, embodiments of the present disclosure may be implemented in a non-Von Neumann architecture that may not include one or more components (e.g., a Central Processing Unit (CPU), an Arithmetic Logic Unit (ALU), etc.) typically associated with a Von Neumann architecture.

The memory device 204 may include acceleration circuitry 220, which may include logic circuitry 222 and memory resources 224. The logic circuitry 222 may be provided in the form of an integrated circuit, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a system on a chip, or other combination of hardware and/or circuitry configured to perform the operations described in greater detail herein. For example, the logic circuitry 222 may be configured to receive one or more bit strings in a first format (e.g., a plurality of bits in a floating point format), convert the one or more bit strings to a second format (e.g., encode the plurality of bits in a posit format), and/or cause operations (e.g., arithmetic and/or logical operations) to be performed using the one or more bit strings having the second format. As used herein, a bit string in a second format (e.g., a bit string in posit format) includes at least one bit referred to as a "symbol", a set of bits referred to as a "scheme", a set of bits referred to as an "exponent", and a set of bits referred to as a "mantissa" (or significand). Examples of symbols, schemes, exponents, and mantissa bit sets are described in more detail herein in connection with fig. 3 and 4A-4B.

The operations may include conversion operations to convert a floating-point bit string (e.g., a floating-point number) to a posit-formatted bit string, and vice versa. Once the floating-point bit string is converted to a posit-formatted bit string, the logic circuitry 222 can be configured to perform (or cause to be performed) arithmetic operations and/or logical operations using the posit bit string. Arithmetic operations may include addition, subtraction, multiplication, division, fused multiply-add, multiply-accumulate, dot product units, greater than OR less than, absolute values (e.g., FABS ()), fast Fourier transforms, inverse fast Fourier transforms, sigmoid functions, convolutions, square roots, exponents, AND/OR modified linear units (ReLu), AND trigonometric operations such as sine, cosine, tangent, etc., while logical operations may include logical operations such as AND, OR, XOR, NOT, etc. As will be appreciated, the foregoing list of operations is not intended to be exhaustive, nor is the foregoing list of operations intended to be limiting, and the logic circuitry 222 may be configured to perform (or cause to be performed) other arithmetic and/or logical operations.

The logic circuitry 222 may include an ALU. The ALU may include circuitry (e.g., hardware, logic, one or more processing devices, etc.) to perform operations (e.g., arithmetic operations, logical operations, bitwise operations, etc.) such as those described above on an integer binary bit string (e.g., a bit string of a posit format).

The acceleration circuitry 220 may further include a memory resource 224 that may be communicatively coupled to the logic circuitry 222. The memory resources 224 may include volatile memory resources, non-volatile memory resources, or a combination of volatile and non-volatile memory resources. In some embodiments, the memory resource may be Random Access Memory (RAM), such as Static Random Access Memory (SRAM). However, embodiments are not so limited, and the memory resources may be cache, one or more registers, NVRAM, ReRAM, FeRAM, MRAM, PCM, "emerging" memory devices such as the variable resistance memory devices described above, or a combination thereof.

The acceleration circuitry 220 may be communicatively coupled to the memory array 230 via one or more channels 207. The memory array 230 may be, for example, a DRAM array, an SRAM array, an STT RAM array, a PCRAM array, a TRAM array, a RRAM array, a NAND flash array, and/or a NOR flash array. Array 230 can include memory cells arranged in rows coupled by access lines (which can be referred to herein as word lines or select lines) and columns coupled by sense lines (which can be referred to herein as data lines or digit lines). Although a single array 230 is shown in fig. 2A, embodiments are not so limited. For example, memory device 204 includes a plurality of memory arrays 230 (e.g., a plurality of banks of DRAM cells, NAND flash cells, etc.).

The embodiment of fig. 2A may include additional circuitry not illustrated to avoid obscuring embodiments of the present disclosure. For example, the memory device 204 may include address circuitry to latch address signals provided over I/O connections through I/O circuitry. Address signals may be received and decoded by a row decoder and a column decoder to access the memory device 204 and/or the memory array 230. Those skilled in the art will appreciate that the number of address input connections may depend on the density and architecture of the memory device 204 and/or the memory array 230.

Fig. 2B is another functional block diagram in the form of a computing system including an apparatus 200 including a host 202 and a memory device 204, according to several embodiments of the present disclosure. As shown in FIG. 2B, apparatus 200 may include a host 202 that may be communicatively coupled to a memory device 204 via a communication path 203. The memory device 204 may include acceleration circuitry 220, which may include a state machine 223 and optionally memory resources 224. The acceleration circuitry 220 may be communicatively coupled to a memory array 230 via a communication path 207. Each of the components (e.g., host 202, acceleration circuitry 220, state machine 223, memory resources 224, and/or memory array 230) may be individually referred to herein as an "apparatus".

In the embodiment shown in fig. 2B, where the logic circuitry 222 comprises a state machine 223, the state machine 223 may be configured to execute a specified set of instructions to, for example, write, read, copy, and/or erase a bit string (e.g., data) stored in the memory array 230. For example, as described in more detail herein, the state machine 223 may execute instructions to read data from one or more rows and/or columns of the memory array 230 to retrieve data stored in the memory array 230. As described in more detail in connection with fig. 1, the data may include, among other things, one or more posit bit string operands and/or one or more operation (e.g., arithmetic and/or logical operation) results.

By utilizing a state machine 223 configured to execute a specified set of instructions to write and/or retrieve posit bit strings from the memory array 230, improved memory device 204 performance may be achieved as compared to some approaches, as the amount of time consuming and/or computational resource intensive processes to perform operations between bit string bit strings stored in the memory array 230 may be reduced by storing the results of such operations in the memory array 230 and retrieving the results of the operations directly from the memory array 230.

In some embodiments, the state machine 223 can determine an address in the memory array 230 where the relevant posit bit string is stored. For example, the state machine 223 may determine a row and/or column address in the memory array 230 that stores one or more posit bit string operands and/or a row and/or column address that stores a resulting posit bit string that represents performance of arithmetic and/or logical operations between one or more posit bit string operands. The state machine 223 may then send a command or request to retrieve the posit bit string stored at an address in the memory array 230 and/or cause the retrieved posit bit string to be transferred to the host 202, a media device (e.g., a solid state drive, a flash memory device, etc.) coupled to the memory device 202, or other circuitry external to the memory array 230.

FIG. 3 is an example of an n-bit universal number or "um" with an es exponent bit. In the example of FIG. 3, the n-bit unum is posit bit string 331. As shown in fig. 3, the n-bit posit 331 can include a set of sign bits (e.g., sign bits 333), a set of scheme bits (e.g., scheme bits 335), a set of exponent bits (e.g., exponent bits 337), and a set of mantissa bits (e.g., mantissa bits 339). Mantissa bits 339 may alternatively be referred to as a "fractional portion" or "fractional bits," and may represent a portion (e.g., a number) of a bit string after a decimal point.

Sign bit 333 may be zero (0) for positive numbers and one (1) for negative numbers. Scheme bit 335 is described below in conjunction with table 1, which shows a (binary) bit string and its associated numerical meaning k. In table 1, the numerical meaning k is determined by the run length of the bit string. The letter x in the binary part of table 1 indicates that the bit value is irrelevant for the determination of the scheme, since the (binary) bit string terminates in response to consecutive bit flips or when the end of the bit string is reached. For example, in a (binary) bit string 0010, the bit string terminates in response to zero flipping to one and then back to zero. Thus, the last zero is not relevant to the scheme and all that is considered for the scheme is the leading identity bit and the first identity bit of the terminating bit string (if the bit string contains such bits).

Binary system 0000 0001 001X 01XX 10XX 110X 1110 1111
Numerical value (k) -4 -3 -2 -1 0 1 2 3

TABLE 1

In FIG. 3, the scheme bits 335r correspond to identical bits in a bit string, and the scheme bits 335r correspond to identical bits in a bit stringCorresponding to the relative bit of the termination bit string. For example, for the value k-2 shown in Table 1, the scheme bit r corresponds to the first two leading zeros, while the scheme bitCorresponding to one. As described above, the final bit corresponding to the number k represented by X in table 1 is not related to the scheme.

If m corresponds to the number of identical bits in the bit string, k equals-m if the bit is zero. If it is notAnd if the bit is one, k is m-1. This is illustrated in table 1, where for example the (binary) bit string 10XX has a single one and k-m-1-0. Similarly, the (binary) bit string 0001 contains three zeros, so that k-m-3. The scheme may indicate usedkA scaling factor of whereinSeveral example values for used are shown in table 2 below.

es 0 1 2 3 4
used 2 22=4 42=16 162=256 2562=65536

TABLE 2

Exponent bit 337 corresponds to exponent e, which is an unsigned number. The exponent bits 337 described herein may have no bias associated therewith, as opposed to floating point numbers. Thus, the exponent bits described herein337 may represent a passing factor of 2eScaling of (3). As shown in FIG. 3, there may be up to es exponent bits (e), depending on how many bits remain to the right of the scheme bits 335 for the n-bit posit 3311、e2、e3、…、ees). In some embodiments, this may allow for progressively less accuracy for n-bit posit 331, where numbers closer in magnitude to one have higher accuracy than the maximum or minimum number. However, since very large or very small numbers may be used less frequently in certain types of operations, the diminishing accuracy performance of the n-bit posit 331 shown in FIG. 3 may be desirable in a wide range of situations.

Mantissa bits 339 (or fraction bits) represent any additional bits that may be part of an n-bit posit 331 located to the right of exponent bits 337. Similar to a floating-point bit string, mantissa bits 339 represent a fraction f, which may be similar to fraction 1.f, where f includes one or more bits to the right of a subsequent decimal point. However, in contrast to the floating-point bit string, in the n-bit posit 331 shown in FIG. 3, the "hidden bits" (e.g., one) may always be one (e.g., one), while the floating-point bit string may include a sub-normal number with "hidden bits" zero (e.g., 0. f).

FIG. 4A is an example of a positive value for a 3-bit posit. In FIG. 4A, only the right half of real numbers are projected, however, it should be appreciated that the real numbers corresponding to their positive counterparts shown in FIG. 4A may exist on a curve representing a transformation about the y-axis of the curve shown in FIG. 4A.

In the example of fig. 4A, es ═ 2, and thusThe accuracy of posit 431-1 may be increased by appending bits to a string of bits, as shown in fig. 4B. For example, appending a bit with a value of one (1) to a string of bits of posit 431-1 increases the accuracy of posit 431-1 as shown by posit 431-2 in FIG. 4B. Similarly, appending a bit with the value one to the bit string of posit 431-2 in FIG. 4B increases the accuracy of posit 431-2 as shown by posit 431-3 shown in FIG. 4B. The following is a table that may be used to append bits to FIG. 4The bit string of posit 431-1 shown in A to obtain an example of the interpolation rule for posit 431-2, 431-3 illustrated in FIG. 4B.

If maxpos is the maximum positive value of the bit string for posit 431-1, 431-2, 431-3 and minpos is the minimum value of the bit string for posit 431-1, 431-2, 431-3, maxpos may equal useed and minpos may equal minposBetween maxpos and ± ∞, the new bit value may be maxpos used, and between zero and minpos, the new bit value may beThese new bit values may correspond to new scheme bits 335. At present value x is 2mAnd y is 2nWhere m differs from n by more than one, the new bit value may be given by the geometric mean:which corresponds to the new exponent bit 337. If the new bit value is an intermediate position between the immediately preceding x and y values, the new bit value may represent an arithmetic mean valueWhich corresponds to the new mantissa bit 339.

FIG. 4B is an example of a posit construction using two exponent bits. In fig. 4B, only the right half of real numbers are projected, however, it should be appreciated that the real numbers corresponding to their positive counterparts shown in fig. 4B may exist on a curve representing a transformation around the y-axis of the curve shown in fig. 4B. Posit 431-1, 431-2, 431-3 shown in fig. 4B each contain only two outliers: zero (0) when all bits of the bit string are zero and + - ∞whenthe bit string is one (1) followed by all zeros. It should be noted that the values of posit 431-1, 431-2, 431-3 shown in FIG. 4 are precise usesk. That is, the values of posit 431-1, 431-2, 431-3 shown in FIG. 4 are exactly the powers of user times the value of k represented by the scheme (e.g., scheme bits 335 described above in connection with FIG. 3). In FIG. 4B, posit 431-1 has es ═ 2, thereforeposit 431-2 has es ═ 3, and thusAnd posit 431-3 has es ═ 4, thus

As an illustrative example of adding bits to 3-bit posit 431-1 to produce the 4-bit posit 431-2 of FIG. 4B, used is 256, so the bit string corresponding to used 256 has extra scheme bits appended thereto, and the previous used 16 has termination scheme bits appended theretoAs described above, between existing values, the corresponding bit string has additional exponent bits added thereto. For example, the numerical values 1/16, 1/4, 1, and 4 will have exponent bits appended thereto. That is, the last exponential digit corresponding to a value of 4, the last zero exponential digit corresponding to a value of 1, and so on. This pattern can be further seen in posit 431-3, which is a 5-position posit generated from 4-position posit 431-2 according to the above rules. If another bit is added to posit 431-3 in FIG. 4B to produce a 6-bit posit, mantissa bit 339 will be appended to a value between 1/16 and 16.

The following is a non-limiting example of decoding posit (e.g., posit 431) to obtain its numerical equivalent. In some embodiments, the bit string corresponding to posit p is in the range of-2n-1To 2n-1Is an integer corresponding to scheme bit 335, and e is an unsigned integer corresponding to exponent bit 337. If the set of mantissa bits 339 is denoted as f1 f2…ffsAnd f is 1.f1f2…ffsThe value represented (e.g., by one after the decimal point after the mantissa bit 339), then p may be given by equation 1 below.

Another illustrative example of decoding posit bit strings is provided below in connection with posit bit string 0000110111011101 shown in table 3 below.

(symbol) Scheme(s) Index of refraction Mantissa
0 0001 101 11011101

TABLE 3

In table 3, posit bit string 0000110111011101 is broken down into its constituent bit sets (e.g., sign bit 333, scheme bit 335, exponent bit 337, and mantissa bit 339). Since es is 3 in the posit bit string shown in table 3 (e.g., because there are three exponent bits), the used is 256. Because the sign bit 333 is zero, the value represented by the value corresponding to the posit bit string shown in table 3 is a positive value. The scheme bit 335 has a series of three consecutive zeros corresponding to a value of-3 (as described above in connection with table 1). Thus, the scaling factor contributed by the scheme bit 335 is 256-3(e.g., usedk). Exponent number 337 represents five (5) as an unsigned integer and thus provides 2e=25An additional scaling factor of 32. Finally, the mantissa number 339 given as 11011101 in Table 3 is shownTwo hundred twenty one (221) are shown as unsigned integers, so that the mantissa digit 339 given above as f isUsing these values and equation 1, corresponding to that given in Table 3

Fig. 5 is another functional block diagram in the form of a computing system including an apparatus 501 including a host 502 and a memory device 504, according to several embodiments of the present disclosure. The memory device 504 may include circuitry implemented on a Field Programmable Gate Array (FPGA). The components illustrated within dashed box 540 may be circuitry implemented on an FPGA. However, the embodiments of the present disclosure are not limited thereto.

The memory device 504 may include a processing device 554. In at least one embodiment, the processing device 554 may be a cache coherent Reduced Instruction Set Computing (RISC) device. A non-limiting example of a cache coherent reduced instruction set computing device is a 32-bit RISC-V device. The processing device 553 may be configured to receive commands from the host 502. The processing device 553 may be configured to provide a signal to the host 502 indicating completion and/or execution of a command from the host 502. The processing device 553 may be configured to provide a signal indicative of an interrupt to the host 502.

The memory device 504 may include logic circuitry 522. The logic circuitry 522 may be similar to the logic circuitry 222 described in connection with fig. 2A. Logic circuitry 522 may be coupled to acceleration circuitry 520. However, the disclosed embodiments are not so limited. For example, logic circuitry 522 may be a component of acceleration circuitry 520 as illustrated in fig. 2A. The acceleration circuitry 520 may be referred to as a Posit Arithmetic Logic Unit (PALU).

The acceleration circuitry 520 may include a plurality of buffers configured to receive data values from the logic circuitry 522. A buffer configured to receive data values from logic circuitry 522 may be referred to as an "input buffer" (e.g., input buffer 545). Each input buffer may be configured to store operands formatted in a um or posit format. The operands may be represented by a bit string.

As illustrated in FIG. 5, input buffer A545 may be configured to store "operand A" and input buffer B546 may be configured to store "operand B". Operand A and/or operand B may be formatted in a unum or posit format. Although fig. 5 illustrates two input buffers 545 and 546, embodiments of the present disclosure are not limited thereto. For example, the acceleration circuitry 520 may include a single input buffer or more than two input buffers. In at least one embodiment, the at least one input buffer may be a first-in-first-out (FIFO) buffer.

The acceleration circuitry 520 may include circuitry (e.g., processing circuitry) configured to perform at least one arithmetic operation, at least one logical operation, or both, on at least one operand stored by at least one input buffer. As illustrated in FIG. 5, the acceleration circuitry 520 may be configured to perform operations including, but NOT limited to, addition, subtraction, multiplication, division, fused multiply-add, multiply-accumulate, dot product unit, greater than OR less than, absolute value (e.g., FABS ()), fast Fourier transform, inverse fast Fourier transform, sigmoid function, convolution, square root, exponent, AND trigonometric operations such as sine, cosine, tangent, AND the like, AND, OR, XOR, NOT, AND the like, AND combinations thereof. In at least one embodiment, acceleration circuitry 520 may include Multiply Accumulate (MAC) circuitry 548.

The acceleration circuitry 520 may include at least one buffer configured to receive a data value resulting from the acceleration circuitry 520 performing at least one arithmetic operation, at least one logical operation, or both. A buffer configured to transmit a resulting data value may be referred to as an "output buffer" (e.g., output buffer 547). The output buffer may be configured to store the resulting data values formatted in the unum or posit format. Although fig. 5 illustrates a single output buffer 547, embodiments of the present disclosure are not so limited. For example, the speed up circuitry 520 may include more than one output buffer. In at least one embodiment, the at least one output buffer may be a FIFO buffer.

The logic circuitry 522 may include a Direct Memory Access (DMA) controller 542. DMA controller 542 may be coupled to input buffers 545 and 546. The DMA controller 542 can be coupled to at least one memory resource. As illustrated in fig. 5, the DMA controller 542 can be coupled to an interconnect 551, such as an advanced extensible interface (AXI) interconnect. Interconnect 551 may be coupled to a plurality of memory resources. At least one of the memory resources may be internal to memory device 504, such as scalable RAM 553 and block RAM 555. As used herein, "scalable RAM" refers to a dual-port synchronous random access memory unit deployed within an FPGA that can be coupled to multiple dual-port synchronous random access memory units to increase the amount of storage locations available to scalable RAM 553. The interconnects 551 may be coupled to memory resources 530 external to the memory device 504. The external memory resources 530 may be similar to the memory array 130 and the memory array 230 described in conjunction with fig. 1 and 2A-2B, respectively. The interconnect 551 may be coupled to external memory resources 530 via a memory controller 550. The external memory resource 530 may be a main memory. Non-limiting examples of external memory resources 530 are Double Data Rate (DDR) memory, such as DDR3 memory or DDR4 memory.

In at least one embodiment, the acceleration circuitry 520 may include a memory controller, such as the DMA controller 542. The memory controller may be coupled to a memory that includes at least two distinct memory components. The memory controller may be configured to retrieve the first operand, the second operand, or both from a particular one of at least two distinct memory components of the memory based at least in part on signaling received by the memory controller. For example, DMA controller 542 can be configured to retrieve at least one operand from any of the memory resources coupled thereto (e.g., scalable RAM 553, block RAM 555, external memory resource 530 (e.g., via memory controller 550), and combinations thereof) to input buffers 545 and 546. Additionally or alternatively, to retrieve operands from a memory resource, the DMA controller 542 can be configured to retrieve operands from the host 502.

DMA controller 542 can be configured to communicate (e.g., transfer) the resulting data value from output buffer 547 to at least one operand from any of the memory resources coupled thereto, such as scalable RAM 553, block RAM 555, external memory resource 530 (e.g., via memory controller 550), and combinations thereof. Additionally or alternatively, to communicate the resulting data value to a memory resource, the DMA controller 542 can be configured to communicate the resulting data value to the host 502.

Memory device 504 may be configured to communicate with host 502 via a peripheral component interconnect express (PCIe) interface 552.

In at least one embodiment, the logic circuitry 522 may include a micro-decode controller 544. Instead of using the acceleration circuitry 520 to perform an arithmetic operation or a logical operation, the micro-decode controller 544 may be used to identify a pre-computed result of the arithmetic operation or the logical operation. The results of multiple arithmetic operations, multiple logical operations, or both, may be pre-computed using multiple operands formatted in a unim or posit format. The pre-computed results may be compiled into a look-up table (LUT). The LUT may be stored by the microcode controller 544 or other circuitry of the logic circuitry 522. The micro-decode controller 544 may receive an instruction from a processing device, such as the processing device 554 or a CPU of the host 502, to perform a particular operation of a plurality of arithmetic operations or a plurality of logical operations using a particular set of a plurality of operands. The micro-decode controller 544 may identify one of the pre-computed results corresponding to the particular operation and the particular set of the plurality of operands via the LUT. Additional arithmetic or logical operations may be performed using the identified results.

Memory device 504 may include multiple instances of logic circuitry 522 and acceleration circuitry 520 as represented by overlapping representations of logic circuitry 522 and acceleration circuitry 520. However, the overlapping representation is not intended to be a physical orientation and/or overlap of multiple instances of the logic circuitry 522 and acceleration circuitry 520.

Fig. 6 is another functional block diagram in the form of a computing system including an apparatus 607 including a host 602 and memory resources 624, according to several embodiments of the present disclosure. Memory resources 624 may be similar to scalable RAM 553, block RAM 555, and/or DDR3 memory 530 described in connection with fig. 5. The memory resources 624 may be configured to communicate with the host 602 via a PCIe interface 652.

A first buffer (e.g., data buffer 660) of the host 602 may be configured to store data values (e.g., data vectors). Data values stored by the data buffer 660 of the host 602 may be communicated (e.g., transferred) to a second buffer, such as the data buffer 662 of the memory resource 624. Data values stored by the data buffers 662 of the memory resources 624 may be communicated to the data buffers 660 of the host 602. A third buffer (command buffer 661) may be maintained by host 602. The command buffer 661 may be configured to store commands communicated from the CPU of the host 602. Commands stored by command buffer 661 of host 602 may be copied (mirrored) to a fourth buffer (command buffer 663) of memory resource 624. Commands stored by command buffer 663 of memory resource 624 may be copied to command buffer 661 of host 602.

The processing device 654, illustrated by FIG. 6 as a RISC-V device, may be configured to monitor the command buffer 623 for the memory resource 624. Processing device 654 may be configured to parse instructions stored by command buffers 663 of memory resources 624 into signals indicative of the execution of arithmetic operations, logical operations, or both, to be performed by computing circuitry 665. The computing circuitry 665 may include processing circuitry configured to perform arithmetic operations, logical operations, or both. For example, in response to a command being mirrored from a command buffer 661 of the host 602 to a command buffer 663 of the memory resource 624, the processing device 654 may be configured to parse the mirrored command (e.g., host-level command) into one or more machine-level instructions (e.g., finite state machine commands). The processing device 654 may be configured to communicate data indicative of completion of the host-level command to the command buffer 663 of the memory resource 624 in response to a state machine (e.g., state machine 623) executing a machine-level instruction parsed from the host-level command. Data indicating completion of the host-level command may be copied from command buffer 663 of memory resource 624 to the command buffer of host 602.

The computing circuitry 665 may be configured to receive signals from the processing device 654, retrieve a particular one of the operands stored by the data buffers 662 of the memory resources 624 based at least in part on the received signals, and perform an arithmetic operation, a logical operation, or both using the retrieved operand based at least in part on the received signals. The processing device 654 may be configured to communicate the machine-level instructions to the buffer 664 of the computing circuitry 665. The buffer 664 may be a FIFO buffer. The buffer 664 may enable storage of multiple machine-level instructions by the buffer 664 without waiting for completion of each individual machine-level instruction.

The computing circuitry 665 may include a state machine 623, e.g., a Finite State Machine (FSM). The state machine 623 may be similar to the state machine 223 described in connection with fig. 2B. The state machine 623 may retrieve machine-level instructions from the buffer 664. The state machine 623 may be configured to direct the execution of arithmetic operations, logical operations, or both, according to machine-level instructions.

The computing circuitry 665 may include acceleration circuitry 620 coupled to a state machine 623. The acceleration circuitry 620 may be similar to the acceleration circuitry 520 described in conjunction with fig. 5.

A non-limiting example of a host-level command is a generic matrix multiply (GEMM) command. An implementation of the GEMM command may include the host 602 transferring the bit string representing the matrix from the data buffer 660 to the data buffer 662 of the memory resource 624. An implementation of the GEMM command may include a command buffer 661 that communicates command data associated with the GEMM command (e.g., addresses and sizes of the two matrices) to the host 602. Command data may be copied from command buffer 661 of host 602 to command buffer 663 of memory resource 624. The processing device 654 may detect command data stored by the command buffer 663 of the memory resource 624 and parse the GEMM command into machine-level instructions, such as multiple dot product operations. Processing device 654 may parse each dot-product operation into multiple FSM commands. The FSM commands may include, but are not limited to, clearing QUIRE, performing a dot product, transferring the results to a buffer (e.g., output buffer 547 described in connection with FIG. 5) of acceleration circuitry 620, and writing the results from the buffer to the result location. In response to all FSM commands completing all dot product operations, processing device 654 may communicate data indicating completion of the GEMM command to command buffer 663 of memory resource 624. Data indicating completion of the GEMM command may be copied from command buffer 663 of memory resource 624 to command buffer 661 of host 602.

The PCIe interface 652, the memory resources 624, the processing device 654, and/or the computing circuitry 665 may be implemented on an FPGA.

Fig. 7 is another functional block diagram in the form of a computing system including an apparatus 709 including a host 702 and memory resources 724 in accordance with several embodiments of the present disclosure. Host 702, PCIe interface 752, memory resource 724, processing device 754, and computing circuitry 765 may be similar to host 602, PCIe interface 652, memory resource 624, processing device 654, and computing circuitry 665 described in connection with fig. 6.

In contrast to the device 607 illustrated in FIG. 6, the device 709 includes a plurality of computing circuits 766-0, 766-1. Similar to the computing circuitry 765, each of the computing circuits 766 may include a buffer 764 and a state machine 723 (e.g., a FSM). The buffer 766 may be configured to store machine-level instructions (e.g., FSM commands). However, each of the calculation circuits 766 includes a quire register 767 coupled to the state machine 723. The quire register 767 may be configured to store results (e.g., intermediate results) of iterations of recursive operations (e.g., arithmetic operations or logical operations) performed by the MAC. In some embodiments, each of the calculation circuits 766 may include a sense amplifier strip, a string of flip-flops, or both, in addition to or instead of the state machine 723.

As illustrated in fig. 7, command buffer 761 of host 702 and command buffer 763 of memory resource 724 each include a storage device (e.g., a command buffer) associated with each state of computing circuitry 766 and computing circuitry 765. The command buffers 761 and 763 include storage for each of the state machines of the N +1 computation circuits 766 and the state machine 723 of the computation circuitry 765.

As described in connection with fig. 6, the processing device 754 may parse a host machine level command associated with a matrix multiplication (e.g., GEMM command) into a plurality of machine level instructions associated with a dot-product operation. The dot-product operation may be performed in parallel using the calculation circuit 766. The machine-level instructions for the respective dot-product operations may be distributed among the calculation circuit 766. The PCIe interface 752, the memory resources 724, the processing device 754, the computing circuitry 765, and/or the computing circuitry 766 may be implemented on an FPGA. Including quire registers 767 in each of compute circuits 766, rather than acceleration circuitry 720 of compute circuitry 765, can reduce the amount of resources of an FPGA implementing compute circuitry 765. The acceleration circuitry 520 of the apparatus 501 illustrated by the fig. 5 implementation of the acceleration circuitry 520 may utilize a greater amount of FPGA resources as compared to the apparatus 709. Including the quire registers 767 in each of the compute circuits 766 may increase the amount of arithmetic and/or logical operations that may be performed in parallel while reducing the amount of resources of the FPGA to be implemented.

Fig. 8 is another functional block diagram in the form of a computing system including an apparatus 811 including a host 802 and memory resources 824 in accordance with several embodiments of the present disclosure. Host 802, PCIe interface 852, memory resources 824, processing device 854, computing circuitry 865, and computing circuitry 866 may be similar to host 702, PCIe interface 752, memory resources 724, processing device 754, computing circuitry 765, and computing circuitry 766 described in connection with fig. 7.

In contrast to the device 709 illustrated in fig. 7, the device 811 includes acceleration circuitry 868 coupled to the PCIe interface 852 in addition to the acceleration circuitry 820 of the computing circuitry 865. The acceleration circuitry 868 may have the same structure or a different structure than the acceleration circuitry 820 of the computing circuitry 865. The acceleration circuitry 868 may be directly coupled to the PCIe interface 852. The acceleration circuitry may be mapped into a memory space associated with the PCIe interface 852 (e.g., indirection registers 870). The indirection registers 870 may be configured to store scalar operands, the result of an arithmetic or logical operation, or both. An output buffer of the acceleration circuitry 868 (e.g., the output buffer 547 described in connection with fig. 5) may be combined with a busy flag such that read operations from the output buffer will not complete until the acceleration circuitry 868 has completed execution of the arithmetic or logical operations. Scalar multiplication operations performed using the apparatus 800 and acceleration circuitry 868 may include writing a first scalar operand (operand A) to a first input buffer (e.g., input buffer 545) of the acceleration circuitry 868, writing a second scalar operand (operand B) to a second input buffer (e.g., input buffer 546) of the acceleration circuitry 868, and reading the result of a multiplication operation of the first and second scalar operands from a multiplication result register. Using acceleration circuitry 868 to perform scalar operations may be more efficient than using acceleration circuitry 820 of computation circuitry 865.

In at least one embodiment, apparatus 800 may provide indirect access to operands stored by memory resources 824. For example, if a combination of scalar operations and vector operations are to be performed, it may be more efficient to load vector data into the memory resources 824 and maintain storage of the vector data by the memory resources 824 so that the compute circuitry 866 and the state machine 823 of the compute circuitry system 865 can access the vector data without having to communicate data to and from the host 802. If a scalar operation is to be performed on the vector data or the results from the vector operation, then the scalar operation is performed on the memory resources 824. An indirect scalar multiplication operation may include writing an address of a first operand (operand A) stored in the memory resource 824 to the indirect register 870, writing an address of a second operand (operand B) stored in the memory resource 824 to the indirect register 870, and writing an address of a result of a scalar operation using the first and second operands stored in the memory resource 824 to the indirect register 870.

Fig. 9 is another functional block diagram in the form of a computing system including an apparatus 913 that includes a host 902 and memory resources 924 in accordance with several embodiments of the present disclosure. The host 902, PCIe interface 952, memory resources 924, computing circuitry 965, and computing circuitry 966 can be similar to the host 802, PCIe interface 852, memory resources 824, computing circuitry 865, and computing circuitry 866 described in connection with fig. 8.

In contrast to the device 811 illustrated in fig. 8, the device 913 does not include a processing apparatus coupled to the PCIe interface 952 and the memory resources 924. Buffers 964 of computing circuitry 965 may be mapped to memory space of PCIe interface 952. Mapping the buffer 964 to memory space of the PCIe interface 952 may enable a software driver of the host 902 to provide functionality provided by embedded software of a processing device (e.g., the processing device 854 described in connection with fig. 8). The software driver of the host 902 may operate at a higher throughput than the embedded software of the processing device.

In at least one embodiment, the host 902 (e.g., via a CPU of the host 902) may parse higher-level commands, such as GEMM commands, into lower-level commands (e.g., machine-level instructions), such as state machine commands associated with the dot-product operation of the GEMM commands. Lower level commands may be stored by a command buffer 961 of the host 902, which includes a storage device (e.g., register) corresponding to each state machine 923 of the compute circuitry 966 and compute circuitry 965. Respective lower-level commands may be communicated (e.g., transferred) from respective storage of a command buffer 961 corresponding to a state machine of one of the computing circuits 966-0, 966-1,.., 966-N or the computing circuit system 965 to a command buffer 964 of the one of the computing circuits 966 or the computing circuit system 965. For example, machine-level instructions may be transferred from registers of the command buffer 961 corresponding to the state machine 923 of the compute circuit 966-1 to the command buffer 964 of the compute circuit 966-1.

Fig. 10 is a functional block diagram in the form of acceleration circuitry 1020, according to several embodiments of the present disclosure. The acceleration circuitry 1020 may be similar to any of the acceleration circuitry 520, 620, 720, 820, 868, 920, and 968 described in connection with fig. 5-9. The acceleration circuitry 1020 may be implemented on an FPGA.

The acceleration circuitry 1020 may be capable of storing bursts of incoming data because arithmetic operations may be performed using two or more operands. During the same clock cycle, data values corresponding to two operands may be supplied to the acceleration circuitry 1020. For example, if two operands are being retrieved from an external memory resource (e.g., DDR3 memory 530, described in connection with FIG. 5) in 256 byte bursts, there is no guarantee that the data value will arrive at the correct time. In at least one embodiment, the acceleration circuitry 1020 may include AXI stream FIFOs, such as input buffers 1046 and 1046. The acceleration circuitry 1020 may include an output buffer 1047 that may be configured to buffer data values representing results from the arithmetic block prior to communicating the data values from the acceleration circuitry 1020 to, for example, the DDR3 memory 530.

In at least one embodiment, the acceleration circuitry 1020 may read sixty-four bits of data from the input buffers 1046 and convert the data into a format compatible with circuitry configured to perform arithmetic operations. For example, in an 8-bit posit environment, the number of operations read from input buffer 1045, input buffer 1046, or both, may be converted by circuitry 1072 into eight sequential inputs on the bus. Circuitry 1076 may include a multiplexer configured to select an output. Circuitry 1076 may include an ALU. An ALU may include circuitry (e.g., hardware, logic, one or more processing devices, etc.) that performs operations (e.g., arithmetic operations, logical operations, bitwise operations, etc.) on an integer binary bit string (e.g., a bit string in posit format). The results of the operations performed by circuitry 1076 corresponding to the eight sequential inputs may be filled into a 64-bit word, for example, by circuitry 1076, followed by writing the 64-bit word to the output buffer 1047.

In at least one embodiment, the input buffer 1045, the input buffer 1046, the output buffer 1047, the circuitry 1072, and/or the circuitry 1076 may be implemented in a DMA block.

In at least one embodiment, the acceleration circuitry 1020 may include circuitry 1073 configured to convert data values formatted in floating point format to a unim or posit format. The acceleration circuitry 1020 may include circuitry 1074 configured to convert data values formatted in the um or posit format to a floating point format.

Fig. 11 is another functional block diagram in the form of a memory device 1104 according to several embodiments of the present disclosure. PCIe interface 1152, memory resources 1124, and computing circuitry 1165 can be similar to PCIe interface 652, memory resources 624, and computing circuitry 865 described in connection with fig. 6. Fig. 11 illustrates components of state machine 1123 of computing circuitry 1165.

State machine 1123 may include multiple DMA controllers. Each respective DMA controller may be coupled to a respective buffer. Each buffer may be a FIFO buffer. As illustrated by fig. 11, the state machine 1123 may include a first set 1171 of DMA controllers and FIFO buffers configured to retrieve (e.g., from the memory resources 1124) and store a first operand (operand a (opa)). The state machine 1123 may include a second set 1173 of DMA controllers and FIFO buffers configured to retrieve (e.g., from the memory resources 1124) and store a second operand (operand B (OPB)). The state machine 1123 may include a third set 1175 of DMA controllers and FIFO buffers configured to retrieve (e.g., from the memory resources 1124) and store a third operand (operand C (OPC)). The FIFO buffers of sets 1171, 1173, and 1175 between the output of the respective DMA controller and the acceleration circuitry 1120 may enable an increase in the speed at which the acceleration circuitry 1120 performs operations.

The acceleration circuitry 1120 of the computation circuitry 1165 may retrieve the first operand, the second operand, the third operand, or a combination thereof from the respective FIFO buffers of the state machine 1123. Although fig. 11 illustrates three sets 1171, 1173, and 1175 of DMA controllers and FIFO buffers for retrieving and storing operand numbers, embodiments are not so limited. For example, the state machine 1123 may include less than three sets of DMA controllers and FIFO buffers or more than three sets of DMA controllers and FIFO buffers.

The state machine 1123 may include a set 1177 of DMA controllers and buffers configured to retrieve, store, and communicate results from the acceleration circuitry 1120. The buffer of set 1177 may be a FIFO buffer.

The processing device 1158 may communicate commands, such as counts, addresses, sizes, strides, and operations to be performed by the acceleration circuitry 1120 to the command buffer 1164. As illustrated by fig. 11, the computing circuitry 1165 may include a memory 1178 configured to store data indicative of control signals, states, or both. The DMA controllers of sets 1171, 1173, and 1175 may retrieve respective operands based at least in part on commands from processing device 1158 that include a start address, an amount of data to be retrieved (e.g., start address + N x stride operand size), an operation count. As used herein, a "stride" refers to an address increment between processed samples. For example, a stride of-1 may be used to read data from the buffer in reverse order. As another example, stride 2 may be used to read data from the buffer at address offsets of 0, 2, 4, 6, 8, etc.

Fig. 12 is a flow diagram representing an example method 1280 of accelerating circuitry for posit operations, in accordance with several embodiments of the present disclosure. At block 1281, the method 1280 may include receiving, by a first buffer residing on acceleration circuitry, a first operand. The first operand may be formatted in a unim or posit format. The first buffer may be a FIFO buffer. The acceleration circuitry may be similar to the acceleration circuitry 120, 220, 520, 620, 720, 820, 868, 920, 968, 1020, and 1120 described in connection with fig. 1, 2, and 5-11, respectively.

At block 1282, the method 1280 may include receiving, by a second buffer residing on the acceleration circuitry, the second operand. The second operand may be formatted in a unim or posit format. The second buffer may be a FIFO buffer.

At block 1283, the method 1280 may include performing an arithmetic operation, a logic operation, or both using the first operand and the second operand. As described above, the arithmetic operations may include the following arithmetic operations: addition, subtraction, multiplication, division, fused multiply-add, multiply-accumulate, dot product units, greater OR less than, absolute values (e.g., FABS ()), fast Fourier transforms, inverse fast Fourier transforms, sigmoid functions, convolutions, square roots, exponents, AND/OR modified linear units (ReLu), AND trigonometric operations such as sine, cosine, tangent, etc., while logical operations may include logical operations such as AND, OR, XOR, NOT, etc.

At block 1284, the method 1280 may include transmitting the results of the arithmetic operation, the logical operation, or both, through a third buffer residing on the acceleration circuitry. The results may be formatted in the unum or posit format. The third buffer may be a FIFO buffer.

In some embodiments, the acceleration circuitry may be implemented on an FPGA. The method 1280 may further include retrieving the first operand, the second operand, or both from memory residing on the FPGA. The method 1280 may further include retrieving the first operand, the second operand, or both from a main memory external to the acceleration circuitry. The method 1280 may further include retrieving the first operand, the second operand, or both via a DMA controller that may be coupled to the acceleration circuitry. The method 1280 may further include transferring the first operand, the second operand, or both from a host that may be coupled to the acceleration circuitry to at least one of a memory implemented on the FPGA and a main memory that may be coupled to the acceleration circuitry.

Fig. 13 is another flow diagram representing an example method 1385 of accelerating circuitry for posit operations, in accordance with several embodiments of the present disclosure. In at least one embodiment, the method 1385 may be implemented at least in part by the microcode controller 544 described in conjunction with FIG. 5. At block 1386, the method 1385 may include precomputing results of multiple arithmetic operations, multiple logical operations, or both, using multiple operands formatted in the unum or posit format. In some embodiments, pre-computing results may include pre-computing results of multiple arithmetic operations, multiple logical operations, or both, using operands of a particular bit length.

At block 1387, the method 1385 may include compiling the pre-computed results into a LUT. As described herein, a LUT may be used to identify the result of a particular arithmetic or logical operation within the LUT without performing the particular arithmetic or logical operation. For example, in response to an instruction to perform a particular arithmetic or logical operation, a micro-decode controller (e.g., micro-decode controller 544) may identify the result of the particular arithmetic or logical operation rather than performing the particular arithmetic or logical operation.

At block 1388, the method 1385 may include receiving an instruction from a processing device to perform a particular operation of a plurality of arithmetic operations or a plurality of logical operations using a particular set of a plurality of operands. In some embodiments, receiving the instruction from the processing device may include receiving the instruction from a host processing device.

At block 1389, the method 1385 may include identifying, via the LUT, one of the pre-computed results that corresponds to the particular operation and the particular set of the plurality of operands. In some embodiments, identifying one of the pre-computed results may include determining an address of a LUT associated with a result of a particular operation and a particular set of a plurality of operands. In some embodiments, the method 1385 may further include performing additional arithmetic or logical operations using the identified result.

Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that an arrangement calculated to achieve the same results may be substituted for the specific embodiments shown. This disclosure is intended to cover adaptations or variations of one or more embodiments of the disclosure. It is to be understood that the above description has been made in an illustrative fashion, and not a restrictive one. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description. The scope of one or more embodiments of the present disclosure includes other applications in which the above structures and processes are used. The scope of one or more embodiments of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

In the foregoing detailed description, certain features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the disclosed embodiments of the disclosure have to use more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description, with each claim standing on its own as a separate embodiment.

40页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种RDMA通信加速集合通信的方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!