Method and apparatus for performing video processing matrix operations within a memory array

文档序号：575036 发布日期：2021-05-21 浏览：28次中文

阅读说明：本技术 用于在存储器阵列内执行视频处理矩阵运算的方法和设备 (Method and apparatus for performing video processing matrix operations within a memory array ) 是由罗法隆于 2020-11-11 设计创作，主要内容包括：本文公开了用于在存储器组构内执行视频处理矩阵运算的方法和设备。本公开的各种实施例针对将存储器阵列转换成用于离散余弦变换DCT矩阵变换的矩阵组构以及在其中执行DCT矩阵运算。本文所描述的示范性实施例在包含矩阵组构和矩阵乘法单元MMU的存储器装置内执行DCT矩阵-矩阵乘法运算。在一个实施例中,使用单独的矩阵-向量乘积来获得矩阵-矩阵乘法运算。在一个示范性实施例中,所述矩阵组构使用电阻元件的“纵横”构造。每个电阻元件存储表示相应矩阵系数值的阻抗水平。纵横连接可以用将输入向量表示为模拟电压的电信号来驱动。所得到的信号可以由MMU从模拟电压转换成数字值,以产生向量-矩阵乘积。在某些情况下,所述MMU可以在数字域内另外执行各种其它逻辑运算。(Methods and apparatus for performing video processing matrix operations within a memory fabric are disclosed. Various embodiments of the present disclosure are directed to converting a memory array into a matrix configuration for Discrete Cosine Transform (DCT) matrixing and performing DCT matrixing operations therein. The exemplary embodiments described herein perform DCT matrix-matrix multiplication operations within a memory device that includes a matrix fabric and a matrix multiplication unit MMU. In one embodiment, a separate matrix-vector product is used to obtain the matrix-matrix multiplication operation. In one exemplary embodiment, the matrix fabric uses a "crossbar" construction of resistive elements. Each resistive element stores an impedance level representing a value of the corresponding matrix coefficient. The crossbar may be driven with electrical signals representing the input vector as analog voltages. The resulting signals may be converted from analog voltages to digital values by the MMU to produce a vector-matrix product. In some cases, the MMU may additionally perform various other logical operations within the digital domain.)

1. A non-transitory computer-readable apparatus, comprising:

an array of memory cells, wherein each memory cell of the array of memory cells is configured to store a digital value as an analog value in an analog medium;

a memory sensing component, wherein the memory sensing component is configured to read the analog value of a first memory cell as a first digital value;

logic configured to:

receiving a Discrete Cosine Transform (DCT) matrix transform operation code;

causing the array of memory cells to operate as a Matrix Multiplication Unit (MMU) based on the DCT matrix transformation opcode;

wherein each memory element of the MMU modifies the analog values in the analog media according to the DCT matrixing opcode and matrixing operands;

configuring the memory sensing component to convert the analog value of the first memory cell to a second digital value according to the DCT matrixing opcode and the matrixing operand; and is

Responsive to reading the matrixing operand into the MMU, writing a matrixing result based on the second digital value.

2. The non-transitory computer-readable apparatus of claim 1, wherein the DCT matrix transform opcode indicates a size of the MMU.

3. The non-transitory computer-readable apparatus of claim 2, wherein the DCT matrix transform opcode corresponds to a frequency domain transform operation.

4. The non-transitory computer-readable apparatus of claim 3, wherein the frequency domain transform operation spans at least one other MMU.

5. The non-transitory computer-readable apparatus of claim 1, wherein the DCT matrix transform opcode identifies one or more analog values corresponding to one or more memory cells.

6. The non-transitory computer-readable apparatus of claim 5, wherein the one or more analog values corresponding to the one or more memory cells are stored within a lookup table (LUT) data structure.

7. The non-transitory computer readable device of claim 1, wherein each memory cell of the MMU comprises a resistive random access memory (ReRAM) cell; and is

Wherein the each memory unit of the MMU multiplies the analog value in the analog media according to the DCT matrixing opcode and the matrixing operand.

8. The non-transitory computer-readable apparatus of claim 7, wherein each memory unit of the MMU further accumulates the analog value in the analog media with a previous analog value.

9. The non-transitory computer-readable device of claim 1, wherein the first digital value is characterized by a first cardinality of two (2); and is

Wherein the second numeric value is characterized by a second base greater than two (2).

10. An apparatus, comprising:

a processor in data communication with a non-transitory computer readable device;

wherein the non-transitory computer-readable apparatus comprises one or more instructions configured to, when executed by the processor, cause the device to:

writing a Discrete Cosine Transform (DCT) matrixing opcode and a matrixing operand to the non-transitory computer readable device;

wherein the DCT matrix transformation opcode causes the non-transitory computer-readable medium to operate an array of memory cells as a matrix structure;

wherein the matrixing operand modifies one or more analog values of the matrix structure; and is

Reading a matrix transformation result from the matrix structure.

11. The device of claim 10, wherein the non-transitory computer-readable apparatus further comprises one or more instructions that, when executed by the processor, cause the device to:

receiving video data comprising one or more image blocks;

wherein the matrixing operand comprises the one or more image blocks and the matrixing result comprises one or more DCT frequency-domain image coefficients; and is

Wherein the one or more analog values of the matrix structure accumulate the one or more DCT frequency-domain image coefficients from video data over time.

12. The device of claim 10, wherein the DCT matrix transform opcode causes the non-transitory computer-readable medium to operate another array of memory cells as another matrix structure; and is

Wherein the matrix transformation result associated with the matrix structure is logically combined with another matrix transformation result associated with the other matrix structure.

13. The device of claim 10, wherein the one or more analog values of the matrix structure are stored within a lookup table (LUT) data structure.

14. A method of performing a matrix transform operation for video processing, comprising:

receiving a matrix transform opcode, the matrix transform opcode comprising instructions to perform a Discrete Cosine Transform (DCT) operation;

configuring a memory cell array of a memory into a matrix structure based on the matrixing opcode;

configuring a memory sensing component based on the matrixing opcode; and is

Writing a matrixing result from the memory sense component in response to reading a matrixing operand into the matrix structure.

15. The method of claim 14, wherein the configuring the memory cell array comprises connecting a plurality of word lines and a plurality of bit lines corresponding to row and column dimensions associated with the matrix structure.

16. The method of claim 15, further comprising determining the row dimension and the column dimension from the matrix transform opcode.

17. The method of claim 14, wherein the configuring the array of memory cells comprises setting one or more analog values of the matrix structure based on a look-up table (LUT) data structure.

18. The method of claim 17, further comprising identifying an entry from the LUT data structure based on the matrix transform opcode.

19. The method of claim 14, wherein the configuring the memory sensing component causes a matrix transformation result to have a cardinality greater than two (2).

20. The method of claim 14, wherein configuring the array of memory cells into the matrix structure comprises configuring impedance values of individual memory cells based on values of a DCT transform matrix.

Technical Field

The following generally relates to the field of data processing and device architecture. In particular, in one exemplary aspect, a processor-memory architecture is disclosed that converts a memory array into a matrix structure for matrix transformation and performing video processing matrix operations therein.

Background

Memory devices are widely used to store information in various electronic devices such as computers, wireless communication devices, cameras, digital displays, and the like. Information is stored by programming different states of the memory device. For example, binary devices have two states, typically represented by a logical "1" or a logical "0". To access the stored information, the memory device may read (or sense) the stored state in the memory device. To store information, the memory device may write (or program) the state into the memory device. So-called volatile memory devices may require power to maintain this stored information, while non-volatile memory devices may persistently store information even after the memory device itself has been, for example, power cycled. Different memory manufacturing methods and configurations enable different capabilities. For example, Dynamic Random Access Memory (DRAM) inexpensively provides high density volatile storage. Initial research was directed to resistive random access memory (ReRAM), which has non-volatile performance similar to DRAM.

Processor devices are commonly used in conjunction with memory devices to perform a myriad of different tasks and functions. During operation, the processor executes computer readable instructions (often referred to as "software") from the memory. Computer readable instructions define basic arithmetic, logic, control, input/output (I/O) operations, and the like. As is well known in the computing arts, relatively basic computer readable instructions, when combined in a sequence, can perform a variety of complex actions. Processors tend to emphasize different circuit architectures and fabrication techniques from memory devices. For example, processing performance is generally related to clock rate, so most processor fabrication methods and configurations emphasize very high rate transistor switch architectures, and the like.

Over time, the speed and power consumption of both processors and memory has increased. Typically, these improvements are a result of device scaling, as electrical signaling is physically limited by the dielectric and distance of the transmission medium. As previously mentioned, most processors and memories are manufactured using different materials and techniques of manufacture. Thus, even though processors and memory continue to improve, the physical interface between the processor and memory is a "bottleneck" in overall system performance. More directly, the performance of a combined processor and memory system, no matter how fast the processor or memory can run independently, is limited by the transfer rate allowed by the interface. This phenomenon has several common names, such as "processor-memory wall", "von neumann bottleneck", and the like.

Video processing (e.g., compression) methods are a class of processing applications that are known to require frequent communication between memory and a processor. Thus, the speed of conventional video processing is particularly affected by the processor-memory interface limitations described above. Accordingly, there is a clear need for improved apparatus and methods to reduce the impact of these limitations, including in such video processing applications.

Disclosure of Invention

The present invention provides, among other things, methods and apparatus for converting a memory array into a matrix structure for matrix transformation and performing matrix operations therein.

In one aspect of the disclosure, a non-transitory computer-readable medium is disclosed. In one exemplary embodiment, the non-transitory computer-readable medium includes: an array of memory cells, wherein each memory cell of the array of memory cells is configured to store a digital value as an analog value in an analog medium; a memory sensing component, wherein the memory sensing component is configured to read an analog value of a first memory cell as a first digital value; and logic. In one embodiment, the memory cell array includes a stack of at least two sub-arrays of memory cells. In one variation, at least some of the sub-arrays of memory cells are configured to run in parallel with each other.

In one implementation, at least some of the sub-arrays of memory cells are connected to a single memory sensing component.

In another implementation, individual sub-arrays of memory cells are connected to individual memory sensing components.

In another variation, the memory cell is configured to store the analog value as an impedance or conductance in the memory cell.

In another exemplary embodiment, the logic is further configured to: receiving a subjective operation code; causing the array of memory cells to operate as a Matrix Multiplication Unit (MMU) based on the matrixing opcode; wherein each memory element of the MMU modifies the analog values in the analog media according to the matrixing opcode and the matrixing operand; configuring the memory sense component to convert an analog value of the first memory cell to a second digital value according to the matrixing opcode and the matrixing operand; and writing a matrixing result based on the second digital value in response to reading the matrixing operand into the MMU.

In one variation, the matrixing opcode indicates the size of the MMU. In one such variant, the matrix transform opcode corresponds to a frequency domain transform operation. In one embodiment, the frequency domain transform operation is a DCT (discrete cosine transform) operation, and at least some aspects of the size of the MMU correspond to the size of the DCT operation (e.g., the size of the MMU may correspond to the size of the DC operation, or a multiple or fraction thereof). In one exemplary variant, the frequency domain transform operation spans at least one other MMU.

In yet another variant, the matrix transformation operand includes a vector derived from pixel values of the pixel image block. In one embodiment, the pixel values correspond to red, green, blue, luminance (Y), chrominance red (Cr), or chrominance blue (Cr) pixel values. In one exemplary configuration, the pixel image blocks are NxN image blocks (e.g., 8x8 pixels, 16x16 pixels, etc.).

In one variation, the matrix transform opcode identifies one or more analog values corresponding to one or more memory cells. In one such variation, one or more analog values corresponding to one or more memory cells are stored in a look-up table (LUT) data structure. In one implementation, the one or more analog values include DCT transform matrix values.

In one variation, each memory cell of the MMU includes a resistive random access memory (ReRAM) cell; and each memory unit of the MMU multiplies the analog value in the analog medium according to the matrixing opcode and the matrixing operand.

In one variation, each memory unit of the MMU further accumulates the analog value in the analog medium with a previous analog value.

In one variation, the first numerical value is characterized by a first base of two (2); and the second numeric value is characterized by a second base greater than two (2).

In one aspect of the disclosure, an apparatus is disclosed. In one embodiment, the device includes a processor coupled to a non-transitory computer-readable medium; wherein the non-transitory computer-readable medium includes one or more instructions that, when executed by the processor, cause the processor to: writing a matrixing opcode and a matrixing operand to the non-transitory computer-readable medium; wherein the matrixing opcode causes the non-transitory computer-readable medium to operate an array of memory cells as a matrix structure; wherein the matrixing operand modifies one or more analog values of the matrix structure; and reading the matrix transformation result from the matrix structure.

In one variation, the non-transitory computer-readable medium also includes one or more instructions that, when executed by the processor, cause the processor to: capturing image data comprising one or more captured color values; and wherein the matrixing operand includes one or more captured color values and the matrixing result includes one or more shifted color values.

In one variation, the non-transitory computer-readable medium further comprises one or more instructions that, when executed by the processor, cause the processor to: receiving video data comprising one or more image blocks; wherein the matrix transform operand comprises one or more image blocks and the matrix transform result comprises one or more frequency domain image coefficients; and wherein one or more analog values of the matrix structure accumulate one or more frequency domain image coefficients from video data over time. In one implementation, the one or more frequency domain image coefficients comprise DCT image coefficients.

In one variation, the matrixing opcode causes the non-transitory computer-readable medium to operate with another array of memory cells as another matrix structure; and logically combining a matrix transformation result associated with the matrix structure with another matrix transformation result associated with another matrix structure.

In one variant, one or more analog values of the matrix structure are stored in a look-up table (LUT) data structure. In one variant, at least some values of the matrix structure are stored in a portion of the memory cells that are not configured as a matrix structure.

In one aspect of the disclosure, a method of performing a transform matrix operation is disclosed. In one embodiment, the method includes: receiving a matrix transformation operation code; configuring a memory cell array of a memory into a matrix structure based on the matrixing opcode; configuring a memory sensing component based on the matrixing opcode; and writing a matrixing result from the memory sense component in response to reading the matrixing operand into the matrix structure. In one variation, configuring the matrix structure includes configuring the array of memory cells with values of a DCT transform matrix, and the matrix transform result includes a matrix of DCT image coefficients. In one embodiment, the matrix structure is configured with two matrix values: a matrix of positive DCT transform matrix values and a matrix of negative DCT transform matrix values. In one variation, the plurality of memory cell arrays are configured with values of a DCT transform matrix.

In one variation, configuring the memory cell array includes connecting a plurality of word lines and a plurality of bit lines corresponding to row and column dimensions associated with the matrix structure.

In one variation, the method further includes determining a row dimension and a column dimension from the matrix transform opcode.

In one variation, configuring the memory cell array includes setting one or more analog values of a matrix structure based on a look-up table (LUT) data structure.

In one variation, the method includes identifying an entry from the LUT data structure based on the matrix transform opcode.

In one variation, the memory sensing component is configured such that the matrix transformation result has a cardinality greater than two (2).

In one aspect, an apparatus configured to configure memory devices in a matrix structure is disclosed. In one embodiment, the apparatus includes: a memory; a processor configured to access the memory; preprocessor logic configured to allocate one or more memory portions to be used as a matrix structure.

In another aspect of the disclosure, a computer image processing device apparatus configured to dynamically configure memory into a matrix structure is disclosed. In one embodiment, a computer image processing device includes: a camera interface; a digital processor device in data communication with the camera interface; and a memory in data communication with the digital processor apparatus and containing at least one computer program.

In another aspect of the disclosure, a computer video processing device apparatus is disclosed that is configured to dynamically configure a memory into a matrix structure. In one embodiment, a computer video processing device includes: a camera interface; a digital processor device in data communication with the camera interface; and a memory in data communication with the digital processor apparatus and containing at least one computer program.

In another aspect of the disclosure, a computer wireless access node apparatus is disclosed that is configured to dynamically configure a memory into a matrix structure. In one embodiment, a computer wireless access node comprises: a wireless interface configured to transmit and receive RF waveforms in a portion of a spectrum; a digital processor device in data communication with the wireless interface; and a memory in data communication with the digital processor apparatus and containing at least one computer program.

In additional aspects of the disclosure, computer-readable devices are described. In one embodiment, the apparatus includes a storage medium configured to store one or more computer programs within or in conjunction with characterization memory. In one embodiment, the apparatus includes a program memory or HDD or SDD on a computer controller device. In another embodiment, the apparatus comprises a program memory, a HDD, or an SSD on the computer access node.

These and other aspects will become apparent when considered in light of the disclosure provided herein.

Drawings

FIG. 1A is a graphical depiction of a diagram of a processor-memory architecture and correlation matrix operations.

Fig. 1B is a graphical depiction of a diagram of a processor-PIM architecture and correlation matrix operations.

FIG. 2 is a logical block diagram of one exemplary implementation of a memory device according to various principles of the present disclosure.

FIG. 3 is an exemplary side-by-side illustration of a first memory device configuration and a second memory device configuration.

Fig. 4 is a graphical depiction of a half-matrix operation performed in accordance with the principles of the present disclosure.

FIG. 5A is a block diagram of a first embodiment of a memory device configured for DCT matrix operations, according to aspects of the present disclosure.

FIG. 5B is a block diagram of a second embodiment of a memory device configured for DCT matrix operations, according to aspects of the present disclosure.

FIG. 6A is a block diagram of a third embodiment of a memory device configured for DCT matrix operations, according to aspects of the invention.

FIG. 6B is a block diagram of a fourth embodiment of a memory device configured for DCT matrix operations, according to aspects of the present disclosure.

FIG. 7A is a logical block diagram of one exemplary implementation of a processor-memory architecture.

Fig. 7B is a ladder diagram illustrating one exemplary embodiment of performing a set of matrix operations in accordance with the principles of the present disclosure.

Fig. 7C is a ladder diagram illustrating another exemplary embodiment for performing a set of matrix operations in accordance with the principles of the present disclosure.

FIG. 8 is a logic flow diagram of one exemplary method of converting a memory array to a matrix structure and performing matrix operations therein.

FIG. 8A is a logic flow diagram of one exemplary method of performing a DCT matrix operation using a matrix structure.

All drawings 2019, meiguan technologies retain all copyrights.

Detailed Description

Referring now to the drawings, in which like numerals refer to like parts throughout.

As used herein, the term "application" (or "app") is generally not limited to an executable software unit that implements a particular function or theme. The themes of applications vary widely in any number of specifications and functions (e.g., on-demand content management, e-commerce transactions, brokerage transactions, home entertainment, calculators, etc.), and an application may have more than one theme. The executable software elements typically run in a predetermined environment; for example, the elements may comprise downloadable applications that run in an operating system environment.

As used herein, the terms "computer program" or "software" are meant to encompass any sequence of human or machine-recognizable steps for performing a function. Such programs may be presented in virtually any programming language or environment, including, for example, C/C + +, Fortran, COBOL, PASCAL, assembly languages, markup languages (e.g., HTML, SGML, XML, VoXML), and the like, as well as object-oriented environments such as the Common Object Request Broker Architecture (CORBA), Java, and the like^TM(including J2ME, Java Bean, etc.), Register Transfer Language (RTL), VHSIC (very high speed integrated circuit) hardware description language (VHDL), Verilog, etc.

As used herein, the terms "decentralized" or "distributed" are not limited to a configuration or network architecture of multiple computer devices capable of performing data communication with each other, rather than requiring a given device to communicate through a designated (e.g., central) network entity, such as a server device. For example, decentralized networks enable direct peer-to-peer data communication between multiple UEs (e.g., wireless user devices) that make up the network.

As used herein, the term "distributed unit" (DU) is not limited to distributed logical nodes within the wireless network infrastructure. For example, the DU may be embodied as a next generation node b (gNB) DU (gNB-DU) controlled by the above-described gNB CU. One gNB-DU may support one or more cells; a given cell is supported by only one gNB-DU.

As used herein, the terms "Internet" and "Internet" are used interchangeably to refer to an internetwork including, but not limited to, the Internet. Other common examples include, but are not limited to: a network of external servers, a "cloud" entity (e.g., a memory or storage device that is not local to the device, a storage device that is accessible at any time typically via a network connection, etc.), a service node, an access point, a controller device, a client device, etc. The 5G serving core network and network components (e.g., DUs, CUs, gnbs, small or femto cells, 5G-capable external nodes) residing in the backhaul (backhaul), fronthaul (fronthaul), cross-backhaul (crosshaul), or its "edge" adjacent to residential, enterprise, and other occupied areas may be included in the "internet".

As used herein, the term "memory" includes any type of integrated circuit or other memory device suitable for storing digital data, including, but not limited to, Random Access Memory (RAM), pseudo-static RAM (PSRAM), Dynamic RAM (DRAM), Synchronous Dynamic RAM (SDRAM), including Double Data Rate (DDR) -like memory and graphics DDR (GDDR), and variants thereof, ferroelectric RAM (FeRAM), Magnetic RAM (MRAM), resistive RAM (ReRAM), read-only memory (ROM), Programmable ROM (PROM), electrically erasable PROM (EEPROM or E PROM)²PROM), DDR/2SDRAM, EDO/FPMS, Reduced Latency DRAM (RLDRAM), Static RAM (SRAM), "flash" memory (e.g., NAND/NOR), Phase Change Memory (PCM), 3-dimensional cross point memory (3D XPoint), stacked memory such as HBM/HBM2, and Magnetoresistive RAM (MRAM) such as spin-torque transfer RAM (STT RAM).

As used herein, the terms "microprocessor" and "processor" or "digital processor" are generally meant to include all types of digital processing devices, including, but not limited to, Digital Signal Processors (DSPs), Reduced Instruction Set Computers (RISCs), General Purpose Processors (GPPs), microprocessors, gate arrays (e.g., FPGAs), PLDs, reconfigurable computer architectures (RCFs), array processors, secure microprocessors, and Application Specific Integrated Circuits (ASICs). Such a digital processor may be contained on a single unitary IC die, or distributed across multiple components.

As used herein, the term "server" refers to any computer component, system or entity, regardless of the form suitable for providing data, files, applications, content, or other services to one or more other devices or entities on a computer network.

As used herein, the term "memory" is not limited to a computer hard drive (e.g., Hard Disk Drive (HDD), solid state drive (SDD)), flash drive, DVR device, memory, RAID device or array, optical media (e.g., CD-ROM, laser disk, blu-ray, etc.), or any other device or medium capable of storing content or other information, including semiconductor devices capable of maintaining data without a power supply (e.g., those devices described herein as memory). Common examples of memory devices for storage include, but are not limited to: ReRAM, DRAM (e.g., SDRAM, DDR 2SDRAM, DDR3 SDRAM, DDR4 SDRAM, GDDR, RLDRAM, lpddr, etc.), DRAM module (e.g., RDIMM, VLP RDIMM, UDIMM, VLP UDIMM, SODIMM, sordmimm, Mini-DIMM, VLP Mini-DIMM, LRDIMM, NVDIMM, etc.), managed NAND, NAND flash memory (e.g., SLC NAND, MLC NAND, TLS NAND, serial NAND, 3D NAND, etc.), NOR flash memory (e.g., parallel NOR, serial NOR, etc.), multichip package, hybrid memory cube, memory card, solid state memory (SSS), and any number of other memory devices.

As used herein, the term "Wi-Fi" is not limited to, and when applicable, refers to any variation of: IEEE Standard 802.11 or related standards including 802.11a/b/g/n/s/v/ac/ad/av or 802.11-2012/2013, 802.11-2016, and Wi-Fi Direct (including in particular the "Wi-Fi Peer-to-Peer (P2P) Specification," which is incorporated herein by reference in its entirety).

As used hereinAs used, the term "wireless" means any wireless signal, data, communication, or other interface, including, but not limited to Wi-Fi, Bluetooth/BLE, 3G/4G/4.5G/5G/B5G (3GPP/3GPP2), HSDPA/HSUPA, TDMA, CBRS, CDMA (e.g., IS-95A, WCDMA, etc.), FHSS, DSSS, GSM, PAN/802.15, WiMAX (802.16), 802.20, TDMA, CBRS, CDMA (e.g., IS-95A, WCDMA, etc.), FHSS, DSSS, GSM, PAN/802.15, WiMAX,z-wave, narrowband/FDMA, OFDM, PCS/DCS, LTE/LTE-A/LTE-U/LTE-LAA, analog cellular, CDPD, satellite systems, millimeter wave or microwave systems, acoustic, and infrared (i.e., IrDA).

SUMMARY

In the case of processor-memory architectures that repeat similar operations on large data sets, the above-described "processor-memory wall" performance limitations may be very severe. In this case, the processor-memory architecture must repeatedly transmit, operate, and store for each element of the data set separately. For example, a matrix multiplication of 4x4 (sixteen (16) elements) requires four (4) times as much as a matrix multiplication of 2x2 (four (4) elements). In other words, the matrix operation scales exponentially according to the matrix size.

Matrixing is commonly used in many different applications and may take up a disproportionate amount of processing and/or memory bandwidth. For example, many Image Signal Processing (ISP) techniques typically use matrix transformations for, e.g., color interpolation, white balancing, color correction, color conversion, and the like. Many communication techniques employ Fast Fourier Transforms (FFTs) and matrix multiplication for beamforming and/or massive multiple-input multiple-output (MIMO) channel processing.

In particular, video, audio and image compression typically uses a Discrete Cosine Transform (DCT) to identify video image data that can be removed with minimal loss of fidelity. DCT-based compression is used, for example, in JPEG, MP3, AAC, MPEG, h.264/AVC, h.265/HEVC image/audio standards, and HDR (high dynamic range) applications. DCT-based image/video compression involves several matrix multiplication operations. Various embodiments of the present disclosure are directed to converting a memory array into a matrix structure for matrix transformation and performing video processing matrix operations therein.

The exemplary embodiments described herein perform matrix transformations within a memory device that includes a matrix structure and a Matrix Multiplication Unit (MMU). In one exemplary embodiment, the matrix structure uses a "crossbar" configuration of resistive elements. Each resistive element stores an impedance level representing a value of the corresponding matrix coefficient. The crossbar may be driven with electrical signals representing the input vector as analog voltages. The resulting signals may be converted from analog voltages to digital values by the MMU to produce a vector-matrix product. In some cases, the MMU may also perform various other logical operations within the digital domain.

Unlike prior solutions that iterate every element of the matrix to compute the value of an element, the crossbar matrix structure described below computes multiple elements of the matrix "atomically" (i.e., in a single processing cycle). For example, at least a portion of the vector-matrix product may be computed in parallel. The "atomicity" of the matrix structure based computation yields a significant processing improvement over the iterative alternative. In particular, as iterative techniques grow with matrix size, atomic matrix structure calculations are independent of matrix dimensions. In other words, the NxN vector-matrix product can be completed in a single atomic instruction.

Various embodiments of the present disclosure derive and/or use matrix coefficient values internally to further minimize interface transactions. As described in more detail herein, many useful matrixing transforms can be characterized by and performed with "structurally defined dimensions". Structure definition refers to those aspects of matrix computation defined for a particular matrix structure (e.g., the rank and/or size of a matrix); in other words, the matrix coefficients may be inferred from the matrix structure and need not be explicitly provided through the processor-memory interface. For example, as described in more detail below, the DCT transform matrix values (i.e., the DCT "twiddle factors") are a function of the matrix size. The "twiddle factor" for a Fast Fourier Transform (FFT) is also a function of the matrix size, while ISP filtering and/or massive MIMO channel coding techniques may use, for example, a predefined matrix and/or a codebook of matrices with known structure and weighting.

In short, practical limitations in component fabrication limit the ability of each element within an individual memory device. For example, most memory arrays are only designed to discern two (2) states (logic "1", logic "0"). While existing memory sensing components may be extended to recognize higher levels of accuracy (e.g., four (4) states, eight (8) states, etc.), increasing the accuracy of a memory sensing component may be impractical to the accuracy required to support large transforms typically used in, for example, video compression, mathematical transforms, etc.

For these purposes, various embodiments of the present disclosure logically combine one or more matrix structures and/or MMUs to provide greater precision and/or processing complexity than would otherwise be possible. In one such embodiment, a first matrix structure and/or MMU may be used to compute positive vector-matrix products, while a second matrix structure and/or MMU may be used to compute negative vector-matrix products. The positive and negative vector-matrix products may be summed to determine a net vector-matrix product.

In another such embodiment, a larger matrix transformation may be implemented using a plurality of simple matrix transformations. For example, to perform a DCT operation, a matrix-matrix multiplication of two NxN matrices may be decomposed into N separate matrix-vector multiplications (i.e., NxN matrices multiplied by N sized vectors). The various parts of the DCT operation in the processing matrix structure may also be ordered or parallelized according to a number of design considerations. Other instances of logical matrix operations may be replaced with equivalent successes (e.g., decomposition, common matrix multiplication, etc.) given the present disclosure.

Some applications may save a lot of power by shutting down system components when not in use. For example, video compression may benefit from "sleep" during video blanking intervals (when no video data is active), and the like. However, hibernation programs typically require a processor and/or memory to transfer data from an operating volatile memory to a non-volatile storage memory so that the data is not lost when power is removed. A wake-up process is also required to retrieve the stored information from the non-volatile memory. Transferring data back and forth between memories is an inefficient use of processor-memory bandwidth. Thus, various embodiments disclosed herein take advantage of the "non-volatile" nature of the matrix structure. In such an embodiment, the matrix structure may retain its matrix coefficient values even when the memory is without power. More directly, the non-volatile nature of the matrix structure enables the processor and memory to transition to a sleep/low power mode or perform other tasks without shuffling data from volatile memory to non-volatile memory, and vice versa.

Various other combinations and/or variations of the foregoing will be readily apparent to those of ordinary skill in the art given the present disclosure.

Detailed description of exemplary embodiments

Exemplary embodiments of the apparatus and methods of the present disclosure are now described in detail. Although these exemplary embodiments are described in the context of a previously specific processor and/or memory configuration, the general principles and advantages of the present disclosure may be extended to other types of processor and/or memory technologies and, thus, the following is merely exemplary in nature.

It will also be understood that while generally described in the context of consumer devices (within camera devices, video codecs, cell phones, and/or network base stations), the present disclosure may be readily applicable to other types of devices, including, for example, server devices, internet of things (IoT) devices, and/or for personal, corporate, or even government uses, such as those described outside of banned "incumbent" users such as the united states department of defense, and the like. Other applications are also possible.

Other features and advantages of the present disclosure will be readily apparent to those of ordinary skill in the art by reference to the figures and the detailed description of the exemplary embodiments set forth below.

Processor memory architecture

FIG. 1A shows one common processor-memory architecture 100 for illustrating matrix operations. As shown in fig. 1A, processor 102 is coupled to memory 104 through interface 106. In an illustrative example, a processor multiplies elements of an input vector a with a matrix M to calculate a vector-matrix product b. Mathematically, the input vector a is considered as a single column matrix having a number of elements equal to the number of rows in the matrix M.

To calculate the vector-matrix product b₀The processor must iterate each permutation of the input vector a elements of each element within a row of the matrix M. During the first iteration, the input vector a is read₀The first element of (a), reading the vector-matrix product b₀And reading the corresponding matrix coefficient value M_0,0. Using three (3) reads in a multiply-accumulate operation to produce an "intermediate" vector-matrix product b₀. Specifically, the multiply-accumulate operation calculates: (a)₀·M_0,0)+b₀And write back the result value to b₀. Of note, b₀Is the "intermediate value". After the first iteration but before the second iteration, the intermediate value b₀May not correspond to the vector-matrix product b₀The final value of (c).

During the second iteration, the input vector a is read₁Second element of (1), retrieving the previously calculated intermediate value b₀And reading the second matrix coefficient value M_1,0. Using three (3) reads to generate a vector-matrix product b in a multiply-accumulate operation₀The first element of (1). B is completed in the second iteration₀And (4) calculating.

Although not explicitly shown, the iterative process described above is also performed to generate the vector-matrix product b₁The second element of (1). Additionally, while the foregoing example is a 2x2 vector-matrix product, the techniques described therein are generally extended to support vector-matrix computations of any size. For example, for each of the three (3) rows of the matrix, the 3x3 vector-matrix product calculation iterates over an input vector of three (3) elements; therefore, nine (9) iterations are required. A 1024x1024 matrix operation, which is not uncommon for many applications, will require over a million iterations. More directly, the iterative process described above scales exponentially according to the matrix dimensions.

Although the foregoing discussion has been presented in the context of vector-matrix product contests, one of ordinary skill in the art will readily appreciate that matrix-matrix products may be performed as a series of vector-matrix products. For example, a first vector-matrix product of a first single-column matrix corresponding to the input vector is calculated, a second vector-matrix product of a second single-column matrix corresponding to the input vector is calculated, and so on. Thus, a 2x2 matrix-matrix product would require two (2) vector-matrix calculations (i.e., a total of 2x4 ═ 8), and a 3x3 matrix-matrix product would require three (3) vector-matrix calculations (i.e., a total of 3x9 ═ 27).

One of ordinary skill in the relevant art will readily appreciate that each iteration of the process described in fig. 1A becomes a bottleneck in the bandwidth limitations of the interface 106 ("processor-memory wall"). Even though the processor and memory may have an extremely high bandwidth internal bus, the processor-memory system can only communicate at speeds where the interface 106 is capable of supporting electrical signaling, based on the dielectric properties and transmission distance (1-2 centimeters) of the material used in the interface 106, typically copper. In addition, the interface 106 may also contain various additional signal conditioning, amplification, noise correction, error correction, parity calculation, and/or other interface-based logic to further reduce transaction time.

One common way to improve the performance of matrix operations is to perform the matrix operations within the local processor cache. Unfortunately, local processor caches take up processor die space and have a much higher manufacturing cost per bit than, for example, comparable memory devices. As a result, the local cache size of a processor is typically much smaller (e.g., several megabytes) than its memory (which may be several gigabytes). From a practical perspective, a smaller local cache is a hard limit on the maximum number of matrix operations that can be performed locally within the processor. As another disadvantage, large matrix operations result in poor cache utilization because only one row and one column are accessed at a time (e.g., for a 1024x1024 vector-matrix product, only 1/1024's cache is in active use during a single iteration). Thus, while processor cache implementations are acceptable for small matrices, this technique becomes less and less desirable as the complexity of the matrix operations increases.

Another common approach is the so-called memory embedded Processor (PIM). Fig. 1B illustrates one such processor-PIM architecture 150. As shown, the processor 152 is coupled to the memory 154 through an interface 156. Memory 154 further includes a PIM 162 and a memory array 164; the PIM 162 is tightly coupled to the memory array 164 through an internal interface 166.

Similar to the process described above in fig. 1A, the processor-PIM structure 150 of fig. 1B multiplies the elements of the input vector a by the matrix M to calculate the vector-matrix product B. However, PIM 162 internally reads, multiply-accumulate, and writes to memory 164 via internal interface 166. The internal interface 166 is much shorter than the external interface 156; additionally, the internal interface 166 may operate locally without, for example, signal conditioning, amplification, noise correction, error correction, parity computation, and the like.

Although the processor-PIM architecture 150 provides a substantial improvement in performance over, for example, the processor-memory architecture 100, the processor-PIM architecture 150 may have other drawbacks. For example, the fabrication technology ("silicon process") differs substantially between processors and memory devices because each silicon process is optimized for different design criteria. For example, a processor silicon process may use a thinner transistor structure than a memory silicon process; thinner transistor structures provide faster switching (which improves performance) but suffer from larger leakage (which is undesirable for memory retention). As a result, fabricating the PIM 162 and the memory array 164 in the same wafer results in at least one of being implemented in a sub-optimal silicon process. Alternatively, the PIM 162 and the memory array 164 may be implemented and connected together within separate dies; die-to-die communication typically increases manufacturing cost and complexity, and may suffer from various other impairments (e.g., introduced by process discontinuities, etc.).

Further, one of ordinary skill in the relevant art will readily appreciate that the PIM 162 and the memory array 164 are "hardened" components; PIM 162 is not capable of storing data nor is memory 164 capable of performing calculations. Indeed, once memory 154 is manufactured, it cannot be changed to, for example, store more data and/or increase/decrease PIM performance/power consumption. Such memory devices are typically customized specifically for their application; this is expensive to design and modify, and in many cases the memory is "proprietary" and/or customer/manufacturer specific. Furthermore, these devices are quickly scrapped due to the rapid technological changes.

For various reasons, improved solutions for matrix operations within processors and/or memories are needed. Ideally, such a solution would enable matrix operations within the memory device in a manner that minimizes performance bottlenecks of the processor-memory wall. Furthermore, such a solution should be flexible to accommodate a variety of different matrix operations and/or matrix sizes.

Exemplary memory device

FIG. 2 is a logical block diagram of one exemplary implementation of a memory device 200 manufactured according to various principles of the present disclosure. Memory device 200 may include a plurality of partitioned memory cell arrays 220. In some implementations, each partitioned array of memory cells 220 can be partitioned at device fabrication. In other implementations, the memory cell array 220 can be partitioned dynamically (i.e., after device fabrication). Memory cell array 220 may include a plurality of banks, respectively, where each bank includes a plurality of word lines, a plurality of bit lines, and a plurality of memory cells arranged, for example, at intersections of the plurality of word lines and the plurality of bit lines. Selection of word lines may be performed by row decoder 216 and selection of bit lines may be performed by column decoder 218.

The plurality of external terminals included in the semiconductor device 200 may include an address terminal 260, a command terminal 262, a clock terminal 264, a data terminal 240, and a power supply terminal 250. Address terminals 260 may be provided with address signals and bank address signals. The address signal and the bank address signal supplied to the address terminal 260 are transmitted to the address decoder 204 via the address input circuit 202. Address decoder 204 receives, for example, address signals and provides decoded row address signals to a row decoder 216 and decoded column address signals to a column decoder 218. The address decoder 204 may also receive bank address signals and provide the bank address signals to a row decoder 216 and a column decoder 218.

Command terminal 262 is provided with a command signal to command input circuit 206. The command terminal 262 may include one or more individual signals, such as a Row Address Strobe (RAS), a Column Address Strobe (CAS), a read/write (R/W). The command signal input to the command terminal 262 is supplied to the command decoder 208 through the command input circuit 206. The command decoder 208 may decode the command signal 262 to generate various control signals. For example, RAS may be asserted to specify a row for which data is to be read/written, while CAS may be asserted to specify a column for which data is to be read/written. In some variations, the R/W command signal determines whether the contents of the data terminal 240 are written to or read from the memory cell 220.

During a read operation, read data may be externally output from the data terminal 240 through the read/write amplifier 222 and the input/output circuit 224. Similarly, when a write command is issued and a row address and a column address are provided in time with the write command, a write data command may be provided to the data terminal 240. Write data commands may be provided to a given memory cell array 220 through input/output circuitry 224 and read/write amplifiers 222 and written into the memory cells specified by the row and column addresses. According to some embodiments, the input/output circuitry 224 may include an input buffer.

The clock terminal 264 may be provided with an external clock signal for synchronous operation. In one variant, the clock signal is a single-ended signal; in other variations, the external clock signals may be complementary to each other (differential signaling) and provided to the clock input circuit 210. Clock input circuit 210 receives an external clock signal and adjusts the clock signal to ensure that the resulting internal clock signal has sufficient amplitude and/or frequency for subsequent locked loop operation. The conditioned internal clock signal is provided to a feedback mechanism (internal clock generator 212) that provides a stable clock for the internal memory logic. Common examples of internal clock generation logic 212 include, but are not limited to: digital or analog Phase Locked Loops (PLLs), Delay Locked Loops (DLLs), and/or Frequency Locked Loops (FLLs) operate.

In an alternative variant (not shown), the memory 200 may rely on external clocking (i.e., without its own internal clock). For example, the phase control clock signal may be externally supplied to the input/output (IO) circuit 224. This external clock may be used to clock in write data and clock out read data. In this variation, the IO circuit 224 provides a clock signal to each respective logic block (e.g., the address input circuit 202, the address decoder 204, the command input circuit 206, the command decoder 208, etc.).

The power supply terminal 250 may be supplied with a power supply potential. In some variations (not shown), these supply potentials may be provided by input/output (I/O) circuitry 224. In some embodiments, the power supply potential may be isolated from the I/O circuitry 224 so that power supply noise generated by the IO circuitry 224 does not propagate to other circuit blocks. These power supply potentials are regulated by an internal power supply circuit 230. For example, internal power supply circuit 230 may generate various internal potentials, such as removing noise and/or parasitic activity, as well as boosted or stepped-down potentials provided from the power supply potential. The internal potentials may be used, for example, in address circuitry (202, 204), command circuitry (206, 208), row and column decoders (216, 218), RW amplifier 222, and/or any of a variety of other circuit blocks.

A power-on reset circuit (PON)228 provides a power-on signal when the internal power circuit 230 is able to adequately provide internal voltages for a power-on sequence. The temperature sensor 226 may sense a temperature of the semiconductor device 200 and provide a temperature signal; the temperature of the semiconductor device 200 may affect some memory operations.

In one exemplary embodiment, the memory array 220 may be controlled by one or more configuration registers. In other words, the use of these configuration registers selectively configures one or more memory arrays 220 into one or more matrix structures and/or Matrix Multiplication Units (MMUs), as will be described in greater detail herein. In other words, the configuration registers may enable the memory cell architecture within the memory array to dynamically change, for example, its structure, operation, and functionality. These and other variations will be apparent to those of ordinary skill in the art given the present disclosure.

FIG. 3 provides a more detailed side-by-side illustration of a memory array and matrix structure circuit configuration. The memory array and matrix architecture circuit configuration of fig. 3 both use the same array of memory cells, where each memory cell is comprised of a resistive element 302, which is coupled to a word line 304 and a bit line 306. In the first configuration 300, the memory array circuitry is configured to operate as a row decoder 316, a column decoder 318, and an array of memory cells 320. In the second configuration 350, the matrix structure circuitry is configured to operate as row drivers 317, Matrix Multiplication Units (MMUs) 319, and analog crossbar (matrix structure) 321. In one exemplary embodiment, a look-up table (LUT) and associated logic 315 may be used to store and configure different matrix multiplication unit coefficient values.

In one exemplary embodiment of the present disclosure, the memory array 320 is comprised of resistive random access memory (ReRAM). ReRAM is a non-volatile memory that changes the resistance of a memory cell across a dielectric solid state material, sometimes referred to as a "memristor. Current ReRAM technology can be implemented within a two-dimensional (2D) layer or a three-dimensional (3D) layer stack; however, higher order dimensions may be used in future iterations. Complementary Metal Oxide Semiconductor (CMOS) compatibility of crossbar ReRAM technology may enable logic (data processing) and memory (storage) to be integrated within a single chip. In other possible configurations, the crossbar ReRAM array may be formed in a one transistor/one resistor (1T1R) configuration and/or a one transistor driving n resistive memory cells (1TNR) configuration.

A variety of inorganic and organic material systems can achieve thermal and/or ionic resistance transitions. In various embodiments, such a system may comprise: phase change chalcogenides (e.g. Ge)₂Sb₂Te₅AgInSbTe, etc.); binary transition metal oxides (e.g. NiO, TiO)₂Etc.); perovskites (e.g. Sr (ZR) TrO)₃PCMO, etc.); solid electrolytes (e.g., GeS, GeSe, SiO)_x、Cu₂S, etc.); organic charge transfer complexes (e.g., Cu Tetracyanoquinodimethane (TCNQ), etc.); organic charge acceptor systems (e.g., Al amino-dicyanoimidazole (AIDC)N), etc.); and/or 2D (layered) insulating materials (e.g., hexagonal BN, etc.); and other possible systems for resistance transformation.

In the illustrated embodiment, the resistive element 302 is a non-linear passive two-terminal electrical component that can change its resistance based on the history of the current application (e.g., hysteresis or memory). In at least one exemplary embodiment, the resistive element 302 may form or break a conductive filament in response to applying currents of different polarities to a first terminal (connected to the word line 304) and a second terminal (connected to the bit line 306). The presence or absence of a conductive filament between two terminals changes the conductance between the terminals. While the present operation is presented in the context of a resistive element, one of ordinary skill in the relevant art will readily appreciate that the principles described herein may be implemented in any circuit characterized by a variable impedance (e.g., resistance and/or reactance). The variable impedance may be implemented by various linear and/or nonlinear elements (e.g., resistors, capacitors, inductors, diodes, transistors, thyristors, etc.).

For purposes of illustration, the operation of the memory array 320 in the first configuration 300 is briefly summarized. During operation in the first configuration, a memory "write" may be achieved by applying current to memory cells corresponding to rows and columns of the memory array. The row decoder 316 may selectively drive the various row terminals to select a particular row of the memory array circuitry 320. Column decoder 318 may selectively sense/drive various column terminals to "read" and/or "write" to corresponding memory cells uniquely identified by the selected row and column (as highlighted by the thicker line width and darkened cell elements in fig. 3). As described above, the application of the electrical current causes the formation (or destruction) of a conductive filament within the dielectric solid material. In such a case, the low resistance state (on state) is used to represent a logic "1" and the high resistance state (off state) is used to represent a logic "0". To switch the ReRAM cell, a first current having a particular polarity, amplitude, and duration is applied to the dielectric solid state material. Subsequently, a memory "read" may be achieved by applying a second current to the resistive element and sensing whether the resistive element is in an on state or an off state based on the respective impedance. The memory read may or may not be destructive (e.g., the second current may or may not be sufficient to form or destroy the conductive filament).

One of ordinary skill in the relevant art will readily appreciate that the above discussion of the memory array 320 in the first configuration 300 is consistent with existing memory operations according to, for example, ReRAM memory technology. In contrast, the second configuration 350 performs a matrix multiplication operation using memory cells as the analog crossbar (matrix structure) 321. Although the exemplary embodiment of fig. 3 corresponds to a 2x4 Matrix Multiplication Unit (MMU), other variations may be substituted with equivalent success. For example, matrices of arbitrarily large size (e.g., 3x3, 4x4, 8x8, etc.) may be implemented (depending on the accuracy achieved by the digital-to-analog conversion (DAC)308 and analog-to-digital (ADC)310 components).

In analog crossbar (matrix) 321 operation, each of the row terminals is driven simultaneously by an analog input signal, and each of the column terminals is sensed simultaneously for an analog output (which is an analog sum of voltage potentials across corresponding resistive elements for each row/column combination). Notably, in the second configuration 350, all of the row and column terminals associated with the matrix multiplication are active (as highlighted by the thicker line width and the darkened cell elements in fig. 3). In other words, the ReRAM crossbar (matrix structure) 321 uses the matrix structure to perform "analog computation" that computes vector-matrix products (or scalar-matrix products, matrix-matrix products, etc.).

Notably, the concurrent vector-matrix product computation within the crossbar is atomic. In particular, the analog computation of the vector-matrix product can be done in a single access cycle. As previously described, atomic operations are not affected by data race conditions. In addition, the vector-matrix product calculation simultaneously calculates all rows and all columns of the matrix operation; in other words, the vector-matrix product computation is not scaled in complexity according to the matrix dimension. While manufacturing constraints (e.g., ADC/DAC granularity, manufacturing tolerances, etc.) may limit the amount of precision and complexity that a single matrix structure may yield, multiple matrix operations may be mathematically combined together to provide much higher precision and complexity.

For example, in one exemplary embodiment of the present disclosure, DAC 308 converts the input to the analog domain for analog computation, but ADC 310 may also convert the input back to the digital domain for subsequent digital and/or logic operations. In other words, the arithmetic logic unit 312 is capable of performing complex digital operations on the outputs of the matrix structure 321. This capability may be used in cases where the analog domain cannot achieve the required calculations due to practical implementation constraints (e.g., manufacturing costs, etc.).

Consider the illustrative example of fig. 4, where a simple "FFT butterfly" calculation 400 may be performed by a 2x4 matrix structure. While the conductance can be increased or decreased, it cannot be made "negative". As a result, subtraction may need to be performed in the digital domain. The FFT butterfly (eqn.1) is described in the multiplication of the following matrix (M) and vector (a):

EQN.1：

this simple FFT butterfly 400 of eqn.1 can be decomposed into two different matrices (eqn.2 and eqn.3) representing positive and negative coefficients.

EQN.2：

EQN.3：

Eqn.2 and eqn.3 can be implemented as analog calculations with a matrix structure circuit. Once calculated, the resulting analog value can be converted back to the digital domain by the aforementioned ADC. Existing ALU operations can be used to perform subtraction in the digital domain (eqn.4):

EQN.4：

in other words, as shown in fig. 4, the 2x2 matrix may be further subdivided into a 2x2 positive matrix and a 2x2 negative matrix. The ALU may add/subtract the results of the 2x2 positive matrix and the 2x2 negative matrix to generate a single 2x2 matrix. One of ordinary skill in the relevant art will readily appreciate the wide variety and/or capabilities of ALU enablement. For example, the ALU may provide arithmetic operations (e.g., add, subtract, carry add, borrow subtract, invert, increment, decrement, pass, etc.), bitwise operations (e.g., AND, OR, XOR, complement), shift operations (e.g., arithmetic shift, logical shift, rotate, carry rotate, etc.) to enable, for example, multi-precision arithmetic, complex operations, AND/OR any extension of MMU capabilities to any precision, size, AND/OR complexity level.

As used herein, the terms "digital" and/or "logic" in the context of computing refer to processing logic that uses quantized values (e.g., "0" and "1") to represent symbolic values (e.g., "on state," "off state"). Conversely, the term "analog" in the context of computing refers to processing logic that performs computation using continuously variable aspects of physical signaling phenomena, such as electrical, chemical, and/or mechanical quantities. Various embodiments of the present disclosure may represent analog input and/or output signals as continuous electrical signals. For example, the voltage potential may have different possible values (e.g., any value between a minimum voltage (0V) and a maximum voltage (1.8V), etc.). Combining analog calculations with digital components may be performed with digital-to-analog converters (DACs), analog-to-digital converters (ADCs), Arithmetic Logic Units (ALUs), and/or variable gain amplification/attenuation.

Referring again to fig. 3, to configure the memory cells into a crossbar (matrix structure) 321 of a second configuration 350, each resistive element may be written with a corresponding matrix coefficient value. Unlike the first configuration 300, the second configuration 350 can write different degrees of impedance (representing coefficient values) into each ReRAM cell using an amount of current having a polarity, magnitude, and duration selected to set a particular conductance. In other words, by forming/destroying conductive filaments having different conductivities, a plurality of different conductivity states can be established. For example, applying a first magnitude may result in a first conductance, applying a second magnitude may result in a second conductance, applying a first magnitude for a longer duration may result in a third conductance, and so on. Any permutation of the foregoing write parameters may be substituted with equivalent success. More directly, rather than using two (2) resistance states (on state, off state) to represent two (2) digital states (logic "1", logic "0") a varying conductance may use multiple states (e.g., three (3), four (4), eight (8), etc.) to represent a continuous range of values and/or a range of values (e.g., [0,0.33,0.66,1], [0,0.25,0.50,0.75,1], [0,0.125,0.250, …,1], etc.).

In one embodiment of the present disclosure, the matrix coefficient values are stored in advance in a look-up table (LUT) and configured by associated control logic 315. In the initial configuration phase, matrix structure 321 is written with matrix coefficient values from the LUT by control logic 315. One of ordinary skill in the relevant art will readily appreciate that certain memory technologies may also implement write-once, multi-use operations. For example, even if a conductive filament of a ReRAM cell may require a particular duration, magnitude, polarity, and/or direction of current flow to form (or destroy); subsequent use of the memory cell may be repeated multiple times (as long as the conductive filament is not substantially formed and is not destroyed over the lifetime). In other words, subsequent use of the same matrix structure 321 construction may be used to pay for the initial construction time.

In addition, some memory technologies (e.g., ReRAM) are non-volatile. Thus, once the matrix structure circuit is programmed, it can enter a low power state (or even power down) when not in use to save power. In some cases, the non-volatility of the matrix structure may be exploited to further improve power consumption. In particular, unlike the prior art, where matrix coefficient values may be reloaded from non-volatile memory for subsequent processing, the exemplary matrix structure may store matrix coefficient values even when the memory device is powered down. In subsequent wake-ups, the matrix structure can be used directly.

In one exemplary embodiment, the matrix coefficient values may be derived from the characteristics of the matrix operation. For example, the coefficients for a particular matrix operation may be derived in advance based on "size" (or other structurally defined parameters) and stored within the LUT. As just two such examples, the fast fourier transform (eqn.5) and the Discrete Cosine Transform (DCT) (eqn.6) are reproduced below:

EQN.5：

EQN.6：

note that eqn.6 represents a DCT-II version of DCT and is commonly used for video/image compression. As can be mathematically determined from the aforementioned equation, the matrix coefficient values (also referred to as "twiddle factors") are determined according to the size of the transformation. For example, once the size of the DCT matrix is known, the value of the matrix can be set a priori (using, for example, eqn.6).

In another exemplary embodiment, the matrix coefficient values may be stored in advance. For example, the coefficients of certain matrix multiplication operations may be known or defined by, for example, an application or user. For example, IMAGE processing calculations such as described in commonly owned and co-pending U.S. patent application No. 16/002,644 entitled "IMAGE PROCESSOR FORMED in AN array OF MEMORY CELLS" (AN IMAGE PROCESSOR FORMED IN AN ARRAY OF MEMORY CELLS), filed on 7/6/2018, previously incorporated above, may define various matrix coefficient values to implement, for example, defect correction, color interpolation, white balance, color adjustment, gamma brightness, contrast adjustment, color conversion, down sampling, and/or other IMAGE signal processing operations.

In another example, the coefficients for certain matrix multiplication operations may be determined or defined by, for example, user considerations, environmental considerations, other devices, and/or other network entities. For example, wireless devices often experience different multipath effects that may interfere with operation. Various embodiments of the present disclosure determine multipath effects and correct them with matrix multiplication. In some cases, the wireless device may calculate each of the separate different channel effects based on degradation of known signaling. The difference between the expected reference channel signal and the actual reference channel signal may be used to determine the noise contribution it experiences (e.g., attenuation, reflection, scattering, and/or other noise contributions over a particular frequency range). In other embodiments, the wireless device may be instructed to use a predetermined "codebook" of beamforming configurations. The codebook of beamforming coefficients may be less accurate but may be preferred for other reasons (e.g., speed, simplicity, etc.).

As mentioned previously, the matrix coefficient values are stored in advance in a look-up table (LUT) and configured by the associated control logic 315. In one exemplary embodiment, the matrix structure may be configured by dedicated hardware logic. Such internal hardware logic may not be limited by processor word size; thus, the matrix coefficient values for any dimension may be simultaneously configurable (e.g., 4x4, 8x8, 16x16, etc.). While the present disclosure is presented in the context of internal control logic 315, external embodiments may be substituted with equivalent success. For example, in other embodiments, the logic includes an internal memory inline Processor (PIM) that can set matrix coefficient values based on LUT values in a series of reads and writes. In other examples, the external processor may perform LUT and/or logic functions, for example.

A memory array configured to perform a particular matrix multiplication operation is described below with reference to fig. 5A-5B and 6A-6B.

In various embodiments, aspects of the present disclosure may be applied to video/image compression or other operations, whether lossy or not. Discrete Cosine Transform (DCT) is used by various image/video processing and compression techniques, including JPEG, to represent pixel values (e.g., Y, Cb or Cr values) of an image as a sum of different cosine functions. For example, an image may be decomposed into 8x8 pixel blocks, where each of the 64 pixels in the block is represented by values for luminance (Y) and chrominance (Cb, Cr). A DCT may be applied to each 8x8 block to obtain a matrix D of DCT frequency coefficients for that block of pixels. This matrix of DCT frequency coefficients is then used to perform the actual compression of the image block. Note that DCT can be similarly applied to other sized pixel blocks (e.g., 16x16 pixel blocks).

DCT of NxN pixel blocks may be performed using a matrix multiplication operation D ═ TMT ', where T denotes a DCT transform matrix (whose values may be calculated using eqn.6), T' denotes the transpose of the T matrix, M denotes a matrix of pixel values (e.g., Y values, Cb values, R values, etc.), and D denotes a matrix of DCT frequency coefficients of the pixel blocks. This operation can be decomposed into two separate matrix-matrix multiplication stages/operations: a first operation (eqn.7) and a second operation (eqn.8).

EQN.7:B＝TM

EQN.8:D＝BT'

The individual values of the matrices in the first and second matrix-matrix multiplication operations are explicitly written below.

Each matrix-matrix multiplication operation may be performed using a separate matrix-vector multiplication operation, and there are various ways to deconstruct the two matrix-matrix multiplication operations of eqn.7 and eqn.8 into a matrix-vector operation, as will be understood by those of ordinary skill in the art given the present disclosure. And as previously described, matrix-vector multiplication operations may be performed using matrix structure(s) and MMU(s).

In one exemplary embodiment, the individual columns of the B matrix may be calculated using eqn.7a.

EQN.7A:b_cn＝T·m_cn

Wherein the vector b_cnRepresents the column n of the B matrix, and the vector m_cnRepresenting the column n of the M matrix.

For example, a matrix-vector multiplication operation b may be used_c0＝T·m_c0To calculate the first column (B) of the B matrix₀₀、b₁₀、b₂₀…b_n0) As follows.

By performing b n times_cn＝T·m_cnMatrix-vector operations (for each of the n columns of B), the entire B matrix can be obtained. Similarly, to obtain D ═ BT', rows of the D matrix can be calculated using eqn.8a:

EQN.8A:d_rn＝T·b_rn

wherein the vector d_rnRepresenting the row n of the D matrix, and the vector b_rnRepresenting row n of the B matrix.

For example, the first row (D) of the D matrix₀₀、d₀₁、d₀₂…d_0n) May be calculated using matrix-vector multiplication as shown below. Note that the matrix used in this operation is T, not T'.

Fig. 5A and 5B illustrate one exemplary embodiment of a matrix structure and MMU circuit 500 that may be used to perform a DCT on an NxN matrix M using the matrix-vector operations described above in eqn.7a and 8A. Note that for simplicity, the matrix structure 521 of fig. 5A is shown as a 2x4 cell array. However, the matrix structure 521 represents an array of size Nx2N, wherein a first NxN matrix structure portion 521A may be configured with a positive value of matrix T and a second NxN matrix structure portion 521B may be configured with a negative value of matrix T.

In one embodiment, the matrix structure 521 is an 8x16 cell array, the first matrix structure portion 521A is configured with a positive value of an 8x8T matrix, and the second matrix structure 521B is configured with a negative value of an 8x8T matrix. The 8x16 matrix structure 521 may be used to perform a DCT operation on an 8x8 matrix M representing pixel values of an 8x8 pixel block.

In another implementation, the matrix structure 521 is a 16x32 cell array, programmed with positive and negative values of a 16x 16T matrix, and configured to perform DCT operations on a 16x16 matrix M.

In other implementations, the size of the matrix structure 521 may be scaled up or down as appropriate in order to perform a DCT operation on a matrix M representing pixel blocks of various other sizes. As previously mentioned, the matrix multiplication operations of the present disclosure can be performed on very large matrices because they do not face the time exponential scaling problems of conventional methods. Thus, according to aspects of the present disclosure, a matrix structure-based DCT operation may be performed on very large blocks of pixels (e.g., 64x64 pixels) in a reasonable amount of time.

In the system 500 of fig. 5A, the matrix structure 521 (including the first portion 521A and the second portion 521B) is programmed with positive and negative values of the DCT transformation matrix T using control logic 515. The value of the T matrix depends on the size of the matrix and may be calculated using eqn.6, or pre-stored in a look-up table (LUT), as desired. In one embodiment, the T values for commonly used matrix sizes (e.g., 8x8 and 16x16) are stored in the LUT. The values of the matrix M represent values (e.g., luminance values) of individual pixels of a block of pixels, and may be stored in the memory array 522.

To perform the first phase of the DCT operation (EQN.7 or 7A), a vector M representing an individual column of the matrix M_cnCan be extracted from the memory array 522 and fed sequentially by the control logic 515 to the row drivers 517 and then converted to analog input signals by the DACs 508 of the row drivers 517. These converted analog input signals are then used to drive all row terminals of the matrix structure 521 simultaneously. The matrix structure results are output to the MMU 519, which MMU 519 converts the results to digital values via its ADC 510. The digital values are then used by the MMU's Arithmetic Logic Unit (ALU)512 to obtain the individual b_cnThe value of the vector (from eqn.7a) which represents the column of the B matrix. Each b if necessary_cnThe values of the vector may be temporarily stored in the memory array 522 until the entire B matrix is obtained.

Once the entire B matrix is computed, the second phase of the DCT operation (eqn.8 or 8A) may be performed using the same matrix structure 521 (configured with T matrix values). Each vector B representing an individual column of the matrix B_rnFed back by control logic 515 to row drivers 517, converted by DAC 508 and used to drive all row inputs of the matrix structure 521 simultaneouslyAnd (6) adding. MMU 519 (containing ADC 510 and ALU 512) is then used to obtain d_rnVector d of_rnThe vector is stored in the memory array 522 until all values of the D matrix are calculated. Thus, the system 500 of fig. 5A may be used to compute DCT frequency matrix coefficients of an NxN matrix M using a single Nx2N matrix structure 521 configured with values of a DCT transformation matrix T by first running a series of matrix-vector operations through the matrix structure 521 and in the MMU 519 to obtain a first matrix-matrix product B, and then running a series of matrix-vector operations through the matrix structure 521 and the MMU 519 to obtain a second matrix-matrix product D.

As shown in FIG. 5B, another system 550 may be used to perform matrix-matrix multiplication operations by running several matrix-vector multiplication operations in parallel using two or more matrix structures 521. In one embodiment, the system 550 of FIG. 5B includes a plurality of matrix structures 521, the matrix structures 521 connected to their own row drivers 517 and MMU 519. In one variation, the plurality of matrix structures 521 includes a vertical stack of, for example, crossbar matrix arrays, each crossbar matrix array configured as a matrix structure 521. In another variant, the plurality of matrix structures may be implemented as a matrix array disposed on a single level, or even a common semiconductor die.

It will also be appreciated that although each matrix structure 521 of FIG. 5B is connected to its own MMU 519, multiple matrix structures 521 may be connected to a single MMU 519. For example, a single MMU of sufficient capacity may be used to process the output of each matrix structure 521 in parallel. Furthermore, a common row driver arrangement 517 may be used (although each matrix structure 521 is driven independently using discretely composed row driver circuits).

In one embodiment, using the memory structure system 550, the entire matrix-matrix multiplication stage of eqn.7 can be performed in the time it takes to drive one matrix structure 521 once, since all n matrix-vector operations of eqn.7a can be performed simultaneously. In this embodiment of the apparatus, control logic 515 is configured to obtain n separate vectors m_cnAnd feeds the vectors to n individual row drivers simultaneously. Each individual row driver usingM of it_cnThe vectors are used to simultaneously drive the row terminals of their respective matrix structures 521. The matrix structure 521 and its respective MMU 519 vector b_cnThe value of (d) is output to the memory array 522. Thus, the individual columns of matrix B are all computed in parallel and stored simultaneously in memory array 522. Similarly, in the second stage of the DCT operation, all rows of matrix D may be obtained by performing matrix-vector multiplication of eqn.8a in parallel using the multi-matrix architecture system 550 of fig. 5B.

Eqns.7a and 8A show one way to deconstruct the two matrix-matrix multiplications required for a DCT operation. However, DCT matrix-matrix multiplication operations (eqns.7 and 8) may be deconstructed using matrix vector multiplication operations other than eqns.7a and 8A, as recognized by one of ordinary skill in the art given this disclosure.

In one exemplary embodiment, the values of matrix B may be calculated using the following equation

EQN.7B:b_rn＝M'·t_rn，

Wherein the vector b_rnRepresenting the row n, vector t of the B matrix_rnRepresents the row n of the T matrix and M' represents the transpose of the M matrix. For example, the first row (B) of the B matrix₀₀、b₀₁、b₀₂…b_0n) May be calculated using matrix-vector multiplication as shown below.

Fig. 6A and 6B illustrate another embodiment of a matrix structure and MMU circuitry that may be used to perform a DCT on NxN matrix M using the matrix-vector operations described in eqn.7b and 8A. In one configuration thereof, during the first phase of the DCT operation, the matrix-vector multiplication of eqn.7b may be implemented using the first part (left) of the circuit 600 shown in figure 6A. The first matrix structure 621 is configured by control logic 615 to hold the values of the M ' matrix (divided into a positive M ' matrix 621A and a negative M ' matrix 621B). The values of the M' matrix correspond to pixel values of a particular image block, are unknown or pre-computed, and are not stored in a look-up table (LUT). Instead, the values of the M' matrix may be stored in a memory array 622 of the device. The control logic 615 may access the memory array 622 and configure the first matrix structure 621 using the M' values stored in the memory array 622.

After configuring the first matrix structure 621, the control logic 615 provides the individual vector t to the first row driver 617_rn(e.g., vector t)_r0、t_r1Etc.) that convert the vector values from digital to analog (using DAC 608) and use the analog signals to simultaneously drive the row terminals of the first matrix structure 621. Vector value t_rnDepending on the size of the matrix T and may be stored in advance in a look-up table (LUT). The first matrix structure 621 and the first MMU 619 to obtain all b sequentially_rnVector (representing the rows of matrix B).

In one variant, b_rnThe values of the vector may be temporarily stored in the memory array 622 and then provided from the memory array 622 to the second row driver. In another variant, b_rnThe vector values are fed directly from the first MMU 619 into the second row driver 637. The second phase of the DCT operation (eqn.8) may be implemented using the second (right) portion of the memory system 600, including the second row driver 637, the second matrix structure 641, and the second MMU 639.

During the second phase of the DCT operation, each vector b_rnFed to the second row driver 637 via the control logic 615, converted by the DACs 628 and used to drive all row inputs of the second matrix structure 641 simultaneously. The respective vector output of the first operation (eqn.7b) is used as the respective vector input of the second operation (eqn.8a). In other words, unlike the embodiment in fig. 5A and 5B, the values of the entire B matrix need not be accumulated in the memory array 622 before the second operation is performed. Second matrix structure 641 and second MMU 639 use b_rnVector to obtain d_rnVector and combine d with_rnThe vector is stored in the memory array 622 until all values of the D matrix are calculated. Thus, the system 600 of fig. 6A may be used to compute a DCT for an NxN matrix M using two consecutive Nx2N matrix structures 621 and 641 without having to accumulate an intermediate product B between the first and second operations of the DCT.

FIG. 6B shows the arrangementYet another embodiment, wherein the system 650 is configured such that all matrix-vector multiplication operations for eqn.7b may be performed in parallel and all matrix-vector multiplication operations for eqn.8a may be performed in parallel. All rows of matrix B may be computed in parallel using multiple (e.g., stacked or common stage/die) first row drivers/first matrix structures/first MMU circuits, as shown in fig. 6B. Then b is_rnThe vectors are input to the respective second row driver/second matrix structure/second MMU circuitry such that all rows of the D matrix are computed in parallel using eqn.8a. All individual rows of the D matrix are obtained from the second MMU 639 and stored to the memory array 622 simultaneously. Thus, the entire two-stage DCT operation may be performed in approximately the time it takes to pass through the matrix structure/MMU circuit twice.

Fig. 7A is a logical block diagram of one exemplary implementation of a processor-memory architecture 700, according to various principles described herein. As shown in fig. 7A, the processor 702 is coupled to the memory 704; the memory contains a look-up table (LUT)706, control logic 708, a matrix structure and corresponding Matrix Multiplication Unit (MMU)710, and a memory array 712.

In one embodiment, the LUT 706 stores a plurality of matrix value coefficients, dimensions, and/or other parameters associated with different matrix operations. In one exemplary embodiment, the LUT 706 stores a plurality of Discrete Cosine Transform (DCT) coefficients (e.g., a "twiddle factor") associated with different DCT dimensions.

In another exemplary embodiment, the LUT 706 stores a plurality of Fast Fourier Transform (FFT) "twiddle factors," where various subsets of the twiddle factors are associated with different FFT dimensions. For example, the LUT 706 storing the twiddle factors for a 64-point FFT has 64 coefficient values that contain all 32 coefficients used in a 32-point FFT, and all 16 coefficients used in a 16-point FFT, and so on.

In other embodiments, the LUT 706 stores a plurality of different matrix coefficient values for Image Signal Processing (ISP), such as defect correction, color interpolation, white balance, color adjustment, gamma brightness, contrast adjustment, color conversion, down sampling, and/or other image signal processing operations. In yet another embodiment of the LUT 706, the LUT 706 may contain various channel matrix codebooks that may be predefined and/or empirically determined based on radio channel measurements.

In one embodiment, control logic 708 controls the operation of matrix structure and MMU 710 based on instructions received from processor 702. In one exemplary embodiment, the control logic 708 may form/destroy conductive filaments having different conductivities within each memory cell of the matrix structure in accordance with the aforementioned matrix dimensions and/or matrix value coefficients provided by the LUT 706. In addition, control logic 708 may configure the respective MMUs to perform any additional arithmetic and/or logical operations of the matrix structure. In addition, the control logic 708 may select one or more digital vectors to drive the matrix structure and select one or more digital vectors to store the logical output of the MMU.

In processing technology, "instructions" typically contain different types of "instruction syllables": such as opcodes, operands, and/or other associated data structures (e.g., registers, scalars, vectors).

As used herein, the term "opcode" (opcode) refers to an instruction that may be interpreted by processor logic, memory logic, or other logic circuitry to effect an operation. More directly, an opcode identifies an operation to be performed on one or more operands (inputs) to generate one or more results (outputs). Both operands and results may be embodied as data structures. Common examples of data structures include, but are not limited to: scalars, vectors, arrays, lists, records, unions, objects, graphs, trees, and/or any number of other forms of data. Some data structures may contain, in whole or in part, reference data ("data pointing to" other data). Common examples of reference data structures include, for example, pointers, indices, and/or descriptors.

In one exemplary embodiment, the opcode may identify one or more of the following: a matrix operation, a dimension of the matrix operation, and/or a row and/or a column of memory cells. In one such variation, the operand is an encoded identifier that specifies one or more number vectors to be operated on.

For example, an instruction that performs the DCT processing involved on an input number 8x8 matrix and stores the result in an output number matrix may contain an opcode and operands: DCT8X8($ input, $ output), where: DCT8X8 identifies the size and nature of the operation, $ input identifies the input number matrix base address, and $ output identifies the output number matrix base address. In another such example, the DCT process may be divided into several different atomic operations, such as (i) DCT8X8($ address), which converts the memory array at $ address to an 8X8 (64-point) matrix structure, (ii) row 8($ input, $ output), which decomposes the 8X8 digital matrix at $ input into 8 separate digital vectors ($ output), and (iii) MULT ($ address, $ input, $ output), which stores the vector-matrix product at $ input and the matrix structure at $ address to $ output.

In one embodiment, the memory-processor architecture 700 may be used to perform DCT video processing on successive video frames by computing one video frame (e.g., an I-frame) in the memory 704, powering down the processor 702 at least in part (or allocating it to other tasks), and then waking up the processor to compute the next frame (e.g., a B-frame). Since each DCT matrix operation is performed in memory and video frame data always resides in memory, memory bandwidth load is greatly reduced.

FIG. 7A illustrates an instruction interface that is functionally independent and distinct from an input/output (I/O) memory interface. In one such embodiment, the instruction interfaces may be physically distinct (e.g., have distinct dies and/or connections). In other embodiments, the instruction interface may be multiplexed with the I/O memory interface (e.g., sharing the same control signaling, and address and/or data buses, but in a different communication mode). In other embodiments, the instruction interface may be virtually accessible through the I/O memory interface (e.g., as registers located within an address space addressable through the I/O interface). Other variations may be substituted by one of ordinary skill in the art given the present disclosure.

In one embodiment, the matrix structure and MMU 710 are tightly coupled to the memory array 712 to read and write digital vectors (operands). In one exemplary embodiment, operands are identified for dedicated data transfer hardware (e.g., Direct Memory Access (DMA)) to and from the matrix structure and MMU 710. In one exemplary variation, the digital vector of data may be of any dimension and is not limited by processor word size. For example, the operand may specify an N-bit operand (e.g., 2, 4, 8, 16 …, etc.). In other embodiments, the DMA logic 708 may read/write to the matrix structure 710 using an existing memory row/column bus interface. In other embodiments, the DMA logic 708 may use existing address/data within the internal memory interface and read/write control signaling to read/write to the matrix structure 710.

FIG. 7B provides a logic flow diagram of an exemplary set of matrix operations 750 in the context of the exemplary embodiment 700 depicted in FIG. 7A. As shown therein, the processor 702 will specify an opcode (e.g., by matrix M) via the interface 707_x,yTokens) and operands (e.g., digit vectors a, b) are written to memory 704.

Control logic 708 determines whether the matrix structure and/or Matrix Multiplication Unit (MMU) should be configured/reconfigured. For example, a portion of a memory array is converted into one or more matrix structures and used by matrix M_x,yThe defined correlation matrix coefficient values are weighted. Digital-to-analog (DAC) row drivers and analog-to-digital (ADC) sense amplifiers associated with the matrix structure may need to be adjusted for dynamic range and/or amplification. Additionally, one or more MMU ALU components may be coupled to one or more matrix structures.

When the matrix structure and/or Matrix Multiplication Unit (MMU) is configured appropriately, the input operand a is read by a digital-to-analog (DAC) and applied to the matrix structure M_x,yFor use in analog calculations. The simulation results may also be converted using analog-to-digital (ADC) conversion for subsequent logic operations by the MMU ALU. The output is written to output operand b.

Fig. 7C provides an alternative logic flow diagram of an exemplary set of matrix operations 760 in the context of the exemplary embodiment 700 depicted in fig. 7A. In contrast to the flow diagram of FIG. 7B, the system of FIG. 7C uses explicit instructions to convert the memory array into a matrix structure. Providing a further degree of atomicity in instruction behavior may realize various related benefits, including, for example, pipeline design and/or reduced instruction set complexity.

More directly, when the matrix structure contains appropriate matrix value coefficients M_x,yThe matrix operation can be repeated efficiently. For example, IMAGE processing calculations such as described in commonly owned and co-pending U.S. patent application No. 16/002,644 entitled "IMAGE PROCESSOR FORMED in AN array OF MEMORY CELLS" (AN IMAGE PROCESSOR FORMED IN AN ARRAY OF MEMORY CELLS), filed on 7/6/2018, previously incorporated above, may configure a plurality OF matrix structures and MMU processing elements for streamlining, e.g., defect correction, color interpolation, white balancing, color adjustment, gamma brightness, contrast adjustment, color conversion, downsampling, and/or other IMAGE signal processing operations. Each pipeline stage may be configured once and repeatedly for each pixel (or group of pixels) of the image. For example, a white balance pipeline stage may operate on each pixel of data using the same matrix structure with matrix coefficient values set for white balance; the color adjustment pipeline stage may operate on each pixel of data using the same matrix structure with matrix coefficient values set for color adjustment or the like. In another such example, a DCT of an 8x8 matrix may be processed, for example, in two stages, each stage containing, for example, eight atomic MMU computations using the same DCT transformation matrix T (as described above).

Moreover, one of ordinary skill in the relevant art will further appreciate that some matrix structures may have additional versatility and/or utility beyond their original configuration. For example, as previously described, a 64-point FFT has 64 coefficient values that encompass all 32 coefficients used in a 32-point FFT. Thus, a matrix structure configured for 64-point operation can be reused for 32-point operation, with the 32-point input operand a being applied appropriately on the appropriate row of the 64-point FFT matrix structure. Similarly, the FFT twiddle factors are a superset of Discrete Cosine Transform (DCT) twiddle factors; therefore, the DCT result can also be computed using the FFT matrix structure (applying the input operand a appropriately).

Other permutations and/or variations of the foregoing examples will be apparent to those of ordinary skill in the relevant art given the present disclosure.

Method-

Referring now to FIG. 8, a logic flow diagram of one exemplary method 600 of converting a memory array to a matrix structure and performing matrix operations therein is shown.

At step 802 of method 800, a memory device receives one or more instructions. In one embodiment, a memory device receives an instruction from a processor. In one such variation, the processor is an Application Processor (AP) commonly used in consumer electronics. In other such variations, the processor is a baseband processor (BB) commonly used in wireless devices.

In short, a so-called "application processor" is a processor configured to execute an Operating System (OS) and one or more applications, firmware, and/or software. The term "operating system" refers to software that controls and manages access to hardware. The OS typically supports processing functions such as task scheduling, application execution, input and output management, memory management, security, and peripheral access.

A so-called "baseband processor" is a processor configured to communicate with a wireless network through a communication protocol stack. The term "communication protocol stack" refers to the software and hardware components that control and manage access to wireless network resources. The communication protocol stack typically includes, but is not limited to: physical layer protocols, data link layer protocols, medium access control protocols, network and/or transport protocols, and the like.

Other peripheral and/or coprocessor configurations may be similarly replaced with equivalent success. For example, server devices typically include multiple processors that share common memory resources. Similarly, many common device architectures pair a general-purpose processor with a dedicated coprocessor and a shared memory resource, such as a graphics engine or Digital Signal Processor (DSP). Common examples of such processors include, but are not limited to: a Graphics Processing Unit (GPU), a Video Processing Unit (VPU), a Tensor Processing Unit (TPU), a neural Network Processing Unit (NPU), a Digital Signal Processor (DSP), and an Image Signal Processor (ISP). In other embodiments, the memory device receives instructions from an Application Specific Integrated Circuit (ASIC) or other form of processing logic, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a camera sensor, an audio/video processor, and/or a media codec (e.g., image, video, audio, and/or any combination thereof).

In one exemplary embodiment, the memory device is a resistive random access memory (ReRAM) arranged in a "crossbar" row-column configuration. Although various embodiments described herein assume particular memory technologies and particular memory structures, one of ordinary skill in the relevant art given the present disclosure will readily appreciate that the principles described herein may be broadly extended to other technologies and/or structures. For example, certain programmable logic structures, such as those typically used in Field Programmable Gate Arrays (FPGAs) and Programmable Logic Devices (PLDs), may have similar features to memory in terms of capabilities and topology. Similarly, some processors and/or other memory technologies may vary resistance, capacitance, and/or inductance; in this case, the analog calculation may be performed using the changed impedance characteristic. In addition, while the "crossbar" based architecture provides a physical structure that is well suited to two-dimensional (2D) matrix structures, other topologies may also be well suited to higher order mathematical operations (e.g., matrix-matrix products stacked via three-dimensional (3D) memory, etc.).

In one exemplary embodiment, the memory device further includes a controller. The controller receives one or more instructions and parses each instruction into one or more instruction components (also commonly referred to as "instruction syllables"). In one exemplary embodiment, an instruction syllable includes at least one opcode and one or more operands. For example, an instruction may be parsed into an opcode, a first source operand, and a destination operand. Other common examples of instruction components may include, but are not limited to, a second source operand (for binary operations), a shift amount, an absolute/relative address, a register (or other reference to a data structure), an immediate data structure (i.e., a data structure provided within the instruction itself), a dependent function, and/or a branch/link value (e.g., executed depending on whether the instruction completed or failed).

In one embodiment, each received instruction corresponds to an atomic memory controller operation. As used herein, an "atomic" instruction is an instruction that completes within a single access cycle. In contrast, a "non-atomic" instruction is an instruction that completes or does not complete within a single access cycle. Even though a non-atomic instruction may complete in a single cycle, the instruction must be treated as non-atomic to prevent data race conditions. Race conditions occur where data accessed by a processor instruction (read or write) can be accessed by another processor instruction before the first processor instruction has an opportunity to complete; race conditions may unpredictably lead to data read/write errors. In other words, atomic instructions guarantee that data cannot be observed in an incomplete state.

In one exemplary embodiment, the atomic instruction may identify a portion of a memory array to be converted into a matrix structure. In some cases, the atomic instructions may identify characteristic attributes of the matrix structure. For example, the atomic instruction may identify the portion of the memory array based on, for example, a location within the memory array (e.g., by offset, row, column), a size (number of rows, number of columns, and/or other dimensional parameters), a granularity (e.g., precision and/or sensitivity). Notably, atomic instructions can provide very fine control over memory device operation; this may be desirable where memory device operation may be optimized in view of various application specific considerations.

In other embodiments, the non-atomic instruction may specify the portion of the memory array to be converted into a matrix structure. For example, the non-atomic instructions may specify various requirements and/or constraints for the matrix structure. The memory controller may allocate resources internally to accommodate requirements and/or constraints. In some cases, the memory controller may additionally prioritize and/or de-prioritize instructions based on current memory usage, memory resources, controller bandwidth, and/or other considerations. Such implementations may be particularly useful where memory device management is unnecessary, otherwise burdening the processor.

In one embodiment, the instruction specifies a matrix operation. In one such variation, the matrix operation may be a vector-matrix product. In another variant, the matrix operation may be a matrix-matrix product. Other variations may also be substituted by one of ordinary skill in the relevant art, given the present disclosure. Such variations may include, for example, scalar-matrix products, higher order matrix products, and/or other transformations including, for example, linear shifts, rotations, reflections, and translations.

As used herein, the term "transform" or the like refers to a mathematical operation that converts an input from a first domain to a second domain. The transformation may be "injective" (each element of the first domain has a unique element in the second domain), "flood" (each element of the second domain has a unique element in the first domain), or "bijective" (a unique one-to-one mapping of elements from the first domain to the second domain).

More complex mathematically defined transforms that are often used in the field of computing include fourier transforms (and derivatives thereof, such as Discrete Cosine Transforms (DCTs)), hilbert transforms, laplace transforms, and legendre transforms. In one exemplary embodiment of the present disclosure, matrix coefficient values for the mathematically defined transform may be calculated in advance and stored in a look-up table (LUT) or other data structure. For example, twiddle factors for Fast Fourier Transforms (FFTs) and/or DCTs may be calculated and stored in the LUT. In other embodiments, matrix coefficient values for the mathematically defined transform may be calculated by the memory controller during the matrix structure conversion process (or preparation).

Other transformations may not be based on the mathematical definition itself, but may be defined based on, for example, an application, another device, and/or a network entity. Such transformations may be used generally for encryption, decryption, geometric modeling, mathematical modeling, neural networks, network management, and/or other graph-theory based applications. For example, the wireless network may use a codebook of predetermined antenna weight matrices to signal the most common beamforming configurations. In other instances, certain types of encryption may agree and/or negotiate between different encryption matrices. In such embodiments, the codebook or matrix coefficient values may be agreed in advance, exchanged out-of-band, exchanged in-band, or even arbitrarily determined or negotiated.

Empirically determined transformations may also be substituted with equivalent success given the present disclosure. For example, empirically derived transforms often used in the computing arts include the environmental effects of radio channel coding, image signal processing, and/or other mathematical modeling. For example, a multipath radio environment may be characterized by measuring the effect of a channel on, for example, a reference signal. The resulting channel matrix may be used to constructively interfere with signal reception (e.g., increase signal strength) while destructively interfering with interference (e.g., reduce noise). Similarly, the overall color balance of an image with skewed tones can be evaluated and mathematically corrected. In some cases, the image may be intentionally skewed based on, for example, user input in order to impart an aesthetic "warmth" to the image.

Various embodiments of the present invention may implement a "unary" operation within a memory device. Other embodiments may implement "binary" or even higher order "N-ary" matrix operations. As used herein, the terms "unary," "binary," and "N-gram" refer to operations that employ one, two, or N input data structures, respectively. In some embodiments, binary and/or N-ary operations may be subdivided into one or more unary matrix in-place operators. As used herein, the "in-place" operator refers to a matrix operation that stores or converts its results to its own state (e.g., its own matrix coefficient values). For example, a binary operation may be decomposed into two (2) unary operations; a first in-place unary operation is performed (the result is stored as "in-place"). Thereafter, a second unary operation may be performed on the matrix structure to produce a binary result (e.g., a multiply-accumulate operation).

Other embodiments may serialize and/or parallelize matrix operations based on various considerations. For example, sequentially related operations may be performed in a "serial" pipeline. For example, IMAGE processing calculations are described, for example, in commonly owned and co-pending U.S. patent application No. 16/002,644 entitled "IMAGE PROCESSOR FORMED in AN array OF MEMORY CELLS" (AN IMAGE PROCESSOR FORMED IN AN ARRAY OF MEMORY CELLS), filed on 7/6/2018, previously incorporated above, configuring a plurality OF matrix structures and MMU processing elements for streamlining, e.g., defect correction, color interpolation, white balancing, color adjustment, gamma brightness, contrast adjustment, color conversion, downsampling, and the like. Pipeline processing can typically produce very high throughput data with minimal matrix structure resources. Instead, unrelated operations may be performed "in parallel" with separate resources. For example, the first phase of a DCT operation performed on an 8x8 matrix may be processed with eight separate matrix structures, as described with respect to fig. 5B and 6B. Highly parallelized operations can greatly reduce latency; however, overall memory fabric resource utilization can be very high.

In one exemplary embodiment, the instruction is received from the processor via a dedicated interface. The dedicated interface is particularly useful in case the matrix computation structure is processed like a coprocessor or a hardware accelerator. Notably, the dedicated interface does not require arbitration and can operate at very high speeds (in some cases, at local processor speeds). In other embodiments, the instruction is received via a shared interface.

The shared interface may be multiplexed in time, resources (e.g., channels, etc.), or other ways with other concurrently active memory interface functionality. Common examples of other memory interface functionality include, but are not limited to: data input/output, memory configuration, memory inline Processor (PIM) communication, direct memory access, and/or any other form of blocking memory access. In some variations, the shared interface may include one or more queuing and/or pipelining mechanisms. For example, some memory technologies may implement a pipeline interface in order to maximize memory throughput.

In some embodiments, the instructions may be received from any entity capable of accessing the memory interface. For example, a camera co-processor (image signal processor (ISP)) can communicate directly with a memory device to, for example, write captured data. In some implementations, the camera co-processor is capable of offloading its processing tasks to the matrix structure of the memory device. For example, the ISP may accelerate/unload/parallelize such as color interpolation, white balance, color correction, color conversion, etc. In other examples, a baseband coprocessor (BB) is capable of communicating directly with a memory device to read/write data for a transaction, such as through a network interface. The BB processor can offload, for example, FFT/IFFT, channel estimation, beamforming computation, and/or any number of other networking tasks to the matrix structure of the memory device. Similarly, video and/or audio codecs typically utilize DCT/IDCT transforms and would benefit from matrix structure operations. Given the present disclosure, one of ordinary skill in the relevant art will readily recognize variations of the foregoing.

Various embodiments of the present invention may support a queue of multiple instructions. In one exemplary embodiment, the matrix operations may be queued together. For example, multiple vector-matrix multiplications may be queued together to implement a matrix-matrix multiplication. For example, as described elsewhere in this disclosure, multiple matrix-vector multiplications may be used to efficiently perform the matrix-matrix multiplication operations required for the DCT. Similarly, as previously described, higher order transforms (e.g., FFT1024) may be implemented by queuing multiple iterations of lower order constituent transforms (e.g., FFT512, etc.). In yet another example, ISP processing of an image may involve multiple iterations over an iteration space (each iteration may be pre-queued). Given the present disclosure, one of ordinary skill in the relevant art can readily substitute other queuing schemes with the same success.

In some cases, matrix operations may be cascaded together to achieve higher order matrix operations. For example, a high-order FFT (e.g., 1024x1024) may be decomposed into multiple iterations of a low-order FFT (e.g., four (4) iterations of 512x512 FFT, sixteen (16) iterations of 256x256 FFT, etc.). In other examples, an arbitrary size N-point DFT (e.g., not a power of 2) may be implemented by concatenating other sizes of DFTs. Other examples of cascaded and/or chained matrix transforms may be substituted with equivalent success, the above being purely illustrative.

As previously described, the non-volatile nature of ReRAM enables the memory content to be retained even when the ReRAM is not powered. Thus, certain variations of the processor-memory architecture may enable one or more processors to independently power memory. In some cases, the processor may power the memory when the processor is inactive (e.g., keep the memory active when the processor is at low power). Independent power management of memory is particularly useful, for example, for performing matrix operations in memory, even when the processor is asleep. For example, the memory may receive a plurality of instructions to be executed; the processor may transition to sleep mode until multiple instructions are completed. Other embodiments may use the non-volatile nature of ReRAM to retain memory contents when the memory is powered down; for example, certain video and/or image processing calculations may remain within the ReRAM during periods of inactivity.

At step 804 of method 800, a memory array (or a portion thereof) may be converted to a matrix structure based on the instruction. As used herein, the term "matrix structure" refers to a plurality of memory cells having configurable impedances that, when driven with an input vector, produce an output vector and/or matrix. In one embodiment, the matrix structure may be associated with a portion of a memory map. In some such variations, the portion is configurable in its size and/or position. For example, a configurable memory register may determine whether the bank is configured as a memory or a matrix structure. In other variations, the matrix structure may be reused and/or even block memory interface operations. For example, the memory device may allow the memory interface to be GPIO based (e.g., in one configuration, the pins of the memory interface may selectively operate as ADDR/DATA during normal operation, or as FFT16, for example, during matrix operations).

In one embodiment, the instructions identify a matrix structure characterized by structurally defined coefficients. In one exemplary embodiment, the matrix structure contains coefficients for a structurally defined matrix operation. For example, the matrix structure for an 8x8 DCT is an 8x8 matrix structure, which has been pre-filled with the structurally defined coefficients of the DCT. In some variations, the matrix structure may be pre-populated with coefficients of a particular sign (positive, negative) or particular cardinality (most significant, least significant, or intermediate).

As used herein, the term "structurally defined coefficients" refers to the fact that the coefficients of a matrix multiplication are defined by the matrix structure (e.g., the size of the matrix), rather than the characteristics of the operation (e.g., multiplication by operands). For example, structurally defined matrix operations (e.g., 8x8, 16x16, 32x32, 64x64, 128x128, 256x256, etc.) may be identified by, for example, row and column designations. Although the foregoing discussion is presented in the context of full rank matrix operations, the deficit matrix operator may be replaced with equivalent success. For example, the matrix operations may have asymmetric columns and/or rows (e.g., 8x16, 16x8, etc.). In fact, multiple vector-based operations may be considered a row with a single column or a column with a single row (e.g., 8x1, 1x 8).

In some hybrid hardware/software embodiments, control logic (e.g., a memory controller, a processor, a PIM, etc.) may determine whether resources providing a matrix structure are present. In one such embodiment, the matrix operations may be evaluated by a preprocessor to determine whether they should be processed within software or within a dedicated matrix structure. For example, if the use of existing memory and/or matrix structures consumes all memory device resources, it may be necessary to process matrix operations within software, rather than through the matrix structure. In this case, the instruction may become incomplete (resulting in conventional matrix operations by the processor instructions). In another such example, configuring the temporary matrix structure to handle simple matrix operations may be less rewardable, such that the matrix operations should be handled within software.

Various considerations may be used in determining whether a matrix structure should be used. For example, memory management may allocate portions of a memory array for memory and/or a matrix structure. In some implementations, portions of a memory array can be statically allocated. Static allocation is preferred to reduce memory management overhead and/or to simplify operation overhead (wear leveling, etc.). In other implementations, portions of the memory array may be dynamically allocated. For example, wear leveling may be required to ensure that the performance of the memory is degraded evenly (rather than wearing high availability areas). Other variations may statically and/or dynamically allocate different portions; for example, subsets of memory and/or matrix structure portions may be allocated dynamically and/or statically.

In short, wear leveling of memory cells can be performed in any discrete amount of memory (e.g., a memory bank, a memory block, etc.). Similar techniques may be used for wear leveling of the matrix structure; for example, in one variation, wear leveling of portions of the matrix structure may require that the entire matrix structure be moved as a whole (the crossbar structure cannot be moved in segments). Alternatively, the wear-leveling of the matrix structure portion may be performed by first decomposing the matrix structure into constituent matrix computations and dispersing the constituent matrix computations elsewhere. More directly, matrix structure wear leveling may indirectly benefit from "logical" matrix operations used in other matrix operations (e.g., decomposition, concatenation, parallelization, etc.). In particular, decomposing the matrix structure into its constituent matrix structures may enable better wear leveling management, using only slightly more complex operations (e.g., additional steps through logical combinations of MMUs).

In one exemplary embodiment, the converting includes reconfiguring the row decoder to operate as a matrix structure driver variably driving a plurality of rows of the memory array. In one variant, the row driver converts the digital values to analog signals. In one variation, the digital-to-analog conversion includes varying a conductance associated with the memory cell as a function of the matrix coefficient value. Additionally, the converting may include reconfiguring the column decoders to perform analog decoding. In one variation, a column decoder is reconfigured to sense analog signals corresponding to a column of varying conductance cells driven by a corresponding row of varying signaling. The column decoder converts the analog signal into a digital value. While the foregoing configuration is presented in a particular row-column configuration, other embodiments may be substituted with the same success. For example, the column drivers may convert digital values to analog signals, and the row decoders may convert analog signals to digital values. In another such example, a three-dimensional (3D) row-column-depth memory may implement a 2D matrix in any permutation (e.g., row driver/column decoder, row driver/depth decoder, column driver/depth decoder, etc.) and/or 3D matrix permutation (e.g., row driver/column decoder driver/depth decoder).

In one exemplary embodiment, the matrix coefficient values correspond to structurally determined values. The structurally determined value may be based on a characteristic of the operation. For example, a Discrete Cosine Transform (DCT) performed on an NxN matrix uses an NxN DCT matrix T, where the values of the DCT matrix are calculated as a function of N. In fact, the various transformations are similar in this respect. For example, both Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT) use structurally defined coefficients.

In one exemplary embodiment, the matrix structure itself has structurally defined dimensions. The structurally determined dimensions may be based on the characteristics of the operation; for example, the ISP white balance process may use a 3 × 3 matrix (corresponding to different values of red (R), green (G), blue (B), luminance (Y), chroma red (Cr), chroma blue (Cb), etc.). In another such example, the channel matrix estimate and/or beamforming codebook is typically defined in terms of a number of multiple-input multiple-output (MIMO) paths. For example, a 2x2 MIMO channel has a corresponding 2x2 channel matrix and a corresponding 2x2 beamforming weight. Given the present disclosure, one of ordinary skill in the relevant art may substitute various other structurally defined values and/or dimensions useful for matrix operations.

Some variations may further subdivide the matrix coefficient values in order to handle operations that may otherwise be impractical. In this case, the matrix structure may contain only a part of the matrix coefficient values (only a part of the matrix operations is performed). For example, performing signed arithmetic and/or higher level radix calculations may require extremely expensive levels of manufacturing tolerances. Signed matrix operations may be split into positive and negative matrix operations (which are summed later by Matrix Multiplication Units (MMUs) described elsewhere herein). Similarly, the high radix matrix operations may be divided into, for example, a Most Significant Bit (MSB) portion, a Least Significant Bit (LSB) portion, and/or any intermediate bits (which may be bit shifted and summed by the aforementioned MMU). Other variations will be readily apparent to those of ordinary skill in the art given the present disclosure.

In one exemplary embodiment, the matrix coefficient values are determined in advance and stored in a look-up table for later reference. For example, matrix operations with structurally defined dimensions and structurally defined values may be stored in advance. As just one such example, the 8x8 DCT matrix T has a structurally determined dimension (8x8) and a structurally determined value (e.g., calculated with eqn.6). The DCT8X8 instruction may result in the configuration of an 8X8 matrix structure that is pre-populated with corresponding structurally determined values of DCT8X 8. As another such example, antenna beamforming coefficients are typically defined in advance within a codebook; the wireless network may identify a corresponding index within the codebook to configure antenna beamforming. For example, a MIMO codebook may identify a possible configuration of a 4x4 MIMO system; during operation, the selected configuration may be retrieved from the codebook based on its index.

While the foregoing examples are presented in the context of structurally defined dimensions and/or values, other embodiments may use dimensions and/or values defined based on one or more other system parameters. For example, low power operation may require less granularity. Similarly, as previously described, various processing considerations may favor (or negate) matrix operations within the matrix structure. In addition, matrix operations may affect other memory considerations, including but not limited to: wear leveling, memory bandwidth, in-memory process bandwidth, power consumption, row and column and/or depth decoding complexity, and the like. Given the present disclosure, various other considerations may be substituted by one of ordinary skill in the relevant art, and the foregoing is merely illustrative.

At step 806 of method 800, one or more matrix multiplication units may be configured based on the instruction. As previously mentioned, some matrix structures may implement logic (mathematical identity) to handle a single phase of matrix operations; however, multi-level matrix structures may be cascaded together to achieve more complex matrix operations. In one exemplary embodiment, the first matrix is used to calculate the positive product of a matrix operation, and the second matrix is used to calculate the negative product of the matrix operation. The resulting positive and negative products can be compiled within the MMU to provide signed matrix multiplications. In one exemplary embodiment, the first matrix is used to compute a first radix portion of a matrix operation and the second matrix is used to compute a second radix portion of the matrix operation. The resulting radix portion may be bit shifted and/or summed within the MMU to provide a larger radix product.

Briefly, logical matrix operations are distinguished from analog matrix operations. The exemplary matrix structure converts analog voltages or currents into digital values that are read by a Matrix Multiplication Unit (MMU). Logical operations may manipulate digital values by mathematical properties (e.g., by matrix decomposition, etc.); analog voltages or currents cannot be manipulated in this manner.

More generally, different logical operations may be performed with a set of matrices. For example, a matrix may be decomposed or factored into one or more constituent matrices. Similarly, multiple constituent matrices may be aggregated or combined into a single matrix. In addition, the matrix may be expanded in rows and/or columns to create a larger dimension (but the same rank) of defective matrices. This logic can be used to implement many high order matrix operations. For example, multiplying two matrices together may be decomposed into a plurality of vector-matrix multiplications, as explained elsewhere in this disclosure. These vector-matrix multiplications may further be implemented as multiply-accumulate logic within a Matrix Multiply Unit (MMU). In other words, even non-unary operations may be handled as a series of piecewise unary matrix operations. More generally, one of ordinary skill in the relevant art will readily appreciate that any matrix operation that may be represented, in whole or in part, as a unary operation may greatly benefit from the various principles described herein.

Various embodiments of the present disclosure use Matrix Multiplication Units (MMUs) as glue logic between multiple constituent matrix structures. In addition, MMU operations may be selectively switched to connect to various rows and/or columns. Not all matrix structures may be used simultaneously; thus, the matrix structure may be selectively connected to the MMU depending on the current processing and/or memory usage. For example, a single MMU may be dynamically connected to different matrix structures.

In some embodiments, control logic (e.g., a memory controller, processor, PIM, etc.) may determine whether resources are present to provide MMU operations, e.g., at a column decoder or elsewhere. For example, the current MMU load may be evaluated by a preprocessor to determine if the MMU load may be made high. It is noted that MMUs are mainly used for logical operations, and therefore any processing entity with equivalent logical functionality can assist the tasks of the MMU. For example, a memory embedded Processor (PIM) may offload MMU operations. Similarly, the matrix structure results may be provided directly to the host processor (which may perform logical operations in software).

More generally, various embodiments of the present disclosure contemplate sharing MMU logic between multiple different matrix structures. The sharing may be based on, for example, a time sharing scheme. For example, the MMU may be assigned to a first matrix structure during one time slot and to a second matrix structure during another time slot. In other words, unlike the physical structure of the matrix structure (which is statically allocated for the duration of the matrix operation), the MMU performs logical operations that may be scheduled, subdivided, allocated, reserved, and/or partitioned in any number of ways. More generally, various embodiments of the matrix structure are based on memory and non-volatility. As a result, the matrix structure can be preconfigured and read when needed; the non-volatile nature ensures that the matrix structure retains content without requiring processing overhead even if, for example, the memory device is powered down.

If both the matrix structure and the corresponding Matrix Multiplication Unit (MMU) were successfully translated and configured, then at step 808 of method 800, the matrix structure is driven based on the instruction, and a logical result is computed with one or more matrix multiplication units at step 810. In one embodiment, one or more operands are converted to electrical signals for analog computation by a matrix structure. The result of the analog calculation is to drive the electrical signal through the matrix structure element; for example, the voltage drop is a function of the coefficients of the matrix structure. The analog computation is sensed and converted back to a digital domain signal. Thereafter, the one or more digital field values are processed with one or more Matrix Multiplication Units (MMUs) to create a logical result.

Fig. 8A illustrates a logic flow diagram of an exemplary method 850 of implementing steps 808 and 810 of fig. 8 to perform DCT matrix operations. In one embodiment, the exemplary method 850 is implemented using the memory structure described in FIG. 5A, although this is merely exemplary.

Referring back to fig. 8, in the context of a DCT, at step 802, a memory device receives one or more instructions from, for example, a processor. The instruction may specify a DCT operation to be performed on a size NxN matrix M.

At step 804 of the method 800, a portion of a memory array is converted into a matrix structure configured to perform a DCT matrix operation. In one embodiment, at least one Nx2N portion of the memory array is configured as a matrix structure (divided into positive T and negative T matrix values) having values of a DCT matrix T, as depicted in fig. 5A. The values of matrix T may be retrieved from a look-up table (LUT) by control logic within the memory device.

At step 806, the one or more matrix multiplication units may be configured to, for example, compile matrix-vector products from the positive T matrix structure and the negative T matrix structure to provide signed matrix multiplication (similar to that described in fig. 4).

Referring now to FIG. 8A, a method 850 is used to perform a DCT operation on the matrix M (i.e., as in steps 808 and 810 of FIG. 8).

In step 811 of method 850, one or more vectors M representing columns of matrix M are obtained from the memory array_c. In one embodiment, the control logic is to access one vector at a time based on the instruction received in step 802 of method 800.

In step 812, a vector m is used_cnTo drive a matrix structure that has been configured with positive and negative T matrix values.

In step 814, the MMU obtains the results of the matrix structure and computes the bcn vector by subtracting the matrix-vector product obtained on the negative T-side of the matrix structure from the matrix-vector product obtained on the positive T-side of the matrix structure.

In step 816, the resulting b is combined_cnThe values of the vector are stored in a memory array.

Steps 811 through 816 are then repeated until the entire B matrix is calculated and accumulated in the memory array. Additional details may be found in the discussion of fig. 5A elsewhere in this disclosure.

Once the entire B matrix has been computed in accordance with steps 811-816, at step 817 of FIG. 8A, the control logic obtains one or more vectors B representing rows of matrix B from the memory array_r。

In step 818, a vector b is used_rnTo drive a matrix structure that has been configured with positive and negative T matrix values as described with respect to fig. 5A.

In step 820, the MMU obtains the results of the matrix structure and calculates d by subtracting the matrix-vector product obtained on the negative T side of the matrix structure from the matrix-vector product obtained on the positive T side of the matrix structure_rnAnd (5) vector quantity.

In step 822, the values of the resulting drn vector are stored in a memory array.

Steps 817 through 822 are then repeated until the entire D matrix is calculated and the DCT operation is complete. It will be appreciated that the foregoing method may be readily modified by one of ordinary skill in the art for the multi-matrix architecture of fig. 5B and 6B, given this disclosure.

It will also be appreciated that while certain aspects of the disclosure have been described in terms of a particular sequence of steps of a method, these descriptions are merely illustrative of the broader methods of the disclosure and may be modified as desired for particular applications. Certain steps may be unnecessary or optional in certain circumstances. In addition, certain steps or functions may be added to the disclosed embodiments or the order of execution of two or more steps may be permuted. Furthermore, features from two or more methods may be combined. All such variations are considered to be encompassed within the disclosure disclosed and claimed herein.

The description set forth herein, in connection with the appended drawings, describes example configurations and is not intended to represent all examples that may be implemented or within the scope of the claims. The term "exemplary" as used herein means "serving as an example, instance, or illustration," and not "preferred" or "superior to other instances. The detailed description includes specific details that provide an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described examples.

Information and signals described herein may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

While the above detailed description has shown, described, and pointed out novel features of the disclosure as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the disclosure. This description is in no way meant to be limiting, but should be taken as illustrative of the general principles of the disclosure. The scope of the disclosure should be determined with reference to the claims.

It should also be understood that while certain steps and aspects of the various methods and apparatus described herein may be performed by a human, the disclosed aspects and various methods and apparatus are generally computerized/computer-implemented. Computerized devices and methods are needed to fully implement these aspects for a number of reasons, including but not limited to commercial viability, utility, and even availability (i.e., a person cannot simply perform certain steps/processes in any feasible manner).

The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium (e.g., a storage medium). Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A non-transitory storage medium may be any available medium that can be accessed by a general purpose or special purpose computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes CD, laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

46页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：包括温度感测电路的半导体器件

Method and apparatus for performing video processing matrix operations within a memory array

相关技术

网友询问留言