Vertical mapping and computation of deep neural networks in non-volatile memory

文档序号:170900 发布日期:2021-10-29 浏览:27次 中文

阅读说明:本技术 深度神经网络在非易失性存储器中的垂直映射和计算 (Vertical mapping and computation of deep neural networks in non-volatile memory ) 是由 T.T.霍昂 M.利克-博登 A.库卡尼 于 2021-04-06 设计创作,主要内容包括:本发明题为“深度神经网络在非易失性存储器中的垂直映射和计算”。本发明提供了一种非易失性存储器结构,该非易失性存储器结构能够存储深度神经网络(DNN)的层并在该结构内执行推断操作。接合管芯对的叠堆通过硅通孔连接。每个接合管芯对包括:存储器管芯,具有一个或多个存储器阵列,神经网络的层被映射到该存储器阵列上;和外围电路管芯,包括用于执行接合管芯对的卷积或乘法的控制电路。该乘法可以在存储器管芯上的阵列内或在外围电路管芯上的逻辑内完成。该阵列可形成为沿着通孔的列,从而允许通过沿着列上下传播输入来执行推断操作,其中一个层级的输出为后续层的输入。(The invention provides vertical mapping and calculation of deep neural networks in non-volatile memory. A non-volatile memory structure is provided that is capable of storing layers of a Deep Neural Network (DNN) and performing inference operations within the structure. The stacks of bonded die pairs are connected by through-silicon vias. Each bonded die pair includes: a memory die having one or more memory arrays onto which layers of a neural network are mapped; and a peripheral circuit die including a control circuit for performing convolution or multiplication of the bonded die pair. The multiplication may be done within an array on the memory die or within logic on the peripheral circuit die. The array may be formed as columns along the vias, allowing inference operations to be performed by propagating inputs up and down the columns, with the output of one level being the input of a subsequent level.)

1. A non-volatile memory device, comprising:

a plurality of stacked bond die pairs, each bond die pair comprising:

a memory die having one or more arrays of non-volatile memory cells, each array configured to store one or more weights for a layer of a neural network; and

a corresponding peripheral circuit die formed separately from the memory die and bonded thereto and having peripheral circuit elements, including one or more control circuits connected to the one or more arrays and configured to perform memory operations on the one or more arrays, the one or more control circuits configured to:

receiving a set of input values for a layer of the neural network, the layer having weights stored on one of the arrays of the memory die; and is

Performing a multiplication operation between the set of input values and the weights of the layer of the neural network to generate a set of output values of the layer of the neural network; and

one or more vias each connecting a plurality of the bonding die pairs and configured to transfer data between a connected bonding die pair connected to the one or more vias, the vias configured to transfer a set of output values of a multiplication operation in a first of the bonding die pairs as a set of input values of a multiplication operation in a second of the bonding die pairs.

2. The non-volatile memory device of claim 1, wherein in generating a set of output values for the layer of the neural network, the one or more control circuits are configured to:

performing an activation operation on the set of output values after performing the multiplication operation between the set of input values and the weights of the layer of the neural network.

3. The non-volatile memory device of claim 2, wherein:

the plurality of stacked bond die pairs includes a first bond die pair and a second bond die pair, wherein one or more control circuits of a peripheral circuit die of the first bond die pair are further configured to perform a different set of activation operations than the peripheral circuit die of the second bond die pair.

4. The non-volatile memory device of claim 1, wherein to perform the multiplication operation between the set of input values and the weights of the layer of the neural network, the one or more control circuits are configured to:

performing the multiplication operation in an array that stores the weights of the layers of the neural network.

5. The non-volatile memory device of claim 1, wherein to perform the multiplication operation between the set of input values and the weights of the layer of the neural network, the one or more control circuits are configured to:

reading out the weights of the layer of the neural network from the memory die into the peripheral circuit die; and is

In logic of the peripheral circuit die, performing the multiplication operation.

6. The non-volatile memory device of claim 1, wherein for one or more of the bond die pairs, the one or more control circuits of the peripheral circuit die are further configured to:

performing a subsequent multiplication operation between the output value of the layer of the neural network and the weight of an additional layer of the neural network stored in the one or more arrays of the corresponding memory die.

7. The non-volatile memory device of claim 1, wherein the peripheral circuit die of the bond die pair is a CMOS die.

8. The non-volatile memory device of claim 7, wherein the memory die of the bonded die pair are formed in an NMOS process.

9. The non-volatile memory device of claim 1, wherein the memory die of the bonded die pair is a three-dimensional NAND flash memory die.

10. The non-volatile memory device of claim 1, wherein the memory die of the bond die pair is a storage level memory die.

11. The non-volatile memory device of claim 1, wherein the memory die of the bond die pair is a phase change memory die.

12. The non-volatile memory device of claim 1, wherein the layer of the neural network is a convolutional layer of the neural network.

13. The non-volatile memory device of claim 1, wherein the layer of the neural network is a fully connected layer of the neural network.

14. A method, comprising:

receiving an initial input of a neural network at a peripheral circuit die of a first bond die pair of a non-volatile memory fabric, the non-volatile memory fabric including a plurality of bond die pairs, including the first bond die pair, each of the bond die pairs including a peripheral circuit die and a non-volatile memory die, each of the non-volatile memory dies storing weights for one or more layers of the neural network;

propagating the input of the neural network through the sequence of bonded die pairs, the bonded die pairs forming a three-dimensional stack, wherein the bonded die pairs are connected by through-silicon vias, the propagating comprising:

generating, at the first bond die pair, an output for a layer of the neural network stored on the memory die of the first bond die pair with the initial input; and is

At each of one or more subsequent bond die pairs in the sequence:

receiving along the via an output from a previous bonded die pair in the sequence; and is

Using an output from the previous bond die pair as an input to generate an output for a layer of the neural network stored on the memory die of the subsequent bond die pair; and is

Providing an output of the neural network from the peripheral circuit die of one of the pair of subsequent bonded dies.

15. The method of claim 14, wherein generating outputs for a layer of the neural network stored on the memory die in one or more of the bond die pairs comprises:

performing a multiplication between the input and the weight of the layer of the neural network in the memory die of the bond die pair.

16. The method of claim 14, wherein generating outputs for a layer of the neural network stored on the memory die in one or more of the bond die pairs comprises:

reading the weights of the layers of the neural network from the memory die of the bond die pair; and is

Performing, in logic of a peripheral circuit die of the bonded die pair, multiplication between the input and the weight of the layer of the neural network.

17. A non-volatile memory device, comprising:

a stack of a plurality of bonded die pairs, each bonded die pair having a memory die with one or more arrays of non-volatile memory cells and a peripheral circuit die having one or more control circuits configured to access data stored on the memory die of the bonded die pair; and

a plurality of vias through the stack of the pair of bond dies, the plurality of vias configured to transfer data between the pair of bond dies, the array of memory dies grouped into columns along the vias,

wherein the memory die of the plurality of bond die pairs is configured to store weights for a plurality of layers of a neural network, sequences of layers of the neural network being mapped to corresponding array sequences along the columns, an

Wherein the control circuitry of the peripheral circuit die of the plurality of bond die pairs is further configured to perform inference operations for the neural network by: receiving an input of the neural network at a first bond die pair of the bond die pairs; and propagating the input through the layer of the neural network by propagating the input along the corresponding array sequence of the columns; performing a multiplication operation within each of the pair of bond dies using the weights stored in the corresponding array.

18. The non-volatile memory device of claim 17, wherein the multiplication operations performed in one or more bond die pairs using the weights stored in the corresponding arrays are performed within the memory dies of the bond die pairs.

19. The non-volatile memory device of claim 17, wherein the multiplication operations performed in one or more bonded die pairs using the weights stored in the corresponding array are performed within the peripheral circuit dies of the bonded die pairs.

20. The non-volatile memory device of claim 17, wherein the control circuitry of the peripheral circuit die of the plurality of bond die pairs is further configured to:

propagating the input through a plurality of the corresponding sequences of the array along a first one of the columns in a first direction; and is

The input is then propagated through a plurality of the corresponding sequences of the array along a second one of the columns in a second direction.

Background

Artificial neural networks are increasingly used in artificial intelligence and machine learning applications. In an artificial neural network, a set of inputs is propagated through one or more intermediate or hidden layers to generate an output. Layers connecting inputs to outputs are connected by sets of weights generated in a training or learning phase by: a set of mathematical transformations for converting the input to the output is determined, moved through the layers, and a probability for each output is calculated. Once the weights are established, they can be used in an inference phase to determine an output from a set of inputs. While such neural networks may provide highly accurate results, they are extremely computationally intensive, and the data transfer involved in reading the connection weights of different layers from memory and transferring these weights to some processing unit may be quite intensive.

Drawings

Like numbered elements refer to common features in different figures.

FIG. 1 is a block diagram of one embodiment of a memory system connected to a host.

Figure 2 is a block diagram of one embodiment of a front-end processor circuit. In some embodiments, the front-end processor circuit is part of the controller.

Figure 3 is a block diagram of one embodiment of a back-end processor circuit. In some embodiments, the back-end processor circuit is part of the controller.

Figure 4 is a block diagram of one embodiment of a memory package.

FIG. 5 is a block diagram of one embodiment of a memory die.

Fig. 6A-6B illustrate an array workpiece including pads for bonding.

Fig. 7A to 7B show a control circuit workpiece including a pad for bonding.

Fig. 8A-8C illustrate wafer-to-wafer bonding and the resulting structure.

FIG. 9 shows an example of control circuitry coupled to a memory structure by wafer-to-wafer bonding.

Fig. 10 shows a simple example of a Convolutional Neural Network (CNN).

FIG. 11 shows a simple example of a fully connected layer in an artificial neural network.

FIG. 12A is a flow chart describing one embodiment of a process for training a neural network to generate a set of weights.

FIG. 12B is a flow chart describing one embodiment of a process for inference using a neural network.

Fig. 13 is a schematic diagram of convolution operations in a convolutional neural network.

FIG. 14 is a schematic diagram of the use of matrix multiplication in the fully-connected layer of a neural network.

Fig. 15 is a simplified illustration of a non-volatile memory structure formed by a stack of bonded die pairs connected by vias.

Fig. 16 and 17 schematically show: the inference operation performed by the structure of fig. 15 when performing convolution or full-link multiplication of layers within the memory die and CMOS die, respectively.

Fig. 18 illustrates mapping a 3D deep neural network onto the architecture of fig. 15.

Fig. 19 and 20 are flow diagrams of the following embodiments: the inputs are propagated to the neural network through the network layer mapped onto the structure of fig. 15.

Detailed Description

When a neural network performs inference or training operations, a large number of computations involving a large amount of data are performed, especially in the case of deep neural networks or DNNs, which involve a large number of layers through which input propagation must occur. To avoid moving large amounts of data into and out of the memory device, the weights of the layers of the neural network are stored in a non-volatile memory array of the memory device, and the calculations of each layer are performed on the device. To improve performance and capability, the embodiments described below use a bonded die pair in which a first die includes one or more memory arrays and peripheral circuitry (including control circuitry for performing memory operations) is moved onto a second die. The second memory die of the bond die pair may have a higher capacity because all of its area is dedicated to memory cells and its processing may be optimized for memory cell technology. The peripheral circuit die of the bonded die pair may use a process optimized for controlling circuit elements, such as a CMOS process, and the additional area available on the CMOS die may be used for additional logic elements and functions. When used in a neural network, the weights of the layers may be stored in an array of memory dies of the bonded die pair, and activation of the neural network may be performed in a logic element of a CMOS die of the bonded die pair. According to this embodiment, the calculation (multiplication, convolution) of the inputs of the layer with their weights may be performed in memory by converting the inputs to voltage levels applied along the array word lines, or in logic by reading the weights out into the CMOS die and performing the calculations in logic elements therein.

The bond die pairs may be stacked in a three-dimensional structure, where the bond die pairs of one layer may be connected to the bond die pairs on other stacked layers through the use of through-silicon vias or TSVs. The output of one stack layer may then be transferred to another stack layer where they are used as inputs. The array of memory dies may be grouped into columns along the vias. By mapping the neural network into a 3D structure of stacked memory die pairs, the weights of each layer, whether convolutional or fully-connected, may be stored in arrays along the columns, and the inference operation may be performed by receiving network inputs at the first layer and propagating the inputs through the layers of the structure.

FIG. 1 is a block diagram of one embodiment of a memory system 100 connected to a host 120. Memory system 100 may implement the techniques presented herein, where neural network inputs or other data is received from host 120. According to an embodiment, input may be received from the host 120 and then provided to the memory package 104 for use in inferring weights previously programmed into the memory arrays of the memory package 104. Many different types of memory systems may be used with the techniques presented herein. Exemplary memory systems include solid state drives ("SSDs"), memory cards, and embedded memory devices; however, other types of memory systems may also be used.

The memory system 100 of FIG. 1 includes a controller 102, a non-volatile memory 104 for storing data, and a local memory (e.g., DRAM/ReRAM) 106. The controller 102 includes a front-end processor (FEP) circuit 110 and one or more back-end processor (BEP) circuits 112. In one embodiment, the FEP circuit 110 is implemented on an ASIC. In one embodiment, each BEP circuit 112 is implemented on a separate ASIC. In other embodiments, the unified controller ASIC may combine both front-end and back-end functions. The ASICs for each of the BEP circuitry 112 and the FEP circuitry 110 are implemented on the same semiconductor such that the controller 102 is fabricated as a system on a chip ("SoC"). Both the FEP circuit 110 and the BEP circuit 112 include their own processors. In one embodiment, the FEP circuit 110 and the BEP circuits 112 function as a master-slave configuration, where the FEP circuit 110 is a master and each BEP circuit 112 is a slave. For example, the FEP circuit 110 implements a Flash Translation Layer (FTL) or a Media Management Layer (MML) that performs memory management (e.g., garbage collection, wear leveling, etc.), logical-to-physical address translation, communication with a host, management of DRAMs (local volatile memory), and management of the overall operation of SSDs (or other non-volatile storage systems). The BEP circuit 112 manages memory operations in the memory package/die as requested by the FEP circuit 110. For example, the BEP circuitry 112 may perform read, erase, and program processes. In addition, the BEP circuit 112 may perform buffer management, set a particular voltage level required by the FEP circuit 110, perform error correction (ECC), control a switch mode interface to the memory package, and so forth. In one embodiment, each BEP circuit 112 is responsible for its own set of memory packages.

In one embodiment, the non-volatile memory 104 includes a plurality of memory packages. Each memory package includes one or more memory dies. Thus, the controller 102 is connected to one or more non-volatile memory dies. In one embodiment, each memory die in the memory package 104 utilizes NAND flash memory (including two-dimensional NAND flash memory and/or three-dimensional NAND flash memory). In other embodiments, the memory package may include other types of memory, such as resistive random access memory (such as ReRAM, MRAM, FeRAM, or RRAM) based storage-class memory (SCM) or Phase Change Memory (PCM).

Controller 102 communicates with host 120 via interface 130, which implements NVM Express (NVMe) over PCI Express (PCIe). To operate with memory system 100, host 120 includes a host processor 122, a host memory 124, and a PCIe interface 126 connected along a bus 128. Host memory 124 is the physical memory of the host and may be DRAM, SRAM, non-volatile memory, or another type of storage device. The host 120 is external to and separate from the memory system 100. In one embodiment, the memory system 100 is embedded in the host 120.

FIG. 2 is a block diagram of one embodiment of the FEP circuit 110. Fig. 2 shows a PCIe interface 150 in communication with host 120, and a host processor 152 in communication with the PCIe interface. Host processor 152 may be any type of processor known in the art suitable for implementation. The host processor 152 communicates with a Network On Chip (NOC) 154. NOCs are communication subsystems on integrated circuits, typically between cores in a SoC. NOCs may span both synchronous and asynchronous clock domains, or use asynchronous logic that is not clocked. NOC technology applies network theory and methodology to on-chip communications and represents a significant improvement over conventional bus and crossbar interconnects. NOCs improve SoC scalability and power efficiency of complex socs over other designs. The wires and links of the NOC are shared by many signals. Since all links in the NOC can run simultaneously on different data packets, a high degree of parallelism is achieved. Thus, as the complexity of integrated subsystems continues to increase, NOCs provide enhanced performance (such as throughput) and scalability compared to previous communication architectures (e.g., dedicated point-to-point signal lines, shared buses, or segmented buses with bridges). Connected to the NOC 154 and in communication with the NOC 154 are a memory processor 156, SRAM 160, and DRAM controller 162. DRAM controller 162 is used to operate and communicate with a DRAM (e.g., DRAM 106). SRAM 160 is a local RAM memory used by memory processor 156. The memory processor 156 is used to run the FEP circuit and perform various memory operations. Also in communication with the NOC are two PCIe interfaces 164 and 166. In the embodiment of fig. 2, the SSD controller would include two BEP circuits 112; thus, there are two PCIe interfaces 164/166. Each PCIe interface communicates with one of the BEP circuits 112. In other embodiments, there may be more or less than two BEP circuits 112; thus, there may be more than two PCIe interfaces.

The FEP circuit 110 may also include a Flash Translation Layer (FTL), or more generally a Media Management Layer (MML)158, that performs memory management (e.g., garbage collection, wear leveling, load balancing, etc.), logical-to-physical address translation, communication with a host, management of DRAM (local volatile memory), and management of the overall operation of SSD or other non-volatile storage system. Media Management Layer (MML)158 may be integrated as part of memory management that can handle memory errors and interact with a host interface. In particular, the MML may be a module in the FEP circuit 110 and may be responsible for the internal portion of memory management. In particular, MML 158 may include an algorithm in the memory device firmware that converts writes from the host to writes to the memory structure of the die (e.g., 326 in fig. 5, below). MML 158 may be required because: 1) memory may have limited endurance; 2) the memory structure may be written to only a plurality of pages; and/or 3) the memory structure may not be written to unless it is erased as a block. MML 158 understands these potential limitations of the memory structure, which may not be visible to the host. Accordingly, MML 158 attempts to convert writes from the host to writes to the memory structure.

Fig. 3 is a block diagram of one embodiment of the BEP circuit 112. Fig. 3 illustrates a PCIe interface 200 for communicating with the FEP circuit 110 (e.g., communicating with one of the PCIe interfaces 164 and 166 of fig. 2). PCIe interface 200 communicates with two NOCs 202 and 204. In one embodiment, two NOCs may be combined into one large NOC. Each NOC (202/204) is connected to SRAM (230/260), buffer (232/262), processor (220/250), and data path controller (222/252) through an XOR engine (224/254) and an ECC engine (226/256). The ECC engine 226/256 is used to perform error correction as is known in the art. XOR engine 224/254 is used to perform an XOR on the data so that the data may be combined and stored in a recoverable manner in the presence of a programming error. The data path controller 222 is connected to the interface module for communication with the memory enclosure via four channels. Thus, the top NOC 202 is associated with four channel interfaces 228 for communicating with the storage enclosure, and the bottom NOC 204 is associated with four additional channel interfaces 258 for communicating with the storage enclosure. Each interface 228/258 includes four switched mode interfaces (TM interfaces), four buffers, and four schedulers. There is one scheduler, buffer and TM interface for each of the channels. The processor may be any standard processor known in the art. The data path controller 222/252 may be a processor, FPGA, microprocessor, or other type of controller. XOR engine 224/254 and ECC engine 226/256 are dedicated hardware circuits, referred to as hardware accelerators. In other embodiments, XOR engine 224/254 and ECC engine 226/256 may be implemented in software. The scheduler, buffer and TM interface are hardware circuits.

Fig. 4 is a block diagram of one embodiment of a memory package 104 including a plurality of memory dies 292 connected to a memory bus (data lines and chip enable lines) 294. The memory bus 294 is connected to a switched mode interface 296 for communicating with a TM interface of the BEP circuit 112 (see, e.g., fig. 3). In some embodiments, the memory enclosure may include a small controller connected to the memory bus and the TM interface. The memory package may have one or more memory dies. In one embodiment, each memory package includes eight or 16 memory dies; however, other numbers of memory dies may also be implemented. The techniques described herein are not limited to any particular number of memory dies.

Fig. 5 is a functional block diagram of one embodiment of a memory die 300. The components depicted in fig. 5 are electronic circuits. In one embodiment, each memory die 300 includes a memory structure 326, control circuitry 310, and read/write circuits 328. The memory structure 126 is addressable by word lines via a row decoder 324 and by bit lines via a column decoder 332. The read/write circuits 328 include multiple sense blocks 350, including SB1, SB2, …, SBp (sense circuits), and allow a page of memory cells to be read or programmed in parallel. Commands and data are transferred between the controller and the memory die 300 via lines 318. In one embodiment, memory die 300 includes a set of input and/or output (I/O) pins connected to lines 318.

The control circuitry 310 cooperates with the read/write circuits 328 to perform memory operations (e.g., writing, reading, etc.) on the memory structure 326 and includes a state machine 312, an on-chip address decoder 314, and power control circuitry 316. The state machine 312 provides die-level control of memory operations. In one embodiment, the state machine 312 may be programmed by software. In other embodiments, the state machine 312 is implemented without software and entirely in hardware (e.g., electronic circuitry). In another embodiment, the state machine 312 is replaced by a microcontroller. In one embodiment, control circuit 310 includes buffers such as registers, ROM fuses, and other storage devices for storing default values such as reference voltages and other parameters.

The on-chip address decoder 314 provides an address interface between that used by the controller 102 and the hardware address used by the decoders 324 and 332. The power control module 316 controls the power and voltages supplied to the word lines and bit lines during memory operations. The power control module 316 may include a charge pump for generating the voltage. The sense block includes a bit line driver.

For purposes of this document, the phrase "one or more control circuits" may include a controller, state machine, microcontroller and/or control circuit 310, or other similar circuit for controlling non-volatile memory.

In one implementation, memory structure 326 includes a three-dimensional memory array of non-volatile memory cells in which multiple memory levels are formed above a single substrate, such as a wafer. The memory structure may include any type of non-volatile memory that is integrally formed in one or more physical levels of memory cells having an active area disposed above a silicon (or other type) substrate. In one example, the non-volatile memory cells include vertical NAND strings having charge trapping material, such as described, for example, in U.S. patent 9,721,662, which is incorporated herein by reference in its entirety.

In another implementation, memory structure 326 includes a two-dimensional memory array of non-volatile memory cells. In one example, the non-volatile memory cell is a NAND flash memory cell utilizing floating gates, such as described, for example, in U.S. patent 9,082,502, which is incorporated herein by reference in its entirety. Other types of memory cells (e.g., NOR-type flash memory) may also be used.

The exact type of memory array architecture or memory cells included in memory structure 326 is not limited to the examples described above. Many different types of memory array architectures or memory technologies may be used to form memory structure 326. No particular non-volatile memory technology is required to implement the new embodiments claimed herein. Other examples of suitable technologies for the memory cells of memory structure 326 include ReRAM memory (resistive random access memory), magnetoresistive memory (e.g., MRAM, spin transfer torque MRAM, spin orbit torque MRAM), FeRAM, phase change memory (e.g., PCM), and the like. Examples of suitable technologies for the memory cell architecture of memory structure 126 include two-dimensional arrays, three-dimensional arrays, cross-point arrays, stacked two-dimensional arrays, vertical bit line arrays, and so forth.

One example of a ReRAM cross-point memory includes reversible resistance-switching elements arranged in a cross-point array accessed by X-lines and Y-lines (e.g., word lines and bit lines). In another embodiment, the memory cell may include a conductive bridge memory element. The conductive bridge memory elements may also be referred to as programmable metallization cells. The conductive bridge memory element may function as a state-change element based on physical repositioning of ions within the solid electrolyte. In some cases, a conductive bridge memory element may include two solid metal electrodes, one being relatively inert (e.g., tungsten) and the other being electrochemically active (e.g., silver or copper), with a thin film of solid electrolyte between the two electrodes. As temperature increases, the mobility of the ions also increases, which results in a lower programming threshold for the conductive bridge memory cell. Thus, the conductive bridge memory element can have a wide range of programming thresholds over the entire temperature range.

Magnetoresistive memories (MRAMs) store data through magnetic storage elements. The element is formed from two ferromagnetic plates, each of which can be kept magnetized, separated by a thin insulating layer. One of the two plates is a permanent magnet set to a particular polarity; the magnetization of the other plate can be changed to match the magnetization of the external magnetic field to store the memory. The memory device is constructed from a grid of such memory cells. In one embodiment for programming, each memory cell is located between a pair of write lines arranged at right angles to each other, parallel to the cell, one above the cell and one below the cell. When a current is passed through them, an induced magnetic field is generated.

Phase Change Memories (PCMs) take advantage of the unique properties of chalcogenide glasses. One embodiment uses a GeTe-Sb2Te3 superlattice to achieve a non-thermal phase change by merely changing the coordination state of the germanium atoms with a laser pulse (or a light pulse from another source). Thus, the programmed dose is a laser pulse. The memory cells may be inhibited by preventing the memory cells from receiving light. In other PCM implementations, the memory cell is programmed by current pulses. It should be noted that the use of "pulses" in this document does not require rectangular pulses, but includes (continuous or discontinuous) vibrations or pulse trains of sound, current, voltage light or other waves.

One of ordinary skill in the art will recognize that the techniques described herein are not limited to a single particular memory structure, memory configuration, or material composition, but encompass many related memory structures within the spirit and scope of the techniques as described herein and as understood by one of ordinary skill in the art.

The elements of fig. 5 may be divided into two parts: a memory structure 326 of memory cells; and peripheral circuitry, including all other components. An important characteristic of a memory circuit is its capacity, which can be increased by: increasing the area of memory die 300 that is reserved for memory structure 326 for a particular purpose; however, this reduces the area of the memory die 300 available for peripheral circuitry. This can place considerable limitations on these peripheral components. For example, it is desirable to fit the sense amplifier circuitry within the available area, which can be a significant limitation on the sense amplifier design architecture. The reduction in available area relative to the on-die control circuitry 310 may limit the available functions that may be implemented on the chip. Thus, a substantial tradeoff needs to be made in the design of memory die 300 between the amount of dedicated area for memory structures 326 and the amount of dedicated area for peripheral circuitry.

Another area where memory structure 326 and peripheral circuitry are generally left is the processing involved in forming these areas, as these areas typically involve different processing techniques and trade-offs to give the individual die different formats. For example, when memory structure 326 is a NAND flash memory, this is an NMOS structure, while the peripheral circuitry is typically CMOS based. For example, the sense amplifier circuitry in the sense block 350, the charge pump in the power control block 316, the logic elements in the state machine 312, and other peripheral circuitry typically employ PMOS devices. The processing operations used to fabricate a CMOS die will differ in many respects from those optimized for NMOS flash NAND memory or other memory cell technologies.

To improve these limitations, the embodiments described below may separate the elements of fig. 5 onto separately formed dies, which are then bonded together. More specifically, memory structure 326 may be formed on one die, and some or all of the peripheral circuit elements (including one or more control circuits) may be formed on a separate die. For example, a memory die may be formed of only memory elements, such as flash NAND memory, PCM memory, ReRAM memory, or other memory type of memory cell array. Some or all of the peripheral circuitry (even including elements such as decoders and sense amplifiers) may then be moved onto a separate die. This allows each of the memory dies to be individually optimized according to its technology. For example, a NAND memory die may be optimized for a memory array architecture without concern for CMOS elements that have now moved to a separate peripheral circuit die that may be optimized for CMOS processing. This provides more space for peripheral elements, which now can incorporate additional capabilities that may not be easily incorporated if the peripheral elements are limited to the edge of the same die that houses the memory cell array. The two dies may then be bonded together in a bonded multi-die memory circuit, with the array on one die connected to peripheral elements on the other memory circuit. For example, while the following will focus on a bonded memory circuit with one memory die and one peripheral circuit die, other embodiments may use more dies, such as two memory dies and one peripheral circuit die.

Fig. 6A shows some layers of a non-volatile memory die (such as memory structure 326) after forming interconnect and bonding features. Fig. 6A shows a cross-section of a workpiece 500 comprising a substrate 501, e.g., a silicon substrate, such as a silicon wafer (e.g., a 300mm wafer) or a portion of a wafer. The term workpiece may refer to a substrate (e.g., a wafer or a portion of a wafer) as well as components formed in or on the substrate, including electronic components such as active components, interconnect layers, dielectric layers, and bond pads. Above and over the substrate 501 is a middle region 502, which may include blocks of non-volatile memory cells. In an upper region 503 (interconnect region) of the workpiece 500, one or more metal layers are patterned in the conductive paths to carry electrical signals. These layers may include bit lines and vias. The patterned metal layers are separated by dielectric layers (e.g., silicon dioxide or other insulators) that provide electrical isolation. Above the upper region 503 is a bonding region 504 that includes bonding pads that are connected to conductive paths of the upper region 503 and/or components of the intermediate region 502 and/or the substrate 501 to provide electrical coupling to the non-volatile memory cells in the intermediate region 502 through conductive traces of the upper region 503. Bonding region 504 may include a dielectric layer with bond pads formed in or on the dielectric layer such that the pads are exposed at surface 506 (the major surface) and may be bonded to opposing bond pads in a wafer-to-wafer bonding process.

Fig. 6B illustrates a top view of a portion of a major surface 506 of workpiece 500 showing bare pads including pads 508a-508d (which may include active pads coupled to bit lines, word lines, dummy word lines, selected gate lines, and test pads provided for testing purposes). Although a small number of pads are shown in this example, it should be understood that a large number of pads may be disposed on a surface of a memory die, such as workpiece 500. A single die may include, for example, millions of pads, with a wafer including hundreds or thousands of dies. The bonding region 504 may include a dielectric layer that extends around the pads and isolates adjacent pads. The bond pad may be made of a suitable electrical conductor, such as copper or a copper alloy containing copper, or other suitable metal or metal alloy.

Fig. 7A shows an example of a peripheral circuit die workpiece 600 containing control circuitry, such as the on-die control circuitry 310 and other peripheral elements in fig. 5. The workpiece 600 includes a substrate 601, such as a silicon wafer (e.g., a 300mm wafer) or a portion of a wafer. Above and over the substrate 601 is an intermediate region 602 that may include control circuitry formed on (and/or in) the substrate 601. For example, the intermediate region may include CMOS (complementary metal oxide semiconductor) circuitry including the control circuitry 310 formed in the active area of the substrate 601. In an upper region 603 (interconnect region) of the workpiece 600, one or more metal layers are patterned in the conductive paths to carry electrical signals. These may include: laterally extending traces and vertically extending vias with respect to the major surface of the substrate. Such conductive paths form electrical connections between different circuits (e.g., connection components of controller 122). The patterned metal layers are separated by dielectric layers (e.g., silicon dioxide or other insulators) that provide electrical isolation. Above the upper region 603 is a bonding region 604 comprising bonding pads connected to the conductive paths of the upper region 603 and/or the intermediate region 602 and/or components of the substrate 601 to provide electrical coupling to the circuitry in the intermediate region 602 through the conductive paths of the upper region 603. The bonding region 604 may include a dielectric layer, with bond pads formed in or on the dielectric layer such that the bond pads are exposed at the surface 606 (major surface) and may be bonded to opposing bond pads in a wafer-to-wafer bonding process.

Fig. 7B illustrates a top view of a portion of the major surface 606 of the workpiece 600 showing exposed pads including pads 608a-608d (which may include active pads and test pads for testing purposes) and pads 616. The bonding region 604 may include a dielectric layer that extends around the pads and isolates adjacent pads. The bond pad may be made of a suitable electrical conductor, such as copper or a copper alloy containing copper, or other suitable metal or metal alloy.

The pads 608a-608d and 616 may be arranged to connect with corresponding pads of another workpiece for a wafer-to-wafer bonding process. For example, the pads 616 may be positioned such that when the workpieces 500 and 600 are aligned and joined, the pads 616 are joined to the pads 416 and then to the array bit lines, and may be controlled by control circuitry in the middle region 602 of the workpiece 600. The workpiece 500 may include a plurality of memory dies, and the workpiece 600 may include an equal number of control circuit dies in a corresponding pattern (e.g., a one-to-one pattern such that for each memory die of the workpiece 500 there is a corresponding control circuit die of the workpiece 600). The opposing pair of dies have corresponding pad patterns such that when the workpiece is precisely aligned, each pad is aligned with a corresponding pad of the opposing die such that the control circuitry of the workpiece 600 is electrically connected to the memory cells of the workpiece 500.

Fig. 8A illustrates a wafer-to-wafer bonding process of workpiece 500 and workpiece 600. Substrate 501 is processed to fabricate memory arrays (e.g., memory structure 326), interconnect structures, and pads for bonding, as discussed above with reference to fig. 6A-6B, to form workpiece 500. The substrate 601 is processed to fabricate peripheral circuits including control circuitry (e.g., logic circuitry formed as CMOS circuitry), interconnect structures, and pads for bonding, as discussed above with reference to fig. 7A-7B, to form the workpiece 600. Then, in this example, the workpiece 500 is flipped over (either workpiece may be flipped) so that the major surface 506 of the workpiece 500 is opposite the major surface 606 of the workpiece 600. The workpieces 500, 600 are aligned such that corresponding die pairs are aligned and pads on such die pairs are aligned for bonding. Subsequently, with the workpieces 500, 600 aligned, pressure and/or heat or other conditions are applied to the workpieces 500, 600 to bond the respective pads together, thereby forming electrical connections between the memory array of the workpiece 500 and the control circuitry of the workpiece 600 (i.e., bonding along the interface between the major surfaces 506, 606). The joined workpieces 500 and 600 form a composite workpiece 700 that includes die pairs, where each pair includes a memory array die and a control circuit die that form a memory system.

Fig. 8B shows a combined workpiece 700 comprising workpiece 600 and workpiece 500. As can be seen, the bonding region 504 of the workpiece 500 is in contact with the bonding region 604 of the workpiece 600 along the interface 701 (i.e., the interface between the major surfaces of the workpieces 500, 600).

Fig. 8C shows a portion of a composite workpiece 700 comprising a substrate 601, an intermediate region 602 (comprising active components formed as CMOS chip components in and on the substrate 601), an upper region 603 (comprising conductive paths formed by traces and vias separated by a dielectric), and a bonding region 604 comprising pads bonded to corresponding pads of bonding region 504. The bond pads of the bonding region 504 are coupled to the memory array elements formed on the substrate 501 in the intermediate region 502 by conductive paths of the upper region 503.

Although the pads along the major surfaces of the workpieces 500, 600 may be joined together and may not be exposed after joining, additional pads may be provided for connection to other circuitry (e.g., to external circuitry of the composite workpiece 700) and may be exposed for joining and/or probing. The exposed pads 720 in the workpiece 600 are exposed by openings 722 that extend through the workpiece 500. One or more such openings may be provided in each die of the workpiece 500 such that designated pads of the workpiece 600 remain exposed after wafer-to-wafer bonding of the workpieces 500, 600. For example, an opening etch may be performed through the workpiece 500 before (or after) wafer-to-wafer bonding. Although one exposed pad (pad 720) is shown, it should be understood that any number of exposed pads may be provided. For example, pads may be provided for interacting with a host, receiving power, testing, and other purposes.

The manufacturing process used to form the workpiece 500 may be completely different from the manufacturing process used to form the workpiece 600. For example, although the memory structure 326 of the workpiece 500 may, for example, include multiple active areas over a substrate surface in a 3-D structure, the logic circuitry of the workpiece 600 may be formed in a single active area. The logic circuit may be formed by a CMOS process, while the memory structure 126 uses process steps suitable for 3D memory fabrication, which may be different from the CMOS process steps. Thus, the number of layers and materials used may vary. The equipment dimensions and tolerances may also vary. And the thermal budget may be different. Embedding logic circuitry in a workpiece, such as workpiece 500, may require additional process steps, as the same process steps used to form memory structure 126 may not be compatible with the process steps used to form the logic circuitry. Accordingly, the logic circuits provided in the peripheral region (e.g., CMOS or other logic circuits in the peripheral region 104) may be formed separately. This increases the number of process steps, thereby increasing the risk of damage and creating thermal budget issues.

Although fig. 5 illustrates an example in which memory die 300 includes various circuitry (e.g., control circuitry 310, read/write circuits 328, and row decoders 324) other than memory structures 326 that may be formed in peripheral areas, reducing or eliminating logic circuitry in the array workpiece may reduce the number of processing steps required and reduce thermal budget issues. Wafer-to-wafer bonding provides a relatively large area for bonding a large number of pads, so that a wide interface may be provided between the array workpiece and the control circuitry workpiece. This may be advantageous for different circuit arrangements than the circuit shown in fig. 5. For example, in some examples, some or all of the peripheral circuitry, including control circuitry 310, read/write circuits 328, and row decoder 324 shown on memory die 300, may be located on the peripheral or control circuitry workpiece, rather than on the same die as memory structure 326. This may allow the fabrication of an array workpiece for memory cells with little or no logic circuitry (e.g., no CMOS components) so that the fabrication process may be significantly simplified and thermal budget issues reduced.

Fig. 9 shows an alternative arrangement to that of fig. 5, which may be achieved by wafer-to-wafer bonding. Fig. 9 shows an example of peripheral circuitry including control circuitry formed in a control circuitry die 808 (e.g., a peripheral circuitry die in workpiece 600) coupled to a memory structure 326 formed in an array die 810 (e.g., an array die in workpiece 500). The common components are numbered as in fig. 5. As can be seen, the control circuit 310, the read/write circuits 328, and the row decoder 324 (which may be formed by CMOS processes) are located in a control circuit die 808. Additional elements, such as functions from controller 102, may also be moved into control circuit die 808. Control circuitry 310, read/write circuits 328, row decoder 324, and controller 322 may be formed from conventional processes (e.g., CMOS processes) such that few or no additional process steps may be required to add elements and functions more commonly found on memory controller 102 (i.e., the same process steps used to manufacture controller 102 may also be used to manufacture control circuitry 310, read/write circuits 328, and row decoder 324). Thus, although removing such circuitry in a die (such as memory die 300) may reduce the number of steps required to manufacture such a die, adding such circuitry to a die (such as control circuitry die 808) may not require any additional process steps.

FIG. 9 shows read/write circuits 328 coupled to memory structure 326 by electrical paths 812. For example, electrical path 812 may provide an electrical connection between read/write circuit 328 and a bit line of memory structure 326. An electrical path may extend from read/write circuit 328 in control circuit die 808 through a pad (e.g., pad 616) on the control circuit die that is bonded to a corresponding pad (e.g., pad 416) on array die 810 that is connected to a bitline of memory structure 326. Each bit line of memory structure 326 may have a corresponding one of electrical paths 812, including a pair of bond pads connected to read/write circuitry 328. Similarly, row decoder circuitry 324 is coupled to memory structure 326 by electrical path 814. Each of electrical paths 814 may correspond to a word line, a dummy word line, or a selected gate line. Additional electrical paths may also be provided between the control circuit die 808 and the array die 810.

In the following, the state machine 312 and/or the controller 102 (or equivalent functional circuitry) may be considered part of one or more control circuits that perform the functions described herein, in combination with all or a subset of the other circuits shown in FIG. 9. The control circuitry may comprise hardware only or a combination of hardware and software, including firmware. For example, a controller programmed by firmware to perform the functions described herein is one example of a control circuit. The control circuit may include a processor, FGA, ASIC, integrated circuit, or other type of circuit.

Turning now to the types of data that may be stored on the non-volatile memory device, in the following discussion, specific examples of data types of interest are weights used by an artificial neural network, such as a convolutional neural network or CNN. The name "convolutional neural network" means that the network uses a mathematical operation called convolution, i.e. a special linear operation. The convolutional network is: in at least one of its layers, the neural network of the generic matrix multiplication is replaced by a convolution. The CNN is formed of input and output layers, and a plurality of intermediate hidden layers. The hidden layers of a CNN are typically a series of convolutional layers that "convolve" multiplications or other dot products. Although these layers are commonly referred to as convolutions, technically, these layers are often sliding dot products or cross-correlations, such as described below with reference to FIG. 13.

Each neuron in the neural network calculates an output value by applying a specific function to an input value from a receiving domain in a previous layer. The application function of the input values is determined by the weight vector and the bias. In neural networks, learning is performed by iteratively adjusting these biases and weights. The weight vector and the offset are referred to as a filter and represent a specific feature (e.g., a specific shape) of the input. The distinguishing features of CNN are: many neurons may share the same filter.

Fig. 10 is a schematic diagram of an example of CNN. Starting with an input image of an array of pixel values, followed by a plurality of convolutional layers, followed by a plurality of fully-connected layers, the last of which provides an output. Each neuron in the first convolutional layer acquires input data in an n × n pixel sub-region from the input image. The learned weights of the neuron (collectively its convolution filter) determine the single-valued output of the neuron in response to the input. In this convolution, a neuron filter is applied to the input image by sliding the input region along the x and y dimensions of the image to generate values for the convolution layer. In practice, equivalent convolution is usually achieved by neurons making static identical copies for different input regions. The process is repeated through the convolutional layers using the learned weight for each layer, and then propagated through the fully-connected layers using its learned weights.

Figure 11 shows several fully connected layers of a neural network in more detail. In fig. 11, the three-layer artificial neural network shown is represented as an interconnected set of nodes or artificial neurons shown as circles, and a set of connections from the output of one artificial neuron to the input of another artificial neuron. This example shows three input nodes (I)1、I2、I3) And two output nodes (O)1、O2) And has four hidden or intermediate nodes (H)1、H2、H3、H4) The intermediate layer of (1). Nodes of an artificial neural network or artificial neurons/synapses are implemented by a logic element of a host or other processing system as a mathematical function that receives one or more inputs and sums them to produce an output. Typically, each input is individually weighted and the sum is passed through a mathematical function of the node to provide the output of the node.

In a common artificial neural network implementation, the signal at the connection between nodes (artificial neurons/synapses) is a real number, and the output of each artificial neuron is calculated by some non-linear function of the sum of its inputs. Nodes and their connections typically have weights that adjust as the learning process progresses. The weights increase or decrease the signal strength at the connection. The node may have a threshold such that the signal is only sent when the aggregate signal exceeds the threshold. Typically, nodes are aggregated into layers. Different layers may perform different types of conversions on their inputs. A signal may travel from the first layer (input layer) to the last layer (output layer) after traversing the layers multiple times. Although fig. 6 shows only a single intermediate or hidden layer, a complex Deep Neural Network (DNN) may have many such intermediate layers.

Supervised artificial neural networks are "trained" by supplying inputs, then examining and correcting the outputs. For example, a neural network trained to recognize a breed of dog will process a set of images and calculate the probability that the dog in the image is of a breed. The user can view the results and select which probabilities the network should display (above a certain threshold, etc.) and return the proposed label. Each such mathematical transformation is considered to be a layer, and complex neural networks have many layers. Because of the depth provided by the large number of intermediate or hidden layers, neural networks can model complex nonlinear relationships when trained.

FIG. 12A is a flow chart describing one embodiment of a process for training a neural network to generate a set of weights. This training process is often performed in the cloud, enabling additional or more powerful processing to be used. At step 1201, an input, such as a set of images (e.g., the image input in FIG. 10), is received. At step 1203, the input is propagated through the layers connecting the input to the next layer (e.g., CON1 in fig. 10) using the current filter or set of weights. Then in step 1205, the output of the neural network is received at the next layer (e.g., CON2 in fig. 10), such that the values received as output from one layer are used as input to the next layer. The inputs from the first layer propagate in this way through all intermediate or hidden layers until they reach the output. In the dog breed example of the previous paragraph, the input will be image data for a number of dogs, and the intermediate layer uses the current weight values to calculate the probability that a dog in the image is a certain breed, and returns a proposed dog breed label at step 1205. The user may then review the results at step 1207 to select which probabilities the neural network should return and determine if the current weight set provides sufficiently accurate labels, and if so, training is complete (step 1211). If the results are not accurate enough, the neural network adjusts the weights at step 1209 based on the probability selected by the user, and then loops back to step 1203 to run the input data again using the adjusted weights. Once the set of weights for the neural network has been determined, the weights can be used to "infer", which is the process of using the determined weights to generate an output from the data input to the neural network. Once the weights are determined at step 1211, the weights may be stored in non-volatile memory for later use, wherein the storing of the weights in non-volatile memory is discussed in further detail below.

FIG. 12B is a flow chart describing a process for supervised learning using neural networks for the inference phase, which uses estimated accuracy to predict the "meaning" of the input data. Depending on the situation, the neural network may be in the cloud and inferred by the processor of the edge device (e.g., smartphone, car process, hardware accelerator). At step 1221, an input is received, such as an image of a dog in the example used above. If no previously determined weights exist in the device running the neural network application, the previously determined weights are loaded at step 1222. For example, on a host processor executing a neural network, weights may be read from an SSD storing the weights and loaded into RAM on the host device. At step 1223, the input data is then propagated through the layers of the neural network. Step 1223 would be similar to step 1203 of FIG. 12B, but now using the weights established at the end of the training process at step 1211. After propagating the input through the middle layer, the output is then provided at step 1225.

Fig. 13 is a schematic diagram of a convolution operation between an input image and a filter or set of weights. In this example, the input image is a 6 × 6 array of pixel values, and the filter is a 3 × 3 array of weights. The convolution operation is performed by multiplying a 3 × 3 filter by a matrix of 3 × 3 blocks of the input image. For example, multiplying the top-left 3 × 3 block of the image by the filter yields the top-left value of the output matrix. The filter may then slide through one pixel on the image to generate the next entry of the output, and so on to generate the top row 4 elements for output. This process is repeated by sliding the filter down one pixel at a time, generating a 4 x 4 output matrix. Similar operations are performed for each layer. In an actual CNN, the size of the data set and the number of convolutions performed mean: a very large number of such operations are performed, involving a very large amount of data.

FIG. 14 is a schematic diagram of the use of matrix multiplication in the fully-connected layer of a neural network. Matrix multiplication or MatMul is a commonly used method in the training and inference phases of neural networks and is used in the kernel approach of machine learning. The top of fig. 14 is similar to fig. 11, with only a single hidden layer shown between the input and output layers. The input data is represented as a vector having a length corresponding to the number of input nodes. The weights are represented in a weight matrix, where the number of columns corresponds to the number of intermediate nodes in the hidden layer and the number of rows corresponds to the number of input nodes. The output is determined by a matrix multiplication of the input vector with a weight matrix, wherein each element of the output vector is a dot product of a vector of the input data and a column of the weight matrix.

A common technique for performing matrix multiplication is to use a multiplier-accumulator (MAC, or MAC unit). However, this has a number of problems. Referring back to FIG. 12B, the inference phase loads the neural network weights at step 1222 and then performs matrix multiplication by propagation at step 1223. However, since the amount of data involved can be very large, using a multiplier-accumulator to make inferences creates several problems related to weight loading. One of these problems is the high energy dissipation due to the large MAC array that must be used with the required bit width. Another problem is the high energy dissipation due to the limited size of the MAC array, resulting in high data movement between logic and memory and much higher energy dissipation than that used in the logic computation itself.

To help avoid these limitations, other memory technologies may be used instead of the use of a multiplier-accumulator array. For example, by taking advantage of the characteristics of NAND memory and Storage Class Memory (SCM), such as memories based on: ReRAM, PCM, FeRAM or MRAM based memory cells. This allows neural network inputs to be provided via read commands, and allows neural weights to be preloaded for inference. By using memory computations, this may relieve the logic required to perform matrix multiplication in the MAC array and the need to move data between the memory and the MAC array.

While storing weights for the neural network in non-volatile memory cells of the non-volatile array may save data transfers involved in neural network computations, and using memory computations may eliminate the need to move data between memory and the MAC array of the neural network layer, the neural network may have a large number of layers. The initial input is propagated through multiple layers, which will thus result in the output of one layer transmitted by one memory die being used as the input to a layer stored on another memory die. Using bonded die pairs, such as described above with reference to fig. 6A-9, may improve this situation because the increased capacity of the memory may allow more arrays, larger arrays, or both to be formed on the memory die so that more layers of computation may be performed on a given die. In addition, the peripheral circuit die may have greater capabilities than: the control elements are confined to the edge of the array-receiving die. Even though improvements may be obtained by using bond die pairs, the entire neural network, or even a substantial portion of the neural network, may require many such bond die pairs.

To increase capacity, multiple such bond die pairs may be stacked and interconnected. When connecting multiple individual memory chips, such as in a memory package, the individual integrated circuit chips are typically connected by wires bonded to pads along the chip edges. This limits the number of such interconnects because in practice so many pads can only be formed along the edge of the memory chip in the available space. The nature of the bonding wires can also limit the speed at which signals can be transmitted between individual integrated circuit chips, where a tradeoff is also required between the number of connections available (to support smaller wires) and the speed of the connections (to support larger wires). To improve this situation, an embodiment is described below in which a plurality of bonded die pairs are stacked on top of each other and connected by through-silicon vias (TSVs). This allows the neural network output from a layer in one bond die pair in the stack to be transmitted through the via to be used as an input to another layer of the neural network having weights stored in a different bond die pair in the stack. This allows efficient data propagation with high bandwidth through multiple layers of the neural network, all within the stack of bonded die pairs. Fig. 15 shows such an arrangement.

Fig. 15 is a simplified illustration of a non-volatile memory structure 1500 formed from a stack of bonded die pairs connected by vias. In the example shown in fig. 15, a stack of four bonded die pairs (1501, 1503, 1505, 1507) is shown, where each bonded die pair is here considered to be a memory die MEM (e.g., 1501a) on top of a CMOS logic die (e.g., 1501 b). Practical embodiments may include more such bonded die pairs. The stack of bonded die pairs is located on a set of "bumps" 1513, which in turn are located on top of the package substrate 1515. A set of through-silicon vias (such as 1511) may serve as channels for transferring data up and down through the stack of bonded die pairs. As with the number of bond die pairs, the number of vias may vary based on the embodiment. The use of a via structure provides high bandwidth and low interconnect delay relative to wire bond interconnect methods.

Each of the memory dies 1501a, 1503a, 1505a can accommodate multiple memory arrays, and the arrays of different memory dies can have different structures. As described below, when used for inference operations of a neural network, the computation results may be propagated down or up through different pairs of bond dies, where the operations at each pair of bond dies correspond to a layer of the neural network. Different types or sizes of memory arrays may be better suited for different stages of the neural network, and the arrangement of the memory arrays may reflect this. For example, by using storage level memory arrays, the convolutional layers may be better implemented, and by NAND memory arrays, the fully connected layers may be better implemented, so that the three-dimensional stack of fig. 15 may be structured so that the input initially propagates through a series of storage level memory arrays and then a series of NAND memory arrays. Vias of TSV channels, such as 1511, may be used to group the array of different bonded array pairs into columns for propagation through the layers of the neural network. (in addition to the three-dimensional structure resulting from the vertical stacking of horizontally arranged bond die pairs, it should be noted that the array of memory cells on the memory die can also have a 3D structure, such as 3D flash NAND; however, in the following discussion, a 3D structure will refer to the stacking of bond die pairs unless otherwise specified.)

The CMOS dies (1501b, 1503b, 1505b, 1507b) of different bond die pairs may also be formed differently. Moving the control circuitry and other peripheral circuitry to a separate die provides a large amount of additional area for adding additional logic elements and functionality so that different operations can be performed on the data as it is broadcast through different layers of the column. Additionally, one or more of the CMOS layers may be configured to move laterally between columns so that, for example, after propagating a via down through a layer of a neural network to the stack's CMOS bottom layer 1507, the values may be shifted onto another array and propagated back up the array. Although not shown in fig. 15, each of the CMOS layers may also include a plurality of pins corresponding to I/O pins 318 of fig. 9 so that data and commands can be transferred into and out of the stack at the different layers of the bonded die pair.

The following discussion focuses primarily on introducing inferences using the architecture of FIG. 15, where weights have been programmed into the memory cells of the architecture; however, this structure may also be applied to the training process of determining the weight values. In some embodiments, an additional task for training is to compute a suitable floating point gradient after each training period, and then update the weights of the weight values. The update of the weight values may be accomplished by: the updated values of their weights in the current location in the memory array are rewritten or written to other available locations in the structure, with the logical addresses also updated accordingly. The extra space available on the CMOS die may be used to implement additional functionality on one or more of the CMOS dies, thereby introducing floating-point processors on these dies.

With respect to fig. 5, the on-die control circuit of fig. 9 may also include additional functionality within its logic elements, including both the more general-purpose capabilities typically found in memory controller 102, as well as some CPU capabilities, and also application-specific functions. In the embodiments described herein, this may include various activation functions that determine the amplitude of the output for neural network operation as described below and as shown at the activation logic block 313 of the on-die control circuit. For example, these activation functions may include addition operations and threshold determination operations for the accumulation portion of a Multiple Accumulation (MAC) operation, as well as higher-level operations (such as sigmoid functions or tanh functions). Additional area available on the CMOS (or more generally, peripheral) circuitry of fig. 9 may also include volatile or non-volatile buffer memory 317 to store data (such as weights for the neural network layer) for logical memory operations.

Neural networks, such as large-scale Deep Neural Networks (DNNs), may have very deep network structures (i.e., a very large number of layers) and use very large models (i.e., their layers have a very large number of weights), which presents challenges for supporting large-scale DNNs. The large-scale DNN model cannot fit onto typical on-chip memory (SRAM) or even off-chip volatile memory (DRAM) for a processor to utilize the model to perform inference operations. Therefore, early implementations of large-scale DNNs faced the following problems: data movement between DRAM, SRAM, cache, and CPU/GPU cores is costly during the training phase and the inference phase. Although portions of a large-scale DNN may be mapped into a 2D planar architecture, moving data between arrays laterally across a planar structure introduces large and inconsistent delays due only to topological constraints of moving large amounts of data laterally around the chip structure.

The architecture of fig. 15 may address many of the problems associated with large-scale DNNs. Structures in which multiple memory array dies are stacked can store very large models. Multiple CMOS dies that can be used to implement heterogeneous functionality in different layers, thereby reducing or eliminating the need for: data is moved back and forth across the memory hierarchy to the CPU/GPU core and even the controller 102. The structure can also provide high bandwidth and low interconnect latency by utilizing a via structure, thereby significantly reducing data movement costs. In addition, the structure has high expansibility.

Fig. 16 and 17 schematically show: the inference operation performed by the structure of fig. 15 when performing convolution or full-link multiplication of layers within the memory die and CMOS die, respectively. In both approaches, the weights of a layer may be stored with the memory die array, but in the embodiment of fig. 16, the inputs for that layer are applied within the array, while in the embodiment of fig. 17, the weights are read out into the CMOS die and the inputs for that layer are applied in the CMOS die. Although shown separately in fig. 16 and 17, these two methods may be combined with in-array multiplication among some layers and readout of weights to other layers in the CMOS die.

When performing the inference operation on the neural network in the embodiment of fig. 16 or 17, the weights predetermined during training have been stored in the structural array. In the arrangement of fig. 16, a memory die calculation method is used, which can use either a memory die for weight storage or calculation, which can use either vector multiplication or matrix multiplication. An activation function (such as a threshold determination function, sigmoid function, tanh function, etc.) may then be performed in the corresponding CMOS layer. For example, at CMOS die 1501b, e.g., a bonded die pair 1501, the initial inputs to the neural network have weights stored in structure 1500. The input is applied to the array of memory dies 1501a storing weights to perform the in-array multiplication operation. The result of the multiplication within the array is received back at the CMOS die 1501b where the activation function may be applied, with the resulting output being passed along the vias to the other die (such as 1503) in the bonded memory die pair for use as input to the next layer of the neural network.

In the arrangement of fig. 17, the calculations are performed in the logic of the CMOS die. For example, at CMOS die 1501b, e.g., a bonded die pair 1501, the initial inputs to the neural network have weights stored in structure 1500. The weights of this layer are read out from the array of memory die 1501a before or after the input is received, and in the logic elements of CMOS die 1501b, the multiplication between the input and the weights is performed. The CMOS die 1501b may then apply an activation function to the result of the in-logic multiplication, where the resulting output is passed along the vias to the other die in the bonded memory die pair to be used as an input for the next layer of the neural network.

Multiple functions may be mapped onto the CMOS die relative to the peripheral circuitry of the CMOS die (e.g., 1501b of 1501) to which the die pair is bonded. As discussed above with reference to fig. 6A-9, typical peripheral circuitry (e.g., sense amplifiers, decoders, charge pumps, etc.) supporting the memory array may be moved onto the CMOS die of the bonded die pair. In addition, in the bonded die pair arrangement, the extra space available on the CMOS die may be used for dedicated logic or core elements to speed up domain-specific applications such as ECC, data searching, compression, filtering, and the like. The CMOS die may also incorporate an activation function, such as an sigmoid function or a tanh function, for use with the neural network layer. Some or all of the CMOS die may also include some degree of CPU functionality in the stack to support domain-generic applications. Different CMOS dies may also differ in capability from one another in the stack, and in some embodiments may be arranged according to a particular embodiment, as different activation functions may be required at different levels, for example, in a neural network, and CMOS dies may be tailored to these needs based on their position in the stack to which the die pairs are bonded.

For example, as discussed above, outputs calculated within one bond die pair and from one layer of the neural network may be passed down or up through vias (such as 1511) to another bond die pair to be used as inputs to subsequent layers of the neural network. By enabling switch box interconnects on CMOS dies to transmit data streams from one column to another, the layer sequence of the neural network can be extended: as one set of inputs propagates down a column of memory arrays in the stack, the CMOS of the bottom-bonded die pair (such as CMOS die 1507a in fig. 15) may have a switch box interconnect to transfer the data stream laterally to another via/column where it is propagated back up the stack. Similarly, when the data stream reaches the top of the stack, it may again be transported laterally and sent back down the stack. Fig. 18 shows such an arrangement.

Fig. 18 illustrates mapping a 3D deep neural network onto the architecture of fig. 15. For ease of illustration, fig. 18 again shows a stack of four bonded die pairs, numbered as in fig. 15. Each of the memory dies 1501a, 1503a, 1505a, 1507a is represented as having four arrays or planes vertically grouped into four columns or "iplanes" 1831, 1833, 1835, and 1837, where each of the iplanes may run in parallel. In some embodiments, each of the iplanes can be used to map one or several nerve layers. In other embodiments, several iplanes may be used to map a larger neural layer to achieve a high level of parallelism.

In the embodiment shown here, each column array is shown with two through-silicon vias (1841,1842), (1843,1844), (1845,1846), and (1847,1848). Depending on the embodiment, the number of stacks of bonded die pairs may be smaller or larger, as may the number of columns and memory arrays or planes within each memory die.

Referring back to the neural networks shown in fig. 10 and 11, the weights for each layer are stored in the array of memory dies for each layer/bond die pair. The input for a layer is received at the corresponding CMOS die and applied to its weights either by intra-array operations (such as those described with reference to fig. 16) or by reading the weights and performing multiplication of the layer in the logic of the CMOS die (such as that shown in fig. 17). For example, an initial input may be received from a memory controller at CMOS die 1501a and applied to the weights of a first convolutional layer in the array of column 1831 to generate an output of the first convolutional layer, which is then transmitted down one or both of via groups 1841 and 1842 to bond die pair 1503. The output from the previous layer is used as an input to the second convolutional layer weights stored in the array of memory die 1503a in column 1831, and CMOS die 1503b generates the output of the second convolutional layer. This process may propagate down the layer of bonded die pairs until it reaches the bottom of the stack (1507 in this example) where it may move laterally through the switchbox interconnect in CMOS die 1507a to column 1833 where it may propagate up the layer of the network to the top of the stack and continue down and up the column as it travels through the layer. Depending on the implementation, input and output may be performed at various levels. The one or more control circuits for the inference operation may be various elements as shown for the peripheral element die 808 of fig. 9, where the controller 102 may also be part of the control circuit configured to perform the inference operation according to this embodiment.

Figures 19 and 20 are flow diagrams of embodiments for performing inference operations of a neural network by: input propagation via the neural network is performed through the layers of the network as mapped onto the structure of fig. 15. More specifically, fig. 19 corresponds to the arrangement of fig. 16, and fig. 20 corresponds to the arrangement of fig. 17. Although described in the following discussion as two separate embodiments, a hybrid embodiment may also be implemented, with some layers using memory computations (as shown in fig. 19) and other layers using inter-logic die computations (as shown in fig. 20).

FIG. 19 begins at step 1901, where the memory structure of FIG. 18 receives an input to a neural network. In fig. 18, this is shown as being received at the I/O pin (318 in fig. 9) of the CMOS die 1501b of the bonded die pair 1501. This input may be the initial input to the neural network, such as at the far left side of fig. 10, or an input at an intermediate level of the network. Although shown as an input at the highest bonded die pair 1501 of the stack, the input may be received at other layers as well, depending on the implementation and the layer at which the weights of the appropriate network layer are located. In the example herein, the input will initially propagate down column 1831.

At step 1903, the input is applied to the array in column 1831 of non-volatile memory cells in the corresponding memory die 1501a to perform an in-array multiplication of the input with weights stored in the array. For example, a vector of input values may be converted to word line voltages in one or more control circuits of CMOS die 1501b (such as by row decode circuit 324) and applied over lines 814 to word lines of the array in column 1831 of memory die 1501. The weights of the layers are stored in a non-volatile memory array, and the intra-array multiplication operations may correspond to the operations shown above with reference to fig. 13 or fig. 14. The result of the intra-array operation between the input matrix or vector and the matrix or vector of layer weights may correspond to a value on a bit line of the array, which may then be provided to read-write circuitry 328 on CMOS die 1501a via line 812. On CMOS die 1501b, at step 1905, activate logic 313 may be used to perform an activate operation on the memory multiplication result of step 1903. Step 1905 may be an accumulate operation or a higher level activate operation and generate an output for the first layer.

Step 1907 determines whether the output of step 1905 is the final output of the neural network computation. If not, the output of step 1905 is propagated to the next layer of the neural network, or more specifically, to the bonded die pair storing the weights corresponding to that layer at step 1909, for use as input for the next convolution or multiplication operation at step 1903. As shown in fig. 18, the output of one layer (e.g., 1501) propagates down a column (e.g., 1831) along a via (e.g., 1841,1842) to the next layer (e.g., 1503) where it is used as an input. In the embodiment of fig. 18, the result of the convolution or multiplication propagates down vias 1841 and 1842 of column 1831 to bottom layer 1507 where it is transmitted to column 1833 and propagates up vias 1843 and 1844. Once at the topmost layer 1501, it again moves laterally and propagates down column 1835, and so on, through the neural network layer as mapped onto the structure of fig. 18.

If step 1907 determines that the output of step 1905 is the last layer of the sequence, then at step 1911, the output from step 1905 is provided as the operational output. Depending on the embodiment and implementation, this may be the final output (as in the rightmost side of fig. 10) or an intermediate result. Depending on the depth and size of the network (number of layers, number of arrays per layer), complete propagation from the initial input to the final output is possible on a single device as shown in fig. 18 or by using several such devices. If the propagation process does not run out of available layers of the structure of FIG. 18, it may propagate until a final output is obtained (such as the output in layer 1505 and column 1837 of FIG. 18), or an output from the I/O pin 318 of the CMOS die in that layer is obtained, generating a final output.

The embodiment of fig. 20 also stores the network weights in the memory die of the structure of fig. 18, but performs the calculations in the logic of the CMOS (or more generally peripheral circuit) die that joins the die pairs. The flow of FIG. 20 begins at step 2001, where the memory structure of FIG. 18 receives an input to a neural network. Step 2001 may be as described above for corresponding step 1901 of fig. 19. Unlike the case in fig. 19 where the computation is performed in the memory array at step 1903, the layer's weights are now read out from the memory die (e.g., the array in column 18031 of memory die 1501a) into the corresponding CMOS die (e.g., 1501b) at step 2003 in fig. 20, so that the layer's inputs are applied to the layer's weights at step 2005 to perform the intra-logic multiplication.

More specifically, referring to fig. 9, at step 2003, control circuitry on CMOS die 808 reads the weight values from array structure 326 onto read/write circuit 328 over line 812, which may include buffer memory 317 to store the weight values for that layer. Convolution or multiplication of the input with the weight values, such as described above with reference to fig. 13 or fig. 14, may then be performed within logic on the memory die. At step 2007, an activation operation may be performed on the multiplication result of step 2005, similar to the process described above with reference to step 1905 of FIG. 19. The flow of fig. 20 may then proceed to steps 2009, 2011, and 2013, which may correspond to steps 1907, 1909, and 1911, respectively, as described above.

For either of the embodiments of fig. 19 or fig. 20, propagation of input through the structure of fig. 18 may be performed with varying degrees of autonomy, where the control circuitry within the CMOS dies of the bonded die pair (as shown in fig. 9) performs propagation of initial neural network input through the layers independently of the controller 102 or with varying degrees of controller supervision. In addition, although fig. 19 and 20 describe the propagation of a single set of inputs to the network, multiple sets of inputs may be propagated sequentially through the structures of fig. 15-18: once a set of inputs is moved from a first layer to a second layer, a subsequent set of inputs may be received at the first layer, resulting in a high level of parallelism.

While fig. 18 shows and fig. 19 and 20 describe that inputs are received at the CMOS die of a first bonded die pair and applied to an array in a first column of corresponding memory dies (such as the array in column 1831 for memory die 1501a) and propagated sequentially through the layers to outputs (such as the outputs at CMOS die 1507b in column 1837), many variations are possible in alternative embodiments. For example, some layers may be skipped. Such skipping may be used when there are multiple sets of weights for the same layer of the network: for example, different sets of weights may correspond to different models of the same network, and one of these weights may be selected and the other weight skipped based on the currently selected model. In other variations, additional inputs may be received at different layers in the process; the intermediate output may be provided by an intermediate layer; and data can be moved laterally between columns of a given layer to provide more general computation.

The above structure can provide the following functions: the deep neural network is 3D mapped onto a stack of bonded die to non-volatile memory devices. By grouping multiple arrays or planes of memory dies that are bonded to a die pair into columns, a high degree of parallelism can be achieved and high performance is provided by avoiding the need to move large amounts of data into and out of the memory. Inter-plane data propagation is improved using Through Silicon Vias (TSVs). Separating the memory array and peripheral circuitry onto separate dies of a bonded die pair increases the memory capacity available for storing weights. The increased area available on the CMOS die to which the die pair is bonded increases the complexity of activation and other functions that may be taken on the data as it moves through the layers. Although applied here to 3D deep neural networks, the architecture can also be used to implement accelerator-centric computational-type in-memory or near-memory computing systems for other application domains (such as database application scenarios).

According to a first set of aspects, a non-volatile memory device includes a plurality of stacked bond die pairs and one or more vias, each via connecting a plurality of the bond die pairs. Each bonded die pair includes: a memory die having one or more arrays of non-volatile memory cells, each array configured to store one or more weights of a neural network layer; and a corresponding peripheral circuit die formed separately from and bonded to the memory die and having peripheral circuit elements including one or more control circuits connected to the one or more arrays and configured to perform memory operations on the one or more arrays. The one or more control circuits are configured to: receiving a set of input values for a layer of the neural network, the layer having weights stored in one of the arrays of memory dies; and performing a multiplication operation between the set of input values and weights of the neural network layer to generate a set of output values of the neural network layer. One or more vias are configured to transfer data between a connected pair of bond dies connected to the via, the via configured to transfer a set of output values of a multiplication operation in a first pair of bond dies of the pair of bond dies as a set of input values of a multiplication operation in a second pair of bond dies of the pair of bond dies.

In a further aspect, a method comprises: an initial input to a neural network is received at a peripheral circuit die of a first bonded die pair of a non-volatile memory structure. The non-volatile memory structure includes a plurality of bonded die pairs, including a first bonded die pair, each bonded die pair including a peripheral circuit die and a non-volatile memory die, each non-volatile memory die storing weights for one or more layers of the neural network. The method also includes: the inputs to the neural network are propagated through a sequence of bonded die pairs, which are connected by through-silicon vias, forming a three-dimensional stack. The propagation comprises the following steps: generating, at the first bond die pair, an output for a neural network layer stored on a memory die of the first bond die pair with an initial input; and, at each of one or more subsequent bonded die pairs in the sequence, receiving an output from a previous bonded die pair in the sequence along the via and using the output from the previous bonded die pair as an input to generate an output for a layer of the neural network stored on a memory die of the subsequent bonded die pair. The method also includes: the output of the neural network is provided from a peripheral circuit die of one of the subsequently bonded die pairs.

Further aspects include a non-volatile memory device, the device comprising: a stack of a plurality of bonded die pairs, each bonded die pair having a memory die with one or more arrays of non-volatile memory cells and a peripheral circuit die having one or more control circuits configured to access data stored on the memory die of the bonded die pair; and a plurality of vias through the stack of bonded die pairs, the plurality of vias configured to transfer data between the bonded die pairs, the array of memory dies grouped into columns along the vias. The memory die of the plurality of bond die pairs is configured to store weights for a plurality of layers of the neural network, the sequence of layers of the neural network being mapped to a corresponding sequence of arrays along the column. The control circuitry of the peripheral circuit die of the plurality of bond die pairs is further configured to perform an inference operation for the neural network by: receiving an input of a neural network at a first bonded die pair of the bonded die pairs; and propagating the input through the layers of the neural network by propagating the input along the corresponding array sequence of columns; a multiplication operation is performed within each of the pair of bond dies using the weights stored in the corresponding array.

For the purposes of this document, references in the specification to "an embodiment," "one embodiment," "some embodiments," or "another embodiment" may be used to describe different embodiments or the same embodiment.

For purposes of this document, a connection may be a direct connection or an indirect connection (e.g., via one or more other components). In some cases, when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via an intervening element. When an element is referred to as being directly connected to another element, there are no intervening elements between the element and the other element. Two devices are "in communication" if they are directly or indirectly connected, such that they can communicate electronic signals therebetween.

For the purposes of this document, the term "based on" may be understood as "based at least in part on".

For purposes of this document, the use of digital terms such as "first" object, "second" object, and "third" object, without additional context, may not imply a ranking of the objects, but may be used for identification purposes to identify different objects.

For purposes of this document, the term "group" of objects may refer to a "group" of one or more objects.

The foregoing detailed description has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. The described embodiments were chosen in order to best explain the principles of the proposed technology and its practical application to thereby enable others skilled in the art to best utilize it in various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto.

43页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:半导体存储装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!