Output array neuron conversion and calibration for analog neural memory in deep learning artificial neural network

文档序号：1958022 发布日期：2021-12-10 浏览：14次中文

阅读说明：本技术 用于深度学习人工神经网络中的模拟神经存储器的输出阵列神经元转换和校准 (Output array neuron conversion and calibration for analog neural memory in deep learning artificial neural network ) 是由 H·V·特兰 T·乌 S·洪 S·特林 V·蒂瓦里 M·雷顿 N·多于 2019-11-18 设计创作，主要内容包括：本发明公开了利用非易失性存储器单元的模拟神经存储器系统的可配置输入块和输出块以及物理布局。输入块可被配置为支持在水平方向上布置的不同数量的阵列,并且输出块可被配置为支持在竖直方向上布置的不同数量的阵列。公开了用于在可配置输入块和输出块中使用的可调整部件。系统和方法用于补偿模拟神经存储器系统中的输入块和输出块中的泄漏和偏移。(Configurable input and output blocks and physical layouts of analog neural memory systems utilizing non-volatile memory cells are disclosed. The input block may be configured to support a different number of arrays arranged in a horizontal direction, and the output block may be configured to support a different number of arrays arranged in a vertical direction. Adjustable components for use in configurable input and output blocks are disclosed. Systems and methods are used to compensate for leakage and offset in input and output blocks in analog neural memory systems.)

1. A method of operating an output circuit block of an analog neural memory cell, comprising:

measuring leakage and/or deflection;

storing the measured quantity as a first value; and

the Least Significant Bit (LSB) of a memory cell is determined using the following equation: LSB is said first value rounded to the next level.

2. The method of claim 1, further comprising:

the Most Significant Bit (MSB) of the memory cell range is determined using the following formula: MSB (LSB + (N-1) × incremental level quantity, where N is the total number of levels and the incremental level quantity is the difference between two consecutive levels.

3. The method of claim 2, wherein the incremental level amount is equal to the LSB.

4. The method of claim 2, wherein the incremental level is predetermined.

5. The method of claim 2, wherein the LSB comprises an incremental LSB amount.

6. The method of claim 2, wherein the leakage comprises one or more of array leakage and circuit leakage.

7. The method of claim 6, wherein the array leakage comprises one or more of memory cell leakage and leakage from one or more of a decoding circuit and a column write circuit.

8. The method of claim 2, wherein the offset comprises one or more of an array offset and a circuit offset.

9. The method of claim 8, wherein the array offset comprises an offset from array variations due to one or more of read only memory cell capacitance and cell junctions.

10. The method of claim 8, wherein the circuit offset comprises an offset from one or more of a decoding circuit and a column write circuit.

11. The method of claim 2, wherein the memory cell is a split gate memory cell.

12. The method of claim 2, wherein the measuring step is performed by an output block comprising an analog-to-digital converter.

13. The method of claim 2, wherein the determining the LSB and the determining the MSB steps are performed by an output block.

14. The method of claim 12, wherein the analog-to-digital converter is an integral analog-to-digital converter.

15. The method of claim 2, wherein the measuring step comprises performing an analog-to-digital conversion and the storing step comprises storing the digital output in a counter.

16. The method of claim 15, further comprising:

counting down the stored digital output until the counter reaches zero; and

digital bits received in the output circuit block are incrementally counted.

17. A method of operating an output circuit block of an analog neural memory cell, comprising:

measuring the offset or leakage and storing the measured quantity as a first value;

calling the first value;

performing temperature adjustment; and

the Least Significant Bit (LSB) of a memory cell range is determined using the following formula: LSB is said first value rounded to the next level.

18. The method of claim 17, further comprising:

19. The method of claim 18, wherein the incremental level amount is equal to the LSB.

20. The method of claim 18, wherein the incremental level amount is predetermined.

21. The method of claim 18, wherein the LSB comprises an incremental LSB amount.

22. The method of claim 17, wherein the leakage comprises one or more of array leakage and circuit leakage.

23. The method of claim 22, wherein the array leakage comprises one or more of memory cell leakage and leakage from one or more of a decoding circuit and a column write circuit.

24. The method of claim 17, wherein the offset comprises one or more of an array offset and a circuit offset.

25. The method of claim 24, wherein the array offset comprises an offset from array variations due to one or more of read only memory cell capacitance and cell junctions.

26. The method of claim 24, wherein the circuit offset comprises an offset from one or more of a decoding circuit and a column write circuit.

27. The method of claim 17, wherein the memory cell is a split gate memory cell.

28. The method of claim 17, wherein the measuring step is performed by an output block comprising an analog-to-digital converter.

29. The method of claim 17, wherein the determining the LSB and the determining the MSB steps are performed by the output block.

30. The method of claim 28, wherein the analog-to-digital converter is an integral analog-to-digital converter.

31. A method of compensating for leakage or offset while generating an output from an output circuit block coupled to an array of analog neural memory cells, comprising:

measuring leakage or offset in the output circuit block value using an analog-to-digital converter to generate a digital leakage or offset value;

storing the digital leakage or offset value as a first value in a counter; and

compensating for the leakage or offset.

32. The method of claim 31, wherein the compensating is performed by: measuring an output of the output circuit block using the counter, counting down from the stored first value until the counter reaches zero, and then counting up to generate the output.

33. The method of claim 31, wherein the compensating is performed by: the counter is used to measure an output of the output circuit block and then subtract the output from the stored first value to generate the output.

34. The method of claim 31, wherein the analog-to-digital converter comprises an integral analog-to-digital converter.

35. The method of claim 34, wherein the analog-to-digital converter comprises a ramp analog-to-digital converter.

36. The method of claim 31, wherein the analog-to-digital converter comprises an algorithmic analog-to-digital converter.

37. The method of claim 31, wherein the analog-to-digital converter comprises a sigma-delta type analog-to-digital converter.

38. The method of claim 31, wherein the analog-to-digital converter comprises a successive approximation register analog-to-digital converter.

39. The method of claim 31, wherein the system further comprises:

a digital data-to-voltage converter to convert the output to a voltage.

40. The method of claim 31, wherein the system further comprises:

a digital data-to-pulse width converter to convert the output into one or more pulses, wherein a width of the one or more pulses is proportional to a value of the digital data.

41. The method of claim 31, wherein the analog neural memory cell is a split gate flash memory cell.

42. The method of claim 31, wherein the analog neural memory cell is a stacked gate flash memory cell.

43. The method of claim 31, further comprising: performing calibration by the output circuit block to compensate for temperature.

44. The method of claim 31, further comprising: calibration is performed by the output circuit block to compensate for process or voltage source variations.

Technical Field

Configurable input and output blocks and associated physical layouts of analog neural memory systems utilizing non-volatile memory cells are disclosed.

Background

Artificial neural networks mimic biological neural networks (the central nervous system of animals, particularly the brain), and are used to estimate or approximate functions that may depend on a large number of inputs and are generally unknown. Artificial neural networks typically include layers of interconnected "neurons" that exchange messages with each other.

FIG. 1 illustrates an artificial neural network, where circles represent the inputs or layers of neurons. Connections (called synapses) are indicated by arrows and have a numerical weight that can be adjusted empirically. This enables the neural network to adapt to the input and to learn. Typically, a neural network includes a layer of multiple inputs. There are typically one or more intermediate layers of neurons, and an output layer of neurons that provides the output of the neural network. Neurons at each level make decisions based on data received from synapses, either individually or collectively.

One of the major challenges in developing artificial neural networks for high-performance information processing is the lack of adequate hardware technology. In practice, practical neural networks rely on a large number of synapses, thereby achieving high connectivity between neurons, i.e., very high computational parallelism. In principle, such complexity may be achieved by a digital supercomputer or a dedicated cluster of graphics processing units. However, these methods are also energy efficient, in addition to high cost, compared to biological networks, which consume less energy mainly because they perform low precision analog calculations. CMOS analog circuits have been used for artificial neural networks, but most CMOS-implemented synapses are too bulky due to the large number of neurons and synapses required.

Applicants previously disclosed an artificial (simulated) neural network that utilized one or more non-volatile memory arrays as synapses in U.S. patent application 15/594,439 (published as U.S. patent publication 2017/0337466), which is incorporated herein by reference. The non-volatile memory array operates as a simulated neural memory. The neural network device includes a first plurality of synapses configured to receive a first plurality of inputs and generate a first plurality of outputs therefrom, and a first plurality of neurons configured to receive the first plurality of outputs. The first plurality of synapses comprises a plurality of memory cells, wherein each of the memory cells comprises: spaced apart source and drain regions formed in the semiconductor substrate, wherein a channel region extends between the source and drain regions; a floating gate disposed over and insulated from a first portion of the channel region; and a non-floating gate disposed over and insulated from the second portion of the channel region. Each memory cell of the plurality of memory cells is configured to store a weight value corresponding to a plurality of electrons on the floating gate. The plurality of memory units is configured to multiply the first plurality of inputs by the stored weight values to generate a first plurality of outputs.

Each non-volatile memory cell used in analog neural memory systems must be erased and programmed to maintain a very specific and precise amount of charge (i.e., number of electrons) in the floating gate. For example, each floating gate must hold one of N different values, where N is the number of different weights that can be indicated by each cell. Examples of N include 16, 32, 64, 128, and 256.

One challenge in vector-by-matrix multiplication (VMM) systems is the ability to quickly and accurately deliver output from a VMM as input to another VMM, and to do so while efficiently utilizing physical space within the semiconductor die.

What is needed is a configurable input block and output block and physical layout for an analog neural memory system that utilizes non-volatile memory cells. What is also needed is a system and method for compensating for leakage and offset in the input and output blocks of such a system.

Disclosure of Invention

Systems and methods for compensating for leakage and offset in input and output blocks in analog neural memory systems are disclosed.

One embodiment of a method of operating an output circuit block of an analog neural memory cell includes: measuring leakage and/or deflection; storing the measured quantity as a first value; and determining a Least Significant Bit (LSB) of the memory cell using the following formula: LSB is said first value rounded to the next level.

Another embodiment of a method of operating an output circuit block of an analog neural memory cell includes: measuring the offset or leakage and storing the measured quantity as a first value; calling the first value; performing temperature adjustment; and determining a Least Significant Bit (LSB) of the memory cell range using the formula: LSB is said first value rounded to the next level.

An embodiment of a method of compensating for leakage or offset while generating an output from an output circuit block coupled to an array of analog neural memory cells includes: measuring leakage or offset in the output circuit block value using an analog-to-digital converter to generate a digital leakage or offset value; storing the digital leakage or offset value as a first value in a counter; and compensating for the leakage or offset.

Drawings

Fig. 1 is a schematic diagram illustrating a prior art artificial neural network.

Figure 2 shows a prior art split gate flash memory cell.

Figure 3 illustrates another prior art split gate flash memory cell.

Figure 4 illustrates another prior art split gate flash memory cell.

Figure 5 illustrates another prior art split gate flash memory cell.

Figure 6 shows another prior art split gate flash memory cell.

Fig. 7 illustrates a prior art stacked gate flash memory cell.

FIG. 8 is a schematic diagram illustrating different levels of an exemplary artificial neural network using one or more non-volatile memory arrays.

Fig. 9 is a block diagram illustrating a vector-matrix multiplication system.

FIG. 10 is a block diagram illustrating an exemplary artificial neural network using one or more vector-matrix multiplication systems.

FIG. 11 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 12 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 13 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 14 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 15 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 16 illustrates a prior art long and short term memory system.

FIG. 17 shows exemplary units used in a long-and-short-term memory system.

Fig. 18 shows one embodiment of the exemplary unit of fig. 17.

Fig. 19 shows another embodiment of the exemplary unit of fig. 17.

FIG. 20 shows a prior art gated recursive cell system.

FIG. 21 illustrates an exemplary cell used in a gated recursive cell system.

Fig. 22 illustrates one embodiment of the exemplary unit of fig. 21.

Fig. 23 shows another embodiment of the exemplary unit of fig. 21.

FIG. 24 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 25 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 26 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 27 shows another embodiment of a vector-matrix multiplication system.

FIG. 28 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 29 shows another embodiment of a vector-matrix multiplication system.

FIG. 30 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 31 shows another embodiment of a vector-matrix multiplication system.

FIG. 32 illustrates another embodiment of a vector-matrix multiplication system.

FIG. 33 illustrates an exemplary block diagram of a vector-matrix multiplication system.

FIG. 34 shows an exemplary block diagram of a vector-matrix multiplication system in which vertically adjacent arrays share an output block.

FIG. 35A illustrates an exemplary block diagram of a vector-matrix multiplication system in which horizontally adjacent arrays share an input block.

FIG. 35B illustrates an exemplary block diagram of a vector-matrix multiplication system in which horizontally adjacent arrays share an input block.

FIG. 36 illustrates an exemplary block diagram of a vector-matrix multiplication system.

Fig. 37A shows an embodiment of an integrating analog-to-digital converter.

Fig. 37B illustrates an exemplary output waveform of the system of fig. 37A.

Fig. 38 shows an exemplary input that may be provided to the integrating analog-to-digital converter of fig. 37A.

Fig. 39 shows an embodiment of a current-to-voltage converter.

Fig. 40 shows a digital data-to-voltage converter.

Fig. 41 shows a configurable analog to digital converter.

Fig. 42 shows a configurable current-voltage input circuit.

Fig. 43A shows a digital-to-pulse line converter.

Fig. 43B illustrates exemplary waveforms of the digital-to-pulse line converter of fig. 43A.

Fig. 44A shows a digital-to-pulse line converter.

Fig. 44B illustrates exemplary waveforms of the digital-to-pulse line converter of fig. 44A.

Fig. 44C shows a row digital pulse generator for use with the digital-to-pulse row converter of fig. 44A.

Fig. 45A shows a ramp analog-to-digital converter output converter.

Fig. 45B and 45C illustrate exemplary reference voltages for use by the ramp analog-to-digital converter output converter of fig. 45A.

Fig. 46 shows an algorithmic analog to digital converter output converter 4600.

Fig. 47A shows a SAR ADC.

Fig. 47B shows an algorithm for generating digital output bits for the SAR ADC of fig. 47A.

Fig. 48 shows a sigma-delta type analog-to-digital converter.

Fig. 49 shows a calibration method.

Fig. 50A illustrates a real-time calibration method.

Fig. 50B illustrates a background calibration method.

Fig. 51A illustrates a calibration and conversion method.

FIG. 51B illustrates another calibration and conversion method.

Fig. 52 shows a comparator for optional use with the embodiments of fig. 37A-37B and 45A-45C.

Detailed Description

The artificial neural network of the present invention utilizes a combination of CMOS technology and a non-volatile memory array.

Non-volatile memory cell

Digital non-volatile memories are well known. For example, U.S. patent 5,029,130 ("the 130 patent"), which is incorporated herein by reference, discloses an array of split gate non-volatile memory cells, which are a type of flash memory cells. Such a memory cell 210 is shown in fig. 2. Each memory cell 210 includes a source region 14 and a drain region 16 formed in a semiconductor substrate 12 with a channel region 18 therebetween. A floating gate 20 is formed over and insulated from (and controls the conductivity of) a first portion of the channel region 18 and is formed over a portion of the source region 14. A word line terminal 22 (which is typically coupled to a word line) has a first portion disposed over and insulated from (and controls the conductivity of) a second portion of the channel region 18, and a second portion extending upward and over the floating gate 20. The floating gate 20 and the word line terminal 22 are insulated from the substrate 12 by gate oxide. Bit line 24 is coupled to drain region 16.

The memory cell 210 is erased (with electrons removed from the floating gate) by placing a high positive voltage on the word line terminal 22, which causes electrons on the floating gate 20 to tunnel through the intermediate insulator from the floating gate 20 to the word line terminal 22 via Fowler-Nordheim tunneling.

Memory cell 210 (in which electrons are placed on the floating gate) is programmed by placing a positive voltage on word line terminal 22 and a positive voltage on source region 14. Electron current will flow from the source region 14 to the drain region 16. When the electrons reach the gap between the word line terminal 22 and the floating gate 20, the electrons will accelerate and heat up. Some of the heated electrons will be injected onto the floating gate 20 through the gate oxide due to electrostatic attraction from the floating gate 20.

Memory cell 210 is read by placing a positive read voltage on drain region 16 and word line terminal 22 (which turns on the portion of channel region 18 under the word line terminal). If the floating gate 20 is positively charged (i.e., electrons are erased), the portion of the channel region 18 under the floating gate 20 is also turned on and current will flow through the channel region 18, which is sensed as an erased or "1" state. If the floating gate 20 is negatively charged (i.e., programmed by electrons), the portion of the channel region under the floating gate 20 is mostly or completely turned off and no (or little) current will flow through the channel region 18, which is sensed as a programmed or "0" state.

Table 1 shows typical voltage ranges that may be applied to the terminals of memory cell 110 for performing read, erase and program operations:

table 1: operation of flash memory cell 210 of FIG. 2

"read 1" is a read mode in which the cell current is output on the bit line. "read 2" is a read mode in which the cell current is output on the source line.

Fig. 3 shows a memory cell 310, which is similar to memory cell 210 of fig. 2, but with the addition of Control Gate (CG) 28. The control gate 28 is biased at a high voltage (e.g., 10V) during programming, a low or negative voltage (e.g., 0V/-8V) during erasing, and a low or medium voltage (e.g., 0V/2.5V) during reading. The other terminals are biased similarly to fig. 2.

Fig. 4 shows a four-gate memory cell 410 that includes a source region 14, a drain region 16, a floating gate 20 over a first portion of the channel region 18, a select gate 22 (typically coupled to a word line WL) over a second portion of the channel region 18, a control gate 28 over the floating gate 20, and an erase gate 30 over the source region 14. Such a configuration is described in U.S. patent 6,747,310, which is incorporated by reference herein for all purposes. Here, all gates, except the floating gate 20, are non-floating gates, which means that they are or can be electrically connected to a voltage source. Programming is performed by heated electrons from the channel region 18 which inject themselves into the floating gate 20. The erase is performed by electrons tunneling from the floating gate 20 to the erase gate 30.

Table 2 shows typical voltage ranges that may be applied to the terminals of memory cell 310 for performing read, erase and program operations:

table 2: operation of flash memory cell 410 of FIG. 4

	WL/SG	BL	CG	EG	SL
						Reading 1	0.5-2V	0.1-2V	0-2.6V	0-2.6V	0V
Reading 2	0.5-2V	0-2V	0-2.6V	0-2.6V	2-0.1V
						Erasing	-0.5V/0V	0V	0V/-8V	8-12V	0V
Programming	1V	1μA	8-11V	4.5-9V	4.5-5V

"read 1" is a read mode in which the cell current is output on the bit line. "read 2" is a read mode in which the cell current is output on the source line.

Fig. 5 shows a memory cell 510, the memory cell 510 being similar to the memory cell 410 of fig. 4 except that it does not include the erase gate EG. Erasing is performed by biasing substrate 18 to a high voltage and control gate CG28 to a low or negative voltage. Alternatively, erasing is performed by biasing the word line 22 to a positive voltage and the control gate 28 to a negative voltage. Programming and reading are similar to those of fig. 4.

Fig. 6 shows a tri-gate memory cell 610, which is another type of flash memory cell. Memory cell 610 is the same as memory cell 410 of fig. 4, except that memory cell 610 does not have a separate control gate. The erase operation (and thus the erase by using the erase gate) and the read operation are similar to those of fig. 4, except that no control gate bias is applied. Without control gate biasing, the programming operation is also completed and, as a result, a higher voltage must be applied on the source line during the programming operation to compensate for the lack of control gate biasing.

Table 3 shows typical voltage ranges that may be applied to the terminals of memory cell 610 for performing read, erase, and program operations:

table 3: operation of flash memory cell 610 of FIG. 6

	WL/SG	BL	EG	SL
					Reading 1	0.5-2.2V	0.1-2V	0-2.6V	0V
Reading 2	0.5-2.2V	0-2V	0-2.6V	2-0.1V
					Erasing	-0.5V/0V	0V	11.5V	0V
Programming	1V	2-3μA	4.5V	7-9V

"read 1" is a read mode in which the cell current is output on the bit line. "read 2" is a read mode in which the cell current is output on the source line.

Fig. 7 shows a stacked gate memory cell 710, which is another type of flash memory cell. The memory cell 710 is similar to the memory cell 210 of fig. 2, except that the floating gate 20 extends over the entire channel region 18, and the control gate 22 (which here will be coupled to a word line) extends over the floating gate 20, separated by an insulating layer (not shown). The erase, program, and read operations operate in a similar manner as previously described for memory cell 210.

Table 4 shows typical voltage ranges that may be applied to the terminals of memory cell 710 and substrate 12 for performing read, erase and program operations:

table 4: operation of flash memory cell 710 of FIG. 7

	CG	BL	SL	Substrate
					Reading 1	0-5V	0.1–2V	0-2V	0V
Reading 2	0.5-2V	0-2V	2-0.1V	0V
					Erasing	-8 to-10V/0V	FLT	FLT	8-10V/15-20V
Programming	8-12V	3-5V/0V	0V/3-5V	0V

"read 1" is a read mode in which the cell current is output on the bit line. "read 2" is a read mode in which the cell current is output on the source line. Optionally, in an array comprising rows and columns of memory cells 210, 310, 410, 510, 610, or 710, the source line may be coupled to one row of memory cells or two adjacent rows of memory cells. That is, the source line may be shared by adjacent rows of memory cells.

In order to utilize a memory array comprising one of the above types of non-volatile memory cells in an artificial neural network, two modifications are made. First, the circuitry is configured so that each memory cell can be programmed, erased, and read individually without adversely affecting the memory state of other memory cells in the array, as explained further below. Second, continuous (analog) programming of the memory cells is provided.

In particular, the memory state (i.e., the charge on the floating gate) of each memory cell in the array can be continuously changed from a fully erased state to a fully programmed state independently and with minimal disturbance to other memory cells. In another embodiment, the memory state (i.e., the charge on the floating gate) of each memory cell in the array can be continuously changed from a fully programmed state to a fully erased state, or vice versa, independently and with minimal disturbance to other memory cells. This means that the cell storage device is analog, or at least can store one of many discrete values (such as 16 or 64 different values), which allows very precise and individual tuning of all cells in the memory array, and which makes the memory array ideal for storing and fine-tuning synaptic weights for neural networks.

The methods and apparatus described herein may be applied to other non-volatile memory technologies such as, but not limited to, SONOS (silicon-oxide-nitride-oxide-silicon, charge trapping in nitride), MONOS (metal-oxide-nitride-oxide-silicon, metal charge trapping in nitride), ReRAM (resistive ram), PCM (phase change memory), MRAM (magnetic ram), FeRAM (ferroelectric ram), OTP (dual-layer or multi-layer one-time programmable), and CeRAM (associated electronic ram), among others. The methods and apparatus described herein may be applied to volatile memory technologies for neural networks, such as, but not limited to, SRAM, DRAM, and/or volatile synapse cells.

Neural network employing non-volatile memory cell array

Figure 8 conceptually illustrates a non-limiting example of a neural network of the present embodiment that uses a non-volatile memory array. This example uses a non-volatile memory array neural network for facial recognition applications, but any other suitable application may also be implemented using a non-volatile memory array based neural network.

For this example, S0 is the input layer, which is a 32 × 32 pixel RGB image with 5-bit precision (i.e., three 32 × 32 pixel arrays, one for each color R, G and B, respectively, each pixel being 5-bit precision). Synapse CB1 from input layer S0 to layer C1 applies different sets of weights in some cases, shared weights in other cases, and scans the input image with a3 x 3 pixel overlap filter (kernel), shifting the filter by 1 pixel (or more than 1 pixel as indicated by the model). Specifically, the values of 9 pixels in the 3 × 3 portion of the image (i.e., referred to as a filter or kernel) are provided to synapse CB1, where these 9 input values are multiplied by appropriate weights, and after summing the outputs of this multiplication, a single output value is determined by the first synapse of CB1 and provided for generating the pixels of one of the layers C1 of the feature map. The 3 x 3 filter is then shifted to the right by one pixel (i.e., adding the three pixel column to the right and releasing the three pixel column to the left) within the input layer S0, thereby providing the 9 pixel values in the newly positioned filter to synapse CB1 where they are multiplied by the same weight and a second single output value is determined by the associated synapse. This process continues until the 3 x 3 filter scans all three colors and all bits (precision values) over the entire 32 x 32 pixel image of the input layer S0. This process is then repeated using different sets of weights to generate different feature maps for C1 until all feature maps for layer C1 are computed.

At level C1, in this example, there are 16 feature maps, each having 30 × 30 pixels. Each pixel is a new feature pixel extracted from the product of the input and the kernel, so each feature map is a two-dimensional array, so in this example, layer C1 consists of a 16-layer two-dimensional array (bearing in mind that the layers and arrays referred to herein are logical relationships, not necessarily physical relationships, i.e., the arrays are not necessarily oriented in a physical two-dimensional array). Each of the 16 feature maps in level C1 is generated by one of sixteen different sets of synaptic weights applied to the filter scan. The C1 feature maps may all relate to different aspects of the same image feature, such as boundary identification. For example, a first map (generated using a first set of weights, shared for all scans used to generate the first map) may identify rounded edges, a second map (generated using a second set of weights different from the first set of weights) may identify rectangular edges, or aspect ratios of certain features, and so on.

Before going from layer C1 to layer S1, an activation function P1 (pooling) is applied that pools values from consecutive non-overlapping 2x 2 regions in each feature map. The purpose of the pooling function is to average neighboring locations (or a max function may also be used) to e.g. reduce the dependency of edge locations and reduce the data size before entering the next stage. At level S1, there are 16 15 × 15 feature maps (i.e., sixteen different arrays of 15 × 15 pixels each). Synapse CB2 from layer S1 to layer C2 scans the mapping in S1 with a 4 × 4 filter, where the filter is shifted by 1 pixel. At level C2, there are 22 12 × 12 feature maps. Before going from layer C2 to layer S2, an activation function P2 (pooling) is applied that pools values from consecutive non-overlapping 2x 2 regions in each feature map. At level S2, there are 22 6 × 6 feature maps. An activation function (pooling) is applied to synapse CB3 from tier S2 to tier C3, where each neuron in tier C3 is mapped to each of tiers S2 via a respective synapse connection of CB 3. At layer C3, there are 64 neurons. Synapse CB4 from tier C3 to output tier S3 fully connects C3 to S3, i.e., each neuron in tier C3 is connected to each neuron in tier S3. The output at S3 includes 10 neurons, with the highest output neuron determining the class. For example, the output may indicate an identification or classification of the content of the original image.

The synapses of each layer are implemented using an array or a portion of an array of non-volatile memory cells.

FIG. 9 is a block diagram of an array that can be used for this purpose. The vector-matrix multiplication (VMM) system 32 includes non-volatile memory cells and serves as synapses between one layer and the next (such as CB1, CB2, CB3, and CB4 in fig. 6). In particular, the VMM system 32 includes a VMM array 33 (comprising non-volatile memory cells arranged in rows and columns), erase and word line gate decoders 34, control gate decoders 35, bit line decoders 36 and source line decoders 37, which decode respective inputs to the non-volatile memory cell array 33. Inputs to the VMM array 33 may come from the erase gate and word line gate decoders 34 or from the control gate decoder 35. In this example, the source line decoder 37 also decodes the output of the VMM array 33. Alternatively, the bit line decoder 36 may decode the output of the VMM array 33.

The VMM array 33 serves two purposes. First, it stores the weights to be used by the VMM system 32. Second, the VMM array 33 effectively multiplies the inputs by the weights stored in the VMM array 33 and adds them per output line (source line or bit line) to produce an output that will be the input of the next layer or the input of the final layer. By performing multiply and add functions, VMM array 33 eliminates the need for separate multiply and add logic circuits and is also power efficient due to its in-situ memory computations.

The output of the VMM array 33 is provided to a differential summer (such as a summing op-amp or summing current mirror) 38 that sums the output of the VMM array 33 to create a single value for the convolution. The differential summer 38 is arranged for performing a summation of the positive and negative weights.

The output values of the differential summers 38 are then summed and provided to an activation function circuit 39, which modifies the output. The activation function circuit 39 may provide a sigmoid, tanh, ReLU function, or any other non-linear function. The modified output values of the activation function circuit 39 become elements of the feature map as a next layer (e.g., layer C1 in fig. 8) and are then applied to the next synapse to produce a next feature map layer or final layer. Thus, in this example, VMM array 33 constitutes a plurality of synapses (which receive their inputs from an existing neuron layer or from an input layer, such as an image database), and summer 38 and activation function circuitry 39 constitute a plurality of neurons.

The inputs to the VMM system 32 in fig. 9 (WLx, EGx, CGx, and optionally BLx and SLx) may be analog levels, binary levels, digital pulses (in which case a pulse-to-analog converter PAC may be required to convert the pulses to the appropriate input analog levels), or digital bits (in which case a DAC is provided to convert the digital bits to the appropriate input analog levels); the output may be an analog level, a binary level, a digital pulse or a digital bit (in which case an output ADC is provided to convert the output analog level to a digital bit).

Fig. 10 is a block diagram illustrating the use of a multi-tiered VMM system 32 (labeled here as VMM systems 32a,32b,32c,32d, and 32 e). As shown in fig. 10, the input (denoted as input) is converted from digital to analog by a digital-to-analog converter 31 and provided to an input VMM system 32 a. The converted analog input may be a voltage or a current. The input D/a conversion of the first layer may be accomplished by using a function or LUT (look-up table) that maps the input Inputx to the appropriate analog levels of the matrix multiplier of the input VMM system 32 a. Input conversion may also be accomplished by an analog-to-analog (A/A) converter to convert external analog input to mapped analog input to the input VMM system 32 a. Input conversion may also be accomplished by a digital-to-digital pulse (D/P) converter to convert an external digital input into one or more digital pulses that are mapped to the input VMM system 32 a.

The output produced by the input VMM system 32a is provided as input to the next VMM system (hidden level 1)32b, which in turn generates output provided as input to the next VMM system (hidden level 2)32c, and so on. The layers of the VMM system 32 serve as different layers of synapses and neurons for the Convolutional Neural Network (CNN). Each VMM system 32a,32b,32c,32d, and 32e may be a separate physical non-volatile memory array, or multiple VMM systems may utilize different portions of the same non-volatile memory array, or multiple VMM systems may utilize overlapping portions of the same physical non-volatile memory array. Each VMM system 32a,32b,32c,32d, and 32e may also be time-multiplexed for different portions of its array or neuron. The example shown in fig. 10 comprises five layers (32a,32b,32c,32d,32 e): one input layer (32a), two hidden layers (32b,32c) and two fully connected layers (32d,32 e). Those of ordinary skill in the art will appreciate that this is merely exemplary and that, instead, a system may include more than two hidden layers and more than two fully connected layers. Furthermore, different layers may use different combinations of n-bit memory cells (supporting multiple different levels of different cells), including 2-level memory cells (meaning only 2 levels, "0" and "1").

VMM array

FIG. 11 illustrates a neuron VMM array 1100 that is particularly suited for use with the memory cell 310 shown in FIG. 3 and serves as a synapse and component for neurons between an input layer and a next layer. The VMM array 1100 includes a memory array 1101 of non-volatile memory cells and a reference array 1102 of non-volatile reference memory cells (at the top of the array). Alternatively, another reference array may be placed at the bottom.

In the VMM array 1100, control gate lines (such as control gate line 1103) extend in a vertical direction (so the reference array 1102 is orthogonal to the control gate line 1103 in the row direction), and erase gate lines (such as erase gate line 1104) extend in a horizontal direction. Here, the inputs of the VMM array 1100 are set on the control gate lines (CG0, CG1, CG2, CG3), and the outputs of the VMM array 1100 appear on the source lines (SL0, SL 1). In one embodiment, only even rows are used, and in another embodiment, only odd rows are used. The current placed on each source line (SL0, SL1, respectively) performs a summation function of all currents from the memory cells connected to that particular source line.

As described herein with respect to neural networks, the non-volatile memory cells of the VMM array 1100 (i.e., the flash memory of the VMM array 1100) are preferably configured to operate in a sub-threshold region.

Biasing the non-volatile reference memory cell and the non-volatile memory cell described herein in weak inversion:

Ids＝Io*e^(Vg-Vth)/nVt＝w*Io*e^(Vg)/nVt，

wherein w ═ e^(-Vth)/nVt

Where Vg is the gate voltage on the memory cell; vth is the threshold voltage of the memory cell; vt is the thermal voltage k T/q, where k is the boltzmann constant, T is the temperature in kelvin, and q is the electronic charge; n is the slope factor 1+ (Cdep/Cox), where Cdep is the capacitance of the depletion layer, and Cox is the capacitance of the gate oxide layer; and Io is the memory cell current at a gate voltage equal to the threshold voltage. Io and (Wt/L) u Cox (n-1) Vt²Proportionally, where u is the carrier mobility, and Wt and L are the width and length, respectively, of the memory cell.

For an I-to-V logarithmic converter that converts an input current to an input voltage using a memory cell (such as a reference memory cell or a peripheral memory cell) or transistor:

Vg＝n*Vt*log[Ids/wp*Io]

here, wp is w of the reference memory cell or the peripheral memory cell.

For a memory array used as a vector matrix multiplier VMM array, the output current is:

Iout＝wa*Io*e^(Vg)/nVti.e. by

Iout＝(wa/wp)*Iin＝W*Iin

W＝e^{(Vthp-Vtha)/nVt}

Here, wa ═ w for each memory cell in the memory array.

The word line or control gate may be used as an input to the memory cell for an input voltage.

Alternatively, the flash memory cells of the VMM array described herein may be configured to operate in the linear region:

Ids＝β*(Vgs-Vth)*Vds；β＝u*Cox*Wt/L，

where Wt and L are the width and length of the transistor, respectively

W＝α(Vgs-Vth)，

Meaning that the weight W is proportional to (Vgs-Vth)

A word line or control gate or bit line or source line may be used as an input to a memory cell operating in the linear region. A bit line or source line may be used as the output of the memory cell.

For an I-to-V linear converter, a memory cell operating in the linear region (e.g., a reference memory cell or a peripheral memory cell) or a transistor or resistor may be used to linearly convert the input/output current to an input/output voltage.

Alternatively, the flash memory cells of the VMM array described herein may be configured to operate in the saturation region:

Ids＝¹/₂*β*(Vgs-Vth)²；β＝u*Cox*Wt/L

W＝α(Vgs-Vth)²meaning the weights W and (Vgs-Vth)²Is in direct proportion

A word line, control gate, or erase gate may be used as an input to a memory cell operating in a saturation region. A bit line or source line may be used as the output of the output neuron.

Alternatively, the flash memory cells of the VMM array described herein may be used in all regions or combinations thereof (sub-threshold, linear, or saturated regions).

Other embodiments of VMM array 32 of fig. 9 are described in U.S. patent application 15/826,345, which is incorporated herein by reference. As described herein, a source line or bit line may be used as the neuron output (current summation output).

FIG. 12 illustrates a neuronal VMM array 1200 that is particularly suited for use in the memory cell 210 shown in FIG. 2 and serves as a synapse between an input layer and a next layer. The VMM array 1200 includes a memory array 1203 of non-volatile memory cells, a reference array 1201 of first non-volatile reference memory cells, and a reference array 1202 of second non-volatile reference memory cells. The reference arrays 1201 and 1202 arranged along the column direction of the array are used to convert the current inputs flowing into the terminals BLR0, BLR1, BLR2 and BLR3 into voltage inputs WL0, WL1, WL2 and WL 3. In practice, the first and second non-volatile reference memory cells are diode connected through a multiplexer 1214 (only partially shown) into which the current input flows. The reference cell is tuned (e.g., programmed) to a target reference level. The target reference level is provided by a reference microarray matrix (not shown).

The memory array 1203 serves two purposes. First, it stores the weights that the VMM array 1200 will use on its corresponding memory cells. Second, memory array 1203 effectively multiplies the inputs (i.e., the current inputs provided in terminals BLR0, BLR1, BLR2, and BLR3, which reference arrays 1201 and 1202 convert to input voltages to provide to word lines WL0, WL1, WL2, and WL3) by the weights stored in memory array 1203, and then adds all the results (memory cell currents) to produce an output on the corresponding bit line (BL0-BLN) that will be the input of the next layer or the input of the final layer. By performing the multiply and add functions, the memory array 1203 eliminates the need for separate multiply and add logic circuits and is also power efficient. Here, voltage inputs are provided on the word lines (WL0, WL1, WL2, and WL3), and outputs appear on the respective bit lines (BL0-BLN) during a read (infer) operation. The current placed on each of the bit lines BL0-BLN performs a summation function of the currents from all the non-volatile memory cells connected to that particular bit line.

Table 5 shows the operating voltages for the VMM array 1200. The columns in the table indicate the voltages placed on the word line for the selected cell, the word lines for the unselected cells, the bit line for the selected cell, the bit lines for the unselected cells, the source line for the selected cell, and the source line for the unselected cells, where FLT indicates floating, i.e., no voltage applied. The rows indicate read, erase, and program operations.

Table 5: operation of the VMM array 1200 of FIG. 12

WL-unselected

BL-unselected

SL-unselected

Reading

0.5-3.5V

-0.5V/0V

0.1-2V(Ineuron)

0.6V-2V/FLT

Erasing

About 5-13V

Programming

1-2V

-0.5V/0V

0.1-3uA

Vinh about 2.5V

4-10V

0-1V/FLT

FIG. 13 illustrates a neuron VMM array 1300 that is particularly suited for use with the memory cell 210 shown in FIG. 2 and serves as a synapse and component for neurons between an input layer and a next layer. The VMM array 1300 includes a memory array 1303 of non-volatile memory cells, a reference array 1301 of first non-volatile reference memory cells, and a reference array 1302 of second non-volatile reference memory cells. The reference arrays 1301 and 1302 extend in the row direction of the VMM array 1300. The VMM array is similar to VMM 1100 except that in VMM array 1300, the word lines extend in a vertical direction. Here, inputs are provided on word lines (WLA0, WLB0, WLA1, WLB2, WLA2, WLB2, WLA3, WLB3), and outputs appear on source lines (SL0, SL1) during a read operation. The current placed on each source line performs a summation function of all currents from the memory cells connected to that particular source line.

Table 6 shows the operating voltages of VMM array 1300. The columns in the table indicate the voltages placed on the word line for the selected cell, the word lines for the unselected cells, the bit line for the selected cell, the bit lines for the unselected cells, the source line for the selected cell, and the source line for the unselected cells. The rows indicate read, erase, and program operations.

Table 6: operation of the VMM array 1300 of FIG. 13

Fig. 14 illustrates a neuron VMM array 1400 that is particularly suitable for the memory cell 310 illustrated in fig. 3 and serves as a synapse and component for neurons between the input layer and the next layer. The VMM array 1400 includes a memory array 1403 of non-volatile memory cells, a reference array 1401 of first non-volatile reference memory cells, and a reference array 1402 of second non-volatile reference memory cells. The reference arrays 1401 and 1402 are used to convert current inputs flowing into the terminals BLR0, BLR1, BLR2, and BLR3 into voltage inputs CG0, CG1, CG2, and CG 3. In practice, the first and second non-volatile reference memory cells are diode connected through a multiplexer 1412 (only partially shown) with current inputs flowing therein through BLR0, BLR1, BLR2, and BLR 3. The multiplexers 1412 each include a respective multiplexer 1405 and cascode transistor 1404 to ensure a constant voltage on the bit line (such as BLR0) of each of the first and second non-volatile reference memory cells during a read operation. The reference cell is tuned to a target reference level.

Memory array 1403 serves two purposes. First, it stores the weights to be used by the VMM array 1400. Second, memory array 1403 effectively multiplies the inputs (current inputs provided to terminals BLR0, BLR1, BLR2, and BLR3, which reference arrays 1401 and 1402 convert to input voltages to be provided to control gates CG0, CG1, CG2, and CG3) by the weights stored in the memory array, and then adds all the results (cell currents) to produce an output that appears at BL0-BLN and will be the input of the next layer or the input of the final layer. By performing the multiply and add functions, the memory array eliminates the need for separate multiply and add logic circuits and is also power efficient. Here, inputs are provided on control gate lines (CG0, CG1, CG2, and CG3), and outputs appear on bit lines (BL0-BLN) during read operations. The currents placed on each bit line perform a summation function of all currents from the memory cells connected to that particular bit line.

VMM array 1400 implements unidirectional tuning for the non-volatile memory cells in memory array 1403. That is, each non-volatile memory cell is erased and then partially programmed until the desired charge on the floating gate is reached. This may be performed, for example, using the precise programming techniques described below. If too much charge is placed on the floating gate (so that the wrong value is stored in the cell), the cell must be erased and the sequence of partial program operations must be restarted. As shown, two rows sharing the same erase gate (such as EG0 or EG1) need to be erased together (which is referred to as page erase) and, thereafter, each cell is partially programmed until the desired charge on the floating gate is reached.

Table 7 shows the operating voltages for the VMM array 1400. The columns in the table indicate the voltages placed on the word line for the selected cell, the word lines for the unselected cells, the bit lines for the selected cell, the bit lines for the unselected cells, the control gate for the selected cell, the control gate for the unselected cells in the same sector as the selected cell, the control gate for the unselected cells in a different sector than the selected cell, the erase gate for the unselected cells, the source line for the selected cell, the source line for the unselected cells. The rows indicate read, erase, and program operations.

Table 7: operation of the VMM array 1400 of FIG. 14

FIG. 15 shows an array 1500 of neuron VMM that is particularly suited for the memory cell 310 shown in FIG. 3 and serves as a synapse and component for neurons between the input layer and the next layer. The VMM array 1500 includes a memory array 1503 of non-volatile memory cells, a reference array 1501 of first non-volatile reference memory cells, and a reference array 1502 of second non-volatile reference memory cells. EG lines EGR0, EG0, EG1 and EGR1 extend vertically, while CG lines CG0, CG1, CG2 and CG3 and SL lines WL0, WL1, WL2 and WL3 extend horizontally. VMM array 1500 is similar to VMM array 1400 except that VMM array 1500 implements bidirectional tuning where each individual cell can be fully erased, partially programmed, and partially erased as needed to achieve the desired amount of charge on the floating gate due to the use of a separate EG line. As shown, the reference arrays 1501 and 1502 convert the input currents in terminals BLR0, BLR1, BLR2, and BLR3 into control gate voltages CG0, CG1, CG2, and CG3 (by the action of the diode-connected reference cells via multiplexer 1514) to be applied to the memory cells in the row direction. The current outputs (neurons) are in bit lines BL0-BLN, where each bit line sums all currents from non-volatile memory cells connected to that particular bit line.

Table 8 shows the operating voltages for VMM array 1500. The columns in the table indicate the voltages placed on the word line for the selected cell, the word lines for the unselected cells, the bit lines for the selected cell, the bit lines for the unselected cells, the control gate for the selected cell, the control gate for the unselected cells in the same sector as the selected cell, the control gate for the unselected cells in a different sector than the selected cell, the erase gate for the unselected cells, the source line for the selected cell, the source line for the unselected cells. The rows indicate read, erase, and program operations.

Table 8: operation of VMM array 1500 of FIG. 15

Fig. 24 shows a neuron VMM array 2400 that is particularly suitable for use in the memory cell 210 shown in fig. 2, and that serves as a synapse and component for neurons between the input layer and the next layer. In VMM array 2400, INPUT is INPUT₀…,INPUT_NAt bit lines BL respectively₀,…BL_NIs connected toReceive and OUTPUT₁、OUTPUT₂、OUTPUT₃And OUTPUT₄At the source line SL respectively₀、SL₁、SL₂And SL₃The above is generated.

Fig. 25 illustrates a neuron VMM array 2500 that is particularly suitable for use in the memory cell 210 illustrated in fig. 2 and serves as a synapse and component for neurons between the input layer and the next layer. In this example, INPUT is INPUT₀、INPUT_1、INPUT₂And INPUT₃At the source line SL respectively₀、SL₁、SL₂And SL₃Is received and OUTPUTs OUTPUT₀,…OUTPUT_NOn bit line BL₀,…,BL_NThe above is generated.

FIG. 26 illustrates a neuron VMM array 2600 that is particularly suited for use in the memory cell 210 shown in FIG. 2, and serves as a synapse and component for neurons between the input layer and the next layer. In this example, INPUT is INPUT₀,…,INPUT_MRespectively at word lines WL₀,…,WL_MIs received and OUTPUTs OUTPUT₀,…OUTPUT_NOn bit line BL₀,…,BL_NThe above is generated.

FIG. 27 shows a neuron VMM array 2700 that is particularly well suited for use in the memory cell 310 shown in FIG. 3 and serves as a synapse and component for neurons between an input layer and a next layer. In this example, INPUT is INPUT₀,…,INPUT_MRespectively at word lines WL₀,…,WL_MIs received and OUTPUTs OUTPUT₀,…OUTPUT_NOn bit line BL₀,…,BL_NThe above is generated.

FIG. 28 illustrates a neuron VMM array 2800 that is particularly suited for use with the memory cell 410 illustrated in FIG. 4 and serves as a synapse and component for neurons between the input layer and the next layer. In this example, INPUT is INPUT_0,…,INPUT_nRespectively on the vertical control grid line CG₀,…,CG_NIs received and OUTPUTs OUTPUT₁And OUTPUT₂At the source line SL₀And SL₁The above is generated.

FIG. 29 shows a neuron VMM array 2900 that is particularly suited for the memory cell 410 shown in FIG. 4 and serves as a synapse and component for neurons between the input layer and the next layer. In this example, INPUT is INPUT₀To INPUT_NAre received on the gates of bit line control gates 2901-1, 2901-2 through 2901- (N-1), and 2901-N, respectively, which are coupled to bit line BL, respectively₀To BL_N. Exemplary OUTPUT OUTPUT₁And OUTPUT₂At the source line SL₀And SL₁The above is generated.

Fig. 30 illustrates a neuron VMM array 3000 that is particularly suitable for use in the memory cell 310 shown in fig. 3, the memory cell 510 shown in fig. 5, and the memory cell 710 shown in fig. 7, and serves as a synapse and component for neurons between an input layer and a next layer. In this example, INPUT is INPUT₀,…,INPUT_MAt word line WL₀,…,WL_MIs received and OUTPUTs OUTPUT₀,…,OUTPUT_NAt bit lines BL respectively₀，…，BL_NThe above is generated.

Fig. 31 illustrates a neuron VMM array 3100 particularly suited for the memory cell 310 shown in fig. 3, the memory cell 510 shown in fig. 5, and the memory cell 710 shown in fig. 7, and serving as a synapse and component of neurons between an input layer and a next layer. In this example, INPUT is INPUT₀To INPUT_MAt control grid line CG₀To CG_MIs received. OUTPUT OUTPUT₀,…,OUTPUT_NAt the vertical source lines SL respectively₀,…,SL_NUpper generation, wherein each source line SL_iA source line terminal coupled to all memory cells in column i.

Fig. 32 illustrates a neuron VMM array 3200 that is particularly suitable for use in the memory cell 310 shown in fig. 3, the memory cell 510 shown in fig. 5, and the memory cell 710 shown in fig. 7, and serves as a synapse and component for neurons between an input layer and a next layer. In this example, INPUT is INPUT₀To INPUT_MAt the control grid lineCG₀To CG_MIs received. OUTPUT OUTPUT₀,…,OUTPUT_NAt the vertical bit line BL respectively₀,…,BL_NUpper generation, wherein each bit line BL_iTo the bit line terminals of all memory cells in column i.

Long and short term memory

The prior art includes a concept known as Long Short Term Memory (LSTM). LSTM units are commonly used in neural networks. LSTM allows the neural network to remember information at predetermined arbitrary time intervals and use the information in subsequent operations. Conventional LSTM cells include a cell, an input gate, an output gate, and a forgetting gate. Three gates regulate the flow of information into and out of the cell and the time interval in which the information is remembered in the LSTM. The VMM may be used in particular in LSTM cells.

Fig. 16 shows an exemplary LSTM 1600. LSTM 1600 in this example includes cells 1601, 1602, 1603, and 1604. Unit 1601 receives input vector x₀And generates an output vector h₀Sum cell state vector c₀. Unit 1602 receives an input vector x₁Output vector (hidden state) h from unit 1601₀And cell state c from cell 1601_0，And generates an output vector h₁Sum cell state vector c₁. Cell 1603 receives input vector x₂Output vector (hidden state) h from unit 1602₁And cell state c from cell 1602₁And generates an output vector h₂Sum cell state vector c₂. Unit 1604 receives input vector x₃Output vector (hidden state) h from unit 1603₂And cell state c from cell 1603₂And generates an output vector h₃. Additional cells may be used, and an LSTM with four cells is merely an example.

Fig. 17 shows an exemplary implementation of LSTM cell 1700 that may be used for cells 1601, 1602, 1603, and 1604 in fig. 16. LSTM unit 1700 receives input vector x (t), unit state vector c (t-1) from the previous unit, and output vector h (t-1) from the previous unit, and generates unit state vector c (t) and output vector h (t).

LSTM unit 1700 includes sigmoid function devices 1701, 1702, and 1703, each applying a number between 0 and 1 to control the number of components in an input vector that are allowed to pass through to an output vector. The LSTM unit 1700 further comprises tanh devices 1704 and 1705 for applying a hyperbolic tangent function to the input vector, multiplier devices 1706, 1707 and 1708 for multiplying the two vectors together and an adding device 1709 for adding the two vectors together. The output vector h (t) may be provided to the next LSTM unit in the system, or it may be accessed for other purposes.

FIG. 18 shows an LSTM cell 1800, which is an example of a specific implementation of LSTM cell 1700. For the convenience of the reader, the same numbering is used in LSTM unit 1800 as in LSTM unit 1700. sigmoid function devices 1701, 1702, and 1703 and tanh device 1704 each include multiple VMM arrays 1801 and activation circuit blocks 1802. Thus, it can be seen that the VMM array is particularly useful in LSTM cells used in some neural network systems. Multiplier devices 1706, 1707, and 1708 and summing device 1709 are implemented in a digital or analog manner. Activation function block 1802 may be implemented in a digital or analog manner.

An alternative form of LSTM unit 1800 (and another example of a specific implementation of LSTM unit 1700) is shown in fig. 19. In fig. 19, sigmoid function devices 1701, 1702, and 1703 and tanh device 1704 share the same physical hardware (VMM array 1901 and activation function block 1902) in a time-division multiplexing manner. LSTM unit 1900 further includes multiplier device 1903 that multiplies the two vectors together, adding device 1908 that adds the two vectors together, tanh device 1705 (which includes activation circuit block 1902), register 1907 that stores value i (t) when value i (t) is output from sigmoid function block 1902, register 1904 that stores value f (t) c (t-1) when it is output from multiplier device 1903 through multiplexer 1910, register 1905 that stores value i (t) u (t) when it is output from multiplier device 1903 through multiplexer 1910, register 1906 that stores value o (t) c (t) when it is output from multiplier device 1903 through multiplexer 1910, and multiplexer 1909.

LSTM unit 1800 includes multiple sets of VMM arrays 1801 and corresponding activation function blocks 1802, while LSTM unit 1900 includes only one set of VMM arrays 1901 and activation function blocks 1902, which are used to represent the multiple layers in an embodiment of LSTM unit 1900. LSTM unit 1900 will require less space than LSTM1800 because LSTM unit 1900 only requires its 1/4 space for VMMs and activation function blocks compared to LSTM unit 1800.

It will also be appreciated that an LSTM unit will typically include multiple VMM arrays, each requiring functionality provided by certain circuit blocks outside the VMM array, such as the summer and activation circuit blocks and the high voltage generation block. Providing separate circuit blocks for each VMM array would require a large amount of space within the semiconductor device and would be somewhat inefficient.

Gated recursion unit

Emulated VMM implementations may be used to Gate Recursive Unit (GRU) systems. GRUs are gating mechanisms in recurrent neural networks. A GRU is similar to an LSTM, except that a GRU unit generally contains fewer components than an LSTM unit.

Fig. 20 shows an exemplary GRU 2000. GRU 2000 in this example comprises cells 2001, 2002, 2003 and 2004. Unit 2001 receives input vector x₀And generates an output vector h₀. Unit 2002 receives an input vector x₁Output vector h from unit 2001₀And generates an output vector h₁. Unit 2003 receives input vector x₂And the output vector (hidden state) h from unit 2002₁And generates an output vector h₂. Unit 2004 receives input vector x₃And an output vector (hidden state) h from unit 2003₂And generates an output vector h₃. Additional cells may be used and a GRU with four cells is merely an example.

Fig. 21 shows an exemplary implementation of a GRU unit 2100 that may be used for units 2001, 2002, 2003 and 2004 of fig. 20. GRU unit 2100 receives an input vector x (t) and an output vector h (t-1) from a previous GRU unit and generates an output vector h (t). GRU unit 2100 includes sigmoid function devices 2101 and 2102, each of which applies a number between 0 and 1 to the components from the output vector h (t-1) and the input vector x (t). The GRU unit 2100 further comprises a tanh device 2103 for applying a hyperbolic tangent function to an input vector, a plurality of multiplier devices 2104, 2105 and 2106 for multiplying two vectors together, an adding device 2107 for adding the two vectors together, and a complementary device 2108 for subtracting the input from 1 to generate an output.

Fig. 22 shows a GRU unit 2200, which is an example of a specific implementation of a GRU unit 2100. For the convenience of the reader, the same numbering is used in GRU unit 2200 as in GRU unit 2100. As shown in fig. 22, the sigmoid function devices 2101 and 2102 and the tanh device 2103 each include a plurality of VMM arrays 2201 and an activation function block 2202. Thus, it can be seen that VMM arrays are particularly useful in GRU units used in certain neural network systems. Multiplier devices 2104, 2105 and 2106, summing device 2107 and complementary device 2108 are implemented in a digital or analog manner. The activation function block 2202 may be implemented in a digital or analog manner.

An alternative form of a GRU unit 2200 (and another example of a specific implementation of the GRU unit 2300) is shown in figure 23. In fig. 23, GRU unit 2300 utilizes VMM array 2301 and activation function block 2302, which when configured as a sigmoid function, applies a number between 0 and 1 to control the number of each component in an input vector that is allowed to reach an output vector. In fig. 23, the sigmoid function devices 2101 and 2102 and the tanh device 2103 share the same physical hardware (the VMM array 2301 and the activation function block 2302) in a time-division multiplexing manner. GRU unit 2300 further includes multiplier device 2303 to multiply the two vectors together, adding device 2305 to add the two vectors together, complementary device 2309 to subtract input from 1 to generate output, multiplexer 2304, register 2306 to hold the value h (t-1) × r (t) when it is output from multiplier device 2303 through multiplexer 2304, register 2307 to hold the value h (t-1) × z (t) when it is output from multiplier device 2303 through multiplexer 2304, and register 2308 to hold the value h ^ t (1-z) (t) when it is output from multiplier device 2303 through multiplexer 2304.

The GRU unit 2200 includes multiple sets of VMM arrays 2201 and activation function blocks 2202, while the GRU unit 2300 includes only one set of VMM arrays 2301 and activation function blocks 2302, which are used to represent the multiple layers in an embodiment of the GRU unit 2300. The GRU unit 2300 would require less space than the GRU unit 2200 because the GRU unit 2300 only requires its 1/3 space for the VMM and the activation function block as compared to the GRU unit 2200.

It will also be appreciated that a GRU system will typically include multiple VMM arrays, each requiring functionality provided by certain circuit blocks outside the VMM array, such as the summer and activation circuit blocks and the high voltage generation block. Providing separate circuit blocks for each VMM array would require a large amount of space within the semiconductor device and would be somewhat inefficient.

The input to the VMM array may be an analog level, a binary level, or a digital bit (in which case a DAC is required to convert the digital bit to the appropriate input analog level), and the output may be an analog level, a binary level, or a digital bit (in which case an output ADC is required to convert the output analog level to a digital bit).

For each memory cell in the VMM array, each weight W may be implemented by a single memory cell or by a differential cell or by two hybrid memory cells (average of 2 cells). In the case of a differential cell, two memory cells are needed to implement the weight W as the differential weight (W ═ W + -W-). In two hybrid memory cells, two memory cells are needed to implement the weight W as the average of the two cells.

Configurable input-output system for VMM arrays

Fig. 33 shows a VMM system 3300. VMM system 3300 includes VMM array 3301 (which may be based on any of the previously discussed VMM array designs, such as VMM arrays 1100, 1200, 1300, 1400, 1500, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 3100, and 3200, or other VMM array designs), low voltage row decoder 3302, high voltage row decoder 3303, column decoder 3304, column driver 3305, control logic 3306, bias circuit 3307, neuron output circuit block 3308, input VMM circuit block 3309, algorithm controller 3310, high voltage generator block 3311, analog circuit block 3315, and control logic 3316.

The input circuit block 3309 serves as an interface input from the outside to the input terminal of the memory array 3301. The input circuit block 3309 may include, but is not limited to, a DAC (digital-to-analog converter), DPC (digital-to-pulse converter), APC (analog-to-pulse converter), IVC (current-to-voltage converter), AAC (analog-to-analog converter, such as a voltage-to-voltage scaler), or FAC (frequency-to-analog converter). The neuron output block 3308 serves as an interface (not shown) for output from the memory array to an external interface. The neuron output block 3308 may include, but is not limited to, an ADC (analog-to-digital converter), APC (analog-to-pulse converter), DPC (digital-to-pulse converter), IVC (current-to-voltage converter), or IFC (current-to-frequency converter). Neuron output block 3308 may include, but is not limited to, activation functions, normalization circuitry, and/or rescaling circuitry.

Fig. 34 shows a VMM system 3400 including VMM arrays 3401, 3402, 3403 and 3404; high-voltage row decoders 3405 and 3406; low-voltage row decoders 3407 and 3408; input blocks 3409 and 3410 (each similar to input block 3309 in fig. 33); and output blocks 3411 and 3412. In this configuration, VMM arrays 3401 and 3403 share a set of bit lines and output block 3411, and VMM arrays 3402 and 3404 share a set of bit lines and output block 3412. VMM arrays 3401 and 3403 may be read simultaneously, which would effectively combine them into a single larger array, or they may be read at different times. Output blocks 3411 and 3412 (similar to output block 3308 in fig. 33) may be configured to be able to process read operations from one array at a time (such as reads from only array 3401 or 3403) or from multiple arrays at a time (such as reads from both arrays 3401 and 3403).

Fig. 35A illustrates a VMM system 3500, which includes VMM arrays 3503, 3504, and 3505; shared global high-voltage line decoder 3506; local high voltage line decoders 3507 and 3508; shared low row decoder 3509; and an input block 3510. In this configuration, the VMM arrays 3503, 3504 and 3505 share the input block 3510. The VMM arrays 3503, 3504 and 3505 may receive inputs (e.g., voltages or pulses on word lines, control gate lines, erase gate lines or source lines) simultaneously through input block 3510, which effectively combines them into a single larger VMM array, or they may receive inputs through input block 3510 at different times, which effectively causes them to operate as three different VMM arrays having the same input blocks. The input block 3510 may be configured to be able to provide input to one array at a time or to multiple arrays at a time.

FIG. 35B shows a VMM system 3550, including VMM arrays 3511, 3512, 3513, and 3514; a global high voltage decoder 3515; local high voltage row decoders 3516, 3517, and 3518; shared low voltage row decoder 3519; and an input block 3520. In this configuration, VMM arrays 3511, 3512, 3513, and 3514 share input block 3520. The VMM arrays 3511, 3512, 3513, and 3514 may receive inputs (e.g., voltages or pulses on word lines, control gate lines, erase gate lines, or source lines) simultaneously through input block 3520, which effectively combines them into a single larger array, or they may receive inputs through input block 3520 at different times, which effectively causes them to operate as three different VMM arrays with the same input block 3520. The input block 3520 may be configured to be able to provide input to one array at a time or to multiple arrays at a time. For example, input block 3510 of fig. 35A is configured to provide input to 3 arrays, and input block 3520 is configured to provide input to 4 arrays.

Fig. 36 shows a VMM system 3600 including a level group 3601 and a level group 3611. Horizontal group 3601 includes VMM arrays 3602 and 3603; a shared global high-voltage row decoder 3604; a local high voltage line decoder 3605; a shared low-voltage row decoder 3606; and an input block 3607. VMM arrays 3602 and 3603 share input block 3607. The input block 3607 may be configured to be able to provide input to one array at a time or to multiple arrays at a time.

Horizontal group 3611 includes VMM arrays 3612 and 3613; shared global high voltage decoder 3614; local high voltage row decoder 3615; shared low-voltage row decoder 3616; and an input block 3617. VMM arrays 3612 and 3613 share input block 3617. The input block 3617 may be configured to be able to provide input to one array at a time or to multiple arrays at a time.

In a first configuration, horizontal group 3601 utilizes output blocks 3608 and 3609, and horizontal group 3611 utilizes output blocks 3618 and 3619. Output blocks 3608, 3609, 3618, and 3619 may output as outputs a current, a digital pulse, or a digital bit. In one implementation where the digital bits are output, output blocks 3608, 3609, 3618, and 3619 each output 8 digital output bits.

In a second configuration, output blocks 3608 and 3609 are disabled and VMM arrays 3602 and 3612 share output block 3618 and VMM arrays 3603 and 3613 share output block 3619. VMM arrays 3602 and 3612 may be read simultaneously, which effectively combines them into a single larger vertical array (meaning more rows per bit line), or they may be read at different times. If VMM arrays 3602 and 3612 are read simultaneously, then in one embodiment where each output block, when coupled to only one array, will output a value of an 8-bit range, then output blocks 3608 and 3609 will each output a value of a 9-bit range. This is due to the dynamic range of the output neurons that has been doubled by using 2 arrays as a single large array. In this case, if the next array only requires 8 bits of dynamic range, the output may need to be rescaled or normalized (e.g., down from 9 bits to 8 bits). In another embodiment, the number of output bits may remain the same when increasing the number of vertical arrays.

Similarly, VMM arrays 3603 and 3613 may be read simultaneously, which effectively combines them into a single larger array, or they may be read at different times. Output blocks 3618 and 3619 may be configured to be able to process read operations from one array at a time or from multiple arrays at a time.

In VMM systems 3400, 3500, 3550, and 3600, when the system may be configured to use a different number of arrays with each input block and/or output block, then the input block or output block itself must also be configurable. For example, in VMM system 3600, if output blocks 3608, 3609, 3612, and 3619 each output 8-bit outputs when coupled to a single array, output blocks 3618 and 3619 would each need to be configured to output 9-bit outputs when they are coupled to two arrays (e.g., arrays 3602 and 3612, and arrays 3603 and 3609, respectively). If these outputs are to be subsequently provided to an input block of another VMM system, the outputs would need to be normalized first if the input block expects 8-bit inputs instead of 9-bit inputs. A variety of analog and digital techniques are known for converting an N-bit value to an M-bit value. In the foregoing example, N would be 9 and M would be 8, but one of ordinary skill in the art will appreciate that N and M may be any positive integer.

Additional arrays may be coupled to the input blocks and output blocks in the VMM systems 3400, 3500, 3550, and 3600. For example, in VMM system 3400, more than two arrays may be coupled to input block 3409 and more than two arrays may be coupled to input block 3410; in the VMM system 3500, more than three arrays may be coupled to the input block 3510; in VMM system 3550, more than four arrays may be coupled to input block 3520; and in VMM system 3600, more than two arrays may be coupled to input block 3607, more than two arrays may be coupled to input block 3617, more than two arrays may be coupled to output block 3618, and more than two arrays may be coupled to output block 3619. In those cases, the relevant input and output blocks need to be further configured to accommodate the additional arrays.

Output blocks 3411 and 3412 and output blocks 3618 and 3619 in the VMM system 3400 need to be configurable for verify operations after the programming operation, which will be affected by the number of arrays connected to the output blocks. Furthermore, for program/erase verification (for tuning, meaning that a particular charge is generated on the floating gate of the memory to produce the desired cell current), the accuracy of the output block circuitry (e.g., 10 bits) needs to be greater than the accuracy required for an inferred read (e.g., 8 bits). For example, verification accuracy > inference accuracy, out of 1 bit (e.g., out of 1-5 bits). This is necessary to ensure that there is sufficient margin between one level and the next, such as for verifying result distribution, data retention drift, temperature or variation, without restriction.

Furthermore, the input blocks 3409, 3410, 3510, 3520, 3607, and 3617 and the output blocks 3411, 3412, 3608, 3609, 3618, and 3619 in fig. 34, 35A, 35B, and 36 need to be configurable for a calibration process, as the calibration will be affected by the number of arrays connected to the output blocks. Examples of calibration processes include processes for compensating for offsets, leakage, manufacturing processes, and variations due to temperature variations.

In the next section, various adjustable components are disclosed for use in the input and output blocks to enable configuration of the input and output blocks based on the number of arrays coupled to the input or output blocks.

Component of input block and output block

Fig. 37A shows an integrating type dual-hybrid slope analog-to-digital converter (ADC)3700 that may be used in an output block, such as the output blocks 3411, 3412, 3608, 3609, 3618, and 3619 in fig. 34 and 36, in which an output neuron (I) is used_NEU3706) Is the output current received by the output block from the VMM array. Integrating dual hybrid slope analog-to-digital converter (ADC)3700 converter I_NEU3706 convert to a series of digital/analog pulses or digital output bits. Fig. 37B shows an operation waveform of the integrating ADC 3700 in fig. 37A. Output waveforms 3710, 3711, and 3714 are for one current level. Output waveforms 3712, 3713, and 3715 are for another higher current level. Waveforms 3710 and 3712 have pulse widths that are proportional to the value of the output current. Waveforms 3711 and 3713 have a number of pulses proportional to the value of the output current. Waveforms 3714 and 3715 have digital output bits that are proportional to the value of the output current.

In one embodiment, ADC 3700 converts I_NEU3706 (which is the analog output current received by the output block from the VMM array) is converted into a digital pulse whose width varies in proportion to the magnitude of the analog output current in the neuron output block, as shown in the example shown in fig. 38. ADC 3700 includes an integrator formed of an integrating operational amplifier 3701 and an adjustable integrating capacitor 3702, coupled I with respect to an adjustable reference current IREF 3707_NEU3706 integration is performed. Optionally, IREF 3707 may include a temperature coefficient of 0 or a temperature coefficient tracking neuron current I_NEU3706 bandgap filter. The latter temperature coefficient optionally may be included in the test stageA reference array of values determined during the segment is obtained. During the initialization phase, switch 3708 is closed. Then, the input of Vout3703 and the negative terminal of operational amplifier 3701 will become equal to the VREF value. Thereafter, the switch 3708 is opened, and during a fixed time period tref, the switch S1 is closed, and the neuron current I_NEU3706 are integrated. During a fixed time period tref, Vout rises, and its slope changes as the neuron current changes. Thereafter, during a time period tmeas, which is the time required to integrate Vout down to VREF, the constant reference current IREF is integrated down during the time period tmeas (during which Vout drops) by opening switch S1 and closing switch S2.

When VOUT is over>VREFV, output EC 3705 will be high, otherwise low. EC 3705 thus generates a pulse whose width reflects the time period tmeas, which in turn is related to the current I_NEU3706 (pulses 3710 and 3712 in fig. 37B).

Optionally, the output pulse EC 3705 may be converted into a series of pulses with uniform periods for transmission to the next stage of the circuit, such as the input block of another VMM array. At the beginning of time period tmeas, output EC 3705 is input into and gate 3740 along with reference clock 3741. At VOUT>During the period of VREF, the output will be pulse train 3742 (where the frequency of the pulses in pulse train 3742 is the same as the frequency of clock 3741). The number of pulses is proportional to the time period tmeas, which is proportional to the current I_NEU3706 (waveforms 3711 and 3713 in fig. 37B).

Optionally, pulse sequence 3743 may be input to counter 3720, which will count the number of pulses in pulse sequence 3742 and will generate count value 3721, which is a digital count of the number of pulses in pulse sequence 3742 that is related to neuron current I_NEU3706 is in direct proportion. Count value 3721 includes a set of digital bits (waveforms 3714 and 3715 in FIG. 37B).

In another embodiment, the integrating dual slope ADC 3700 may convert the neuron current I_NEU3706 converting to a pulse, wherein the width of the pulse is related to the neuron current I_NEU3706 the magnitude is inversely proportional. This is achieved byThe inversion may be done in a digital or analog manner and converted to a series of pulses or digital bits for output to the follower circuit.

The adjustable integrating capacitor 3702 and the adjustable reference current IREF 3707 are adjusted in response to the number N of arrays connected to the integrating dual mixed slope analog-to-digital converter (ADC) 3700. For example, when N arrays are connected to an integrating dual hybrid slope analog-to-digital converter (ADC)3700, the adjustable integrating capacitor 3702 is adjusted by 1/N, or the adjustable reference current IREF 3707 is adjusted by N.

Optionally, a calibration step may be performed when the VMM array and ADC 3700 are at or above operating temperature to offset any leakage current present in the VMM array or the control circuitry, and then this offset value may be subtracted from Ineu in fig. 37A. Calibration steps may also be performed to compensate for process or voltage source variations other than temperature variations.

The method of operation of the output circuit block includes first performing calibration for offset and voltage source variation compensation. Next, output conversion is performed (such as converting neuron currents to pulses or digital bits), and then data normalization is performed to align the output range with the input range of the next VMM array. Data normalization may include data compression or output data quantization (such as to reduce the number of bits from 10 bits to 8 bits). Activation may be performed after output conversion or after data normalization, compression, or quantization. Examples of calibration algorithms are discussed below with reference to fig. 49, 50A, 50B, and 51, which are discussed below.

Fig. 39 shows a current-to-voltage converter 3900 that is optionally available to convert neuron output currents to voltages that may be applied, for example, as inputs to a VMM memory array (e.g., on WL or CG lines). Thus, when the input blocks 3409, 3410, 3510, 3520, 3607, and 3617 in fig. 34, 35A, 35B, and 36 receive analog currents (rather than pulses or digital data) as inputs, a current-to-voltage converter 3900 may be used in these input blocks.

The current-to-voltage converter 3900 includes an operational amplifier 3901, an adjustable capacitor 3902, a switch 3903, a switch 3904, and a current source 3905, which here represents the neuron current INEU received by the input block. During current-voltage operation, switch 3903 will open and switch 3904 will close. The amplitude of the output Vout will increase in proportion to the magnitude of the neuron current INEU 3905.

Fig. 40 shows a digital data-to-voltage converter 4000 that optionally may be used to convert digital data (received as DIN) to a voltage that may be applied, for example, as an input to a VMM memory array (e.g., on a WL or CG line). When switch 4002 is closed, the data input of signal DIN will enable IREF _ u reference current 4001 to enter capacitor 4003, thereby generating a voltage across its terminals. Thus, when the input blocks 3409, 3410, 3510, 3520, 3607, and 3617 in fig. 34, 35A, 35B, and 36 receive digital data (instead of pulses or analog currents) as inputs, a digital data-to-voltage converter 4000 may be used in these input blocks. Furthermore, the digital data-to-voltage converter 4000 may be configured such that digital data received at the input as the signal DIN is fed directly to the output OUT by opening switches 4002 and 4004 and closing switch 4005. Switches 4002, 4004, and 4005 are thus configured to enable output OUT to receive the voltage on capacitor 4003 or to directly receive digital data received as signal DIN. In the illustrated embodiment, the signal DIN is received as a data pulse.

The digital data-to-voltage pulse converter 4000 includes an adjustable reference current 4001, a switch 4002, a variable capacitor 4003, a switch 4004, and a switch 4005. The adjustable reference current 4001 and the variable capacitor 4003 may be configured to have different values to adjust for differences in the size of the array to which the digital data-to-voltage pulse converter 400 is attached. During operation, digital data controls switch 4002 such that switch 4002 closes whenever the digital data is high. When the switch is closed, the adjustable reference current 4001 will charge the variable capacitor 4003. Switch 4004 is closed whenever it is desired to provide an output at node OUT, such as when the array is ready to be read. In the alternative, switch 4004 may be open and switch 4005 may be closed, and the data input may pass through as an output.

FIG. 41 shows a configurable analog-to-digital converter 4100, which is optionallyCan be used to convert analog neuron currents into digital data. The configurable analog-to-digital converter 4100 may be used for output blocks (such as output blocks 3411, 3412, 3608, 3609, 3618, and 3619 in fig. 34 and 36) in which neurons (I) are output_NEU4101) Is the output current received by the output block.

Configurable analog-to-digital converter 4100 includes a current source 4101, a variable resistor 4102, and an analog-to-digital converter 4103. The current inetu 4101 drops across the variable resistor 4102Rneu to generate a voltage Vneu ═ INEU × Rneu. The ADC 4103 (such as, but not limited to, such an integrating ADC, SAR ADC, flash ADC, or sigma-delta ADC) converts the voltage into digital bits.

Fig. 42 shows a configurable current-to-voltage converter 4200 that is optionally available to convert analog neuron currents to voltages that can be applied as inputs to the VMM memory array (e.g., on WL or CG lines). Thus, when the input blocks 3409, 3410, 3510, 3520, 3607, and 3617 in fig. 34, 35A, 35B, and 36 receive analog currents (rather than pulses or digital data) as inputs, configurable current-to-voltage converter 4200 may be used in these input blocks. The configurable current-to-voltage converter 4200 includes an adjustable resistor Rin 4202 and receives an input current Iin 4201 (which is the received input current) and generates Vin 4203, ═ Iin × Rin.

Fig. 43A and 43B show a digital bit-to-pulse width converter 4300 to be used within an input block, a row decoder, or an output block. The pulse width output from the digital bit to pulse width converter 4300 is proportional to the value of the digital bits.

The digital bit-to-pulse width converter includes a binary counter 4301. The state Q [ N:0] of the binary counter 4301 may be loaded by serial or parallel data in a load sequence. Row control logic 4310 outputs a voltage pulse WLEN having a pulse width proportional to the value of the digital data input provided from a block such as the integrating ADC in fig. 37.

Fig. 43B shows a waveform of an output pulse width, in which the width is proportional to a digital bit value. First, data in received digital bits is inverted and the inverted digital bits are loaded serially or in parallel into the counter 4301. The row pulse width is then generated by row control logic 4310 by counting in a binary fashion until it reaches the maximum counter value, as shown by waveform 4320.

An example of using a 4-bit value for DIN is shown in table 9:

table 9: digital input bit to output pulse width

DIN<3:0>	Counting	Inverted DIN loaded into a counter<3:0>	Output pulse width # clks
				0000	0	1111	0
0001	1	1110	1
				0010	2	1101	2
0011	3	1100	3
				0100	4	1011	4
0101	5	1010	5
				0110	6	1001	6
0111	7	1000	7
				1000	8	0111	8
1001	9	0110	9
				1010	10	0101	10
1011	11	0100	11
				1100	12	0011	12
1101	13	0010	13
				1110	14	0001	14
1111	15	0000	15

Optionally, a pulse train-to-pulse converter may be used to convert an output comprising a pulse train serving as an input to the VMM array to be applied to a word line or control gate within the VMM array into a single pulse of a width that varies in proportion to the number of pulses in the pulse train. An example of a pulse train-to-pulse converter is a binary counter with control logic.

Another embodiment utilizes an up binary counter and digital comparison logic. That is, the output pulse width is generated by counting up the binary counter until the digital output of the binary counter is the same as the digital input bit.

Another embodiment utilizes a down binary counter. First, the down binary counter is loaded with the digital data input pattern serially or in parallel. Next, an output pulse width is generated by counting down a down binary counter until the digital output of the binary counter reaches a minimum value (i.e., a "0" logic state).

Fig. 44A shows a digital data-to-pulse line converter 4400 including binary index pulse stages 4401-i, where i ranges from 0 to N (i.e., least significant bit LSB to most significant bit MSB). The row converter 4400 is used to provide row inputs to the array. Each stage 4401-i includes a latch 4402-i, a switch 4403-i, and a row digital binary index pulse input 4404-i (RDIN _ Ti). For example, the binary index pulse input 4404-0(RDIN _ T0) has a pulse width equal to one time unit (i.e., 1 × tpls1 units). The binary index pulse input 4404-1(RDIN _ T1) has a width equal to two time units (i.e., 2x tpls1 units). The binary index pulse input 4404-2(RDIN _ T2) has a width equal to four time units (i.e., 4 x tpls1 units). The binary index pulse input 4403-3(RDIN _ T3) has a width equal to eight time units (i.e., 8 × tpls1 units). The digital data (from the neuron outputs) in mode DINi for each row is stored in latches 4402-i. If the output Qi of the latch 4402-i is "1," it will transfer the binary index pulse input 4404-i (RDIN _ Ti) to the time-summing converter node 4408 through the switch 4403-i. Each time the summing converter node 4408 is connected to a respective input of a nand gate 4404, and the output of the nand gate 4404 generates the output of the row converter WLIN/CGIN 4409 through a level shifting inverter 4405. The time-summing converter node 4408 sums the binary index pulse inputs 4404-i sequentially in time in response to a common clock control signal CLK, in that the binary index pulse inputs 4404-i (RDIN _ Ti) are enabled one digital bit at a time in a sequential manner, such as from LSB to MSB, or from MSB to LSB, or any random bit pattern.

Fig. 44B illustrates an exemplary waveform 4420. Shown here are exemplary signals of row digital binary index pulse inputs 4404-i (specifically 4404-0, 4404-1, 4404-2, and 4404-3), and exemplary outputs from level-shifting inverters 4405, labeled WL0 and WL3, where WL0 and WL3 are generated by row converter 4400 circuitry. In this example, WL0 is generated by the row digital inputs 4403-0 and 4403-3 of the row decoder whose row decoder is asserted (WL0: Q0 ═ 1 ", Q3 ═ 1"), and WL3 is generated by the row digital inputs 4403-1 and 4403-2 of the row decoder whose row decoder is asserted (WL3: Q1 ═ 1 ", Q2 ═ 1"). If none of the row digital inputs 4403-x are asserted, then there are no pulses on either WL0 or WL3 (whose control logic is not shown in FIG. 44A). Assume that the inputs from the other rows of the digital-to-pulse row converter 4400 (i.e., the other inputs of the nand gate 4404) are high during this cycle.

Fig. 44C shows a row digital pulse generator 4410 that generates a row digital binary index pulse input 4403-i (RDIN _ Ti), wherein the width of the pulse is proportional to the binary value of the digital bits, as described above with respect to fig. 44A.

Fig. 45A shows a ramp analog-to-digital converter 4400 including a current source 4401 (which represents a received neuron current, Ineu), a switch 4402, a variable configurable capacitor 4403, and a comparator 4404 that receives a voltage formed across the variable configurable capacitor 4403 (represented as Vneu) as a non-inverting input and a configurable reference voltage Vreframp as an inverting input and generates an output Cout. Vreframp ramps up at discrete levels with each compare clock cycle. Comparator 4404 compares Vneu with Vreframp, and thus the output Cout will be "1" when Vneu > Vreframp, and will be "0" otherwise. Thus, the output Cout will be a pulse whose width varies in response to Ineu. A larger Ineu will cause Cout to be "1" for a longer period of time, resulting in a wider pulse of output Cout. Digital counter 4420 converts each pulse of output Cout into a digital output bit as shown in fig. 45B for two different Ineu currents (denoted OT1A and OT2A), respectively. Alternatively, the ramp voltage Vreframp is a continuous ramp voltage 4455, as shown in graph 4450 of fig. 45B. A multi-ramp embodiment for reducing the transition time by utilizing a coarse-fine ramp transition algorithm is shown in fig. 45C. The first coarse reference ramp reference voltage 4471 ramps in a fast manner to find a sub-range for each Ineu. Next, a fine reference ramp reference voltage 4472 (i.e., Vreframp1 and Vreframp2) is used for each sub-range separately to switch the Ineu current within the respective sub-range. As shown, there are two sub-ranges of fine reference ramp voltages. More than two coarse/fine steps or more than two sub-ranges are possible.

Fig. 52 shows a comparator 5200 for use in optional replacement of the comparators 3704 and 4404 in fig. 37A and 45A. Comparator 5200 may be a static comparator (which does not necessarily utilize a clock signal) or a dynamic comparator (which utilizes a comparison clock signal). If comparator 5200 is a dynamic comparator, it may include a clocked cross-coupled inverter comparator, a StrongARM comparator, or other known dynamic comparator. Comparator 5200 operates as a coarse comparator when coarse enable 5203 is asserted, and comparator 5200 operates as a fine comparator when fine enable 5204 is asserted. Select signal 5206 optionally may be used to indicate a coarse comparator mode or a fine enable mode, or it may optionally be used to configure comparator 5200 to operate as a static comparator or a dynamic comparator. For the case where comparator 5200 acts as a dynamic comparator, comparator 5200 receives a clock signal 5205. When operating as a dynamic comparator, the comparison clock signal 5205 will be a first clock signal of a first frequency when the comparator is a coarse comparator, and the clock signal 5205 will be a second clock signal of a second frequency greater than the first frequency when the comparator is a fine comparator. Comparator 5200, when operating as a coarse comparator, will have lower accuracy and slower speed, but will use less power, than if comparator 5200 were operating as a fine comparator. Thus, during a transition ramping period, the dynamic comparator for coarse comparison may utilize a slow comparison clock, while the dynamic comparator for fine comparison may utilize a fast comparison clock.

Comparator 5200 compares array output 5201 to a reference voltage 5202, as was the case with comparators 3704 and 4404 in fig. 37A and 45A, and generates output 5205. When comparator 5200 operates as a coarse comparator, reference voltage 5202 may be an offset voltage.

During conversion cycles in which digital output bits are generated, such as shown in fig. 37B and 45B/45C, comparator 5200 may act as a coarse comparator and a fine comparator during coarse and fine comparison cycles, respectively. At the start of this digital output bit transition, a fine compare cycle or a hybrid coarse-fine (coarse and fine parallel) compare cycle is performed within a fixed time period. Next, a coarse compare cycle is performed, followed by a final fine compare to complete the conversion.

Fig. 46 shows an algorithmic analog-to-digital output converter 4600 including switches 4601, switches 4602, a sample-and-hold (S/H) circuit 4603, a 1-bit analog-to-digital converter (ADC)4604, a 1-bit digital-to-analog converter (DAC)4605, an adder 4606, and the gains of two residual operational amplifiers (2x opamps) 4607. Algorithmic analog-to-digital output converter 4600 generates a converted digital output 4608 in response to an analog input Vin and control signals applied to switches 4602 and 4602. An input received at analog input Vin (e.g., Vneu in fig. 45A) is first sampled by S/H circuit 4603 through switch 4602, and then conversion is performed for N bits in N clock cycles. For each conversion clock cycle, the 1-bit ADC 4604 compares the S/H voltage 4609 with a reference voltage (e.g., VREF/2, where VREF is an N-bit full scale voltage) and outputs a digital bit (e.g., "0" if the input is < ═ VREF/2 and "1" if the input > VREF/2). This digital bit, which is a digital output signal 4608, is then converted to an analog voltage (e.g., to VREF/2 or 0V) by the 1-bit DAC 4605 and fed to an adder 4606 to be subtracted from the S/H voltage 4609. The 2x residual opamp 4607 then amplifies the adder difference voltage output to a conversion residual voltage 4610, which is fed through switch 4601 to the S/H circuit 4603 for the next clock cycle. Instead of the 1-bit (i.e., 2 level) algorithmic ADC, a 1.5-bit (i.e., 3 level) algorithmic ADC may be used to reduce the effects of offsets such as from ADC 4604 and residual opamp 4607. A 1.5-bit or 2-bit (i.e., 4 levels) DAC is required for a 1.5-bit algorithmic ADC.

Fig. 47A shows a Successive Approximation Register (SAR) analog-to-digital converter 4700 applied to an output neuron to convert a cell current representing the output neuron into a digital output bit. SAR ADC 4700 includes SAR 4701, digital-to-analog converter 4702, and comparator 4703. The cell current may drop across a resistor to generate a voltage VCELL, which is applied to the inverting input of the comparator 4703. Alternatively, the cell current may charge the sample-and-hold capacitor to generate the voltage VCELL (such as Vneu as shown in fig. 45A). SAR 4701 then uses a binary search to compute each bit starting from the MSB bit (most significant bit) to the LSB bit (least significant bit). Based on the digital bits (DN-D0) from SAR 4701, DAC 4702 is used to set an appropriate analog reference voltage to comparator 4703. The output of comparator 4703 is then fed back to SAR 4701 to select the next analog level of the analog reference voltage to comparator 4703. As shown in fig. 47B, for the example of a 4-bit digital output bit, there are 4 evaluation periods: a first pulse for evaluating DOUT3 by setting the analog level of the analog reference voltage of comparator 4703 at the midpoint of the range, followed by a second pulse for evaluating DOUT2 by setting the analog level of the analog reference voltage of comparator 4703 from the midpoint of the range to the middle of the maximum point of the range or from the midpoint of the range to the middle of the minimum point of the range. This is followed by additional steps, each of which further refines the analog reference voltage level of comparator 4703. The continuous output of SAR 4701 is the output digital bits. An alternative SAR ADC circuit is a Switched Capacitor (SC) circuit with only one reference level and a local SC ratio to continuously generate ratified reference levels for continuous comparison.

Fig. 48 shows a sigma-delta type analog-to-digital converter 4800 applied to an output neuron to convert a cell current 4806(ICELL or Ineu) to a digital output bit 4807. An integrator comprising an operational amplifier 4801 and a configurable capacitor 4805(Cint) integrates the sum of the current from the cell current 4806 and a configurable reference current generated by a 1-bit current DAC 4804, which converts the digital output 4807 into a current. The comparator 4802 compares the integrated output voltage Vint from the comparator 4801 to a reference voltage VREF2, and the output of the comparator 4802 is fed to the D input of the clocked DFF 4803. The clocked DFF 4803 provides a digital output stream 4807 in response to the output of the comparator 4802. The digital output stream 4807 may be fed to a digital filter before being output as digital output bits 4807. The clock period of the clocked DFF 4803 may be configured for different Ineu ranges.

The calibration methods 4900, 5010, 5020, and 5100 will now be discussed with reference to fig. 49, 50A, 50B, and 51, respectively. Methods 4900, 5010, 5020, and 5100 compensate for leakage and/or offset. The leakage may include one or more of array leakage and circuit leakage. The array leakage may include one or more of memory cell leakage and leakage from one or more of the decoding circuitry and the column write circuitry. The offset may include one or more of an array offset and a circuit offset. The array offset may include an offset from array variations due to one or more of memory cell capacitance and cell junction capacitance. The circuit offset may comprise an offset from one or more of the decoding circuit and the column write circuit.

Fig. 49 illustrates a calibration method 4900 for compensating for leakage and/or offset. A leak and/or offset calibration step is performed (step 4901). The leakage and/or offset is measured and the measured quantity is stored as leakage _ value and/or offset _ value (step 4902). The LSB is determined using the following formula: LSB _ value and/or offset _ value + deltaLmin. Optionally, deltaLMin is a current value that compensates for variations between levels due to process, temperature, noise or usage degradation and ensures separation between levels to be sufficient. deltaLMin optionally may be determined from sample data characterization. (step 4903) the MSB is determined using the following formula: MSB + (N-1) × deltaL, where N is the number of levels and deltaL is an incremental level quantity equal to the average or ideal difference between two consecutive levels. (step 4904) in one embodiment, DeltaL is equal to LSB. In another embodiment, DeltaL is determined from the sample data characterization. DeltaL may have a uniform or non-uniform value for different pairs of successive levels.

For example, for a 6-bit memory cell, there are 64 current levels, where each level is associated with a weight in the neural network application, where N is 64. In this step during calibration as well as during the measurement step, a minimum offset current may be injected to produce a baseline value.

Table 10 contains exemplary values for the 4 bit cell:

table 10: exemplary levels of 4 bit cell (16 levels)：

Level numbering	Ideal level	Adjusted level for offset/leakage
			0	0
1	1
			2	2
3	3	3
			4	4	4
5	5	5
			6	6	6
7	7	7
			8	8	8
9	9	9
			10	10	10
11	11	11
			12	12	12
13	13	13
			14	14	14
15	15	15
			16		16
17		17
			18		18

Fig. 50A and 50B illustrate a calibration method 5000 that includes one or more of a real-time calibration method 5010 and a background calibration method 5020.

In the real-time calibration method 5010, leakage and/or offset calibration is performed, including measuring leakage and/or offset and storing the measured values as leakage _ value and/or offset _ value (step 5011). The LSB is determined using the following formula: LSB level _ value and/or offset _ value + deltaLmin. (step 5012) the MSB is determined using the following equation: MSB + (N-1) × deltaL, where N is the number of levels (step 5013). The description of deltaLmin and deltaL with respect to fig. 49 also applies to fig. 50A. Examples of values are as follows: leakage and offset is 200pA, deltaLmin 300pA, LSB 500pA, deltaL 400pA, N16, MSB 500pA + (16-1) 400pA 6500 pA.

In background calibration method 5020, offset _ value and/or leakage _ value + temperature data is stored in a fuse (e.g., a look-up table for offset and/or leakage versus temperature) (step 5021). This is done once or periodically in a background calibration step. The offset _ value and/or leakage _ value + temperature data is called (step 5022). Temperature adjustments for offset _ value and/or leakage _ value are performed according to a lookup table or by device transistor equations (step 5023). The LSB is then determined using the following equation: LSB level _ value and/or leakage _ value + deltaLmin (step 5024). The MSB is then determined using the following formula: MSB + (N-1) × deltaL (step 5025). The description of deltaLmin and deltaL with respect to fig. 49 also applies to fig. 50B. Temperature adjustment may be accomplished by a lookup table or extrapolated from plant equations (e.g., subthreshold, linear, or saturation equations).

Fig. 51A shows a calibration and conversion method 5100 with automatic leakage and/or offset cancellation. A leak and/or offset calibration is performed (step 5101). The leakage and/or offset is measured, such as by ADC conversion, and the measured digital output is stored in a counter (step 5102). The conversion of the neuron output is enabled and a countdown is performed in the counter until the counter reaches zero (which compensates for the leakage and/or offset initially stored in the counter) and then an up-count is performed on the digital output bits (step 5103).

Fig. 51B shows a calibration and conversion method 5110 with automatic leakage and/or offset cancellation, which is a variation of method 5100. A leak and/or offset calibration is performed (step 5111). The leakage and/or offset is measured, such as by ADC conversion, and the measured digital output is stored in a register (step 5112). The conversion of the neuron output is enabled and an incrementing count is performed on the digital output bits and then the stored digital output is subtracted (step 5113).

It should be noted that as used herein, the terms "above …" and "above …" both inclusively encompass "directly on …" (with no intermediate material, element, or space disposed therebetween) and "indirectly on …" (with intermediate material, element, or space disposed therebetween). Similarly, the term "adjacent" includes "directly adjacent" (no intermediate material, element, or space disposed therebetween) and "indirectly adjacent" (intermediate material, element, or space disposed therebetween), "mounted to" includes "directly mounted to" (no intermediate material, element, or space disposed therebetween) and "indirectly mounted to" (intermediate material, element, or space disposed therebetween), and "electrically coupled to" includes "directly electrically coupled to" (no intermediate material or element therebetween that electrically connects the elements together) and "indirectly electrically coupled to" (intermediate material or element therebetween that electrically connects the elements together). For example, forming an element "over a substrate" can include forming the element directly on the substrate with no intervening materials/elements therebetween, as well as forming the element indirectly on the substrate with one or more intervening materials/elements therebetween.

86页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：染色体自动化分析方法

Output array neuron conversion and calibration for analog neural memory in deep learning artificial neural network

相关技术

网友询问留言