Accelerated deep learning apparatus, method and system

文档序号:1964397 发布日期:2021-12-14 浏览:36次 中文

阅读说明:本技术 加速的深度学习的设备、方法和系统 (Accelerated deep learning apparatus, method and system ) 是由 肖恩·利 迈克尔·莫里森 迈克尔·埃德温·詹姆斯 加里·R·劳特巴赫 斯里坎斯·阿雷卡普迪 于 2018-02-23 设计创作,主要内容包括:本申请涉及加速的深度学习的设备、方法和系统。高级深度学习中的技术提高了精度、性能和能效中的一个或多个,例如,学习精度、预测精度、学习速度、学习性能和学习能效。处理元件阵列对小波数据执行基于流的计算。每个处理元件具有相应的计算元件和相应的路由元件。每个计算元件都有处理资源和存储资源。每个路由器都能够经由小波与2D网格中至少最近的邻居进行通信。随机梯度下降、小批量梯度下降和连续传播梯度下降是可用于训练由处理元件建模的神经网络的权重的技术。反向检查点可用于减少训练期间的内存使用。(The application relates to an apparatus, method and system for accelerated deep learning. Techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency, e.g., learning accuracy, prediction accuracy, learning speed, learning performance, and learning energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has processing resources and memory resources. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. Stochastic gradient descent, small batch gradient descent, and continuous propagation gradient descent are techniques that may be used to train the weights of a neural network modeled by a processing element. Reverse checkpoints may be used to reduce memory usage during training.)

1. An accelerated deep learning device comprising:

a plurality of processing elements;

training a workload comprising selecting a set of machine code from a predefined set of native instruction code for performing training of a neural network comprising a plurality of ordered connected layers;

wherein the order identifies for each respective layer which other layers precede the respective layer and which other layers follow the respective layer;

wherein each layer includes one or more neurons, each neuron including a weight and connected to at least one of at least one preceding neuron of a preceding layer and at least one subsequent neuron of a succeeding layer; and is

Wherein each neuron is implemented by one or more of the processing elements, each processing element comprising:

At least one coupling to a fabric via which the processing elements are capable of communicating via a plurality of virtual channels,

a first memory capable of storing instructions corresponding to at least the calculations of the respective neuron,

a second memory capable of storing weights of the corresponding neurons, an

A compute engine capable of executing a predefined set of base operations in response to receiving a corresponding base instruction selected from the predefined set of native instruction code, the execution being by executing instructions from a corresponding first memory and accessing data from a corresponding second memory.

2. The apparatus of claim 1, wherein each of the layers is a respective inner layer of the neural network, and the neural network further comprises an input layer and an output layer.

3. The apparatus of claim 1, wherein the training workload comprises a respective set of machine code to:

determining a second activation based on the first activation and the first weight;

determining and saving a second weight based on the first increment and the first weight;

determining a fourth activation based on a third activation and a selected weight, wherein the selected weight is dynamically selected from the first weight and the second weight; and is

A third weight is determined and saved based on the second increment and the selected weight.

4. The apparatus of claim 3, wherein the selected weight is dynamically selected according to which of the first weight and the second weight was most recently stored.

5. The device of claim 3, wherein the determination of the second activation comprises:

receiving the first activation from the at least one previous neuron via the structure;

calculating the second activation based at least in part on the first activation and the first weight by at least executing a first instruction stored in the first memory and accessing the first weight in the second memory; and is

Selectively transmitting the second activation to the at least one subsequent neuron via the structure.

6. The device of claim 5, wherein the determination of the fourth activation comprises:

receiving the third activation from the at least one previous neuron via the structure;

calculating the fourth activation based at least in part on the third activation and the selected weight by executing at least the first instruction and accessing the selected weight in the second memory; and is

Selectively transmitting the fourth activation to the at least one subsequent neuron via the structure.

7. The device of claim 6, wherein the determination and preservation of the second weight comprises:

receiving, via the structure, the first increment from the at least one subsequent neuron based in part on the second activation;

calculating a first gradient based at least in part on the first increment and the second activation by executing at least a second instruction stored in the first memory;

calculating the second weight based at least in part on the first gradient, a learning rule, and the first weight by at least executing a third instruction stored in the first memory and accessing the first weight in the second memory; and is

Storing the second weight in the second memory.

8. The device of claim 7, wherein the determination and preservation of the third weight comprises:

receiving, via the structure, the second increment from the at least one subsequent neuron based in part on the fourth activation;

calculating a second gradient based at least in part on a third increment and the fourth activation by executing at least the second instruction stored in the first memory;

Calculating the third weight based at least in part on the second gradient, the learning rule, and the selected weight by at least executing the third instruction stored in the first memory and accessing the selected weight in the second memory; and is

Storing the third weight in the second memory.

9. The apparatus of claim 8, wherein the calculation of the second gradient further comprises optionally recalculating the fourth activations based at least in part on the selected weights.

10. The apparatus of claim 8, wherein the calculation of the first gradient further comprises optionally recalculating the second activation based at least in part on the first weight.

11. The apparatus of claim 1, wherein the apparatus is implemented via an entire wafer including the processing elements.

12. A method of accelerated deep learning, comprising:

training a neural network comprising a plurality of ordered connected layers;

wherein the order identifies for each respective layer which other layers precede the respective layer and which other layers follow the respective layer;

wherein each layer includes one or more neurons, each neuron including a weight and connected to at least one of at least one preceding neuron of a preceding layer and at least one subsequent neuron of a succeeding layer; and is

Wherein each neuron is implemented by one or more processing elements, each processing element comprising:

at least one coupling to a fabric via which the processing elements are capable of communicating via a plurality of virtual channels,

a first memory capable of storing instructions corresponding to at least the calculations of the respective neuron,

a second memory capable of storing weights of the corresponding neurons, an

Hardware execution resources capable of executing instructions from the respective first memory and accessing data from the respective second memory.

13. The method of claim 12, wherein each of the layers is a respective inner layer of the neural network, and the neural network further comprises an input layer and an output layer.

14. The method of claim 12, wherein the training comprises:

determining a second activation based on the first activation and the first weight;

determining and saving a second weight based on the first increment and the first weight;

determining a fourth activation based on a third activation and a selected weight, wherein the selected weight is dynamically selected from the first weight and the second weight; and is

A third weight is determined and saved based on the second increment and the selected weight.

15. The method of claim 14, wherein the selected weight is dynamically selected according to which of the first weight and the second weight was most recently stored.

16. The method of claim 14, wherein the determination of the second activation comprises:

receiving the first activation from the at least one previous neuron via the structure;

calculating the second activation based at least in part on the first activation and the first weight by at least executing a first instruction stored in the first memory and accessing the first weight in the second memory; and is

Selectively transmitting the second activation to the at least one subsequent neuron via the structure.

17. The method of claim 16, wherein the determination of the fourth activation comprises:

receiving the third activation from the at least one previous neuron via the structure;

calculating the fourth activation based at least in part on the third activation and the selected weight by executing at least the first instruction and accessing the selected weight in the second memory; and is

Selectively transmitting the fourth activation to the at least one subsequent neuron via the structure.

18. The method of claim 17, wherein the determining and saving of the second weight comprises:

receiving, via the structure, the first increment from the at least one subsequent neuron based in part on the second activation;

calculating a first gradient based at least in part on the first increment and the second activation by executing at least a second instruction stored in the first memory;

calculating the second weight based at least in part on the first gradient, a learning rule, and the first weight by at least executing a third instruction stored in the first memory and accessing the first weight in the second memory; and is

Storing the second weight in the second memory.

19. The method of claim 18, wherein the determining and saving of the third weight comprises:

receiving, via the structure, the second increment from the at least one subsequent neuron based in part on the fourth activation;

calculating a second gradient based at least in part on a third increment and the fourth activation by executing at least the second instruction stored in the first memory;

calculating the third weight based at least in part on the second gradient, the learning rule, and the selected weight by at least executing the third instruction stored in the first memory and accessing the selected weight in the second memory; and is

Storing the third weight in the second memory.

20. The method of claim 19, wherein the calculating of the second gradient further comprises optionally recalculating the fourth activations based at least in part on the selected weights.

21. The method of claim 19, wherein the calculation of the first gradient further comprises optionally recalculating the second activation based at least in part on the first weight.

22. The method of claim 12, wherein the method is performed via an entire wafer including the processing elements.

23. A system of accelerated deep learning, comprising:

means for training a neural network, the neural network comprising a plurality of sequentially connected layers;

wherein the order identifies for each respective layer which other layers precede the respective layer and which other layers follow the respective layer;

wherein each layer includes one or more neurons, each neuron including a weight and connected to at least one of at least one preceding neuron of a preceding layer and at least one subsequent neuron of a succeeding layer; and is

Wherein each neuron is implemented by one or more processing elements, each processing element comprising:

At least one coupling to a fabric via which the processing elements are capable of communicating via a plurality of virtual channels,

a first memory capable of storing instructions corresponding to at least the calculations of the respective neuron,

a second memory capable of storing weights of the corresponding neurons, an

Hardware execution resources capable of executing instructions from the respective first memory and accessing data from the respective second memory.

24. The system of claim 23, wherein each of the layers is a respective inner layer of the neural network, and the neural network further comprises an input layer and an output layer.

25. The system of claim 23, wherein the means for training comprises:

means for determining a second activation based on the first activation and the first weight;

means for determining and saving a second weight based on a first increment and the first weight;

means for determining a fourth activation based on a third activation and a selected weight, wherein the selected weight is dynamically selected from the first weight and the second weight; and

means for determining and saving a third weight based on the second increment and the selected weight.

26. The system of claim 25, wherein the selected weight is dynamically selected according to which of the first weight and the second weight was most recently stored.

27. The system of claim 25, wherein the means for determining the second activation comprises:

means for receiving the first activation from the at least one previous neuron via the structure;

means for calculating the second activation based at least in part on the first activation and the first weight by at least executing a first instruction stored in the first memory and accessing the first weight in the second memory; and

means for selectively transmitting the second activation to the at least one subsequent neuron via the structure.

28. The system of claim 27, wherein the means for determining the fourth activation comprises:

means for receiving the third activation from the at least one previous neuron via the structure;

means for calculating the fourth activation based at least in part on the third activation and the selected weight by at least executing the first instruction and accessing the selected weight in the second memory; and

Means for selectively transmitting the fourth activation to the at least one subsequent neuron via the structure.

29. The system of claim 28, wherein the means for determining and saving the second weight comprises:

means for receiving, via the structure, the first increment from the at least one subsequent neuron based in part on the second activation;

means for calculating a first gradient based at least in part on the first increment and the second activation, implemented by executing at least a second instruction stored in the first memory;

means for calculating the second weight based at least in part on the first gradient, a learning rule, and the first weight, implemented by executing at least a third instruction stored in the first memory; and

means for storing the second weight in the second memory.

30. The system of claim 29, wherein the means for determining and saving the third weight comprises:

means for receiving, via the structure, the second increment from the at least one subsequent neuron based in part on the fourth activation;

Means for calculating a second gradient based at least in part on a third increment and the fourth activation by executing at least a second instruction stored in the first memory;

means for calculating the third weight based at least in part on the second gradient, the learning rule, and the selected weight by at least executing the third instruction stored in the first memory and accessing the selected weight in the second memory; and

means for storing the third weight in the second memory.

31. The system of claim 30, wherein the means for calculating the second gradient further comprises means for optionally recalculating the fourth activations based at least in part on the selected weights.

32. The system of claim 30, wherein the means for calculating the first gradient further comprises means for optionally recalculating the second activation based at least in part on the first weight.

33. The system of claim 23, wherein an entire wafer comprises the system.

Technical Field

Advances in accelerated deep learning are needed to improve one or more of accuracy, performance, and energy efficiency.

Background

Unless explicitly stated to be public or well known, reference herein to techniques and concepts (including techniques and concepts for context, definition, or comparison purposes) should not be construed as an admission that such techniques and concepts were previously publicly known or otherwise part of the prior art. All references, if any, cited herein, including patents, patent applications, and publications, are hereby incorporated by reference in their entirety for all purposes, whether or not specifically done so.

Disclosure of Invention

The invention can be implemented in numerous ways, e.g., as a process, an article of manufacture, an apparatus, a system, a composition of matter, and a computer readable medium, e.g., a computer readable storage medium (e.g., a medium in optical and/or magnetic mass storage (e.g., diskette), an integrated circuit having a non-volatile memory (e.g., flash memory)), or a computer network wherein program instructions are sent over optical or electronic communication links. The detailed description provides illustrations of one or more embodiments of the invention that can improve cost, profitability, performance, efficiency, and utility in the field above. The detailed description includes an introduction to facilitate understanding of the remainder of the detailed description. The introduction includes example embodiments of one or more of systems, methods, articles of manufacture, and computer-readable media according to the concepts described herein. As discussed in more detail in the conclusions, the invention encompasses all possible modifications and variations within the scope of the issued claims.

Drawings

FIG. 1 illustrates selected details of an embodiment of a neural network training and reasoning system using a deep learning accelerator.

FIG. 2 illustrates selected details of an embodiment of software elements associated with neural network training and reasoning using a deep learning accelerator.

FIG. 3 illustrates selected details of an embodiment of a process associated with training a neural network and performing inference using the trained neural network using a deep learning accelerator.

FIG. 4 illustrates selected details of an embodiment of a deep learning accelerator.

FIG. 5 illustrates selected details of an embodiment of a processing element of a deep learning accelerator.

FIG. 6 illustrates selected details of an embodiment of a router of a processing element.

FIG. 7 illustrates selected details of an embodiment of processing associated with a router of a processing element.

FIG. 8 illustrates selected details of an embodiment of a computing element of the processing element.

FIG. 9 illustrates selected details of an embodiment of processing wavelets for task start-up.

FIG. 10 illustrates selected details of an embodiment of instruction processing associated with a computing element of a processing element.

FIG. 11 illustrates selected details of an embodiment of a flow associated with dependency management via epilogue.

FIG. 12 illustrates selected details of an embodiment of a flow associated with activation accumulation and epilogue, followed by partial sum computation and epilogue.

Figure 13A illustrates selected details of an embodiment of a sparse wavelet.

Figure 13B illustrates selected details of an embodiment of a dense wavelet.

Figure 14 illustrates selected details of an embodiment of creating and transmitting a wavelet.

Figure 15A shows selected details of an embodiment of receiving a wavelet.

Figure 15B illustrates selected details of an embodiment of the wavelet consumed.

FIG. 16 illustrates selected details of an embodiment of blocking and unblocking execution of an instruction.

FIG. 17 illustrates selected details of an embodiment of a neural network.

FIG. 18A illustrates selected details of a first embodiment of assigning processing elements to neurons.

FIG. 18B illustrates selected details of a second embodiment of assigning processing elements to neurons.

FIG. 19 illustrates selected details of an embodiment of smearing neurons on a plurality of processing elements.

FIG. 20 illustrates selected details of an embodiment of communicating between portions of segmented neurons.

FIG. 21A illustrates selected details of an embodiment of a structure input data structure descriptor.

FIG. 21B illustrates selected details of an embodiment of a structure output data structure descriptor.

FIG. 21C illustrates selected details of an embodiment of a 1D memory vector data structure descriptor.

FIG. 21D illustrates selected details of an embodiment of a 4D memory vector data structure descriptor.

FIG. 21E illustrates selected details of an embodiment of a circular memory buffer data structure descriptor.

FIG. 22A illustrates selected details of an embodiment of a circular memory buffer extended data structure descriptor.

FIG. 22B illustrates selected details of an embodiment of a 4D memory vector extension data structure descriptor.

FIG. 23 illustrates selected details of accessing operands according to a data structure descriptor.

FIG. 24 illustrates selected details of an embodiment of a decoded data structure descriptor.

FIG. 25A illustrates selected details of an embodiment of a multiple operand instruction.

FIG. 25B illustrates selected details of an embodiment of a source non-destination operand instruction.

FIG. 25C illustrates selected details of an embodiment of an immediate instruction.

FIG. 26A illustrates an embodiment of a pipeline flow for random gradient descent (SGD).

FIG. 26B shows an embodiment of a pipeline flow for small batch gradient descent (MBGD).

FIG. 26C shows an embodiment of a pipeline flow for Continuous Propagation Gradient Descent (CPGD).

FIG. 26D illustrates an embodiment of a pipeline flow of Continuous Propagation Gradient Descent (CPGD) with Reverse Check Pointing (RCP).

Fig. 27A-27E illustrate various aspects of forward and backward delivery embodiments according to SGD, MBGD, CPGD, and RCP processing.

Fig. 28A shows the general operation of multiplying the matrix (m) by the vector (v).

FIG. 28B illustrates various representations of memory structures used in forward transfers, incremental transfers, and chained transfers.

FIG. 29 illustrates an embodiment of tasks used in the forward pass state machine.

List of reference numerals in the figures

Detailed Description

The following provides a detailed description of one or more embodiments of the invention and the accompanying drawings that show selected details of the invention. The invention is described in connection with the embodiments. The embodiments herein are to be considered merely illustrative, and the invention is not to be explicitly limited to any or all embodiments herein, and the invention includes many alternatives, modifications, and equivalents. To avoid monotony in illustration, various word labels (e.g., first, last, some, various, further, other, specific, selected, some, and noteworthy) may be applied to different sets of embodiments; as used herein, such labels are expressly not meant to convey quality or any form of preference or prejudice, but merely to distinguish between different sets for convenience. The order of some of the operations of disclosed processes may be varied within the scope of the invention. Wherever multiple embodiments are used to describe variations in process, system, and/or program instruction features, other embodiments are contemplated in which static and/or dynamic selection of one of multiple operating modes, each corresponding to multiple embodiments, is performed according to predetermined or dynamically determined criteria. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

Introduction to the design reside in

This introduction is included merely to facilitate a faster understanding of the embodiments; the present invention is not limited to the concepts presented in the introduction (including the explicit examples, if any), as any of the paragraphs of the introduction must be a shorthand view of the entire subject matter and are not meant to be an exhaustive or limiting description. For example, the following introduction provides summary information limited by space and organization only to certain embodiments. In the remainder of the description, numerous other embodiments are also discussed, including those to which the claims will ultimately refer.

In one aspect conceptually related to the continuous propagation of accelerated deep learning, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency, e.g., learning accuracy, prediction accuracy, learning speed, learning performance, and learning energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has processing resources and memory resources. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. Stochastic gradient descent, small batch gradient descent, and continuous propagation gradient descent are techniques that may be used to train the weights of a neural network modeled by a processing element. Reverse checkpoints may be used to reduce memory usage during training

In one aspect conceptually related to structure vectors for accelerated deep learning, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has memory. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. Routing is controlled by the corresponding virtual channel specifier in each wavelet and routing configuration information in each router. An instruction executed by a compute element includes one or more operand specifiers, some of which specify data structure registers that store data structure descriptors. The data structure descriptor describes the operand as a structure vector or a memory vector. The data structure descriptor also describes the length of the structure vector, whether the structure vector fits into a micro-thread, and the plurality of data elements of the structure vector that are received, transmitted, and/or processed in parallel. The data structure descriptor also specifies virtual channel and task identification information associated with the processing structure vector, whether to terminate upon receipt of the control wavelet, and whether to mark the output wavelet as a control wavelet.

In one aspect conceptually related to data structure descriptors for accelerated deep learning, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has memory. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. Routing is controlled by the corresponding virtual channel specifier in each wavelet and routing configuration information in each router. An instruction executed by a compute element includes one or more operand specifiers, some of which specify data structure registers that store data structure descriptors. The data structure descriptor describes the operand as a structure vector or a memory vector. The data structure descriptor describes the operand as a structure vector or a memory vector. The data structure descriptor also describes the memory vector as one of a one-dimensional vector, a four-dimensional vector, or a circular buffer vector. Optionally, the data structure descriptor specifies an extended data structure register that stores the extended data structure descriptor. The extended data structure descriptor specifies parameters associated with a four-dimensional vector or a circular buffer vector.

In one aspect conceptually related to neuron smearing for accelerated deep learning, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has memory. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. Routing is controlled by the corresponding virtual channel specifier in each wavelet and routing configuration information in each router. At least a first single neuron is implemented using resources of a plurality of processing element arrays. At least a portion of the second neuron is implemented using resources of one or more of the plurality of processing elements. In some usage scenarios, the foregoing neuron implementations enable higher performance while maintaining locality of incoming activation of processing elements by enabling a single neuron to use computational resources of multiple processing elements and/or computational load balancing across processing elements.

In one aspect conceptually related to task synchronization for accelerated deep learning, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has memory. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. Routing is controlled by the corresponding virtual channel specifier in each wavelet and routing configuration information in each router. A particular one of the computing elements conditionally selects a previously received wavelet that specifies a particular one of the virtual channels for task initiation. Conditional selection excludes previously received wavelets for selection until at least the blocking/unblocking state maintained for a particular virtual channel is in an unblocking state. The computing element executes the block/unblock instruction to modify the block/unblock state.

In one aspect conceptually related to the accelerated deep learning task triggered by the data stream, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has memory. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. Routing is controlled by the corresponding virtual channel specifier in each wavelet and routing configuration information in each router. A particular one of the computing elements receives a particular wavelet that includes a particular virtual channel specifier and a particular data element. Instructions are read from the memory of the particular computing element based at least in part on the particular virtual channel specifier. The particular data element is used as an input operand to execute the at least one instruction.

In one aspect conceptually related to control wavelets for accelerated deep learning, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has memory. Each router is capable of communicating with at least the nearest neighbors in the 2D mesh via wavelets. A particular one of the computing elements receives a wavelet. If the control specifier for the wavelet is a first value, an instruction is read from the memory of the particular compute element based on the index specifier for the wavelet. If the control specifier is a second value, instructions are read from the memory of the particular compute element based on the virtual channel specifier of the wavelet. The particular computing element then initiates execution of the instruction.

In one aspect conceptually related to wavelet representation for accelerated deep learning, techniques in advanced deep learning improve one or more of accuracy, performance, and energy efficiency. The array of processing elements performs flow-based computations on the wavelet data. Each processing element has a respective computing element and a respective routing element. Each computing element has a dedicated memory. Each router is capable of communicating with at least the nearest neighbor in the 2D mesh. Communication is via wavelets according to representations including index specifiers, virtual channel specifiers, index specifiers, data element specifiers, and optional control/data specifiers. The virtual channel specifier and the index specifier are associated with one or more instructions. The index specifier is associated with at least a first instruction operand of the one or more instructions. The data element is associated with at least a second instruction operand of the one or more instructions.

A first example of accelerated deep learning is to train a neural network using a deep learning accelerator. A second example of accelerated deep learning is to use a neural network trained with deep learning accelerator operation to perform the inference. A third example of accelerated deep learning is to train a neural network using a deep learning accelerator, and then perform inference using any one or more of the trained neural network, information from the neural network, and variants of the neural network.

Examples of neural networks include fully-connected neural networks (FCNN), Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), long-short term memory (LSTM) networks, autoencoders, deep belief networks, and generative countermeasure networks.

One example of training a neural network is determining one or more weights associated with the neural network, e.g., through hardware acceleration via a deep learning accelerator. One example of making inferences is using a trained neural network to compute results by processing input data based on weights associated with the trained neural network.

The neural network processes data according to a dataflow graph that includes a neuron layer. The stimulus (e.g., input data) is received by an input layer of the neuron, and the computation results (e.g., output data) of the dataflow graph are provided by an output layer of the neuron. Example layers of a neuron include an input layer, an output layer, a rectifying linear element layer, a fully connected layer, a recursive layer, a long-short term memory layer, a convolutional layer, a core layer, an abscission layer, and an aggregate layer. The neural network is conditionally and/or selectively trained, subject to hardware acceleration. After training, the neural network is conditionally and/or selectively used for reasoning, subject to hardware acceleration.

One example of a deep learning accelerator is one or more relatively specialized hardware elements that operate in conjunction with one or more software elements to train and/or perform inference with a neural network relatively more efficiently than using relatively less specialized hardware elements. Some implementations of relatively dedicated hardware elements include one or more hardware logic circuit elements, such as transistors, resistors, inductors, capacitors, line interconnects, combinational logic (e.g., NAND, NOR), latches, register files, memory arrays, tags for memory arrays, content addressable memory, flash memory, ROM, DRAM, SRAM, serializer/deserializer (SerDe), I/O drivers, and so forth, e.g., implemented via custom logic, synthetic logic, ASICs, and/or FPGAs. Some relatively less specialized hardware elements include conventional CPUs and conventional GPUs.

Example implementations of deep learning accelerators are able to process data streams and/or perform inferences with neural networks based on calculations performed to train the neural networks. Some deep learning accelerators include processing elements coupled via a fabric and capable of communicating with each other via the fabric. Processing elements and structures are sometimes referred to collectively as processing element structures.

Example implementations of processing elements are capable of communicating and processing wavelets. In various instances, the wavelets correspond to a data stream and/or an instruction stream, such that calculations can be performed for training and/or reasoning using a neural network, based on the communication and/or processing.

Example processing elements include a router to communicate wavelets via a fabric and a computing element to process the wavelets. The example router is coupled to a plurality of elements: structure, exit ramps to and entry ramps from the computing element. Example couplings between routers and fabric enable communication between a router and, for example, four logically and/or physically adjacent processing elements. The router receives wavelets from the fabric and the ingress ramps in different ways. The router transmits the wavelet to the fabric and the exit ramp in different ways.

Example implementations of the computing element can process the wavelet by initiating a task and executing instructions associated with the wavelet and accessing data associated with the wavelet and/or the instructions. The instructions conform to an instruction set architecture that includes arithmetic instructions, control flow instructions, data type conversion instructions, configuration instructions, structure management instructions, and load/store instructions. Instructions operate on operands comprising various data types (e.g., integer data types and floating point data types of various widths). The operands include scalar operands and vector operands. In various embodiments and/or usage scenarios, the vectors variously represent weights of the neural network, inputs or stimuli to the neural network, activations of the neural network, and/or partial sums of the neural network. In some cases, the vector is a sparse vector (e.g., a neuron-activated vector) and includes sparse data elements (e.g., only non-zero elements). In some other scenarios, the vector is a dense vector (e.g., pixel values) and includes dense data elements (e.g., all elements of the vector, including zero elements).

Example computing elements include hardware elements that collectively execute instructions associated with a wavelet by performing operations specified by the instructions (e.g., arithmetic operations, control flow operations, and load/store operations). Examples of hardware elements include a selector queue, a selector, a task definition table, an instruction sequencer, an instruction decoder, a data sequencer, a register file, a memory, a pseudo-random number generator, and an ALU. Some implementations of the hardware elements are consistent with hardware logic circuit elements described elsewhere herein. Sometimes, the computing element is referred to as a compute engine. Sometimes, the compute scheduler is referred to as a selector and the compute scheduler queue is referred to as a selector queue.

An example structure is a collection of logical and/or physical couplings between processing elements and/or within a single processing element. The structure may be used to implement logical and/or physical communication topologies, such as meshes, 2D meshes, 3D meshes, hypercubes, rings, trees, or any combination thereof. An example of a physical coupling between processing elements is a set of physical interconnects (including optional and/or selective buffering) between the physically coupled processing elements. A first example of a physically coupled processing element is a directly physically adjacent processing element, e.g., a first processing element located directly beside a second processing element (e.g., "north," "south," "east," or "west"). A second example of a physically coupled processing element is a relatively physically adjacent processing element, e.g., a first processing element that is located within a relatively small number of intermediate processing elements, e.g., one or two "rows" and/or "columns" that are remote from a second processing element. A third example of a physically coupled processing element is a processing element that is relatively physically remote, e.g., a first processing element that is physically relatively remote from a second processing element, e.g., a distance that is limited by signal propagation (with or without optional and/or selective buffering) within a clock cycle and/or clock subcycle associated with the processing element. Examples of physical couplings within a single processing element (having, for example, a computing element and a router) are an ingress ramp coupling out information from the computing element to the router and an egress ramp coupling in information from the router to the computing element. In some cases, the router will route information from the on-ramp to the off-ramp.

One example of a logical coupling between processing elements is a virtual channel implemented by a router within a processing element. The routing between the first processing element and the second processing element is implemented, for example, by a router within the processing element forwarding along the route according to the virtual channel and the routing configuration information. An example of a logical coupling within a single particular processing element (having, for example, a router) is a virtual channel implemented by the router such that the particular processing element is able to send information to the particular processing element via the virtual channel. The router forwards "internally" with respect to a particular processing element based on the virtual channel and routing configuration information.

An example wavelet is a bundle of information passed between processing elements via a structure. Example wavelets include wavelet payload and color. The wavelet payload includes data and is associated with instructions. The first response to the wavelet received by the computing element of the processing element includes the computing element initiating a task, e.g., processing corresponding to an instruction associated with the wavelet. The second response to the wavelet received by the computing element of the processing element includes the computing element processing data of the wavelet. Example types of wavelets include dense wavelets and sparse wavelets as well as data wavelets and control wavelets.

For example, wavelets are used for communication between processing elements. In the first case, the first processing element transmits the wavelet to the second processing element. In the second case, an external device (e.g., an FPGA) transmits the wavelet to the processing element. In a third case, the processing element transmits the wavelet to an external device (e.g., an FPGA).

An example virtual channel is one or more communication paths specified by color and enabled by, for example, a fabric and one or more routers. Wavelets containing a particular color are sometimes referred to as being associated with a particular virtual channel associated with a particular color. A first example of color is a fabric color that specifies a virtual channel between two different processing elements. In some embodiments, the structural color is a 5-bit integer. A second example of color is a native color that specifies a virtual channel from processing element to processing element. In some embodiments, the color is a 6-bit integer and specifies one of a structural color and a native color.

An example task includes a set of instructions executed in response to a wavelet. An example instruction includes an operation and optionally one or more operands that specify a location of a data element to be processed in accordance with the operation. A first example of an operand specifies a data element in memory. A second example of an operand specifies a data element that is transmitted (e.g., received or transmitted) via the fabric. One example of a data sequencer determines the location of a data element. One example of an instruction sequencer determines the address in memory of the instruction associated with the wavelet.

The example selector queue can hold wavelets received via the exit ramps of the fabric for processing in the computing element. One example of a selector selects a wavelet from a queue of selectors for processing.

One example of an Integrated Circuit (IC) is a collection of circuits implemented on a single portion of semiconductor material. One example of an Application Specific Integrated Circuit (ASIC) is an IC designed for a particular use. One example of wafer-level integration is to implement a system using all or most of the wafer as an element of the system, for example, by making the wafer complete or substantially complete.

In some embodiments and/or usage scenarios, wafer-level integration enables multiple elements in a system to be connected via wafer interconnects formed using silicon fabrication processes rather than via inter-chip interconnects, and thus improves any one or more of performance, cost, reliability, and energy efficiency. As one specific example, a system implemented using wafer-level integration techniques is capable of implementing 300 million PEs on a single wafer, each PE having a bandwidth with the nearest physical neighbor greater than a comparable system using other techniques besides wafer-level integration techniques. The greater bandwidth enables systems implemented using wafer-level integration techniques to train and/or perform inferences for larger neural networks relatively efficiently relative to systems implemented using technologies other than wafer-level integration techniques.

Acronyms

At least some of the various acronyms (e.g., acronyms) defined herein refer to certain elements used herein.

Example embodiments

At the conclusion of the introduction to the detailed description, the following is a collection of example embodiments, including at least some explicitly enumerated as "ECs" (example combinations), that provide additional description of various embodiment types according to the concepts described herein; these examples are not meant to be mutually exclusive, exhaustive, or limiting; and the invention is not limited to these exemplary embodiments but includes all possible modifications and variations within the scope of the issued claims and their equivalents.

EC100) a system, comprising:

a fabric of processor elements, each processor element including a fabric router and a compute engine, the compute engine capable of performing data flow-based and instruction-based processing;

wherein each processor element selectively communicates fabric packets with other processor elements; and is

Wherein each compute engine selectively performs processing based on the virtual channel specifier and the task specifier for each fabric packet received by the compute engine.

EC100b) a system, comprising:

A fabric of processor elements, each processor element comprising a fabric router and a compute engine;

wherein each processor element selectively communicates fabric packets with other processor elements; and is

Wherein each compute engine selectively performs data flow processing and instruction processing based on the data flow field and instruction field of each fabric packet received by the compute engine, respectively.

EC100c) the system of EC100, wherein the processing is according to a dataflow graph.

EC100d) the system of EC100, wherein the execution workload comprises mainly data stream based processing and minimal instruction based processing.

EC100e) the system of EC100d, wherein the system implements a Long Short Term Memory (LSTM) neural network model.

EC100f) the system of EC100, wherein the execution workload comprises primarily instruction-based processing and minimally data-stream based processing.

EC100g) the system of EC100, wherein the system is implemented at least in part using wafer-level integration.

EC100h) the system of EC100, wherein the structure of the processor elements is implemented at least in part using VLSI manufacturing.

EC101) the system of EC100, wherein the virtual channel specifier selects respective routing paths that are independent in the fabric.

EC101b) the system of EC100, wherein the virtual channel specifier selects a routing path in a fabric to perform the multicast.

EC101c) the system of EC100, wherein the virtual channel specifier selects a routing path in a fabric to perform load splitting.

EC102) the system of EC100, wherein the task specifier selects one or more operations to perform.

EC103) the system of EC100, wherein the structure comprises a 2D array of processor elements.

EC103b) the system of EC100, wherein the fabric comprises a processor element interconnection topology selected from the group consisting of fully connected, star, ring, array, mesh, hypercube, torus, and tree.

EC103c) the system of EC100, wherein the fabric comprises processor element interconnect topology dimensions selected from the group consisting of 1D, 2D, 3D, and dimensions greater than 3D.

EC104) the system of EC100, wherein the system is capable of executing a machine learning workload.

EC105) the system of EC100, wherein the system is trained to perform inference applications.

EC105b) the system of EC100, wherein the system executes the inference application.

EC106) the system of EC100, wherein the system implements a deep neural network trained to perform object classification and/or detection.

EC107) the system of EC100, wherein the system implements a deep neural network trained to perform inference applications selected from text translation, optical character recognition, image classification, facial recognition, scene recognition for autonomous vehicles, voice recognition, data analysis for high-energy physics, and drug discovery.

EC108) the system of EC100, wherein the fabric is organized into a plurality of peripheral processor elements and a plurality of internal processor elements, and each internal processor element is respectively coupled to at least four other processor elements of the plurality of processor elements in at least four logical directions.

EC109) the system of EC100, wherein each compute engine comprises a memory, a data path, and a hybrid data flow and instruction execution controller.

EC110) the system of EC109, wherein each compute engine operates according to a multi-stage compute engine pipeline having a plurality of compute engine pipeline stages.

EC111) the system of EC109, wherein the instruction execution controller comprises an instruction sequencer implemented using one or more of microcode, PLA, one or more counters, and a gate-level state machine.

EC112) the system of EC109, wherein each compute engine further comprises a register file, an instruction decoder, an instruction cache, and a data cache.

EC112b) the system of EC109, wherein each compute engine further comprises a register file, an instruction decoder, an instruction buffer, and a data buffer.

EC113) the system of EC100, wherein:

each compute engine is configured to execute a predefined set of base operations in response to receiving a corresponding base instruction selected from a predefined set of native instruction code; and the system further comprises:

training a workload comprising:

a first set of machine code selected from a native instruction set for mapping at least a portion of the neurons onto a compute engine of a processor element, the mapping comprising managing at least one partial neuron weight,

a second set of machine code selected from the native instruction set to perform a forward pass based at least in part on the at least one partial neuron weight to propagate activations in a forward logical direction, the forward pass initiated in response to the input sample,

A third set of machine code selected from the native instruction set for performing an incremental pass in a backward logical direction to generate an increment, the incremental pass initiated in response to completion of the forward pass,

a fourth set of machine code selected from the native instruction set for performing chain passing to compute gradients based on the deltas, an

A fifth set of machine code selected from the native instruction set for performing selective updating of at least one partial neuron weights according to a predetermined learning rule and based at least in part on the deltas; and is

Wherein each compute engine includes memory for at least one partial neuron weight.

EC113a) the system of EC113, wherein each primitive instruction is executed according to the task specifiers of a respective one of the fabric packets.

EC113b) the system of EC113, wherein the structure comprises a 2D array of processor elements, the processor elements comprising first, second, third, and fourth physical directions, the first and second physical directions being co-linear and opposite, the third and fourth physical directions being co-linear and opposite, the first and third physical directions being orthogonal, and the forward logical direction being in the first physical direction and the backward logical direction being in the second physical direction.

EC113c) the system of EC113, wherein the training workload further comprises a sixth set of machine code selected from the native instruction set for executing the non-linear activation function.

EC113d) the system of EC113c, wherein the nonlinear activation function is selected from the group consisting of sigmoid, tanh, and ReLU.

EC114) the system of EC113, wherein the mapping is according to the initialization structure to enable partitioning of the neural network into a plurality of layers, the neuron is a first neuron of a plurality of neurons of the neural network, the first neuron is included in a first layer of the plurality of layers, and each of the plurality of neurons is mapped in a distributed manner on a plurality of processor elements of the structure.

EC115) the system of EC114, wherein the mapping completes all transfers for each layer in a same amount of time from each input sample of the training set.

EC115b) the system of EC114, wherein the mapping completes all transfers for each layer in a same predetermined amount of time from each input sample of the training set.

EC115c) the system of EC114, wherein the mapping completes all transfers for each layer within the same time period determined in real time from each input sample of the training set.

EC116) the system of EC114, wherein the multiple layers operate as a logical structure pipeline comprising logical structure pipeline stages, each logical structure pipeline stage comprising completing all transfers for each layer, completion for each layer taking a time step comprising the same amount of time.

EC116b) the system of EC114, wherein each of the multiple layers operates as a logical structure pipeline stage of a respective logical structure pipeline for each pass, completion of each layer taking a time step comprising a same amount of time.

EC117) the system of EC114, wherein the neuron weights are selectively updated in the first plurality of processor elements on the multiple levels as each input sample of the training set flows through at least the first plurality of processor elements on the multiple levels.

EC117b) the system of EC118, wherein the neuron weights are selectively updated in the first plurality of processor elements on the multiple levels as each input sample of the training set flows through at least the first plurality of processor elements on the multiple levels, and the streaming and updating are performed for each time step of the multiple time steps.

EC119) the system of EC120, further comprising a digital clock, and wherein the time step is an integer multiple of a clock period of the digital clock.

EC118b) the system of EC120, further comprising a digital clock, and wherein the step of time is a variable amount of time.

EC118c) the system of EC121 or EC118b, wherein the time step is determined in real time.

EC122) the system of EC114, further comprising:

wherein each compute engine operates according to a multi-stage compute engine pipeline having a plurality of compute engine pipeline stages, including a compute engine machine cycle of time to complete each compute engine pipeline stage, a compute engine pipeline cycle of time to complete a plurality of compute engine pipeline stages;

wherein the compute engine machine cycle comprises a first multiple of a clock cycle of the digital clock;

wherein the plurality of layers operate as a logical structure pipeline comprising logical structure pipeline stages, each logical structure pipeline stage comprising a time to complete all transfers for each layer, the time step comprising a time to complete each logical structure pipeline stage; and is

Wherein the step of time comprises calculating a second multiple of the engine pipeline period.

EC123) the system of EC122, wherein the first multiple is 1.

EC124) the system of EC122, wherein the second multiple is hundreds to thousands.

EC125) the system of EC120, wherein the at least one partial neuron weights are selectively updated within the first plurality of processor elements in response to changes in backward propagated data within the first plurality of processor elements for each time step of the plurality of time steps while forward propagation of activations is ongoing.

EC126) the system of EC120, wherein the at least one partial neuron weights are selectively updated at each time step within the plurality of time steps.

EC123b) the system of EC117, EC117b, EC122, or EC123, wherein the selectively updating is consistent with a continuous propagation gradient descent process.

EC127) the system of EC114, wherein the neural network comprises a thousand layers.

EC128) the system of EC114, wherein the plurality of neurons comprises billions of neurons.

EC125b) the system of EC114, wherein the plurality of neurons comprises millions of neurons.

EC125c) the system of EC114, wherein the neural network comprises, for at least some of the plurality of neurons, at least 10 weights per neuron.

EC125d) the system of EC114, wherein the neural network comprises, for at least some of the plurality of neurons, at least 1000 weights per neuron.

EC129) the system of EC114, wherein the neural network comprises billions of weights per layer.

EC126b) the system of EC114, wherein the neural network comprises millions of weights per layer.

EC130) the system of EC114, wherein, for each layer of the neural network, the input activations are weighted to create partial sums, the partial sums are accumulated to generate output activations for the layer, and the accumulated weighted partial sums represent neurons and associated synapses of the neural network.

EC127b) the system of EC127, wherein each weight corresponds to a synapse, each partial sum corresponds to a stimulus, the cumulative weighted partial sum corresponds to a total stimulus, and each output activation of the layer corresponds to a neuron output.

EC131) the system of EC113, wherein the iteration of the training workload is performed on each of a plurality of input samples that collectively comprise a training set.

EC132) the system of EC131, wherein the predetermined learning rule specifies updating the at least one partial neuron weights after all transfers of the entire training set are completed.

EC129b) the system of EC129, wherein the predetermined learning rule conforms to a random gradient descent process.

EC129c) the system of EC129, wherein the predetermined learning rule conforms to a mini-batch gradient descent process.

EC129d) the system of EC129, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC133) the system of EC131, wherein the training set is divided into a plurality of so-called mini-batches, and the predetermined learning rule specifies updating the at least one partial neuron weights after completing all transfers of the input samples included in each of the mini-batches.

EC134) the system of EC131, wherein the training set is divided into a plurality of so-called mini-batches, and the predetermined learning rule specifies updating the at least one partial neuron weight after all transfers of each input sample of each mini-batch are completed.

EC131b) the system of EC131, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC135) the system of EC134, wherein the forwarding comprises weight updates within the first plurality of processor elements while the small batch learning is performed within the first plurality of processor elements.

EC136) the system of EC113, wherein the memory is contained in memory local to the compute engine.

EC133b) the system of EC113, wherein the memory is included in the compute engine.

EC133b) the system of EC113, wherein the memory is a respective memory connected to each compute engine.

EC137) the system of EC113, wherein the memory is capable of storing a 2D matrix data structure.

EC134b) the system of EC113, wherein the memory is capable of storing a multidimensional data structure.

EC134c) the system of EC113, wherein the memory is capable of storing a tensor data structure comprising dimensions selected from the group consisting of 2D, 3D, 4D, 5D, and 6D.

EC138) the system of EC113, wherein each compute engine further comprises memory for gradient accumulation, forward partial sums, delta partial sums, and forward pass activations.

EC139) the system of EC114, wherein the data propagates to the logic side of the neural network during forward transfers and loops back in a reverse logic direction during incremental transfers and chained transfers.

EC140) the system of EC113, wherein the forward transfer save is activated for use with the incremental transfer and the chained transfer.

EC141) the system of EC113, wherein each processor element is time-shared in forward, incremental, and chained transfers.

EC142) the system of EC131, wherein for each input sample, the system is capable of selectively updating at least one partial neuron weight according to a predetermined learning rule in response to completion of forward transfer, incremental transfer, and chained transfer corresponding to the input sample.

EC139b) the system of EC139, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC143) the system of EC142, wherein for each forward pass the system is capable of using weight information provided by a most recent selective update of the weight of the at least one partial neuron.

EC144) the system of EC143, wherein the system is capable of initiating a forward pass of a particular iteration of the training workload regardless of whether a selective update of at least one partial neuron weight corresponding to a previous iteration of the training workload has occurred.

EC145) the system of EC143, wherein the system is capable of initiating a forward pass of a particular iteration of the training workload regardless of whether an incremental pass of a previous iteration of the training workload has begun.

EC146) the system of EC143, wherein the at least one computing engine is capable of performing at least a portion of the forward pass for a subsequent iteration of the training workload after performing at least a portion of the forward pass for a previous iteration of the training workload and before performing a portion of the selective updating of the at least one partial neuron weights corresponding to the previous iteration of the training workload.

EC147) the system of EC143, wherein the system is capable of performing an incremental transfer and a chained transfer for each input sample based at least in part on the recalculated activations based at least in part on the first portion of neuron weights.

EC148) the system of EC147, wherein the first partial neuron weights are partial neuron weights resulting from a most recent selective update.

EC145b) the system of EC145, wherein the activation of the recalculation does not require storage between calculations, thereby reducing the total memory required for a given system training configuration.

EC145c) the system of EC139, EC140, EC141, or EC142, wherein the parallel layer training is capable of achieving a predetermined accuracy goal with faster convergence speed, thereby reducing the total training time required for a given system training configuration.

EC145d) the system of EC139, EC140, EC141, or EC142, wherein the concurrent layer training is capable of improving the accuracy of a given total training time and system training configuration.

EC149) the system of EC143, wherein each computing element is capable of performing a portion of the incremental transfer and a portion of the chain transfer on each input sample based at least in part on the recalculated activations based at least in part on the first portion neuron weights.

EC150) the system of EC149, wherein the first partial neuron weights are partial neuron weights resulting from a most recent selective update.

EC200) a method comprising:

selectively communicating fabric packets with other processor elements in each fabric of processor elements, each processor element including a fabric router and a compute engine capable of performing data flow-based and instruction-based processing; and is

In each compute engine, processing is selectively performed according to the virtual channel specifier and the task specifier of each fabric packet received by the compute engine.

EC200b) a method comprising:

selectively communicating fabric packets with other processor elements in each fabric of processor elements, each processor element including a fabric router and a compute engine; and is

In each compute engine, data stream processing and instruction processing are selectively performed according to the virtual channel specifier and task specifier, respectively, of each fabric packet received by the compute engine.

EC200c) the method of EC200, wherein the processing is according to a dataflow graph.

EC200d) the method of EC200, further comprising executing a workload, the workload comprising primarily data stream based processing and minimally instruction based processing.

EC200e) the method of EC200d, wherein the method is performed implementing a Long Short Term Memory (LSTM) neural network model.

EC200f) the method of EC200, further comprising executing a workload, primarily comprising instruction-based processing and minimal data-stream-based processing.

EC200g) the method of EC200, wherein the structure of the processor element is implemented at least in part using wafer-level integration.

EC200h) the method of EC200, wherein the structure of the processor element is implemented at least in part using VLSI manufacturing.

EC201) the method of EC200, wherein the virtual channel specifier selects respective routing paths that are independent in the fabric.

EC201b) the method of EC200, wherein the virtual channel specifier selects a routing path in a fabric to perform the multicast.

EC201c) the method of EC200, wherein the virtual channel specifier selects a routing path in a fabric to perform load splitting.

EC202) the method of EC200, wherein the task specifier selects one or more operations to be performed.

EC203) the method of EC200, wherein the structure comprises a 2D array of processor elements.

EC203b) the method of EC200, wherein the structure comprises a processor element interconnection topology selected from the group consisting of fully connected, star, ring, array, mesh, hypercube, torus, and tree.

EC203c) the method of EC200, wherein the fabric comprises processor element interconnect topology dimensions selected from the group consisting of 1D, 2D, 3D, and dimensions greater than 3D.

EC204) the method of EC200, wherein the method is performed capable of performing a machine learning workload.

EC205) the method of EC200, wherein the method is performed, capable of training an inference application.

EC205b) the method of EC200, wherein the method is performed, performing the inference application.

EC206) the method of EC200, wherein the method is performed, implementing a deep neural network, training the deep neural network to perform object classification and/or detection.

EC207) the method of EC200, wherein the method is performed, implementing a deep neural network, training the deep neural network to perform an inference application selected from text translation, optical character recognition, image classification, face recognition, scene recognition for autonomous vehicles, voice recognition, data analysis for high-energy physics, and drug discovery.

EC208) the method of EC200, wherein the fabric is organized into a plurality of peripheral processor elements and a plurality of internal processor elements, and each internal processor element is respectively coupled to at least four other processor elements of the plurality of processor elements in at least four logical directions.

EC209) the method of EC200, wherein each compute engine comprises a memory, a data path, and a mixed data flow and instruction execution controller.

EC210) the method of EC209, wherein each compute engine operates according to a multi-stage compute engine pipeline having a plurality of compute engine pipeline stages.

EC211) the method of EC209, wherein the instruction execution controller comprises an instruction sequencer implemented using one or more of microcode, PLA, one or more counters, and a gate-level state machine.

EC212) the method of EC209, wherein each compute engine further comprises a register file, an instruction decoder, an instruction cache, and a data cache.

EC212b) the method of EC209, wherein each compute engine further comprises a register file, an instruction decoder, an instruction buffer, and a data buffer.

EC213) the method of EC200, wherein:

each compute engine is configured to execute a predefined set of base operations in response to receiving a corresponding base instruction selected from a predefined set of native instruction code; and also comprises

Processing training workloads including

A first set of machine code selected from a native instruction set for mapping at least a portion of the neurons onto a compute engine of a processor element, the mapping comprising managing at least one partial neuron weight,

A second set of machine code selected from the native instruction set to perform a forward pass based at least in part on the at least one partial neuron weight to propagate activations in a forward logical direction, the forward pass initiated in response to the input sample,

a third set of machine code selected from the native instruction set for performing an incremental pass in a backward logical direction to generate an increment, the incremental pass initiated in response to completion of the forward pass,

a fourth set of machine code selected from the native instruction set for performing chain passing to compute gradients based on the deltas, an

A fifth set of machine code selected from the native instruction set for performing selective updating of at least one partial neuron weights according to a predetermined learning rule and based at least in part on the deltas; and is

Wherein each compute engine includes memory for at least one partial neuron weight.

EC213a) the method of EC213, wherein each primitive instruction is executed according to the task specifiers of a respective one of the fabric packets.

EC213b) the method of EC213, wherein the structure comprises a 2D array of processor elements, the processor elements comprising first, second, third, and fourth physical directions, the first and second physical directions being collinear and opposite, the third and fourth physical directions being collinear and opposite, the first and third physical directions being orthogonal, and the forward logical direction being in the first physical direction and the backward logical direction being in the second physical direction.

EC213c) the method of EC213, wherein the training workload further comprises a sixth set of machine code selected from the native instruction set for executing the non-linear activation function.

EC213d) the method of EC213c, wherein the nonlinear activation function is selected from the group consisting of sigmoid, tanh, and ReLU.

EC214) the method of EC213, wherein the mapping is according to the initialization structure to enable partitioning of the neural network into a plurality of layers, the neuron is a first neuron of a plurality of neurons of the neural network, the first neuron is included in a first layer of the plurality of layers, and each of the plurality of neurons is mapped in a distributed manner on a plurality of processor elements of the structure.

EC215) the method of EC214, wherein the mapping completes all transfers for each layer in a same amount of time from each input sample of the training set.

EC215b) the method of EC214, wherein the mapping completes all transfers for each layer in a same predetermined amount of time from each input sample of the training set.

EC215c) the method of EC214, wherein the mapping completes all passes for each layer within the same time period determined in real time from each input sample of the training set.

EC216) the method of EC214, wherein the multiple layers operate as a logical structure pipeline comprising logical structure pipeline stages, each logical structure pipeline stage comprising completing all transfers of each layer, completion of each layer taking a time step comprising the same amount of time.

EC216b) the method of EC214, wherein each of the multiple layers operates as a logical structure pipeline stage of a respective logical structure pipeline for each transfer, completion of each layer taking a time step comprising the same amount of time.

EC217) the method of EC214, wherein the neuron weights are selectively updated in the first plurality of processor elements on the multiple levels as each input sample of the training set flows through at least the first plurality of processor elements on the multiple levels.

EC217b) the method of EC216, wherein the neuron weights are selectively updated in the first plurality of processor elements on the multiple levels as each input sample of the training set flows through at least the first plurality of processor elements on the multiple levels, and the streaming and updating are performed for each time step of the multiple time steps.

EC218) the method of EC216, wherein the at least one processor element comprises a digital clock and the step of time is an integer multiple of a clock period of the digital clock.

EC218b) the method of EC216, wherein the at least one processor element has a digital clock and the step of time is a variable amount of time.

EC218c) the method of EC218 or EC218b, wherein the time step is determined in real time.

EC219) the method of EC214, further comprising:

operating each compute engine according to a multi-stage compute engine pipeline having a plurality of compute engine pipeline stages, including a compute engine machine cycle that completes the time of each compute engine pipeline stage, a compute engine pipeline cycle that includes the time to complete the plurality of compute engine pipeline stages;

wherein the compute engine machine cycle comprises a first multiple of a clock cycle of the digital clock;

wherein the plurality of layers operate as a logical structure pipeline comprising logical structure pipeline stages, each logical structure pipeline stage comprising a time to complete all transfers for each layer, the time step comprising a time to complete each logical structure pipeline stage; and is

Wherein the step of time comprises calculating a second multiple of the engine pipeline period.

EC220) the method of EC219, wherein the first multiple is 1.

EC221) the method of EC219, wherein the second multiple is hundreds to thousands.

EC222) the method of EC216, further comprising, for each time step of the plurality of time steps, selectively updating at least one partial neuron weight within the first plurality of processor elements in response to a change in backward propagated data within the first plurality of processor elements while forward propagation of activations is ongoing.

EC223) the method of EC216, further comprising selectively updating at least one partial neuron weight at each time step in the plurality of time steps.

EC223b) the method of EC217, EC217b, EC222, or EC223, wherein selectively updating is consistent with a continuous propagation gradient descent process.

EC224) the method of EC214, wherein the neural network comprises a thousand layers.

EC225) the method of EC214, wherein the plurality of neurons comprises billions of neurons.

EC225b) the method of EC214, wherein the plurality of neurons comprises millions of neurons.

EC225c) the method of EC214, wherein, for at least some of the plurality of neurons, the neural network comprises at least 10 weights for each neuron.

EC225d) the method of EC214, wherein, for at least some of the plurality of neurons, the neural network comprises at least 1000 weights per neuron.

EC226) the method of EC214, wherein the neural network comprises billions of weights per layer.

EC226b) the method of EC214, wherein the neural network comprises millions of weights per layer.

EC227) the method of EC214, further comprising, for each layer of the neural network, weighting the input activations to create partial sums, accumulating the partial sums to generate output activations for the layer, and wherein the accumulated weighted partial sums represent neurons and associated synapses of the neural network.

EC227b) the method of EC227, wherein each weight corresponds to a synapse, each partial sum corresponds to a stimulus, the cumulative weighted partial sum corresponds to a total stimulus, and each output activation of the layer corresponds to a neuron output.

EC228) the method of EC213, further comprising performing an iteration of a training workload on each of a plurality of input samples that collectively comprise a training set.

EC229) the method of EC228, wherein the predetermined learning rule specifies updating at least one partial neuron weight after completing all transfers of the entire training set.

EC229b) the method of EC229, wherein the predetermined learning rule conforms to a random gradient descent process.

EC229c) the method of EC229, wherein the predetermined learning rule conforms to a mini-batch gradient descent process.

EC229d) the method of EC229, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC230) the method of EC228, further comprising dividing the training set into a plurality of so-called mini-batches, and the predetermined learning rule specifies updating the at least one partial neuron weight after completing all transfers of input samples included in each mini-batch.

EC231) the method of EC228, further comprising dividing the training set into a plurality of so-called mini-batches, and the predetermined learning rule specifies updating the at least one partial neuron weight after all transfers of each input sample of each mini-batch are completed.

EC231b) the method of EC231, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC232) the method of EC231, wherein the forwarding includes weight updates within the first plurality of processor elements while the small batch learning is performed within the first plurality of processor elements.

EC233) the method of EC213, wherein the memory is contained in memory local to the compute engine.

EC233b) the method of EC213, wherein the memory is included in the compute engine.

EC233b) the method of EC213, wherein the memory is a respective memory connected to each compute engine.

EC234) the method of EC213, wherein the memory is capable of storing a 2D matrix data structure.

EC234b) the method of EC213, wherein the memory is capable of storing a multidimensional data structure.

EC234c) the method of EC213, wherein the memory is capable of storing a tensor data structure comprising dimensions selected from the group consisting of 2D, 3D, 4D, 5D, and 6D.

EC235) the method of EC213, wherein each compute engine further comprises memory for gradient accumulation, forward partial sums, delta partial sums, and forward pass activations.

EC236) the method of EC214, wherein the data propagates to the logical side of the neural network during the forward transfer and loops back in the reverse logical direction during the incremental transfer and the chained transfer.

EC237) the method of EC213, wherein the forward transfer save activates for use with the incremental transfer and the chained transfer.

EC238) the method of EC213, further comprising time sharing each processor element over forward, incremental, and chained transfers.

EC239) the method of EC228, further comprising, for each input sample, selectively updating at least one partial neuron weight according to a predetermined learning rule in response to completion of forward transfer, incremental transfer, and chained transfer corresponding to the input sample.

EC239b) the method of EC239, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC240) the method of EC239, further comprising, for each forward pass, selectively using weight information provided by a most recent selective update of the weight of the at least one partial neuron.

EC241) the method of EC240, further comprising initiating a forward pass of a particular iteration of the training workload regardless of whether a selective update of at least one partial neuron weight corresponding to a previous iteration of the training workload has occurred.

EC242) the method of EC240, further comprising selectively initiating a forward pass of a particular iteration of the training workload regardless of whether an incremental pass of a previous iteration of the training workload has begun.

EC243) the method of EC240, further comprising, in the at least one compute engine, after performing at least a portion of the forward pass for a previous iteration of the training workload and before performing a portion of the selective updating of the at least one partial neuron weights corresponding to the previous iteration of the training workload, performing at least a portion of the forward pass for a subsequent iteration of the training workload.

EC244) the method of EC240, further comprising selectively performing an incremental transfer and a chained transfer on each input sample based at least in part on the recalculated activations based at least in part on the first portion of neuron weights.

EC245) the method of EC244, wherein the first partial neuron weights are partial neuron weights resulting from a most recent selective update.

EC245b) the method of EC245, wherein the activation of the recalculation does not need to be stored between calculations, thereby reducing the total memory required for a given system training configuration.

EC245c) the method of EC239, EC240, EC241, or EC242, wherein the parallel layer training is capable of achieving a predetermined accuracy goal with faster convergence speed, thereby reducing the total training time required for a given system training configuration.

EC245d) the method of EC239, EC240, EC241, or EC242, wherein the concurrent layer training is capable of improving the accuracy of a given total training time and system training configuration.

EC246) the method of EC240, further comprising, in each computing element, selectively performing a portion of incremental transfer and a portion of chain transfer on each input sample based at least in part on the recalculated activations based at least in part on the first portion neuron weights.

EC247) the method of EC246, wherein the first partial neuron weights are partial neuron weights resulting from a most recent selective update.

EC300) a system comprising:

means for selectively communicating fabric packets with other processor elements in each fabric of processor elements, each processor element including a fabric router and a compute engine capable of performing data flow-based and instruction-based processing; and is

Means for selectively performing processing in each compute engine based on the virtual channel specifier and the task specifier for each fabric packet received by the compute engine.

EC300b) a system comprising:

means for selectively communicating fabric packets with other processor elements in each fabric of processor elements, each processor element including a fabric router and a compute engine; and is

Means, in each compute engine, for selectively performing data flow processing and instruction processing based on the virtual channel specifier and the task specifier of each fabric packet received by the compute engine, respectively.

EC300c) the system of EC300, wherein the processing is according to a dataflow graph.

EC300d) the system of EC300, further comprising means for executing the workload, including primarily data stream based processing and minimally instruction based processing.

EC300e) the system of EC300d, wherein the system implements a Long Short Term Memory (LSTM) neural network model.

EC300f) the system of EC300, further comprising means for executing a workload, comprising primarily instruction-based processing and minimally data-stream based processing.

EC300g) the system of EC300, wherein the system is implemented at least in part using wafer-level integration.

EC300h) the system of EC300, wherein the structure of the processor elements is implemented at least in part using VLSI manufacturing.

EC301) the system of EC300, wherein the virtual channel specifier selects respective routing paths that are independent in the fabric.

EC301b) the system of EC300, wherein the virtual channel specifier selects a routing path in a fabric to perform the multicast.

EC301c) the system of EC300, wherein the virtual channel specifier selects a routing path in a fabric to perform load splitting.

EC302) the system of EC300, wherein the task specifier selects one or more operations to perform.

EC303) the system of EC300, wherein the structure comprises a 2D array of processor elements.

EC303b) the system of EC300, wherein the fabric comprises a processor element interconnection topology selected from the group consisting of fully connected, star, ring, array, mesh, hypercube, torus, and tree.

EC303c) the system of EC300, wherein the fabric comprises processor element interconnect topology dimensions selected from the group consisting of 1D, 2D, 3D, and dimensions greater than 3D.

EC304) the system of EC300, wherein the system is capable of executing a machine learning workload.

EC305) the system of EC300, wherein the system is trained to execute inference applications.

EC305b) the system of EC300, wherein the system executes an inference application.

EC306) the system of EC300, wherein the system implements a deep neural network, the deep neural network trained to perform object classification and/or detection.

EC307) the system of EC300, wherein the system implements a deep neural network, trains the deep neural network to perform an inference application selected from the group consisting of text translation, optical character recognition, image classification, facial recognition, scene recognition of an autonomous automobile, voice recognition, data analysis of high-energy physics, and drug discovery.

EC308) the system of EC300, wherein the fabric is organized into a plurality of peripheral processor elements and a plurality of internal processor elements, and each internal processor element is respectively coupled to at least four other processor elements of the plurality of processor elements in at least four logical directions.

EC309) the system of EC300, wherein each compute engine comprises a memory, a data path, and a mixed data stream and instruction execution controller.

EC310) the system of EC309, wherein each compute engine operates according to a multi-stage compute engine pipeline having a plurality of compute engine pipeline stages.

EC311) the system of EC309, wherein the instruction execution controller comprises an instruction sequencer implemented using one or more of microcode, a PLA, one or more counters, and a gate-level state machine.

EC312) the system of EC309, wherein each compute engine further comprises a register file, an instruction decoder, an instruction cache, and a data cache.

EC312b) the system of EC309, wherein each compute engine further comprises a register file, an instruction decoder, an instruction buffer, and a data buffer.

EC313) the system of EC300, wherein:

each compute engine is configured to execute a predefined set of base operations in response to receiving a corresponding base instruction selected from a predefined set of native instruction code; and also comprises

Training a workload including

A first set of machine code selected from a native instruction set for mapping at least a portion of the neurons onto a compute engine of a processor element, the mapping comprising managing at least one partial neuron weight,

a second set of machine code selected from the native instruction set to perform a forward pass based at least in part on the at least one partial neuron weight to propagate activations in a forward logical direction, the forward pass initiated in response to the input sample,

a third set of machine code selected from the native instruction set for performing an incremental pass in a backward logical direction to generate an increment, the incremental pass initiated in response to completion of the forward pass,

a fourth set of machine code selected from the native instruction set for performing chain passing to compute gradients based on the deltas, an

A fifth set of machine code selected from the native instruction set for performing selective updating of at least one partial neuron weights according to a predetermined learning rule and based at least in part on the deltas; and is

Wherein each compute engine includes memory for at least one partial neuron weight.

EC313a) the system of EC313, wherein each primitive instruction is executed according to the task specifiers of a respective one of the fabric packets.

EC313b) the system of EC313, wherein the structure comprises a 2D array of processor elements, the processor elements comprising first, second, third, and fourth physical directions, the first and second physical directions being co-linear and opposite, the third and fourth physical directions being co-linear and opposite, the first and third physical directions being orthogonal, and the forward logical direction being in the first physical direction and the backward logical direction being in the second physical direction.

EC313c) the system of EC313, wherein the training workload further comprises a sixth set of machine code selected from the native instruction set for executing the non-linear activation function.

EC313d) the system of EC313c, wherein the nonlinear activation function is selected from the group consisting of sigmoid, tanh, and ReLU.

EC314) the system of EC313, wherein the mapping is according to the initialization structure to enable partitioning of the neural network into a plurality of layers, the neuron is a first neuron of a plurality of neurons of the neural network, the first neuron is included in a first layer of the plurality of layers, and each of the plurality of neurons is mapped in a distributed manner on a plurality of processor elements of the structure.

EC315) the system of EC314, wherein the mapping completes all transfers for each layer in a same amount of time from each input sample of the training set.

EC315b) the system of EC314, wherein the mapping completes all transfers for each layer in a same predetermined amount of time from each input sample of the training set.

EC315c) the system of EC314, wherein the mapping completes all transfers for each layer within the same time period determined in real time from each input sample of the training set.

EC316) the system of EC314, wherein the multiple layers operate as a logical structure pipeline comprising logical structure pipeline stages, each logical structure pipeline stage comprising completing all transfers for each layer, completion for each layer taking a time step comprising the same amount of time.

EC316b) the system of EC314, wherein each of the multiple layers operates as a logical structure pipeline stage of a respective logical structure pipeline for each transfer, completion of each layer taking a time step comprising the same amount of time.

EC317) the system of EC314, wherein the neuron weights are selectively updated in the first plurality of processor elements on the multiple levels as each input sample of the training set flows through at least the first plurality of processor elements on the multiple levels.

EC317b) the system of EC316, wherein the neuron weights are selectively updated in the first plurality of processor elements on the multiple levels as each input sample of the training set flows through at least the first plurality of processor elements on the multiple levels, and the streaming and updating are performed for each time step of the multiple time steps.

EC318) the system of EC316, further comprising a digital clock, and wherein the time step is an integer multiple of a clock period of the digital clock.

EC318b) the system of EC316, further comprising a digital clock, and wherein the step of time is a variable amount of time.

EC318c) the system of EC318 or EC318b, wherein the time step is determined in real time.

EC319) the system of EC314, further comprising:

means for operating each compute engine according to a multi-stage compute engine pipeline having a plurality of compute engine pipeline stages, including a compute engine machine cycle that completes the time of each compute engine pipeline stage, a compute engine pipeline cycle that includes the time to complete the plurality of compute engine pipeline stages;

wherein the compute engine machine cycle comprises a first multiple of a clock cycle of the digital clock;

wherein the plurality of layers operate as a logical structure pipeline comprising logical structure pipeline stages, each logical structure pipeline stage comprising a time to complete all transfers for each layer, the time step comprising a time to complete each logical structure pipeline stage; and is

Wherein the step of time comprises calculating a second multiple of the engine pipeline period.

EC320) the system of EC319, wherein the first multiple is 1.

EC321) the system of EC319, wherein the second multiple is hundreds to thousands.

EC322) the system of EC316, further comprising means for selectively updating at least one partial neuron weight within the first plurality of processor elements in response to a change in backward propagated data within the first plurality of processor elements for each time step of the plurality of time steps while forward propagation of activations is ongoing.

EC323) the system of EC316, further comprising means for selectively updating at least one partial neuron weight at each time step in the plurality of time steps.

EC323b) the system according to EC317, EC317b, EC322, or EC323, wherein the selectively updating is in accordance with a continuous propagation gradient descent process.

EC324) the system of EC314, wherein the neural network comprises a thousand layers.

EC325) the system of EC314, wherein the plurality of neurons comprises billions of neurons.

EC325b) the system of EC314, wherein the plurality of neurons comprises millions of neurons.

EC325c) the system of EC314, wherein the neural network comprises, for at least some of the plurality of neurons, at least 10 weights per neuron.

EC325d) the system of EC314, wherein the neural network comprises, for at least some of the plurality of neurons, at least 1000 weights per neuron.

EC326) the system of EC314, wherein the neural network comprises billions of weights per layer.

EC326b) the system of EC314, wherein the neural network comprises millions of weights per layer.

EC327) the system of EC314, further comprising, for each layer of the neural network, weighting the input activations to create partial sums, accumulating the partial sums to generate output activations for the layer, and wherein the accumulated weighted partial sums represent neurons and associated synapses of the neural network.

EC327b) the system of EC327, wherein each weight corresponds to a synapse, each partial sum corresponds to a stimulus, the cumulative weighted partial sum corresponds to a total stimulus, and each output activation of the layer corresponds to a neuron output.

EC328) the system of EC313, further comprising means for performing an iteration of a training workload on each of a plurality of input samples that collectively comprise a training set.

EC329) the system of EC328, wherein the predetermined learning rule specifies updating the at least one partial neuron weight after all transfers of the entire training set are completed.

EC329b) the system of EC329, wherein the predetermined learning rule conforms to a random gradient descent process.

EC329c) the system of EC329, wherein the predetermined learning rule conforms to a small batch gradient descent process.

EC329d) the system of EC329, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC330) the system of EC328, further comprising means for dividing the training set into a plurality of so-called mini-batches, and the predetermined learning rule specifies that the at least one partial neuron weight is updated after all transfers of input samples included in each mini-batch are completed.

EC331) the system of EC328, further comprising means for dividing the training set into a plurality of so-called mini-batches, and the predetermined learning rule specifies that the at least one partial neuron weight is updated after all transfers of each input sample of each mini-batch are completed.

EC331b) the system of EC331, wherein the predetermined learning rule conforms to a continuous propagation gradient descent process.

EC332) the system of EC331, wherein the forwarding comprises weight updates within the first plurality of processor elements while the small batch learning is performed within the first plurality of processor elements.

EC333) the system of EC313, wherein the memory is contained in memory local to the compute engine.

EC333b) the system of EC313, wherein the memory is included in the compute engine.

EC333b) the system of EC313, wherein the memory is a respective memory connected to each compute engine.

EC334) the system of EC313, wherein the memory is capable of storing a 2D matrix data structure.

EC334b) the system of EC313, wherein the memory is capable of storing a multidimensional data structure.

EC334c) the system of EC313, wherein the memory is capable of storing a tensor data structure comprising dimensions selected from the group consisting of 2D, 3D, 4D, 5D, and 6D.

EC335) the system of EC313, wherein each compute engine further comprises memory for gradient accumulation, forward partial sums, delta partial sums, and forward pass activations.

EC336) the system of EC314, wherein the data propagates to the logical side of the neural network during forward transfers and loops back in the reverse logical direction during incremental transfers and chained transfers.

EC337) the system of EC313, wherein the forward transfer save activates for use with the incremental transfer and the chained transfer.

EC338) the system of EC313, further comprising means for time sharing each processor element over forward, incremental, and chained transfers.

EC339) the system of EC328, further comprising, for each input sample, means for selectively updating at least one partial neuron weight according to a predetermined learning rule in response to completion of the forward pass, the incremental pass, and the chained pass corresponding to the input sample.

EC339b) the system of EC339, wherein the predetermined learning rule is in accordance with a continuous propagation gradient descent process.

EC340) the system of EC339, further comprising, for each forward pass, means for selectively using weight information provided by a most recent selective update of at least one partial neuron weight.

EC341) the system of EC340, further comprising means for initiating a forward pass of a particular iteration of the training workload regardless of whether a selective update of at least one partial neuron weight corresponding to a previous iteration of the training workload has occurred.

EC342) the system of EC340, further comprising means for selectively initiating a forward pass of a particular iteration of the training workload, regardless of whether an incremental pass of a previous iteration of the training workload has begun.

EC343) the system of EC340, further comprising, in the at least one compute engine, means for performing at least a portion of the forward pass for a subsequent iteration of the training workload after performing at least a portion of the forward pass for a previous iteration of the training workload and before performing a portion of the selective update of the at least one partial neuron weights corresponding to the previous iteration of the training workload.

EC344) the system of EC340, further comprising means for selectively performing an incremental transfer and a chained transfer for each input sample based at least in part on the recalculated activations based at least in part on the first portion neuron weights.

EC345) the system of EC344, wherein the first partial neuron weights are partial neuron weights resulting from a most recent selective update.

EC345b) the system of EC345, wherein activation of recalculations does not require storage between calculations, thereby reducing the total memory required for a given system training configuration.

EC345c), wherein the parallel layer training is capable of achieving a predetermined accuracy goal with faster convergence speed, thereby reducing the total training time required for a given system training configuration.

EC345d), wherein concurrent layer training can improve the accuracy of a given total training time and system training configuration.

EC346) the system of EC340, further comprising, in each computing element, means for selectively performing, for each input sample, a portion of the incremental transfer and a portion of the chain transfer based at least in part on the recalculated activations based at least in part on the first portion neuron weights.

EC347) the system of EC346, wherein the first partial neuron weights are partial neuron weights resulting from a most recent selective update.

EC400) a method comprising:

training a neural network comprising a plurality of ordered connected layers;

wherein the order identifies for each respective layer which other layers precede the respective layer and which other layers follow the respective layer;

wherein each layer includes one or more neurons, each neuron including a weight and connected to at least one of at least one preceding neuron of a preceding layer and at least one subsequent neuron of a succeeding layer; and is

Wherein each neuron is implemented by one or more processing elements, each processing element comprising

At least one coupling to the fabric, the processing elements capable of communicating via the fabric via a plurality of virtual channels,

a first memory capable of storing instructions corresponding to at least a computation of a neuron,

a second memory capable of storing the weight, an

Hardware execution resources capable of executing instructions from a respective first memory and accessing data from a respective second memory.

EC401) the method of EC400, wherein the training comprises:

determining a second activation based on the first activation and the first weight;

determining and saving a second weight based on the first increment and the first weight;

determining a fourth activation based on the third activation and the selected weight, wherein the selected weight is dynamically selected from the first weight and the second weight; and is

Based on the second increment and the selected weight, a third weight is determined and saved.

EC402) the method of EC401, wherein determining the second activation comprises:

receiving a first activation from at least one previous neuron via the structure;

calculating a second activation based at least in part on the first activation and the first weight by executing at least a first instruction stored in a first memory and accessing the first weight in a second memory; and is

Selectively transmitting the second activation to at least one subsequent neuron via the structure.

EC403) the method of EC401, wherein determining and saving the second weight comprises:

receiving, via the structure, a first increment from at least one subsequent neuron based in part on the second activation;

calculating a first gradient based at least in part on the first increment and the second activation by executing at least a second instruction stored in the first memory;

calculating a second weight based at least in part on the first gradient, the learning rule, and the first weight by executing at least a third instruction stored in the first memory and accessing the first weight in the second memory; and is

The second weight is stored in a second memory.

EC404) the method of EC402, wherein determining the fourth activation comprises:

receiving a third activation from at least one previous neuron via the structure;

calculating a fourth activation based at least in part on the third activation and the selected weight by executing at least the first instruction and accessing the selected weight in the second memory; and is

Selectively transmitting the fourth activation to at least one subsequent neuron via the structure.

EC405) the method of EC403, wherein determining and saving the third weight comprises:

Receiving, via the fabric, a second increment from the at least one subsequent neuron based in part on the fourth activation;

calculating a second gradient based at least in part on the third increment and the fourth activation by executing at least a second instruction stored in the first memory;

calculating a third weight based at least in part on the second gradient, the learning rule, and the selected weight by executing at least a third instruction stored in the second memory and accessing the selected weight in the second memory; and is

Storing the third weight in a second memory.

EC406) the method of EC404, wherein determining and saving the second weight comprises:

receiving, via the structure, a first increment from at least one subsequent neuron based in part on the second activation;

calculating a first gradient based at least in part on the first increment and the second activation by executing at least a second instruction stored in the first memory;

calculating a second weight based at least in part on the first gradient, the learning rule, and the first weight by executing at least a third instruction stored in the first memory and accessing the first weight in the second memory; and is

The second weight is stored in a second memory.

EC407) the method of EC406, wherein determining and saving the third weight comprises:

Receiving, via the fabric, a second increment from the at least one subsequent neuron based in part on the fourth activation;

calculating a second gradient based at least in part on the third increment and the fourth activation by executing at least a second instruction stored in the first memory;

calculating a third weight based at least in part on the second gradient, the learning rule, and the selected weight by executing at least a third instruction stored in the second memory and accessing the selected weight in the second memory; and is

Storing the third weight in a second memory.

EC408) the method of EC403, wherein the selected weight is dynamically selected according to which of the first weight and the second weight was most recently stored.

EC409) the method of EC401, wherein determining the fourth activation can be performed after determining the second activation and before determining and saving the second weight.

EC410) the method of EC404, wherein selectively transmitting the second activation and the fourth activation is selectively based on respective values of the second activation and the fourth activation.

EC411) the method of EC404, wherein selectively transmitting the second activations and the fourth activations is selectively based on respective absolute values of the second activations and the fourth activations exceeding respective first and second thresholds.

EC412) the method of EC400, wherein the at least one neuron is implemented by a plurality of processing elements.

EC413) the method of EC405, wherein determining the fourth activation further comprises storing the fourth activation in a second memory, and calculating the second gradient further comprises accessing the fourth activation in the second memory.

EC414) the method of EC407, wherein calculating the second gradient further comprises optionally recalculating the fourth activation based at least in part on the selected weight.

EC415) the method of EC407, wherein calculating the first gradient further comprises optionally recalculating the second activation based at least in part on the first weight.

EC416) the method of EC400, wherein each processing element is capable of performing data stream based processing.

EC417) the method of EC400, wherein each processing element comprises a fabric router.

EC418) the method of EC400, wherein each processing element is capable of selectively grouping communication structures with other processing elements.

EC419) the method of EC418, wherein each processing element is capable of performing processing according to the virtual channel specifier and the task specifier of each fabric packet received by the processing element.

Selected implementation details

Described herein are embodiments related to neural network training and reasoning, including deep learning accelerator hardware elements and software elements (see, e.g., fig. 1-4 and the "deep learning accelerator overview" section). The deep learning accelerator includes hardware processing elements (see, e.g., fig. 5-8 and "processing elements: compute elements and routers" sections). The deep learning accelerator implements and/or uses various techniques, such as task launch and terminate (see, e.g., fig. 9-12 and the "task" section), wavelet processing (see, e.g., fig. 13A-15B and the "wavelet" section), task block and unblock (see, e.g., fig. 16 and the "block and unblock" section), neuron smearing (see, e.g., fig. 17-20 and the "neuron smearing" section), structure vectors, memory vectors, and associated data structure descriptors (see, e.g., fig. 21A-24 and the "vector and data structure descriptor" section), and instruction formats (see, e.g., fig. 21A-24 and the "instruction format" section). The deep learning accelerator may be used in various scenarios (see, e.g., fig. 26A-27E and the "deep learning accelerator example use" section and fig. 28A-29 and the "example workload map" section). Deep learning accelerators are contemplated in various embodiments (see, e.g., the "other embodiments details" section). The deep learning accelerator may be implemented differently (see, e.g., the "example implementation techniques" section).

Deep learning Accelerator overview

FIG. 1 illustrates selected details of an embodiment of a neural network training and reasoning system using a deep learning accelerator as the neural network system 100. Conceptually, a deep learning accelerator is used to train the neural network. One or more results (e.g., weights) of the training are then used for reasoning. For example, training includes mapping neurons of a neural network onto PEs of a deep learning accelerator. The training data is then applied to the PE. The PEs process the training data (e.g., via forward, incremental, and chain transfers) and update the weights until training is complete. The weights are then used for reasoning.

Referring to the figure, deep learning accelerator 120 includes an FPGA 121 and a PE 122 that enable communication with each other, as shown by coupling 123. The placement server 150 (including CPU 151 and CRM 152) is coupled to a connection server 160 (including CPU 161, CRM 162, and NIC 164). Connection server 160 is capable of communicating with FPGA 121 via NIC 164 and 100Gb 112. Autonomous vehicle 130 includes CPU 131, CRM 132, IE 133, and camera 135. The handset 140 includes a CPU 141, CRM 142, IE 143, and camera 145.

According to various embodiments and/or usage scenarios, internet 180 provides coupling (not explicitly shown) between any combination of placement server 150, connection server 160, autonomous vehicle 130, and/or cell phone 140.

Dashed arrow placement 113 conceptually indicates placement information communicated from placement server 150 to PE 122 (e.g., via LAN 111, connection server 160/NIC 164, 100Gb 112, FPGA 121, and coupling 123). In some embodiments and/or usage scenarios, placement 113 is implicit, reflected in initialization information provided to the router elements of PE 122 and the computing elements of PE 122. In some embodiments and/or usage scenarios, a portion of the initialization information of placement 113 is provided to FPGA 121 to configure elements of FPGA 121 for operation with PE 122.

Dashed arrow weights 114 and dashed arrow weights 115 conceptually indicate weight information transmitted from PE 122 to autonomous vehicle 130 and handset 140, respectively (e.g., via coupling 123, FPGAs 121, 100Gb 112, connection server 160/NIC 164, and internet 180). In some embodiments and/or usage scenarios, the weight information is any one or more of all or any portion of the weight information that is directly generated as a result of the training, sub-sampling thereof, quantization thereof, and/or other transformations thereof.

The deep learning accelerator 120 can perform training of the neural network, e.g., by computing weights in response to placement information and training information received via the 100Gb 112. The deep learning accelerator 120 can also provide weights via the 100Gb 112 when training is complete, as a result. The weights may then be used for reasoning, for example, in autonomous vehicle 130 and/or cell phone 140. PE 122 includes a relatively large number of PEs (e.g., 10,000 or more), each capable of independently performing routing and computations associated with training. In some embodiments and/or usage scenarios, PEs 122 are implemented via wafer-level integration, e.g., a respective plurality of PEs implemented on respective dies of a single wafer. FPGA 121 can interface PE 122 to information provided via 100Gb 112. The connection includes/is converted from/to a wavelet to/from a modified ethernet frame as transmitted over coupling 123.

Placement server 150 can programmatically determine placement of neurons (e.g., as shown by placement 113) via one or more placement programs. The placing program is stored in the CRM 152 and executed by the CPU 151. The placement information is transmitted to the connection server 160 via the LAN 111. One example of placement is mapping logical neurons of a neural network onto physical memory and execution hardware resources (e.g., PEs 122).

Connection server 160 is capable of communicating with FPGA 121 and indirectly with PE 122 via FPGA 121/coupling 123, via NIC 164, and via programmed control of drivers. In various embodiments and/or usage scenarios, the communication includes placement information (e.g., from placement server 150), training information (e.g., from a source not shown but accessible via internet 180), and/or training results (e.g., weights from PEs 122). The driver is stored in the CRM 162 and executed by the CPU 161.

Autonomous vehicle 130 can use weights 114 to perform inferences using IE 133 that are programmatically controlled and/or assisted by CPU 131 executing programs stored in CRM 132. Inference is optionally and/or selectively performed using information obtained from the camera 135. For example, an automobile may operate as an autonomous vehicle. The car includes a camera that can provide video to the inference engine. The inference engine is capable of identifying objects related to car navigation, such as lanes, obstacles, and other objects. The car can be navigated using the results of the object recognition. Any combination of providing, identifying, and navigating is controlled and/or performed, at least in part, via one or more CPUs executing programs stored in the CRM.

Handset 140 can use weights 115 to perform inferences using IE 143 that are programmatically controlled and/or assisted by CPU 141 executing programs stored in CRM 142. Inference is optionally and/or selectively performed using information obtained from the camera 145. For example, a cell phone may be operable to post a tagged photograph on a social networking website. The handset includes a camera capable of providing image data to the inference engine. The inference engine can mark objects in the image (e.g., by type such as "cat," "dog," etc., or by name such as "Bob," "Mary," etc.). The mobile phone can post the image and the tagging result to a social networking website. Any combination of providing, marking, and posting is controlled and/or executed, at least in part, via one or more CPUs executing programs stored in the CRM.

In various embodiments and/or usage scenarios, all or any portion of the weight information determined via the deep learning accelerator is post-processed outside of the accelerator prior to inference usage. For example, all or any portion of the information represented by weights 114 and/or weights 115 is processed in whole or in part by placement server 150 before autonomous vehicle 130 and/or cell phone 140 infer use. In various embodiments and/or usage scenarios, examples of post-processing include quantization weights 114 and/or weights 115 (e.g., converting from a floating point number format to a fixed point number format). In various embodiments and/or usage models, cameras 135 and 145 are respective examples of sensors that provide input to IEs 133 and 143. Other examples of sensors are position sensors, orientation sensors, magnetic sensors, light sensors and pressure sensors.

CPU 151 includes one or more CPUs compatible with a corresponding instruction set architecture. CPU 151 is capable of fetching and executing instructions from CRM 152 according to an instruction set architecture. The CPU 161 includes one or more CPUs compatible with the respective instruction set architecture. The CPU 161 is capable of fetching and executing instructions from the CRM 162 according to an instruction set architecture. In some embodiments, at least one instruction set architecture of CPU 151 is compatible with at least one instruction set architecture of CPU 161.

The CPU 131 includes one or more CPUs compatible with the respective instruction set architecture. The CPU 131 is capable of fetching and executing instructions from the CRM 132 according to an instruction set architecture. CPU 141 includes one or more CPUs compatible with a corresponding instruction set architecture. CPU 141 is capable of fetching and executing instructions from CRM 142 according to an instruction set architecture. In some embodiments, at least one instruction set architecture of CPU 131 is compatible with at least one instruction set architecture of CPU 141. In some embodiments, any one or more of CPU 151, CPU 161, CPU 131, and CPU 141 have an instruction set architecture that is compatible with each other.

At least respective portions of each of CRM 152, CRM 162, CRM 132, and CRM 142 are non-volatile and include any one or more of flash memory, magnetic memory, optical memory, phase change memory, and other non-volatile memory technology elements.

In various embodiments and/or usage scenarios, the IEs 133 and/or 143 include one or more inference engines that can use weight information determined by the deep learning accelerator 120 (and conceptually indicated by the weights 114 and/or the weights 115). In various embodiments and/or usage scenarios, IE 133 operates in conjunction with and/or under the control of programs executed by CPU 131 and stored in CRM 132. In various embodiments and/or usage scenarios, IE 143 operates in conjunction with and/or under the control of programs executed by CPU 141 and stored in CRM 142. In various implementations and/or usage scenarios, all or any portion of IEs 133 and/or IEs 143 are implemented via various combinations of HW and/or SW techniques. In some embodiments, all or any portion of the functionality provided by IEs 133 and/or IEs 143 is implemented using techniques implemented by deep learning accelerator 120 and/or associated with deep learning accelerator 120. In various embodiments and/or usage scenarios, all or any portion of IEs 133 and/or IEs 143 are implemented differently via techniques including various combinations of conventional CPUs, conventional GPUs, conventional DSPs, conventional FPGAs, and dedicated hardware.

In various embodiments, the 100Gb 112 is variously a 100Gb ethernet coupling for transmitting standard ethernet frames, a 100Gb ethernet coupling for transmitting modified ethernet frames, a 100Gb modified ethernet coupling for transmitting modified ethernet frames, a 100Gb serial coupling other than ethernet technology, or some other relatively high speed serial coupling.

In some embodiments and/or usage scenarios, coupling 123 conveys information in the form of wavelets.

In various embodiments, LAN 111 is implemented using technologies such as Ethernet, fibre channel, and/or other suitable interconnection technologies.

In some embodiments and/or usage scenarios, placement server 150 and connection server 160 are implemented and/or operated as a combined element (e.g., sharing CPU, CRM, and/or NIC resources), as conceptually illustrated by combined server 110. In some embodiments and/or usage scenarios, placement server 150 and connection server 160 are coupled via the internet 180 rather than (or in addition to) LAN 111.

FIG. 2 illustrates selected details of an embodiment of software elements associated with neural network training and reasoning using a deep learning accelerator as the neural network SW 200. According to an embodiment, the placement server SW 210 includes a neuron-to-PE mapping SW 212 and other elements not shown. In various embodiments and/or usage scenarios, all or any portion of placement server SW 210 is stored in CRM 152 and is executable by CPU 151 of fig. 1. One or more programs of the neuron-to-PE mapping SW 212 enable determination of placement of neurons of the neural network onto particular ones of the PEs 122 of fig. 1.

According to an embodiment, connection server SW 220 includes a 100Gb NIC driver 224, a training information provider SW 225 and a weight receiver SW 226, among other elements not shown. In various embodiments and/or usage scenarios, all or any portion of connection server SW 220 is stored in CRM 162 and is executable by CPU 161 of fig. 1. One or more programs of the 100Gb NIC driver 224 enable communication between the connection server 160 and the deep learning accelerator 120 of fig. 1 (via the NICs 164 and 100Gb 112, also fig. 1). One or more programs of the training information provider SW 225 enable determination of training information for an application under control of the 100Gb NIC driver 224 for communication with the deep learning accelerator 120 of fig. 1 (via the NICs 164 and 100Gb 112). In various embodiments and/or usage scenarios, the training information is variously determined from, for example, non-volatile memory accessible to connection server 160 and/or internet 180 of fig. 1. One or more programs of the weight receiver SW 226 can receive weight information (via the NICs 164 and 100Gb 112) under the control of the 100Gb NIC driver 224 as determined by the deep learning accelerator 120.

In various embodiments and/or usage scenarios, miscellaneous SW on FPGA 250 conceptually represents SW executed by one or more CPUs included in FPGA121 (fig. 1). The CPU of the FPGA is hard coded, for example, during manufacture of one or more elements of the FPGA121, and/or soft coded during initialization of one or more elements of the FPGA 121. In various embodiments and/or usage scenarios, all or any portion of miscellaneous SW on FPGA 250 and/or representations thereof are stored in non-volatile memory included in FPGA121 and/or accessible by connection server 160. In various embodiments and/or usage scenarios, miscellaneous SW on FPGA 250 can perform various housekeeping functions, such as functions related to initialization and/or debugging of PE 122 of fig. 1.

In various embodiments and/or usage scenarios, task SW on PE 260 conceptually represents a distributed SW performed as tasks on various PEs of PE 122. In various embodiments and/or usage scenarios, all or any portion of tasks SW on PE 260 and/or representations thereof are stored in non-volatile memory included in PE 122 and/or are accessible by connection server 160. In various embodiments and/or usage scenarios, task SW on PE 260 can perform processing of training data, e.g., determining weights of a neural network (e.g., via forward, incremental, and chained transfers).

According to an embodiment, the autonomous vehicle SW 230 includes a camera SW 232, an inference engine SW 233, and a navigation SW 234, as well as other elements not shown. In various embodiments and/or usage scenarios, all or any portion of autonomous vehicle SW 230 is stored in CRM 132 and is executable by CPU 131 of fig. 1. One or more programs of the video camera SW 232 can control and/or operate the camera 135 of fig. 1 to provide video information to the inference engine SW 233. One or more programs of inference engine SW 233 can control and/or operate IE 133 of fig. 1 to determine navigation information, e.g., objects to avoid and/or lanes to follow, from the video information. One or more programs of the navigation SW 234 are capable of navigating the autonomous vehicle SW 230 in response to the navigation information.

According to an embodiment, the handset SW 240 includes a still camera SW 242, an inference engine SW 243, a publication SW 244, and other elements not shown. In various embodiments and/or usage scenarios, all or any portion of the cell phone SW 240 is stored in the CRM 142 and is executable by the CPU 141 of fig. 1. One or more programs of the still camera SW 242 can control and/or operate the camera 145 of fig. 1 to provide still image information to the inference engine SW 243. One or more programs of inference engine SW 243 can control and/or operate IE 143 of fig. 1 to determine tag information from still image information. One or more programs of post SW 244 can post to the social networking website in response to the still image information and/or the tag information.

In various embodiments and/or usage scenarios, any one or more of SW set placement server SW 210, connection server SW 220, autonomous vehicle SW 230, and/or cell phone SW 240 optionally and/or selectively include one or more operating system elements, e.g., one or more real-time operating systems, one or more non-real-time operating systems, and/or one or more other control programs, to coordinate the elements of each respective SW set.

FIG. 3 illustrates selected details of an embodiment of processing associated with training a neural network and performing inference using the trained neural network using a deep learning accelerator as neural network training/inference 300. As shown, in act 310, neurons of a neural network are placed, e.g., assigned and/or associated with particular PE resources. Then, in act 320, FPGA resources are initialized in preparation for training of the neural network. The PE resources are then initialized in act 330 in preparation for training of the neural network.

After the FPGA resources and PE resources are initialized in preparation for training, training data is applied to the PEs in act 340. In act 350, the PE resource processes the training data. Then in act 360, a check is made to determine whether the training is complete, e.g., because the application of the training data is complete and/or meets one or more completion criteria (e.g., an inference error below a predetermined limit). If not, flow returns to act 340 to apply further training data. In some cases, training is not complete, and in some embodiments, control passes to another action (not shown) to enable the neural network to be changed (e.g., adding a layer of neurons, removing a layer of neurons). The altered neural network is then trained according to acts 310, 320, 330, 340, 350, and 360.

If training is complete, the flow continues to provide weights as a result of the training for inference in 370. In some implementations and/or usage scenarios, the weights are quantized, e.g., converted to an integer data format. In some implementations and/or usage scenarios, the integer data format is a reduced precision digital format (e.g., 8-bit or 16-bit). The weights are then provided to one or more inference engines and used to make inferences in act 380.

In various embodiments and/or usage scenarios, the inference engine corresponds to one or more inference applications, such as text translation, optical character recognition, image classification, facial recognition, scene recognition for auto-driven vehicles, voice recognition, data analysis for high-energy physics, and drug discovery.

In various embodiments and/or usage scenarios, PE resources correspond to, for example, PE 122 of fig. 1, and FPGA resources correspond to, for example, FPGA 121 of fig. 1.

In various embodiments and/or usage scenarios, any one or more of all or any portion of the actions of the neural network training/reasoning 300 are performed by and/or associated with all or any portion of any one or more elements of the neural network system 100 of fig. 1 and/or the neural network SW 200 of fig. 2. For example, all or any portion of act 310 is performed by placement server 150 via performing neuron-to-PE mapping SW 212. As another example, all or any portion of act 320 is performed by placement server 150 via performing neuron-to-PE mapping SW 212. As another example, all or any portion of act 330 is performed by placement server 150 via performing neuron-to-PE mapping SW 212. As yet another example, all or any portion of act 330 is performed by PE 122 via execution of task SW on PE 260. As another example, all or any portion of act 340 is performed by connection server 160 via executing training information provider SW 225. As yet another example, all or any portion of act 350 is performed by PE 122 by executing task SW on PE 260. As another example, all or any portion of act 350 may be performed by combine server 110, place server 150, and/or connect server 160. As another example, all or any portion of 370 is performed by connection server 160 via execution weight receiver SW 226. As another example, all or any portion of act 370 is performed by FPGA 121 via execution of miscellaneous SW on FPGA 250. As another example, all or any portion of 380 is executed by IE 133, e.g., under the control of inference engine SW 233. As another example, all or any portion of act 380 is performed by IE 143, e.g., under the control of inference engine SW 243.

In various embodiments and/or usage scenarios, any one or more of all or any portion of the actions of the neural network training/reasoning 300 are performed in conjunction with the communication of information between the various elements of the neural network system 100 of fig. 1. For example, the various actions of neural network training/reasoning 300 are performed at least in part via NICs 164 and 100Gb 112, which communicate information between connection server 160 and FPGA 121. As another example, the various actions of the neural network training/reasoning 300 are performed in conjunction with the FPGA 121 and coupling 123 communicating information between the connection server 160 and the PEs 122. As another example, the various actions of the neural network training/reasoning 300 are performed in conjunction with any one or more of the placement server 150, the connection server 160, the autonomous vehicle 130, and the mobile phone 140, communicating information enabled at least in part by the internet 180.

FIG. 4 illustrates selected details of an implementation of a deep learning accelerator as deep learning accelerator 400. Each of the PE 499 elements is coupled to other elements of the PE 499 element. Two of the PE elements (PE 497 and PE 498) are shown with unique identifiers and are otherwise identical to instances of PE 499, respectively. PE 497 is shown with an identifier for each of the four couplings to the other PEs (north coupling 430, east coupling 431 with PE 498, and south coupling 432) and one I/O FPGA (west coupling 433), but is otherwise identical to the other PE elements shown. In some implementations and/or usage scenarios, the coupling is logical and/or physical. In various embodiments and/or usage scenarios, the coupling may be used to convey wavelets, backpressure information, or both. In various embodiments and/or usage scenarios, all or any portion of a physical coupler is a coupling to a physically adjacent PE. In some embodiments and/or usage scenarios, the PEs are physically implemented in a 2D mesh. In some embodiments and/or usage scenarios, PEs are physically implemented in a 2D grid of aligned rectangles, and physically adjacent PEs correspond to PEs sharing a horizontal boundary (north/south PEs relative to each other) and PEs sharing a vertical boundary (east/west PEs relative to each other).

In some embodiments and/or usage scenarios, an array of identical instances of the same ASIC is formed on a wafer, and each of the identical ASICs includes multiple identical instances of the same PE (e.g., PE 499), forming a wafer (e.g., wafer 412) that can be used for wafer-level integration techniques. In some embodiments and/or usage scenarios, a peripheral portion of a PE is coupled to I/O FPGA 420. The example ASIC is shown as: an ASIC 410 including an organized portion of columns of PEs (e.g., replicated in a one-dimensional manner to form a wafer); and an ASIC 411 including square or rectangular organized sections of PE (e.g., replicated in two dimensions to form a wafer). Other organizations of ASICs on a wafer are contemplated.

In some embodiments and/or usage scenarios, neurons associated with a layer in a neural network are typically placed on the PE 499 elements in a left-to-right manner, with earlier layers (e.g., input layers) on the left and subsequent layers (e.g., output layers) on the right. Thus, the data flow during training is conceptually shown as dashed arrow forward 401, delta 402, and chain 403. During the forward direction 401, a stimulus is applied to the input layer and activates one layer from the input layer to the next, eventually reaching the output layer and producing a forward result. During increment 402, the increment (e.g., the difference between the forward result and the training output data) is propagated in reverse. During chain 403, the gradient is computed based on the deltas generated during delta 402 (e.g., relative to weights in the neurons). In some embodiments and/or usage scenarios, the processing of delta 402 substantially overlaps the processing of 403.

In some implementations and/or usage scenarios, deep learning accelerator 400 is an implementation of deep learning accelerator 120 of fig. 1. In some embodiments and/or usage scenarios, an individual PE 499 element corresponds to an individual PE of PE 122 of fig. 1. In some embodiments and/or usage scenarios, each ASIC 410 element, or alternatively each ASIC 411 element, corresponds to all or any portion of PE 122 implemented as a separate integrated circuit. In some embodiments and/or usage scenarios, each ASIC 410 element or, optionally, each ASIC 411 element corresponds to a portion of (optionally the same as) PE 122 implemented via a respective die of a wafer. In some embodiments and/or usage scenarios, the I/O FPGA 420 elements collectively correspond to the FPGA 121 of fig. 1.

In some embodiments and/or usage scenarios, placement of neurons (e.g., associated with a layer in a neural network) onto PE 499 elements is performed by all or any portion, all or part, of placement server SW 210 of fig. 2.

A processing element: computing element and router

FIG. 5 illustrates selected details of an embodiment of a PE of PE 500 as a deep learning accelerator. The PE 500 includes a router 510 and a computing element 520. The router 510 selectively and/or conditionally transfers wavelets between other PEs (e.g., logically adjacent and/or physically adjacent PEs) and live PEs via couplers 511 and 516. Router 510 selectively and/or conditionally transmits the wavelet to the instant PE via exit ramp 521 and out of the instant PE via entry ramp 522. The computation element 520 performs computations on the data contained in the wavelet based on instruction address information that may be derived from the wavelet. The instruction address information is used to identify a start address of a task implemented as an instruction stored in a memory of the compute element.

In various embodiments, any one or more of 511-516 are omitted.

In some embodiments and/or usage scenarios, PE 500 is an embodiment of PE 499 of FIG. 4, and/or elements of PE 500 correspond to an implementation of PE 499. In some implementations and/or usage scenarios, north 513, east 515, south 516, and west 511 correspond to north coupler 430, east coupler 431, south coupler 432, and west coupler 433, respectively, of fig. 4.

FIG. 6 illustrates selected details of an embodiment of a router of a PE as router 600. Consider a plurality of PE points, each PE point comprising a respective router and a respective CE. Router 600 is an instance of one of the respective routers. The router 600 routes the wavelet to the CE of the PE included in the instant router and other routers according to the color information of the wavelet and the routing configuration information. The routed wavelets are received differently by the instant routers and/or generated by the CE of the PE in which the instant router is contained. The routing supports communication between PEs. The stall information is transmitted to prevent overflow of wavelet storage resources in router 600.

Router 600 includes four sets of interfaces, namely data in 610, data out 620, stall output 630, and source 640. Data input 610, data output 620, stall output 630, and source 640 comprise interface elements 611-. Router 600 also includes write decode 651, output 652, generate Stall 656, and Stall 657 coupled to data input 610, data output 620, Stall output 630, and source 640, respectively. Router 600 also includes a source 653 that includes Src 670 coupled to generate stall 656. Router 600 also includes data queues 650, control information 660, and router schedules 654. The control information 660 includes Dest 661 and set 662.

The data queue 650 is coupled to a write decode 651 to receive input wavelet information and to an output 652 to provide output wavelet information. Data queue 650 is also coupled to generate stall 656 to provide data queue validity information. The router schedule 654 is coupled to control information 660 to receive control information related to scheduling the queued wavelets. The router schedule 654 is also coupled to Stall 657 to receive Stall information associated with scheduling queuing wavelets. The router schedule 654 is further coupled to the output 652 to direct the presentation of the queued wavelets at one or more of 621-627. The router schedule 654 is also coupled to the generate stall 656 to partially direct the generation of stall information.

In various embodiments, each of interface elements 611, 621, 627, 631, 637, and 641, 647 are implemented differently via passive interconnects (e.g., lines without buffering), active interconnects (e.g., lines with selective and/or optional buffering), and couplers with logic to accommodate additional functionality between one instance of router 600 and another instance of router 600.

In some embodiments and/or usage scenarios, router 600 is an implementation of router 510 of fig. 5.

In some embodiments, one of data inputs 610 and one of data outputs 620 correspond to portions of west 511, skip west 512, north 513, skip east 514, east 515, south 516, exit ramp 521, and entrance ramp 522. For example, on-ramp 617 corresponds to on-ramp 522 and off-ramp 627 corresponds to off-ramp 521. As another example, Y +615 includes a portion of north 513 that is capable of receiving data, and Y +625 includes a portion of north 513 that is capable of transmitting data.

FIG. 7 illustrates selected details of an embodiment of processing associated with a router of a processing element as wavelet entry 710, stall information 720, and wavelet exit 730. Conceptually, the router accepts as many wavelets as possible from the ingress port, queues them as needed, and routes as many wavelets per unit time (e.g., clock cycle) to the egress port if queue space is available. For each respective queue, wavelet entry 710 comprises actions 711-713 corresponding to wavelet entries from (logically and/or physically) neighboring PEs and/or immediate PEs. Stall information 720 includes actions 721-723 corresponding to providing stall information for each respective queue. The wavelet exits 730 include actions 731-734, where actions 731-734 correspond to the wavelet exits of (logically and/or physically) neighboring PEs and/or immediate PEs of each respective queue. In some cases, Send wavelet 734 sends the wavelet from a single queue entry to a single destination (e.g., unicast) based on the wavelet's color information and routing configuration information. In some cases, Send wavelet 734 sends the wavelet from a single queue entry to multiple destinations (e.g., multicast) based on the wavelet's color information and routing configuration information. In various embodiments and/or usage scenarios, any one or more of all or any portion of acts 710, 720, and/or 730 corresponds to an action performed by and/or associated with all or any portion of any one or more elements of router 600 of fig. 6.

FIG. 8 illustrates selected details of an embodiment of a computing element of the processing element as CE 800.

In various embodiments, the CE 800 is coupled to the router via an egress ramp 820 and an ingress ramp 860. CE 800 includes a Qdistr 824, which Qdistr 824 is coupled to receive wavelets via the exit ramp 820. Qdistr 824 is coupled to transmit the wavelet to scheduling information 896. Scheduling information 896 includes Qs 897, an active bit 898, and a block bit 899.

In various embodiments, Qs 897 includes a queue, e.g., Q0897.0.. QN 897.N, for each structural color (e.g., holding wavelets created by other processing elements and associated with a respective color) and each local color (e.g., holding wavelets created by CE 800 and associated with a respective color). Each of Qs 897 (e.g., Q0897.0) is associated with a respective one of activity bit 898 (e.g., activity bit 0898.0) and blocking bit 899 (e.g., blocking bit 0899.0). Each valid bit 898 and each block bit 899 contains information about a corresponding one of Qs 897, e.g., block bit N899. N denotes blocking QN 897. N.

In various embodiments, there is a different physical Q for each color, one or more physical Qs for a predetermined subset of colors, and one or more physical Qs for a dynamically determined subset of colors. In various embodiments, there are different one or more same-sized physical qs (e.g., each capable of holding the same number of wavelets) and one or more different-sized physical qs (e.g., each capable of holding a different number of wavelets). In various embodiments, having one or more physical qs mapped differently to virtual qs, each virtual Q associated with one or more colors. For example, there are N logical Qs, less than N physical Qs. As another example, some of Qs 897 can hold 8 wavelets, while others of Qs 897 can hold 3 wavelets. In some embodiments, the flow rate of one or more colors associated with a particular one of Qs 897 is estimated and/or measured, and the particular one of Qs 897 is able to hold a particular number of wavelets based on the flow rate.

The hash 822 is coupled to Qdistr 824 and, based at least in part on the color of the wavelet (e.g., by applying a hash function to the color), selects a physical queue to store the wavelet. In some implementations, the color associated with the wavelet payload is stored explicitly in the queue with the wavelet payload so that entries in the queue hold the entire wavelet (colored payload). In some implementations, the color associated with the wavelet payload is not explicitly stored in the queue with the wavelet payload such that entries in the queue store the wavelet payload without storing the associated color. The color of the wavelet payload is inferred, for example, from a particular queue in which the wavelet payload is stored.

In some embodiments, one or more of the active bit 898 and the block bit 899 are implemented as a corresponding bit vector having N entries, one entry per color. In various embodiments, one or more of the activity bit 898 and the block bit 899 are implemented as respective bit fields in a table that includes one entry for each color.

Selector 830 is coupled to schedule information 896, RF 842, Dec 840, base 890, PC 834, I-Seq, and D-Seq 844. The selector 830 can select a wavelet for processing from one of Qs 897. In some embodiments, selector 830 selects a wavelet by selecting one of Qs 897 and selecting the earliest wavelet in the selected queue. In some cases, when Dec 840 indicates that the termination instruction has been decoded, selector 830 selects a new wavelet for processing. In some other cases (e.g., instructions to access a structure input), selector 830 selects a new wavelet from one of Qs 897 for processing in response to the queue identifier received from D-Seq 844.

Selector 830 receives the selected wavelet from one of Qs 897 and is able to send one or more data and indices from the selected wavelet to RF 842. In some embodiments, Qs 897 is coupled to data path 852 and the data path is capable of receiving data directly from one Qs. Selector 830 is able to read the base address from base 890 and calculate the instruction address to be sent to PC 834 and I-Seq 836. Base 890 stores a base address and is also coupled to D-Seq 844. PC 834 stores the address of the next instruction to fetch. In various embodiments, base 890 and PC 834 are implemented as registers. In some embodiments, D-Seq 844 can read a base address from base 890 based at least in part on the value read from base 890, and request data at one or more addresses from memory 854 and data store 848.

I-Seq836 is coupled to PC 834 and is capable of reading and modifying PC 834 (e.g., incrementing for sequential instructions or non-sequential incrementing for branch instructions). I-Seq836 is also coupled to memory 854 and is capable of providing an instruction fetch address to memory 854 (e.g., based on PC 834).

Memory 854 is also coupled to Dec 840, data path 852, and D-Seq 844. In response to the instruction fetch address from I-Seq836, memory 854 can provide Dec 840 (instruction decoder) with instructions at the instruction fetch address. In various embodiments, memory 854 is capable of providing up to three instructions in response to each instruction fetch address. In some embodiments, the instructions are formatted according to one or more of fig. 25A, 25B, and 25C.

Dec 840 can determine one or more characteristics of the instruction according to various embodiments and/or usage scenarios. For example, Dec 840 can parse the instruction into an opcode (e.g., opcode 2512 of fig. 25A) and zero or more operands (e.g., source and/or destination operands). As another example, Dec 840 can identify instructions based on instruction type (e.g., branch instructions or multiply accumulate instructions, etc.). For yet another example, Dec 840 can determine that an instruction is a particular instruction and activate one or more signals accordingly.

Dec 840 is coupled to selector 830 via terminator 812 and can indicate that one decoded instruction is a terminating instruction to end a task (e.g., the last instruction of an instruction executed in response to a task launched in response to a selected wavelet).

In some cases, Dec 840 is able to decode branch instructions. Examples of branch instructions include: conditional branch instructions of the PC 834 are conditionally modified and jump instructions of the PC 834 are unconditionally modified. The branch instruction is executed by I-Seq 836 and optionally and/or conditionally modifies PC 834. In some cases, branch instructions implement software control flow (e.g., looping) by conditionally modifying PC 834.

In response to decoding of an instruction (e.g., a multiply accumulate instruction), Dec 840 can send an opcode to data path 852. Dec 840 is coupled to DSR 846 and is capable of sending one or more operand identifiers to DSR 846. Dec 840 is also coupled to D-Seq 844 and is capable of transmitting one or more operand type identifiers to D-Seq 844.

DSR 846 includes registers that hold Data Structure Descriptors (DSDs), and is coupled to and capable of transmitting one or more DSDs to D-Seq 844. In some implementations, the DSR includes a source DSR, a destination DSR, an extended DSR, and a stride register. In response to receiving the operand identifier from Dec 840, DSR 846 is able to read the DSD specified by the operand identifier and send the DSD to D-Seq 844. In various embodiments, DSR 846 is capable of receiving up to two source operand identifiers and one destination operand identifier, reading two source DSRs and one destination DSR, and sending two source DSDs and one destination DSD to D-Seq 844. In some implementations, the CE can explicitly write the DSD to the DSR from memory in response to loading the DSR instruction, and the CE can explicitly write the DSD to memory from the DSR in response to storing the DSR instruction. In some embodiments, DSR 846 is coupled to memory 854 and is capable of receiving data from memory 854 and transmitting data to memory 854.

In some embodiments, DSR 846 includes three sets of DSRs: 12 DSR for source 0 operands (sometimes referred to as SODSR), 12 DSR for source 1 operands (sometimes referred to as SIDR), and 12 DSR for destination operands (sometimes referred to as DDSR). In addition, DSR 846 also includes six extended DSRs (sometimes referred to as XDSRs) and six stride registers. In some embodiments, the DSR comprises 48 bits, the XDSR comprises 51 bits, and the stride register comprises 15 bits. In various embodiments, corresponding instructions load 48 bits of data from storage (e.g., data store 848 or memory 854) to a corresponding DSR (e.g., LDS0WDS, LDS1WDS, and LDDWDS instructions load source 0, source 1, and destination DSR, respectively). In various embodiments, each instruction stores 48 bits of data from the respective DSR into memory (e.g., STS0WDS, STS1WDS, and STDWDS instructions store source 0, source 1, and destination DSR, respectively, into memory). In some embodiments, instructions (e.g., LDXDS) load data from memory into the XDSR, while other instructions (e.g., STXDS) store data from the XDSR into memory. Instructions that move data between memory and XDSRs (e.g., LDXDS and STXDS) access 64-bit memory and use only the lower 51 bits. In some implementations, instructions (e.g., LDSR) load data from memory into the stride register, while other instructions (e.g., STSR) store data from the stride register into memory. In some embodiments, instructions that move data between memory and the stride register access 16-bit memory and use only the lower 15 bits.

D-Seq 844 is also coupled to data store 848, RF 842, and selector 830, and is capable of initiating access to vector data of various sources in response to a DSD received from DSR 846. In some cases (e.g., in response to receiving a DSD describing one of a 1D memory vector, a 4D memory vector, and a circular memory buffer), D-Seq 844 can calculate a sequence of memory addresses to access (e.g., in memory 854 and/or data store 848). In some other cases (e.g., in response to receiving a DSD describing a structure input), D-Seq 844 can read structure data from one of Qs 897 via selector 830. In other cases (e.g., in response to receiving the DSD describing the structure output), D-Seq 844 can begin converting data to wavelets and transmitting the wavelets to the structure via the on-ramp 860. In some implementations, D-Seq 844 can simultaneously access vector data at three sources (e.g., read vector data from memory, read vector data from structure inputs, and write vector data to structure outputs).

In some embodiments, D-Seq 844 is capable of accessing data (e.g., instructions having one or more input operands and/or one output operand) in one or more registers in RF 842. In some cases, D-Seq 844 can request operands from registers in RF 842. In still other cases, D-Seq 844 can request data from a register (e.g., index) in RF 842 as input to compute a sequence of memory addresses to access from the DSD.

The data path 852 is coupled to the RF 842 and the data store 848. In various embodiments, any one or more of memory 854, RF 842, Qs 897, and data store 848 can provide data to data path 852 (e.g., in response to a request from D-Seq 844) and receive data (e.g., operational results) from data path 852. The data path 852 is also coupled to the router via an ingress ramp 860 and is capable of sending data to the router via the ingress ramp 860. Datapath 852 includes execution resources (e.g., an ALU) that are capable of performing operations (e.g., as specified by the opcode decoded and/or provided by Dec 840, according to an embodiment). In some embodiments, RF 842 includes sixteen general purpose registers, sometimes referred to as GPRs 0-GPRs 15. Each GPR is 16 bits wide and can store integer or floating point data.

In some embodiments, data store 848 is a type of memory that is smaller and more efficient (e.g., lower joules per bit of data read) than memory 854. In some embodiments, data store 848 is a type of storage that has a relatively lower capacity (e.g., retains less information) and relatively lower access latency and/or relatively higher throughput than memory 854. In some cases, more frequently used data is stored in data store 848 and less frequently used data is stored in memory 854. In some embodiments, data storage 848 includes a first address range and memory 854 includes a second non-overlapping address range.

In some embodiments and/or usage scenarios, the elements of this figure correspond to an implementation of computing element 520 of fig. 5, and exit ramp 820 and entrance ramp 860 correspond to exit ramp 521 and entrance ramp 522, respectively, of fig. 5.

The separation and coupling shown in fig. 8 is merely illustrative, as other embodiments are contemplated having different separations and/or couplings. For example, in other embodiments, RF 842 and DSR 846 are combined into one module. In other implementations, DSR 846 and data path 852 are coupled.

Task

FIG. 9 illustrates selected details of an embodiment of processing a wavelet for task start-up as a flow 900. Conceptually, the process includes initiating a task by determining an address at which to begin fetching and executing a task instruction. The address is determined based at least in part on information contained in the wavelet.

In some embodiments, processing of the wavelet for task start is started (start 901) by selecting a ready wavelet from, for example, one or more queues for processing (select ready wavelet for task start 905). In some embodiments, the wavelet is selected based on one or more of: a blocked/unblocked state associated with each queue, an active/inactive state associated with each queue, a color of a previously selected wavelet, and a scheduling algorithm.

After the ready wavelet is selected, the wavelet is examined to determine whether the wavelet is a control wavelet or a data wavelet (control/data. If the wavelet is a control wavelet, the start address of the task associated with the control wavelet is calculated by adding the lower six bits of the wavelet index to the base register (adding the lower index bits to the base register to form instruction address 930). If the wavelet is not a control wavelet, the wavelet is a data wavelet. The start address of the task associated with the data wavelet is calculated by adding the base register to the color of the wavelet multiplied by 4 (color x 4 is added to the base register to form instruction address 920). The start address of the task corresponds to the start address of the task instruction, whether for control wavelet calculations or data wavelet calculations.

Once the starting address of the instruction is calculated, the instruction is fetched from the starting instruction address (the instruction is fetched from memory at instruction address 950). The one or more fetched instructions are decoded and executed (execute fetched instruction 960). Fetch and execute (as shown in acts 950 and 960) continues (without termination 961) until a termination instruction is executed (termination 962), and then processing associated with the initiated task is completed (end 990). In some embodiments, the termination instruction is the last instruction associated with processing the wavelet. After completion of the initiated task, the process optionally and/or selectively continues to process another wavelet for task initiation, beginning at start 901.

According to various usage scenarios, executing (executing the fetched instructions 960) includes executing sequential and/or control flow instructions, and the instruction address for fetching changes accordingly (fetching instructions from memory at instruction address 950).

The ready wavelet selected for task initiation consists of a particular color. In some embodiments and/or usage scenarios, once a ready wavelet is selected for task start (ready wavelet is selected for task start 905), other wavelets of a particular color received (if any) are consumed as operands for executing the instruction (execute fetched instruction 960). The consumption of wavelets of a particular color, including as operands, continues until a terminate instruction is fetched and executed (terminate 962).

In some implementations and/or usage scenarios, the actions of all or any portion of the flow 900 conceptually correspond to and/or are conceptually related to the operations and/or elements performed by the CE of the PE (e.g., CE 800 of fig. 8). As one example, blocking bit 899 corresponds to a blocking/unblocking status associated with each queue. The active bit 898 corresponds to the active/inactive state associated with each queue. As another example, portions of act 905 are performed by selector 830. Selector 830 selects the earliest wavelet from one of the prepared Qs 897 (e.g., the associated one of the block bits 899 is not set and the associated one of the active bits 898 is set) according to a scheduling policy such as round robin or from last pick. The wavelet selected by the selector 830 includes color and wavelet payloads formatted according to one of fig. 13A and 13B.

As another example, act 908 is performed by an element of CE 800. If the control bits of the wavelet payload (e.g., control bits 1320 of fig. 13A) are adhered to (e.g., determined by selector 830), the wavelet is a control wavelet. Action 930 is then performed by CE 800, for example, by selector 830 adding the contents of base 890 to the 6 least significant bits of the low index bits 1321.1 of fig. 13A to form an instruction fetch address for the instruction of the task associated with the control wavelet. The selector 830 then provides the instruction fetch address to the PC 834. If the control bits (e.g., control bits 1320 of fig. 13A) of the wavelet payload are de-asserted (e.g., as determined by selector 830), the wavelet is a data wavelet. Subsequently, act 920 is performed by CE 800, e.g., by selector 830 adding the contents of base 890 to the color of the multiplied-by-4 wavelet (e.g., corresponding to color 1324 of fig. 13A and 13B) to form an instruction fetch address for the instructions of the task associated with the data wavelet. The selector 830 then provides the instruction fetch address to the PC 834.

As another example, act 950 is performed by elements of CE 800, such as PC 834, I-Seq 836, and memory 854. Acts 960 are performed by elements of CE 800, e.g., Dec 840, D-Seq 844, memory 854, RF 842, and datapath 852, among others. Execution includes terminating execution of the instruction. One example of a terminate instruction is an instruction that adheres to the terminate bit. In the context of this example, when Dec 840 decodes the terminate instruction, Dec 840 sends a signal of wavelet completion to selector 830 via terminate 812, and selector 830 selects another wavelet for processing, e.g., corresponding to act 905.

In various embodiments and/or usage scenarios, all or any portion of the elements that process wavelets for task start 900 conceptually correspond to all or any portion of the instructions that execute task SW on PE 260 of fig. 2.

In various embodiments and/or usage scenarios, the actions comprising all or any portion of flow 900 conceptually correspond differently to all or any portion of flow 1500 of fig. 15A and/or flow 1550 of fig. 15B. For example, act 905 includes all or any portion of act 1552, and acts 908, 920, 930, 950, and 960 include all or any portion of act 1553.

FIG. 10 illustrates selected details of an embodiment of instruction processing associated with a compute element of a processing element as instruction processing 1000.

In some embodiments and/or usage scenarios, the actions of all or any portion of instruction processing 1000 correspond to or are conceptually related to the operations and/or elements performed by the CE of the PE (e.g., CE 800 of fig. 8).

FIG. 11 illustrates selected details of an embodiment via an end of flow associated with dependency management as dependency management 1100.

In some embodiments and/or usage scenarios, all or any portion of the actions of dependency management 1100 correspond to or are conceptually related to the operations and/or elements performed by PEs 122 of fig. 1. In some embodiments and/or usage scenarios, all or any portion of the elements of dependency management 1100 conceptually correspond to all or any portion of the instructions that execute task SW on PE 260 of fig. 2.

FIG. 12 illustrates selected details of an embodiment of a flow associated with activation accumulation and epilogue, followed by partial sum computation and epilogue, as activation accumulation/epilogue and partial sum computation/epilogue 1200.

In some embodiments and/or usage scenarios, all or any partial acts of activating the accumulate/conclude and partial and compute/conclude 1200 correspond to or are conceptually related to operations and/or elements performed by PE 122 of fig. 1. In some embodiments and/or usage scenarios, all or any portion of the elements that activate the accumulate/terminate and partial and compute/terminate 1200 conceptually correspond to all or any portion of the instructions that execute the task SW on the PE 260. In various implementations and/or usage scenarios, a tail-in (e.g., associated with act 1210) is an example of a control wavelet.

Wavelet transform method

Figure 13A illustrates selected details of a sparse wavelet implementation as sparse wavelet 1301. The sparse wavelet 1301 comprises a sparse wavelet payload 1302 and a color 1324. Sparse wavelet payload 1302 includes index 1321, sparse data 1322, and control bits 1320. Index 1321 includes low index bits 1321.1 and high index bits 1321.2.

In some embodiments, sparse data 1322 includes fields that are 16-bit floating point numbers or 16-bit integers. In various cases, sparse data 1322 variously represents weights of the neural network, inputs or stimuli to the neural network, activations of the neural network, or partial sums of the neural network.

In some embodiments, index 1321 comprises a 16-bit field. In some cases, index 1321 is an integer and is an index that explicitly indicates a particular neuron of the neural network. In some embodiments, lower index bits 1321.1 are 6 bits and upper index bits 1321.2 are 10 bits.

In some embodiments, control bits 1320 are 1-bit fields. In some cases, control bits 1320 indicate whether the sparse wavelet payload 1302 triggers control activity or data activity. In some cases, the control activity includes calculating a last activation of a neuron and the data activity includes calculating an activation of a neuron that was not the last activation. In some implementations and/or usage scenarios, the control activity includes an end activity, e.g., associated with any one or more of fig. 11 ending from a previous layer 1110 and/or ending to a next layer 1122 and any one or more of fig. 12 receiving an activation epilogue 1204 and/or a transmission epilogue 1210.

In some embodiments, color 1324 comprises a 5-bit field. In some implementations, the color corresponds to a virtual channel on a shared physical channel, e.g., via routing according to the color. In some cases, the color is used for a specific purpose, e.g., to send configuration information to the processing element, or to send the input of a neural network to the neurons mapped to the processing element.

Fig. 13B shows selected details of an embodiment of a dense wavelet as dense wavelet 1331. Dense wavelet 1331 includes dense wavelet payload 1332 and colors 1344. Dense wavelet payload 1332 includes dense data 1343.1, dense data 1343.2, and control bits 1340.

In some embodiments, control bit 1340 is a 1-bit field and is functionally identical to control bit 1320.

In some embodiments, color 1344 comprises a 5-bit field and is functionally identical to color 1324.

In some cases, dense data 1343.1 and dense data 1343.2 include fields for respective 16-bit floating point numbers or respective 16-bit integers. In various instances, dense data 1343.1 and dense data 1343.2 represent weights of the neural network differently, inputs or stimuli to the neural network, activations of the neural network, or partial sums of the neural network. In some cases, the dense data 1343.1 and the dense data 1343.2 collectively include 32-bit floating point numbers (e.g., the dense data 1343.1 includes a first portion of 32-bit floating point numbers and the dense data 1343.2 includes a second portion of 32-bit floating point numbers).

In various embodiments and/or usage scenarios, the use of sparse wavelets versus dense wavelets is predetermined and/or dynamically determined differently. In various embodiments and/or usage scenarios, the use of sparse wavelets versus dense wavelets is determined by software.

Figure 14 illustrates selected details of an embodiment of creating and transmitting a wavelet as a wavelet creation flow 1400. The actions of wavelet creation flow 1400 are performed by various agents. The transmitting PE comprises a PE that performs acts 1403 and 1407 as indicated by the CE of transmitting PE 1420. The transmitting PE also includes a router that performs act 1408, as shown by the router of transmitting PE 1430. The receiving PE includes a router that performs act 1409, as shown by the router of receiving PE 1440.

The creation and transmission of wavelets begins by initializing at least one transmitting PE and one or more receiving PEs, as well as any PE that includes a router implementing a fabric coupling the transmitting PE and the receiving PE (initializing PE 1402) (start 1401). Each PE includes a respective router (e.g., router 510 of fig. 5) and a respective CE (e.g., compute element 520 of fig. 5). In some cases, initializing a PE enables the CE of the PE to perform computations and enables the router of the PE to structurally send, receive, and/or forward wavelets.

In various implementations, the DSR holds a DSD that contains information about operands, e.g., a location of a data element (e.g., memory, structure input, and/or structure output), a number of data elements (e.g., length), one or more addresses of the data elements (e.g., starting address and stride in memory). For structure output operands (e.g., wavelets sent via the structure), DSR includes the color, control bits, and optionally index values or positions of the wavelets on the structure.

In some embodiments, the CE of the transmitting PE configures the source (set source 1403). In some cases, the source is a source DSD that describes source operands. In various implementations, the source DSD describes one or more data elements stored in one of a cache and a memory. In other implementations, the source DSD describes one or more data elements received via the structure (e.g., the data elements are the payload of a wavelet arriving via the structure). In some other cases, the source includes a source register (e.g., one of the RFs 842). In other cases, the source includes an immediate specified in the instruction.

The CE also configures the destination DSD in the destination DSR describing the structured destination operand (set destination (structure) DSR 1404). In some implementations, the destination DSD describes one or more data elements transmitted via the fabric. In various embodiments, the source and destination DSDs are configured via one or more instructions.

Subsequently, the CE fetches and decodes instructions (e.g., FMACH, MOV, LT16) that contain the destination operand specified by the DSD in the destination DSR (the fetch/decode instruction with destination DSR 1404.5). In some embodiments, an operand type field of an instruction specifies whether an operand is specified by a DSD.

The CE reads the destination DSD from the destination DSR and reads any source DSD in the source DSR (read DSR 1404.6). Based on the DSD, the CE determines the type of data structure, the source of the data elements, whether to read multiple data elements together (e.g., for SIMD operations), and the total number of data elements per operand. In some cases, the DSR is read for one or more of: a source 0 operand, a source operand, and a destination operand. In some implementations and/or usage scenarios, DSRs are read in parallel, in whole or in part, and in other implementations and/or usage scenarios DSRs are read in sequence, in whole or in part.

The CE of the transmitting PE then reads the data elements (e.g., source DSDs or registers) described by the source and creates a wavelet that includes the data elements based on the destination DSD. The CE reads (e.g., from memory) the first data element specified by the source (read (next) data element from queue/memory 1405). The data elements are used to form a wavelet payload. The control bits of the wavelet payload and the color of the wavelet are specified by the destination DSD. The wavelet payload and color are provided to the router that transmitted the CE (the data element is provided to router 1406 as a wavelet). In some implementations and/or usage scenarios, a single data element is used to create the payload of a sparse wavelet. In other embodiments and/or usage scenarios, two data elements are used to create the payload of a dense wavelet.

The CE of the transmitting PE determines whether the destination data element specifies an extra data element (more data elements. If the destination DSD specifies an additional data element, then is CE provided the data element as a wavelet and more via reading the (next) source data element from queue/memory 1405 to router 1406? 1407 to create additional wavelets until no additional data elements are specified by the destination DSD. If the destination DSD does not specify additional data elements, flow ends (end 1410). In some implementations, the wavelet created via act 1406 is the same color as specified by the destination DSR.

The router transmitting the PE transmits the wavelet formed by the CE of the PE according to the color of the wavelet (transmit wavelet to fabric 1408). In some embodiments and/or usage scenarios, the transmission is directly to the router of the receiving PE. In some embodiments and/or usage scenarios, the wavelet is sent indirectly to the router of the receiving PE, e.g., via one or more intermediate PEs that forward the wavelet according to color. The router receiving the PE receives the wavelet according to color (receives the wavelet from structure 1409).

In various implementations, act 1408 is performed asynchronously with respect to any one or more of acts 1405, 1406, and 1407. For example, act 1406 generates a plurality of wavelets prior to transmitting any generated wavelets as shown in act 1408.

In various embodiments, receiving the wavelet from the structure 1409 corresponds in various aspects to receiving the wavelet at the router 1503 of fig. 15.

In various embodiments and/or usage scenarios, all or any portion of any one or more elements of wavelet creation stream 1400 conceptually correspond to and/or are conceptually related to the operations performed by a PE (e.g., PE 499 of fig. 4).

In various embodiments and/or usage scenarios, all or any portion of any one or more elements of wavelet creation flow 1400 (e.g., any one or more of acts 1403 and 1407) conceptually correspond to and/or are conceptually related to the operations and/or elements performed by the computing elements, e.g., all or any portion of the CE of the PE, e.g., computing element 520 of fig. 5 and/or CE 800 of fig. 8. As one example, the destination DSR (associated with setting the DSR destination (structure) DSR 1404) is one of the DSRs 846. In some cases, the source DSR (associated with setting source 1403) is one of DSRs 846; in other cases, the source register (associated with set source 1403) is one of the RFs 842.

As another example, CE 800, which is a CE that transmits PEs, performs act 1403 in response to a load DSR instruction that copies information from memory 854 to a source DSR (e.g., one of DSRs 846). In various embodiments, the source DSR specifies the location of the data element as one of memory 854, data store 848, and RF 842. In some cases, the source DSR specifies the address of the first data element in memory 854 (e.g., address 0x0008), a number of data elements (e.g., nine data elements), and a stride between subsequent data elements (e.g., 12 bytes). As another example, CE 800 performs act 1403 by writing data to a register of RF 842.

As another example, CE 800, which is a CE that transmits PEs, performs act 1404 in response to a load DSR instruction that copies information from memory 854 to a destination DSR (e.g., one of DSRs 846). In various implementations, the destination DSR specifies that one or more data elements are converted to one or more wavelets and transmitted by router 510 via a fabric-coupled egress port (e.g., north 513). The destination DSR specifies the color of the wavelet, the control bits of the wavelet, the number of data elements (e.g., length), and information about the wavelet index. In some cases, the destination DSR specifies an index value, while in other cases, the destination DSR specifies the location of the index value (e.g., in a register of the RF 842).

As another example, the CE 800, being a CE of a transmitting PE, performs acts 1404.6, 1405, 1406, and 1407 in response to extracting and decoding an instruction that specifies a destination DSR as a destination operand (act 1404.5). In some implementations and/or usage scenarios, D-Seq 844 reads the source DSR and accesses one or two data elements specified by the source DSR, e.g., from memory 854 or data store 848, to perform act 1405. In various embodiments, memory 854 and/or data store 848 provides one or two data elements to data path 852. The data path converts the data into a wavelet and sends the wavelet via ingress ramp 860, e.g., for storage into an element of data queue 650 (of router 600 of fig. 6), thereby performing act 1406. In some embodiments, the on-ramp 860 includes memory that buffers one or more wavelets. In some implementations, the CE 800 of the transmitting PE reads the color from the destination DSR. Based on the color, CE 800 sends the wavelet payload via ingress ramp 860, e.g., for storage into an element of data queue 650, thereby completing act 1406. In some implementations, the CE 800 of the transmitting PE performs act 1407 by comparing the number (e.g., length) of data elements specified in the destination DSR to the number of data elements transmitted via act 1406 (e.g., tracked by a counter).

As another example, CE 800, which is a CE that is a transmitting PE, performs act 1406. The CE converts one or both data elements into a wavelet payload in accordance with the destination DSR. In some embodiments and/or usage scenarios, the CE converts a single data element into a wavelet payload formatted according to the sparse wavelet 1301 of fig. 13A. The single data element is converted into an instantiation of sparse data 1322, the index value specified by the destination DSR is converted into an instantiation of index 1321, and the control bits from the destination DSR are converted into instantiations of control bits 1320, forming an instantiation of sparse wavelet payload 1302.

As another example, CE 800, which is the CE of the transmitting PE, converts two data elements into a wavelet payload formatted according to dense wavelet 1331 of fig. 13B. The first data element is converted into an instantiation of the dense data 1343.1 and the second data element is converted into an instantiation of the dense data 1343.2. The control bits from the destination DSR are converted into instantiations of control bits 1340, forming instantiations of dense wavelet payload 1332.

In various embodiments and/or usage scenarios, all or any portion of any one or more elements of wavelet creation stream 1400 (e.g., any one or more of acts 1408 and 1409) conceptually correspond to and/or are conceptually related to the operations performed by a router and/or elements of a router, e.g., all or any portion of a router of a PE, e.g., router 510 of fig. 5 and/or router 600 of fig. 6.

As an example, router 600 performs the transmission of a wavelet to fabric 1408 as a router that transmits PE 1430 as follows. Router 600 determines the destination of the wavelet in data queue 650, for example, by reading Dest 661. For each color, Dest 661 indicates an output destination, e.g., one or more data outputs 620. Router 600 transmits the wavelet payload and color (collectively referred to as wavelets) to the fabric via output 652 and one or more data outputs 620. In various embodiments, router 600 of the transmitting PE performs act 1408 asynchronously with any one or more of acts 1405, 1406, and 1407.

As another example, receiving a wavelet from fabric 1409 is performed by router 600, which is the router receiving PE 1440, as follows. Router 600 receives the transmitted wavelet at data queue 650 via one of data input 610 and write decode 651. The received wavelets are stored in one or more locations of a data queue 650.

In some embodiments and/or usage scenarios, all or any portion of the elements of wavelet creation flow 1400 conceptually correspond to all or any portion of the execution instructions of task SW on PE 260 of fig. 2.

Figure 15A illustrates selected details of an embodiment of receiving a wavelet as a wavelet reception flow 1500. The actions of the wavelet reception flow 1500 are performed by various agents. The receiving PE includes a router that performs acts 1503 and 1506, as shown by router 1520 of the receiving PE. The receiving PE also includes a CE that performs act 1507, as shown by the CE of receiving PE 1530.

The wavelet begins to be received by initializing at least one transmitting PE point and one or more receiving PE points and any PE including a router implementing a fabric coupling the transmitting PE and the receiving PE (start 1501) (initializing PE 1502). Each computing element includes a respective router (e.g., router 510 of fig. 5) and a respective CE (e.g., computing element 520 of fig. 5). In some cases, the PE is initialized, the CE of the PE is enabled to perform computations, and the router of the PE is enabled to send, receive, and/or forward wavelets over the fabric.

The following description assumes that there is one receiving PE. In a usage scenario where there are multiple receiving PEs, the respective routers and CEs of each receiving PE perform the process according to fig. 15A.

The router of the receiving PE receives the wavelet 'on color' (e.g., the wavelet includes color) of the structure sent by the transmitting PE (the wavelet is received at router 1503). The router checks the destination of the wavelet based on color, e.g., by reading a configuration register. If the destination of the wavelet includes other PEs (to other PEs 1504), the router transmits the wavelet to the destination PE. The router sends the wavelet to the output of the router (transmits the wavelet to output 1505), from which the wavelet is transmitted over the fabric to the destination PE. If the destination of the wavelet does not include other PEs, the transmission is omitted.

If the destination of the wavelet does not include a local CE (for local CE1506), no further action is taken (end 1510). If one of the destination of the wavelet is a local CE, the router provides the wavelet to the local CE via the exit ramp and the wavelet is written to the selector queue associated with the color of the received wavelet (writing the wavelet to selector queue 1507), thereby receiving the wavelet (end 1510).

In various embodiments and/or usage scenarios, all or any portion of any one or more elements of the wavelet reception flow 1500 (e.g., any one or more of acts 1503 and 1506) conceptually correspond to and/or are conceptually related to the operations performed by the router and/or the elements of the router, e.g., all or any portion of the router of the PE, e.g., router 510 of fig. 5 and/or router 600 of fig. 6.

As an example, when a wavelet is received on one data input 610, router 600, which is the router receiving PE 1520, performs the receive wavelet at router 1503. Subsequently, execution by router 600 to other PEs 1504 and for local CEs 1506 uses the color of the wavelet to determine the destination of the wavelet, e.g., by reading Dest 661. For each input color, Dest 661 indicates an output destination, e.g., one or more data outputs 620. If Dest 661 indicates that the output includes other PEs (e.g., via one of Skipx +621, Skipx-622, X +623, X-624, Y +625, and Y-626), then the wavelet is sent by router schedule 654 to the other PEs. If Dest 661 indicates that the CE including the PE is output (e.g., exit ramp 627), then the wavelet is sent to the CE by router schedule 654. The wavelet remains in one of the data queues 650 until act 1505 is performed by scheduling the wavelet to be sent to one or more data outputs 620 (e.g., by router schedule 654).

In various embodiments and/or usage scenarios, all or any portion of any one or more elements of wavelet reception flow 1500 (e.g., act 1507) conceptually correspond to and/or are conceptually related to operations performed by and/or elements of a computing element, e.g., all or any portion of a CE of a PE, e.g., computing element 520 of fig. 5 and/or CE 800 of fig. 8. Writing the wavelet to selector queue 1507 is performed, as one example, by sending the wavelet to CE 800 via exit ramp 820 and writing the wavelet to one of Qs 897.

In some embodiments and/or usage scenarios, the wavelet is received, queued, and routed by the router to the router output port without any specific determination that the wavelet is for a local CE. Instead, the wavelet destined for the local CE is routed to the off-ramp and then written to the selector queue. Wavelets not destined for the local CE are routed to the non-off-ramp router output.

Figure 15B illustrates selected details of an embodiment of consuming a wavelet as a wavelet consumption flow 1550. The actions of the wavelet consumption flow 1550 are performed by the CE of the PE.

The wavelet is selected from the queue for processing by the selector (selector selects wavelet for processing 1552), the wavelet begins to be consumed (start 1551), and then the wavelet is processed by the CE. The CE extracts and executes instructions associated with the wavelet (extract, execute instructions 1553), consuming the wavelet (end 1554). In some embodiments and/or usage scenarios, fetching and executing instructions associated with the wavelet ends with fetching and executing termination instructions.

In some implementations, selector selection 1552 of wavelets for processing is performed by selector 830 of fig. 8. In various instances, selector 830 selects one of the prepared Qs 897 (e.g., block bit 899 and active bit 898 set to particular values) according to a scheduling policy such as round robin or last pick. In some embodiments, the portion of wavelet consumption flow 1550 corresponds to the portion of processing wavelets for task start 900 of fig. 9. As one example, act 1552 corresponds to act 905. As another example, act 1553 corresponds to acts 908, 920, 930, 950, and 960.

In some other cases, an instruction executing on a CE (e.g., FMACH) accesses the wavelet as an operand, and the wavelet is consumed by the CE during instruction execution, e.g., as shown in fig. 23.

Blocking and unblocking

FIG. 16 illustrates, as a flow 1600, selected details of an embodiment of blocking and unblocking execution of an instruction. Conceptually, the blocking instruction specifying the particular color is executed, preventing execution of the instruction associated with the particular color, at least until the unblocking instruction specifying the particular color is executed.

Referring to the figure, execution of an instruction begins by fetching the instruction from memory and decoding the instruction (fetch, decode instruction 1602) (start 1601). If the instruction decodes into a block instruction (block instruction. The source operands of the block instruction specify one or more colors to block relative to the instruction processing associated with the blocked/unblocked color. In various embodiments and/or usage scenarios, the block operation is performed by setting one or more block indicators of one or more colors specified by the source operands to a block state, and the execution is completed (end 1630). In each case, the source operands specify blocking a single color, blocking all colors, and blocking any of a plurality of colors differently. In a subsequent operation, the wavelet consisting of the color for which the corresponding blocking indicator is set to the blocking state is not selected for processing.

If the instruction decodes as an unblocking instruction (unblocking instruction. The source operands of the unblocking instruction specify one or more colors to unblock relative to instruction processing associated with the blocking/unblocking colors. In various embodiments and/or usage scenarios, the unblocking operation is performed by resetting the blocking indicator of the one or more colors specified by the source operands to an unblocking state, and the performing is complete (end 1630). In various cases, the source operands specify unblocking a single color, unblocking all colors, and unblocking any of the multiple colors differently. In a subsequent operation, a wavelet consisting of a color with the corresponding blocking indicator set to the unblocking state may be selected for processing.

If the instruction decodes as an instruction that is not a block instruction and is not an unblock instruction, the instruction is executed (execute instruction 1620) and execution completes (end 1630).

In some embodiments, if the source operand of the block operation is an immediate (e.g., an 8-bit immediate), the value of the immediate specifies the color to be blocked. If the source operand is not immediate, all colors will be blocked.

In some embodiments, the source operand of the unblocking operation is an immediate (e.g., an 8-bit immediate), and the value of the immediate specifies the color to be unblocked. In various embodiments, an unblocking operation with a particular operand unblocks multiple colors.

In various embodiments and/or usage scenarios, all or any portion of any one or more elements of block and unblock instruction processing flow 1600 conceptually correspond to and/or are conceptually related to operations performed by and/or elements of a computing element (e.g., all or any portion of a CE of a PE, such as computing element 520 of fig. 5 and/or CE 800 of fig. 8).

As one example, the block bits 899 include one bit per color (e.g., as an entry in a table, or as a bitmask). The block operation (block color 1604) is performed by setting block bit 899 to a particular block value (e.g., '1') for one or more colors specified by the source operand. In some embodiments, selector 830 selects a wavelet to process from the color in which blocking bit 899 matches the unblocking value (e.g., '0'). As another example, the unblocking operation (unblocking color 1611) is performed by setting the block bit 899 to a particular unblocking value (e.g., '0') for the color specified by the source operand. In some implementations, the selector 830 selects a wavelet that includes a color for which the blocking bit 899 matches the unblocking value (e.g., '0').

In some embodiments, the portions of block and unblock instruction processing flow 1600 correspond to portions of task start 900 of fig. 9 that process wavelets. As one example, acts 1602, 1603, 1604, 1610, 1611, and 1620 correspond to portions of acts 950 and 960 of fig. 9.

In various embodiments and/or usage scenarios, all or any portion of the elements of block and unblock instruction processing flow 1600 conceptually correspond to all or any portion of the execution instructions of task SW on PE 260 of fig. 2.

Neuron smearing

FIG. 17 illustrates selected details of an embodiment of a neural network as neural network 1700. Network 1700 includes three parts: input layer 1710, inner layer 1720, and output layer 1740. Each layer includes a plurality of neurons. Input layer 171 includes neurons N111711, N121712, and N131713. Inner layer 1720 includes a first layer of neurons N211721, N221722, N231723, and N241724, followed by a second layer of neurons N311731, N321732, and N331733. Output layer 1740 includes neurons N411741 and N421742.

Selected neurons (N211721, N221722, N231723 and N241724 as well as N311731 and N321732) and communication between selected neurons (1791, 1792 and 1793) are highlighted. Selected neurons and pathways are discussed in more detail below.

FIG. 18A illustrates selected details of a first embodiment of assigning processing elements to neurons. Sometimes, assigning a processing element to a neuron is referred to as placing the neuron in the processing element, or alternatively placing the neuron. Like-numbered elements in fig. 18A correspond to like-numbered elements in fig. 17. A first assignment of processing elements to the subset of neurons of fig. 17 (highlighted neurons N211721, N221722, N231723 and N241724 and N311731 and N321732) is conceptually illustrated. The vertical distances in the figure represent the relative usage of computing resources of each of the five processing elements PE 01820, PE 11821, PE 21822, PE 31823, PE 41824, and PE 51825.

Each of neurons N211721, N221722, N231723, and N241724 represent approximately equal amounts of computational resources, e.g., M operations of memory, K storage capacities, and J bandwidths. Each of neurons N311731 and N321732 represents approximately equal amounts of computational resources, e.g., M/2 operations, K/2 stores, and J/2 bandwidth. Thus, each of N311731 and N321732 represents approximately half the computing resources of each of N211721, N221722, N231723, and N241724. In various embodiments, examples of computing resources include computing operations, storage capacity, reading bandwidth from memory, writing bandwidth to memory, input connections from other neurons, and output connections to other neurons.

In the illustrated embodiment, the neuron processing is assigned such that each of the aforementioned neurons is assigned to an entire PE. More specifically, N211721 is assigned to PE 01840, N221722 is assigned to PE 11841, N231723 is assigned to PE 21842, N241724 is assigned to PE 31843, N311731 is assigned to PE 41844, and N321732 is assigned to PE 51845. Thus, four of the six processing elements are fully subscribed to (PE 01820, PE 11821, PE 21822, and PE 31823), while two of the six processing elements are subscribed to only half (PE 41824 and PE 51825).

FIG. 18B illustrates selected details of a second embodiment of assigning processing elements to neurons. Like-numbered elements in fig. 18B correspond to like-numbered elements in fig. 17 and 18A. A second distribution of processing elements to the subset of neurons of fig. 17 (highlighted neurons N211721, N221722, N231723 and N241724 and N311731 and N321732) is conceptually illustrated. As shown in FIG. 18A, the vertical distances in the figure represent the relative usage of computing resources for each of the five processing elements PE 01820, PE 11821, PE 21822, PE 31823, PE 41824, and PE 51825. As also shown in fig. 18A, each of N311731 and N321732 represents approximately half of the computing resources of each of N211721, N221722, N231723, and N241724.

In the illustrated embodiment, the neuron treatments are assigned such that the treatment of the respective neuron is "smeared" on the treatment element. Conceptually, neurons are "segmented" into portions suitable for processing the elements to be assigned. As shown, the neurons are partitioned and processing elements are allocated such that four of the six processing elements subscribe equally (and completely) (PE 01820, PE 11821, PE 21822, and PE 31823), while two of the six processing elements unsubscribe completely and are therefore available for other purposes (PE 41824 and PE 51825). In some implementations and/or usage scenarios, the unsubscribed processing elements remain unused and consume little or no active and/or static power (e.g., via one or more of clock gating and power gating). More specifically, N211721 is split in half (1/2N 211721.1 and 1/2N 211721.2) to two respective processing elements (PE 01820 and PE 21822). Similarly, N221722 is split in half (1/2N 221722.1 and 1/2N 221722.2) to two respective processing elements (PE 01820 and PE 21822). N231723 is divided in half (1/2N 231723.1 and 1/2N 231723.2) and assigned to two respective processing elements (PE 11821 and PE 31823), and N241724 is divided in half (1/2N 241724.1 and 1/2N 241724.2) and assigned to two respective processing elements (PE 11821 and PE 31823). N311731 was divided into four quarters (1/4N 311731.1, 1/4N 311731.2, 1/4N 311731.3, and 1/4N 311731.4) allocated to four respective processing elements (PE 01820, PE 11821, PE 21822, and PE 31823). Likewise, N321732 is divided into four quarters (1/4N 321732.1, 1/4N 321732.2, 1/4N 321732.3, and 1/4N 321732.4) allocated to four respective processing elements (PE 01820, PE 11821, PE 21822, and PE 31823). In various implementations, neurons are partitioned and processing elements are allocated based on one or more computing resources associated with the neurons. In some embodiments, neurons are partitioned and processing elements are allocated based on hardware resources available in the processing elements (e.g., some neurons require specific hardware resources, e.g., PRNGs).

FIG. 19 illustrates selected details of an embodiment of smearing neurons on a plurality of processing elements. The segmentation results in a portion of the segmented neurons being smeared on the processing element. Like-numbered elements in fig. 19 correspond to like-numbered elements in fig. 17, 18A, and 18B. As shown in fig. 18B, N211721 is divided into two portions 1/2N 211721.1 and 1/2N 211721.2 implemented by PE 01820 and PE 21822, respectively.

Conceptually, N211721 is considered to include local computation and local storage as well as input and output. Each element of N211721 is partitioned separately. The N21 local compute is split into 1/2 local compute 1930.1 and 1/2 local compute 1930.2. The local storage of N21 is divided into 1/2 local storage 1940.1 and 1/2 local storage 1940.2. The input of N21 is divided into first half In 01910, In 11911 and In 21912 and second half In 01913, In 41914 and In 51915. The output of N21 is divided into first half out 01920, out 11921, out 21922 and second half out 31923, out 41924 and out 51925.

1/2 local compute 1930.1, 1/2 local store 1940.1, in 01910 and in 11911, and out 01920 are implemented by PE 01820. 1/2 local compute 1930.2, 1/2 local store 1940.2, in 21912 and in 31913, and out 11921 are implemented by PE 01822.

In some embodiments and/or usage scenarios, smearing neurons on more than one processing element is achieved, at least in part, by additional computations, additional storage, and/or additional communications that the neurons do not perform/use. Additional computations, additional storage, and/or additional communications can, for example, combine partial results from the neuron portions into a result corresponding to the overall neuron result. Extra compute 1950.1 and extra storage 1960.1 represent extra compute and extra storage for 1/2N 211721.1 and are implemented by PE 01820. Extra compute 1950.2 and extra storage 1960.2 represent extra compute and extra storage for 1/2N 211721.2 and are implemented by PE 01822.

Additional communications 1970 represents additional communications between 1/2N 211721.1 and 1/2N 211721.2 and is implemented through a fabric connection between PE 01820 and PE 01822. In some embodiments and/or usage scenarios, all or any portion of the additional communications 1970 represent communications that would occur within a single processing element if the single processing element fully implemented N211721.

FIG. 20 illustrates selected details of an embodiment of segmenting communications between neuron portions. Like-numbered elements in fig. 20 correspond to like-numbered elements in fig. 17, 18A, 18B, and 19. Assignment of PE 01820, PE 11821, PE 21822, and PE 31823 to the neuron portion is shown in fig. 18B. For clarity, only allocations specific to PE 01820 and PE 11821 are shown.

Wafer portion 2000 includes PE 01820, PE 11821, PE 21822, and PE 31823. The coupling between the PEs of wafer portion 2000 is shown (coupling between adjacent PEs) as 2040 coupling PEs 01820 and 11821, 2041 coupling PEs 11821 and 31823, 2043 coupling PEs 31823 and 21822, and 2044 coupling PEs 21822 and 01820. The coupling of the PEs adjacent to wafer portion 2000 is shown as (coupling portions between adjacent PEs) 2050, 2051, 2052, 2053, 2054, 2055, 2056, and 2057. Coupling with neighboring PEs is "part" in that in some embodiments and/or usage scenarios, all or any part of the coupling is contained in a wafer portion adjacent to wafer portion 2000, rather than being completely contained in wafer portion 2000.

As a first example, communication portion 1791.1 conceptually represents a portion of communication 1791 between N111711 and N211721 (of fig. 17), e.g., from the input layer to the inner layer, with portions of the split neurons in the respective processing elements. More specifically, review N211721 was divided into two parts (1/2N 211721.1 and 1/2N 211721.2; see FIG. 18B). Thus, the communication 1791 is split into two parts. The communication section 1791.1 specifically illustrates the section relative to 1/2N 211721.1. The communication portion 1791.1 is transmitted to PE 01820 (assigned to 1/2N 211721.1) via 2057 between PEs (coupling portion between adjacent PEs) of the adjacent wafer portion 2000. In some embodiments and/or usage scenarios, communication 1791 is split into two portions, namely, communication portion 1791.1 (shown) and communication portion 1791.2 (not shown). In some embodiments and/or usage scenarios, communication portion 1791.1 and communication portion 1791.2 are transmitted via the same virtual channel. In some embodiments and/or usage scenarios, communication portion 1791.1 and communication portion 1791.2 are transmitted via respective unique virtual channels.

As a second example, communication portion 1792.1 conceptually represents a portion of communication 1792 between N211721 and N311731 (of fig. 17), e.g., from a first inner layer to a second inner layer, with portions of the split neurons in the respective processing elements. More specifically, recall that N211721 is divided into two parts (1/2N 211721.1 and 1/2N 211721.2; see FIG. 18B). Review further N311731 divided into four sections (1/4N 311731.1, 1/4N 311731.2, 1/4N 311731.3, and 1/4N 311731.4; see FIG. 18B). Thus, the communication 1792 is divided into multiple portions. The communications portion 1792.1 illustrates in particular portions relative to 1/2N 211721.1 and 1/4N 311731.2. The communication section 1792.1 is transmitted between PE 01820 (assigned to 1/2N 211721.1) and PE 11821 (assigned to 1/4N 311731.2) via (coupling between adjacent PEs) 2040. In various embodiments and/or usage scenarios, communication portion 1792.1 (shown) and other portions such as communication 1792 (not shown) are transmitted via the same virtual channel, via a unique virtual channel for each portion, via a virtual channel for each portion associated with a particular neuron, and/or via a virtual channel for each portion associated with a particular processing element.

As a third example, communication section 1793.1 conceptually represents a portion of communication 1793 between N231723 and N311731 (fig. 17), e.g., from a first internal layer to a second internal layer, with portions of the split neurons in the same processing element. More specifically, recall that N231723 was split into two parts (1/2N 231723.1 and 1/2N 231723.2; see FIG. 18B). Review further N311731 divided into four sections (1/4N 311731.1, 1/4N 311731.2, 1/4N 311731.3, and 1/4N 311731.4; see FIG. 18B). Thus, the communication 1793 is divided into multiple parts. The communications portion 1793.1 illustrates in particular portions relative to 1/2N 231723.1 and 1/4N 311731.2. The communication portion 1793.1 is transmitted via one or more mechanisms internal to PE 11821 (assigned to 1/2N 231723.1 and 1/4N 311731.2). For example, PE 11821 uses internal resources (e.g., routers) to internally feed back outputs as inputs, and/or internally provide inputs from outputs. In some embodiments and/or usage scenarios, communication portion 1793.1 is transmitted via a virtual channel that causes an output to be used as an input, and/or provides an input from an output.

As a fourth example, communication 2060 conceptually represents all or any portion of additional communication 1970 (of fig. 19), such as communication within a neuron that is partitioned across processing elements. More specifically, communication 2060 specifically illustrates communication between two of the four portions into which N321732 is divided (1/4N 321732.1 and 1/4N 321732.2; see FIG. 18B). Communications 2060 are communicated between PE 01820 (assigned to 1/4N 321732.1) and PE 11821 (assigned to 1/4N 321732.2) via (coupling between adjacent PEs) 2040. In various embodiments and/or usage scenarios, communication 2060 is via a virtual channel dedicated to communication 2060, a virtual channel shared with communications 2060 and other portions of N321732, and a virtual channel shared with communication 2060 and all or any portion of neurons partitioned on the processing elements.

In some embodiments and/or usage scenarios, all or any portion of wafer portion 2000 comprises PE 122 of fig. 1. In some embodiments and/or usage scenarios, any of PE01820, PE 11821, PE 21822, and PE 31823 corresponds to PE 497 of fig. 4. In some embodiments and/or usage scenarios, any one or more of a portion of the coupling between adjacent PEs 2041, 2042, 2043, and 2044 and/or the coupling between adjacent PEs 2050, 2051, 2052, 2053, 2054, 2055, 2056, and 2057 corresponds to any one or more of north coupler 430, east coupler 431, south coupler 432, and west coupler 433 of fig. 4.

The concepts related to neuron smearing (e.g., described and illustrated with reference to fig. 17, 18A, 18B, 19, and 20) are applicable to neural networks of various topologies and types, e.g., FCNN, RNN, CNN, LSTM networks, autoencoders, deep trust networks, and generative confrontation networks.

In various embodiments and/or usage scenarios, neurons are divided into equally sized portions, e.g., half, quarter, eighth, etc. In various embodiments and/or usage scenarios, the neurons are divided into different sized portions, e.g., the first portion is half, and the second and third portions are each a quarter. In various embodiments and/or usage scenarios, neurons are divided into arbitrarily sized portions.

In various embodiments and/or usage scenarios, multiple PEs are assigned to a single neuron. In various embodiments and/or usage scenarios, a single PE is assigned to a respective ensemble of multiple neurons.

In various embodiments and/or usage scenarios, PEs are assigned to neurons that are fully or partially responsive to static and/or dynamic measurements of computational and/or storage requirements. In various embodiments and/or usage scenarios, PEs are assigned to dimensions of neurons that are fully or partially responsive to data to be processed.

In various embodiments and/or usage scenarios, the data flow represented by the arrow direction is unidirectional (as indicated by the drawn arrow), bidirectional, and/or reverse (with respect to the drawn arrow). As a specific example, in various embodiments and/or usage scenarios, communication 1792 (of fig. 17) represents a data flow from N211721 to N311731 (e.g., during forward propagation) or from N311731 to the reverse of N211721 (e.g., during backward propagation). Thus, communication over communications portion 1792.1 and thus (the coupling portion between adjacent PEs) 2057 occurs in a direction from PE 01820 to PE 11821 (e.g., during forward propagation) and in a direction from PE 11821 to PE 01820 (e.g., during backward propagation).

Vector and data structure descriptors

In various embodiments and/or usage scenarios, processing of one or more vectors, each vector comprising a respective one or more data elements, is performed. The vectors are variously read from storage (e.g., storage of the CE of the PE, such as memory 854 or data store 848 of fig. 8), written to storage, received from the fabric, or transmitted to the fabric. The vector that is read from or written to memory is sometimes referred to as a 'memory vector'. Vectors received from or transmitted to the structure (e.g., as wavelets) are sometimes referred to as 'structure vectors'. The DSD in DSR (and xdxdxd in XDSR) may be used to determine the addressing pattern of the memory vector and the access pattern of the structure vector.

In the description of fig. 21A to 21E, 22A to 22B, and 23 to 24, each element identifier having a first number "8" refers to an element of fig. 8, which is not otherwise specifically identified as the element of fig. 8 for the sake of brevity.

FIG. 21A illustrates selected details of an embodiment of a structure input data structure descriptor (also known as structure input DSD) as structure input data structure descriptor 2100. In some embodiments, the structure input data structure descriptor 2100 describes the structure vectors received by the PE from the structure and various parameters related to the processing of the structure vectors. In various embodiments and/or usage scenarios, the source 0 operand or the source 1 operand of an instruction refers to a DSR that includes a DSD instance according to the structure input data structure descriptor 2100.

The structure input data structure descriptor 2100 includes a length 2101, a UTID (micro-thread identifier) 2102, a UE (micro-thread enable) 2103, a SW (SIMD width) 2104, an AC (active color) 2105, a Term (terminate micro-threads on control wavelets) 2106, a CX (control wavelet transform enable) 2107, an US (micro-thread sparse mode) 2108, a type 2109, an SS (single step) 2110, an SA (save address/conditional single step mode) 2111, an SC (color designation/normal mode) 2112, an SQ (queue designation/normal mode) 2113, and a CH (color high) 2114.

In some embodiments, length 2101 comprises a 15-bit integer specifying the length of the vector, e.g., the number of data elements in the vector.

In some embodiments, UE (micro-thread enabled) 2103 includes a 1-bit field that indicates whether, under at least some conditions, a micro-thread is enabled during processing of a structure vector, sometimes referred to as 'micro-thread enabled'. If at least one operand (source or destination) of an instruction is a micro-thread enabled structure vector, then during instruction processing, on an input or output stall, another instruction capable of switching to another task is processed (assuming sufficient micro-thread resources are available). When the stall is cleared, processing (eventually) returns to the previously stalled instruction. An example input stall is when at least one element of an input structure vector operand is unavailable. An example output stall is when there is not enough room to buffer the results associated with the elements of the output structure vector. In some cases, the structure vector synchronization processing of the micro-threads is not enabled and processing is stopped on an input or output stall. In some cases, the structure vector asynchronous processing of the micro-threads is enabled and stalling of processing elements on input or output stalls is reduced or avoided. If the structure vector enables micro-threading, the processing element can conditionally switch to processing a different instruction (rather than stalling), and then resume processing the structure vector at a later point in time (e.g., when data is available).

In some embodiments, UTID (micro-thread identifier) 2102 includes a 3-bit field that identifies one of the plurality of micro-threads and/or a resource associated with one of the plurality of micro-threads. The micro-threads and/or resources are associated with, for example, micro-thread enabled structure vectors. In some embodiments, the hardware provides resources for eight micro-threads. In some implementations and/or usage scenarios, the UTID 2102 identifies or partially identifies one of the Qs 897.

In some embodiments, SW (SIMD width) 2104 includes a 2-bit field that specifies the number of operations (e.g., one, two, or four) to be performed in parallel in some implementations. For example, FMACH, FADDH, FMULH, or MOV16 instructions perform multiple (up to four) operations on corresponding operands in parallel. In some implementations, the SW field is used to determine how to parse the wavelet into data and index information. For example, when the SW field is 4, two wavelets, each having two data values (no index values), provide four operands, e.g., in parallel. Continuing with the example, when the SW field is 2, a single wavelet with two data values (and no index values) provides two operands, e.g., in parallel. Continuing with the example, when the SW field is 1, a single wavelet with a single data value and a single index value provides a single operand.

In some implementations, the AC (activate color) 2105 includes a 6-bit field that specifies the color to be activated (e.g., via an activation operation). In some cases, when the processing of the micro-thread enabled structure vector is complete, the color specified by the AC field is activated and the task is initiated based on the activated color. Processing is completed, for example, when all elements of the structure vector have been processed, or when Term 2106 indicates that a control wavelet is encountered, and when the structure vector is processed, a control wavelet is encountered. In some implementations, the AC 2105 can specify one of: native color and structural color.

In some embodiments, Term (terminate micro-thread on control wavelet) 2106 includes a 1-bit field that specifies whether to terminate upon receipt of a control wavelet. If the wavelet at the head of the queue specified by the structure input data structure descriptor 2100 (e.g., one of Qs 897 specified differently by various functions of any combination of UTID 2102, SC 2112, and/or SQ 2113, as described elsewhere herein) is a control wavelet (e.g., control bit 1320 of fig. 13A or control bit 1340 of fig. 13B is set), and Term 2106 is set, the instruction terminates, and the color is activated by AC 2105.

In some embodiments, CX (control wavelet transform enable) 2107 comprises a 1-bit field that specifies whether or not a control wavelet is transformed. If CX 2107 is set, in response to receiving the control wavelet in the configuration vector, bit 15 of the index register: 6 are set to all "1". In some embodiments and/or usage scenarios, if bit 15 of the index register: 6 are all "1", then the control bit of any output wavelet associated with the output structure vector referencing the index register is set.

In some embodiments, US (micro-thread sparse mode) 2108 includes a 1-bit field that specifies whether a structure vector of enabled micro-threads (e.g., via a UE field) is to be processed in sparse mode. If US 2108 is set, the structure vector comprises a vector of sparse data elements and the corresponding wavelet indices of the operands described by the structure input data structure descriptor 2100. According to WLI 2152 (of fig. 21C), the index is optionally and/or selectively used for address calculation of memory operands.

In some embodiments, type 2109 includes a 3-bit field that specifies the data structure type and/or how to interpret other fields of the structure input data structure descriptor 2100. For all instances of the structure input data structure descriptor 2100, type 2109 is "0".

In some embodiments, SS (single step) 2110 includes a 1-bit field that specifies whether single step mode operation is enabled for operations that use DSD as an operand, under at least some conditions. In some cases, an instruction having one or more operands that enable single step mode operates in single step mode.

In some embodiments, SA (Save Address/conditional Single step mode) 2111 includes a 1-bit field that specifies whether save address mode operation is enabled under at least some conditions for operations that use DSD as an operand.

In some implementations and/or usage scenarios, a color is activated and, in response, a task is initiated at an address based at least in part on the color. Once started, the task is executed. In some cases, the input structure vector is provided from a queue associated with the color of the task currently being performed. In some embodiments, SC (color specified, normal mode) 2112 includes a 1-bit field that, if set, specifies that the input structure vector is to be provided from a particular queue (e.g., one of Qs 897) associated with a particular structure color. A particular structural color is designated (e.g., as a 5-bit color) as a concatenation of a lower UTID 2102 (comprising a 3-bit field) and an upper CH 2114 (comprising a 2-bit field). In some embodiments, SQ (specify queue, normal mode) 2113 includes a 1-bit field that, if set, specifies that the input structure vector is to be provided from a particular queue (e.g., one of Qs 897). If SQ 2113 is set, the input structure vector is provided from one of Qs 897 specified by UTID 2102.

FIG. 21B illustrates selected details of an embodiment of a structure output data structure descriptor (also known as structure output DSD) as structure output data structure descriptor 2120. In some embodiments, the structure output data structure descriptor 2120 describes the structure vectors created by the PEs and transmitted over the structure and various parameters related to the processing of the structure vectors. In various implementations and/or usage scenarios, the destination operand of an instruction refers to the DSR that includes an instance of the DSD according to the structure output data structure descriptor 2120.

The structure output data structure descriptor 2120 includes a length 2121, a UTID (micro thread identifier) 2122, a UE (micro thread enable) 2123, a SW (SIMD width) 2124, a color 2126, a C (output control bit) 2127, an index low 2128.1, a type 2129, a SS (single step) 2130, an SA (save address/conditional single step mode) 2131, a WLI (wavelet index select) 2132, an index high 2128.2, and an AC (active color) 2125.

In some embodiments, the elements of the structure output data structure descriptor 2120 (lengths 2121, UTID 2122, UE 2123, SW 2124, SS 2130, SA 2131, and AC 2125) are similar in function and/or operation to the elements of the structure input data structure descriptor 2100 (lengths 2101, UTID 2102, UE 2103, SW 2104, SS 2110, SA 2111, and AC 2105), respectively.

In some implementations, color 2126 includes a 5-bit field that specifies a structure color for transmitting the wavelet associated with the structure vector.

In some embodiments, C (output control bit) 2127 includes a 1-bit field that specifies whether the wavelet is a control wavelet. If C2127 is set, then any wavelet created based on the DSD is a control wavelet (e.g., control bits 1320 of FIG. 13A are set).

In some embodiments, index low 2128.1 includes a 3-bit field and index high 2128.2 includes a 3-bit field. The concatenation of index low 2128.1 and index high 2128.2 is collectively referred to as index 2128. In some cases, index 2128 is used to form an index for the wavelet (e.g., index 1321 of fig. 13A).

In some embodiments, type 2129 comprises a 3-bit field that specifies the data structure type and/or other fields of how to interpret structure output data structure descriptor 2120. Type 2129 is "0" for all instances of structure output data structure descriptor 2120.

In some embodiments, WLI (wavelet index selection) 2132 includes a 1-bit field that specifies, in part, the index of the structure vector. In some cases, if WLI 2132 is "1," the index is a value from a register (e.g., GPR4 for RF 842). In some cases, if WLI 2132 is "0," the index is a 16-bit zero extension to index 2128.

Fig. 21C illustrates selected details of an embodiment of a 1D memory vector data structure descriptor (also known as a 1D memory vector DSD) as 1D memory vector data structure descriptor 2140. In some embodiments, the 1D memory vector data structure descriptor 2140 describes a one-dimensional memory vector stored in memory and various parameters related to the processing of the memory vector. In various embodiments and/or usage scenarios, any one or more of the instruction's source 0 operand, source 1 operand, and destination operand refer to the respective DSR that includes the respective DSD instance according to the 1D memory vector data structure descriptor 2140.

1D memory vector data structure descriptor 2140 includes length 2141, base address 2142, type 2149, SS (single step) 2150, SA (save address/conditional single step mode) 2151, WLI (wavelet index select) 2152, and stride 2153.

In some embodiments, some elements of the 1D memory vector data structure descriptor 2140 (lengths 2141, SS 2150, and SA 2151) are similar in function and/or operation, respectively, with respect to some elements of the structure input data structure descriptor 2100 (lengths 2101, SS 2110, and SA 2111). In some cases, if the memory vector is greater than 15 bits in length, then the 4D memory vector data structure descriptor 2140 is used.

In some embodiments, the base address 2142 comprises a 15-bit integer that specifies the base address of the memory vector.

In some embodiments, the type 2149 includes a 3-bit field that specifies a data structure type and/or other fields of how to interpret the 1D memory vector data structure descriptor 2140. Type 2149 is "1" for all instances of 1D memory vector data structure descriptor 2140.

In some embodiments, WLI (wavelet index select) 2152 includes a 1-bit field that partially specifies the index of the vector. If WLI 2152 is "0", the index is 0. In some cases, if WLI 2152 is "1," the index is a value from a register (e.g., GPR4 for RF 842) or an index of a sparse wavelet (e.g., index 1321 of fig. 13A).

In some embodiments, stride 2153 comprises a 9-bit signed integer that specifies a vector stride. In some cases, base address 2142, the index specified by WLI 2153, and stride 2153 enable the calculation of the address of a data element in a 1D memory vector. The address of the first data element in the 1D memory vector is the base address 2142+ index specified by WLI 2153. The address of the next data element in the 1D vector is the address of the first data element + stride 2153. For example, base address 2142 is 136, WLI 2153 is 1, GPR4 holds the value 6, stride 2153 is-2, length 2141 is 10, and the memory vector includes data at addresses {142, 140, 138. In some cases, if the stride of the memory vector is greater than 9 bits, then the 4D memory vector data structure descriptor 2140 is used.

FIG. 21D illustrates selected details of an embodiment of a 4D memory vector data structure descriptor (also known as a 4D memory vector DSD) as 4D memory vector data structure descriptor 2160. In some embodiments, 4D memory vector data structure descriptor 2160, in conjunction with 4D memory vector extension data structure descriptor 2240 of fig. 22B, describes the 4-dimensional memory vectors stored in memory and various parameters related to the processing of the memory vectors. In some embodiments, 4D memory vector data structure descriptor 2160, in conjunction with 4D memory vector extension data structure descriptor 2240 of fig. 22B, describes a two-dimensional or three-dimensional memory vector stored in memory and various parameters related to the processing of the memory vector. In various embodiments and/or usage scenarios, according to the 4D memory vector data structure descriptor 2160, any one or more of the instruction's source 0 operand, source 1 operand, and destination operand refer to the respective DSR that includes the respective instance of the DSD.

The 4D memory vector data structure descriptor 2160 includes length low 2161.1 bits, base address 2162, type 2169, SS (single step) 2170, SA (save address/conditional single step mode) 2171, WLI (wavelet index select) 2172, and length high 2161.2 bits.

In some embodiments, some elements of 4D memory vector data structure descriptor 2160 (base addresses 2162, SS 2170, SA 2171, and WLI 2172) are similar in function and/or operation with respect to 1D memory vector data structure descriptor 2140 (base addresses 2142, SS 2150, SA 2151, and WLI 2152), respectively.

In some embodiments, low order 2161.1 includes a 15 bit field and length high order 2161.2 includes a 9 bit field. The concatenation of the low order bits 2161.1 and the high order bits 2161.2 is collectively referred to as (and shown as) a length 2161 (24-bit field), which is interpreted in conjunction with a 4D memory vector extension data structure descriptor 2240.

In some embodiments, type 2169 includes a 3-bit field specifying an extended dsr (xdsr), storing, for example, an extended dsd (xdsd). XDSD specifies and describes one of the following: a circular memory buffer (e.g., circular memory buffer extension data structure descriptor 2210 of FIG. 22A) and a four-dimensional memory vector (e.g., 4D memory vector extension data structure descriptor 2240 of FIG. 22B).

FIG. 21E illustrates selected details of an embodiment of a circular memory buffer data structure descriptor (also known as a circular memory buffer DSD) as a circular memory buffer data structure descriptor 2180. In some embodiments, in conjunction with the circular memory buffer extension data structure descriptor 2210, the circular memory buffer data structure descriptor 2180 describes one of: a circular buffer of data elements stored in memory and a FIFO of data elements stored in memory; as well as various parameters related to the processing of the data elements. In various embodiments and/or usage scenarios, any one or more of a source 0 operand, a source 1 operand, and a destination operand of an instruction refers to a respective DSR that includes a respective DSD instance, according to the circular memory buffer data structure descriptor 2180.

The circular memory buffer data structure descriptor 2180 includes a length 2181, a base address 2182, FW (FIFO surround bit) 2188, a type 2189, SS (single step) 2190, SA (save address/conditional single step mode) 2191, WLI (wavelet index select) 2192, and SW (SIMD width) 2184. In some embodiments, the circular memory buffer access always indexes 0 with a stride of 1.

In some embodiments, some elements of the circular memory buffer data structure descriptor 2180 (length 2181, base address 2182, SS 2190, and SA 2191) are similar in function and/or operation, respectively, with respect to some elements of the 1D memory vector data structure descriptor 2140 (length 2141, base address 2142, SS 2150, and SA 2151). In some implementations, the type 2189 is similar in function and/or operation to the type 2169 of the 4D memory vector data structure descriptor 2160. In some embodiments, SW 2184 of circular memory buffer data structure descriptor 2180 is similar in function and/or operation to SW 2104 of structure input data structure descriptor 2100.

In some embodiments, FW (FIFO surround bit) 2188 comprises a 1-bit field that is capable of distinguishing between a full FIFO and an empty FIFO. When an address range of the surround FIFO is accessed, FW (FIFO surround bit) 2188 is switched.

In some embodiments, WLI 2192 has no effect on the index of the circular buffer.

FIG. 22A illustrates selected details of an embodiment of a circular memory buffer extension data structure descriptor, as circular memory buffer extension data structure descriptor 2210. Circular memory buffer extension data structure descriptor 2210 includes type 2211, start address 2212, end address 2213, FIFO 2214, push (active) color 2215, and pop (active) color 2216.

In some implementations, type 2211 includes a 1-bit field that specifies the type of the data structure. For all instances of circular memory buffer extension data structure descriptor 2210, type 2211 is "1".

In some implementations, the start address 2212 includes a 15-bit field that specifies the start address of a circular buffer in memory. In some implementations, the end address 2213 includes a 15-bit integer that specifies an end address of a circular buffer in memory. When the address is incremented (e.g., by the stride at which the next access is to begin) and equals end address 2213, the address is reset to base address 2212, providing circular access behavior.

In some implementations, FIFO 2214 includes a 1-bit field that specifies whether the circular buffer is a FIFO. If FIFO 2214 is "0," then the circular buffer is not a FIFO. If FIFO 2214 is a "1," the circular buffer is a FIFO.

In some implementations, the press (activation) color 2215 and the pop (activation) color 2216 include 6-bit fields that specify the color to be activated (e.g., via an activation operation). In some implementations, the press (activation) color 2215 and the pop (activation) color 2216 can specify one of a native color and a structural color.

In various embodiments, two circular memory buffers DSR can describe FIFOs of data elements stored in the same area of memory. The destination DSR (e.g., DDSR8) describes the write pointer of the FIFO, and the source DSR (e.g., S1DSR8) describes the read pointer of the FIFO. In some embodiments, the destination and source DSR have the same identifier. In various embodiments, only some DSRs 846 are capable of describing FIFOs (e.g., DDSR8-DDSR 11 and S1SR8-S1SR 11).

FW (FIFO surround bit) 2188 of both DSRs can detect whether the FIFO is full or empty. When the FIFO is used as a destination, the base addresses 2182 and FW 2188 of the relevant S1DSR are read and compared with the value from the DDSR. If the base addresses 2182 of both DSRs are the same, but FW 2188 is different, then the FIFO is full. When a FIFO is used as a source, the base addresses 2182 and FW 2188 of the relevant DDSR are read and compared with the value from S1 DSR. If the base addresses 2182 of both DSRs are the same and FW 2188 is the same, the FIFO is empty. In some cases (e.g., micro-threads), in response to a read access to an empty FIFO or a write access to a full FIFO, processing switches to an instruction in another task until the FIFO is not empty or not full, respectively.

Fig. 22B illustrates selected details of an embodiment of a 4D memory vector extension data structure descriptor, as 4D memory vector extension data structure descriptor 2240. In some implementations, the 4D memory vector extension data structure descriptor 2240 describes, in part, a four-dimensional vector of data elements stored in memory. The 4D memory vector extension data structure descriptor 2240 includes a type 2241, a dimension 2242, DF (dimension format) 2243, a select stride 12244.1, a select stride 22244.2, a select stride 32244.3, a select stride 42244.4, and a stride 2245. In some embodiments, the 4D memory vector extension data structure descriptor 2240 includes 51 bits.

In some implementations, the type 2241 includes a 1-bit field that specifies the type of the data structure. For all instances of the 4D memory vector extension data structure descriptor 2240, the type 2241 is "0".

In some implementations, dimension 2242 includes a 20-bit field for the length of the next dimension of the initialization vector.

In some embodiments, DF (dimension format) 2243 comprises a 5-bit field that, in conjunction with length 2161 of fig. 21D, specifies the length of each dimension of the N-dimensional vector. Conceptually, length 2161 is divided into 6 consecutive 4-bit nibbles, and each dimension is represented using one or more nibbles. Bits are set in DF 2243 to indicate the demarcation between dimensions in length 2161. For example, DF 2242 is "01110" (binary), meaning that two nibbles are used to represent a first dimension, e.g., bit [ 7: 0], and represents a length between 1 and 128. Similarly, a second dimension is represented using one nibble, e.g., bit [ 11: 8], and represents the length between 1 and 4. In DF 2242, an N-dimensional vector is represented by setting the (N-1) bit, with more than four nibbles being used for only the last dimension. In some implementations and/or usage scenarios, one-dimensional vectors are described using this format, for example, if the vectors are too long to describe for length 2141 (fig. 21C). In some embodiments and/or usage scenarios, two-dimensional or three-dimensional vectors are described using this format.

In some embodiments, selecting stride 12244.1 includes a 1-bit field specifying a stride for a first dimension of a vector. If stride 12244.1 is selected to be "0," the stride is 1. If stride 12244.1 is selected to be "1," the stride is specified by stride 2245.

In some embodiments, selection stride 22244.2 includes a 3-bit field and encodes the stride for the second dimension of the vector. If stride 22244.2 is selected to be "0," the stride is 1. If stride 22244.2 is selected to be "1," the stride is specified by stride 2245. If stride selection 22244.2 is 2-7, the stride is specified by a corresponding (DSR) stride register (e.g., six stride registers of DSR 846).

In some embodiments, select stride 32244.3 and select stride 42244.4 include respective 3-bit fields. In some embodiments, selecting stride 32244.3 and selecting stride 42244.4 are functionally and/or operationally similar, respectively, with respect to the third dimension and the fourth dimension, as is selecting stride 22244.2 with respect to the second dimension.

In some embodiments, stride 2245 includes a 15-bit field that specifies the vector stride in memory. In some cases, step 2245 can use a longer step for the one-dimensional vector than step 2153 (fig. 21C).

FIG. 23 illustrates selected details of an embodiment for accessing operands in accordance with a data structure descriptor as a data structure descriptor stream 2300. In some implementations, the actions of the data structure descriptor stream 2300 are performed by a CE (e.g., CE 800).

Accessing source operands via the data structure descriptor begins (start 2301) by initializing the DSR of one or more CEs of a PE with a corresponding DSD (set DSR 2302), and optionally initializing the corresponding XDSD and/or stride value of a CE ((optional) set XDSR 2305). In some implementations, the initialized DSR (and optionally the initialized XDSR and stride register to hold stride values) is initialized by an instruction to move data from memory to the DSR. Subsequently, the CE fetches and decodes an instruction (e.g., FMACH, MOV, or LT16), including one or more operands specified by the initialized DSR and optionally one or more XDSRs and/or stride registers (with the fetched/decoded instruction 2303 of the DSR). In some embodiments, an operand type field of an instruction specifies whether an operand is specified by the DSR.

The CE reads one or more DSDs from the DSR (reads DSR 2304), and determines one or more of the following: the type of data structure, the source of the data element, whether multiple data elements are read together (e.g., for SIMD operations), and the total number of data elements per operand. Depending on the determination, for each DSD read, the XDSR and one or more stride registers are also optionally read ((optional) read XDSR 2306), as described with reference to fig. 24. In some cases, the DSR is read for one or more of the source 0 operand, the source 1 operand, and the destination operand and is identified by the corresponding operand field of the instruction obtained in act 2303. In some implementations and/or usage scenarios, any one or more of DSR, XDSR, and stride register are read in parallel, in whole or in part, and in other implementations and/or usage scenarios, any one or more of DSR, XDSR, and stride register are read in sequence, in whole or in part.

Based on the DSD obtained in act 2304 (and the optional XDSR and stride value obtained in act 2306), the CE reads one or more source data elements from the structure and/or memory (reads the (next) source data element from queue/memory 2310). For each source (e.g., each of source 0 and source 1) specified by the instruction obtained in act 2303, the CE reads enough elements for the iteration of the operation specified in the instruction according to the SIMD width information in the DSD. Data elements from the structure (e.g., the source data structure is a structure vector) are accessed via one or more queues of the CE. In some implementations and/or usage scenarios, the CE also reads data elements from the registers.

After reading the source data element, the CE performs an operation using the data element as input (perform (next) operation 2311 on the data element). This operation is specified by the instruction obtained in act 2303 (e.g., a multiply accumulate operation for a FMACH instruction, a move operation for a MOV instruction, or a less than integer comparison for LT 16).

In some cases, an operation (e.g., a multiply-accumulate operation or a move operation) produces one or more output data elements. Based on the DSD obtained in act 2304 (and the optional XDSR and stride value obtained in act 2306), the CE writes the output data element to structure or storage (writes the (next) destination data element to queue/memory 2312). The data elements sent to the fabric (e.g., the destination data structure is a fabric vector) are formed into wavelets and transmitted to the fabric via the routers of the PEs. In some other scenarios, no data elements are output (e.g., some comparison operations).

After the write operation results, do the CE determine if extra data elements (more data elements) are to be processed? 2313). In some implementations, the DSD specifies a total number of data elements to access (e.g., a length of the vector), and the CE compares the number of data elements that have been accessed (e.g., tracked via a counter) to the total number of data elements specified by the length. If additional data elements are to be processed, the CE repeats acts 2310 and 2313 until all data elements have been processed and the flow ends (end 2316).

In various embodiments and/or usage scenarios, all or any portion of any one or more elements of the data structure descriptor stream 2300 (e.g., any one or more actions of 2302 and 2312) conceptually correspond to and/or are conceptually related to the operations performed by a CE (e.g., CE 800).

As one example, the source DSR that holds the source DSD (associated with setting DSR 2302 and reading DSR 2304) is one or more of DSRs 846 (e.g., S0DSR, S1DSR, DDSR, XDSR, and stride registers). In some implementations, CE 800 performs setting DSR 2302 in response to an instruction to write DSD to DSR (e.g., LDS0WDS, LDS1WDS, LDXDS, and LDSR).

As another example, CE 800 executes a fetch/decode instruction with DSR 2303. In various embodiments, PC 834 and I-Seq 836 fetch instructions from memory 854, and Dec 840 decodes the fetched instructions. In some embodiments, the instructions are formatted according to one of the following: the multi-operand instruction 2510 of FIG. 25A, the one-source no-destination operand instruction 2520 of FIG. 25B, and the immediate instruction 2530 of FIG. 25C. In some embodiments, decoding includes detecting that an instruction operand is specified by the DSD, e.g., the operand 1 type 2514.1 has a value of "1".

As another example, CE 800 executes read DSR 2304 in response to an instruction having one or more operands specified by the DSR. In various embodiments, D-Seq 844 reads the DSR specified by the instruction obtained in act 2303 from DSR 846. In some implementations, the DSD read from the DSR is formatted according to one or more of the following: the structure input data structure descriptor 2100 of fig. 21A, the structure output data structure descriptor 2200 of fig. 21B, the 1D memory vector data structure descriptor 2140 of fig. 21C, the 4D memory vector data structure descriptor 2160 of fig. 21D, and the circular memory buffer data structure descriptor 2180 of fig. 21E. In some implementations and/or usage scenarios, for example, in response to a DSD having a type 2169 or a type 2189 specifying an XDSR, D-Seq 844 performs an (optional) read of the XDSR 2306. In various embodiments, the XDSDs read from the XDSRs are formatted according to one of: the circular memory extended buffer data structure descriptor 2180 of FIG. 22A and the 4D memory vector extended data structure descriptor 2160 of FIG. 22B.

As another example, CE 800 performs a read (next) source data element from queue/memory 2310 based on the source DSD read in act 2304 and optionally the XDSD read in act 2306. In some cases, source DSD specifies (e.g., via type 2149) that the operand originates from memory, and D-Seq 844 reads the data element from data store 848 or memory 854 at the DSD specified address (e.g., based in part on one or more of base address 2142, WLI 2152, and stride 2153). In some cases, the source DSD specifies (e.g., via type 2109) that the operand originates from the fabric, and the CE 800 reads the data element from one of the Qs 897. In some implementations and/or usage scenarios, the data element is transmitted directly from one of Qs 897 to data path 852. In other implementations and/or usage scenarios, the data elements are transmitted from one of Qs 897 to RF 842, and from RF to data path 852. In some implementations, one of Qs 897 is implicitly specified by a portion of the DSD (e.g., one or more of UTID 2102, SC 2112, and SQ 2113). In some cases, the CE reads from the queue associated with the color of the current task (e.g., the task associated with the instruction obtained in act 2303). In some cases (e.g., SQ 2113 is "1"), the CE reads from the queue specified by UTID 2102. In some cases (e.g., SC 2112 is "1"), the CE reads from the queue associated with the color specified by UTID 2102 concatenated with CH 2114. In some cases, the CE reads one, two, or four data elements from the designated queue based on SW 2104.

In some embodiments and/or usage scenarios, the CE 800 stalls when the CE 800 attempts to read more data elements than are available in the designated queue of Qs 897, or alternatively attempts to read from an empty FIFO (e.g., implemented according to the DSD of fig. 21E). In some embodiments and/or usage scenarios (e.g., micro-threading), the selector 830 can select a different task from Qs 897 while waiting for a data element, thereby enabling the CE 800 to avoid stalling.

As another example, the CE 800 performs a (next) operation on the data element 2311. In some implementations, the data path 852 uses the data elements read in act 2310 as input to the operation specified by the instruction obtained in act 2303. In some cases (e.g., a computing operation), act 2311 generates an output data element, while in other cases (e.g., a comparison operation), act 2311 does not generate an output data element. In some implementations, the datapath 852 is capable of performing more than one operation simultaneously, e.g., performing two or four multiply-accumulate operations simultaneously using SIMD execution resources.

As another example, CE 800 performs a write (next) of a source data element to queue/memory 2312 based on the destination DSD read in act 2304 and optionally the XDSD read in act 2306. In some cases, destination DSD specifies (e.g., via type 2149) that the operand is to memory, and D-Seq 844 writes the data element to data store 848 or memory 854 at the address specified by destination DSD (e.g., based in part on one or more of base address 2142, WLI 2152, and stride 2153).

In various embodiments and/or usage scenarios, portions of act 2312 (e.g., writing the destination data element to the structure) conceptually correspond to and/or are conceptually related to providing the data element as a wavelet to router 1406 of fig. 14. In some cases, the destination DSD specifies (e.g., via type 2129) that the operand is to be sent to the fabric, and the CE 800 creates a wavelet from the data elements (e.g., based in part on the fabric output data structure descriptor 2120) and transmits it to the fabric via the ingress ramp 860 to router 600 (of fig. 6). In some cases, the CE transmits one, two, or four data elements as a wavelet based on SW 2124 of the destination DSD.

In some embodiments and/or usage scenarios, the CE 800 stalls when the CE 800 attempts to transmit more wavelets than are available in the router 600 (e.g., there are insufficient resources in the data queue 650 of fig. 6), or alternatively, attempts to write to a full FIFO (e.g., implemented according to the DSD of fig. 21E). In some embodiments and/or usage scenarios (e.g., micro-threading), the selector 830 can select a different task from Qs 897 while waiting for more resources, thereby enabling the CE 800 to avoid stalling.

As another example, CE 800 performs act 2313. In some embodiments, D-Seq 844 determines how many data elements have been processed (e.g., by incrementing a counter for each data element) and compares it to the length of the vector (e.g., length 2101).

FIG. 24 illustrates selected details of an embodiment of a decoded data structure descriptor as a data structure descriptor decoding stream 2400. In various embodiments and/or usage scenarios, the memory data structure descriptor stream 2400 is a conceptual representation of all or any portion of the acts 2304, 2306, 2310, and 2312 (of figure 23) performed on each DSR that describes a structure or memory vector. In summary, FIG. 23 illustrates fetching and decoding an instruction that includes one or more operands specified by an initialized DSR, reading the DSR to obtain and decode a corresponding DSD, reading a (next) source data element in accordance with the DSD, performing an operation on the source data element, writing an output data element of the operation in accordance with the DSD, and iteratively reading back a next source data element until completed. FIG. 24 shows further details regarding decoding a DSD obtained from a DSR and optionally reading one or more XDSRs and stride registers to obtain and decode corresponding XDSRs and stride values, for a structure vector (structure vector 2410) and a memory vector (memory vector 2420), to determine a memory access pattern for accessing data elements (e.g., any one or more of source 0, source 1, and destination) of the memory vector of instructions. Conceptually, the acts illustrated in FIG. 24 are performed for each DSD obtained via act 2304 of FIG. 23. In some implementations, the actions of the memory data structure descriptor stream 2400 are performed by a CE (e.g., CE 800).

Decoding the DSD (e.g., obtained via act 2304 of fig. 23) begins (start 2401), and the CE determines whether the DSD corresponds to a structure vector (structure. If so, then access of the operand described by the DSD is made using the DSD (accessed via DSD 2412) as a structure vector, e.g., if the operand is the source (FIG. 21A), then act 2310 (of FIG. 23) reads from the structure according to the DSD, and if the operand is the destination (FIG. 21B), then act 2312 (of FIG. 23) writes into the structure according to the DSD.

If the DSD does not correspond to a structure vector, the DSD corresponds to a memory vector. Then, the CE determines whether the DSD corresponds to a 1D memory vector (type XDSR2421), for example, according to fig. 21C. If so, then the access of the operand described by DSD is done using DSD (access 1D via DSD 2427) as the 1D memory vector. For example, if the operand is a source, act 2310 reads the source from memory according to the 1D memory vector described by the DSD, and if the operand is a destination, act 2312 writes to memory according to the 1D memory vector described by the DSD. Each iteration of the data elements in FIG. 23 (act 2310-2313) advances the operand storage address based on the 1D memory vector described by the DSD.

If the DSD does not correspond to a 1D memory vector, then the DSD corresponds to a 4D memory vector (e.g., according to FIG. 21D) or a circular buffer (e.g., according to FIG. 21E). The CE reads the XDSR specified by the DSD (the read XDSR specified via DSD 2422, also conceptually corresponding to the (optional) read XDSR 2306 of fig. 23) to obtain the XDSD. The XDSR is specified by type 2169 (FIG. 21D) or type 2189 (FIG. 21E).

The CE then determines whether the XDSD specifies a 4D memory vector (e.g., according to FIG. 22B). If so, CE optionally reads one or more stride registers ((optionally) reads stride register 2424, conceptually also corresponding to (optionally) read XDSR 2306 of fig. 23), as optionally specified by XDSD. Accesses to operands described by DSD, XDSD, and any optional stride value (obtained from stride registers) are made using DSD, XDSD, and optional stride value (4D accessed via XDSD 2428) as a 4D memory vector. For example, if the operand is a source, act 2310 reads the source from memory according to a 4D memory vector, and if the operand is a destination, act 2312 writes to memory according to the 4D memory vector. Each iteration of the data elements in FIG. 23 (act 2310-2313) advances the operand memory addresses based on the 4D memory vector described by the DSD.

If the XDSD does not correspond to a 4D memory vector, then the XDSD corresponds to a circular buffer (e.g., according to FIG. 22A). The access of operands described by DSD and XDSD is done using DSD and XDSD (accessing the circular buffer via XDSD 2429) as a circular buffer. For example, if the operand is a source, act 2310 reads the source from memory according to a circular buffer, and if the operand is a destination, act 2312 writes into memory according to a circular buffer. Each iteration of the data elements in FIG. 23 (acts 2310 and 2313) advances operand memory addresses according to the circular buffer described by the DSD.

In various embodiments, is D-Seq 844 executed based on the DSD execution type? 2411 and/or type XDSD 2421. In some embodiments, the type field of the DSD (e.g., type 2109 of fig. 21A, type 2129 of fig. 21B, type 2149 of fig. 21C, type 2169 of fig. 21D, and type 2189 of fig. 21E) determines whether the data structure is one of: a structure vector (e.g., type ═ 0 "), a 1D vector (e.g., type ═ 1), and an XDSD type (e.g., type ═ 2-7). In various embodiments (e.g., type ═ 2-7), the value of the type field specifies which XDSR of DSR 846 was read by action 2422. In some implementations, D-Seq 844 performs act 2422 and receives an XDSD from DSR 846. In some other implementations, DSR 846 performs acts 2421 and 2422 and transmits the DSD and XDSD to D-Seq 844.

As another example, is D-Seq 844 based on the XDSD execution type of act 2422? 2423. In some implementations, the type field of the XDSD (e.g., type 2211 of fig. 22A or type 2241 of fig. 22B) read from the XDSR determines whether the data structure is one of a 4D vector (e.g., XDSD type ═ 0 ") or a circular buffer (XDSD type ═ 1").

As another example, as described elsewhere herein, a memory access is generated from act 2427, D-Seq 844 using, for example, base address 2142, WLI 2152, length 2141, and stride 2153 of the DSD by calculating a memory address based on the DSD (e.g., of act 2304). Similarly, according to act 2428, D-Seq 844 generates memory accesses by calculating memory addresses based on the DSD of (e.g., act 2404) and the XDSD of act 2422 using, for example, base address 2162, length 2161, WLI 2172, stride 2245, stride select 12244.1, and DF 2243 for DSD and XDSD. Similarly, according to act 2429, D-Seq 844 generates memory accesses by calculating memory addresses based on the DSD of (e.g., act 2404) and the XDSD of act 2422 using, for example, the base address 2182, length 2181, WLI 2192, start address 2212, and end address 2213 of the DSD and XDSD, as described elsewhere herein.

In some embodiments, D-Seq 844 sends each calculated address to one of data store 848 and memory 854. In response to receiving the computed address, the data store and/or memory accesses two bytes of data at the computed address.

Instruction format

In the description of fig. 25A through 25C, each element identifier having the first number "8" refers to an element of fig. 8, and is not otherwise specifically identified as an element of fig. 8 for the sake of brevity.

FIG. 25A illustrates selected details of an embodiment of a multi-operand instruction, which is a multi-operand instruction 2510. The multi-operand instruction 2510 is one of a two/three source, one destination operand instruction (e.g., multiply-add such as FMACH), a two source, no destination operand instruction (e.g., compare such as LT 16), and one source, one destination operand instruction (e.g., move instruction such as MOV 16).

The multiple operand instruction 2510 includes various fields: instruction type 2511, opcode 2512, operand 0 encoding 2513, operand 1 encoding 2514, and terminate 2515. Operand 0 encoding 2513 includes operand 0 type 2513.1 and operand 02513.2. Operand 1 encoding 2514 includes operand 1 type 2514.1 and operand 12514.2. In some embodiments, the multi-operand instruction 2510 includes 20 bits.

In some embodiments, the values of instruction type 2511 distinguish different types of instructions (e.g., two/three sources, one destination and one source, and one destination instruction type) according to the following table. In various embodiments, the value of the operation code 2512 specifies a particular operation (e.g., multiply, add, or subtract). The length of the operation code 2512 varies between different types of instructions, as described in the table below.

Instruction families Value of instruction type 2511 Length of opcode 2522
Two/three sources, one destination 10 5 position
Two source, no destination 1110 4 bit
One source, one destination 110 5 position

In some embodiments, operand 0 encoding 2513 describes source and/or destination operands according to the following table. In some embodiments, operand 1 encoding 2714 describes a source operand.

In some embodiments, operands 02513.2 and 12514.2 include respective 4-bit fields. In some embodiments, operand 0 type 2513.1 and operand 1 type 2514.1 include respective 2-bit fields and determine how to interpret operand 02513.2 and operand 12514.2, respectively. For a two/three source operand, i.e., one destination operand instruction, operand 0 type 2513.1 is explained according to the table below.

For example, if operand 0 type 2513.1 has a value of "1" and operand 02513.2 has a value of "4", operand 0 encoding 2513 specifies that the source 0 operand is a vector described by S0DSR [4] and the destination operand is a vector described by DDSR [4 ].

For dual source operands, there is no destination operand instruction, operand 0 type 2513.1 is explained according to the table below.

2513.1 value Operand 0 encoding 2513
0 Source 0 is S0DSR [ operand 02513.2]
1 Source 0 is GPR [ operand 02513.2]

For example, if operand 0 type 2513.1 has a value of "0" and operand 02513.2 has a value of "4", operand 0 encoding 2513 specifies that the destination operand is a vector described by DDSR [4 ].

For a source operand, a destination operand instruction, operand 0 type 2513.1 is explained according to the table below.

2513.1 value Operand 0 encoding 2513
0 The destination is DDSR [ operand 02513.2]
1 The destination is GPR [ operand 02513.2]

For example, if operand 0 type 2513.1 has a value of "0" and operand 02513.2 has a value of "4", operand 0 encoding 2513 specifies that the destination operand is a vector described by DDSR [4 ].

For the multiple operand instruction 2510, the operand 1 type 2514.1 is explained according to the following table.

2514.1 value Operand 0 encoding 2514
0 Source 1 is S1DSR [ operand 12514.2]
1 Source 1 is composed of GPR [6 ]]Data in memory at specified addresses
2 Source 1 is GPR [ operand 12514.2]
3 Source 1 is immediate

For example, if operand 0 type 2513.1 has a value of "0" and operand 02513.2 has a value of "4", operand 0 encoding 2513 specifies that the destination operand is a vector described by DDSR [4 ].

In various embodiments, the immediate source operand specifies one of several predetermined values (e.g., 0, 1, and-1) and a pseudo-random number generated by the LFSR. For example, if operand 1 type 2514.1 has a value of "3" and operand 12514.2 has a value of "8", operand 1 encoding 2514 specifies a PRNG generated by an LFSR.

In some embodiments, termination 2515 includes a 1-bit field specifying that the instruction is the last instruction in the task. When the instructions complete execution, the task terminates, enabling selection and execution of a new task (e.g., via termination 812 and selector 830).

FIG. 25B illustrates selected details of an embodiment of a source non-destination operand instruction as a source non-destination instruction 2520. A source-no-destination instruction 2520 includes an instruction type 2521, an opcode 2522, operand 1 encodings 2523, an immediate high 2524, and a stop 2525. Operand 1 encoding 2523 describes source operands, including operand 1 type 2523.1 and operand 12523.2. In some embodiments, a source-agnostic instruction 2520 includes 20 bits.

In some embodiments, instruction type 2521 includes four bits "1111" specifying that the instruction is a source, no destination operand instruction, and opcode 2522 includes a 4-bit field specifying the particular operation (e.g., block, unblock, activate, set active PRNG, data filter, conditional branch, and jump).

In some embodiments, immediate high 2524 comprises a 4-bit field. In some cases, immediate high 2524 is concatenated with operand 12523.2 to form an 8-bit immediate.

In some embodiments, operand 1 type 2523.1 includes a 2-bit field that determines how to interpret operand 12523.2. If operand 1 type 2523.1 is "0," operand 1 encoding 2523 specifies a vector (e.g., a structure vector of data elements from Qs 897 or a memory vector of data elements in one of memory 854 and data store 854), and the value of operand 12523.2 identifies which of the 12 SIDs of DSR 846 describes the vector. If operand 1 type 2523.1 is "1," operand 1 encoding 2523 describes a value at an 8-bit address in memory (e.g., one of memory 854 and data store 848) formed by immediate high 2524 concatenated with operand 12523.2. If operand 1 type 2523.1 is "2," operand 1 encoding 2523 describes the value in a register (e.g., one of RF 842) identified by the value of operand 12523.2. If operand 1 type 2523.1 is "3," operand 1 encoding 2523 describes immediate. If the opcode 2522 specifies an operation (e.g., block, unblock, or activate) on a 16-bit integer operand, then 8 bits are immediately included and are a concatenation of immediately higher 2524 and operand 12523.2.

In some embodiments, termination 2525 includes a 1-bit field that specifies that the instruction is the last instruction in the task. When the instructions complete execution, the task terminates, enabling selection and execution of a new task (e.g., via termination 812 and selector 830). If a source indirection instruction 2520 is a conditional branch, the task terminates only if the conditional branch is not taken.

FIG. 25C illustrates selected details of an embodiment of an immediate instruction that is immediate instruction 2530. The immediate instruction 2530 includes an instruction type 2531, an opcode 2532, an operand 02533.2, and an immediate 2534. In some embodiments, immediately low 2534.1 includes a 9-bit field and immediately high 2534.2 includes a 1-bit field. The concatenation of immediate low 2534.1 and immediate high 2534.2 is collectively referred to (and shown as) immediate 2534. In some embodiments, the immediate instruction 2520 includes 20 bits.

In some implementations, the instruction type 2531 includes a 1-bit field "0" specifying that the instruction is an immediate instruction, and the opcode 2532 includes a 5-bit field specifying a particular operation (e.g., load Source 0DSR, load Source 1DSR, load destination DSR, store Source 0DSR, store Source 1DSR, and store destination DSR). In some cases, execution of immediate instructions 2530 (e.g., load DSR instructions and load XDSR instructions) loads data from one of memory 854 and data store 848 to the DSR of DSR 846. In other cases, execution of the immediate instructions 2530 (e.g., store DSR instructions and store XDSR instructions) stores data from the DSR of the DSR 846 to one of the memory 854 and the data store 848.

In some embodiments, operand 02533.2 includes a 4-bit field, and opcode 2532 determines how to interpret operand 02533.2. In some cases (e.g., if operand 02533.2 specifies an operation without a register operand, such as a jump operation), immediate low 2534.1, operand 02533.2, and immediate high 2534.2 are concatenated to form a 14-bit immediate. In other cases, immediate 2534 sign extends to form a 16-bit immediate. In other cases, immediate 2534 sign-extends to form a 15-bit address. In other cases, immediate 2534 is moved one bit to the left and the sign is extended to form a 15-bit address (e.g., for 32-bit data).

Deep learning accelerator example usage

In various embodiments and/or usage scenarios, as described elsewhere herein, a deep learning accelerator (e.g., the structure of a PE (e.g., implemented via wafer-level integration, and as shown, for example, in fig. 4)) may be used to train a neural network, and/or perform reasoning about a trained neural network. In some cases, training includes determining weights for the neural network in response to a training stimulus. Various techniques may be used for training, such as Stochastic Gradient Descent (SGD), small batch gradient descent (MBGD), Continuous Propagation Gradient Descent (CPGD), and Reverse Checkpoint (RCP). Next, the CPGD is compared with other technologies, and then each of the SGD, MBGD, CPGD, and RCP is described in more detail.

Past deep neural network training methods (e.g., SGD and MBGD) all use so-called anchored triangle learning. That is, the incremental derived weight updates have been 'anchored' or held fixed until all activation processes for a training set batch or mini-batch are completed. In some cases, the layer order nature of anchor incremental learning results in high latency order parameter updates (e.g., including weight updates), which in turn results in slow convergence. In some cases, anchoring incremental learning limits layer parallelism and thus concurrency.

In contrast, in some cases, using continuously propagated (also referred to as immediate delta) learning rules for deep neural network training, as taught herein, provides faster convergence, reduces latency for parameter updates, and increases concurrency by enabling layer parallelism. The increments calculated from the direct network parameters use updated information corresponding to the slope of the current parameter. Continuous propagation achieves layer parallelism by enabling each layer to learn simultaneously with other layers without explicit synchronization. Thus, parallelization along the depth of the network can apply more computational resources to training. In some usage scenarios, the parallelism available in continuous propagation achieves up to 1x wall clock time improvement compared to MBGD techniques. The continuous propagation method also avoids the use of additional memory for storing the model parameter values for multiple activation vectors.

In some embodiments and/or usage scenarios, the neural network is trained using continuous propagation of stimuli to perform SGD. In some embodiments that train via the CPGD, the RCP can reduce the number of activations saved in memory (thereby reducing memory footprint) by recalculating the selected activations. In some cases, recalculation activations also improve the accuracy of the weight training estimates. In training without RCP, each layer of neurons receives activation during one or more forward transfers and saves the activation for reuse in computations performed during one or more backward transfers associated with the forward transfers (e.g., one or more incremental, chained, and weight update transfers associated with the forward transfers). In some cases (e.g., a relatively deep neural network), the time between saving an activation and an associated back-pass is relatively long, and saving all activations uses relatively more memory than saving less than all activations.

For example, only some neuron layers (e.g., every even layer) hold respective activations, while other layers discard respective activations (e.g., every odd layer). The layer with saved activations (e.g., each even layer) uses the most recent weights to recalculate and transmit the recalculated activations to the layer that dropped the activations (e.g., each odd layer). In some cases, the recalculated activation is different than the dropped activation in that the most recent weight is different than the weight available during the forward pass (e.g., one or more weight updates occur between the forward pass and the associated backward pass). In various embodiments, the number and types of layers that are saved and discarded active are selected to optimize a desired balance between reduced memory usage and increased computations. As one example, every four layers save activations, all other layers abandon activations. As another example, a convolutional layer is selected to save activations and other layers are selected to discard activations.

In various embodiments and/or usage scenarios, any one or more of SGD, MBGD, and CPGD with or without RCP is implemented via one or more of the following: an arrangement of processing elements (e.g., as shown in fig. 4), one or more GPUs, one or more CPUs, one or more DSPs, one or more FPGAs, and one or more ASICs.

For example, SGDs with back propagation may be used (as described elsewhere herein) to train neural networks. However, learning via gradient descent is continuous in nature, as each weight update uses information of the gradient measurements taken after completion of a full forward pass through the neural network. Furthermore, the weight updates are made during the respective back pass through the neural network (following and corresponding to the forward pass), so that the last weight update occurs after the completion of the entire respective back pass.

MBGD achieves more parallelism than SGD by gradient averaging over small batches, processing several ('small batch') activations in parallel. However, the speed of the sequential updates is constant compared to the SGD, and the weight updates are done after all corresponding back-passes are done through the neural network as in the SGD. The gradient noise decreases as smaller batches are added with more activations processed in parallel. In some cases, the reduction of gradient noise may result in poor generalization beyond one point.

CPGD supports parallel processing and updating of weights in all layers of a neural network, while activation propagates through the layers in the stream. Thus, in some embodiments and/or usage scenarios, CPGD overcomes the sequential processing limitations of SGD and MBGD.

RCP reduces memory usage via (re) computing the activation that would otherwise be stored and can be used in conjunction with SGD, MBGD and CPGD.

Pipeline flow diagrams can be used to compare and contrast various SGD, MBGD, CPGD, and CPGD and RCP techniques. The flow and concurrency of information in the training technique is visible in the pipeline flow diagram. 26A-26D illustrate, from left to right, an embodiment of a pipeline flow for a layer of a neural network flow, e.g., activation entering from the left, forward propagation of the layer computation flow flowing to the right. Gradient calculations are performed at the rightmost layer, and back propagation of layer calculations is initiated, including right-to-left weight updates. Time progresses from top to bottom.

FIG. 26A illustrates an embodiment of a pipeline flow for an SGD. The updating of the weights of the layers of the neural network is done after the respective full forward pass and the respective full backward pass through all layers of the neural network. The next forward pass is started only after the weight update corresponding to the immediately previous forward pass is completed. As shown, a first forward pass 2611 is performed (from the first tier to the last tier, shown from left to right in the figure). Then, a first backward pass 2621 is performed (from the last layer to the first layer, shown from right to left in the figure). During the first backward pass 2621, the weights are updated from the last layer to the first layer. The last weight update (of the first layer) is completed with the completion of the first backward pass 2621. Then a second forward pass 2612 is performed (using the weights updated during the first backward pass 2621), followed by a second backward pass 2622, with the weight updates performed during the second backward pass 2622.

FIG. 26B shows an embodiment of a pipeline flow for MBGD. Multiple activations are processed with the same weight. The coordinated quiet time is used to synchronize weight updates. In some embodiments and/or usage scenarios, the MBGD process is characterized by a small batch size (N)2631, overhead 2632, and update interval (U) 2633.

Unlike gradient descent techniques (e.g., SGD and MBGD), which use full forward and full backward passes through the network to compute gradient estimates, resulting in order dependencies, CPGD uses a differential structure to replace order dependencies with continuous models of continuous gradient generation. In some implementations and/or usage scenarios, CPGD achieves layer parallelism without explicit synchronization by enabling each layer of the neural network to train (e.g., "learn") simultaneously with other ones of the layers. Thus, parallelization along the depth of the neural network can apply more computational resources to training. In various embodiments and/or usage scenarios, CPGD provides comparable accuracy and improved convergence speed in terms of training epochs compared to other techniques.

FIG. 26C shows an embodiment of a pipeline flow for CPGD. The CPGD process maintains a flow model. The hidden representation and deltas enter each layer at each time step, and the weights are updated at each time step. The CPGD process is a coordinated synchronous operation. In some embodiments and/or usage scenarios, the CPGD processing is characterized by a forward pass 2651 and a corresponding backward pass 2661, representing one of a plurality of forward passes and one of a plurality of corresponding backward passes, respectively. In operation, respective ones of the plurality of forward transfers operate in parallel with one another, respective ones of the plurality of backward transfers operate in parallel with one another, and the plurality of forward transfers and the plurality of backward transfers operate in parallel with one another. Weight updates (made during backward pass) are used by forward pass and backward pass when weight updates are available.

As a specific example, forward pass 2665 begins, followed by forward pass 2666. At least a portion of forward pass 2665 operates in parallel with at least a portion of forward pass 2666. At least a portion of the respective backward pass of the forward pass 2665 operates in parallel with at least a portion of the forward pass 2666. Further, as shown in the weight update use example 2667, the respective backward pass completes at least some of the weight updates used by the forward pass 2666.

Fig. 26D shows an embodiment of a pipe flow of CPGD with RCP. The CPGD with RCP ignores saving the selected activation but recalculates the selected activation. In some implementations and/or usage scenarios, the recalculation is performed with updated weights. Thus, reverse checkpointing can reduce memory (shown as a reduced area covered by vertical lines that are temporally advanced through the saved hidden representation) and reduce the temporal difference between the computed hidden representation and the corresponding delta.

As a specific example, CPGD with RCP processing is characterized by a forward pass 2671 and a corresponding backward pass 2681. As shown in activation memory 2685, a first activation is computed during the forward pass and stored in the layer for a corresponding backward pass. The activation memory 2685 is occupied during partial forward and backward transfers and is not available for other uses. The recalculated activation store 2686 illustrates a specific example of memory reduction. The second activation is computed during the forward pass but discarded and no storage is required. During backward pass, the second activation is recalculated and stored in a layer for backward pass, as shown by recalculated activation store 2686. The recalculated active memory 2686 is unoccupied during the entire forward pass and is available for other purposes (e.g., other forward passes, other backward passes), thereby reducing the required memory.

Considering parallelization more generally, in some embodiments and/or usage scenarios, parallelized computations (e.g., neural network training) scatter computations across separate computing elements operating simultaneously. In the model parallel mode, separate units evaluate the same neural network simultaneously using different model parameters. In a data parallel mechanism, different workers evaluate different network inputs simultaneously using the same formal model parameters. Some expansion techniques use fine-grained data parallelism on a layer and in a cluster unit.

In some embodiments and/or usage scenarios, MBGD improves the accuracy of gradient estimation based on the small batch size n. However, performing the calculation of MBGD for small batch size n is approximately equal to performing the calculation of SGD for n steps. In some cases, the SGD is more efficient than MBGD for small batch size n, approximately the square root of n, for n steps. Thus, higher parallelism (e.g., in MBGD) and higher efficiency (e.g., in SGD) are sometimes mutually exclusive.

In some embodiments and/or usage scenarios, the deep neural network is a high-dimensional parameterized function, sometimes represented as a directed acyclic graph. The back propagation technique is sometimes represented by a cyclic graph. The loop in the figure is a feedback iteration. The gradient resulting from the first full network evaluation changes the weights used in the next iteration, since the iteration is a discrete approximation of a continuous differential system. The discrete approximation includes an unbiased continuous noise process with time-varying statistics. The noise process provides regularization to enable a continuous system to model the phenomena observed in a discrete-time learning system. In the discrete case, regularization is provided by a sampling process (e.g., SGD), a learning rate, and/or other explicit mechanisms. Time-dependent noise processing enables the use of learned rate scheduling of local high frequency contours in the erasure parameter space. When approaching the correct region, regularization decreases, leading in some cases to a better final solution.

CPGD represents all nodes as a function of time in the conceptual framework of an arbitrary feedforward neural network, and applies the functional combination to formulate an internal state and a representation of the stimulus to which the internal state is exposed. The result of factoring is that the individual layers act as systems with independent local dynamics. Two dimensions are the depth of the network and the time evolution of the parameters. In some embodiments and/or usage scenarios where acceleration is achieved by mapping network layers to spatially separated computing elements, there is a delayed communication between network layers. Therefore, there is a time delay in the communication between the layers. Some implementations of CPGD are synchronous implementations that take into account time delays.

During the CPGD process, the activation vectors and associated hidden representations are combined with model parameters at different time steps during the forward pass of the activation vectors. The difference between the model parameters at different time steps and the model parameters at the same time step cannot be detected by the forward activation vector. Conceptually, a set of fixed parameters, as if continuous time steps, are used to form a collective parameter state, which is then used for learning.

During late transfers (e.g., incremental propagation), it may be selected to use immediate parameters (e.g., weights) after the update, or to retrieve historical parameters anchored to when the corresponding forward transfer was performed. The increments calculated from the instantaneous parameters use updated information corresponding to the slope of the current parameter. Some embodiments and/or usage scenarios use immediate parameters. Some embodiments and/or usage scenarios use historical parameters.

Some implementations of CPGD use memory in an order similar to SGD. Reverse checkpoints (as described elsewhere herein) can be used for CPGD, e.g., to reduce memory usage. Some embodiments of reverse checkpointing and/or use scenarios use immediate parameters (e.g., weights) to recalculate activation. Some embodiments of reverse checkpointing and/or using scene usage history parameters to recalculate activation. In some embodiments and/or usage scenarios where the activation is recalculated using instantaneous parameters, the time difference between the parameters used to calculate the forward propagation activation and the backward propagation increment is reduced in the alignment wavefront.

Continuous propagation techniques can be used in conjunction with small batch processing (streaming into MBGD). In some embodiments and/or usage scenarios, the next batch is started before the immediately preceding batch is completed, conceptually similar to an asynchronous SGD. The inconsistency of parameters within the pipeline is limited to no more than one batch boundary.

In some embodiments and/or usage scenarios, enabling data to flow through a neural network and perform computations without global synchronization boundaries enables the extraction of unextracted learning information. In some embodiments and/or usage scenarios, a lower learning rate may dominate when a larger batch is used. In some embodiments and/or usage scenarios, hidden active and/or incremental arcs are conceptually interpreted as individual vectors or alternative batch matrices. Batch matrix interpretation enables the techniques described herein to be implemented directly on a GPU, CPU, DSP, FPGA, and/or ASIC.

Fig. 27A-27E illustrate various aspects of forward and backward delivery embodiments according to SGD, MBGD, CPGD, and RCP processing. In the figure, two layers of neurons are shown, representing respective layers, e.g., a portion of a deep neural network. In various embodiments and/or usage scenarios, a deep neural network includes thousands or more layers and thousands or more neurons per layer. In various embodiments and/or usage scenarios, the first layer is an input layer that receives activations for training from agents outside the deep neural network. In various embodiments and/or usage scenarios, the second layer is the output layer, where forward pass is complete and backward pass begins. In various embodiments and/or usage scenarios, the first layer and the second layer are interior layers.

Fig. 27A and 27B show forward and backward delivery embodiments according to SGD, MBGD and CPGD, respectively, without RCP. These two layers are shown as a previous layer 2701 and a next layer 2702. The previous layer 2701 includes computations 2710 and storage 2715. The latter layer 2702 includes computation 2720 and storage 2725. The computing 2710 and computing 2720 are computing resource examples, and the storage 2715 and storage 2725 are storage resource examples.

Fig. 27C-27E show forward and backward delivery with RCP embodiments according to SGD, MBGD and CPGD. These two layers are shown as a previous layer 2703 and a next layer 2704. The previous layer 2703 includes computations 2730 and storage 2735. The latter layer 2704 includes computations 2740 and storage 2745. Computations 2730 and 2740 are examples of computing resources, and storage 2735 and storage 2745 are examples of storage resources.

Identically numbered elements in fig. 27A-27E have the same structure and operation, although the computing resources produce different results based on different inputs, and the storage resources store and subsequently provide different values based on different values stored. Other embodiments are contemplated having different computing resources and/or different storage resources that may be used for forward and backward pass computation and storage. For example, backward transfers use transposed weight storage that is not used by forward transfers. Other embodiments are contemplated having different computing and/or memory resources available for forward and backward pass implementations. For example, the RCP-based implementation uses additional computing resources (not shown) rather than for forward or backward pass processing without RCP.

With respect to fig. 27A, the computation 2710 can perform computations, such as forward pass computation F2711. The store 2715 is capable of storing activations, for example, in a 2716. The storage 2715 can also store weights, for example, in W2717. In various embodiments and/or usage scenarios, the computing 2720, F2721, storage 2725, a 2726, and W2727 are substantially similar or identical in structure and/or operation to the computing 2710, F2711, storage 2715, a 2716, and W2717, respectively.

In forward transfer operation of SGD or MBGD, A is activated1,t2781 is received by the previous layer 2701 and stored in a 2716 (for later use during backward delivery). Then, treatment A according to F27111,t2781 and weight W previously stored in W27171,tTo produce activation A2,t2782。A2,t2782 then passes to the next layer 2702. Like the previous layer, A2,t2782 is received by the latter layer 2702 and stored in a 2726 (for later use during backward delivery). Then, process A according to F27212,t2782 and weight W previously stored in W27272,tTo produce activation A3,t 2783。A3,t2783 then provides for the next subsequent layer (if present) to process, and so on, until the forward pass is complete and the backward pass begins. If the next layer 2702 is the output layer, the forward pass is completed and a backward pass corresponding to the forward pass is initiated.

With respect to fig. 27B, elements (F2711 and F2721) of the calculation 2710 and the calculation 2720 dedicated to the forward transfer process are omitted for clarity. With respect to the structure and operation shown and described with reference to fig. 27A, fig. 27B shows that the computation 2710 is also capable of performing additional computations, e.g., a backward pass computation B2712, and the computation 2720 is also capable of performing additional computations, e.g., a backward pass computation B2722. The storage 2715 can also store calculated weights, e.g., in W2718, and the storage 2725 can also store calculated weights, e.g., in W2728. In various embodiments and/or usage scenarios, B2722 and W2728 are substantially similar or identical in structure and/or operation to B2712 and W2718, respectively.

In a backward pass operation of SGD or MBGD, during the backward pass process, a Δ is received from the next succeeding layer (if any)3,t2793. If the next layer 2702 is the output layer, then the next layer 2702 calculates Δ according to the incremental rule3,tE.g., as a function of the difference between the output of the next layer (e.g., the estimated output) and the training output (e.g., the desired output). Delta3, t2793. Weight W previously stored in W27272,tAnd activation A previously stored in A27262,tThen processed according to B2722 (e.g., according to an incremental rule) to generate Δ2,t2792 and new weight W2,t+1The new weights are then stored in W2728 for the next forward pass. Delta2,t2792 then passes to the previous layer 2701. Similar to the latter layer, delta is then processed according to B27122,t2792. Weight W previously stored in W27171,tAnd activation A previously stored in A27161,tTo produce Δ1,t2791 and new weight W1,t+1The new weight is then stored in W2718 for the next forward pass. Then, Δ1,t2791 passes to the next previous layer (if any) for processing, and so on until the backward pass is complete and the next forward pass begins. If the previous layer 2701 is the input layer, the backward pass is complete and the next forward pass begins.

In SGD and MBGD (unlike CPGD), the next forward pass is delayed until the previous backward pass is completed, e.g., W2717 and W2727 are updated with W2718 and W2728, respectively, after W2717 and W2727 have been used for the same forward pass and the same corresponding backward pass. Thus, the next forward pass is performed using the weights from the same backward pass.

Fig. 27A shows a CPGD forward transfer process in addition to the SGD and MBGD forward transfer processes. However, compared to SGD and MBGD, the operation of CPGD is different because the weight update and next forward pass are performed as soon as possible, rather than being delayed until the last backward pass is completed. For example, W2717 and W2727 are updated as soon as possible with W2718 and W2728, respectively. Thus, the next forward pass may selectively access the weights from the previous iteration to selectively produce an activation that is different from the activation produced by the SGD and MBGD under the same conditions.

More specifically, in the previous layer 2701, a is received1,t2781 and stored in a 2716, as in SGD and MBGD. Then process A previously stored in W2717 according to F27111,t2781 and weight W1,t-k-jTo produce activation A2,t2782. Weight W1,t-k-jGenerated and stored by a backward pass of the forward pass preceding the instantaneous forward pass corresponding to the k-j forward pass. A. the 2,t2782 then passes to the next layer 2702 and, similar to the previous layer, receives a2,t2782 and stored in a 2726, as in SGD and MBGD. Then process A according to F27212,t2782 and weight W previously stored in W27272,t-kTo produce activation A3,t2783. Weight W2,t-kGenerated and stored by a backward pass of the forward pass preceding the instantaneous forward pass corresponding to the k forward pass. Note that the previous and subsequent layers use different backward-passing weights to handle the same forward-pass. As in SGD and MBGD, A3,t2783 then provides for the next subsequent layer (if present) to process, and so on, until the forward pass is complete and the backward pass begins. If the next layer 2702 is the output layer, the forward pass is completed and a backward pass corresponding to the forward pass is initiated. In some embodiments and/or usage scenarios, the value of j is 0, and (k-j) and (k) are equal. In various embodiments and/or usage scenarios, the previous layer and the subsequent layer concurrently process one of: different forward transfers, different backward transfers, and both forward transfers and different backward transfers.

Fig. 27B shows a CPGD backward transfer process in addition to the SGD and MBGD backward transfer processes. The process of backward transfer in CPGD is the same as SGD and MBGD. However, the results of the selection (e.g., the weights of the selection) are used earlier than in SGD and MBGD. For example, W resulting from backward pass t-k-j 1,t-k-jAnd generated by backward transfer t-kW1,t-kUsed earlier than SGD and MBGD, e.g., forward pass t.

Fig. 27C shows an embodiment of a forward pass process of any one of SGD, MBGD and CPGD in conjunction with RCP. In various embodiments and/or usage scenarios, computation 2730 and storage 2735 are substantially similar or identical in structure and/or operation to computation 2710 and storage 2715, respectively. In various embodiments and/or usage scenarios, the computation 2740 and the storage 2745 are substantially similar or identical in structure and/or operation to the computation 2720 and the storage 2725, respectively, except for omitting the activation a2726 of the storage 2725 that does not have a corresponding activation stored in the storage 2745.

In a forward pass operation, relative to the previous layer 2703, the activation A is received and processed according to the forward pass processing in computation 27301,t2781, and stored in storage 2735, as described with reference to fig. 27A. However, with respect to the latter layer 2704, activate a is received2,t2782, and processed according to the forward pass processing in computation 2740, but not stored (instead, recalculated according to RCP during the backward pass processing).

Fig. 27D and 27E show first and second parts, respectively, of an embodiment of a backward pass process of any one of SGD, MBGD and CPGD in conjunction with RCP. Elements of computation 2730 and computation 2740 dedicated to forward pass processing (F2721) are omitted for clarity. With respect to the structure and operation shown and described with reference to fig. 27C, fig. 27D and 27E illustrate that the computation 2730 can also perform additional computations, e.g., the backward pass computation B2712, and the computation 2740 can also perform additional computations, e.g., the backward pass computation B2722. Storage 2735 can also store calculated weights, e.g., in W2718, and storage 2745 can also store calculated weights (e.g., in W2728) and recalculated activations (e.g., in a 2729).

In a first part of the backward pass operation, activations not stored in the corresponding forward pass are recalculated. In SGD and MBGD scenarios, the recalculated activation is formulated in previous layer 2703 by processing the activation stored from forward pass in A2716 and the weights stored in W2717 according to F2711 to produce activation A'2,t2784, which is then stored in a 2729 of the next layer 2704. A 'because SGD and MBGD delay the start of weight update and next forward pass until the forward pass and corresponding backward pass are complete'2,t2784 and value A discarded during Forward transfer2,t2782 the same.

In the CPGD scenario, the activation of the recalculation is formulated according to the same topology as the SGD and MBGD scenarios. However, the CPGD performs the update without delay and allows the next forward transfer to start, regardless of the completion of the previous backward transfer. Thus, depending on the implementation and/or usage scenario, the weight values stored at the time of backward delivery (e.g., in W2717) are selectively different from the weight values stored during the corresponding forward delivery. As a specific example, according to FIG. 27C, W2717 stores W during forward pass1,t-k-j. However, during the backward pass, additional weight updates have occurred, e.g., corresponding to m iterations, and now W2717 stores W 1,t-k-j+m. Thus, A'2,t2784 optionally different from the value a discarded during forward pass2,t2782。

In a second part of the backward pass operation, the calculation is performed using the activation of the recalculation. In the SGD and MBGD scenarios, since the recalculated activation is the same as the discarded activation (e.g., conceptually the value stored in a 2729 is the same as the value stored in a 2726), the back-processing produces the same results as described with reference to fig. 27B. E.g., delta'3,t2796、Δ’2,t2795 and Δ'1,t2794 is respectively related to delta3,t2793、Δ2,t2792 and Δ1,t2791 the same. In the CPGD scenario, the post-processing produces selectively different results than those described with reference to fig. 27B, because the recalculated activation is selectively different from the discarded activation. E.g., delta'3,t2796、Δ’2,t2795 and Δ'1,t2794 respectively selectively different from Δ3,t2793、Δ2,t2792 and Δ1,t2791。

In some embodiments and/or usage scenarios, W2717 is different from W2718 (as shown), and in some embodiments and/or usage scenarios, W2718 and W2717 are the same portion of storage (not shown) such that a new value is saved in W2718, overwriting a previously saved value in W2717. Similarly, W2727 is different from W2728, or the same as W2728. In various embodiments and/or usage scenarios, a 2729 is implemented differently to use fewer memory locations and/or use the same number of memory locations in a shorter time than a 2726.

In various embodiments and/or usage scenarios, the activation and/or weights are implemented and/or represented by any one or more of scalars, vectors, matrices, and high-dimensional data structures. For example, any one or more of a 2716, a 2726, a2729, W2717, W2727, W2718, and W2728 can store any one or more of one or more scalars, one or more vectors, one or more matrices, and one or more high-dimensional arrays.

In various embodiments and/or usage scenarios, one or more elements of the previous layer 2701 and the next layer 2702 are implemented by respective PEs, e.g., a portion of PE 499 or similar element of fig. 4. For example, PE 497 implements the previous layer 2701 and PE 498 implements the next layer 2702. Activation A2,t2782 and delta2,t2792 communicate via east coupler 431. In some embodiments and/or usage scenarios, one or more elements of the previous layer 2701 and the next layer 2702 are implemented by one or more of a CPU, GPU, DSP, and FPGA.

In various embodiments and/or usage scenarios, all or any portion of the elements of F2711, F2721, B2712, and B2722 conceptually correspond to all or any portion of the execution instructions of task SW on PE 260 of fig. 2.

Example workload mapping

Conceptually, deep learning accelerator 400 (FIG. 4) is a programmable computer architecture (see, e.g., FIGS. 5-8 and the "processing element: compute element and router" sections). For example, the compute elements of each PE 499 element are capable of executing a sequence of task instructions (e.g., conceptually corresponding to all or any portion of the execution instructions of task SW on PE 260 of fig. 2), and the router elements of each PE 499 may be configured to route wavelets between PEs. The programmable computing structure can map workloads onto the computing structure in a variety of ways. Described below are example high-level mappings of workloads to compute fabrics to illustrate various techniques and mechanisms implemented by the compute fabrics.

The workload is deep neural network training implemented via SGD. The deep neural network includes multiple layers of neurons. The workload has three major phases: forward transfer, incremental transfer, and chained transfer. Forward propagation activation. The delta passes back the propagation delta. When an increment is generated in the increment transfer, the chain transfer calculates a gradient based on the increment. The calculation amounts of the three large phases are approximately the same.

FIG. 4 shows an example mapping of large phases to PEs. Each layer is implemented by PE blocks allocated back-to-back (e.g., in the horizontal dimension) from a computing structure (also referred to as a "placement"). Data movement propagates to the end of the structure during forward pass (forward 401) and then returns in the opposite direction during incremental pass (increment 402) and chained pass (chain 403). The purpose of the placement is to reduce data movement, since the forward transfer preserves the activation to be used for both incremental transfer and chained transfer. In this example, all PEs are time shared in three ways among three large phases, each using approximately the same computational effort. In some cases, the entire chain of PEs performing the transfer runs as a pipeline, such that each layer is a pipeline stage (requiring approximately the same time to complete), and each activation of a small batch fills the pipeline.

In some embodiments and/or usage scenarios, within a group of PEs mapped to a single layer, the weights of the single layer are distributed over the PEs such that a single neuron maps to multiple PEs. In some cases, partitioning a single neuron into multiple PEs provides load balancing benefits and provides communication partitioning benefits (see, e.g., fig. 17-20 and the "neuron smearing" section).

Conceptually, the process proceeds as follows (see forward 401 of FIG. 4). The activation is broadcast into the layers along the horizontal axis. The activation is received by the PE and triggers a lookup of the relevant weights (corresponding to neurons mapped to the PE) stored locally at the PE. Only non-zero activations are broadcast, so zero activations do not waste computations (one example of activation sparsity collection). Each PE locally multiplies and accumulates the activation of the inputs, and all neuron weights produce a local partial sum. Since the weight of each neuron is distributed to a plurality of PEs, partial sums are accumulated on PEs in the vertical direction according to the neuron weight distribution. After accumulating the partial sums to produce a final sum, the activation function is performed and all new non-zero activations are broadcast to the next layer.

The delta transfer (see delta 402 of fig. 4) and the chain transfer (see chain 403 of fig. 4) follow a similar data flow as the forward transfer. In some embodiments and/or usage scenarios, the incremental transfer and chain transfer are offset by one layer placement, such that activations are stored in the same layer as the weight used in reverse. The activations are stored by the receiving layer such that in both incremental transfer and chained transfer, the activations are used directly without additional communication. In addition to storage activation, weight transposition is performed to achieve incremental transfer. In some embodiments and/or usage scenarios, weight transposition is achieved by copying weights, using additional storage capacity and additional communication when updating weights. In some implementations and/or usage scenarios, weight transposition is achieved by transposing incremental broadcasts in the vertical dimension.

Fig. 28A shows the general operation of multiplying the matrix (m) by the vector (v). Fig. 28B illustrates, in the form of fig. 28A, various representations of memory structures (e.g., fully connected neural networks) used in three major phases in some embodiments. In various embodiments, the weight (w) and gradient accumulation (g) data structures are two-dimensional matrices. In some embodiments, the forward part sum (fpsum) and delta part sum (δ psurri) and forward pass activation (a) are one-dimensional vectors. The two-dimensional matrix is stored in a memory (e.g., memory 854 of fig. 8) because the two-dimensional matrix is relatively large in some embodiments and/or usage scenarios. In some embodiments, the one-dimensional vectors are stored in a higher throughput memory (e.g., data store 848 of fig. 8) to enable the use scenario, full data path performance, of multiply accumulate vector operations in each of three passes.

Fig. 29 illustrates an embodiment of a task used in the forward pass state machine (see, e.g., fig. 9-12 and the "tasks" section). In some embodiments and/or usage scenarios, each PE implements an instantiation of a state machine. In some embodiments and/or usage scenarios, various portions of the state machine are implemented by respective PEs (see, e.g., fig. 17-20 and the "neuron smearing" portion). There are four tasks in the state machine: f _ rxact: acc 2901, f _ rxact: close 2902, f _ psum: prop 2903, f _ txact: tx 2904. Conceptually, activation goes from one PE to the "left" of the instant PE (corresponding to the previous layer). Activation of incoming (non-ending) activation (activation 2911) on the broadcast line triggers f _ rxcact: acc 2901. The instant PE executes the instructions of the task, looks up the weights associated with the activation (e.g., from memory local to the instant PE), and performs local weight multiplication and accumulates into partial sums. f _ rxact: acc 2901 and f _ psum: there is a control flow dependency between prop 2903 (flow 2913). Example data structures referenced by the task are wrow, fpsum, and fact.

Activating the input activation epilogue (epilogue 2912) on the broadcast line triggers f _ rxact: close 2902. The end indicates that all activations of the current wavefront end. The instant PE executes the instructions of the task, using the portion and beginning portion in the instant PE's start list and the cumulative ring (start Psums 2916). Example data structures referenced by the task are fpsum _ acc _ mem and fpsum _ acc _ fab.

The incoming partial sum (Prop Psums 2930) triggers f _ psum: prop 2903. The instant PE executes the instructions of the task, adds the partial sum of the input to the partial sum of the instant PE, and then forwards the result to the next hop on the ring (Prop Psums 2931). If the immediate PE is the end of the ring, a final sum is generated. In some embodiments and/or usage scenarios, additional processing is performed to prevent deadlock. Example data structures referenced by the task are fpsum _ acc _ mem, fpsum _ acc _ fab, and f _ txact _ wake.

When there is a queued activation to transmit, f _ txact: tx 2904 is self-triggering (wake-up 2914). The instant PE executes the instructions for the task, de-activates the queue, and transmits the activation to the next layer on the broadcast line (Activate 2921). When there are still more items in the queue, the instant PE reschedules the task (reschedule 2915). When the queue is empty, the instant PE transmits a tail wavelet to close the wavefront (end 2922).

The activation (incoming and outgoing) and partial sum (incoming and outgoing) and tail-out wavelets are transmitted as wavelets (see, e.g., fig. 13A-15B and the "wavelet" section). In some implementations and/or usage scenarios, the one or more wavelets correspond to one or more elements of a structure vector described by one or more DSDs and/or XDSRs.

The data structures of the various state machines are referenced via multiple DSDs stored in the respective DSR (see, e.g., figures 21A-24 and the "vector and data structure descriptor" section), as set forth in the following tables.

The example workload mapping described above is for the SGD. However, these techniques are readily applicable to MBGD and CPGD, with or without RCP.

Details of other embodiments

The embodiments and usage scenarios described with respect to fig. 1-29 are conceptually related to PEs that include programmable CEs that process data, e.g., according to instructions. Other embodiments are contemplated in which one or more CEs are partially or fully hardwired, e.g., to process data according to one or more fixed-circuit processing elements that may operate without instructions. As a specific example, a particular CE includes hardware logic cell circuitry that implements all or a portion of an LSTM cell. The particular PE includes a router in the particular PE that is operable in a fabric with other PE points. Some other PEs are similar or identical to the particular PE, and some other PEs are similar or identical to PE 499 of FIG. 4.

Example implementation techniques

In some embodiments, various combinations of operations performed for and/or all or any portion of a structure associated with any accelerated deep learning; SGD, MBGD, CPGD, whether or not with RCP, for accelerated deep learning; data structure descriptors and structure vectors for accelerated deep learning: accelerated deep learning neuronal smear: task synchronization for accelerated deep learning: accelerated deep learning task triggered by data flow: control wavelet for accelerated deep learning: and/or wavelet representations for accelerated deep learning; and portions of processors, microprocessors, systems on a chip, application specific integrated circuits, hardware accelerators, or other circuits that provide all or part of the above-described operations, as specified by a specification compatible with computer system processing. The specification conforms to various descriptions, such as a hardware description language, a circuit description, a netlist description, a mask description, or a layout description. Example descriptions include: verilog, VHDL, SPICE variants, e.g., PSpice, IBIS, LEF, DEF, GDS-II, OASIS or other descriptions. In various embodiments, processing includes any combination of interpretation, compilation, simulation, and synthesis to produce, verify, or specify logic and/or circuitry suitable for inclusion on one or more integrated circuits. According to various embodiments, each integrated circuit is compatible with design and/or fabrication according to various techniques. These techniques include programmable techniques (e.g., field or mask programmable gate array integrated circuits), semi-custom techniques (e.g., fully or partially cell-based integrated circuits), and fully-custom techniques (e.g., substantially application-specific integrated circuits), any combination thereof, or any other technique compatible with the design and/or fabrication of integrated circuits.

In some embodiments, various combinations of all or part of the operations described by a computer-readable medium having a set of instructions stored thereon are performed by executing and/or interpreting one or more program instructions, by interpreting and/or compiling one or more source and/or scripting language statements, or by executing binary instructions generated by compiling, translating and/or interpreting information expressed in the programming and/or scripting language statements. These statements are compatible with any standard programming or scripting language (e.g., C, C + +, Fortran, Pascal, Ada, Java, VBscript, and Shell). One or more program instructions, language statements, or binary instructions are optionally stored on one or more computer-readable storage medium elements. In various embodiments, some, all, or various portions of the program instructions are implemented as one or more functions, routines, subroutines, online routines, procedures, macros, or portions thereof.

Conclusion

Certain choices have been made in the description merely to facilitate the preparation of the text and drawings, and these choices should not themselves be construed as conveying additional information about the structure or operation of the described embodiments, unless indicated to the contrary. Examples of selections include: a specific organization or assignment of numbers for designation and element identifiers (e.g., labels or numerical designations, etc.) for identifying and referencing features and elements of the embodiments.

The various forms of the words "include" and "comprise" are specifically intended to be understood as abstractions that describe a logical grouping of open-ended ranges, and are not intended to convey physical containment unless explicitly described (e.g., followed by the word "inclusive").

Although the foregoing embodiments have been described in some detail for purposes of clarity of description and understanding, the invention is not limited to the details provided. There are many embodiments of the invention. The disclosed embodiments are illustrative and not restrictive.

It should be understood that many variations in structure, arrangement and use are possible, consistent with the description, and within the scope of the issued patent claims. For example, interconnect and functional unit bit widths, clock speeds, and the type of technology used may vary according to various implementations in each component block. The names provided to the interconnect and logic are merely exemplary and should not be construed as limiting the concepts described. The order and arrangement of the processes, acts and functional elements of the flowcharts and flow diagrams may be varied according to various embodiments. Moreover, unless specifically indicated to the contrary, the numerical ranges specified, the maximum and minimum values used, or other specific specifications (e.g., file types; and the number of entries or stages in registers and buffers) are only those in the described embodiments, are intended to track improvements and variations in implementation techniques, and should not be construed as limitations.

Functionally equivalent techniques known in the art may be used in place of the described techniques to implement various components, subsystems, operations, functions, routines, subroutines, online routines, procedures, macros, or portions thereof. It should also be appreciated that many functional aspects of the embodiments may be selectively implemented in hardware (e.g., general purpose dedicated circuitry) or software (e.g., via some manner of programmed controller or processor), in accordance with implementation-dependent design constraints and technological trends of faster processing (facilitating migration of functions previously in hardware into software) and higher integration density (facilitating migration of functions previously in software into hardware). Specific variations in various embodiments include, but are not limited to: a difference in division; different form factors and configurations; using different operating systems and other system software; using different interface standards, network protocols, or communication links; and other variations that are contemplated when implementing the concepts described herein according to the unique engineering and business constraints of a particular application.

The detailed description and environmental background of these embodiments are well beyond what is required for minimal implementation of many aspects of the described embodiments. One of ordinary skill in the art will recognize that some embodiments omit disclosed components or features without altering the basic cooperation between the remaining elements. Thus, it should be understood that many of the details disclosed are not required to implement various aspects of the described embodiments. To the extent that the remaining elements differ from the prior art, the omitted components and features do not limit the concepts described herein.

All such variations in design are not material to the teachings conveyed by the described embodiments. It should also be understood that the embodiments described herein have broad applicability to other computing and networking applications and are not limited to the particular application or industry of the described embodiments. Therefore, the invention should be construed as including all possible modifications and variations, which are included within the scope of the issued patent claims.

131页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于特征图编码的流水线模型并行训练内存优化方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!