Statically generated compiled representation for processing data in a neural network

文档序号:169817 发布日期:2021-10-29 浏览:20次 中文

阅读说明:本技术 用于在神经网络中处理数据的静态生成的经编译表示 (Statically generated compiled representation for processing data in a neural network ) 是由 李修宇 杨建� 于 2020-04-28 设计创作,主要内容包括:一种电子装置,包括:存储器,其存储输入矩阵A和B;高速缓存存储器;以及处理器。所述处理器生成经编译表示,所述经编译表示包括用于在通过神经网络处理输入数据实例时从输入矩阵A获取数据的值,所述值包括输入矩阵A中针对多个线程中的每个线程的基址和相对偏移,所述相对偏移是将由所述线程处理的输入矩阵A的元素之间的距离。然后,所述处理器在所述本地高速缓存存储器中存储包括针对每个线程的所述基址和所述相对偏移的所述经编译表示。(An electronic device, comprising: a memory storing input matrices a and B; a cache memory; and a processor. The processor generates a compiled representation comprising values for obtaining data from an input matrix a when processing an input data instance over the neural network, the values comprising a base address for each of a plurality of threads in the input matrix a and a relative offset, the relative offset being a distance between elements of the input matrix a to be processed by the thread. The processor then stores the compiled representation including the base address and the relative offset for each thread in the local cache memory.)

1. An electronic device, comprising:

a local cache memory;

a memory storing an input matrix A having values to be used in processing input data instances through a neural network; and

a processor configured to:

generating a compiled representation comprising values for obtaining data from an input matrix a when processing an input data instance through the neural network, the values comprising a base address and a relative offset in the input matrix a for each of a plurality of threads, the relative offset being a distance between elements of the input matrix a to be processed by the thread; and

storing the compiled representation comprising the base address and the relative offset for each thread in the local cache memory.

2. The electronic device of claim 1, wherein when generating the compiled representation, the processor is configured to:

the base address in input matrix A for each thread is calculated from some or all of the thread Identifier (ID) of the thread, the dimensions of input matrix A and/or output matrix C, the properties of the elements of input matrix A and/or output matrix C, and the convolution filter properties.

3. The electronic device of claim 1, wherein when generating the compiled representation, the processor is configured to:

the relative offset is calculated from some or all of the dimensions of the input matrix a and/or the output matrix C, the properties of the elements of the input matrix a and/or the output matrix C, and the filter properties.

4. The electronic device of claim 1, wherein:

the memory stores an input matrix B having values to be used in processing input data instances through the neural network; and

the processor is further configured to:

processing an input matrix A using each of the threads, the processing comprising identifying a location in memory using the compiled representation in the local cache memory, using values in elements of the input matrix A obtained from the location as inputs to one or more general purpose matrix multiplication (GEMM) operations by the thread together with values from elements at corresponding locations of input matrix B.

5. The electronic device of claim 4, wherein in processing the input matrix A using each of the threads, the processor is configured to:

obtaining the base address for each thread of input matrix A from the compiled representation of input matrix A in the local cache memory;

while the elements of the input matrix A are pending, for an active one of the threads:

obtaining a next relative offset from the compiled representation;

increasing a base address of the active thread by the next relative offset;

reading a first value from an element of an input matrix A at a location in the memory identified by a base address of the active thread;

reading a second value from an element of input matrix B at a memory location associated with the element of input matrix A;

using the first value and the second value as inputs to a GEMM operation; and

storing a result of the GEMM operation in an output matrix C in the memory.

6. The electronic device of claim 5, wherein the processor is further configured to use an index to obtain the relative offset in combination with a base address of input matrix B to read the second value from input matrix B.

7. The electronic device of claim 5, wherein, when storing the result of the GEMM operation in an output matrix C, the processor is further configured to:

each result is stored in the output matrix C at a position in the output matrix C determined based on the position in the input matrix a at which the first value was read.

8. The electronic device of claim 4, wherein the values in input matrix A and input matrix B comprise input values and weights, respectively, associated with an input data instance, the input values and weights to be used in processing the input data instance through the neural network.

9. The electronic device of claim 4, wherein input matrix A and input matrix B have dimensions larger than dimensions for the GEMM operation.

10. The electronic device of claim 1, wherein the local cache memory is coupled to processing circuitry through a fast access interface that enables faster access than access of the memory.

11. A method for generating a compiled representation in an electronic device, the electronic device comprising: a local cache memory; a memory storing an input matrix A having values to be used in processing input data instances through a neural network; and a processor, the method comprising:

generating, by the processor, a compiled representation comprising values for obtaining data from an input matrix a when processing an input data instance through the neural network, the values comprising a base address and a relative offset in the input matrix a for each of a plurality of threads, the relative offset being a distance between elements of the input matrix a to be processed by the thread; and

storing, by the processor, the compiled representation comprising the base address and the relative offset for each thread in the local cache memory.

12. The method of claim 11, wherein generating the compiled representation comprises:

calculating, by the processor, the base address in input matrix A for each thread according to some or all of a thread Identifier (ID) of the thread, dimensions of input matrix A and/or output matrix C, properties of elements of input matrix A and/or output matrix C, and convolution filter properties.

13. The method of claim 11, wherein generating the compiled representation comprises:

calculating, by the processor, the relative offset from some or all of the dimensions of the input matrix A and/or the output matrix C, properties of the elements of the input matrix A and/or the output matrix C, and filter properties.

14. The method of claim 11, wherein:

the memory stores an input matrix B having values to be used in processing input data instances through the neural network; and

the method further comprises the following steps:

processing, by the processor, input matrix A using each of the threads, the processing including identifying a location in memory using the compiled representation in the local cache memory, using values in elements of input matrix A obtained from the location as input to one or more general purpose matrix multiplication (GEMM) operations by the thread together with values from elements at corresponding locations of input matrix B.

15. The method of claim 14, wherein processing the input matrix a using each of the threads comprises:

obtaining, by the processor, the base address and the relative offset for each thread of input matrix A from the compiled representation of input matrix A in the local cache memory;

while the elements of the input matrix A are pending, for an active one of the threads:

reading, by the processor, a first value from an element of input matrix A at a location in the memory identified by the active thread adjusted by the corresponding relative offset for a base address of input matrix A;

reading, by the processor, a second value from an element of input matrix B at a memory location associated with the element of input matrix A;

using, by the processor, the first value and the second value as inputs to a GEMM operation; and

storing, by the processor, a result of the GEMM operation in an output matrix C in the memory.

16. The method of claim 15, further comprising:

using, by the processor, the index to obtain the relative offset in combination with the base address of the input matrix B to read the second value from the input matrix B.

17. The method of claim 15, wherein storing the result of the GEMM operation in an output matrix C comprises:

storing, by the processor, each result at a location in the output matrix C that is determined based on the location in the input matrix a at which the first value was read.

18. The method of claim 14, wherein the values in input matrix a and input matrix B comprise input values and weights, respectively, associated with an input data instance to be used in processing the input data instance through the neural network.

19. The method of claim 14, wherein the input matrix a and the input matrix B have dimensions larger than dimensions for the GEMM operation.

20. An electronic device, comprising:

a memory storing an input matrix a and an input matrix B, the input matrix a and the input matrix B having respective values to be used when processing an input data instance through the neural network;

a local cache memory storing a compiled representation comprising values for obtaining data from an input matrix A when processing an input data instance over the neural network, the values comprising a base address for each of a plurality of threads in the input matrix A and a relative offset, the relative offset being a distance between elements of the input matrix A to be processed by the thread; and

a processor configured to:

retrieving the compiled representation from the local cache memory; and

for each of the plurality of threads, processing an input matrix a by using the compiled representation to identify a location in memory, using values in elements of the input matrix a taken from the location together with values from elements at corresponding locations of the input matrix B as inputs to one or more general matrix multiplication (GEMM) operations by the thread.

21. The electronic device of claim 20, wherein when processing the input matrix a, the processor is configured to:

obtaining the base address for each thread of input matrix A from the compiled representation of input matrix A in the local cache memory;

while the elements of the input matrix A are pending, for an active one of the threads:

obtaining a next relative offset from the compiled representation;

increasing a base address of the active thread by the next relative offset;

reading a first value from an element of an input matrix A at a location in the memory identified by a base address of the active thread;

reading a second value from an element of input matrix B at a memory location associated with the element of input matrix A;

using the first value and the second value as inputs to a GEMM operation; and

storing a result of the GEMM operation in an output matrix C in the memory.

22. The electronic device of claim 21, wherein the processor is further configured to use an index to obtain the relative offset in combination with a base address of input matrix B to read the second value from input matrix B.

23. The electronic device of claim 21, wherein, when storing the result of the GEMM operation in an output matrix C, the processor is further configured to:

each result is stored in the output matrix C at a position in the output matrix C determined based on the position in the input matrix a at which the first value was read.

24. A method for generating a compiled representation in an electronic device, the electronic device comprising: a local cache memory; a memory storing an input matrix A and an input matrix B having respective values to be used in processing an input data instance through a neural network; and a processor, the method comprising:

obtaining, by the processor, the compiled representation from the local cache memory, the compiled representation including values for obtaining data from an input matrix A as input data instances are processed through the neural network, the values including a base address for each of a plurality of threads in the input matrix A and a relative offset, the relative offset being a distance between elements of the input matrix A to be processed by the threads; and

for each thread of the plurality of threads, processing, by the processor, an input matrix a by using the compiled representation to identify a location in memory, using values in elements of the input matrix a taken from the location, together with values from elements at corresponding locations of input matrix B, as inputs to one or more general matrix multiplication (GEMM) operations by the thread.

25. The method of claim 24, wherein processing the input matrix a comprises:

obtaining, by the processor, the base address for each thread of input matrix A from the compiled representation of input matrix A in the local cache memory;

while the elements of the input matrix A are pending, for an active one of the threads:

obtaining, by the processor, a next relative offset from the compiled representation;

increasing, by the processor, a base address of the active thread by the next relative offset;

reading, by the processor, a first value from an element of an input matrix A at a location in the memory identified by a base address of the active thread;

reading, by the processor, a second value from an element of input matrix B at a memory location associated with the element of input matrix A;

using, by the processor, the first value and the second value as inputs to a GEMM operation; and

storing, by the processor, a result of the GEMM operation in an output matrix C in the memory.

26. The method of claim 25, further comprising using, by the processor, an index to obtain the relative offset in combination with a base address of an input matrix B to read the second value from the input matrix B.

27. The method of claim 25, wherein storing the result of the GEMM operation in an output matrix C comprises:

storing, by the processor, each result at a location in the output matrix C that is determined based on the location in the input matrix a at which the first value was read.

Background

Prior Art

Some electronic devices perform the operation of artificial neural networks, or more simply "neural networks". Generally, a neural network is a computational structure that includes internal elements that have similarities to a biological neural network, such as those associated with the brain of a living being. The neural network may be trained to perform a specified task by configuring internal elements of the neural network with known instances of training data, such that the neural network may perform the specified task on unknown instances of input data. For example, one task performed by a neural network is to identify whether an image includes image elements such as a face or a vehicle. When training a neural network to perform image recognition, images known to include (or not to include) image elements are processed through the neural network to configure internal components to generate appropriate outputs when the unknown image is subsequently processed to identify whether image elements are present in the unknown image.

One type of neural network is a "fully connected" neural network. A fully connected neural network includes, in its internal elements, a set of artificial neurons or "nodes" interconnected with one another in an arrangement that has some similarity to the way neurons are interconnected by synapses in a biological brain. A fully connected neural network can be visualized as a form of weighted graph structure, where the nodes include input nodes, intermediate (or "hidden") nodes, and output nodes. Fig. 1 presents a block diagram illustrating a fully connected neural network 100, the neural network 100 including an input node 102, an intermediate node 104 in layers 110 and 112, an output node 106, and directed edges 108 (only two directed edges and layers are labeled for clarity). Within a fully connected neural network, each node except output node 106 is connected to one or more downstream nodes by a directed edge with an associated weight. During operation, input nodes 102 in a first layer of the fully connected neural network 100 receive inputs from external sources and process the inputs to generate input values. The input node 102 forwards the input value to an intermediate node 104 in the next layer 110 of the fully connected neural network 100. The receiving intermediate node 104 weights, i.e., adjusts, such as multiplies, the received input based on the weight of the corresponding directional edge. Each intermediate node 104 sums the corresponding weighted received input and possible bias values to generate an internal value and evaluates the activation function of that intermediate node 104 using the internal value to produce a result value. The intermediate nodes 104 then forward the resulting values as input values to the intermediate nodes 104 in the next layer 112 of the fully connected neural network 100, where the input values are used to generate internal values and evaluate activation functions, as described above. In this manner, the values advance through the intermediate nodes 104 in each layer of the fully connected neural network 100 until the intermediate nodes 104 of the last layer forward the resulting values to the output nodes 106 of the fully connected neural network 100, the output nodes 106 generating outputs for the fully connected neural network 100. Continuing with the example above, the output produced by the output node 106, and thus the output produced from the fully connected neural network 100, may be in the form of a number, e.g., between 0-1, indicating whether the image may or may not include the specified image element.

As described above, the values forwarded along the directional edges between nodes in a fully connected neural network (e.g., fully connected neural network 100) are weighted according to the weights associated with each directional edge. By setting the weights associated with the directional edges during the training process such that the desired output is generated by the fully connected neural network, the fully connected neural network may be trained to generate the desired output, such as the above-described identification of image elements in the image. When training a fully-connected neural network, numerous training data instances having expected outputs are processed in the fully-connected neural network to produce actual outputs from the output nodes. Continuing with the example above, the training data instance will include digital images that are known to include (or not include) a particular image element, and thus produce an output for the neural network for which full connectivity is expected that indicates that an image element may be present (or not present) in the image. After each training data instance is processed in a fully connected neural network to produce an actual output, an error value or "penalty" between the actual output and the corresponding expected output is calculated using mean square error, logarithmic penalty, or another algorithm. The loss is then processed back through the fully connected neural network, or "backpropagated" through the fully connected neural network, and used to adjust the weights associated with the directional edges in the fully connected neural network in order to reduce the error for the training data instance. The back propagation operation adjusts the response of the fully connected neural network for that particular training data instance and all subsequent instances of input data. For example, one back propagation technique (which may be referred to as gradient descent) involves calculating a gradient of loss versus weight for each directional edge in a fully connected neural network. Each gradient is then multiplied by a training coefficient or "learning rate" to calculate a weight adjustment value. The weight adjustment value is then used to calculate an updated value for the corresponding weight, e.g., added to the existing value for the corresponding weight.

Another type of neural network is a "convolutional" neural network. Fig. 2 presents a block diagram illustrating a convolutional neural network 200. As can be seen in fig. 2, the internal elements of convolutional neural network 200 can be grouped into feature processing element 202 and classification element 204. The feature processing element 202 processes features in the input data instance 216 (e.g., digital images, digital audio recordings, etc.) in preparation for feature classification in the classification element 204. The feature processing element 202 includes internal elements for convolution, normalization, and pooling. In the convolution 208 internal elements, a feature map is generated from the input data instance using a set of filters. The signature is then normalized (e.g., using a rectifying linear unit) within the normalization 210 internal elements. After processing in the normalization 210 internal elements, the feature map is further processed (e.g., sub-sampled, down-sampled, etc.) in the pooling 212 internal elements to generate a reduced-dimension feature map. The flattened 214 internal components next prepare the reduced dimension feature map from the pooled 212 internal components for input into the fully connected 206 internal components. The classification element 204 includes a fully connected 206 neural network (similar to the fully connected neural network described above) that classifies the input (i.e., the flattened reduced-dimension feature map) as including the specified element (or not) and produces an output 218 that represents the classification. As with a fully connected neural network, back propagation (i.e., gradient descent, etc.) may be used to train convolution 208 internal elements by adjusting values in the filter set and possibly other values in the internal elements of feature processing element 202.

The examples of fully connected neural networks and convolutional neural networks shown in fig. 1 and 2 are simplified. In almost all cases, a fully connected neural network includes more layers and/or nodes. In practice, for many applications, a fully connected neural network may have thousands of nodes arranged in a large number of layers. Furthermore, the feature processing element for the convolutional neural network may have multiple/repeated layers of convolution, normalization, and pooling internal elements. The examples in fig. 1 and 2 are also generic; a fully connected neural network and/or convolutional neural network may include different arrangements of internal elements and/or internal elements not shown in fig. 1-2. In general, processing input data instances through a fully connected neural network or a convolutional neural network involves a large number of internal elements and is therefore computationally intensive.

Due to the arrangement of elements (e.g., feature processing elements, nodes, etc.) in many neural networks, operations such as computing gradients in fully connected neural networks for training and computing internal values of activation functions and using filters in convolutional neural networks to generate feature maps are typically performed using matrix mathematical operations. In other words, input data, weights, filter values, and the like are arranged in a matrix, and then the above-described calculation is performed using a matrix mathematical operation. In many cases, matrix mathematical operations are performed using general-purpose matrix multiplication (GEMM) operations, which are part of the well-known Basic Linear Algebra Subroutine (BLAS) specification. According to the BLAS specification, GEMM operations require two-dimensional input. Because matrices used for neural network computations typically have higher dimensions (i.e., three or more dimensions) and/or have data that can be organized in a number of different ways, such matrices must be converted or translated from the higher dimensional format stored in memory to the two dimensional format used for GEMM operations. To this end, designers have developed customized program code, sometimes referred to as "kernels," for handling higher dimensional input matrices and/or data organization within input matrices. The program code/kernel converts or translates the matrix of values into the two-dimensional format required for the GEMM operation, and then performs the GEMM operation. Developing, maintaining, and properly using customized program code/kernels for neural network processing is time consuming, difficult, and error prone. Furthermore, in program code/kernel, converting or translating the matrix from a higher dimensional format to a two dimensional format for GEMM operations greatly slows down GEMM operations. This is true because the program code/kernel and GEMM operations must dynamically compute the address of the location in memory where the matrix data is stored. Furthermore, the program code/kernel may be inefficient with respect to memory access locations, particularly between sequential executions of the program code/kernel, which may make the overall process of executing the program code/kernel inefficient. In other words, between multiple sequential executions of program code/core (e.g., by different processing threads in a multithreaded processing subsystem), the program code/core may repeatedly load a cache line into cache memory, and thus overwrite the cache line, when only a small portion of the data in the cache line is actually used.

Drawings

Fig. 1 presents a block diagram illustrating a fully connected neural network.

Fig. 2 presents a block diagram illustrating a convolutional neural network.

Fig. 3 presents a block diagram illustrating an electronic device according to some embodiments.

FIG. 4 presents a block diagram illustrating a processing subsystem according to some embodiments.

FIG. 5 presents a block diagram illustrating a compiled representation in accordance with some embodiments.

FIG. 6 presents a block diagram illustrating an input matrix stored in a memory according to some embodiments.

FIG. 7 presents a flowchart illustrating a process for generating a compiled representation, in accordance with some embodiments.

FIG. 8 presents keywords describing factors or values used in expressions according to some embodiments.

FIG. 9 presents a pseudo-code example for the computation of a base address of a compiled representation, in accordance with some embodiments.

FIG. 10 presents a pseudo-code example for the calculation of a relative offset of a compiled representation, in accordance with some embodiments.

FIG. 11 presents a flowchart illustrating a process for using a compiled representation, in accordance with some embodiments.

FIG. 12 presents an example of pseudo code using compiled representations, in accordance with some embodiments.

FIG. 13 presents an example of using a compiled representation in accordance with some embodiments.

FIG. 14 presents an example of a set of elements of an input matrix in accordance with some embodiments.

FIG. 15 presents a pseudo-code example of generating a compiled representation according to some embodiments.

Like reference numerals refer to like elements throughout the drawings and the description.

Detailed Description

The following description is presented to enable any person skilled in the art to make and use the described embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the described embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications. Thus, the described embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.

Term(s) for

In the following description, various terms are used to describe embodiments. The following is a simplified and general description of some of these terms. Note that these terms may have important additional aspects that are not enumerated herein for the sake of clarity and brevity, and thus the description is not intended to limit these terms.

Function blocks: a functional block refers to a group, collection, and/or set of one or more interrelated circuit elements (such as integrated circuit elements/circuits, discrete circuit elements/circuits, etc.). Circuit elements are "cross-correlated" in that circuit elements share at least one property. For example, interrelated circuit elements may be included in, fabricated on, or otherwise coupled to a particular integrated circuit chip, substrate, circuit board, or portion thereof, may relate to performance of a given function (computation or processing function, memory function, etc.), may be controlled by a common control element and/or a common clock, and/or the like. Functional blocks may include circuits having any number of circuit elements, from a single circuit element (e.g., a single integrated circuit logic gate or discrete circuit element) to millions or billions of circuit elements (e.g., integrated circuit memory).

Thread: a thread is an abstract entity that represents or identifies the execution of a portion of program code or "kernel" for performing tasks on processing hardware in a processing subsystem (e.g., a central processing unit or a graphics processing unit). The processing hardware may include separate circuitry (e.g., registers, flags, etc.) for each of a plurality (and possibly a large number) of supported threads, and thus may store and track the respective operating states of each thread. In some embodiments, one or more of the multiple supported threads are simultaneously active and, thus, execute corresponding program code. The remaining threads are inactive and thus queued, waiting for availability of processing hardware resources, data returned from memory, etc. In some embodiments, a computational job that processes data instances, such as through a neural network, is divided and portions of the computational job are assigned to individual threads from a plurality of threads for processing. For example, multiple threads may be used to perform various matrix operations, such as matrix mathematical operations for processing a multidimensional input matrix of input data instances through a neural network.

General matrix multiplication

In the described embodiment, a generic matrix multiplication (GEMM) operation is used to compute values when processing input data instances through a neural network. Generally, GEMM operations are implemented according to specifications, such as the well-known Basic Linear Algebra Subroutine (BLAS) specification, and include (or are used for) matrix-matrix operations, such as multiplication, fused multiply-add (FMA), and the like. In some embodiments, the program code for the GEMM subroutine is included in a library, such as a math library, and thus may be called by an operating system, application programs, and other software routines executed by a processing subsystem in the electronic device.

In some embodiments, the subroutine for the GEMM operation is configured to accept a particular format and/or dimensionality of the input matrix. For example, in some embodiments, GEMM operations use-and may require-a specified two-dimensional input format of the input matrix. For example, one well-known version of GEMM in the BLAS specification is the format C ═ α op (A) op (B) + β C, where α and β are scalars and A, B and C are matrices, op (A) is an M by K matrix, op (B) is a K by N matrix, and C is an M by N matrix, where K, M and N are integers.

In the described embodiment, the GEMM operation may be decomposed or divided into a plurality of separate sub-operations for execution on the multi-threaded processing subsystem. In these embodiments, some or all of the threads in the set of threads supported by the processing subsystem are responsible for performing one or more sub-operations of the entire GEMM operation. For example, if the GEMM operation is used as part of a Fused Multiply Add (FMA) in which the elements of two matrices are multiplied together and then the results of the multiplications are summed to generate the final result of the FMA, then each separate thread may be used to perform one or more of the GEMM multiplication and FMA addition operations.

Neural network

As described above, a neural network is a computational structure that includes internal elements (e.g., nodes, directed edges, feature processing elements, etc.) that are trained to perform specified tasks using or on input data instances. In the described embodiment, as part of the operation of processing an input data instance through a neural network, a thread performs a GEMM operation to compute a specified value. For example, in some embodiments, a processing thread in the processing subsystem uses input data values from a first input matrix and weight values from a second input matrix as inputs to a GEMM operation for computing internal values of nodes in a fully connected neural network. As another example, in some embodiments, the thread uses values from the input matrix as inputs to the GEMM operation for convolution calculations in the feature processing elements of the convolutional neural network. As described in more detail below, when processing input data instances in a neural network, a compiled representation is used to translate or convert a higher-dimensional input matrix into a format that can be used in a two-dimensional GEMM operation.

Overview

In the described embodiments, an electronic device includes a processing subsystem (e.g., a graphics processing unit or a central processing unit), a memory (e.g., a "main" memory), and a local cache memory for the processing subsystem. The processing subsystem performs operations for processing the input data instance through the neural network using a plurality of threads supported by the processing subsystem. As part of the operation of processing an input data instance through the neural network, the thread performs a generic matrix multiplication (GEMM) operation to calculate a specified value.

As described above, GEMM operations may use-and may require-two-dimensional input. However, when processing input data instances in a neural network, the described embodiments may operate on higher dimensional input matrices (i.e., three or more dimensions). For example, in some embodiments, the input matrix is organized in the well-known four-dimensional NCHW (where N is batch, C is channel, H is height, W is width) or CNHW format, five-dimensional NCHWVector format, or the like. Because the data in the input matrix is stored in memory according to a given high dimensional format, the address in memory where the data to be processed by each thread is located is determined before the data can be used for GEMM operations. In other words, the higher dimensional format will be converted or translated into the two dimensional format used by the GEMM operation. For this operation, the described embodiments generate a "compiled representation" of the data in one of the two input matrices used for the GEMM operation (referred to herein as input matrix a), which is used to determine the location in memory of the data in input matrix a. Generally, the compiled representation is a record (i.e., a list, index, etc.) that includes information to be used to identify, on a per-thread basis, the location in memory where the elements in the input matrix a are stored. Thus, the compiled representation may be used to retrieve data of the input matrix a from memory for processing in a GEMM operation by each of the threads.

In some embodiments, the information in the compiled representation has a specified relationship with a second input matrix for the GEMM operation, referred to herein as input matrix B. Generally, when processing an input data instance through a neural network, values from input matrix a (e.g., input data) and values from input matrix B (e.g., weights or convolution filter values) are processed together in a GEMM operation. The specified relationship between the compiled representation and the input matrix B is such that each location in the compiled representation located at a respective offset from the base address of the compiled representation includes information for determining the address of the corresponding location in the input matrix a, while the respective element is located in the same offset from the base address of the input matrix B. In other words, if location X in the compiled representation includes information for determining the address in memory of the value in input matrix a to be used in the GEMM operation, the matching value in input matrix B will be located at an offset X from the base address of input matrix B. Thus, the compiled reference may not include information about the input matrix B (and the processor would otherwise be configured to recognize and use the specified relationship), or may include simple information about the input matrix B, such as the base address or location of the input matrix B in memory.

In the described embodiment, the compiled representation is generated "statically" and is thus prepared in advance for use in determining the address of the memory location during the GEMM operation. As used herein, statically means that the compiled representation is generated at some time before when the input data instance is processed using the GEMM operation, and possibly at a relatively long time before the input data instance is processed. In some embodiments, the statically generated compiled representation is separately generated and stored in preparation for subsequent processing of the input data instance by the neural network.

In some embodiments, the compiled representation includes two types of information that are used in combination to determine the location in memory of an element in the input matrix a. Two types of information are: (1) the base address per thread and (2) the relative offset. Each base address is a memory address in the input matrix a that serves as a starting point or initial address for the corresponding thread. Each thread has a separate and distinct base address and thus the compiled representation includes a list of the respective base addresses for each thread. The relative offset is a memory offset (e.g., number of elements, identifier of elements, bytes, etc.) that is used in conjunction with the base address to determine the location of the memory of the elements in the input matrix a that each thread will process. For example, if the base addresses of the first and second threads are A1 and A2, and the relative offsets are X, Y and Z, then the locations of the memory elements that the first thread will process are A1+ X, A1+ Y and A1+ Z, and the locations of the memory elements that the second thread will process are A2+ X, A2+ Y and A2+ Z.

In some embodiments, the compiled representation is generated using a set of values associated with the input matrix a, the output matrix (referred to herein as output matrix C), properties of the neural network, and/or other values. For example, in some embodiments, computing a base address for each thread of the input matrix a includes computing the base address from some or all of a thread identifier for the thread, dimensions of the input matrix a and/or the output matrix C, properties of elements of the input matrix a and/or the output matrix C, and neural network properties. As another example, in some embodiments, calculating the relative offset of the input matrix a includes calculating the relative offset according to some or all of the dimensions of the input matrix a and/or the output matrix C, the properties of the elements of the input matrix a and/or the output matrix C, and the properties of the neural network. Some example expressions for calculating base address and relative offset are described below.

In some embodiments, processing the input data instance through the neural network includes using the compiled representation to obtain the input data from the input matrix a, and separately obtaining the corresponding data from the input matrix B. For this operation, the processing subsystem obtains a base address for each thread of the input matrix A. Then, while the elements of the input matrix a remain to be processed, the processing subsystem uses the relative offsets from the compiled representation to obtain the values of each thread for processing in the GEMM operation. More specifically, for the next thread, the processing subsystem reads a first value from an element of the input matrix a at a location in memory identified by the base address of the next thread from the corresponding relative offset adjustment of the compiled representation. The processing subsystem then reads a second value from the element of the input matrix B at the memory location associated with the element of the input matrix a (i.e., at a location in the input matrix B corresponding to the base address of the next thread adjusted by the corresponding relative offset). The processing subsystem then uses the first and second values as inputs to the GEMM operation of the thread. Next, the processing subsystem stores the result of the GEMM operation in the output matrix C in memory.

In some embodiments, organizing the compiled representation causes data to be obtained from input matrix a and/or input matrix B in a manner that facilitates reuse of data in cache lines in the local cache. In these embodiments, data values used sequentially by two or more threads (i.e., during adjacent sequential GEMM operations) may be fetched from memory together (e.g., in a sequential memory read operation) and stored in various portions of a single cache line. In this manner, data is available in the cache for all two or more threads, and may help avoid evicting useful data from the cache by not requiring multiple cache lines to be loaded to the cache memory to fulfill the data needs of the two or more threads.

By using a statically generated compiled representation to fetch data from the input matrix A, the described embodiments may avoid calculating address information during the execution of the GEMM operations, which may significantly speed up the GEMM operations. The use of a compiled representation may also mean that the same kernel (or other program code) may be used to perform GEMM operations for different formats and arrangements of input matrices. Simplified, optimized, and general-purpose program code may be used in the kernel, which may speed up the kernel (and thus GEMM operations). The described embodiments may also more efficiently use space/cache lines in the cache memory, which may improve the performance of the cache memory. Improving the performance of the GEMM operations and cache memory when processing input data instances through the neural network may improve the overall performance of the electronic device, which results in increased user satisfaction.

Electronic device

Fig. 3 presents a block diagram illustrating an electronic device 300 according to some embodiments. The electronic device 300 includes a processor 302 and a memory 304. Generally, the processor 302 and the memory subsystem 304 are functional blocks implemented in hardware (i.e., using various circuits, circuit elements, and devices). For example, the processor 302 and the memory 304 may be implemented entirely on one or more semiconductor chips, including on one or more separate semiconductor chips, may be implemented using semiconductor chips in combination with discrete circuit elements, may be implemented using discrete circuit elements only, and so forth. The processor 302 and memory 304 perform operations associated with processing input (or training) data instances through the neural network during training or while performing tasks associated with the neural network (e.g., classification, etc.). For example, the processor 302 and/or the memory 304 perform operations for statically generating a compiled representation of an input matrix for determining a location of data in the input matrix for a subsequent GEMM operation.

The processor 302 is a functional block that performs computations, memory accesses, and other operations in the electronic device 300. The processor 302 includes a processing subsystem 306 and a cache memory 308. The processing subsystem 306 includes one or more functional blocks that perform general-purpose computing, control, and/or other operations, such as a Central Processing Unit (CPU) core, a Graphics Processing Unit (GPU) core, a neural network processor/accelerator, an embedded processor, and/or an Application Specific Integrated Circuit (ASIC). Fig. 4 presents a block diagram illustrating a processing subsystem 306 according to some embodiments. Generally, in fig. 4, the processing subsystem 306 is shown to include some of the same internal functional blocks that may be found in a general purpose GPU. Although a particular arrangement and number of elements are shown in processing subsystem 306, in some embodiments, a different number and/or arrangement of processing subsystems are used in processing subsystem 306. For example, in some embodiments, internal functional blocks in the processing subsystem 306 are similar to those that might be found in a CPU. As another example, in some embodiments, internal functional blocks in processing subsystem 306 include a combination of functional blocks similar to those found in both CPUs and general purpose GPUs.

As can be seen in fig. 4, the processing subsystem 306 includes a plurality of compute units 400, each compute unit (shown only in the topmost compute unit 400) including a plurality of stream processors 402 and register files 404 (only three of the stream processors 402 are labeled for clarity). Each stream processor 402 is a functional block that includes processing circuitry (e.g., floating point or integer arithmetic logic units, logic circuitry, etc.) to be used to perform processing operations for one or more processing threads (or, more simply, "threads"). In some embodiments, each compute unit 400 supports multiple threads, e.g., 25, 50, or another number of threads, and the stream processor 402 in each compute unit 400 may be used (e.g., on-demand, time-sliced, access-by-loop mode, etc.) to process the operation of multiple different threads. Register file 404 is a functional block that includes memory circuitry for storing information and data for stream processor 402. In some embodiments, register file 404 includes separate copies of memory circuitry for storing information and data for each of the various threads. The front end 406 is a functional block that performs operations for: scheduling threads executing in stream processor 402; receive and decode/interpret commands, requests, messages, etc. from other entities in the electronic device 300; communicate with other entities in the electronic device 300, and the like. Cache 408 is a functional block that includes memory circuitry to store local copies of instructions, information, and/or data for compute unit 400 (e.g., copies of instructions, information, and/or data fetched from cache 308, memory 304, and/or another source). Cache 408 is a higher speed memory in which copies of instructions, information, and/or data (e.g., operands for compute operations, results from operations, instructions, configuration values, and/or other information) are stored so that instructions, information, and/or data can be accessed faster than data in cache 308 or memory 304. In some embodiments, the cache memory 408 is located near the processing circuitry in the computing unit 400 that accesses copies of data in the cache memory 408 and is connected to the processing circuitry via a high speed interface, thereby enabling fast access of stored data. In some embodiments, cache memory 408 is used to store compiled representations of input matrices to be used as described herein. The memory interface (MEM INT)410 is a functional block that includes circuitry for accessing (e.g., reading, writing, invalidating, etc.) instructions, information, and data in the cache memory 308, the memory 304, and the cache memory 408.

Cache memory 308 is a functional block that performs operations to store copies of instructions, information, and data for access by processing subsystem 306 and possibly other entities in electronic device 300. Cache memory 308 is a higher speed memory in which copies of data are stored to enable faster access to data than data in memory 304. In some embodiments, the cache memory 308 is located near the processing circuitry in the processing subsystem 306 that accesses copies of data in the cache memory 308 and is connected to the processing circuitry via a high speed interface, thereby enabling fast access to the stored data. In some embodiments, cache memory 308 includes volatile memory circuitry, such as Static Random Access Memory (SRAM) circuitry, for storing data.

The memory 304 is a functional block that performs operations of a memory (e.g., a "main" memory) in the electronic device 300. The memory 304 includes memory circuitry for storing instructions, information, and data for access by functional blocks in the electronic device 300, and control circuitry for handling access of data in the memory circuitry. The memory circuits in memory 304 include volatile memory circuits such as fourth generation double data rate synchronous dynamic random access memory (DDR4SDRAM), Static Random Access Memory (SRAM), and/or other types of memory circuits. In some embodiments, a copy of the data is retrieved from memory 304 and stored in cache memory 308 and cache memory 408 to enable faster access to the copy of the data.

The electronic device 300 is shown to include a particular number and arrangement of elements (e.g., functional blocks and devices such as a processor 302, memory 304, etc.). However, the electronic device 300 is simplified for illustrative purposes. In some embodiments, there are different numbers or arrangements of elements in the electronic device 300. For example, electronic device 300 may include a power subsystem, a human-machine interface system, and the like. As another example, in some embodiments, electronic device 300 includes a mass storage device, such as a disk drive or higher capacity integrated circuit memory, that acts as a non-volatile storage for data in electronic device 300, and from which copies of instructions, information, and data are retrieved for storage in memory 304, cache memory 308, and/or cache memory 408. Generally, electronic device 300 includes sufficient elements to perform the operations described herein.

The electronic device 300 may be or may be included in any electronic device that performs neural network processing using an input matrix. For example, the electronic apparatus 300 may be or may be included in an electronic apparatus, such as a desktop computer, a laptop computer, a wearable electronic apparatus, a tablet computer, a smartphone, a server, an artificial intelligence device, virtual or augmented reality equipment, a network appliance, a toy, audiovisual equipment, a home appliance, a controller, a vehicle, and/or the like, and/or combinations thereof.

Compiled representation

In the described embodiment, a processing subsystem (e.g., processing subsystem 306) generates a compiled representation for an input matrix to be used to process input data instances through a neural network. The compiled representation is a reference, i.e., a list, index, etc., that may be used by the processing subsystem to determine the location/address in memory (e.g., memory 304) of an element in a higher dimensional input matrix (i.e., three or more dimensions), referred to for convenience as input matrix a, which will be used as input for the GEMM operation. In other words, the compiled representation serves as a transformation or mapping from the higher-dimensional input matrix a to a two-dimensional input matrix for GEMM operations (i.e., for matrix-matrix multiplication operations, etc.).

FIG. 5 presents a block diagram illustrating a compiled representation 500 according to some embodiments. The compiled representation 500 includes a plurality of base addresses 502. Each base address is or includes an address (or other identifier) of a location in memory in the input matrix a that serves as the base address for a respective different thread. The base address of each thread is used as a starting point in memory for determining the address or position of an element in the input matrix a, from which a value is obtained for the thread to process in the GEMM operation. For the example in fig. 5, the processing subsystem supports N threads, and thus the compiled representation 500 includes N separate base addresses 502. The compiled representation 500 also includes a plurality of relative offsets 504. Each relative offset 504 is a displacement or offset value that will be used with the base address to determine the address or location of the element in the input matrix a from which a value is to be taken.

FIG. 6 presents a block diagram illustrating input matrix A and input matrix B in a memory according to some embodiments. As can be seen in fig. 6, each of the input matrices a and B includes a plurality of elements 600 (only one of which is labeled for clarity), each element 600 storing a value to be used in the GEMM operation for processing an input data instance through the neural network. For example, the elements 600 of input matrix a may store input data values for a node, a feature extraction element, or some other portion of a neural network, and the elements 600 of input matrix B may store weight values, filter values, or other values to be used as inputs to a GEMM operation. For example, the input data may include values from pixels in the image to be classified, audio samples from the sound to be classified, portions of the image to be filtered in the feature extraction element, and so forth. The input matrices a and B are stored in memory according to a format such as NCHW, and thus where the elements and values for the input data instance (or group thereof) are organized to meet the specification of the format.

Using the compiled representation in fig. 5, the base address of Thread _0 plus the relative offset is used to determine the elements in the input matrix a to be processed by the first Thread (i.e., Thread _0 or T0) in the GEMM operation. For example, as shown in FIG. 6, the first element (i.e., the value therein) that thread 0 will process is the leftmost element in input matrix A, shown as T0+ RO0 (i.e., base address thread _0 plus relative offset [0 ]). Thus, when processing an input data instance through a neural network, thread _0 first takes the value in the leftmost element of input matrix a and processes the value in a GEMM operation. Note that the selection of the "leftmost" element is for the example in FIG. 6 only, and in some embodiments, thread _0 starts from a different location in the input matrix A, depending on the format of the data in the input matrix A, the identity of the thread or other thread, and so forth. Continuing the example, the second element thread 0 will process is the element shown as T0_ RO1, which is the base address thread _0 plus the relative offset [1], and the third element thread 0 will process is the element shown as T0_ RO2, which is the base address thread _0 plus the relative offset [2 ]. In this manner, with the compiled representation 500, for each thread, the described embodiments advance through the input matrix a in a relatively offset order starting at the base address of the thread, fetching the data and processing the data for GEMM operations.

In some embodiments, the compiled representation 500 is arranged such that the elements in the input matrix B (i.e., the values therein) to be processed in the GEMM operation have a specified relationship with the elements from the input matrix a. In general, the relationships are specified such that minimal computation is required to obtain values from elements in the input matrix B, e.g., incrementing or otherwise increasing the element pointers of the input matrix B. For the example in fig. 6, the elements are accessed by the thread in the input matrix B in a sequential or linear manner starting from the base address of the input matrix B. The base address of input matrix B is not a per-thread address as in compiled representation 500, but is a base address in the memory of the input matrix B itself. In these embodiments, therefore, and as shown in fig. 6, the element of the input matrix B that is obtained for processing with the element T0+ RO0 in the input matrix a is the element B0, the element of the input matrix B that is obtained for processing with the element T0+ RO1 in the input matrix a is the element B1, and the element of the input matrix B that is obtained for processing with the element T0+ RO2 in the input matrix a is the element B2. Thus, in the example of fig. 6, when the processing subsystem determines the addresses of elements in the input matrix a by advancing through the compiled representation 500, the processing subsystem simply increments the element pointer of the input matrix B to obtain a value from the corresponding element in the input matrix B.

In some embodiments, the compiled representation 500 is stored in a cache memory 506, the cache memory 506 being a local cache memory for the processing subsystem (e.g., cache memory 408 in the processing subsystem 306, etc.). Because the compiled representation 500 is stored in the cache memory 506, the base address 502 and the relative offset 504 may be retrieved and used (more quickly than from, for example, the memory 304) for determining the address of the element in memory.

Process for generating a compiled representation

In the described embodiment, when an input data instance is processed through the neural network, a processing subsystem (e.g., processing subsystem 306) generates a compiled representation of the input matrix that includes information to be used to locate elements of the input matrix in memory for GEMM operations. FIG. 7 presents a flowchart illustrating a process for generating a compiled representation, in accordance with some embodiments. Note that the operations illustrated in fig. 7 are presented as a general example of operations performed by some embodiments. Operations performed by other embodiments may include different operations, operations performed in a different order, and/or operations performed by different entities or functional blocks. For example, fig. 7 is described using a convolutional neural network with feature processing elements, filters, etc., but the operation of appropriate adjustment may be performed on other types of neural networks (such as fully connected neural networks, etc.).

For the operations in fig. 7, it is assumed that input matrix a is stored in a memory (e.g., memory 304) and is thus available for static generation of a compiled representation. Also assume that input matrix a has a plurality of elements (e.g., element 600) in which data is stored that will be used to process an input data instance through a neural network. For example, the elements in the input matrix a may store data, such as values of pixels or pixel regions in an image, for processing in a neural network that classifies the image. It is further assumed that the data in the input matrix a is organized in more than two dimensions, such as in the well-known four-dimensional NCHW (where N is batch, C is channel, H is height, W is width) or CNHW format, five-dimensional NCHWVector format, and so on.

Fig. 8 presents keys describing factors or values used in expressions used to describe the operations in fig. 7 and 9-10, in accordance with some embodiments. In general, factors or values are properties or characteristics of the neural network or the input data instance (or group thereof), input matrices used to process the input data instance through the neural network, and output matrices to be generated when the input data instance is processed through the neural network. As can be seen in fig. 8, the factors or values include: WI-width of input matrix A; HI-height of input matrix A; r-width of filter used in the feature processing element; s-height of filter used in feature processing element; filter U span used in SU-feature processing element; filter V span used in SV-feature processing element; the filter U used in the DU-feature processing element expands; filter V used in DV-feature processing element is expanded; WO-the width of the output matrix C; height of HO-output matrix C; INC-input channel; and ONC-output channels. As described above, these factors and values are used for neural networks having feature processing elements, such as convolutional neural networks, but the principles described with respect to fig. 7-10 are similar for the generation of compiled representations of other types of neural networks.

The operation of fig. 7 begins when the processing subsystem computes a corresponding base address in the input matrix a stored in memory to be associated with each of a plurality of threads (step 700). For this operation, the processing subsystem computes a base address (e.g., base address 502) for each thread, which is then used to determine a starting point for the elements in the input matrix A from which to obtain values for processing by the thread. FIG. 9 presents a pseudo-code example of base address computation of a compiled representation, in accordance with some embodiments. As can be seen in fig. 9, when calculating the base address, the processing subsystem calculates the intermediate values PIX and IBS as respective functions of the Thread Identifier (TID) for each thread and the combination of factors and values described with respect to fig. 8. The processing subsystem then calculates the BASE address (BASE _ ADDR _ a [ TID ]) for each thread using PIX and IBS and a combination of factors and values as described for fig. 8. Note that SIZEOF (ELEM _ TYPE) in the expression in fig. 9 is the size of an element in the input matrix a (e.g., in units of bytes, etc.). Further, although FIG. 9 is presented as a general example, in some embodiments, different operations and/or different orders of operations are performed to calculate the base address.

In some embodiments, the processing subsystem also stores a second input matrix, i.e., the base address of input matrix B, in the compiled representation. As described above, in some of these embodiments, the compiled representation does not include per-thread base address information for input matrix B, and thus the base address is simply a location (e.g., address, pointer, etc.) in memory from which the elements of input matrix B will be found in order. However, in some embodiments, the base address of the input matrix B is a variable, pointer, or other value managed separately from the compiled representation.

The processing subsystem then calculates the relative offset of the thread, which is the distance between the elements of the input matrix a to be processed by the thread (step 702). For this operation, the processing subsystem computes a set of relative offsets (e.g., relative offsets 504), which are then used in conjunction with the base address of each thread to determine the elements in the input matrix A that the thread will process. FIG. 10 presents a pseudo-code example of relative offset calculation of a compiled representation according to some embodiments. As can be seen in fig. 10, when calculating the relative offsets, the processing subsystem uses counters C, V and U to calculate each of the relative offsets (RELO), counters C, V and U being limited based on certain factors and values described for fig. 8. Note that SIZEOF (ELEM _ TYPE) in the expression in fig. 10 is the size of an element in the input matrix a. Further, although fig. 10 is presented as a general example, in some embodiments, different operations and/or different orders of operations are performed to calculate the relative offset.

In some embodiments, organizing the compiled representation causes data to be obtained from input matrix a and/or input matrix B in a manner that facilitates reuse of data in cache lines in the local cache. In these embodiments, data values used sequentially (i.e., during adjacent sequential GEMM operations) or temporally adjacent to each other by two or more threads may be fetched from memory together (e.g., in a sequential memory read operation) and stored in various portions of a single cache line. In this manner, data is available in the cache for all two or more threads, and may help avoid evicting useful data from the cache by not requiring multiple cache lines to be loaded to the cache memory to fulfill the data needs of the two or more threads.

The processing subsystem then stores a compiled representation of the input matrix a into a local cache memory (e.g., cache memory 308 or cache memory 408), the compiled representation including the base address and the relative offset (step 704). For this operation, in some embodiments, the processing subsystem stores the compiled representation in a format or arrangement such as that shown in fig. 5 into cache memory. Generally, the processing subsystem stores the compiled representation into cache memory for future processing of the input data instance by the neural network. In some embodiments, in addition to or instead of storing the compiled representation in cache memory, the processing subsystem stores the compiled representation in non-volatile memory or transfers the compiled representation (e.g., via a communication network or the like) to another electronic device for storage thereby. In these embodiments, the compiled representation is prepared in advance for future use in processing the input data instance over the neural network — possibly including over a processing subsystem in a separate/different electronic device.

In the described embodiment, the compiled representation is generated "statically" and is thus prepared in advance for use in determining the address of the memory location during the GEMM operation. As used herein, statically means that the compiled representation is generated at some time before when the input data instance is processed using the GEMM operation, and possibly at a relatively long time (e.g., hours, days, etc.) before the input data instance is processed. For example, a compiled representation may be generated for the input matrix as part of the data preparation step prior to processing the data itself through the neural network. In other words, the statically generated compiled representation may be separately generated and stored (e.g., in local cache memory, in memory, etc.) in preparation for subsequent processing of the input data instance by the neural network. Static generation of compiled representations may be contrasted with dynamic or real-time computation of memory addresses of elements in the input matrices a and/or B.

In the described embodiments, the compiled representation is configured such that the program code used to perform the GEMM operations, e.g., in the kernel (or other arrangement of program code), is generic to the dimensions and/or format of the data in the input matrix. Thus, the kernel need only perform the simple address calculations described herein (e.g., base address plus relative offset) for using the compiled representation. By using a generic kernel with a compiled representation, many dimensions and/or formats of the input matrix can be used without requiring multiple kernels and/or program code in the kernels to handle different dimensions and/or formats of the input matrix. In some embodiments, a single kernel using a compiled representation may be developed. In some of these embodiments, the program code for statically generating the compiled representation, as described herein, handles various dimensions and/or formats of the input matrix.

Although the particular calculations of base address and relative offset as presented in fig. 9-10 are described as examples, in some embodiments, different calculations are used. For example, different calculations are used to generate base addresses and/or relative offsets for different dimensions and/or formats of input data, for different types of neural networks, and so on. Generally, in the described embodiments, when generating a compiled representation of an input matrix, base addresses and relative offsets having the properties described herein are computed.

Process using compiled representations

In the described embodiment, when the input data instance is processed through the neural network, the processing subsystem uses the compiled representation to obtain data from the input matrix to be used for the GEMM operation. FIG. 11 presents a flowchart illustrating a process for using a compiled representation, in accordance with some embodiments. Note that the operations illustrated in fig. 11 are presented as a general example of operations performed by some embodiments. Operations performed by other embodiments may include different operations, operations performed in a different order, and/or operations performed by different entities or functional blocks. For example, fig. 11 is described using a convolutional neural network with feature processing elements, filters, etc., but the operation of appropriate adjustment may be performed on other types of neural networks (such as fully connected neural networks, etc.).

For the operations in fig. 11, it is assumed that the compiled representation is stored in a local cache memory (e.g., cache memory 308 or cache memory 408) for the processing subsystem and is thus available to locate the elements of input matrix a for GEMM operations. In other words, in some embodiments, the compiled representation was previously generated statically using the process described in FIG. 7 and stored in cache memory (which may be a relatively long time before use in FIG. 11). Further, assume that both input matrices a and B are stored in a memory (e.g., memory 304). Input matrices a and B each have a plurality of elements (e.g., element 600) in which data is stored that will be used to process an input data instance through a neural network. For example, the elements in the input matrix a may store data, such as values of pixels or pixel regions in an image, for processing in a neural network that classifies the image, and the elements in the input matrix B may store filter values or weights. The data of each of the input matrices a and B is organized in more than two dimensions, such as in the well-known four-dimensional NCHW (where N is batch, C is channel, H is height, and W is width) or CNHW format, five-dimensional NCHWVector format, and so forth.

FIG. 8 presents keys describing factors or values used in expressions used to describe the operations in FIG. 11 and FIG. 12, in accordance with some embodiments. The factors and values shown in fig. 8 are used for neural networks having feature processing elements, such as convolutional neural networks, but the principles described with respect to fig. 8 and 11-12 are similar for the generation of compiled representations of other types of neural networks.

The operation in FIG. 11 begins when the processing subsystem obtains the base address of an active thread from a compiled representation of input matrix A stored in a local cache (step 1100). For this operation, the processing subsystem reads a list of base addresses (e.g., base address 502) of the thread set from the compiled representation in the local cache to obtain a corresponding base address of the active thread. In some embodiments, when an active thread becomes active (i.e., for processing values from elements in input matrix a), the processing subsystem performs this operation and then holds the base address of the active thread (e.g., in a register or memory element). FIG. 12 presents an example of pseudo code using compiled representations, in accordance with some embodiments. The first few lines of pseudo code in FIG. 12 illustrate the acquisition of the base address of an active thread. Note that although fig. 12 is presented as an example, in some embodiments, different operations and/or different orders of operations are performed to use the compiled representation.

When an element is to be processed by an active thread (step 1102), the processing subsystem obtains the next relative offset from the compiled representation (step 1104). For this operation, the processing subsystem advances from the set of relative offsets sequentially stored in the compiled representation to the next relative offset and reads the next relative offset. For example, the processing subsystem may increment a counter used to determine which relative offset to use and then retrieve the associated relative offset (or vice versa). In this manner, the processing subsystem processes the relative offsets in the compiled representation one at a time, thereby obtaining each available relative offset for processing the corresponding value from the input matrix a. The acquisition of the relative offset is shown in fig. 12, where URO is set equal to the kth element of the RELO (relative offset) matrix or vector (note that in some embodiments, the relative offset is stored in the elements of the matrix or vector).

Note that although only one thread is described with respect to FIG. 11, in some embodiments, the relative offsets are used by the processing subsystem in the same manner for all threads. Thus, the position of each element in the input matrix a to be processed by a given thread is calculated using the base address of the given thread and the corresponding one of the relative offsets in the relative offset matrix, vector, etc. As can be seen in fig. 12, in some embodiments, the position of each element in the input matrix a to be processed by a given thread is calculated as a running sum starting from the original value of the base address of the given thread and adding each relative offset (URO) to increase the current value of the base address, as described herein.

The processing subsystem then increments the base address of the active thread by the next relative offset (step 1106). During this operation, the processing subsystem advances the base address used to retrieve the data to the next location in memory, and thus to the next element in the input matrix a, based on the next relative offset. For example, if the previous value of the base address is PV and the relative offset is RO, e.g., 15 elements, 100 bytes, etc., the base address PV is incremented by RO, i.e., PV + RO is calculated. The increase in base address is shown in figure 12 as AVO increasing URO.

The processing subsystem then reads a first value from an element of the input matrix a from a location in memory indicated by the base address of the active thread (step 1108). During this operation, the processing subsystem performs a memory read operation using the base address of the active thread (or another value determined using the base address of the active thread), which causes the memory to return the value of the element of the input matrix a indicated by the base address of the active thread. Fig. 12 shows that the operation is a ═ SCALAR _ BASE _ a [ AVO ]. The processing subsystem also reads a second value from the element of the input matrix B at the location in memory associated with the element in the input matrix a (step 1110). During this operation, the processing subsystem performs a memory read using the offset within the input matrix B based on the next relative offset used to obtain the value from the input matrix a. For example, if the 13 th relative offset is being used by the processing subsystem, then the input within input matrix B is 13. This is shown in fig. 12 as B ═ SCALAR _ BASE _ B [ K ] (the kth element of the retrospective relative offset is read for the purpose of acquiring data from the input matrix a).

The active thread then uses the first and second values as inputs to the GEMM operation (step 1112). For this operation, an active thread executing on the underlying processor hardware (e.g., pipeline, stream processor, etc.) performs a corresponding portion of the GEMM operation. For the example in fig. 12, the GEMM operation is shown as a Fused Multiply Accumulate (FMA) operation, C + ═ a × B. The active thread then stores the result of the GEMM operation in the output matrix C (step 1114). The storage of the result of the GEMM operation is shown in the final if statement of fig. 12, which includes C OUTPUT IDX ═ C.

The processing subsystem then returns to step 1102 to determine if the element is pending. If there are no remaining elements to process, the process ends. Otherwise, if there are more elements left to process, the processing subsystem continues to step 1104.

FIG. 13 presents an example of using a compiled representation in accordance with some embodiments. For the example in FIG. 13, each of the four threads on the processing subsystem is used to compute a separate output, each of which is a convolution of the values of the elements from the set of elements of the input matrix A (A0, A1, etc.) with the values in the input matrix B. As can be seen in fig. 13, the factors and values (described for fig. 8) are assumed to have corresponding values, e.g., WS-4, HS-4, etc. With respect to the values in the compiled representation, the base addresses (START _ ADDR _ A [ ]) of the four threads are 0, 1, 4, and 5, and the relative offsets (RELO) are 0, 1, 2, 4, and so on. Further, assume that the elements in input matrix A (A [ HS ] [ WS ]) store values of 1, 3, 2, 7, etc. and assume that the elements in input matrix B (B [ R ] [ S ]) store values of 00, 01, 02, etc. FIG. 14 presents an example of a set of elements (or blocks) of an input matrix A, in accordance with some embodiments. As can be seen in fig. 14, each thread processes a different set of elements from the input matrix a, where each set of elements starts in the input matrix a from the thread's respective starting address (or more generally, position) and is located at a corresponding offset from the relative offset. In a first set of computations by thread 0 (i.e., the active thread) in FIG. 13, FMA is computed in which values from input matrix A (e.g., A0[ RELO [0], A0[ RELO [1], etc.) are multiplied by corresponding values from input matrix B (e.g., B [0] [0], B [0] [1], etc.) and the sum of the products is written/stored to/in C [0] [0], a location in output matrix C. In the remaining three computation sets for threads 1-3, FMA is computed where the corresponding value from input matrix A is multiplied by the corresponding value from input matrix B and the sum of the products is written to/stored at the location in output matrix C.

Note that the operations and values shown in fig. 13-14 are presented as examples of operations performed and values used by some embodiments. Operations performed by other embodiments may include different operations, operations performed in a different order, and/or operations performed by different entities or functional blocks. For example, fig. 13-14 are described using convolutional neural networks with feature processing elements, filters, etc., but the operation of appropriate adjustment may be performed on other types of neural networks (such as fully connected neural networks, etc.).

Alternate generation of compiled representations

As noted above, the examples in fig. 7-14 are provided to illustrate the generation and use of compiled representations in some embodiments. However, in some other embodiments, different operations, values, etc. are used to generate and/or use the compiled representation. FIG. 15 presents a pseudo-code example of generating a compiled representation according to some embodiments. Note that the operations shown in fig. 15 are presented as a general example of operations performed by some embodiments. Operations performed by other embodiments may include different operations, operations performed in a different order, and/or operations performed by different entities or functional blocks. For example, fig. 15 is described using a convolutional neural network with feature processing elements, filters, etc., but the operation of appropriate adjustment may be performed on other types of neural networks (such as fully connected neural networks, etc.).

For the example in FIG. 15, the factors and values used in the generation of the compiled representation are shown in the upper left corner of the figure. As can be seen, N-batch size, C-input channel, etc. The factors and values, although having some similarities to those shown in fig. 8, are defined and used separately and differently in fig. 15. As can be seen in fig. 15, the BASE address of the compiled representation is calculated as M _ BASE ═ N × C × H × W + W, where the factor N, H and W itself are calculated as shown. Further, the relative OFFSET is calculated as K _ MEMORY _ OFFSET ═ C × H × W + S × W + R. As shown in fig. 15, the BASE address and the relative offset together are used to determine the address in memory where the data is located in the input matrix a as the global address M — BASE + K memory offset. It should also be noted that the elements of input matrix B are accessed using its BASE address N BASE and corresponding memory offset.

In some embodiments, at least one electronic device (e.g., electronic device 300) performs some or all of the operations described herein using code and/or data stored on a non-transitory computer-readable storage medium. More specifically, when performing the described operations, at least one electronic device reads code and/or data from the computer-readable storage medium and executes the code and/or uses the data. A computer-readable storage medium may be any device, medium, or combination thereof that stores code and/or data for use by an electronic device. For example, a computer-readable storage medium may include, but is not limited to, volatile and/or nonvolatile memory including flash memory, random access memory (e.g., eDRAM, RAM, SRAM, DRAM, DDR4SDRAM, etc.), nonvolatile RAM (e.g., phase change memory, ferroelectric random access memory, spin-transfer torque random access memory, magnetoresistive random access memory, etc.), Read Only Memory (ROM), and/or magnetic or optical storage media (e.g., disk drives, magnetic tape, CDs, DVDs, etc.).

In some embodiments, one or more hardware modules perform the operations described herein. For example, a hardware module may include, but is not limited to, one or more Central Processing Units (CPUs)/CPU cores, Graphics Processing Units (GPUs)/GPU cores, Application Specific Integrated Circuit (ASIC) chips, Field Programmable Gate Arrays (FPGAs), compressors or encoders, computing units, embedded processors, Accelerated Processing Units (APUs), controllers, neural network processors, and/or other functional blocks. When circuitry (e.g., integrated circuit elements, discrete circuit elements, etc.) in such a hardware module is activated, the circuitry performs some or all of the operations. In some embodiments, a hardware module includes general-purpose circuitry, such as an execution pipeline, a computing or processing unit, etc., that performs operations when executing instructions (program code, firmware, etc.). In some embodiments, a hardware module includes dedicated or specialized circuitry to perform operations, possibly including circuitry to perform some or all of the operations "in hardware" and not to execute instructions.

In some embodiments, data structures representing some or all of the functional blocks and circuit elements described herein (e.g., electronic device 300 or some portion thereof) are stored on a non-transitory computer-readable storage medium including a database or other data structure, which can be read by an electronic device and used, directly or indirectly, to fabricate hardware including the functional blocks and circuit elements. For example, the data structure may be a behavioral level description or a Register Transfer Level (RTL) description of hardware functionality in a high level design language (HDL) such as Verilog or VHDL. The description may be read by a synthesis tool that may synthesize the description to produce a netlist that includes a list of transistors/circuit elements from a synthesis library that represents the functionality of the hardware including the functional blocks and circuit elements described above. The netlist can then be placed and routed to produce a data set describing the geometry to be applied to the mask. The masks may then be used in various semiconductor fabrication steps to produce one or more semiconductor circuits (e.g., integrated circuits) corresponding to the functional blocks and circuit elements described above. Alternatively, the database on the computer-accessible storage medium may be a desired netlist (with or without a synthesis library) or dataset, or Graphic Data System (GDS) II data.

In this description, variables or unspecified values (i.e., a general description of values without instances of specific values) are represented by letters such as N, M and X. As used herein, although similar letters may be used in different locations in the description, the variables and unspecified values need not be the same in each case, i.e., there may be different amounts of variables and values intended for some or all of the generic variables and unspecified values. In other words, in this description, the particular instances of N and any other letters used to represent variables and unspecified values are not necessarily related to each other.

The expression "etc. (et cetera)" or "etc.", as used herein, is intended to present one and/or the case, i.e., the equivalent of "at least one" of the elements in the list associated with the etc. For example, in the sentence "the electronic apparatus performs the first operation, the second operation, and so on", the electronic apparatus performs at least one of the first operation, the second operation, and the other operation. Moreover, the elements in the list associated with and/or the like are merely examples in the example set-and at least some of the examples may not be present in some embodiments.

The foregoing description of the embodiments has been presented for the purposes of illustration and description only. They are not intended to be exhaustive or to limit the embodiments to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Furthermore, the above disclosure is not intended to limit embodiments. The scope of the embodiments is defined by the appended claims.

31页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:获取神经网络模型的方法、图像处理方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!