Statically generated compiled representation for processing data in a neural network

文档序号：169817 发布日期：2021-10-29 浏览：20次中文

阅读说明：本技术 用于在神经网络中处理数据的静态生成的经编译表示 (Statically generated compiled representation for processing data in a neural network ) 是由李修宇杨建� 于 2020-04-28 设计创作，主要内容包括：一种电子装置,包括：存储器,其存储输入矩阵A和B；高速缓存存储器；以及处理器。所述处理器生成经编译表示,所述经编译表示包括用于在通过神经网络处理输入数据实例时从输入矩阵A获取数据的值,所述值包括输入矩阵A中针对多个线程中的每个线程的基址和相对偏移,所述相对偏移是将由所述线程处理的输入矩阵A的元素之间的距离。然后,所述处理器在所述本地高速缓存存储器中存储包括针对每个线程的所述基址和所述相对偏移的所述经编译表示。(An electronic device, comprising: a memory storing input matrices a and B; a cache memory; and a processor. The processor generates a compiled representation comprising values for obtaining data from an input matrix a when processing an input data instance over the neural network, the values comprising a base address for each of a plurality of threads in the input matrix a and a relative offset, the relative offset being a distance between elements of the input matrix a to be processed by the thread. The processor then stores the compiled representation including the base address and the relative offset for each thread in the local cache memory.)

1. An electronic device, comprising:

a local cache memory;

a memory storing an input matrix A having values to be used in processing input data instances through a neural network; and

a processor configured to:

generating a compiled representation comprising values for obtaining data from an input matrix a when processing an input data instance through the neural network, the values comprising a base address and a relative offset in the input matrix a for each of a plurality of threads, the relative offset being a distance between elements of the input matrix a to be processed by the thread; and

storing the compiled representation comprising the base address and the relative offset for each thread in the local cache memory.

2. The electronic device of claim 1, wherein when generating the compiled representation, the processor is configured to:

the base address in input matrix A for each thread is calculated from some or all of the thread Identifier (ID) of the thread, the dimensions of input matrix A and/or output matrix C, the properties of the elements of input matrix A and/or output matrix C, and the convolution filter properties.

3. The electronic device of claim 1, wherein when generating the compiled representation, the processor is configured to:

the relative offset is calculated from some or all of the dimensions of the input matrix a and/or the output matrix C, the properties of the elements of the input matrix a and/or the output matrix C, and the filter properties.

4. The electronic device of claim 1, wherein:

the memory stores an input matrix B having values to be used in processing input data instances through the neural network; and

the processor is further configured to:

processing an input matrix A using each of the threads, the processing comprising identifying a location in memory using the compiled representation in the local cache memory, using values in elements of the input matrix A obtained from the location as inputs to one or more general purpose matrix multiplication (GEMM) operations by the thread together with values from elements at corresponding locations of input matrix B.

5. The electronic device of claim 4, wherein in processing the input matrix A using each of the threads, the processor is configured to:

obtaining the base address for each thread of input matrix A from the compiled representation of input matrix A in the local cache memory;

while the elements of the input matrix A are pending, for an active one of the threads:

obtaining a next relative offset from the compiled representation;

increasing a base address of the active thread by the next relative offset;

reading a first value from an element of an input matrix A at a location in the memory identified by a base address of the active thread;

reading a second value from an element of input matrix B at a memory location associated with the element of input matrix A;

using the first value and the second value as inputs to a GEMM operation; and

storing a result of the GEMM operation in an output matrix C in the memory.

6. The electronic device of claim 5, wherein the processor is further configured to use an index to obtain the relative offset in combination with a base address of input matrix B to read the second value from input matrix B.

7. The electronic device of claim 5, wherein, when storing the result of the GEMM operation in an output matrix C, the processor is further configured to:

each result is stored in the output matrix C at a position in the output matrix C determined based on the position in the input matrix a at which the first value was read.

8. The electronic device of claim 4, wherein the values in input matrix A and input matrix B comprise input values and weights, respectively, associated with an input data instance, the input values and weights to be used in processing the input data instance through the neural network.

9. The electronic device of claim 4, wherein input matrix A and input matrix B have dimensions larger than dimensions for the GEMM operation.

10. The electronic device of claim 1, wherein the local cache memory is coupled to processing circuitry through a fast access interface that enables faster access than access of the memory.

11. A method for generating a compiled representation in an electronic device, the electronic device comprising: a local cache memory; a memory storing an input matrix A having values to be used in processing input data instances through a neural network; and a processor, the method comprising:

generating, by the processor, a compiled representation comprising values for obtaining data from an input matrix a when processing an input data instance through the neural network, the values comprising a base address and a relative offset in the input matrix a for each of a plurality of threads, the relative offset being a distance between elements of the input matrix a to be processed by the thread; and

storing, by the processor, the compiled representation comprising the base address and the relative offset for each thread in the local cache memory.

12. The method of claim 11, wherein generating the compiled representation comprises:

calculating, by the processor, the base address in input matrix A for each thread according to some or all of a thread Identifier (ID) of the thread, dimensions of input matrix A and/or output matrix C, properties of elements of input matrix A and/or output matrix C, and convolution filter properties.

13. The method of claim 11, wherein generating the compiled representation comprises:

calculating, by the processor, the relative offset from some or all of the dimensions of the input matrix A and/or the output matrix C, properties of the elements of the input matrix A and/or the output matrix C, and filter properties.

14. The method of claim 11, wherein:

the memory stores an input matrix B having values to be used in processing input data instances through the neural network; and

the method further comprises the following steps:

processing, by the processor, input matrix A using each of the threads, the processing including identifying a location in memory using the compiled representation in the local cache memory, using values in elements of input matrix A obtained from the location as input to one or more general purpose matrix multiplication (GEMM) operations by the thread together with values from elements at corresponding locations of input matrix B.

15. The method of claim 14, wherein processing the input matrix a using each of the threads comprises:

obtaining, by the processor, the base address and the relative offset for each thread of input matrix A from the compiled representation of input matrix A in the local cache memory;

while the elements of the input matrix A are pending, for an active one of the threads:

reading, by the processor, a first value from an element of input matrix A at a location in the memory identified by the active thread adjusted by the corresponding relative offset for a base address of input matrix A;

reading, by the processor, a second value from an element of input matrix B at a memory location associated with the element of input matrix A;

using, by the processor, the first value and the second value as inputs to a GEMM operation; and

storing, by the processor, a result of the GEMM operation in an output matrix C in the memory.

16. The method of claim 15, further comprising:

using, by the processor, the index to obtain the relative offset in combination with the base address of the input matrix B to read the second value from the input matrix B.

17. The method of claim 15, wherein storing the result of the GEMM operation in an output matrix C comprises:

storing, by the processor, each result at a location in the output matrix C that is determined based on the location in the input matrix a at which the first value was read.

18. The method of claim 14, wherein the values in input matrix a and input matrix B comprise input values and weights, respectively, associated with an input data instance to be used in processing the input data instance through the neural network.

19. The method of claim 14, wherein the input matrix a and the input matrix B have dimensions larger than dimensions for the GEMM operation.

20. An electronic device, comprising:

a memory storing an input matrix a and an input matrix B, the input matrix a and the input matrix B having respective values to be used when processing an input data instance through the neural network;

a local cache memory storing a compiled representation comprising values for obtaining data from an input matrix A when processing an input data instance over the neural network, the values comprising a base address for each of a plurality of threads in the input matrix A and a relative offset, the relative offset being a distance between elements of the input matrix A to be processed by the thread; and

a processor configured to:

retrieving the compiled representation from the local cache memory; and

for each of the plurality of threads, processing an input matrix a by using the compiled representation to identify a location in memory, using values in elements of the input matrix a taken from the location together with values from elements at corresponding locations of the input matrix B as inputs to one or more general matrix multiplication (GEMM) operations by the thread.

21. The electronic device of claim 20, wherein when processing the input matrix a, the processor is configured to: