Neural network processing device based on pulse array

文档序号:1525398 发布日期:2020-02-11 浏览:6次 中文

阅读说明:本技术 基于脉动阵列的神经网络处理装置 (Neural network processing device based on pulse array ) 是由 高明明 杨康 谷骞 于 2018-08-14 设计创作,主要内容包括:提供一种神经网络处理装置,该装置包括:脉动阵列;控制单元,用于发送指示卷积操作、平均池化操作或最大化操作的控制信号;脉动阵列中各个计算单元用于,接收控制信号、输入特征数据、权重以及来自沿脉动阵列的第二维度的上一级计算单元的第一中间处理结果;根据控制信号,对输入特征数据、权重以及第一中间处理结果进行相关操作,获得第二中间处理结果;向沿着第二维度的下一级计算单元发送第二中间处理结果;向沿着第一维度的下一级计算单元发送输入特征数据;处理单元,用于接收控制信号以及脉动阵列输出端的计算单元输出的第二中间处理结果,根据控制信号对第二中间处理结果进行相关操作,获得神经网络输出结果。可以降低硬件开销。(There is provided a neural network processing apparatus, the apparatus including: a systolic array; a control unit for transmitting a control signal indicating a convolution operation, an average pooling operation, or a maximization operation; each computing unit in the systolic array is used for receiving a control signal, input characteristic data, weight and a first intermediate processing result from a computing unit at the upper stage along the second dimension of the systolic array; according to the control signal, performing relevant operation on the input characteristic data, the weight and the first intermediate processing result to obtain a second intermediate processing result; sending a second intermediate processing result to a next-level computing unit along a second dimension; sending input feature data to a next-level computing unit along a first dimension; and the processing unit is used for receiving the control signal and a second intermediate processing result output by the computing unit at the output end of the pulse array, and performing related operation on the second intermediate processing result according to the control signal to obtain a neural network output result. Hardware overhead can be reduced.)

1. A neural network processing apparatus, comprising:

a systolic array comprising a plurality of computational cells;

a control unit for sending a control signal indicative of any one of the following neural network operations: convolution operation, average pooling operation, and maximization operation;

each computing unit in the systolic array is configured to receive the control signal, input feature data, weights, and first intermediate processing results from a previous stage computing unit along a second dimension of the systolic array; according to the control signal, carrying out neural network related operation on the input feature data, the weight and the first intermediate processing result to obtain a second intermediate processing result; sending the second intermediate processing result to a next-stage computational unit along the second dimension of the systolic array; sending the input feature data to a next-stage computational unit along a first dimension of the systolic array;

and the processing unit is connected with the computing unit at the output end of the systolic array and used for receiving the control signal and a second intermediate processing result output by the computing unit at the output end of the systolic array and carrying out related operation of the neural network on the second intermediate processing result according to the control signal to obtain an output result of the neural network.

2. The apparatus of claim 1, wherein the weight has a value of 1 when the control signal indicates a maximum pooling operation or an average pooling operation.

3. The apparatus of claim 1, wherein the distribution of the weights in the systolic array coincides with or mirrors the distribution of the weights in the convolution kernel when the control signal indicates a convolution operation.

4. The apparatus according to any one of claims 1 to 3, characterized in that it comprises:

a feature data storage for storing input feature data;

a weight memory for storing weights;

the control unit is used for loading the input characteristic data in the characteristic data memory into the systolic array and loading the weight in the weight memory into the systolic array.

5. The apparatus of claim 4, further comprising:

and the other weight memory is used for prestoring the weight of the next convolution operation when the control signal indicates the convolution operation.

6. The apparatus according to any one of claims 1 to 5, wherein each of the computing units comprises:

a weight register for storing weights;

a feature register for storing input feature data and for forwarding the input feature data to a subsequent level of computational cells along the first dimension of the systolic array;

a multiplier coupled to the weight register and the feature register, the multiplier configured to output a product of the weight and the input feature data;

the adder is connected with the multiplier;

when the control signal indicates a convolution operation or an average pooling operation, the adder is configured to receive the product and a first intermediate processing result from a previous-stage calculation unit along the second dimension, accumulate the product and the first intermediate processing result to obtain a second intermediate processing result, and send the second intermediate processing result to a subsequent-stage calculation unit along the second dimension.

7. The apparatus of claim 6,

the adder is configured to receive the product and the first intermediate processing result and output a subtraction result of the product and the first intermediate processing result when the control signal indicates a maximum pooling operation;

the computing unit further comprises:

and a selector, connected to the adder and the multiplier, for receiving the product, the subtraction result, and the first intermediate processing result when the control signal indicates a maximum pooling operation, selecting a maximum value from the product and the first intermediate processing result according to the subtraction result, and sending a second intermediate processing result to a subsequent-stage calculation unit along the second dimension, where the second intermediate processing result is the maximum value.

8. The apparatus of claim 6, wherein the computing unit further comprises:

a comparator connected to the multiplier;

when the control signal indicates a maximum pooling operation, the comparator is configured to receive the product and the first intermediate result, output a maximum value of the product and the first intermediate result, and send a second intermediate processing result to a subsequent computing unit along the second dimension, where the second intermediate processing result is the maximum value.

9. The apparatus according to any one of claims 1 to 8, wherein the processing unit is configured to:

when the control signal indicates convolution operation, accumulating the received second intermediate processing result to obtain an output result of the convolution operation; or

Averaging the received second intermediate processing result based on the size of a pooling window when the control signal indicates an average pooling operation, obtaining an output result of the average pooling operation; or

And when the control signal indicates the maximum pooling operation, comparing the received second intermediate processing results to obtain an output result of the maximum pooling operation.

10. The apparatus of any of claims 1 to 9, wherein the first dimension corresponds to a row of the systolic array and the second dimension corresponds to a column of the systolic array.

11. A pooling treatment device, comprising:

a systolic array comprising a plurality of computational cells;

a control unit for transmitting a control signal indicating an average pooling operation or a maximizing operation;

each computing unit in the systolic array is configured to receive the control signal, input feature data, and a first intermediate processing result from a previous computing unit along a second dimension of the systolic array; according to the control signal, performing pooling related operation on the input feature data and the first intermediate processing result to obtain a second intermediate processing result; sending the second intermediate processing result to a next-stage computational unit along the second dimension of the systolic array; sending the input feature data to a next-stage computational unit along a first dimension of the systolic array;

and the processing unit is connected with the computing unit at the output end of the pulse array and is used for receiving the control signal and a second intermediate processing result output by the computing unit at the output end of the pulse array, and performing pooling related operation on the second intermediate processing result according to the control signal to obtain a pooling output result.

12. The apparatus of claim 11, further comprising:

a feature data storage for storing input feature data;

the control unit is used for loading the input characteristic data in the characteristic data memory into the systolic array.

13. The apparatus according to claim 11 or 12, wherein each of the computing units comprises:

a feature register for storing input feature data and for forwarding the input feature data to a subsequent level of computational cells along the first dimension of the systolic array;

the adder is connected with the characteristic register;

and when the control signal indicates an average pooling operation, the adder is configured to receive the input feature data and a first intermediate processing result from a previous-stage computing unit along the second dimension, accumulate the input feature data and the first intermediate processing result to obtain a second intermediate processing result, and send the second intermediate processing result to a next-stage computing unit along the second dimension.

14. The apparatus of claim 13,

the adder is configured to receive the input feature data and the first intermediate processing result, and output a subtraction result of the product and the first intermediate processing result, when the control signal indicates a maximum pooling operation;

the computing unit further comprises:

a selector connected to the feature register and the adder;

when the control signal indicates a maximum pooling operation, the selector is configured to receive the input feature data, the subtraction result, and the first intermediate processing result, select a maximum value from the input feature data and the first intermediate processing result according to the subtraction result, and send a second intermediate processing result to a subsequent-stage computing unit along the second dimension, where the second intermediate processing result is the maximum value.

15. The apparatus of claim 13, wherein the computing unit further comprises:

a comparator connected to the feature register;

when the control signal indicates a maximum pooling operation, the comparator is configured to receive the input feature data and the first intermediate result, output a maximum value of the input feature data and the first intermediate result, and send a second intermediate processing result to a subsequent-stage computing unit along the second dimension, where the second intermediate processing result is the maximum value.

16. The apparatus according to any one of claims 11 to 15, wherein the processing unit is configured to:

averaging the received second intermediate processing result based on the size of a pooling window when the control signal indicates an average pooling operation, obtaining an output result of the average pooling operation; or

And when the control signal indicates the maximum pooling operation, comparing the received second intermediate processing results to obtain an output result of the maximum pooling operation.

17. The apparatus of any of claims 11 to 16, wherein the first dimension corresponds to a row of the systolic array and the second dimension corresponds to a column of the systolic array.

Technical Field

The present application relates to the field of neural networks, and more particularly, to a systolic array-based neural network processing device.

Background

With the continuous improvement of the complexity of the neural network, the neural network technology has the problems of more occupied resources, low operation speed, high energy consumption and the like in the practical application process. The use of hardware accelerators instead of traditional software computing methods has become an effective way to increase the computational efficiency of neural networks, such as those implemented using general purpose graphics processors, special purpose processor chips, and field programmable logic arrays.

In the prior art, a set of independent hardware devices is designed for convolution operation in a neural network, a set of independent hardware devices is designed for pooling operation, or a set of independent hardware devices is designed for average pooling operation and maximum pooling operation respectively.

The existing scheme increases the hardware overhead and power consumption of the neural network to a certain extent.

Disclosure of Invention

The application provides a neural network processing device based on a systolic array, which can effectively reduce hardware overhead and power consumption of a neural network.

In a first aspect, a neural network processing apparatus is provided, the apparatus comprising: a systolic array comprising a plurality of computational cells; a control unit for sending a control signal indicative of any one of the following neural network operations: convolution operation, average pooling operation, and maximization operation; each computing unit in the systolic array is configured to receive the control signal, input feature data, weights, and first intermediate processing results from a previous computing unit along a second dimension of the systolic array; according to the control signal, carrying out related operation of a neural network on the input characteristic data, the weight and the first intermediate processing result to obtain a second intermediate processing result; sending the second intermediate processing result to a next-stage computational unit along the second dimension of the systolic array; sending the input feature data to a next-stage computational unit along a first dimension of the systolic array; and the processing unit is connected with the computing unit at the output end of the systolic array and used for receiving the control signal and a second intermediate processing result output by the computing unit at the output end of the systolic array and carrying out related operation of the neural network on the second intermediate processing result according to the control signal so as to obtain an output result of the neural network.

In a second aspect, there is provided a pooling treatment device comprising: a systolic array comprising a plurality of computational cells; a control unit for transmitting a control signal indicating an average pooling operation or a maximizing operation; each computing unit in the systolic array is configured to receive the control signal, input feature data, and a first intermediate processing result from a previous computing unit along a second dimension of the systolic array; according to the control signal, performing pooling related operation on the input characteristic data and the first intermediate processing result to obtain a second intermediate processing result; sending the second intermediate processing result to a next-stage computational unit along the second dimension of the systolic array; sending the input feature data to a next-stage computational unit along a first dimension of the systolic array; and the processing unit is connected with the computing unit at the output end of the pulse array and is used for receiving the control signal and a second intermediate processing result output by the computing unit at the output end of the pulse array, and performing pooling related operation on the second intermediate processing result according to the control signal to obtain a pooling output result.

In a third aspect, a chip is provided, on which the apparatus provided in the first aspect or the apparatus provided in the second aspect is integrated.

In a fourth aspect, a control method is provided, where the control method is executed by the control unit in the apparatus provided in the first aspect, and the method includes a related flow executed by the control unit in the first aspect.

In a fifth aspect, a control apparatus is provided, which includes a memory for storing instructions and a processor for executing the instructions stored in the memory, and the execution of the instructions stored in the memory causes the processor to execute the method provided in the fourth aspect.

In a sixth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a computer, causes the computer to carry out the method provided in the fourth aspect.

In a seventh aspect, a computer program product is provided that contains instructions which, when executed by a computer, cause the computer to carry out the method provided in the fourth aspect.

Drawings

Fig. 1 is a schematic block diagram of a systolic array-based neural network processing device according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of implementing a neural network operation by using the systolic array-based neural network processing apparatus provided in the embodiment of the present application.

Fig. 3 to 6 are schematic diagrams of internal structures of the computing unit in the embodiment of the present application.

Fig. 7 to 14 are timing diagrams illustrating a convolution operation or an average pooling operation implemented by the neural network processing apparatus according to the embodiment of the present application.

Fig. 15 to fig. 21 are timing diagrams illustrating a maximum pooling operation achieved by the neural network processing apparatus according to the embodiment of the present application.

Fig. 22 is a schematic diagram of implementing a convolution operation for a plurality of convolution kernels by using the neural network processing apparatus provided in the embodiment of the present application.

Fig. 23 is a schematic block diagram of a systolic array-based pooling processing device provided in accordance with another embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Fig. 1 is a schematic block diagram of a neural network processing device 100 according to an embodiment of the present application. The apparatus 100 includes a control unit 110, a systolic array 120, and a processing unit 130.

A control unit 110 for sending control signals to the systolic array 120 and the processing unit 130, the control signals indicating any of the following neural network operations: convolution operations, average pooling operations, and maximization operations.

Systolic array 120 includes a plurality of computational cells 200. The internal structure of each computing unit 200 in systolic array 120 is the same, as will be described in detail below.

Each computation unit 200 in the systolic array 120 is configured to receive a control signal, input feature data, weights, and first intermediate processing results from a previous level of computation units along a second dimension of the systolic array; according to the control signal, carrying out related operation of the neural network on the input characteristic data, the weight and the first intermediate processing result to obtain a second intermediate processing result; sending a second intermediate processing result to a next-stage computational unit along a second dimension of the systolic array; the input feature data is sent to a next-stage computational unit along a first dimension of the systolic array.

When the control signal indicates the convolution operation, the neural network related operations performed by the calculation unit 200 on the input feature data, the weights, and the first intermediate processing result include multiplication and addition. The neural network related operations performed by the calculation unit 200 on the input feature data, the weights and the first intermediate processing result comprise addition operations when the control signal indicates an average pooling operation. When the control signal indicates the maximum pooling operation, the neural network related operation performed by the calculation unit 200 on the input feature data, the weights and the first intermediate processing result includes a comparison operation.

The first dimension corresponds to rows of the systolic array and the second dimension corresponds to columns of the systolic array. Or the first dimension corresponds to a column of the systolic array and the second dimension corresponds to a row of the systolic array.

For ease of understanding and description, certain embodiments are described below with the first dimension corresponding to rows of the systolic array and the second dimension corresponding to columns of the systolic array. For example, in the embodiment shown in FIG. 1, the first dimension corresponds to a row of systolic array 120 and the second dimension corresponds to a column of systolic array 140.

The descriptions of "left-to-right" and "top-to-bottom" also appear in some embodiments below, where "left-to-right" means left-to-right along a row of the systolic array and "top-to-bottom" means top-to-bottom along a column of the systolic array.

As shown in FIG. 1, for a computational unit along an input in a first dimension in systolic array 120, it may receive input feature values from feature data store 140 external to systolic array 120. The remaining computing units receive input feature values from the superior computing unit along the first dimension.

As shown in FIG. 1, each compute unit in systolic array 120 may receive weights from a weight store 150 external to systolic array 120.

Alternatively, the input signature data and weights to be loaded into the systolic array 120 are determined by the host processor and then loaded into the systolic array 120 from the signature data store 140 and the weights are loaded into the systolic array 120 from the weights store 150. The main processor may be the control unit 110, or may be another processor in the neural network system that knows the computation rule of the systolic array.

Optionally, the neural network processing device 100 provided by the present application comprises a feature data memory 140 and a weight memory 150.

Optionally, the neural network processing device 100 provided by the present application does not comprise the feature data memory 140 and the weight memory 150, but comprises a communication interface for communicating with the feature data memory 140 and the weight memory 150.

In this embodiment, the neural network processing apparatus 100 may further include a buffer unit, and the buffer unit is configured to buffer data transmitted from the communication interface. That is, the data in the feature data memory 140 and the weight memory 150 passes through the buffer unit and enters the systolic array 120. It will be appreciated that buffering data communicated from the communication interface with the buffer unit helps to ensure the correctness of the timing of the data entering the systolic array 120.

And the processing unit 130 is connected to the computing unit 200 at the output end of the systolic array 120, and the processing unit 130 is configured to receive the control signal and a second intermediate processing result output by the computing unit at the output end of the systolic array, and perform a neural network related operation on the second intermediate processing result according to the control signal to obtain a neural network output result.

A unit described herein is connected to a unit including, but not limited to, a wired communication connection and a wireless communication connection.

The calculation unit at the output of the systolic array 120 refers to the calculation unit in the systolic array 120 that ultimately outputs the intermediate processing result, and for example, in fig. 1, refers to the calculation unit in the last row of the systolic array 120.

For example, a transmit interface of a compute unit at the output of systolic array 120 is communicatively coupled to a receive interface of processing unit 130.

When the control signal indicates the convolution operation, the processing unit 130 performs an accumulation operation on the second intermediate processing result. When the control signal indicates an averaging pooling operation, the processing unit 130 performs an averaging operation on the second intermediate processing result. When the control signal indicates the maximum pooling operation, the processing unit 130 performs a comparison operation on the second intermediate processing result.

As can be seen from the above description, the control unit 110 is used to control the systolic array 120 and the processing unit 130 to perform the type of neural network related operation, i.e., the convolution operation, the average pooling operation, or the maximum pooling operation, so that the apparatus 100 provided by the present application can implement both the convolution operation and the pooling operation.

Therefore, according to the scheme provided by the application, the convolution operation and the pooling operation are multiplexed by one set of hardware device, so that the hardware utilization rate can be improved, and the cost is saved.

In the present application, the elements of the systolic array (which are forwarded between the computing units of the systolic array) include the input feature data and the intermediate processing results computed by the computing units, and do not include weights. In other words, for a neural network operation, the weights are loaded into the computational unit and are not forwarded.

It should be understood that the first intermediate processing result and the second intermediate processing result mentioned herein are only for distinguishing the intermediate processing result received and transmitted by the same computing unit, and other descriptions may also be used for distinguishing the intermediate processing result received and transmitted by the same computing unit, which is not limited in this application.

It should also be understood that when a computing unit X is a computing unit of the systolic array input along the first dimension, the computing unit X does not have a previous stage computing unit along the first dimension, i.e., the computing unit X retrieves the input feature data from the external memory. When a computing unit Y is a computing unit of the systolic array input end along the second dimension, the computing unit Y does not have a previous computing unit along the second dimension, that is, the input of the port of the computing unit Y for receiving the first intermediate processing result of the previous computing unit is empty or zero. When a computing unit Z is a computing unit of the output end of the systolic array along the second dimension, the computing unit Z does not have a computing unit at the next stage along the second dimension, and the computing unit Z is connected to the processing unit 130 and is configured to send the obtained second intermediate processing result to the processing unit 130.

It should be appreciated that in implementing the convolution operation using the apparatus 100 provided herein, the weights loaded into the systolic array 120 are the convolution kernels.

Optionally, when the control signal indicates a convolution operation, the distribution of the weights in systolic array 120 is consistent with or mirrors the distribution of the weights in the convolution kernel.

For example, the convolution kernel corresponds to a two-dimensional matrix of

Figure BDA0002312511780000061

The weights loaded into systolic array 120 may be distributed in systolic array 120 in a manner that

Figure BDA0002312511780000062

Can also be

The corresponding weights are read from the weight memory by a main processor (e.g., the control unit 110 or other processors in the neural network system) and loaded into the systolic array 120, where the addressing of the weights and the distribution of the weights in the systolic array 120 can be determined according to actual requirements, which is not limited in this application as long as convolution or pooling can be implemented.

The pooling operation is only for the input feature data and does not involve weights, and therefore, in order to enable the calculation unit 200 to be compatible with the convolution operation and the pooling operation, the value of a weight is configured to be 1 when the control signal indicates the maximum pooling operation or the average pooling operation.

Optionally, in some embodiments, the apparatus 100 further comprises the following units.

A feature data store (such as feature data store 140 shown in fig. 1) for storing input feature data.

A first weight store (such as weight store 150 shown in fig. 1) and a second weight store (not shown in fig. 1). When the control signal indicates a convolution operation, the first weight memory is used for storing the convolution kernel of the current convolution operation, and the second weight memory is used for storing the convolution kernel of the next convolution operation in advance.

The control unit 110 is configured to load the weights in the first weight memory into the computing units in the systolic array 120, i.e. into the computing units at the inputs along the first dimension of the systolic array 120.

The apparatus provided in this embodiment, when performing a convolution operation, stores a convolution kernel of a next convolution operation of a current convolution operation in advance, so that after a correlation operation of the current convolution operation is completed by using a systolic array, an operation of the next convolution operation can be started immediately, thereby improving efficiency of the convolution operation.

As an example, as shown in FIG. 2, the process of processing a neural network using the systolic array-based device 100 shown in FIG. 1 is as follows.

S210, the control unit 110 determines a neural network operation type, where the neural network operation type is any one of the following operations: convolution operations, average pooling operations, and maximum pooling operations.

S220, the control unit 110 addresses the storage unit (such as the feature storage and the weight storage shown in fig. 1) according to the operation type of the neural network, and reads and parses the instruction to be executed next.

S230, the control unit 110 obtains a storage address according to the parsing instruction, obtains input data (input feature data and weight) from the storage unit, and loads the input feature data and the weight into the systolic array.

S240, the control unit 110 sends a control signal to each computing unit 200 in the systolic array 120, the control signal indicating the type of neural network operation.

S250, each computing unit 200 in the systolic array 120 performs a correlation operation on the received input feature data and the weight based on the control signal to obtain a second intermediate processing result, and sends the second intermediate processing result to the computing unit at the next stage along the second dimension. Step S250 will be described in detail below.

S260, the control unit 120 sends a control signal to the processing unit 130, the control signal indicating the neural network operation type.

S270, the processing unit 130 obtains a second intermediate processing result from the computing unit 200 at the output end of the systolic array 120, and performs a correlation operation on the second intermediate processing result based on the control signal to obtain an output result of the neural network. Step S270 will be described in detail below.

Alternatively, the output result of the neural network output by the processing unit 130 may be stored in an output storage unit of the neural network system.

Taking one of the calculation units 200 in the systolic array as an example, the step S250 includes the following steps.

When the control signal indicates the convolution operation or the average pooling operation, the calculation unit 200 multiplies the input feature data by the weight to obtain a product, and accumulates the product and the first intermediate processing result to obtain an accumulated result, which is a second intermediate processing result.

When the control signal indicates the maximum pooling operation, the calculation unit 200 compares the input feature data with the first intermediate processing result to obtain the maximum value of the two, which is the second intermediate processing result. Alternatively, the calculation unit 200 multiplies the input feature data by the weight to obtain a product, and compares the product with the first intermediate processing result to obtain the maximum value of the product and the first intermediate processing result, where the maximum value is the second intermediate processing result.

It should be noted that when the control signal indicates the pooling operation (average pooling or maximum pooling), the weight loaded into the systolic array has a value of 1.

It should be understood that when the computing unit executing step S250 is the computing unit of the systolic array that outputs along the second dimension, the second intermediate processing result is sent to the processing unit 130.

Taking the example that the processing unit 130 receives the second intermediate processing result as the intermediate result of the convolution operation corresponding to a certain convolution kernel, or taking the example that the processing unit 130 receives the second intermediate processing result as the intermediate result of the pooling operation corresponding to a certain pooling window, step S270 specifically includes the following contents.

When the control signal indicates the convolution operation, the processing unit 130 accumulates the received plurality of second intermediate processing results to obtain a calculation result of a convolution kernel.

When the control signal indicates an average pooling operation, the processing unit 130 averages the received second plurality of intermediate processing results based on the size (k × k) of the pooling window to obtain an average pooling calculation result for one pooling window.

When the control signal indicates the maximum pooling operation, the processing unit 130 compares the received second intermediate processing results to obtain the maximum value thereof, and then obtains the maximum pooling calculation result of one pooling window.

In the embodiment of the application, each computing unit in the systolic array adopts the same internal structure. In the present application, the internal structure of the computing unit may have various implementations.

Fig. 3 and 4 are schematic diagrams of an internal structure of the computing unit 200. Fig. 3 and 4 show two operation modes of the computing unit 200 based on the same internal structure.

Fig. 5 and 6 are schematic diagrams of another internal structure of the calculation unit 200. Fig. 5 and 6 show two operation modes of the computing unit 200 based on the same internal structure.

As one implementation, the internal structure of the computing unit 200 is shown in fig. 3 (or fig. 4).

The feature register 201 is used to store input feature data and also to forward the input feature data to a subsequent stage of computational units along the first dimension of the systolic array.

The weight register 202 is used to store weights.

A multiplier 203 is connected to the feature register 201 and the weight register 202, the multiplier 203 being for outputting the product of the weight and the input feature data.

The adder 204 is connected to the multiplier 203.

The comparator 205 is connected to the multiplier 203.

Based on this internal result, the computing unit 200 has two modes of operation.

Specifically, the computing unit 200 receives the control signal, and when the parsed control signal indicates a convolution operation or an average pooling operation, the first operation mode is turned on, as shown in fig. 3. The adder 204 is configured to receive the product output by the multiplier 203 and a first intermediate processing result from a previous-stage computing unit along the second dimension, accumulate the product and the first intermediate processing result to obtain a second intermediate processing result, and send the second intermediate processing result to a next-stage computing unit along the second dimension. In operation mode one, the circuit in which comparator 205 is located is inactive (e.g., the circuit marked by the dashed line in FIG. 3).

The computing unit 200 receives the control signal, and when the parsing obtains the control signal indicating the maximum pooling operation, the second operation mode is activated, as shown in fig. 4. The comparator 205 is configured to receive the product output by the multiplier 203 and the first intermediate processing result from the previous-stage computing unit along the second dimension, output a maximum value of the product and the first intermediate processing result, and send a second intermediate processing result to the subsequent-stage computing unit along the second dimension, where the second intermediate processing result is the maximum value. In the second operation mode, the circuit in which the adder 204 is located is inactive (e.g., the circuit marked by the dashed line in fig. 4).

The computing unit 200 further comprises a control subunit for switching the operating mode of the computing unit 200 in dependence on the control signal.

As another implementation, the internal structure of the computing unit 200 is shown in fig. 5 (or fig. 6).

The feature register 401 is used to store input feature data and also to forward the input feature data to a subsequent stage of computational cells along the first dimension of the systolic array.

The weight register 402 is used to store weights.

A multiplier 403 is connected to the feature register 401 and the weight register 402, the multiplier 403 being used to output the product of the weight and the input feature data.

The adder 404 is connected to the multiplier 403.

The selector 405 is connected to the multiplier 403 and the adder 404.

Based on this internal result, the computing unit 200 has two modes of operation.

Specifically, the computing unit 200 receives the control signal, and when the parsed control signal indicates a convolution operation or an average pooling operation, the first operation mode is turned on, as shown in fig. 5. The adder 404 is configured to receive the product output by the multiplier 403 and a first intermediate processing result from a previous-stage computing unit along the second dimension, accumulate the product and the first intermediate processing result to obtain a second intermediate processing result, and send the second intermediate processing result to a next-stage computing unit along the second dimension. In operation mode one, the circuit in which selector 405 is located is inactive (e.g., the circuit marked by the dashed line in FIG. 5).

The computing unit 200 receives the control signal, and when the parsing obtains the control signal indicating the maximum pooling operation, the second operation mode is activated, as shown in fig. 6. The adder 404 is configured to receive the product output from the multiplier 403 and a first intermediate processing result from a previous-stage calculation unit along the second dimension, and output a subtraction result of the product and the first intermediate processing result; the selector 405 is configured to receive the product output by the multiplier 403, the subtraction result output by the adder 404, and the first intermediate processing result, select a maximum value from the product and the first intermediate processing result according to the subtraction result, and send a second intermediate processing result to a subsequent-stage computing unit along the second dimension, where the second intermediate processing result is the maximum value. In operation mode two, some of the circuits in which adder 404 is located are inactive (e.g., the circuits marked by dashed lines in FIG. 6).

For example, the computing unit 200 further comprises a control subunit for switching the operation mode of the computing unit 200 in response to the control signal.

It should be understood that the multipliers mentioned herein may be implemented using multiplication circuits, adders may be implemented using addition circuits, comparators may be implemented using comparison circuits, and selectors may be implemented using comparison circuits.

It should be understood that fig. 3-6 are exemplary only and not limiting. The internal result of the computing unit 200 may also adopt other implementation manners, and the present application is not limited to this, as long as the computing unit 200 can switch the working mode according to the operation type of the neural network indicated by the control signal.

The processing unit 130 may switch between the various operating modes depending on the type of neural network operation indicated by the control signal.

When the control signal indicates a convolution operation, the processing unit 130 is configured to accumulate the received second intermediate processing result to obtain an output result of the convolution operation.

When the control signal indicates an average pooling operation, the processing unit 130 is configured to average the received second intermediate processing result based on the size of the pooling window to obtain an output result of the average pooling operation.

When the control signal indicates a maximum pooling operation, the processing unit 130 is configured to compare the received second intermediate processing results to obtain an output result of the maximum pooling operation.

As an example, the internal results of the processing unit 130 include: an accumulator, a divider (or multiplier), a comparator.

The accumulator internal to processing unit 130 operates when the control signal indicates a convolution operation. Specifically, the accumulator is configured to output an accumulated result of the second intermediate processing results of the plurality of calculation units, and obtain an output result of the convolution operation.

The accumulator and divider internal to the processing unit 130 operate when the control signal indicates an average pooling operation. Specifically, the accumulator is configured to output an accumulated result of the second intermediate processing results of the plurality of computing units, and the divider is configured to average the accumulated result based on a size of the pooling window, for example, if the size of the pooling window is k × k, the divider divides the accumulated result by k × k to obtain an output result of the averaging pooling operation. It should be understood that averaging may also be implemented using a multiplier, for example, where the size of the pooling window is k × k, and the multiplier multiplies the accumulated result by 1/(k × k) to obtain the output result of the averaging pooling operation.

The comparator inside the processing unit 130 operates when the control signal indicates a maximum pooling operation. Specifically, the comparator is configured to output a maximum value of the second intermediate processing results of the plurality of calculation units, obtaining an output result of the maximum pooling operation.

Therefore, according to the scheme provided by the application, the convolution operation and the pooling operation are multiplexed by one set of hardware device, so that the hardware utilization rate can be improved, and the cost is saved.

For better understanding of the solution provided by the present application, the following describes the process of performing the convolution operation, the average pooling operation, and the maximum pooling operation by using the systolic array of 3 × 3, the weight matrix of 3 × 3, and the input feature matrix of 3 × 3 as examples in conjunction with fig. 7 to 21.

As shown in fig. 7, assuming that the systolic array 120 shown in fig. 1 is composed of computing units C00, C01, C02, C10, C11, C12, C20, C21 and C22, the processing unit 130 is connected with the computing units C20, C21 and C22 for obtaining a final computing result from intermediate computing results output therefrom. The internal structure of each computing unit may be as shown in fig. 3 and 4, or as shown in fig. 5 or 6.

In this example, the weight matrix of 3 x 3 is taken as

Figure BDA0002312511780000111

Input feature matrix with 3 x 3 is

Figure BDA0002312511780000112

The description is made for the sake of example.

In the first case, the control signal indicates a convolution operation.

In this case, a weight matrix of 3 x 3

Figure BDA0002312511780000121

Is a convolution kernel.

For data block

Figure BDA0002312511780000122

And weight

Figure BDA0002312511780000123

The result of the convolution operation of (1) is:

a11*W11+a12*W12+a13*W13+a21*W21+a22*W22+a23*W23+a31*W31+a32*W32+a33*W33。

as shown in fig. 7 to 14, with the neural network processing device of the present application, the process of performing the loading of the input feature data and the weight and the convolution calculation is as follows.

Referring to FIG. 7, weights are pre-weighted

Figure BDA0002312511780000124

Loaded into systolic array

The left to right flow described in the following embodiments corresponds to the flow along the first dimension (row) of the systolic array described in the above embodiments. The top-to-bottom flow described in the following embodiments corresponds to the flow along the second dimension (column) of the systolic array described in the above embodiments.

In the first cycle (T1), referring to fig. 8, input feature data a11 enters the computational unit C00, where the input feature data a11 is loaded from the left side of the computational unit C00 and flows from left to right. At the end of the first cycle, the result of the calculation unit C00 is a11 × W11. The calculation result a 11W 11 of the calculation unit C00 flows from top to bottom.

In the second cycle (T2), referring to fig. 9, the input feature data a11 flows rightward into the computing unit C01, and the computation result a11 × W11 flows downward into the computing unit C10; meanwhile, the input characteristic data a21 is loaded into the computing unit C00, and the input characteristic data a21 is loaded into the computing unit C10. At the end of the second cycle, the calculation result of calculation unit C00 is a12 × W11, the calculation result of calculation unit C01 is a11 × W12, and the calculation result of calculation unit C10 is a11 × W11+ a21 × W21. The calculation results of the respective calculation units flow from top to bottom.

In the third cycle (T3), referring to fig. 10, input characteristic data a11 flows rightward into computing unit C02, a12 flows rightward into computing unit C01, a21 flows rightward into computing unit C11, computing result a 12W 11 flows downward into computing unit C10, computing result a 12W 12 flows downward into computing unit C11, and computing result a 11W 11+ a 21W 21 flows downward into computing unit C20. Meanwhile, a13 is loaded into compute unit C00, a22 is loaded into compute unit C10, and a31 is loaded into compute unit C20. At the end of the third cycle, the calculation result of the calculation unit C00 is a13 × W11, the calculation result of the calculation unit C01 is a12 × W12, the calculation result of the calculation unit C02 is a11 × W13, the calculation result of the calculation unit C10 is a12 × W11+ a22 × W21, the calculation result of the calculation unit C11 is a11 × W12+ a21 × W22, and the calculation result of the calculation unit C20 is a11 × W11+ a21 × W21+ a31 × W31. The calculation results of the respective calculation units flow from top to bottom.

In the fourth cycle (T4), referring to fig. 11, the input feature data a12 flows rightward into computing unit C02, a13 flows rightward into computing unit C01, a21 flows rightward into computing unit C12, a22 flows rightward into computing unit C11, and a31 flows rightward into computing unit C21. After the third cycle is completed, the calculation results of the calculation units C00, C01, C02, C10, and C11 flow downward into the respective next-stage calculation units. Meanwhile, a23 is loaded into compute unit C10 and a32 is loaded into compute unit C20. At the end of the fourth cycle, the calculation result of the calculation unit C01 is a13 × W12, the calculation result of the calculation unit C02 is a12 × W13, the calculation result of the calculation unit C10 is a13 × W11+ a23 × W21, the calculation result of the calculation unit C11 is a12 × W12+ a22 × W22, the calculation result of the calculation unit C12 is a11 × W13+ a21 × W23, the calculation result of the calculation unit C20 is a12 × W11+ a22 × W21+ a32 × W31, the calculation result of the calculation unit C21 is a11 × W12+ a21 × 22+ a31 × W32, and the calculation results of the respective calculation units flow downward.

In the fifth cycle (T5), referring to fig. 12, a13 flows rightward into computing unit C02, a22 flows rightward into computing unit C12, a23 flows rightward into computing unit C11, a31 flows rightward into computing unit C22, and a32 flows rightward into computing unit C21. After the fourth cycle is completed, the calculation results of the calculation units C01, C02, C10, C11, and C12 flow downward into the respective next-stage calculation units. At the same time, a33 is loaded into compute unit C20. At the end of the fifth cycle, the calculation result of the calculation cell C is a × W + a × W, the calculation result of the calculation cell C is a W + a × W, and the calculation result of each calculation cell C flows from top to bottom.

In the sixth cycle (T6), referring to fig. 13, a23 flows rightward into computing unit C12, a32 flows rightward into computing unit C22, and a33 flows rightward into computing unit C21. After the fifth period, the calculation results of the calculation units C02, C11, C12 flow downward into the respective next-stage calculation units. At the end of the sixth cycle, the calculation result of the calculation unit C12 is a13 × W13+ a23 × W23, the calculation result of the calculation unit C21 is a13 × W12+ a23 × W22+ a33 × W32, the calculation result of the calculation unit C22 is a12 × W13+ a22 × W23+ a32 × W33, and the calculation results of the respective calculation units flow from top to bottom.

In the seventh cycle (T7), referring to fig. 14, a33 flows to the right into calculation unit C22, and the calculation result a 13W 13+ a 23W 23 flows downward into calculation unit C22. At the end of the seventh cycle, the calculation results of the calculation unit C22 are a13 × W13+ a23 × W23+ a33 × W33, and the calculation results of the respective calculation units flow from top to bottom.

It is understood that the accumulated values of the calculation result a 11W 11+ a 21W 21+ a 31W 31 of the calculation unit C20 at the end of the third cycle, the calculation result a 12W 12+ a 22W 22+ a 32W 32 of the calculation unit C21 at the end of the fifth cycle, and the calculation result a 13W 13+ a 23W 23+ a 33W 33 of the calculation unit C22 at the end of the seventh cycle are data blocks And weight

Figure BDA0002312511780000142

The result of the convolution operation of (a).

The processing unit 130 is configured to receive the intermediate calculation results output by the calculating units C20, C21, and C22, and accumulate the calculation result of the calculating unit C20 at the end of the third period, the calculation result of the calculating unit C21 at the end of the fifth period, and the calculation result of the calculating unit C22 at the end of the seventh period to obtain a data block

Figure BDA0002312511780000151

And weight

Figure BDA0002312511780000152

The result of the convolution operation of (a).

In the second case, the control signal indicates an average pooling operation.

In this case, a weight matrix of 3 x 3

Figure BDA0002312511780000153

It may be stated that the size of the pooling window is 3 x 3, here a so-called weight matrix

Figure BDA0002312511780000154

Has a value of

For data block

Figure BDA0002312511780000156

The average pooling operation of (1) should be calculated as: (a11+ a12+ a13+ a21+ a22+ a23+ a31+ a32+ a 33)/9.

With the neural network processing device of the present application, the process of performing the loading of the input feature data and the weights and the average pooling calculation is similar to the above-described process of performing the convolution operation, except that the weight matrix

Figure BDA0002312511780000157

Has a value of

Figure BDA0002312511780000158

And the processing unit 130 averages the calculation result of the calculation unit C20 at the end of the third period, the calculation result of the calculation unit C21 at the end of the fifth period, and the calculation result of the calculation unit C22 at the end of the seventh period based on the 3 × 3 pooling window to obtain the data block The calculated result of the average pooling operation of (a11+ a12+ a13+ a21+ a22+ a23+ a31+ a32+ a 33)/9.

In a third case, the control signal indicates maximum pooling operation.

In this case, a weight matrix of 3 x 3

Figure BDA00023125117800001510

Size of the representational window is 3 x 3, weight matrix

Figure BDA0002312511780000161

Has a value of

Figure BDA0002312511780000162

For data block

Figure BDA0002312511780000163

The calculation result of the maximum pooling of (c) should be: max (a11, a12, a13, a21, a22, a23, a31, a32, a33), Max () represents the maximum value.

As shown in fig. 15 to 21, with the neural network processing device of the present application, the process of performing the loading of the input feature data and the weight and the maximum pooling calculation is as follows.

Weight in advance

Figure BDA0002312511780000164

(value is

Figure BDA0002312511780000165

) Loaded into a systolic array (not shown in fig. 15-21).

In the first cycle (T1), referring to fig. 15, input feature data a11 enters the computing unit C00, where the input feature data a11 is loaded from the left side of the computing unit C00 and flows from left to right. At the end of the first cycle, the calculation result of the calculation unit C00 is Max (a11), i.e., a 11. The calculation result Max (a11) of the calculation unit C00 flows from top to bottom.

In the second cycle (T2), see fig. 16. Regarding the loading and flowing of the input feature data and the flowing of the calculation result similar to fig. 9, for the specific description, please refer to the description above in conjunction with fig. 9, and for brevity, the description is not repeated here. At the end of the second cycle, the calculation result of the calculation unit C00 is Max (a12), the calculation result of the calculation unit C01 is Max (a11), and the calculation result of the calculation unit C10 is Max (a11, a 21). The calculation results of the respective calculation units flow from top to bottom.

In the third period (T3), see fig. 17. Regarding the loading and flowing of the input feature data and the flowing of the calculation result similar to fig. 10, for the specific description, please refer to the description above in conjunction with fig. 10, and for brevity, the description is not repeated here. At the end of the third cycle, the calculation result of the calculation unit C00 is Max (a13), the calculation result of the calculation unit C01 is Max (a12), the calculation result of the calculation unit C02 is Max (a11), the calculation result of the calculation unit C10 is Max (a12, a22), the calculation result of the calculation unit C11 is Max (a11, a21), and the calculation result of the calculation unit C20 is Max (a11, a21), a 31). The calculation results of the respective calculation units flow from top to bottom.

In the fourth period (T4), see fig. 18. Regarding the loading and flowing of the input feature data and the flowing of the calculation result similar to fig. 11, for the specific description, please refer to the description above in conjunction with fig. 11, and for brevity, the description is not repeated here. At the end of the fourth cycle, the calculation result of the calculation unit C01 is Max (a13), the calculation result of the calculation unit C02 is Max (a12), the calculation result of the calculation unit C10 is Max (a13, a23), the calculation result of the calculation unit C11 is Max (a12, a22), the calculation result of the calculation unit C12 is Max (a11, a21), the calculation result of the calculation unit C20 is Max (a12, a22), a32), the calculation result of the calculation unit C21 is Max (a11, a21), a31), and the calculation results of the respective calculation units flow from top to bottom.

In the fifth period (T5), see fig. 19. Regarding the loading and flowing of the input feature data and the flowing of the calculation result similar to fig. 12, for the specific description, please refer to the description above in conjunction with fig. 12, and for brevity, the description is not repeated here. At the end of the fifth cycle, the calculation result of the calculation unit C02 is Max (a13), the calculation result of the calculation unit C11 is Max (a13, a23), the calculation result of the calculation unit C12 is Max (a12, a22), the calculation result of the calculation unit C20 is Max (a13, a23), a33), the calculation result of the calculation unit C21 is Max (a12, a22), a32), the calculation result of the calculation unit C22 is Max (a11, a21), a31), and the calculation results of the respective calculation units flow from top to bottom.

In the sixth period (T6), see fig. 20. Regarding the loading and flowing of the input feature data and the flowing of the calculation result similar to fig. 13, for the specific description, please refer to the description above in conjunction with fig. 13, and for brevity, the description is not repeated here. At the end of the sixth cycle, the calculation results of the calculation unit C12 are Max (a13, a23), the calculation results of the calculation unit C21 are Max (a13, a23), a33), the calculation results of the calculation unit C22 are Max (a12, a22), a32), and the calculation results of the respective calculation units flow from top to bottom.

In the seventh period (T7), see fig. 21. Regarding the loading and flowing of the input feature data and the flowing of the calculation result similar to fig. 14, for the specific description, please refer to the description above in conjunction with fig. 14, and for brevity, the description is not repeated here. At the end of the seventh cycle, the calculation results of the calculation unit C22 are Max (a13, a23), a33), and the calculation results of the respective calculation units flow from top to bottom.

It can be seen that the maximum values of the calculation results Max (a11, a21), a31) of the third cycle end calculation unit C20, the calculation results Max (a12, a22), a32) of the fifth cycle end calculation unit C21, and the calculation results Max (a13, a23, a33) of the seventh cycle end calculation unit C22 are Max (a13, a23), a33)

Figure BDA0002312511780000181

The result of the maximum pooling operation.

The processing unit 130 is configured to receive the intermediate calculation results output by the calculating units C20, C21, and C22, and accumulate the calculation result of the calculating unit C20 at the end of the third period, the calculation result of the calculating unit C21 at the end of the fifth period, and the calculation result of the calculating unit C22 at the end of the seventh period to obtain a data block

Figure BDA0002312511780000182

The result of the maximum pooling operation.

As can be seen from the above, the neural network processing apparatus based on systolic array provided in the present application can implement both convolution operation and pooling operation (average pooling and maximum pooling). Specifically, the convolution operation and the pooling operation may share a set of data channels, and the same systolic array and processing unit may implement both the convolution operation and the pooling operation by indicating and/or modifying the weight values through the control signal. Therefore, in the application, the pooling operation and the convolution operation are multiplexed into one hardware device, so that the hardware utilization rate is improved, and the cost is saved.

It should be understood that the above-described fig. 7-21 are only examples and are not limiting. In practice, the size of the convolution kernel or pooling window is not necessarily 3 x 3; the dimensions of the input feature matrix (i.e., the input feature map) are not necessarily two-dimensional, but may be three-dimensional; the dimensionality of the convolution kernel may also be 3-dimensional. In addition, in practical applications, in one convolution operation, it is possible to perform calculation for a plurality of 2-dimensional convolution kernels, or it is possible to perform calculation for a plurality of 3-dimensional convolution kernels. The neural network processing device provided by the application can also realize the neural network operation under the above scene.

For example, as shown in fig. 22, the neural network processing apparatus provided in the present application is used to perform a convolution operation for 9 convolution kernels of 3 × 3.

The weight mapping mode is as follows: and (6) tiling and mapping.

As shown in fig. 22, a systolic array of size 9 x 9 may map exactly 9 weight matrices of 3 x 3 filters (filters), and each weight matrix maps directly without flipping.

The input characteristic data in the input characteristic diagram is loaded into the systolic array in the following mode: each row of input feature data in the input feature map is input to each row of the systolic array in sequence. Each cycle, one input feature data is input into the leftmost column of computational units (MAC) of the systolic array per row, while each computational unit propagates the already stored input feature data to the right into the adjacent computational units. So in the column direction, each computing unit receives input feature data from the computing unit to its left. In the row direction, two adjacent rows, the lower row is always delayed by one period from the upper row.

For brevity, the calculation timing can be referred to fig. 7 to 14 described above, and is not described again here.

As shown in fig. 23, the present embodiment also provides a pooling processing device 2300 based on a systolic array. The apparatus 2300 is configured to perform either average pooling or maximum pooling. The device 2300 includes a control unit 2310, a systolic array 2320, and a processing unit 2330.

A control unit 2310 for sending control signals to the systolic array 2320 and the processing unit 2330, the control signals indicating an average pooling operation or a maximization operation.

Systolic array 2320 includes a plurality of computational cells 3000. Each computing unit 3000 in systolic array 2320 has the same internal structure.

Each computing unit 3000 in the systolic array 2320 is configured to receive a control signal, input feature data, and a first intermediate processing result from a previous computing unit along a second dimension of the systolic array 2320; according to the control signal, performing pooling related operation on the input characteristic data and the first intermediate processing result to obtain a second intermediate processing result; sending a second intermediate processing result to a next-stage computational unit along a second dimension of systolic array 2320; the input feature data is sent to the next stage of computational units along the first dimension of systolic array 2320.

As shown in fig. 23, for a computational cell along the input in the first dimension in the systolic array 2320, it may receive input feature values from a feature data store 2340 external to the systolic array 2320. The remaining computing units receive input feature values from the superior computing unit along the first dimension.

Alternatively, the input signature data that needs to be loaded into systolic array 2320 is determined by the host processor and then loaded into systolic array 2320 from signature data store 2340. The main processor may be the control unit 2310, or may be another processor in the neural network system.

Optionally, the neural network processing device 2300 provided herein includes a feature data store 2340.

Optionally, the neural network processing device 2300 provided herein does not include the feature data store 2340, but does include the feature data store 2340.

The processing unit 2330 is connected to the computing unit 3000 at the output end of the systolic array 2320, and the processing unit 2230 is configured to receive the control signal and the second intermediate processing result output by the computing unit at the output end of the systolic array, and perform pooling related operation on the second intermediate processing result according to the control signal to obtain a pooled output result.

The calculation unit at the output end of systolic array 2320 refers to the calculation unit that finally outputs the intermediate processing result in systolic array 2320, and for example, in fig. 2300, refers to the calculation unit in the last row of systolic array 2320.

For example, the transmit interface of the compute unit at the output of systolic array 2320 is communicatively coupled to the receive interface of processing unit 2330.

When the control signal indicates an average pooling operation, the processing unit 2330 performs an averaging operation on the second intermediate processing result. When the control signal indicates a maximum pooling operation, the processing unit 2330 performs a comparison operation on the second intermediate processing result.

As can be seen from the above, the control unit 2310 is used for controlling the systolic array 2320 and the processing unit 2330 to perform an average pooling operation or a maximum pooling operation, which can improve the hardware utilization to some extent.

In addition, the pulse array is used for realizing pooling operation, so that the number of times of reading data in the pooling operation process can be effectively reduced.

Therefore, the pooling processing device provided by the application can multiplex one set of hardware device by enabling average pooling operation and maximum pooling operation, thereby improving the hardware utilization rate, saving the cost and effectively reducing the number of times of reading data in the pooling operation process.

The internal structure of the computing unit 3000 may be implemented in various ways.

Alternatively, as an implementation, the internal structure of each computing unit 3000 is as follows.

A feature register for storing input feature data and for forwarding the input feature data to a subsequent stage of computational cells along the first dimension of systolic array 2320.

And the adder is connected with the characteristic register.

And the comparator is connected with the characteristic register.

And when the control signal indicates the average pooling operation, the adder is used for receiving the input characteristic data and a first intermediate processing result from a former-stage calculating unit along the second dimension, accumulating the input characteristic data and the first intermediate processing result to obtain a second intermediate processing result, and sending the second intermediate processing result to a latter-stage calculating unit along the second dimension.

When the control signal indicates the maximum pooling operation, the comparator is configured to receive the input feature data and the first intermediate result, output a maximum value of the input feature data and the first intermediate result, and send a second intermediate processing result to a subsequent computing unit along the second dimension, where the second intermediate processing result is the maximum value.

Alternatively, as another implementation, the internal structure of each computing unit 3000 is as follows.

A feature register for storing input feature data and for forwarding the input feature data to a subsequent stage of computational cells along the first dimension of systolic array 2320.

And the adder is connected with the characteristic register.

And the selector is connected with the characteristic register and the adder.

And when the control signal indicates the average pooling operation, the adder is used for receiving the input characteristic data and a first intermediate processing result from a former-stage calculating unit along the second dimension, accumulating the input characteristic data and the first intermediate processing result to obtain a second intermediate processing result, and sending the second intermediate processing result to a latter-stage calculating unit along the second dimension.

The adder is configured to receive the input feature data and the first intermediate processing result, and output a subtraction result of the product and the first intermediate processing result, when the control signal indicates a maximum pooling operation; the selector is used for receiving the input characteristic data, the subtraction result and the first intermediate processing result, selecting a maximum value from the input characteristic data and the first intermediate processing result according to the subtraction result, and sending a second intermediate processing result to a subsequent-stage computing unit along a second dimension, wherein the second intermediate processing result is the maximum value.

Optionally, in this embodiment, the processing unit 2330 is configured to: when the control signal indicates the average pooling operation, averaging the received second intermediate processing result based on the size of the pooling window to obtain an output result of the average pooling operation; or when the control signal indicates the maximum pooling operation, comparing the received second intermediate processing results to obtain an output result of the maximum pooling operation.

Therefore, the neural network processing device 2300 provided by the application can multiplex a set of hardware devices by enabling average pooling operation and maximum pooling operation, thereby improving the hardware utilization rate, saving the cost, and effectively reducing the number of times of reading data in the pooling operation process.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily executed in the specific order, and in fact, some of the steps may be executed concurrently or even in a changed order as long as the required functions are achieved.

The neural network processing device provided by the embodiment of the present application can be implemented on a field-programmable gate array (FPGA) or an Application Specific Integrated Circuit (ASIC).

The embodiment of the invention also provides a control method of the computing unit. The computing unit may be, for example, the computing unit 200 described in any of the embodiments above. The control method is executed by the control unit 110 described in the above embodiment. The control method may include the processing flow performed by the control unit 110 as described above, and is not described in detail here to avoid redundancy.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one computing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

38页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:神经网络处理装置、控制方法以及计算系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!