CNN coprocessor

文档序号:190312 发布日期:2021-11-02 浏览:42次 中文

阅读说明:本技术 一种cnn协处理器 (CNN coprocessor ) 是由 刘冬生 魏来 陆家昊 卢楷文 朱令松 于 2021-07-16 设计创作,主要内容包括:本发明公开了一种CNN协处理器,属于人工智能算法电路实现领域,包括存储模块、卷积计算模块和数据缓存控制模块;其中,存储模块用于存储前向推理所需的数据;卷积计算模块用于实现前向推理过程中的卷积计算;数据缓存控制模块用于从存储模块中读取当前前向推理所需的数据并缓存,以及当CNN推理模式为连续前向推理时,在当前前向推理所需的数据读取完成之后、以及当前前向推理完成之前从存储模块中读取下一次前向推理所需的数据并缓存。本发明通过将当前轮次的前向推理和下一轮次的数据读取并行执行,大大减少了数据传输所占用的时间,计算速度较快;另外,本发明还引入了一种基于符号位判读的乘加电路,有效的减少了冗余计算,降低了系统功耗。(The invention discloses a CNN coprocessor, belonging to the field of artificial intelligence algorithm circuit realization, comprising a storage module, a convolution calculation module and a data cache control module; the storage module is used for storing data required by forward reasoning; the convolution calculation module is used for realizing convolution calculation in the forward reasoning process; and the data cache control module is used for reading and caching the data required by the current forward reasoning from the storage module, and reading and caching the data required by the next forward reasoning from the storage module after the data required by the current forward reasoning is read and before the current forward reasoning is finished when the CNN reasoning mode is the continuous forward reasoning. The invention executes the forward reasoning of the current round and the data reading of the next round in parallel, thereby greatly reducing the time occupied by data transmission and having higher calculation speed; in addition, the invention also introduces a multiply-add circuit based on sign bit interpretation, which effectively reduces redundant calculation and system power consumption.)

1. A CNN coprocessor is characterized in that the CNN coprocessor is used for realizing the forward reasoning process of a CNN network; the inference mode of the CNN coprocessor comprises continuous forward inference and discontinuous forward inference;

the CNN coprocessor comprises: the device comprises a storage module, a convolution calculation module and a data cache control module;

the storage module is used for storing data required by forward reasoning;

the data cache control module is used for reading data required by the current forward reasoning from the storage module and caching the data;

the convolution calculation module is used for acquiring data required by the current forward reasoning from the data cache control module, realizing convolution calculation in the forward reasoning process based on the data required by the current forward reasoning, and outputting a convolution calculation result to the data cache control module for caching;

the data cache control module is also used for reading data required by the next forward reasoning from the storage module and caching after the data required by the current forward reasoning is read and before the current forward reasoning is finished when the CNN reasoning mode is continuous forward reasoning;

and the data cache control module is also used for outputting the convolution calculation result input by the convolution calculation module for the last time as the current forward reasoning calculation result to the storage module for storage after the current forward reasoning is finished.

2. The CNN co-processor of claim 1, wherein said data required for forward reasoning comprises pixel data, bias data and weight data.

3. The CNN coprocessor according to claim 2, wherein said data cache control module comprises a direct memory access controller unit, DMAC, and a data cache unit;

the DMAC is used for reading pixel data, bias data and weight data required by the current forward reasoning from the storage module, caching the pixel data, the bias data and the weight data into the data caching unit, and outputting the pixel data, the bias data and the weight data into the convolution calculation module through the data caching unit; the CNN reasoning module is used for reading pixel data, bias data and weight data required by the next forward reasoning from the storage module and caching the pixel data, the bias data and the weight data into the data caching unit when the CNN reasoning mode is continuous forward reasoning;

the data caching unit is further used for caching the convolution calculation result output by the convolution calculation module, and when the convolution calculation module executes convolution calculation of the CNN network middle layer, the last cached convolution calculation result is input into the convolution calculation module as new pixel data;

and the data cache unit is also used for outputting the convolution calculation result input by the convolution calculation module for the last time as the current forward reasoning calculation result to the DMAC after the current forward reasoning is finished, and storing the result in the storage module through the DMAC.

4. The CNN coprocessor of claim 3, wherein the DMAC comprises: the pixel array comprises a data distributor, a pixel FIFO, a weight FIFO, a bias FIFO, an AND circuit and a reading controller;

the data distributor is used for reading pixel data, partial bias data and partial weight data required by the current forward reasoning from the storage module, and correspondingly distributing the pixel data, the partial bias data and the partial weight data to the pixel FIFO, the weight FIFO and the bias FIFO respectively, and each data is output to the data cache unit through the corresponding FIFO;

the weight FIFO and the offset FIFO respectively output status signals indicating whether the weight FIFO and the offset FIFO are full or not to the AND circuit, and the status signals are sent to the reading controller after AND operation is carried out in the AND circuit;

the reading controller is used for continuously reading the bias data and the weight data required by the current forward reasoning from the storage module when the weight FIFO and the bias FIFO are not full, and inputting the bias data and the weight data into the convolution calculation module through the data cache unit for calculation until all the bias data and the weight data required by the current forward reasoning are read; after the current forward reasoning is finished, writing the current forward reasoning calculation result into the storage module; when the CNN inference mode is continuous forward inference, after all bias data and weight data required by the current forward inference are read and completed and before the current forward inference is completed, reading pixel data, bias data and weight data required by the next forward inference from the storage module and caching the pixel data, the bias data and the weight data into the data caching unit.

5. The CNN coprocessor of claim 4, wherein when the CNN inference mode is continuous forward inference, after all bias data and weight data required for current forward inference have been read and before the current forward inference proceeds to the full link layer computation stage, the bias data and weight data required for next forward inference are read from the storage module and cached in the data caching unit; after the current forward reasoning is carried out to the full connection layer computing stage and before the current forward reasoning is completed, reading pixel data required by the next forward reasoning from the storage module and caching the pixel data into the data caching unit.

6. The CNN coprocessor of claim 1, wherein said convolution computation module comprises a MAC array;

the MAC array is used for adding offset data after convolution calculation of pixel data and weight data is realized, and a convolution calculation result is obtained;

the MAC array comprises a plurality of MAC multiply-add units; the MAC multiply-add units are distributed in an array mode, and the arrangement mode is as follows: the MAC multiply-add unit is repeatedly arranged along a first direction by nxN isxThe multiplication and addition units are connected together in a cascading mode to form a multiplication and addition submodule; the multiplication and addition sub-modules are repeatedly arranged along a second direction by nyA forming multiply-add kernel module; the multiplication and addition core module repeatedly arranges n along a third directionzEach form a group consisting of nx*ny*nzAn array circuit of the MAC multiply-add unit; wherein n isx、nyAnd nzIs an integer of not less than 2; the first direction, the second direction and the third direction are allDifferent;

the MAC multiply-add unit includes: the system comprises a condition judgment circuit, a multiplier, a two-way selector and an adder;

the condition judging circuit judges the condition of the pixel data, if the pixel data is a positive number, the enable signal is output to the multiplier and the two-way selector in a high-order mode, at the moment, the multiplier is enabled, and the two-way selector gates the result output by the multiplier; the result obtained by multiplying the pixel data and the weight data through calculation of the multiplier is output to the adder through the two-way selector; otherwise, the enable signal is set to be low and output to the multiplier and the two-way selector, at this time, the multiplier is not enabled, and the two-way selector gates data 0 and outputs to the adder;

the adder is used for summing the result output by the two-way selector and the psum; when the MAC multiplication and addition unit is not a first-stage MAC multiplication and addition unit in the first direction, the psum is an output result of a previous-stage MAC multiplication and addition unit of the MAC multiplication and addition unit along the first direction; when the MAC multiplication and addition unit is a first-stage MAC multiplication and addition unit in a first direction, the psum value is 0;

the MAC array further comprises an accumulator; the accumulator is used for calculating the sum of the data output by each multiplication and addition submodule and the corresponding offset data.

7. The CNN coprocessor according to claim 6, wherein said convolution computation module further comprises a pooling processing unit; the pooling processing unit is used for pooling the calculation results output by the MAC array.

8. The CNN coprocessor of claim 1, wherein said storage module is a spin-torque transfer magnetic random access memory, STT-MRAM.

9. The CNN coprocessor according to any of claims 2-8, further comprising a register configuration module and a central control unit module;

the register configuration module is used for configuring a register group required by the CNN coprocessor in an initial state; the register group is used for storing reasoning information; the inference information includes: the method comprises the following steps that a CNN reasoning mode, a storage address AI _ PIX _ ADR of pixel data required by current forward reasoning in a storage module, a storage address AI _ WGT _ ADR of weight data required by forward reasoning in the storage module, and a storage address AI _ BIAS _ ADR of BIAS data required by forward reasoning in the storage module are adopted; when the CNN inference mode is continuous forward inference, the inference information further comprises a storage address AI _ NPIX _ ADR for storing pixel data required by the next forward inference in the storage module;

the central control unit module is used for reading the inference information stored in the register group in the register configuration module, sending a CNN inference mode, AI _ PIX _ ADR, AI _ WGT _ ADR and AI _ BIAS _ ADR to the data cache control module, and sending AI _ NPIX _ ADR to the data cache control module when the CNN inference mode is continuous forward inference; the forward reasoning system is used for monitoring whether the current forward reasoning process is finished or not and informing the data cache control module and the register configuration module when the current forward reasoning process is finished; the convolution calculation module is also used for generating an enabling signal of the convolution calculation module and outputting the enabling signal to the convolution calculation module;

the register configuration module is also used for reconfiguring a register group required by the CNN coprocessor after the current forward reasoning process is finished and updating AI _ PIX _ ADR to AI _ NPIX _ ADR when the CNN reasoning mode is continuous forward reasoning; if the next forward reasoning is not the last forward reasoning, updating the AI _ NPIX _ ADR to the storage address of the pixel data required by the one forward reasoning after the next forward reasoning; otherwise, updating the CNN reasoning mode into discontinuous forward reasoning.

10. The CNN coprocessor according to claim 9, wherein the AI _ PIX _ ADR is a top address of a storage address of the pixel data required for current forward inference in the storage module, the AI _ WGT _ ADR is a top address of a storage address of the weighting data required for current forward inference in the storage module, and the AI _ BIAS _ ADR is a top address of a storage address of the offset data required for current forward inference in the storage module;

in the storage module, addresses for storing pixel data, weight data and bias data required by different layers of the current forward reasoning are continuous; the pixel data, the weight data and the bias data required by different layers of the current forward reasoning are all obtained by reading from the corresponding first addresses in sequence.

Technical Field

The invention belongs to the field of artificial intelligence algorithm circuit implementation, and particularly relates to a CNN coprocessor.

Background

With the continuous development of System on a Chip (SoC) and artificial intelligence algorithms, the artificial intelligence algorithms have high accuracy, robustness and self-learning capability, and the Convolutional Neural Network (CNN) can effectively prevent overfitting and simultaneously reduce calculation parameters in a large amount due to the characteristics of local connection and weight sharing. Because a large amount of data calculation and information access exist in the implementation process of the convolutional neural network algorithm, the calculation efficiency, the energy consumption and the data storage are not negligible.

In the aspect of storage, a large amount of weight data and bias data are needed in the inference process of the convolutional neural network, a processor needs to perform a large amount of data interaction with a memory, and the requirement on the transmission efficiency of data in the interaction process is high. In the conventional CNN coprocessor, all data (weight, offset, and pixel data) to be calculated need to be read out from a memory for buffering, and then subsequent convolution operation can be performed; in the inference process, the weight and offset data of the peripheral storage module need to be continuously cached, so that the data required by the next inference cannot be completely accessed in the continuous inference process, the continuous inference process is blocked, the time occupied by the data transmission process in the calculation process is long, and the calculation speed is slow. In addition, most of storage structures in the conventional CNN coprocessor are composed of a random access memory (SRAM) and a nonvolatile memory (NVM) memory (ROM), the reading and writing speed is slow, the speed of providing data by the memory is far behind the speed of processing the data, and the problem of storage walls exists, and the problem of the storage walls further causes that the transmission of weight, bias and pixel data in a calculation module and the memory occupies a large amount of time in the process of realizing a convolutional neural network algorithm.

In the aspect of computational efficiency and energy consumption, in the inference process of the convolutional neural network, because a large amount of convolutional calculation is required, higher requirements are put forward on the computational efficiency and the energy consumption. Due to the general instruction architecture of the traditional general computation processing core CPU integrated in the SoC, the data volume which can be processed in parallel in the forward reasoning process of the convolutional neural network is very limited; in the convolution calculation process of the conventional CNN coprocessor, a large amount of redundant calculation exists, for example, the RELU function needs to convert a negative number into a 0 value and is used as a next layer input to perform multiplication operation with a weight value, so that a large amount of resources are wasted, unnecessary power consumption is caused, and the problems of low calculation efficiency and high energy consumption are solved.

Disclosure of Invention

In view of the above defects or improvement requirements of the prior art, the present invention provides a CNN coprocessor to solve the technical problem of slow computation speed caused by long data transmission time in the prior art.

In order to achieve the above object, the present invention provides a CNN coprocessor, which is used for implementing a forward inference process of a CNN network; the inference mode of the CNN coprocessor comprises continuous forward inference and discontinuous forward inference;

the CNN coprocessor comprises: the device comprises a storage module, a convolution calculation module and a data cache control module;

the storage module is used for storing data required by forward reasoning;

the data cache control module is used for reading data required by the current forward reasoning from the storage module and caching the data;

the convolution calculation module is used for acquiring data required by the current forward reasoning from the data cache control module, realizing convolution calculation in the forward reasoning process based on the data required by the current forward reasoning and outputting a convolution calculation result to the data cache control module for caching;

the data cache control module is also used for reading data required by the next forward reasoning from the storage module and caching after the data required by the current forward reasoning is read and before the current forward reasoning is finished when the CNN reasoning mode is continuous forward reasoning;

and the data cache control module is also used for outputting the convolution calculation result input last time by the convolution calculation module as the current forward reasoning calculation result to the storage module for storage after the current forward reasoning is finished.

Further preferably, the data required for forward inference includes pixel data, bias data, and weight data.

Further preferably, the data cache control module comprises a direct memory access controller unit DMAC and a data cache unit;

the DMAC is used for reading pixel data, bias data and weight data required by the current forward reasoning from the storage module, caching the pixel data, the bias data and the weight data into the data caching unit, and outputting the pixel data, the bias data and the weight data into the convolution calculation module through the data caching unit; the CNN reasoning module is used for reading pixel data, bias data and weight data required by the next forward reasoning from the storage module and caching the pixel data, the bias data and the weight data into the data caching unit when the CNN reasoning mode is continuous forward reasoning;

the data caching unit is also used for caching the convolution calculation result output by the convolution calculation module, and when the convolution calculation module executes the convolution calculation of the CNN network middle layer, the last cached convolution calculation result is input into the convolution calculation module as new pixel data;

and the data cache unit is also used for outputting the convolution calculation result input by the convolution calculation module for the last time as the current forward reasoning calculation result to the DMAC after the current forward reasoning is finished, and storing the result in the storage module through the DMAC.

Further preferably, the DMAC comprises: the pixel array comprises a data distributor, a pixel FIFO, a weight FIFO, a bias FIFO, an AND circuit and a reading controller;

the data distributor is used for reading pixel data, partial bias data and partial weight data required by the current forward reasoning from the storage module, and correspondingly distributing the pixel data, the partial bias data and the partial weight data to the pixel FIFO, the weight FIFO and the bias FIFO respectively, and each data is output to the data cache unit through the corresponding FIFO;

the weight FIFO and the offset FIFO respectively output status signals indicating whether the weight FIFO and the offset FIFO are full to an AND circuit, and the status signals are sent to a reading controller after AND operation is carried out in the AND circuit;

the reading controller is used for continuously reading the bias data and the weight data required by the current forward reasoning from the storage module when the weight FIFO and the bias FIFO are not full, and inputting the bias data and the weight data into the convolution calculation module for calculation through the data cache unit until all the bias data and the weight data required by the current forward reasoning are read; after the current forward reasoning is finished, writing the current forward reasoning calculation result into a storage module; when the CNN reasoning mode is continuous forward reasoning, after all bias data and weight data required by the current forward reasoning are read and completed and before the current forward reasoning is completed, reading pixel data, bias data and weight data required by the next forward reasoning from the storage module and caching the pixel data, the bias data and the weight data into the data caching unit.

Further preferably, when the CNN inference mode is continuous forward inference, after all bias data and weight data required for current forward inference have been read and completed and before the current forward inference proceeds to the full link layer computation stage, the bias data and weight data required for the full link layer in the current forward inference process, and the bias data and weight data required for the next forward inference are read from the storage module and cached in the data caching unit; after the current forward reasoning is carried out to the full connection layer computing stage and before the current forward reasoning is completed, pixel data required by the next forward reasoning is read from the storage module and cached in the data caching unit.

Further preferably, the convolution calculating module includes a MAC array;

the MAC array is used for adding offset data after convolution calculation of pixel data and weight data is realized, and a convolution calculation result is obtained;

the MAC array comprises a plurality of MAC multiply-add units; the MAC multiplying and adding units are distributed in an array mode, and the arrangement mode is as follows: the MAC multiply-add unit is repeatedly arranged along a first directionxN isxThe multiplication and addition units are connected together in a cascading mode to form a multiplication and addition submodule; the multiplication and addition sub-modules are repeatedly arranged along the second direction by nyForm a multiplication and addition kernelA module; the multiplication and addition core module repeatedly arranges n along the third directionzEach form a group consisting of nx*ny*nzAn array circuit of MAC multiply-add units; wherein n isx、nyAnd nzIs an integer of not less than 2; the first, second and third directions are all different;

wherein, MAC multiply-add unit includes: the system comprises a condition judgment circuit, a multiplier, a two-way selector and an adder;

the condition judging circuit judges the condition of the pixel data, if the pixel data is a positive number, the enable signal is output to the multiplier and the two-way selector in a high-order mode, at the moment, the multiplier is enabled, and the two-way selector gates the result output by the multiplier; the result obtained by multiplying the pixel data and the weight data through the calculation of the multiplier is output to the adder through the two-way selector; otherwise, the enable signal is output to the multiplier and the two-way selector in a low position, at the moment, the multiplier is not enabled, and the two-way selector gates data 0 to be output to the adder;

the adder is used for summing the result output by the two-way selector and the psum; wherein, psum is the output result of the MAC multiplication and addition unit at the upper stage along the first direction; when the MAC multiplication and addition unit is a first-stage MAC multiplication and addition unit in the first direction, the psum value is 0;

the MAC array further comprises an accumulator; the accumulator is used for calculating the sum of the data output by each multiplication and addition submodule and the corresponding offset data.

Further preferably, the convolution calculation module further includes a pooling processing unit; and the pooling processing unit is used for pooling the calculation results output by the MAC array.

Further preferably, the memory module is a spin-transfer torque magnetic random access memory STT-MRAM.

Further preferably, the CNN coprocessor further includes a register configuration module and a central control unit module;

the register configuration module is used for configuring a register group required by the CNN coprocessor in an initial state; the register group is used for storing reasoning information; the inference information includes: the method comprises the following steps that a CNN reasoning mode, a storage address AI _ PIX _ ADR of pixel data required by current forward reasoning in a storage module, a storage address AI _ WGT _ ADR of weight data required by forward reasoning in the storage module, and a storage address AI _ BIAS _ ADR of BIAS data required by forward reasoning in the storage module are adopted; when the CNN reasoning mode is continuous forward reasoning, the reasoning information further comprises a storage address AI _ NPIX _ ADR for storing pixel data required by the next forward reasoning in the storage module;

the central control unit module is used for reading the inference information stored in the register group in the register configuration module, sending the CNN inference mode, AI _ PIX _ ADR, AI _ WGT _ ADR and AI _ BIAS _ ADR to the data cache control module, and when the CNN inference mode is continuous forward inference, sending AI _ NPIX _ ADR to the data cache control module; the device is used for monitoring whether the current forward reasoning process is finished or not and informing the data cache control module and the register configuration module when the current forward reasoning process is finished; the convolution calculation module is also used for generating an enabling signal of the convolution calculation module and outputting the enabling signal to the convolution calculation module;

the register configuration module is also used for reconfiguring a register group required by the CNN coprocessor after the current forward reasoning process is finished when the CNN reasoning mode is continuous forward reasoning, and updating AI _ PIX _ ADR into AI _ NPIX _ ADR; if the next forward reasoning is not the last forward reasoning, updating the AI _ NPIX _ ADR to the storage address of the pixel data required by the one forward reasoning after the next forward reasoning; otherwise, updating the CNN reasoning mode into discontinuous forward reasoning.

Further preferably, AI _ PIX _ ADR is a first address of a storage address of pixel data required by current forward reasoning in the storage module, AI _ WGT _ ADR is a first address of a storage address of weighting data required by current forward reasoning in the storage module, and AI _ BIAS _ ADR is a first address of a storage address of BIAS data required by current forward reasoning in the storage module;

in the storage module, addresses for storing pixel data, weight data and bias data required by different layers of the current forward reasoning are continuous; the pixel data, the weight data and the bias data required by different layers of the current forward reasoning are all obtained by reading from the corresponding first addresses in sequence.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

1. the invention provides a CNN coprocessor, which comprises a storage module, a convolution calculation module and a data cache control module, wherein the CNN coprocessor has two modes of continuous forward reasoning and discontinuous forward reasoning, and when in the continuous forward reasoning mode, data required by the next forward reasoning is read from the storage module and cached after the data reading required by the current forward reasoning is finished and before the current forward reasoning is finished through the data cache control module; by executing the forward reasoning of the current round and the data reading of the next round in parallel, the time occupied by data transmission is greatly reduced, and the calculation speed is higher.

2. The CNN coprocessor provided by the invention realizes an efficient data transfer scheme through DMAC, the read data of the bus is distributed to corresponding FIFO through a data distributor, the weight FIFO and the offset FIFO are used for caching the weight and the offset data which do not participate in convolution calculation at this time and the next time of reasoning in advance, the pixel FIFO is used for caching the pixel data required by the next time of reasoning, because the data cache unit does not have enough space at the beginning stage of the calculation process of the full connection layer, the data cache unit needs to be stored in the pixel FIFO firstly, and the pixel FIFO is read into the data cache unit through a read controller when the data cache unit has free space; the structure greatly improves the calculation efficiency in the continuous reasoning process, avoids the influence on the calculation efficiency caused by the blockage of data access and storage, and can effectively improve the overall calculation efficiency of the circuit and the throughput of the CNN processor.

3. According to the CNN coprocessor provided by the invention, under a CNN continuous reasoning mode, the bias and weight data required by the full connection layer in the current forward reasoning process and the weight and bias data required by the next round of reasoning are cached before the full connection layer; in the calculation process of the full connection layer, the stored multi-channel input data of a large number of convolution layers are flattened into a single channel, and the data volume required by cache is greatly reduced, so that the pixel data of the next round of inference is cached, and the expense of internal storage resources can be effectively reduced.

4. The CNN coprocessor provided by the invention also comprises an MAC multiply-add unit based on sign bit interpretation; wherein, the MAC multiply-add unit comprises a condition judging circuit, a multiplier, a two-way selector and an adder, the multiplier and the two-way selector are controlled by judging the positive and negative conditions of the input pixel data through the condition judging circuit, the multiplier can carry out multiplication only when the pixel data is positive, and the two-way selector can gate the multiplication result to output, thereby providing a high-efficiency circuit realization method of an activation function RELU, avoiding the hardware resource expense caused by independently adding an activation function module circuit after the pooling process, simultaneously solving the problems of a large amount of resource waste and unnecessary power consumption expense caused by the multiplication operation of taking the negative number as the next layer input and weight value by converting the negative number into 0 value in the realization process of the prior art, effectively reducing the hardware resource expense of the circuit and obviously reducing the redundant multiply-add operation, the power consumption of the whole circuit is reduced.

5. In the CNN coprocessor provided by the invention, the STT-MRAM is stored in the storage module by adopting spin transfer torque magnetic random, has the characteristics of nonvolatility, high performance, high density, high erasing times, compatibility with CMOS (complementary metal oxide semiconductor) process and the like, is high in reading and writing speed, avoids the problem of a storage wall, further reduces the time occupied by the weight and bias of the calculation module and a memory and the transmission of pixel data in the process of realizing a convolutional neural network algorithm, and improves the calculation speed.

Drawings

Fig. 1 is a schematic structural diagram of a CNN coprocessor provided in the present invention;

fig. 2 is a detailed structural diagram of a CNN coprocessor provided in the embodiment of the present invention;

fig. 3 is a schematic circuit diagram of a MAC array according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a circuit structure of a DMAC according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of CNN coprocessor forward inference provided in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In order to achieve the above object, the present invention provides a CNN coprocessor for implementing a forward inference process of a CNN network, which is suitable for accelerating convolution operation in a convolutional neural network front-end inference; the inference mode of the CNN coprocessor comprises continuous forward inference and discontinuous forward inference; the continuous reasoning mode is that in the current reasoning process, all pixel data, partial weight and bias data required in the next reasoning process are cached, so that the follow-up reasoning process is ensured not to generate any blockage, and the process can be carried out in a full pipeline; discontinuous reasoning (i.e. no next forward reasoning follows): and completing single inference, wherein partial data caching is required to be completed before each inference is carried out.

Specifically, as shown in fig. 1, the CNN coprocessor includes: the device comprises a storage module, a convolution calculation module and a data cache control module;

the storage module is used for storing data required by forward reasoning;

the data cache control module is used for reading data required by the current forward reasoning from the storage module and caching the data;

the convolution calculation module is used for acquiring data required by the current forward reasoning from the data cache control module, realizing convolution calculation in the forward reasoning process based on the data required by the current forward reasoning and outputting a convolution calculation result to the data cache control module for caching;

and the data cache control module is also used for reading data required by the next forward reasoning from the storage module and caching after the data required by the current forward reasoning is read and before the current forward reasoning is finished when the CNN reasoning mode is continuous forward reasoning.

The data cache control module is also used for outputting the convolution calculation result input by the convolution calculation module for the last time as the current forward reasoning calculation result to the storage module for storage after the current forward reasoning is finished;

wherein the data required for forward reasoning comprises pixel data, bias data and weight data.

Preferably, the data cache control module comprises a direct memory access controller unit DMAC and a data cache unit;

the DMAC is used for reading pixel data, bias data and weight data required by the current forward reasoning from the storage module, caching the pixel data, the bias data and the weight data into the data caching unit, and outputting the pixel data, the bias data and the weight data into the convolution calculation module through the data caching unit; the CNN reasoning module is used for reading pixel data, bias data and weight data required by the next forward reasoning from the storage module and caching the pixel data, the bias data and the weight data into the data caching unit when the CNN reasoning mode is continuous forward reasoning;

the data caching unit is also used for caching the convolution calculation result output by the convolution calculation module, and when the convolution calculation module executes the convolution calculation of the CNN network middle layer, the last cached convolution calculation result is input into the convolution calculation module as new pixel data;

and the data cache unit is also used for outputting a convolution calculation result after pooling, which is input by the pooling processing module for the last time, as a current forward reasoning calculation result after the current forward reasoning is finished, to the DMAC (direct memory access), and storing the result into the storage module through the DMAC.

Preferably, the convolution calculation module comprises a MAC array and a pooling processing unit.

In an optional embodiment, the CNN coprocessor further includes a register configuration module and a central control unit module; the data cache control module comprises a direct memory access controller unit (DMAC) and a data cache unit; the convolution calculation module comprises an MAC array and a pooling processing unit; the DMAC is connected with the storage module through a host bus interface, and the register configuration module is connected with the storage module through a slave bus interface. Specifically, as shown in fig. 2, a detailed structural diagram of the CNN coprocessor includes a bus slave Interface module AHB _ slave _ Interface, a bus host Interface module AHB _ master _ Interface, a register configuration module AI _ CFG, a Central control unit module Central controller, a MAC array, a Pooling processing unit, a direct memory access controller unit DMAC, and a storage module.

The DMAC is connected with the storage module through a bus host Interface module AHB _ master _ Interface, and the register configuration module is connected with the storage module through a bus slave Interface module AHB _ slave _ Interface; a first input end, a first output end, a second output end and a third output end of the central control unit module are respectively and correspondingly connected with a first output end of the register configuration module AI _ CFG, a first input end of the MAC array, a second input end of the data cache unit and a third input end of the direct memory access controller unit DMAC; a first input end and a first output end of the DMAC are respectively connected with a first output end and a first input end of the host Interface module AHB _ master _ Interface; a first input end and a first output end of the data cache unit are correspondingly connected with a second output end and a second input end of the direct memory access controller unit DMAC respectively; the second output end of the data cache unit is connected with the second input end of the MAC array; the first output end of the MAC array is connected with the first input end of the pooling processing unit;

the bus slave Interface module AHB _ slave _ Interface and the bus host Interface module AHB _ master _ Interface communicate with other modules by adopting an AMBA 2.0 protocol bus; and the slave Interface module AHB _ slave _ Interface is responsible for carrying out data transmission with a main processor outside the CNN coprocessor to complete the control of the operation mode of the CNN coprocessor and the reading of a calculation result. Specifically, the slave Interface module AHB _ slave _ Interface and the AHB bus are connected to 12 AHB bus signals, including 9 input signals and 3 output signals, where the input signal with 1-bit width is a clock signal ai _ clk, a reset signal ai _ rst, a chip select signal hsel _ ai, a write enable signal hwrite, and an idle signal hread _ in, the input signal with 2-bit width is a transmission size signal hsize, a transmission type signal htrans, and the input signal with 32-bit width is address data haddr and write data hwdata, and the output signals are a response signal with 1-bit width, an idle signal hread _ out with 1-bit width, and read data hrdata with 32-bit width. The host Interface module AHB _ master _ Interface is connected with an AHB bus to form 12 AHB bus signals, including 6 input signals and 8 output signals, wherein the input signals with 1-bit width are respectively a clock signal ai _ clk, a reset signal ai _ rst, an idle signal hread _ in and a bus grant signal hgrant, the input signals with 2-bit width are respectively a transmission size signal hsize and a transmission type signal htrans, the input signals with 32-bit width are write data hrdata, the output signals with 1-bit width are respectively a chip select signal hsel, a write enable signal hwrite, an idle signal hread _ out and a bus request signal hburseq, the output signals with 2-bit width are respectively a transmission size signal hsize, a transmission type signal htrans and the input signals with 32-bit width are respectively an address data haddr and write data hdata.

The storage module is used as a part of RAM and ROM of the CNN coprocessor and is used for storing pixel data, bias data and weight data to be input; in the embodiment, the storage module is a spin transfer torque magnetic random access memory (STT-MRAM), specifically, a 64K × 32bit STT-MRAM; STT-MRAM has the characteristics of non-volatility, high performance (the read-write speed is equivalent to that of DRAM and is close to that of SRAM), high density, high erasing times (the erasing times is more than 3 x 1016), CMOS process compatibility and the like, and in artificial intelligence application, the STTMRAM meets the read-write speed requirement of the STTMRAM on a memory.

The register configuration module AI _ CFG is used for configuring a register group required by the CNN coprocessor in an initial state; the register group is used for storing reasoning information; the inference information includes: whether the current forward reasoning is finished or not is marked AI _ CNN _ DONE, a CNN reasoning MODE AI _ CNN _ MODE, a storage address AI _ PIX _ ADR of pixel data required by the current forward reasoning in a storage module, a storage address AI _ WGT _ ADR of weight data required by the forward reasoning in the storage module and a storage address AI _ BIAS _ ADR of BIAS data required by the forward reasoning in the storage module; the CNN reasoning mode comprises continuous forward reasoning and discontinuous forward reasoning; when the CNN inference mode is continuous forward inference, the inference information further includes a storage address AI _ NPIX _ ADR for storing pixel data required for next forward inference in the storage module. It should be noted that, the weights and biases used in the forward inference of different rounds are the same, so the AI _ WGT _ ADR and AI _ BIAS _ ADR herein are applicable not only to the current forward inference but also to the next forward inference.

The central control unit module is used for providing control signals for other circuit modules and generating buffer data address signals for the convolution calculation module to read corresponding data for convolution calculation; specifically, the central control unit module is configured to read inference information stored in a register set in the register configuration module, and send the CNN inference mode, AI _ PIX _ ADR, AI _ WGT _ ADR, and AI _ BIAS _ ADR to the DMAC, and when the CNN inference mode is continuous forward inference, is further configured to send AI _ NPIX _ ADR to the DMAC; the system comprises a data cache unit and a register configuration module, wherein the data cache unit is used for monitoring whether the current forward reasoning process is finished or not and informing the data cache unit and the register configuration module (to configure AI _ CNN _ DONE) when the current forward reasoning process is finished; the MAC array computation enable unit is used for generating a MAC array computation enable signal and outputting the MAC array computation enable signal to the MAC multiply-add unit; and generating address data of the data cache unit.

The DMAC is used for reading pixel data, BIAS data and weight data required by the current forward reasoning from the storage module based on AI _ PIX _ ADR, AI _ WGT _ ADR and AI _ BIAS _ ADR, caching the pixel data, the BIAS data and the weight data into the data caching unit, and outputting the pixel data, the BIAS data and the weight data to the MAC array module through the data caching unit; it should be noted that AI _ PIX _ ADR is the first address of the storage address of the pixel data required by the current forward reasoning in the storage module, AI _ WGT _ ADR is the first address of the storage address of the weighting data required by the current forward reasoning in the storage module, and AI _ BIAS _ ADR is the first address of the storage address of the BIAS data required by the current forward reasoning in the storage module; in the storage module, addresses for storing pixel data, weight data and bias data required by different layers of the current forward reasoning are continuous; the pixel data, the weight data and the bias data required by different layers of the current forward reasoning are all obtained by reading from the corresponding first addresses in sequence.

The MAC array comprises a plurality of MAC multiplication and addition units distributed in an array, and the MAC array module is used for controlling each MAC multiplication and addition unit based on an MAC array calculation enabling signal, adding the convolution calculation result obtained by adding the convolution calculation result to offset data after convolution calculation of pixel data and weight data is realized, and outputting the convolution calculation result to the pooling processing unit;

the pooling processing unit is used for pooling the convolution calculation result output by the MAC array and outputting the pooled convolution calculation result to the data caching unit for caching;

the data cache unit is also used for taking the last input result of the pooled convolution calculation after the pooling processing unit as the current forward reasoning calculation result after the current forward reasoning is finished, outputting the result to the DMAC, and storing the result in the storage module through the DMAC;

the DMAC is also used for reading pixel data, BIAS data and weight data required by the next forward reasoning from the storage module based on AI _ NPIX _ ADR, AI _ WGT _ ADR and AI _ BIAS _ ADR when the CNN reasoning mode is continuous forward reasoning, and caching the pixel data, the BIAS data and the weight data into the data caching unit;

the register configuration module is also used for reconfiguring a register group required by the CNN coprocessor after the current forward reasoning process is finished when the CNN reasoning mode is continuous forward reasoning, and updating AI _ PIX _ ADR into AI _ NPIX _ ADR; if the next forward reasoning is not the last forward reasoning, updating the AI _ NPIX _ ADR to the storage address of the pixel data required by the one forward reasoning after the next forward reasoning; otherwise, updating the CNN reasoning mode into discontinuous forward reasoning.

In an optional implementation manner, the MAC multiply-add units in the MAC array are arranged in the following manner: the MAC multiply-add unit is repeatedly arranged along a first directionxN isxThe multiplication and addition units are connected together in a cascading mode to form a multiplication and addition submodule; the multiplication and addition sub-modules are repeatedly arranged along the second direction by nyA forming multiply-add kernel module; the multiplication and addition core module repeatedly arranges n along the third directionzEach form a group consisting of nx*ny*nzAn array circuit of MAC multiply-add units; wherein n isx、nyAnd nzIs an integer of not less than 2; the above-mentioned first, second and secondAll three directions are different. As shown in fig. 3, in the present embodiment, the first direction is an x-axis direction, the second direction is a y-axis direction, and the third direction is a z-axis direction; n isx,ny,nzThe MAC multiplying and adding units respectively represent the number of MAC multiplying and adding units of the MAC array on the x axis, the y axis and the z axis.

Specifically, the MAC multiply-add unit includes: the system comprises a condition judgment circuit, a two-input multiplier, a two-way selector, a two-input adder and a D trigger;

firstly, pixel data and weight data are input by a multiplier, and the output of the multiplier is connected with a two-way selector; carrying out condition judgment on input pixel data, if the sign of the input pixel data is 0 (in the embodiment, the sign bit of a positive number is 0, the sign bit of a negative number is 1, and if the sign of the input pixel data is 0, the input pixel data is a positive number), setting a high bit of an enable signal to be output to a multiplier and a two-way selector so that the two-way selector gates the result after multiplication of the pixel data and weight data, otherwise setting a low bit of the enable signal to be output to the multiplier and the two-way selector, and directly enabling the two-way selector to gate the data of 0 without multiplication; the output end of the two-way selector is connected with the first input end of the adder, the psum is connected with the second input end of the adder, and the two-way selector is used for summing the result output by the two-way selector and the psum to obtain the multiplication and addition result of the current MAC multiplication and addition unit; wherein, psum is the result of the last MAC multiply-add unit of the MAC array along the y axis, and if the MAC multiply-add unit is the first MAC multiply-add unit of the y axis, the input psum is 0; as shown in fig. 3, where the input psum of MAC units 000, 001, 002, 003 is 0, the input psum of MAC units 010, 011, 012, 013 is the output result of MAC units 000, 001, 002, 003, and so on. The D flip-flop is used for outputting the multiplication and addition result of the current MAC multiplication and addition unit at the time of a rising edge or a falling edge. The invention realizes the RELU activation function by adding a condition judgment circuit in front of a multiplier and adding a two-way selector after the output of the multiplier through an MAC multiplication and addition unit. The RELU function is called a linear rectification function, is an activation function commonly used in an artificial neural network, and has a formula of f (x) max (0, x).

Further, the MAC array further includes an accumulator for calculating a sum of the data output by each of the multiply-add sub-modules and the corresponding offset data.

Preferably, as shown in FIG. 4, the DMAC includes: a data distributor, a pixel FIFO (Pixel FIFO), a weight FIFO (weight FIFO), a bias FIFO (bias FIFO), an AND gate, and a read Controller (Controller);

the input end of the data distributor is connected with the Interface of the AHB Master, and the output end of the data distributor is respectively connected with the input ends of the Pixel FIFO, the Weight FIFO and the Bias FIFO; full signals of the Weight FIFO and the Bias FIFO are respectively connected with two input ends of the AND circuit, and output signals of the AND circuit are output to the reading controller;

the data distributor is used for reading pixel data, partial bias data and partial weight data required by the current forward reasoning from the storage module, and correspondingly distributing the pixel data, the partial bias data and the partial weight data to the pixel FIFO, the weight FIFO and the bias FIFO respectively, and each data is output to the data cache unit through the corresponding FIFO;

the weight FIFO and the offset FIFO respectively output state signals indicating whether the weight FIFO and the offset FIFO are FULL to an AND circuit, and the AND circuit performs AND operation to obtain signals WB _ FIFO _ FULL indicating the FULL states of the weight FIFO and the offset FIFO and sends the signals WB _ FIFO _ FULL to a reading controller; when the WB _ FIFO _ FULL is 0, it indicates that both the weight FIFO and the offset FIFO are not FULL, and when the WB _ FIFO _ FULL is 1, it indicates that both the weight FIFO and the offset FIFO are FULL.

The reading controller is used for continuously reading the bias data and the weight data required by the current forward reasoning from the storage module when the weight FIFO and the bias FIFO are not full, and inputting the bias data and the weight data into the convolution calculation module for calculation through the data cache unit until all the bias data and the weight data required by the current forward reasoning are read; after the current forward reasoning is finished, writing the current forward reasoning calculation result into a storage module; when the CNN reasoning mode is continuous forward reasoning, after all bias data and weight data required by the current forward reasoning are read and completed and before the current forward reasoning is completed, reading pixel data, bias data and weight data required by the next forward reasoning from the storage module and caching the pixel data, the bias data and the weight data into the data caching unit. Preferably, when the CNN inference mode is continuous forward inference, after all bias data and weight data required for current forward inference have been read and completed and before the current forward inference proceeds to the full link layer computation stage, the bias data and weight data required for the full link layer in the current forward inference process, and the bias data and weight data required for the next forward inference are read from the storage module and cached in the data caching unit; after the current forward reasoning is carried out to the full connection layer computing stage and before the current forward reasoning is completed, pixel data required by the next forward reasoning is read from the storage module and cached in the data caching unit.

Further, the forward reasoning flow when using the CNN coprocessor in the present invention is shown in fig. 5. In step 001, first, a main processor outside the CNN coprocessor inputs address data adr and an en enable signal to a register configuration module AI _ CFG through a bus slave interface AHB slave interface module. In step 002, the register configuration module AI _ CFG configures the AI _ CNN _ MODE, AI _ PIX _ ADR, AI _ NPIX _ ADR, AI _ WGT _ ADR, and AI _ BIAS _ ADR registers based on the address data ADR and the en enable signal, respectively. Wherein, the configured register function table is shown in table 1;

TABLE 1

In step 003, the DMAC transfers the Pixel, offset, and Weight data through the host Interface module AHB Master Interface, and distributes the data read from the sttram to the Pixel FIFO, Weight FIFO, and Bias FIFO through the data distributor, and each data is output to the data buffer unit through the corresponding FIFO.

In step 004, in the convolution calculation process, a read Controller in the DMAC performs conditional judgment on the WB _ FIFO _ FULL signal, if so, performs step 005 (the DMAC continues to read the weight and the offset data), then checks whether the weight and the offset data are completely loaded, and if so, determines the register AI _ CNN _ MODE; if it is 1, return to step 004.

Judging a CNN inference MODE AI _ CNN _ MODE, if the AI _ CNN _ MODE value is 2, that is, the inference MODE is continuous inference, executing step 006, loading weight and bias data required by next inference by DMAC, and judging an FC _ layer signal after the loading is finished, wherein the FC _ layer signal represents whether the CNN inference is performed to a full connection layer, and if the CNN inference is 1, the CNN inference is performed to the full connection layer, and executing step 007. In step 007, the DMAC moves the pixel data needed for the next inference, detects whether the inference is complete after the movement is complete, and if so, executes step 008 to write the computation result of the inference into the sttram; after the step 008 is executed, the step 009 is continuously executed to configure the relevant registers (the register group required by the CNN coprocessor is reconfigured, the AI _ PIX _ ADR is updated to the AI _ NPIX _ ADR, if the next forward reasoning is not the last forward reasoning, the AI _ NPIX _ ADR is updated to the storage address of the pixel data required by the next forward reasoning after the next forward reasoning, if the next forward reasoning is the last forward reasoning, the CNN reasoning MODE AI _ CNN _ MODE is updated to the discontinuous reasoning MODE, at this time, the AI _ CNN _ MODE value is not 2), and the step 004 is skipped to after the execution is finished, and the convolution calculation of the next forward reasoning is directly performed. If the AI _ CNN _ MODE value is not 2, the inference MODE is discontinuous inference, whether the inference is finished or not is detected, if so, step 008 is executed (the calculation result of the inference is written into STT-MRAM), and after the step 008 is executed, the operation of the CNN coprocessor is finished.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于单片机实现1553B总线通信的方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!