Convolutional neural network accelerator based on FPGA

文档序号:1938651 发布日期:2021-12-07 浏览:21次 中文

阅读说明:本技术 一种基于fpga的卷积神经网络加速器 (Convolutional neural network accelerator based on FPGA ) 是由 葛志来 陈智萍 朱晓梅 于 2021-10-12 设计创作,主要内容包括:本发明公开一种基于FPGA的卷积神经网络加速器,该卷积神经网络的网络结构包括输入层、第一卷积层、第二卷积层、第一池化层、第二池化层、第一全连接层、第二全连接层和输出层,输入层,接收图像,图像依次经过第一卷积层、第一池化层、激活函数、第二卷积层、第二池化层、激活函数、第一全连接层、第二全连接层运算后,得到多个特征值,然后在Softmax分类层中将特征值概率归一化得出最大概率值对应的下标即为分类结果。本发明实现了高速度的FPGA加速器,在权重数量和准确率之间做了一个很好的折中。(The invention discloses a convolutional neural network accelerator based on FPGA, the network structure of the convolutional neural network comprises an input layer, a first convolutional layer, a second convolutional layer, a first pooling layer, a second pooling layer, a first full-link layer, a second full-link layer and an output layer, the input layer receives images, the images are sequentially operated by the first convolutional layer, the first pooling layer, an activation function, the second convolutional layer, the second pooling layer, the activation function, the first full-link layer and the second full-link layer to obtain a plurality of characteristic values, and then subscripts corresponding to maximum probability values obtained by normalizing the probability of the characteristic values in a Softmax classification layer are classification results. The invention realizes the FPGA accelerator with high speed, and makes a good compromise between the weight quantity and the accuracy.)

1. A convolutional neural network accelerator based on FPGA is characterized in that the network structure of the convolutional neural network comprises an input layer, a first convolutional layer, a second convolutional layer, a first pooling layer, a second pooling layer, a first full-connection layer, a second full-connection layer and an output layer,

the image processing method comprises the steps that an input layer receives an image, the image is sequentially subjected to operation through a first convolution layer, a first pooling layer, an activation function, a second convolution layer, a second pooling layer, an activation function, a first full-link layer and a second full-link layer to obtain a plurality of characteristic values, and then the probability of the characteristic values is normalized in a Softmax classification layer to obtain subscripts corresponding to maximum probability values, namely classification results;

the first convolution layer and the second convolution layer both adopt a convolution expansion mode of parallel in channels and serial between channels, the convolution result of a single channel is output to a buffer corresponding to the convolution layer, and the buffer obtains the final convolution result of the corresponding convolution layer in a mode of repeatedly reading, summing and storing;

a volume of lamination layer, a pooling layer and an activation function are taken as a lamination layer, a buffer area is arranged between the two lamination layers, and a characteristic diagram output by lamination, a corresponding bias unit and a width parameter are stored in the buffer area and are used for inputting the next lamination cycle reading;

the full-connection layer starts to read the feature diagram output by the previous-level hierarchical output and the corresponding bias unit and width after the previous-level hierarchical output is stored, the feature diagram and the width are multiplied by the DSP multiplier, then the product value of the current neuron is summed, and the bias unit is added to be used as the final neuron output when the summation is finished.

2. The FPGA-based convolutional neural network accelerator of claim 1, wherein the quantization and inverse quantization are performed on the weight parameters of the convolutional layer and the pooling layer by using a quantization algorithm with the quantization of float32 to int8, and the method comprises the following steps:

a. calculating a scale transformation parameter s and a 0-value offset parameter z:

according to the interconversion relation between the floating point number x and the fixed point number:

wherein x represents a floating point number to be quantized, q (x) represents a fixed point number after x quantization, floor () is used for truncating a decimal, s represents scale, the function is scale scaling, the floating point number is scaled in a fixed interval, and z represents zero point, namely offset of the floating point number after 0 quantization;

obtaining a scale transformation parameter s and a 0 value offset parameter z required by quantization, wherein the calculation method is as follows:

wherein xmaxAnd xminMaximum and minimum values, p, of the floating-point number x, respectivelymaxAnd pminMaximum and minimum values of the quantization value p (x), respectively;

b. when no bias exists, the convolution or pooling operation formula is as follows:

wherein N represents the number of convolution kernel parameters, xiIs input data, wiIs the weight, y represents the convolution output of the layer, xi、wiAnd y are both float32 type floating point numbers;

for xiAnd wiQuantization is performed to obtain the formula:

x can be quantized by inverse quantizationiAnd wiExpressed as the formula:

substituting formula (5) into formula (3) to obtain formula:

the convolution output y is a floating point number, and when the input is input to the next layer of convolution, quantization is also needed, and y quantization and inverse quantization are as follows:

substituting equation (7) into equation (6) yields equation:

the data which is output to the next layer by each layer and needs to be used is the data q (y) after y quantization, and the formula (8) is transformed to obtain the formula:

obtaining the quantization data needed by the next layer, and finishing the function of the current layer;

the floating point number exists in the formula (9)Order toThen M is a floating point number, which is equal to 2-nM0Wherein n and M0Are all positive integers, n is between 0 and 15, such that M and 2-nM0Error is 2-16In this case, formula (9) is rewritten as:

wherein M is0(q(wi)-zw)(q(xi)-zx) And zyBelonging to integer arithmetic, 2-nShifting left by n bits in FPGA;

c. when the offset b is added, equation (9) becomes equation:

wherein q (b) is the result of the quantification of b, sbIs scale, z of bbZero point of b;

q (b) store with int32, while letting sb=sxswThen, the quantization result required by the next layer is expressed as formula:

3. the FPGA-based convolutional neural network accelerator of claim 2, wherein when calculating the scale, the maximum value and the minimum value of the values to be quantized are required, the maximum value and the minimum value of the feature map of each layer are tested by using at least 100 parts of data, and the obtained scale result is used for predicting the scale;

after obtaining M, find 2 closest to M-nM0Let n be between 0 and 15, M0GetAnda number that makes the error smaller; wherein the second fully-connected layer is the last layer without looking for 2-nM0Directly discarding during calculation

4. The FPGA-based convolutional neural network accelerator of claim 1, wherein the convolutional layer uses a 5 x 5 convolutional kernel, the pipeline generates a 5 x 5 region to be convolved, and the shift ram shift register is used as a buffer area to generate a 5 x 5 region to be convolved and a convolutional kernel;

when a single shift ram is enabled by a module, when a rising edge of a clock comes, storing data of an input end into the shift ram, sequentially shifting original data in the shift ram to the left, and discarding the last data; the 4 shift rams are connected end to achieve the effect of overall data shift, and the output of the 4 shift rams is added with the initial input to obtain a column in a 5 × 5 matrix; the obtained 5 x 5 matrix needs 25 registers to receive data output by five shift rams, and a pipeline generates a 5 x 5 area to be convolved and a convolution kernel sum in a shifting receiving mode;

after receiving a convolution kernel of 5 multiplied by 5 and a to-be-convolved area, parallelly expanding the convolution kernel, parallelly performing 25 multiplication operations through an exemplary 25 DSP fixed-point multiplier, obtaining a product operation result through the time delay of 1 clock, and performing summation operation on the 25 data, wherein the bit width of the data is 16 bits; decomposing the summation operation of convolution operation through a 6-stage production line during summation operation, wherein the used expansion data are all 0, firstly expanding 25 data into 26 data, and carrying out pairwise summation on the 26 data to obtain 13 17-bit data which is a first-stage production line; expanding 13 data into 14 data, and summing the 14 data in pairs to obtain 7 18-bit data which is a second-stage production line; expanding the 7 data into 8 data, summing two data by two to obtain 4 19-bit data, and taking the data as a third-stage production line; summing every two of the 4 data to obtain 2 20-bit data, and taking the data as a fourth-stage production line; and summing every two of the 2 data to obtain 1 21-bit data, serving as a fifth-stage production line, and finally adding 32-bit offset to obtain a final convolution result.

5. The convolutional neural network accelerator based on FPGA of claim 4, wherein the pooling layer is 2 × 2 Maxpooling, a shift ram with a width of 32bit and a depth of half the length of the channel of the previous layer is first set, a row of data of a matrix is continuously generated through the shift ram, a row of data obtained by the shift ram is shifted and stored by four registers, so that a 2 × 2 pooling window of the pipeline is generated, the step size of pooling is set to be 2, the 2 × 2 window generated by the pipeline is effective at intervals, after the 2 × 2 window is obtained, the four numbers are compared with each other through two combinational logic to obtain the maximum value, the obtained two outputs are compared with each other through the combinational logic to output the maximum value, and the obtained result is the output of the pooling layer.

6. The FPGA-based convolutional neural network accelerator of claim 1, wherein the data set used for training of the convolutional neural network is MNIST data set, the MNIST data set is first downloaded from torchvision, epoch is set to 15, batchsize is set to 64, learning rate is set to 0.0001, error uses cross entropy, and gradient descent is performed by random gradient descent.

Technical Field

The invention belongs to the technical field of neural networks, and particularly relates to a convolutional neural network accelerator based on an FPGA (field programmable gate array).

Background

A Convolutional Neural Network (CNN) is a feed-forward Neural Network, which mainly includes Convolutional layers, pooling layers, full-connection layers, and the like, and its weight sharing reduces the number of parameters required by a conventional full-connection type Network. The CNN can extract depth features in the image, avoid excessive data processing, and maintain a high recognition rate. In recent years, convolutional neural networks have achieved significant success in the fields of speech recognition, target detection, face recognition, and the like.

The convolutional neural network is a calculation intensive model, the amount of calculation caused by convolution operation of the core of the convolutional neural network is extremely large, the calculation capacity of the portable embedded device is difficult to cope with the large amount of calculation, and the acceleration of the neural network by using low-power-consumption hardware is a current research hotspot. As a programmable device, a Field Programmable Gate Array (FPGA) contains abundant logic resources, has the advantages of high performance, low power consumption and reconfigurability, and can realize a large amount of independent convolution operations in CNN in a multipath parallel mode. In 1994, DS real first built a neural network accelerator with FPGA, and as the neural network is not valued at that time, FPGA-based accelerator technology is not valued. In the ILSVRC challenge race of 2012, milestone-like networks AlexNet appeared and neural networks lifted a hot tide. As the amount of calculation and the number of parameters of the neural network are increasing day by day, researchers begin to search hardware platforms which can be programmed repeatedly and have low power consumption, the FPGA deployment CNN begins to appear in various international conferences and periodicals widely until 2018, and the number of papers published on IEEE EXPLORE by the direction of the neural network accelerator based on the FPGA reaches 69.

However, the storage space and resources on the FPGA development board are limited, and taking a classical convolutional neural network LeNet for identifying an MNIST handwritten digit data set as an example, the identification rate can reach more than 98%, but the total weight of weight parameters reaches more than 430000, which consumes more storage space and resources on the FPGA development board.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a lightweight convolutional neural network acceleration system based on an FPGA platform to reduce the quantity of weight parameters of CNN and save the resource consumption on an FPGA chip.

The technical scheme is as follows: the network structure of the convolutional neural network accelerator based on the FPGA comprises an input layer, a first convolutional layer, a second convolutional layer, a first pooling layer, a second pooling layer, a first full-connection layer, a second full-connection layer and an output layer,

the image processing method comprises the steps that an input layer receives an image, the image is sequentially subjected to operation through a first convolution layer, a first pooling layer, an activation function, a second convolution layer, a second pooling layer, an activation function, a first full-link layer and a second full-link layer to obtain a plurality of characteristic values, and then the probability of the characteristic values is normalized in a Softmax classification layer to obtain subscripts corresponding to maximum probability values, namely classification results;

the first convolution layer and the second convolution layer both adopt a convolution expansion mode of parallel in channels and serial between channels, the convolution result of a single channel is output to a buffer corresponding to the convolution layer, and the buffer obtains the final convolution result of the corresponding convolution layer in a mode of repeatedly reading, summing and storing;

a volume of lamination layer, a pooling layer and an activation function are taken as a lamination layer, a buffer area is arranged between the two lamination layers, and a characteristic diagram output by lamination, a corresponding bias unit and a width parameter are stored in the buffer area and are used for inputting the next lamination cycle reading;

the full-connection layer starts to read the feature diagram output by the previous-level hierarchical output and the corresponding bias unit and width after the previous-level hierarchical output is stored, the feature diagram and the width are multiplied by the DSP multiplier, then the product value of the current neuron is summed, and the bias unit is added to be used as the final neuron output when the summation is finished.

The invention further preferably adopts the technical scheme that a quantization algorithm that float32 is quantized to int8 is adopted for the weight parameters of the convolutional layer and the pooling layer to carry out quantization and inverse quantization, and the specific method comprises the following steps:

a. calculating a scale transformation parameter s and a 0-value offset parameter z:

according to the interconversion relation between the floating point number x and the fixed point number:

wherein x represents a floating point number to be quantized, q (x) represents a fixed point number after x quantization, floor () is used for truncating a decimal, s represents scale, the function is scale scaling, the floating point number is scaled in a fixed interval, and z represents zero point, namely offset of the floating point number after 0 quantization;

obtaining a scale transformation parameter s and a 0 value offset parameter z required by quantization, wherein the calculation method is as follows:

wherein xmaxAnd xminMaximum and minimum values, p, of the floating-point number x, respectivelymaxAnd pminMaximum and minimum values of the quantization value p (x), respectively;

b. when no bias exists, the convolution or pooling operation formula is as follows:

wherein N represents the number of convolution kernel parameters, xiIs input data, wiIs the weight, y represents the convolution output of the layer, xi、wiAnd y are both float32 type floating point numbers;

for xiAnd wiQuantization is performed to obtain the formula:

x can be quantized by inverse quantizationiAnd wiExpressed as the formula:

substituting formula (5) into formula (3) to obtain formula:

the convolution output y is a floating point number, and when the input is input to the next layer of convolution, quantization is also needed, and y quantization and inverse quantization are as follows:

substituting equation (7) into equation (6) yields equation:

the data which is output to the next layer by each layer and needs to be used is the data q (y) after y quantization, and the formula (8) is transformed to obtain the formula:

obtaining the quantization data needed by the next layer, and finishing the function of the current layer;

the floating point number exists in the formula (9)Order toThen M is a floating point number, which is equal to 2-nM0Wherein n and M0Are all positive integers, n is between 0 and 15, such that M and 2-nM0Error is 2-16In this case, formula (9) is rewritten as:

wherein M is0(q(wi)-zw)(q(xi)-zx) And zyBelonging to integer arithmetic, 2-nShifting left by n bits in FPGA;

c. when the offset b is added, equation (9) becomes equation:

wherein q (b) is the result of the quantification of b, sbIs a scale of itself, zbZero point of b;

q (b) store with int32, while letting sb=sxswThen, the quantization result required by the next layer is expressed as formula:

preferably, when calculating the scale, the maximum value and the minimum value of the value to be quantized are needed, the maximum value and the minimum value of the characteristic diagram of each layer are tested by using at least 100 parts of data, and the obtained scale result is used for predicting the scale;

after obtaining M, find 2 closest to M-nM0Let n be between 0 and 15, M0GetAnda number that makes the error smaller; wherein the second fully-connected layer is the last layer without looking for 2-nM0Directly discarding during calculation

Preferably, the convolution layer adopts a convolution kernel of 5 × 5, a pipeline generates a to-be-convolved area of 5 × 5, and a shift ram shift register is adopted as a buffer area to generate the to-be-convolved area of 5 × 5 and the convolution kernel;

when a single shift ram is enabled by a module, when a rising edge of a clock comes, storing data of an input end into the shift ram, sequentially shifting original data in the shift ram to the left, and discarding the last data; the 4 shift rams are connected end to achieve the effect of overall data shift, and the output of the 4 shift rams is added with the initial input to obtain a column in a 5 × 5 matrix; the obtained 5 x 5 matrix needs 25 registers to receive data output by five shift rams, and a pipeline generates a 5 x 5 area to be convolved and a convolution kernel sum in a shifting receiving mode;

after receiving a convolution kernel of 5 multiplied by 5 and a to-be-convolved area, parallelly expanding the convolution kernel, parallelly performing 25 multiplication operations through an exemplary 25 DSP fixed-point multiplier, obtaining a product operation result through the time delay of 1 clock, and performing summation operation on the 25 data, wherein the bit width of the data is 16 bits; decomposing the summation operation of convolution operation through a 6-stage production line during summation operation, wherein the used expansion data are all 0, firstly expanding 25 data into 26 data, and carrying out pairwise summation on the 26 data to obtain 13 17-bit data which is a first-stage production line; expanding 13 data into 14 data, and summing the 14 data in pairs to obtain 7 18-bit data which is a second-stage production line; expanding the 7 data into 8 data, summing two data by two to obtain 4 19-bit data, and taking the data as a third-stage production line; summing every two of the 4 data to obtain 2 20-bit data, and taking the data as a fourth-stage production line; and summing every two of the 2 data to obtain 1 21-bit data, serving as a fifth-stage production line, and finally adding 32-bit offset to obtain a final convolution result.

Preferably, 2 × 2 maxporoling is adopted in the pooling layer, a shift ram with a width of 32bit and a depth of half the length of the channel in the previous layer is firstly set, a row of data of a matrix is continuously generated through the shift ram, a row of data obtained by shifting and storing the shift ram is stored by four registers, so that a 2 × 2 pooling window of the production line is generated, the step length of pooling is set to be 2, the 2 × 2 window generated by the production line is effective in interval, after the 2 × 2 window is obtained, the four numbers are compared pairwise through two combinational logics to obtain the maximum value, the obtained two outputs are compared through one combinational logic to output the maximum value, and the obtained result is the output of the pooling layer.

Preferably, the data set used for training the convolutional neural network is the MNIST data set, the MNIST data set is downloaded from torchvision first, the epoch is set to 15, the blocksize is set to 64, the learning rate is set to 0.0001, the error uses cross entropy, and the gradient descent is performed in a random gradient descent manner.

Has the advantages that: (1) the convolutional neural network acceleration system based on the FPGA platform is characterized in that a lightweight convolutional neural network is constructed on the basis of LeNet through a convolutional neural network on a Pythrch design software level, a convolutional expansion mode with better universality and performance is selected, DSP multiplexing is facilitated, and finally a high-speed FPGA accelerator is realized and can be applied to handwritten number recognition;

(2) in a Pythrch framework, weight parameters of each layer of a CNN network are stored and operated in a float32 format, an FPGA (field programmable gate array) cannot directly perform floating point operation, a DSP (digital signal processor) unit is better at point operation, and the problems of calculation amount and storage are considered-nM0The error of the approximate M is small before and after quantization under 500 samples, and the error of the precision of a network trained by the Pythroch and the precision of the network finally deployed to the FPGA end is only 0.2 percent and can be ignored.

Drawings

FIG. 1 is a network architecture diagram of a convolutional neural network of the present invention;

FIG. 2 is a hardware framework diagram of the convolutional neural network accelerator of the present invention;

FIG. 3 is a shift ram schematic;

FIG. 4 is a shift ram connection;

FIG. 5 is a flowchart illustrating the operation of a convolutional layer corresponding register;

FIG. 6 is a flowchart of the operation of the two inter-layer cache;

FIG. 7 is a full connectivity layer workflow diagram;

FIG. 8 is a power consumption parameter graph of a convolutional neural network of an embodiment.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the embodiments.

Example (b): a network structure of the convolutional neural network comprises an input layer, a first convolutional layer, a second convolutional layer, a first pooling layer, a second pooling layer, a first full-connection layer, a second full-connection layer and an output layer.

The convolution expansion method can be mainly divided into the following three methods:

1. the convolutions are parallel.

2. And different input channels are parallel.

3. And different convolution kernels are parallel to each other.

The expected state of the accelerator is convolution parallel, parallel among different input channels and parallel among different convolution kernels, and meanwhile, a pipeline is built on the basis to achieve the ideal situation of a global pipeline, but the higher the parallel expansion degree is, the more DSP resources are needed, in this situation, the first layer of convolution layer needs 250 DSPs in total, and the second layer of convolution layer needs 5000 DSPs, namely, the two layers need 5250 DSPs in total. The ZYNQ-7020 series of DSPs is only 220, and the number of DSPs required in a large-scale network is more, so that the global parallel is not feasible.

In summary, it can be concluded that three cases of parallelism cannot be achieved simultaneously, and therefore need to be discarded, one or both cases being parallel. Because the requirements for both cases are over 220 DSPs, the invention adopts a convolution expansion mode of parallel in channels and serial between channels, and a hardware frame is shown in figure 2, because serial calculation is input between channels, the convolution is carried out by a single channel, and the final convolution is the summation of convolution of all channels, so that a single-channel volume accumulation and buffer module is added.

The image processing method comprises the steps of inputting a 1 × 28 × 28 pixel image, sequentially calculating the image through a first convolution layer, a first pooling layer, an activation function, a second convolution layer, a second pooling layer, an activation function, a first full-link layer and a second full-link layer to obtain a plurality of characteristic values, and then normalizing the probability of the characteristic values in a Softmax classification layer to obtain a subscript corresponding to a maximum probability value, namely a classification result.

And (3) rolling layers:

the CNN network of the invention has two convolution layers, the hardware design of the convolution module is shown as figure 3, and the convolution is realized by performing convolution serial on a single channel of each convolution core and a corresponding characteristic diagram channel.

The convolution data is read from the Block Ram, as the invention adopts 5 × 5 convolution kernels, a pipeline is needed to generate 5 × 5 areas to be convolved, as the data streams are read one by one, in order to generate 5 × 5 areas to be convolved and convolution kernels, four rows or five rows of data need to be stored by using a buffer, and the invention adopts a shift Ram shift register to generate a 5 × 5 matrix. The shift principle of a single shift ram is as shown in fig. 3, when a clock rising edge comes when a module is enabled, data at an input end is stored in the shift ram, original data in the shift ram are sequentially shifted to the left, and the last data is discarded. According to the invention, 4 shift rams are used as a cache region to generate a 5 × 5 matrix, wherein the 4 shift rams are connected in a manner shown in fig. 4, four shift rams are connected end to end, so that the effect of overall data shift is achieved, and one column in the 5 × 5 matrix can be obtained by adding the initial input to the output of the 4 shift rams. Four shift rams can obtain one column of a 5 × 5 matrix at most, so obtaining a 5 × 5 matrix requires 25 registers to receive data output by five shift rams, and this is also a shift reception manner, so that a pipeline can generate a 5 × 5 area to be convolved and a convolution kernel sum. After receiving a convolution kernel of 5 multiplied by 5 and a to-be-convolved area, the convolution kernel is expanded in parallel, 25 multiplication operations are performed in parallel through instantiating 25 DSP fixed-point multipliers, a product operation result is obtained through time delay of 1 clock, the 25 data need to be accumulated and operated, the data bit width is 16 bits at the moment, the accumulation of the 25 data of 16 bits is complex operation, time sequence must not be converged under high frequency, the complex operation is decomposed in a pipeline mode at the moment, and therefore the system can stably operate under a high-frequency system clock. The invention decomposes the summation operation of convolution operation through a 6-stage pipeline, wherein the used extension data are all 0. Firstly, expanding 25 data into 26 data, and summing the 26 data in pairs to obtain 13 17-bit data which is a first-stage production line; expanding 13 data into 14 data, and summing the 14 data in pairs to obtain 7 18-bit data which is a second-stage production line; expanding the 7 data into 8 data, and summing two data to obtain 4 19-bit data which is a third-stage production line; and then, the final result of 21 bits is obtained by continuous pairwise summation, which is the fourth-stage pipeline and the fifth-stage pipeline. Finally, 32bit offset is added to obtain the final convolution result. Because the parallel expansion mode of the design is parallel in a single convolution kernel channel and the channel passes through, the bias is not added in the layer, and the problem of repeated accumulation bias is prevented.

The convolution parallel expansion mode of the invention adopts the mode of parallel in channels and serial between channels, the output result of the convolution layer is the convolution result of a single channel, and the final output of the convolution layer is the result of adding offset after the summation of all the channels, so a buffer is required to be arranged for the convolution layer, the simplest mode is to buffer the convolution results of all the channels output by the convolution layer and then read and accumulate the summation, but the mode can occupy a large amount of storage space, and when the number of the convolution channels is too large, the on-chip ram is insufficient, so the invention only arranges a buffer area with the depth of a single channel, and the final convolution result is obtained by repeatedly reading, summing and storing, and the realization schematic diagram is shown in figure 5. When the first channel of the current convolution kernel outputs a result through the convolution layer, the result is directly stored in the cache region. When the result of the subsequent channel is output by the convolution layer to be cached, the content of the current cache region is read, because the time delay of two clocks exists in ram reading data on the chip, the output result and the enable of the current convolution are cached in two stages, the content of the cache region is read at the moment, the content read out by the cache region and the result output by the convolution of the current channel are accumulated and stored in the original cache region, when the convolution kernel outputs the convolution of the last channel, the content read out by the cache region is accumulated and then the final convolution output result is obtained, the final convolution output result is not stored in the cache at the moment but directly output, and the result is output to the pooling layer after the Relu activation function after the offset of 32 bits is added.

A pooling layer:

the pooling of the invention adopts 2 × 2 MaxPooling, the operation essence of the pooling layer is similar to that of the convolution layer, and the operation is matrix operation, only the 2 × 2 matrix is generated here, firstly a shift ram with the width of 32bit and the depth of half of the channel length of the previous layer is set, a row of data of the matrix is continuously generated through the shift ram, a row of data obtained by shifting and storing the shift ram by four registers can generate a 2 × 2 pooling window of the production line, and the step length of the pooling is set to be 2, so the 2 × 2 window generated by the production line is not continuously effective, but is effective at intervals. After a 2 × 2 window is obtained, the four numbers are compared pairwise through two combinational logics to obtain a maximum value, the obtained two outputs are compared through one combinational logic to output the maximum value, and the obtained result is the output of the pooling layer.

Interlayer caching:

the convolutional neural network may refer to the convolutional layer + the pooling layer + the activation function as a layer, each layer needs to circularly read in the feature map for multiple times, so the feature map needs to have a buffer area for buffering for circularly reading, the output of one layer is used as the input of the next layer, so the output of each layer needs to be provided with a buffer area for buffering data, the flow chart of the buffer design is shown in fig. 6, the output of each layer is buffered by using block ram among each layer, and weight and bias are also stored. When the layer outputs, each data is stored in the block ram, when the last convolution kernel convolution is finished, the read enable is set to be 1, the next layer starts to read the featuremap in the block ram, and the weight and the bias are read at the same time.

Full connection layer:

after the last layer output is stored, enabling a reading signal, starting to read feature map, weight and bias by the full connection layer, multiplying the feature map and the weight by a DSP multiplier, then summing the product value of the current neuron, adding the bias as the final neuron output when the summation is over, and setting the design flow chart of the full connection layer as shown in FIG. 7.

And (3) quantification:

in a Pythrch framework, weight parameters of each layer of a CNN network are stored and operated in a float32 format, an FPGA cannot directly perform floating point operation, a DSP unit is better at point operation, and meanwhile, the parameters of a convolutional neural network need to be quantized in consideration of the problems of calculation amount and storage.

The specific method comprises the following steps:

a. calculating a scale transformation parameter s and a 0-value offset parameter z:

according to the interconversion relation between the floating point number x and the fixed point number:

wherein x represents a floating point number to be quantized, q (x) represents a fixed point number after x quantization, floor () is used for truncating a decimal, s represents scale, the function is scale scaling, the floating point number is scaled in a fixed interval, and z represents zero point, namely offset of the floating point number after 0 quantization;

obtaining a scale transformation parameter s and a 0 value offset parameter z required by quantization, wherein the calculation method is as follows:

wherein xmaxAnd xminMaximum and minimum values, p, of the floating-point number x, respectivelymaxAnd pminMaximum and minimum values of the quantization value p (x), respectively;

b. when no bias exists, the convolution or pooling operation formula is as follows:

wherein N represents the number of convolution kernel parameters, xiIs input data, wiIs the weight, y represents the convolution output of the layer, xi、wiAnd y are both float32 type floating point numbers;

for xiAnd wiQuantization is performed to obtain the formula:

x can be quantized by inverse quantizationiAnd wiExpressed as the formula:

substituting formula (5) into formula (3) to obtain formula:

the convolution output y is a floating point number, and when the input is input to the next layer of convolution, quantization is also needed, and y quantization and inverse quantization are as follows:

substituting equation (7) into equation (6) yields equation:

the data which is output to the next layer by each layer and needs to be used is the data q (y) after y quantization, and the formula (8) is transformed to obtain the formula:

obtaining the quantization data needed by the next layer, and finishing the function of the current layer;

the floating point number exists in the formula (9)Order toThen M is a floating point number, which is equal to 2-nM0Wherein n and M0Are all positive integers, n is between 0 and 15, such that M and 2-nM0Error is 2-16In this case, formula (9) is rewritten as:

wherein M is0(q(wi)-zw)(q(xi)-zx) And zyBelonging to integer arithmetic, 2-nShifting left by n bits in FPGA;

c. when the offset b is added, equation (9) becomes equation:

wherein q (b) is the result of the quantification of b, sbIs scale, z of bbZero point of b;

q (b) store with int32, while letting sb=sxswThen, the quantization result required by the next layer is expressed as formula:

when calculating the maximum value and the minimum value of the value to be quantized required by scale, testing the maximum value and the minimum value of the characteristic diagram of each layer by using at least 100 parts of data, and obtaining scale results which are shown in the following table and used for predicting scale;

after obtaining M, find 2 closest to M-nM0Let n be between 0 and 15, M0GetAndthe results and errors obtained are as follows:

Type n M0 error
conv1 15 27 6.94e-6
conv2 14 15 3.07e-6
fc1 14 19 1.25e-5

with fc2 as the last layer, there is no need to find 2-nM0Directly discarding during calculation

The quality of quantization needs to be measured by precision loss, namely the error between the accuracy of a test set after quantization and the accuracy before quantization, and the error sources of the method mainly comprise two items, namely the error of inverse quantization and 2-nM0An error of approximately M. At 500 samples, the accuracy error of the present design is as follows. It can be seen from the table that the precision error before and after quantization is very small, and the precision error between the network trained by the pytorech and the final network deployed to the FPGA end is only 0.2%, which can be ignored.

Rate of accuracy Error of the measurement
Before quantization 97%
After quantization 97% 0%
M approximation 96.8% 0.2%

And (3) performance testing:

the CNN network in the embodiment is a lightweight convolutional neural network designed based on LeNet, the used data set is an MNIST data set, and the used FPGA platform is a ZYNQ-7020 series development board which comprises an FPGA chip and two ARM-A9 processors. The EDA (electronic design automation) tool used was vivado2018.3 from Xilinx corporation; the software tool used anaconda + python3.6 and the deep learning framework used was pytorch1.7.0.

In this embodiment, the resources consumed by the convolutional neural network accelerator designed at the PL side are as shown in the following table. The two layers of convolution layers respectively use 25 DSPs to be used as convolution channels to be expanded in parallel, the three layers of inverse quantization use 6 DSPs in total, the two layers of full-connection layer multiplication use 2 DSPs in total, and 58 DSPs in total, wherein the three intermediate layer storages occupy 9 BRAMs in total, assist the convolution to complete 2 BRAMs in total, and 11 BRAMs in total. As can be seen from the table, the accelerator designed by the invention uses only a very small amount of resources, and meets the initial design expectation.

Resource Utilization available Utilization%
LUT 2110 53200 3.97
LUTRAM 151 17400 0.87
FF 3555 106400 3.34
BRAM 11 140 7.86
DSP 58 220 26.36
IO 6 125 4.8
MMCM 1 4 25

The power consumption estimate of this embodiment is derived by the EDA tool vivado of xilinx, which is shown in FIG. 8. The total power of an accelerator at the PL end is 0.402W, wherein the main power is Dynamic consumption, namely consumption of FPGA Dynamic conversion state, MMCM is resource consumption for frequency multiplication, the power occupation required for frequency multiplication of a low-frequency clock to a high-frequency clock is higher, except that the highest power consumption is consumption of DSP and BRAM, the DSP is mainly used for parallel product expansion, and the BRAM is used for intermediate parameter storage, and the DSP and the BRAM are cores of a CNN network. It can be seen from the figure that the power consumption of the accelerator is low, and the operation junction temperature of the accelerator is 29.6 degrees, which is also in a proper state of the chip.

The performance evaluation of CNN accelerators consists mainly of two aspects, one is precision and one is speed, where precision has been compared to a precision drop of only 0.2% on a Pytorch frame. In the aspect of speed, in order to embody the advantages of the CNN convolutional neural network accelerator, the embodiment compares the inference speed with the CPU platform, where the specific conditions of the CPU are as follows:

embedded CPU platform: ARM-A9 embedded CPU, the operating frequency is 1 Ghz.

The inference speed of the accelerator in the embodiment of the invention is as follows, the inference speed of the embedded CPU of ARM-a9 is 0.267ms for the accelerator designed by FPGA to infer 1 frame, 1310ms for ARM-a9 to infer 1 frame, and the inference speed of FPGA is 4906 times of that of the accelerator.

Comparison with CPU

Device ARM-A9 FPGA
Clock(Hz) 1G 200M
Memory(MB) 1024 4.9
Latency per img(ms) 1310 0.267
FPS(seconds-1) 0.76 3748

By contrast, the CNN accelerator with low power consumption and low resource consumption designed based on the FPGA has the structure which is easy for DSP multiplexing, and shows the resource consumption, power and precision of the accelerator, which indicates that the design can be fully applied to embedded platforms with limited resources and power consumption; meanwhile, the design is compared with ARM-A9 embedded CPU reasoning, and the design is fully shown to have a very good accelerating effect on the convolutional neural network.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于有机突触晶体管人工神经网络的图像识别系统及方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!