Storage and computation combination-based multi-channel convolution FPGA (field programmable Gate array) framework and working method thereof

文档序号:1862177 发布日期:2021-11-19 浏览:10次 中文

阅读说明:本技术 基于存算结合的多通道卷积fpga架构及其工作方法 (Storage and computation combination-based multi-channel convolution FPGA (field programmable Gate array) framework and working method thereof ) 是由 师皓 徐�明 陈亮 张娜 于 2021-08-17 设计创作,主要内容包括:本发明提供一种基于存算结合的多通道卷积FPGA架构及其工作方法,包括:外部存储器、特征图片上缓存模块、卷积权重片上缓存模块和卷积计算模块;外部存储器包括有第一外部存储器和第二外部存储器,第一外部存储器向特征图片上缓存模块输出特征图数据,第二外部存储器向卷积权重片上缓存模块输出卷积权重;特征图片上缓存模块和卷积权重片上缓存模块均与卷积计算模块连接,分别向卷积计算模块输出多个特征图数据窗口和多个卷积权重窗口;卷积计算模块根据特征图数据窗口和卷积权重窗口进行计算,并输出计算结果。本发明通过多通道并行输出数据流的方式,实现存算结合,提高了FPGA架构的数据传输效率。(The invention provides a multi-channel convolution FPGA (field programmable gate array) framework based on storage and calculation combination and a working method thereof, wherein the multi-channel convolution FPGA framework comprises the following steps: the device comprises an external memory, a feature picture upper cache module, a convolution weight on-chip cache module and a convolution calculation module; the external memory comprises a first external memory and a second external memory, the first external memory outputs feature map data to the feature map on-chip cache module, and the second external memory outputs convolution weight to the convolution weight on-chip cache module; the feature picture upper cache module and the convolution weight on-chip cache module are both connected with the convolution calculation module and respectively output a plurality of feature picture data windows and a plurality of convolution weight windows to the convolution calculation module; and the convolution calculation module calculates according to the characteristic diagram data window and the convolution weight window and outputs a calculation result. The invention realizes the storage and calculation combination by a multi-channel parallel data stream output mode, and improves the data transmission efficiency of the FPGA framework.)

1. A multi-channel convolution FPGA architecture based on memory-computation combination, comprising: the device comprises an external memory, a feature picture upper cache module, a convolution weight on-chip cache module and a convolution calculation module;

the two external memories are respectively a first external memory and a second external memory, the first external memory is connected with the feature picture on-chip cache module and outputs feature picture data to the feature picture on-chip cache module, and the second external memory is connected with the convolution weight on-chip cache module and outputs convolution weight to the convolution weight on-chip cache module;

the characteristic image on-chip cache module is connected with the convolution calculation module and outputs a plurality of characteristic image data windows to the convolution calculation module in parallel;

the convolution weight on-chip cache module is connected with the convolution calculation module and outputs a plurality of convolution weight windows to the convolution calculation module in parallel;

and the convolution calculation module calculates according to the characteristic image data window and the convolution weight window and outputs a calculation result.

2. The memory integration based multi-channel convolution FPGA architecture of claim 1, wherein the first external memory stores the feature map data in a cross-buffer manner.

3. The multi-channel convolution FPGA architecture based on memory combination according to claim 1, wherein the feature picture caching module includes Conv _ num FIFO memory cells and an LBUF line caching unit, the FIFO memory cells are connected to the first external memory, receive the feature map data sent by the first external memory, and transmit the feature map data to the LBUF line caching unit; the LBUF line buffer unit is connected with the convolution calculation module and used for transmitting the feature map data to the convolution calculation module.

4. The memory integration based multi-channel convolutional FPGA architecture of claim 1, wherein the convolutional weight on-chip cache module is implemented using an FPGA internal lookup table and flip-flop logic for caching convolutional weights.

5. The memory combination based multi-channel convolution FPGA architecture of claim 1, wherein the convolution computation module includes Unit _ num convolution computation submodules for performing convolution computation.

6. The memory combination-based multi-channel convolution FPGA architecture of claim 5, wherein the convolution computation submodule includes Conv _ num convolution computation engine units, an output buffer unit, and a channel-by-channel convolution computation unit; the convolution calculation engine unit comprises 9 DSP48E subunits and is used for finishing the calculation of a characteristic diagram window and a convolution weight window and setting zero for redundant convolution weight; the output cache Unit comprises Unit _ num BRAMs used for caching output data; the channel-by-channel convolution calculation unit is used for finishing a channel-by-channel convolution calculation process.

7. A working method of a multi-channel convolution FPGA architecture based on storage and computation combination is characterized by comprising the following steps:

acquiring feature picture data and convolution weight from an external memory, and respectively transmitting the feature picture data and the convolution weight to a feature picture upper cache module and a convolution weight on-chip cache module;

the characteristic picture upper cache module acquires a characteristic picture data window according to the characteristic picture data and sends the characteristic picture data window to a convolution calculation module; meanwhile, the convolution weight on-chip cache module acquires a convolution weight window according to the convolution weight and sends the convolution weight window to a convolution calculation module;

and the convolution calculation module calculates according to the characteristic image data window and the convolution weight window and outputs a calculation result.

8. The operating method of claim 7, wherein after the feature map data and the convolution weight obtained from the external memory are respectively transmitted to the feature image upper buffer module and the convolution weight on-chip buffer module, the feature image upper buffer module obtains a feature map data window according to the feature map data, and before the feature map data window is sent to the convolution calculation module, the operating method further comprises:

when the convolution kernel size is K x K, reading the characteristic diagram data Stream in the AXI Stream form from the external memory, and sequentially inputting the characteristic diagrams into all FIFO storage units for buffering;

after all FIFO memory cells are filled, transmitting the characteristic diagram data to an LBUF line buffer unit;

and storing the feature map data of the adjacent (K-1) line through the LBUF line cache unit, and outputting K x K data windows to a convolution calculation module according to the sequence when the K-th line feature map data is received, so as to wait for convolution operation.

9. The operating method of the storage-computation-combination-based multi-channel convolution FPGA architecture as recited in claim 7, wherein before the convolution weight on-chip cache module obtains a convolution weight window according to the convolution weight, and after the convolution weight window is sent to the convolution calculation module, the convolution calculation module calculates according to the feature map data window and the convolution weight window, and outputs a calculation result, the method further comprises:

the convolution weight on-chip cache module inputs convolution weights in sequence;

and after all the convolution weight values are cached in the convolution weight on-chip cache module, sequentially outputting Unit _ num large groups of convolution weights, wherein each large group comprises Conv _ num group convolution weights.

Technical Field

The invention relates to the technical field of FPGA (field programmable gate array) architectures, in particular to a storage and computation combination-based multi-channel convolution FPGA architecture and a working method thereof.

Background

The convolutional neural network is effectively applied in the field of image detection and identification, but the system of the mobile terminal deploys the Convolutional Neural Network (CNN) to complete real-time target identification, and the computational challenge of using the central processing unit of the mobile terminal to complete convolution calculation is very large. Therefore, the Application Specific Integrated Circuit (ASIC), the Field Programmable Gate Array (FPGA) accelerator and the image processing unit (GPU) are used as the coprocessor to improve the recognition speed, reduce the recognition delay and reduce the calculation burden of the central processing unit, thereby having important significance. The target recognition system on the mobile end side also needs to have low power consumption, which is advantageous over the GPU for a convolutional neural accelerator based on FPGA.

Google proposed an EfficientNet convolutional neural network in 2019, and a series of EfficientNet models were obtained by enlarging the EfficientNet 0 base model. However, the EfficientNet has a residual error branch and an attention mechanism module, so that the network structure is complex, and the EfficientNet is not beneficial to deploying the EfficientNet network at a mobile end or an embedded side. The hard programmable property of the FPGA can adapt to the change of the convolutional neural network structure, and the deployment of the EfficientNet convolutional neural network accelerator on the FPGA has higher value.

At present, a single central computing array of a convolutional neural network accelerator based on an FPGA is suitable for traditional standard convolution with large parameter quantity and few network layers. For a novel depth separable convolution structure, especially for the computation of the depth convolution, the computation is more sparse than that of the traditional standard convolution, so that the computation array of the traditional standard central convolution is not efficient.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a multi-channel convolution FPGA architecture based on storage and computation combination and a working method thereof.

A multi-channel convolutional FPGA architecture based on memory binding, comprising: the device comprises an external memory, a feature picture upper cache module, a convolution weight on-chip cache module and a convolution calculation module; the two external memories are respectively a first external memory and a second external memory, the first external memory is connected with the feature picture on-chip cache module and outputs feature picture data to the feature picture on-chip cache module, and the second external memory is connected with the convolution weight on-chip cache module and outputs convolution weight to the convolution weight on-chip cache module; the characteristic image on-chip cache module is connected with the convolution calculation module and outputs a plurality of characteristic image data windows to the convolution calculation module in parallel; the convolution weight on-chip cache module is connected with the convolution calculation module and outputs a plurality of convolution weight windows to the convolution calculation module in parallel; and the convolution calculation module calculates according to the characteristic image data window and the convolution weight window and outputs a calculation result.

In one embodiment, the first external memory stores the feature map data in a cross-cache manner.

In one embodiment, the feature picture caching module includes Conv _ num FIFO memory units and an LBUF line caching unit, where the FIFO memory unit is connected to the first external memory, receives feature map data sent by the first external memory, and transmits the feature map data to the LBUF line caching unit; the LBUF line buffer unit is connected with the convolution calculation module and used for transmitting the feature map data to the convolution calculation module.

In one embodiment, the convolution weight on-chip cache module is implemented by using an FPGA internal lookup table and a trigger logic, and is used for caching convolution weights.

In one embodiment, the convolution calculation module includes Unit _ num convolution calculation sub-modules for performing convolution calculation.

In one embodiment, the convolution calculation submodule includes Conv _ num convolution calculation engine units, an output buffer unit and a channel-by-channel convolution calculation unit; the convolution calculation engine unit comprises 9 DSP48E subunits and is used for finishing the calculation of a characteristic diagram window and a convolution weight window and setting zero for redundant convolution weight; the output cache Unit comprises Unit _ num BRAMs used for caching output data; the channel-by-channel convolution calculation unit is used for finishing a channel-by-channel convolution calculation process.

A working method of a multi-channel convolution FPGA architecture based on storage and computation combination comprises the following steps: acquiring feature picture data and convolution weight from an external memory, and respectively transmitting the feature picture data and the convolution weight to a feature picture upper cache module and a convolution weight on-chip cache module; the characteristic picture upper cache module acquires a characteristic picture data window according to the characteristic picture data and sends the characteristic picture data window to a convolution calculation module; meanwhile, the convolution weight on-chip cache module acquires a convolution weight window according to the convolution weight and sends the convolution weight window to a convolution calculation module; and the convolution calculation module calculates according to the characteristic image data window and the convolution weight window and outputs a calculation result.

In one embodiment, after the obtaining the feature map data and the convolution weight from the external memory, and respectively transmitting the feature map data and the convolution weight to the feature picture on-chip buffer module and the convolution weight on-chip buffer module, before the obtaining, by the feature picture on-chip buffer module, the feature map data window according to the feature map data and sending the feature map data window to the convolution calculation module, the method further includes: when the convolution kernel size is K x K, reading the characteristic diagram data Stream in the AXI Stream form from the external memory, and sequentially inputting the characteristic diagrams into all FIFO storage units for buffering; after all FIFO memory cells are filled, transmitting the characteristic diagram data to an LBUF line buffer unit; and storing the feature map data of the adjacent (K-1) line through the LBUF line cache unit, and outputting K x K data windows to a convolution calculation module according to the sequence when the K-th line feature map data is received, so as to wait for convolution operation.

In one embodiment, before the obtaining, by the convolution weight on-chip cache module, a convolution weight window according to the convolution weight, sending the convolution weight window to the convolution calculation module, and calculating, by the convolution calculation module, according to the feature map data window and the convolution weight window, and outputting a calculation result, the method further includes: the convolution weight on-chip cache module inputs convolution weights in sequence; and after all the convolution weight values are cached in the convolution weight on-chip cache module, sequentially outputting Unit _ num large groups of convolution weights, wherein each large group comprises Conv _ num group convolution weights.

Compared with the prior art, the invention has the advantages and beneficial effects that: according to the invention, a plurality of feature map data and convolution weights are output in parallel in a multi-channel parallel data stream output mode, so that the data transmission efficiency and the processing efficiency of the FPGA framework are improved, the storage and calculation combination is realized, and the burden of a central processing unit is reduced.

Drawings

FIG. 1 is a schematic diagram of a multi-channel convolution FPGA architecture based on memory integration according to an embodiment;

FIG. 2 is a schematic structural diagram of a cache module on the feature map of FIG. 1;

FIG. 3 is a schematic diagram of the convolution calculation module of FIG. 1;

fig. 4 is a schematic flowchart of a working method of a multi-channel convolution FPGA architecture based on memory combination in an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings by way of specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In one embodiment, as shown in fig. 1, there is provided a multi-channel convolution FPGA architecture based on memory combination, including: the system comprises an external memory 10, a feature picture upper buffer module 20, a convolution weight on-chip buffer module 30 and a convolution calculation module 40; the two external memories 10 are respectively a first external memory 11 and a second external memory 12, the first external memory 11 is connected with the feature picture on-chip cache module 20 and outputs feature picture data to the feature picture on-chip cache module 20, and the second external memory 12 is connected with the convolution weight on-chip cache module 30 and outputs convolution weights to the convolution weight on-chip cache module 30; the feature picture upper buffer module 20 is connected with the convolution calculation module 40, and outputs a plurality of feature picture data windows to the convolution calculation module 40 in parallel; the convolution weight on-chip cache module 30 is connected with the convolution calculation module 40 and outputs a plurality of convolution weight windows to the convolution calculation module 40 in parallel; and the convolution calculation module 40 performs calculation according to the characteristic diagram data window and the convolution weight window and outputs a calculation result.

In this embodiment, the external memory 10 includes a first external memory 11 and a second external memory 12, the first external memory 11 outputs the feature map data to the feature picture on-chip buffer module 20, and the second external memory 12 outputs the convolution weight to the convolution weight on-chip buffer module 30; the feature picture upper cache module 20 and the convolution weight on-chip cache module 30 are both connected with the convolution calculation module 40, and respectively output a plurality of feature picture data windows and a plurality of convolution weight windows to the convolution calculation module 40; the convolution calculation module 40 performs calculation according to the feature map data window and the convolution weight window, outputs a calculation result, and outputs a plurality of feature map data and convolution weights in parallel in a multi-channel parallel data stream output mode, so that the data transmission efficiency and the processing efficiency of the FPGA framework are improved, the storage and calculation combination is realized, and the burden of a central processing unit is reduced.

The first external memory 11 stores the characteristic diagram data in a cross-cache manner.

Specifically, the cross-caching means that in the first external memory 11, different feature map data are stored in adjacent addresses, and each Unit _ num address is a loop. For example, assuming that the data bit width after the fixed-point processing is B, the 256-bit data stored in the first external memory 11 includes 256-bit/B data of the same feature map, where Unit _ num represents the number of logical storage units in the logical storage Unit set.

Specifically, in the second memory, the weights are sequentially stored in order, and each weight data with a bit width of 256 bits may include weight parameters and bias parameters of one or more convolution kernels.

The feature map on-chip cache module 20 includes Conv _ num FIFO memory units 21 and an LBUF line cache unit 22, where the FIFO memory unit 21 is connected to the first external memory 11, receives feature map data sent by the first external memory 11, and transmits the feature map data to the LBUF line cache unit 22; the LBUF line buffer unit 22 is connected to the convolution calculation module 40, and is configured to transmit the feature map data to the convolution calculation module 40.

Specifically, the on-feature-map cache module 20 is configured to receive and cache input feature map data, and output a plurality of feature map data windows in parallel. After receiving the feature map data, the feature map caching module 20 disassembles the feature map data into Conv _ num groups, completes data caching by using Conv _ num FIFO storage units 21 and LBUF line caching Unit 22, copies Unit _ num shares of each group, and transmits the same data in Unit _ num shares to the convolution calculating module 40 in a data stream form.

Specifically, when feature picture input is performed, the feature picture caching module 20 adopts a Conv _ num-path data cross input mode. When the convolution kernel size is K × K, since data is streamed in rows, the storage capacity of each FIFO memory unit 21 is 1 row, and the storage capacity of each LBUF row buffer unit 22 is (K-1) row data.

In an embodiment, the specific process of the feature image caching module 20 for caching the feature image data includes: after the data Stream in the AXI Stream format is read from the first external memory 11, the data is sequentially input into all the FIFO memory units 21 for buffering, after all the FIFO memory units 21 are full, the data is transmitted to the LBUF line buffer unit 22 for storing the data of the adjacent (K-1) line, and when the line K comes, the data window of K x K is sequentially output to the convolution calculation module 40 to wait for convolution operation.

Specifically, after the feature map data reaches the feature map upper buffer module 20, edge zero padding is performed first to avoid the change of the image size before and after convolution and ensure that the size before and after image input and output is not changed, and then the padded data stream is buffered to the FIFO storage unit 21. Because the input data of the buffer module 20 on the feature picture is the data Stream in the form of the AXI Stream bus, the state signal is subjected to the enabling control through the control logic, so that the '0' value is added at the specific time of the data Stream transmission, and the edge zero padding is completed.

The convolution weight on-chip cache module 30 is implemented by using an internal lookup table of the FPGA and a trigger logic, and is configured to cache the convolution weight.

Specifically, the input of the convolution weight on-chip cache module 30 is a 256-bit data stream, which includes 16 data, where the last 7 data are invalid data when the convolution kernel size is 3 × 3; the output is a Unit _ num × Conv _ num window of 3 × 3 data of 16 bits.

In one embodiment, the convolution kernel size may be 1 × 1 or 3 × 3, when the convolution kernel size is 1 × 1, each time 256 bits of data input includes 16 sets of convolution weights, Unit _ num × Conv _ num sets of convolution weights are input in sequence, Unit _ num × Conv _ num/16 clock cycles are required, and an output 3 × 3 convolution weight window includes 1 convolution kernel valid data and 8 invalid data.

When the convolution kernel size is 3 × 3, each time 256-bit data input contains 1 set of convolution weights, Unit _ num _ Conv _ num set of convolution weights are output in sequence, Unit _ num _ Conv _ num clock cycles are required, and an output 3 × 3 convolution weight window contains 9 valid data.

After all the convolution weight values are cached in the convolution weight on-chip cache module, Unit _ num large groups of convolution weights are output in sequence, each large group comprises Conv _ num group convolution weights, and since the Conv _ num group convolution weights are output simultaneously, the output of all the convolution weights needs Unit _ num clock cycles.

The convolution calculation module 40 includes Unit _ num convolution calculation sub-modules 41 for performing convolution calculation.

Specifically, the convolution calculation module 40 includes a plurality of convolution operation sub-modules 41, the specific number Unit _ num may be configured according to the number of hardware resources, and when transplanting to an FPGA with fewer hardware resources, the convolution operation sub-modules 41 may be cut; when the hardware resource is more in FPGA, the convolution operation submodule 41 can be added.

Specifically, after receiving the characteristic diagram data window and the convolution weight window, the plurality of convolution operation sub-modules 41 respectively obtain a group of data required for self-calculation to perform convolution operation, temporarily store the result in the accumulation register after the operation, if the number of channels of the convolution layer is large and all operations cannot be completed by one-time input, perform data transmission, data operation and result accumulation for multiple times, and then control the output time of the accumulation register according to the configured channel parameters.

The convolution calculation submodule 41 includes an output buffer unit 412, a channel-by-channel convolution calculation unit 413, and Conv _ num convolution calculation engine units 411; the convolution calculation engine unit 411 comprises 9 DSP48E subunits, which are used for completing the calculation of the feature map window and the convolution weight window, and setting zero for redundant convolution weights; the output buffer Unit 412 includes Unit _ num BRAMs for buffering output data; the channel-by-channel convolution calculation unit 413 is configured to complete the channel-by-channel convolution calculation process.

Specifically, the convolution calculation submodule 41 includes Conv _ num convolution calculation engine units 411, and 9 DSP48E subunits are inside the convolution calculation engine units 411, and are configured to complete multiply-add operations of a feature map data window and a convolution weight window, and implement selection of two convolution kernel sizes by setting a redundant weight to zero, thereby expanding an application range of the architecture.

Specifically, the output buffer unit 412 stores the output profile data set during accumulation by calling several BRAMs, the storage capacity of which depends on the maximum data amount stored, i.e. when the maximum output profile size is Fx*FyWhen the data bit width is B, the storage capacity of a single BRAM is Fx*FyB. Output buffer listThe element 412 is also used to buffer the intermediate result of the convolution calculation, and when all the output feature maps of all the channels are buffered in the output buffer unit 412, the result of the convolution calculation is output.

Specifically, the channel-by-channel convolution calculation unit 413 is configured to complete the channel-by-channel convolution calculation process, and this module supports two convolution kernels with sizes of 3 × 3 and 5 × 5 by using a mechanism of data parallel pipelining, and can configure the convolution step size to be 1 or 2. Meanwhile, in the case of enabling the channel-by-channel convolution calculation unit 413, the convolution weight on-chip buffer module 30 may buffer the weight parameters of the channel-by-channel convolution, and continue to output the weight parameters required by the channel-by-channel convolution unit 413 when the control signal is valid.

As shown in fig. 4, a working method of a multi-channel convolution FPGA architecture based on storage-computation combination is provided, which includes the following steps:

step S401, the feature image data and the convolution weight are obtained from an external memory and are respectively transmitted to a feature image upper cache module and a convolution weight on-chip cache module.

Specifically, feature picture data are obtained from an external memory and transmitted to a feature picture upper cache module; and meanwhile, convolution weights are obtained from an external memory and transmitted to the on-chip cache module of the convolution weights, and data parallel transmission is realized.

The external memory may be two, one of which is used for storing the feature map data and the other is used for storing the convolution weight.

Step S402, the characteristic image upper cache module obtains a characteristic image data window according to the characteristic image data and sends the characteristic image data window to the convolution calculation module; meanwhile, the convolution weight on-chip cache module obtains a convolution weight window according to the convolution weight and sends the convolution weight window to the convolution calculation module.

Specifically, after receiving the feature map data, the on-feature-map cache module disassembles the feature map data into Conv _ num groups, stores the Conv _ num groups into the FIFO storage unit and the LBUF line cache unit respectively, and outputs K × K feature map data windows to the convolution calculation module in sequence after the LBUF line cache unit is full of storage;

meanwhile, after receiving the convolution weight, the on-chip convolution weight cache module outputs convolution weight windows of Unit _ num large groups to the convolution calculation module according to the sequence, wherein each large group comprises Conv _ num group convolution weight.

By the mode, the characteristic diagram data and the convolution weight are output to the convolution weight module at the same time, so that the data transmission time consumption is reduced, and the working efficiency is improved.

And S403, calculating according to the characteristic image data window and the convolution weight window by the convolution calculating module, and outputting a calculating result.

Specifically, the convolution calculation module performs calculation according to the characteristic diagram data window and the convolution weight window and outputs a calculation result, so that the convolution calculation of the characteristic diagram is completed, the calculation combination is realized, and the data exchange time and the data access energy consumption in the calculation process are greatly reduced.

In this embodiment, the feature image data and the convolution weight are respectively obtained from the external memory through the feature image on-chip cache module and the convolution weight on-chip cache module; obtaining a characteristic map data window through a buffer module on a characteristic map and sending the characteristic map data window to a convolution calculation module; meanwhile, a convolution weight window is obtained through a convolution weight on-chip cache module and is sent to a convolution calculation module; the convolution calculation module calculates according to the characteristic diagram data window and the convolution weight window and outputs a calculation result, so that a plurality of characteristic diagram data and convolution weights are output in parallel in a multi-channel parallel data stream output mode, the data transmission efficiency and the processing efficiency of the FPGA framework are improved, storage and calculation combination is achieved, and the burden of a central processing unit is reduced.

After step S401 and before step S402, the method further includes: when the convolution kernel size is K x K, reading the characteristic diagram data Stream in the AXI Stream form from the external memory, and sequentially inputting the characteristic diagrams into all FIFO storage units for buffering; after all FIFO memory cells are filled, transmitting the characteristic diagram data to an LBUF line buffer unit; and storing the feature map data of the adjacent (K-1) line through an LBUF line cache unit, and outputting K x K feature map data windows to a convolution calculation module according to the sequence when the K-th line feature map data is received, and waiting for convolution operation.

Specifically, after the feature map on-chip cache module receives the feature map data, when the convolution kernel size is K × K, a feature map data Stream in the AXI Stream format is read from the external memory, the feature maps are sequentially input to all FIFO memory cells and then transmitted to the LBUF line cache unit, the LBUF line cache unit stores data of an adjacent (K-1) line, and when the LBUF line cache unit receives the kth line of data, a feature map data window of K × K is sequentially output to the convolution calculation module to wait for convolution operation.

After step S402 and before step S403, the method further includes: the convolution weight on-chip cache module inputs convolution weights in sequence; after all the convolution weight values are cached in a convolution weight on-chip cache module, Unit _ num large groups of convolution weights are output in sequence, wherein each large group comprises Conv _ num group convolution weights.

Specifically, after receiving convolution weight data, the convolution weight on-chip cache module inputs convolution weights in sequence, and after all the convolution weights are cached in the convolution weight on-chip cache module, outputs convolution weights of Unit _ num large groups in sequence, wherein each large group contains Conv _ num group convolution weights, and because Conv _ num group convolution weights are output simultaneously, the output of all the convolution weight data needs Unit _ num clock cycles in common.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and optionally they may be implemented in program code executable by a computing device, such that they may be stored on a computer storage medium (ROM/RAM, magnetic disks, optical disks) and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:神经网络的训练方法、装置、电子设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!