Hardware implementation method and hardware system for batch processing FIR algorithm

文档序号:490287 发布日期:2022-01-04 浏览:33次 中文

阅读说明:本技术 一种批处理fir算法的硬件实现方法和硬件系统 (Hardware implementation method and hardware system for batch processing FIR algorithm ) 是由 李丽 沈思睿 宋文清 傅玉祥 何书专 陈健 于 2021-09-28 设计创作,主要内容包括:本发明公开一种批处理FIR算法的硬件实现方式和硬件系统,属于数字信号处理中FIR算法及其硬件实现技术领域。针对现有技术中存在的含有批处理的FIR加速器性能不够,数字信号处理的效率不高等问题,本发明系统包括控制模块、源数据地址生成模块,系数地址生成模块、结果地址生成模块、存储单元、数据搬运模块和计算模块,计算模块执行FIR算法中的乘累加运算,本发明硬件系统加入批处理功能,减少了配置信息和启动数据搬运时的时钟周期消耗,提高FIR加速器性能,优化数字信号处理硬件设计。(The invention discloses a hardware implementation mode and a hardware system of a batch processing FIR algorithm, belonging to the technical field of FIR algorithm and hardware implementation in digital signal processing. Aiming at the problems of insufficient performance of an FIR (finite impulse response) accelerator containing batch processing, low efficiency of digital signal processing and the like in the prior art, the system comprises a control module, a source data address generation module, a coefficient address generation module, a result address generation module, a storage unit, a data handling module and a calculation module, wherein the calculation module executes multiply-accumulate operation in an FIR algorithm.)

1. A hardware system for batch processing FIR algorithm is characterized in that the hardware system comprises a control module, a source data address generating module, a coefficient address generating module, a result address generating module, a calculating module, a storage unit and a data carrying module; the control module is respectively connected with the source data address generation module, the coefficient address generation module, the result address generation module, the calculation module and the data carrying module, the source data address generation module and the coefficient address generation module are respectively connected with the calculation module, the calculation module is also connected with the result address generation module, the source data address generation module, the coefficient address generation module and the result address generation module are respectively connected with the storage unit, and the data carrying module is connected with the storage unit; the calculation module comprises an m-path multiply accumulator, and m is an integer larger than zero.

2. The hardware implementation of a batch FIR algorithm according to claim 1, wherein said memory unit comprises a source data area, a coefficient area and a result area, and said memory unit comprises 4 × m +1 memories; the source data area includes 2 × m memories, the coefficient area includes 1 memory, and the result area includes 2 × m memories.

3. The hardware system for batch FIR algorithm according to claim 2, wherein said source data address generating module comprises m vector lanes, said coefficient address generating module comprises 1 coefficient lane, and said result address generating module comprises m result data lanes; the vector channel and the coefficient channel are both connected with the input end of the calculation module, and the output end of the calculation module is connected with the result data channel.

4. The hardware system for batch processing FIR algorithms according to claim 1, wherein said multiplier-accumulator includes a multiplier and an adder, the output of the multiplier is connected with the input of the first adder, the output of the first adder is connected with the input of the second adder, the output of the second adder is connected with the input of the third adder, the outputs of the first, second and third adders are connected with their own inputs; the input and output of the multiplier and adder have one beat time delay.

5. The hardware system for batch FIR algorithm according to claim 1, wherein said control module is configured to resolve to respective batch numbers of m channels.

6. A hardware implementation method of batch FIR algorithm is characterized by comprising the following steps:

setting parameters of an FIR algorithm, wherein the parameters comprise sequence length of a single batch, filter coefficient length and batch number of batch processing, and organizing FIR accelerator hardware according to the set parameters;

storing source data according to a data sequence to be processed and a processing batch number;

and executing calculation after the source data is stored until all result data operations are completed, and storing the result data into the SRAM in the result area.

7. The hardware implementation method of a batch FIR algorithm of claim 6, characterized in that the zero padding operation is not performed on the source data during storage, and the zero padding operation is performed during the operation.

8. The hardware implementation method of a batch processing FIR algorithm according to claim 7, wherein when storing the source data, the vector [ m × n +0] of the source data is stored in the first memory of the storage unit, the vector [ m × n +1] of the source data is stored in the second memory of the storage unit, and so on, there are m storage channels, and each batch of the source data is stored in turn, and n is an integer greater than zero.

9. The hardware implementation method of a batch FIR algorithm according to claim 6, characterized in that when the calculating module calculates the filtering results of multiple batches of sequences, all sequences of the batch process share the same filtering coefficient and length.

10. The hardware implementation method of a batch FIR algorithm as claimed in claim 8, wherein the storage unit stores the source data and the calculation result in a left-right ping-pong manner, and the storage unit stores the coefficients in a deep ping-pong manner.

Technical Field

The invention relates to the technical field of FIR (finite Impulse response) algorithm and hardware implementation thereof in digital signal processing, in particular to a hardware implementation mode and a hardware system for batch processing of FIR algorithm.

Background

Digital signal processing techniques are widely used in speech signal processing, radar imaging, aerospace, and the like. It can extract the characteristic parameters in the signal by algorithm, compare and match with the known characteristics. Digital signal processing techniques play a very important role in the field of modern engineering.

A Finite Impulse Response (FIR) filter is a filter whose Impulse Response is a line length, and an Impulse input signal becomes zero in a Finite time. Batch processing is an important concept of modern hardware acceleration techniques. From parallel computing to Single Instruction stream Multiple Data (SIMD) streams. Meanwhile, a plurality of batches of common data are processed, so that control and data communication expenses in hardware can be greatly saved, and the operation efficiency is improved.

However, in the existing FIR accelerator design, the single calculation parallelism and the calculation pipeline are often emphasized in the design, and the problem of low utilization rate of the on-chip storage of multi-batch and small-scale data is ignored. Meanwhile, in a multi-batch data processing scene, an accelerator configuration instruction is frequently called in a processor, so that a large number of cycles are consumed on a control logic, and the efficiency of digital signal processing is reduced.

The chinese patent application for a computing module for an efficient FFT and FIR hardware accelerator, application No. CN 200980139643.1, published 2014, 05, and 07, discloses a hardware accelerator capable of operating in an FFT mode and an FIR mode. The hardware accelerator may receive input data and coefficient data and perform calculations for the selected mode. In the FFT mode, a radix-2 FFT is computed, producing four real outputs corresponding to two complex numbers. In the FIR mode, a real output is generated. The hardware accelerator may use three multiplexers to transition from FFT mode to FIR mode. All FIR components can be used in FFT mode. Registers may be added to provide pipelining support. The hardware accelerator can support a plurality of numerical representation systems, and the method focuses on accelerating the FIR algorithm, but still cannot realize high performance for application scenes containing batch processing.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problems of insufficient performance of an FIR (finite Impulse response) accelerator containing batch processing, low efficiency of digital signal processing and the like in the prior art, the invention provides a hardware implementation mode and a hardware system of the FIR algorithm, integrates the requirements, the area power consumption and the system operation speed in a digital signal processing scene, provides an FIR algorithm hardware implementation method with a batch processing function and a corresponding hardware system, improves the performance of the FIR accelerator and optimizes the digital signal processing hardware design.

2. Technical scheme

The purpose of the invention is realized by the following technical scheme.

A hardware implementation mode of batch processing FIR algorithm comprises a control module, a source data address generation module, a coefficient address generation module, a result address generation module, a calculation module, a storage unit and a data carrying module; the control module is respectively connected with the source data address generation module, the coefficient address generation module, the result address generation module, the calculation module and the data carrying module, the source data address generation module and the coefficient address generation module are respectively connected with the calculation module, the calculation module is also connected with the result address generation module, the source data address generation module, the coefficient address generation module and the result address generation module are respectively connected with the storage unit, and the data carrying module is connected with the storage unit; the calculation module comprises an m-path multiply accumulator, and m is an integer larger than zero.

Furthermore, the memory unit comprises a source data area, a coefficient area and a result area, and the memory unit comprises 4 × m +1 memories; the source data area includes 2 × m memories, the coefficient area includes 1 memory, and the result area includes 2 × m memories. Taking the scale of m-16 as an example, the storage unit comprises 65 SRAMs, wherein SRAMs 0-31 are source data areas, SRAM32 is coefficient areas, and SRAMs 33-64 are result areas. The data carrying module is used for carrying data before the DDR and the SRAM.

Furthermore, the source data address generating module comprises m vector channels, the coefficient address generating module comprises 1 coefficient channel, and the result address generating module comprises m result data channels; the vector channel and the coefficient channel are both connected with the input end of the calculation module, and the output end of the calculation module is connected with the result data channel. Taking m as an example, the source data address generating module includes 16 vector channels, the result address generating module includes 16 result data channels, the 16 vector channels are connected with the 16 first input interfaces of the operation module, the coefficient channel is connected with the 16 second input interfaces of the operation module at the same time, and the 16 output interfaces of the operation module are connected with the 16 result data channels.

Furthermore, the multiplier-accumulator comprises a multiplier and an adder, the output end of the multiplier is connected with the input end of the first adder, the output end of the first adder is connected with the input end of the second adder, the output end of the second adder is connected with the input end of the third adder, and the output ends of the first adder, the second adder and the third adder are connected with the input ends of the first adder, the second adder and the third adder; the input and output of the multiplier and adder have one beat time delay.

Further, the control module configuration is parsed into batch numbers for each of the m channels. The control module is used for analyzing external configuration, controlling each calculation module and each storage unit, corresponding to the processing data of each of the m channels, and realizing data processing of multiple batches.

A hardware implementation method of batch FIR algorithm comprises the following steps:

setting parameters of an FIR algorithm, wherein the parameters comprise sequence length of a single batch, filter coefficient length and batch number of batch processing, and organizing FIR accelerator hardware according to the set parameters;

storing source data according to a data sequence to be processed and a processing batch number;

and executing calculation after the source data is stored until all result data operations are completed, and storing the result data into the SRAM in the result area.

Furthermore, zero padding operation is not performed on the source data during storage, and the zero padding operation is performed during operation. Zero padding is performed in the operation process, so that the storage space of source data is saved, and the storage efficiency is improved.

Furthermore, when the source data is stored, a vector [ m × n +0] of the source data is stored in the first memory of the storage unit, a vector [ m × n +1] of the source data is stored in the second memory of the storage unit, and so on, m storage channels are provided in total, and each batch of source data is stored in turn, and n is an integer greater than zero. The source data is stored according to the number of channels, and batch processing calculation of the system is supported.

Further, the calculation module calculates the filtering results of the plurality of batches of sequences, wherein all the sequences of the batches share the same filtering coefficient and length.

Furthermore, the storage unit stores the source data and the calculation result in a left-right ping-pong mode, and the storage unit stores the coefficient in a deep ping-pong mode. Here, the stored source data is the filter vector and the stored coefficients are the filter coefficients.

3. Advantageous effects

Compared with the prior art, the invention has the advantages that:

the hardware implementation mode of the batch FIR algorithm adds a batch processing function on the basis of the traditional FIR algorithm hardware structure. On the basis of multi-batch data multi-path parallel computation, zero padding is completed in the address generation module, the source data storage space and the access to a memory are reduced, and the data storage efficiency is improved.

According to the invention, a plurality of batches of data are calculated by starting at one time, so that the clock cycle consumption during configuration information and starting data transportation is reduced, and the execution performance of FIR algorithm hardware is improved; the filter weight is shared among multiple batches of data, and FIR calculation under the scene of small-point batch processing can have good performance.

Drawings

FIG. 1 is a schematic diagram of the hardware system of the present invention;

FIG. 2 is a schematic diagram of the computation process of the FIR algorithm of the present invention;

FIG. 3 is a schematic diagram of a data storage distribution of memory cells of the present invention;

FIG. 4 is a schematic diagram of the structure of the calculation module multiply accumulator of the present invention;

FIG. 5 is a ping-pong implementation of the memory cell of the present invention;

FIG. 6 is a simulation diagram of a zero-filled waveform of the source data address generation module of the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and specific examples.

Example 1

The FIR filter is an important digital signal processing method, and for better understanding of the present embodiment, the FIR filter will be explained first.

For an N-point M order FIR filter, the N-point vector is represented as:

X=[x(0) x(1) x(2) … x(N-2) x(N-1)],

the M-order filtering system is represented as:

H=[h(0) h(1) h(2) … h(M-2) h(M-1)],

the calculation formula of the N-point M-order FIR filter is as follows:

when the formula is calculated, M + N-1 times of operation is needed to obtain M + N-1 results, wherein each operation comprises multiplication and accumulation calculation of two M point vectors.

As shown in fig. 2, a common FIR calculation method is a windowing method. As shown in fig. 2, when the filter coefficient window exceeds the range of the filter vector, N-1 zeros need to be complemented before and after the filter vector. In the prior art filter design, the source data memory needs to store not only the filter vector of length M, but also 2 x (N-1) zeros in addition. When performing batch processing, assuming that the number of batches is B, 2 × B (N-1) zeros need to be additionally stored, which consumes a large amount of useless space. Meanwhile, the time for writing the filter vector into the storage unit in the data transfer process is prolonged.

The present embodiment discloses a hardware system of FIR algorithm, as shown in fig. 1. The system comprises a control module, a source data address generating module, a coefficient address generating module, a result address generating module, a calculating module, a storage unit and a data carrying module; the control module is respectively connected with the source data address generation module, the coefficient address generation module, the result address generation module, the calculation module and the data carrying module, the source data address generation module and the coefficient address generation module are respectively connected with the calculation module, the calculation module is also connected with the result address generation module, the source data address generation module, the coefficient address generation module and the result address generation module are respectively connected with the storage unit, and the data carrying module is connected with the storage unit.

The computing module supports batch processing of m paths of parallel data, where m is an integer greater than 0, and this embodiment is specifically described with m being 16. The calculation module comprises 16 multiply-accumulate units, the data type of the multiply-accumulate units is single-precision floating-point data, the multiply-accumulate unit structure is shown in fig. 4 and comprises a multiplier and an adder, the output end of the multiplier is connected with the input end of a first adder, the output end of the first adder is connected with the input end of a second adder, the output end of the second adder is connected with the input end of a third adder, and the output ends of the first adder, the second adder and the third adder are connected with the input ends of the first adder, the second adder and the third adder; the input and output of the multiplier and adder have one beat time delay.

Compared with the design of one multiplier and two adders, the multiply-accumulate device of the calculation module of the invention adds one adder. The multiply-accumulate device formed by adding one by two is larger than 8, the multiply-accumulate device formed by adding one by three is used for reducing a register used for beat time delay, the accumulate length is 1 to 7, the smaller accumulate length can enable the whole FIR accelerator to support smaller points, and the performance advantage of the FIR accelerator under the scene of small point batch processing is better embodied.

The storage unit comprises a source data area, a coefficient area and a result area, wherein the source data area is used for storing filter vectors, the coefficient area is used for storing filter coefficients, and the result area is used for storing filter results. In this embodiment, the memory unit includes 65 SRAMs, each SRAM has a width of 64bit and a depth of 2 k.

The storage unit stores source data (filter vectors) and filter results in a left-right ping-pong manner, and stores coefficients (filter coefficients) in a deep ping-pong manner. Performing partitioned storage on left and right ping-pong, namely source data of the same type by using two pieces of SRAM (static random access memory), and performing a calculation process of a first operation instruction by using one piece of SRAM for data reading; meanwhile, the other piece executes the data transfer process of the next operation instruction to write data. Deep ping-pong, that is, the source data of the same type is stored in a partitioned manner by using different address segments of one piece of SRAM, and when the calculation process of the first operation instruction is executed, the data is read from the first half address segment (or the second half). And simultaneously executing the data transfer process of the next operation instruction, and writing data into the second half address field (or the first half address field).

As shown in FIG. 5, the hardware SRAMs 0-31 of the memory cell are source data regions, the SRAM32 is coefficient region, and the SRAM33-64 is result region. SRAM0-15, 0-1023 address of SRAM32 and SRAM33-48 are ping areas, SRAM16-31, 1024-2047 of SRAM32 and SRAM49-64 are pong areas. When the operation is executed in the ping area, if the next instruction is configured outside, the data carrying module carries source data required by the next instruction from the DDR to the pong area.

The source data address generation module comprises 16 vector channels, the coefficient generation module comprises 1 coefficient channel, and the 16 vector channels of the source data address generation module are connected with the 16 first input interfaces of the calculation module; the coefficient channel of the coefficient generation module is simultaneously connected with the 16 second input interfaces of the calculation module; the result address generation module comprises 16 result data channels, and the 16 result data channels are connected with the 16 output interfaces of the calculation module. As shown in the simulation waveform diagram in the EDA tool of FIG. 6, the source data address generation module will perform zero padding beyond the index when generating the address index and supplying the count to the calculation module. Since the FIR filter uses fixed length multiply-accumulate for operation, zero padding is required before and after. The address generation module compares the generated address signal with the start and end addresses of the current data. If the address signal exceeds the range, the zero enable signal in the address generation module is pulled up, and data zero is output to the calculation module after the zero enable signal is delayed as the same as the zero enable signal in the storage unit.

The control module configuration is parsed into the respective batch numbers of 16 lanes.

The hardware system of the batch processing FIR algorithm described in this example may be implemented in the form of an application specific integrated circuit, or in the form of a field programmable gate array, that is, the hardware design described in this embodiment may be implemented in the form of an FPGA, or in the form of an ASIC chip.

Based on the above hardware system for batch processing of FIR algorithm, the present embodiment discloses a hardware implementation method for batch processing of FIR algorithm, which includes the following steps:

step S1: setting parameters of batch FIR algorithm and organizing FIR accelerator hardware

The parameters of the batch FIR algorithm include the length of the sequence of the single batch (FIR _ num), the filter coefficient length (FIR _ order) and the batch number (batch _ num). Assuming that the depth of the SRAM is d, the parameters of the batch FIR algorithm have the following constraints:

1、(fir_num+fir_order-1)*ceil(batch_num/16)<d;

2、0<fir_order<d/2;

3、fir_num>4。

in this embodiment, the sequence length of a single batch is 16, and the number of batches processed is also 16, that is, when the filter works, 16 batches of data are calculated in parallel by supporting 16 channels, and the execution process is a multiplication and accumulation operation based on a sliding window. The sequence length and the coefficient length of each batch of data after zero padding are equal, and multiply-accumulate operation with a fixed order can be realized by using a multiply-accumulate device.

Step S2: storing source data according to data sequence and processing batch number to be processed

When source data is stored, a vector [16n +0] is stored in the SRAM0, a vector [16n +1] is stored in the SRAM1, and the like, 16 SRAM storage channels are totally arranged, and data of each batch are stored in turn. Zero padding of the source data does not need to be written into the storage unit in advance, the zero padding process runs in the operation process, when address indexing is carried out, the filter can judge the range of the current index and the source data, and when the index exceeds the range of the source data in the storage unit, zero elements can be output to the calculation module, so that the storage space of the source data is saved.

The storage method stores all batches of source data in the source data SRAM once, and supports 16-way multiply-accumulate machines to process data simultaneously and in parallel.

Step S3: after the source data is stored, executing calculation; and storing the result data into the SRAM in the result area until all the result data are operated, and finishing the operation.

The batch FIR filter of this embodiment can calculate a plurality of batches of source data at one time. The plurality of batches of source data are equally divided into 16 groups and are stored in 16 storages of the storage unit in turn. When source data is carried from the DDR to the memory cells via the data transfer unit, the arrangement of the source data in the memory cells is distributed as shown in fig. 3. Similarly, when the result data is carried out from the storage unit to the DDR by the data carrying unit after the calculation is completed, the result data is also distributed in the storage unit in the manner of fig. 3.

When calculating the filtering results of a plurality of batches of sequences, all the sequences of the batch process share the same filtering coefficient and length.

By analogy with the above method, the arrangement method and zero filling method are suitable for FIR filter designs with various parallel paths.

The simulation experiment compares the operation cycles of the single batch of FIR filters with the batch FIR filters with the same number of parallel paths to a plurality of data, and the result is shown in Table 1.

TABLE 1

Design of Length of vector Length of coefficient Number of batches Number of cycles
Single batch FIR filter 29 4 4096 122880
Batch FIR filter 29 4 4096 32805

Table 1 discloses the performance advantages of batch FIR filters at 128 batches, which greatly reduces the number of cycles of operation and improves FIR accelerator performance due to the elimination of a large amount of configuration and startup time. The invention solves the problems of low data storage efficiency after the FIR algorithm is zero-filled and frequent starting in the data handling stage of the traditional FIR accelerator under the condition of small data and multiple batches, and improves the performance of FIR algorithm hardware execution.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, if a person skilled in the art receives the teachings of the present invention, without inventive design, a similar structure and an embodiment to the above technical solution should be covered by the protection scope of the present patent. Furthermore, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Several of the elements recited in the product claims may also be implemented by one element in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种输出可调三相电功能的数字移相器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!