Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix
阅读说明:本技术 针对排列的块对角权重矩阵的稀疏矩阵向量乘法计算单元 (Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix ) 是由 郑勇 陈柱佳 舒毅 于 2019-11-06 设计创作,主要内容包括:本发明涉及一种针对排列的块对角权重矩阵的稀疏矩阵向量乘法计算单元,包括:若干处理单元和累加器;所述处理单元的输出连接所述累加器。本发明提供的针对排列的块对角权重矩阵的稀疏矩阵向量乘法计算单元,充分利用了剪枝后权重矩阵的稀疏特性,避免了零值权重与对应的输入激励元素之间的乘法操作。能够结合输入激励的稀疏性动态地使能跳零操作。充分利用了权重与对应的输入激励相乘得到的中间乘积的稀疏性,避免了零值乘积与对应乘积之间的累加操作。设计的指针生成器消除了记录非零值位置信息的指针的存储开销。(The invention relates to a sparse matrix vector multiplication calculation unit for an arranged block diagonal weight matrix, which comprises: a plurality of processing units and accumulators; the output of the processing unit is connected with the accumulator. The sparse matrix vector multiplication calculating unit for the arranged block diagonal weight matrix fully utilizes the sparse characteristic of the weight matrix after pruning, and avoids multiplication operation between zero-value weight and corresponding input excitation elements. The zero jump operation can be dynamically enabled in conjunction with the sparsity of the input excitation. The sparsity of intermediate products obtained by multiplying the weights by the corresponding input excitations is fully utilized, and accumulation operation between zero-value products and the corresponding products is avoided. The designed pointer generator eliminates the storage overhead of pointers for recording non-zero value position information.)
1. A sparse matrix vector multiplication computation unit for a permuted block diagonal weight matrix, comprising: a plurality of processing units and accumulators; the output of the processing unit is connected with the accumulator, and the accumulator is used for accumulating the outputs of all the processing units;
if the length of the input excitation vector is m, the number of the processing units is
the processing unit comprises a memory for storing weight data, a multiplier array, a pointer generator, a sparse accumulator array, a first one-of-two selector and a second one-of-one selector; the input ends of the memory and the pointer generator are connected with address signals, the output end of the memory is connected with the multiplier array,
2. The sparse matrix vector multiplication computation unit for permuted block diagonal weight matrices according to claim 1, wherein: when the sparsity of the input excitation is greater than 50%, the configuration signal is 1, and when the sparsity of the input excitation is less than 50%, the configuration signal is 0.
3. The sparse matrix vector multiplication computation unit for permuted block diagonal weight matrices according to claim 1, wherein: the sparse accumulator array is formed by a plurality of sparse accumulators in parallel, and each sparse accumulator comprises a decoder, an accumulation unit, a bit splicing unit, a D trigger and a third alternative selector; the input of the decoder is connected with the output of the pointer generator and the output of the second one-of-one selector; the output of the decoder is connected with the first input end of the third alternative selector, the output of the third alternative selector is the sparse accumulator, the output of the third alternative selector is also connected with the D end of the D trigger, the Q end of the D trigger is connected with the accumulation unit and the un-spliced unit, and the input of the accumulation unit is also connected with the output of the second alternative selector.
4. The sparse matrix vector multiplication computation unit for permuted block diagonal weight matrices according to claim 1, wherein: the processing units are arranged in a first stage, the accumulators are arranged in a second stage to an Nth stage, the outputs of two processing units are connected with one accumulator arranged in the second stage, the outputs of two accumulators arranged in an N +1 th stage are connected with one accumulator arranged in an N +2 th stage, wherein N belongs to (1, N-2), and the Nth stage comprises 1 accumulator.
Technical Field
The invention relates to a sparse matrix vector multiplication computing unit for an arranged block diagonal weight matrix, belonging to the technical field of integrated circuit design.
Background
In recent years, the deep learning algorithm has started to replace the conventional algorithm by virtue of its excellent performance as a mainstream algorithm in a plurality of fields. However, the current mainstream processors (CPU, GPU, DSP, etc.) cannot be well adapted to the data intensive computing characteristics, so the research enthusiasm of a depth learning processor is raised in the academic world and the industrial world.
The fully-connected operation is a very important operation in the neural network algorithm and is also a very computationally intensive operation. Therefore, efficiently performing this type of operation becomes a key to improving the performance of the deep learning processor.
The computational load of the full join operation is enormous because of the numerous join parameters. It has been found that not all connections are necessary, that there is a lot of redundancy in full connections, and that a reasonable deletion of these redundant connections into partial connections hardly affects the network performance.
The method of deleting redundant connections is called pruning. Researchers have proposed a pruning method using an arranged block diagonal mask matrix, which is defined as follows: will be one
Is divided intoSub-matrices, elements of each sub-matrixDefined by the following equation:;
wherein the bias signal
,。This is achieved by
The submatrices form an arranged block diagonal mask matrix. The arranged block diagonal mask matrix is multiplied by corresponding elements of the original weight matrix to complete pruning of the original weight matrix. The pruning method completely overcomes the problems of irregular structure after pruning and huge pointer overhead existing in the existing pruning method, and the network performance after pruning is good, thereby being very beneficial to hardware realization.The author of the pruning method also provides a matrix vector multiplication calculation unit aiming at the sparse structure, but the calculation unit has three disadvantages that (1) the calculation unit only reduces the storage cost of zero-valued weight and does not eliminate the calculation cost of the zero-valued weight. (2) The computing unit only supports static utilization of sparsity of input excitation, and actually the sparsity of the input excitation is very small in many scenes, so that power consumption is not saved by increasing zero jump operation. On the contrary, the power consumption of the entire system is also increased because an additional comparison operation is added. (3) The calculation unit does not take into account the sparsity of the intermediate products, since zero-valued weights or zero-valued excitations produce products that are also zero, and therefore the intermediate products have the same or a greater sparsity than the weight matrix. The sparse performance of the intermediate product is fully utilized to greatly reduce the operation of addition, and the energy efficiency of the whole system is further improved.
Disclosure of Invention
The invention aims to solve the technical problems that: the sparse matrix vector multiplication calculating unit for the arranged block diagonal weight matrix can fully utilize the sparsity of the weight matrix, the sparsity of the intermediate product and the sparsity of the dynamic utilization input excitation.
In order to solve the technical problems, the technical scheme provided by the invention is as follows: a sparse matrix vector multiplication computation unit for a permuted block diagonal weight matrix, comprising: a plurality of processing units and accumulators; the output of the processing unit is connected with the accumulator, and the accumulator is used for accumulating the outputs of all the processing units;
if the length of the input excitation vector is m, the number of the processing units is
A plurality of; each of the processing units is used for calculatingAn element in an input excitation vectorMultiplication with its corresponding column weight;the processing unit comprises a memory for storing weight data, a multiplier array, a pointer generator, a sparse accumulator array, a first one-of-two selector and a second one-of-one selector; the input ends of the memory and the pointer generator are connected with address signals, the output end of the memory is connected with the multiplier array,
an element in an input excitation vectorInput to the multiplier array; the output of the multiplier array is connected with the first input end of the first one-of-two selector and the first input end of the second one-of-two selector; a second input end of the firstThe scheme is further improved in that: when the sparsity of the input excitation is greater than 50%, the configuration signal is 1, and when the sparsity of the input excitation is less than 50%, the configuration signal is 0.
The scheme is further improved in that: the sparse accumulator array is formed by a plurality of sparse accumulators in parallel, and each sparse accumulator comprises a decoder, an accumulation unit, a bit splicing unit, a D trigger and a third alternative selector; the input of the decoder is connected with the output of the pointer generator and the output of the second one-of-one selector; the output of the decoder is connected with the first input end of the third alternative selector, the output of the third alternative selector is the sparse accumulator, the output of the third alternative selector is also connected with the D end of the D trigger, the Q end of the D trigger is connected with the accumulation unit and the un-spliced unit, and the input of the accumulation unit is also connected with the output of the second alternative selector.
The scheme is further improved in that: the processing units are arranged in a first stage, the accumulators are arranged in a second stage to an Nth stage, the outputs of two processing units are connected with one accumulator arranged in the second stage, the outputs of two accumulators arranged in an N +1 th stage are connected with one accumulator arranged in an N +2 th stage, wherein N belongs to (1, N-2), and the Nth stage comprises 1 accumulator.
The sparse matrix vector multiplication calculating unit for the arranged block diagonal weight matrix fully utilizes the sparse characteristic of the weight matrix after pruning, and avoids multiplication operation between zero-value weight and corresponding input excitation elements. The zero jump operation can be dynamically enabled in conjunction with the sparsity of the input excitation. The sparsity of intermediate products obtained by multiplying the weights by the corresponding input excitations is fully utilized, and accumulation operation between zero-value products and the corresponding products is avoided. The designed pointer generator eliminates the storage overhead of pointers for recording non-zero value position information.
Drawings
The invention will be further explained with reference to the drawings.
Fig. 1 is a schematic structural diagram of a preferred embodiment of the present invention.
Fig. 2 is a schematic diagram of the processing unit of fig. 1.
Fig. 3 is a schematic diagram of the sparse accumulator of fig. 2.
FIG. 4 is a schematic diagram of a data access mode in a weight memory.
Fig. 5 is a schematic diagram of decoder data decoding in a sparse accumulator.
Detailed Description