Compression calculation unit for three-value neural network sparsity weight

文档序号：515583 发布日期：2021-05-28 浏览：2次中文

阅读说明：本技术 三值神经网络稀疏性权重的压缩计算单元 (Compression calculation unit for three-value neural network sparsity weight ) 是由刘波杨海川钱俊逸王梓羽蔡浩龚宇杨军于 2021-01-12 设计创作，主要内容包括：本发明公开了一种三值神经网络稀疏性权重的压缩计算单元,涉及神经网络硬件加速领域,包括依次连接的权重近似处理单元、哈夫曼编码单元、权重整合单元、序列检测译码模块和计算优化配置模块；本发明的近似压缩计算单元,通过对权重进行近似处理获得更高的稀疏性,通过对网络中大量重复出现的权重进行压缩编码,通过卷积计算结果的复用以及针对三值神经网络近似计算降低计算工作量,从而减少系统整体面积与功耗。(The invention discloses a compression calculation unit for the sparsity weight of a ternary neural network, which relates to the field of neural network hardware acceleration and comprises a weight approximation processing unit, a Huffman coding unit, a weight integration unit, a sequence detection decoding module and a calculation optimization configuration module which are sequentially connected; the approximate compression calculation unit obtains higher sparsity by carrying out approximate processing on the weights, reduces calculation workload by carrying out compression coding on a large number of repeated weights in the network, and reduces the whole area and power consumption of the system by multiplexing convolution calculation results and carrying out approximate calculation aiming at the ternary neural network.)

1. The compression calculation unit of the ternary neural network sparsity weight is characterized by comprising a weight approximation processing unit and a calculation optimization configuration module;

the output end of the weight approximate processing unit is connected with the input end of the calculation optimization configuration module.

The weight approximate processing unit is used for carrying out approximate processing on the weight matrix to form an approximate weight matrix;

inputting a data matrix to be processed and a processed weight matrix to a calculation optimization configuration module, wherein the calculation optimization configuration module comprises an approximate calculation unit, the approximate calculation unit reads the input weight matrix, when the weight data in the weight matrix is read to be '0', the data in the data matrix to be processed is not read any more, and the '0' is directly written in a corresponding calculation result storage unit; when the weight is read to be '1', reading data in the input data matrix to be processed, and directly writing corresponding data in the data matrix to be processed in a corresponding calculation result storage unit; and when the weight is read to be '1', reading the data input into the data matrix to be processed, and writing the data in the data matrix to be processed into the corresponding storage unit after negating the corresponding data.

2. The unit of claim 1, wherein a huffman coding unit, a weight integration unit and a sequence detection and decoding module are sequentially connected between the weight approximation unit and the calculation and optimization configuration module;

the output end of the weight approximate processing unit is connected with the input end of the Huffman coding unit, and the output end of the sequence detection decoding module is connected with the input end of the calculation optimization configuration module;

the Huffman coding unit is used for carrying out Huffman coding on the repeated weight data in the approximate weight matrix, and the Huffman coding belongs to the variable-length coding of the repeated weight data;

the weight integration unit is used for integrating the Huffman coding indefinite-length coding result of the repetition weight and the fixed-length coding result of the non-repetition weight into a string of codes to form a coding weight matrix which is stored in the sequence detection decoding module;

and the sequence detection decoding module is used for decoding and recovering the coding weight matrix and decoding the coded fixed-length and non-fixed-length codes.

3. The unit of compressing and calculating sparsity weights of ternary neural networks according to claim 1, wherein in the weight approximation processing unit, if the weight data is in a manner that the first continuous weight data, the interval weight data and the second weight data are arranged in order, the interval weight data is modified to data of the same size as the first continuous weight data; obtaining an approximate weight matrix;

the first continuous weight data refers to three or more continuous same weight data, the interval weight data refers to two or less continuous weight data different from the continuous weight data, the second continuous weight data refers to three or more continuous same weight data, and the first continuous weight data and the second continuous weight data have the same data size and different data quantity.

4. The unit of claim 2, wherein the sequence detection and decoding module comprises a look-up table, a flag determination module and a sequence detection module;

the method comprises the steps of carrying out Huffman coding on repeated weight data in a Huffman coding unit, recording coding rules in a lookup table, recording coding flag bits and judging the coding flag bits by a flag judging module, and detecting an input coding weight matrix by a sequence checking module.

5. The unit of claim 1, wherein the module for computing optimal configuration further comprises a unit for multiplexing computation results;

the calculation result multiplexing unit is used for judging whether a multiplexing unit exists in the input weight matrix, the multiplexing unit is that at least two rows of same weight data exist in the weight matrix, and the weight data in each row are called as a multiplexing unit;

in the calculation result multiplexing unit, when the input data matrix to be processed is convolved with the weight matrix, the same row of data in the data matrix to be processed is only operated once with the multiplexing unit in the weight matrix, and the operation result is stored in the cache.

Technical Field

The invention relates to the field of neural network hardware acceleration, in particular to a compression calculation unit for a three-value neural network sparsity weight.

Background

At present, in the technical field of keyword recognition, deep neural networks, cyclic neural networks and convolutional neural networks are the mainstream directions. Convolutional neural networks are more popular because their characteristics are more consistent with the requirements of speech keyword recognition. In speech audio, it is necessary to overcome various interferences in a speech signal in order to improve a speech recognition rate. The convolutional neural network can provide invariance in time and space, and can overcome the diversity of the voice signal by using the invariance of convolution.

The convolutional neural network has a big problem that the multiplication operation is too much, so that the data calculation amount is too large. Therefore, the binary weight convolutional neural network is provided, and the weight in the convolutional neural network is quantized to 1bit, so that the multiplication operation in the network can be replaced by reverse operation and shift operation, which brings great convenience in hardware implementation and saves a lot of execution cycles and power consumption compared with the multiplication operation. The binary weight convolutional neural network also has poor adaptability to various scenes and high-noise environments due to excessive weight quantization. Therefore, the three-valued weight neural network can be provided, and the weight of 0 is added in the two-valued weight neural network, so that the improvement of the weight data information quantity can improve the adaptability to various scenes and high-noise environments. And the newly added 0 weight does not bring multiplication operation, but brings a new problem, namely the weight storage of the three-valued weight neural network needs at least 2 bits, which is twice more than the weight of the binary weight.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the problems that the storage space required by weight storage in a ternary weight neural network is overlarge and the ternary weight neural network is optimized in a calculation unit. The compression calculation unit for the sparse weight of the ternary weight neural network is mainly used for compressing and storing the weight of the ternary weight neural network so as to reduce the static power consumption of the whole storage unit, and the power consumption is reduced by optimizing the calculation unit by adopting methods of reducing data reading, approximate storage and the like so as to reduce the dynamic power consumption of the calculation unit.

The compression calculation unit of the three-value neural network sparsity weight comprises a weight approximation processing unit and a calculation optimization configuration module; the output end of the weight approximate processing unit is connected with the input end of the calculation optimization configuration module.

The weight approximate processing unit is used for carrying out approximate processing on the weight matrix to form an approximate weight matrix; inputting a data matrix to be processed and a processed weight matrix to a calculation optimization configuration module, wherein the calculation optimization configuration module comprises an approximate calculation unit, the approximate calculation unit reads the input weight matrix, when the weight data in the weight matrix is read to be '0', the data in the data matrix to be processed is not read any more, and the '0' is directly written in a corresponding calculation result storage unit; when the weight is read to be '1', reading data in the input data matrix to be processed, and directly writing corresponding data in the data matrix to be processed in a corresponding calculation result storage unit; and when the weight is read to be '1', reading the data input into the data matrix to be processed, and writing the data in the data matrix to be processed into the corresponding storage unit after negating the corresponding data. The approximate calculation unit is used for reducing the data reading times and the overturning times of the adder to realize the 0 jump operation, and the approximate calculation is adopted in the calculation of the negative value in the three-value weight to reduce the calculation power consumption and the signal overturning.

Preferably, a huffman coding unit, a weight integrating unit and a sequence detection decoding module are sequentially connected between the weight approximation processing unit and the calculation optimization configuration module; the output end of the weight approximate processing unit is connected with the input end of the Huffman coding unit, and the output end of the sequence detection decoding module is connected with the input end of the calculation optimization configuration module; the output end of the weight approximate processing unit is connected with the input end of the Huffman coding unit, and the output end of the sequence detection decoding module is connected with the input end of the calculation optimization configuration module; the weight approximation processing unit is used for carrying out approximation processing on the weight matrix to form an approximate weight matrix, and the approximate weight matrix has higher sparsity; the Huffman coding unit is used for carrying out Huffman coding on the repeated weight data in the approximate weight matrix, and the Huffman coding belongs to the variable length coding of the repeated weight data.

The weight integration unit is used for integrating the Huffman coding indefinite-length coding result of the repeated weight data with the fixed-length coding result of the non-repeated weight data to form a coding weight matrix which is stored in the sequence detection decoding module; and the sequence detection decoding module is used for decoding and recovering the coding weight matrix and decoding the coded fixed-length and non-fixed-length codes.

Preferably, in the weight approximation processing unit, if the weight data is in a manner that the first continuous weight data, the interval weight data, and the second weight data are arranged in order, the interval weight data is modified to be data of the same size as the first continuous weight data; obtaining an approximate weight matrix; to improve the sparseness of the weights.

The first continuous weight data refers to three or more continuous same weight data, the interval weight data refers to two or less continuous weight data different from the continuous weight data, the second continuous weight data refers to three or more continuous same weight data, and the first continuous weight data and the second continuous weight data have the same data size and different data quantity.

The Huffman coding unit is used for performing Huffman coding on the repetition times of the data with the same weight, so that the overall bit width of the weight is reduced. The specific method comprises the following steps: let the weight of the repeated occurrence of the code have 5 symbols u₁、u₂、u₃、u₄、u₅And respectively represent the number of times of repeated occurrences of the same weight. With a corresponding probability of p₁＝0.4、p₂＝0.1、p₃＝p₄＝0.2、p₅Assuming that these 5 types occur 10 times in total, it represents u, 0.1₁Type repeats 4 times, u₂Type repeat 1 time, u₃、u₄Type repeats 2 times, u₅The pattern repeats 1 time. First, the symbols are sorted according to probability from large to small. During coding, starting from two symbols with the minimum probability, an upper branch is selected to be 0, and a lower branch is selected to be 1. The probabilities of the two encoded branches are then combined and reordered. Repeating the above method for several times until mergingUntil the probability is normalized. Since the code words (codes of each symbol) of the huffman code are different leading code words, that is, any code word is not the leading part of another code word, the code words can be transmitted together, and the only decoding can be realized without adding an isolation symbol in the middle.

Preferably, the sequence detection decoding module comprises a lookup table, a mark judgment module and a sequence detection module; and performing Huffman coding on the weight matrix after the approximate processing in a Huffman coding unit, recording the coding rule in a lookup table, recording a coding flag bit and judging the coding flag bit by a flag judging module, and scanning the input coding weight matrix by a sequence checking module.

Preferably, the decoding of the input coding weight matrix by the sequence detection decoding module includes the following steps:

step 201: the unused '10' in the coding weight matrix is used as a coding flag bit, the sequence detection module scans the weight data in the weight matrix, the scanned weight data is compared with the recorded coding flag bit through the flag judgment module, and when the coding flag bit is scanned by the sequence detection module, the Huffman decoding state is entered.

Step 202: after entering a Huffman decoding state, 2-bit data after the coding zone bit scanned by the sequence detection module represents weight data of the later decoding.

Step 203: and combining the sequence detection module with the lookup table, continuously scanning backwards bit by bit, decoding when the scanned data is the same as the coded data in the lookup table, stopping scanning, ending the Huffman decoding state, and entering a common decoding state, wherein the common decoding state refers to decoding by using the lookup table.

Preferably, the calculation optimization configuration module includes a calculation result multiplexing unit.

The calculation result multiplexing unit is used for judging whether the multiplexing unit exists in the input weight matrix, if the multiplexing unit exists in the current weight matrix, the calculation result multiplexing unit is preferentially used for carrying out convolution operation on the input data matrix to be processed and the weight matrix, and the operation amount of the network can be reduced through the multiplexing unit. And if the multiplexing unit does not exist in the current weight matrix, performing convolution operation on the input data matrix to be processed and the weight matrix by adopting an approximate calculation unit.

The multiplexing unit is that at least two rows of same weight data exist in the weight matrix, and the same weight data in each row are called as the multiplexing unit.

In the calculation result multiplexing unit, when the input data matrix to be processed is convolved with the weight matrix, the same row of data in the data matrix to be processed is only operated once with the multiplexing unit in the weight matrix, and the operation result is stored in the cache.

Has the advantages that: the approximate compression calculation unit obtains higher sparsity by carrying out approximate processing on the weights, reduces calculation workload by carrying out compression coding on a large number of repeated weights in the network, and reduces the whole area and power consumption of the system by multiplexing convolution calculation results and carrying out approximate calculation aiming at the ternary neural network.

Drawings

FIG. 1 is a flow chart of a method of weight approximation compression encoding;

FIG. 2 is a diagram illustrating a weight approximation encoding method;

FIG. 3 is a diagram illustrating a method of weight compression encoding;

FIG. 4 is a circuit diagram of a sequence detection decoding module;

FIG. 5 is a schematic method of a computation result multiplexing unit;

FIG. 6 is a flow chart of a ternary neural network approximation calculation unit.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

The compression calculation unit of the sparse weight of the ternary neural network comprises a weight approximation processing unit, a Huffman coding unit, a weight integration unit, a sequence detection decoding module and a calculation optimization configuration module which are sequentially connected;

as shown in fig. 2, the method for approximating a weight matrix in a weight approximation processing unit includes the following steps:

step 101: traversing the weight matrix, wherein if a mode of sequentially arranging first continuous weight data, interval weight data and second weight data occurs, the first continuous weight data refers to three or more continuous same weight data, the interval weight data refers to two or less data different from the continuous weight data, the second continuous weight data refers to three or more continuous same weight data, and the first continuous weight data and the second continuous weight data have the same data size and different data quantity; step 102 is entered. As can be seen from fig. 2, the first continuous weight data is 5 '-1', the interval data is '0' and '1', and the second continuous weight data is 3 '-1'.

Step 102: the interval weight data is modified into the data with the same size as the first continuous weight data, namely, the data with the same size as the second continuous weight data, so as to improve the sparsity of the weight. In fig. 2, the interval weight data '0', '1' are approximated to two '-1'.

Step 103: and obtaining an approximate weight matrix and storing the approximate weight matrix.

Specifically, taking the weight matrix of the three-valued network as an example, the weight matrix is input into the weight approximation processing unit, and the overall flow is as shown in fig. 1. Inputting the weight matrix, reading the weight data, determining whether the form of the first continuous weight data, the interval weight data and the second weight data exists in the weight matrix, if so, determining that the approximate coding can be performed, performing the weight approximate coding, and forming an approximate weight matrix through the step 101 and the step 103.

Taking specific weight matrix data as an example for explanation, firstly, modifying a small amount of weight data to improve the sparsity of the weight, so that after huffman coding, the compression degree of the data can be higher, and the weight matrix of the three-value network is as follows:

the specific operation for the weight matrix is as follows:

traversing the weight matrix, discovering that 3 continuous ' 0 ' and 3 continuous ' 0 ' have 1 ' intermediate interval, and modifying the ' 1 ' of the intermediate interval to ' 0 ' according to the set approximate condition.

Meanwhile, the traversal finds that the 3 continuous '1' and 3 continuous '1' intermediate intervals have '0' and '1', and the '1' and '0' of the intermediate intervals are both modified into '1' according to the set approximate condition.

When the same condition is found after traversal is finished, the operation of the weight approximation processing unit is finished, and the weight matrix is modified into an approximate weight matrix:

after the weight matrix is processed in the weight approximation processing unit, whether the weight matrix after approximation processing is compressible is judged, when repeated weight data exists, the weight matrix is input into the Huffman coding unit to be compressed by Huffman indefinite length coding, and if the repeated weight data does not exist, only fixed length coding is used, and a schematic method is shown in FIG. 3.

Taking the above approximate weight matrix as an example, huffman indefinite length coding compression is performed:

traversing the data may yield the following information:

the case where 6 '0', 3 '0', 7 '0', 6 '1', 4 '1', 8 '1', 3 '-1' occur consecutively once, and the case where 4 '-1' occurs consecutively twice.

In this embodiment, the approximate weight matrix is compressed by huffman coding through a huffman coding unit, and the formed coding rule is:

6 '0': 101, a first electrode and a second electrode;

3 '0': 100, respectively;

7 '0': 111;

6 '1': 110;

4 '1': 0001;

8 '1': 0000;

3 '-1': 001;

4 '-1': 01;

after the weight matrix is subjected to approximate processing by the weight approximate processing unit and compression processing by the Huffman coding unit, the Huffman coding indefinite-length coding result of the repetition weight and the fixed-length coding result of the non-repetition weight are integrated into a string of codes by the weight integration unit, and a coding weight matrix is formed and stored in the sequence detection decoding module.

When the neural network starts to operate, the sequence detection decoding module decodes the coded fixed-length and non-fixed-length codes, the sequence detection decoding module circuit is shown in fig. 4 and comprises a lookup table, a mark judging module and a sequence checking module, the lookup table is used for recording coding rules of Huffman codes, the mark judging module is used for recording coding mark bits and judging the coding mark bits, and the sequence checking module is used for scanning input coding weight matrixes.

The method for decoding the input coding weight matrix by the sequence detection decoding module comprises the following steps:

step 201: as shown in fig. 3, the unused '10' in the weight data of the ternary neural network is used as an encoding flag bit, the sequence detection module scans the weight data in the encoding weight matrix, the scanned weight data is compared with the recording encoding flag bit through the flag judgment module, and the sequence detection module enters a huffman decoding state when detecting the '10'.

Step 202: after entering the decoding state, the 2-bit data detected by the sequence detection module represents the later decoded weight data, for example, '11' in the figure represents decoding the number of occurrences of '-1'.

Step 203: and combining a sequence detection module with the lookup table, scanning backwards bit by bit, stopping scanning when the scanned code can be decoded, ending the Huffman decoding state, and entering a common decoding state. For example, in the figure, '100' means that '-1' appears 5 times.

Taking the above coding rule as an example, the coding rule is:

6 '0': 101, a first electrode and a second electrode;

3 '0': 100, respectively;

7 '0': 111;

6 '1': 110;

4 '1': 0001;

8 '1': 0000;

3 '-1': 001;

4 '-1': 01;

detecting a '0' that cannot be decoded; backwards detecting a '1', and solving the '01' into 4 '-1' according to a lookup table; detecting a '1' that cannot be decoded; backward detect a '1' that cannot be decoded; then, a '0' is detected backward, and the '110' is solved according to the lookup table to be 6 '1's.

And the decoded weight matrix enters the operation link of the three-value network. The weight matrix needs to be input into a ternary weight neural network calculation result multiplexing unit first, and whether the result multiplexing unit can be calculated at present to reduce the operation amount of the network is judged, and a schematic method is shown in fig. 5. Setting the input data matrix to be processed intoIn particular toThe weight matrix is a 3 x 3 matrix

And the calculation result multiplexing unit judges that two rows of multiplexing units (111) exist in the weight matrix, and performs convolution operation on the input data matrix to be processed and the weight matrix by adopting the calculation result multiplexing unit.

Weighting matrixAligning a first set of matricesPerforming convolution operation to calculate a first group of results, and writing an operation result in a result storage unit:the first set of operations ends. And writes the operation result '0.50.251.25' of (0.50.251.25) and the multiplexing unit (111) into the buffer.

Weight matrixSliding down a row to align a second set of matrices, the second set of matrices being:performing convolution operation on the weight matrix and the second group of matrices to obtain a second group of resultsAccording to the result judged by the multiplexing unit, the first line in the second group of result data is directly called (0.50.251.25) from the buffer and the operation result '0.50.251.25' of the multiplexing calculation unit (111) without specific calculation, and is written into the first line of the second group of result data, and the rest memory cells are written into the memory cells after the rest data and the weight are specifically calculated. Under the condition of unchanging weight, the first line of all input data operation results directly reads the second line of the previous group of operation results, so that the weight operation three are reducedOne-half of the calculated amount.

When the multiplexing unit judges that the multiplexing is impossible, the data matrix to be processed and the weight matrix are input into the three-valued neural network approximation calculation unit for calculation, and the flow is shown in fig. 6. When the weight is read to be '0', the input data is not read, and '0' is directly written in the corresponding storage unit; when the weight is read to be '1', writing the input data into the corresponding storage unit; when the weight is read to be '1', an approximate measure is adopted, only the data is subjected to negation operation without adding '1', and the negation result is written into the corresponding storage unit; and finishing all network operations and outputting a final result.

13页详细技术资料下载

Compression calculation unit for three-value neural network sparsity weight

相关技术

网友询问留言