Quantization method and device

文档序号：1414406 发布日期：2020-03-10 浏览：13次中文

阅读说明：本技术 一种量化方法及装置 (Quantization method and device ) 是由郭青海程捷蒋磊于 2018-09-03 设计创作，主要内容包括：一种量化方法及装置,用以提出一种在硬件设计友好的前提下,保证量化后的精度损失较小的通用的量化方法。在该方法中,读取一个神经网络中N个通道分别对应的权重,将N个通道分成F组,根据每组中包含的通道分别对应的权重确定该组的量化系数,根据量化系数对该组对应的权重进行量化；每组中包含至少一个通道,且至少一个组中包含的多个通道属于神经网络的至少两个层,F为小于N的正整数。这样,针对神经网络中所有层的通道作为一个整体进行分组,且分的组中存在包含不同层中的通道,这样可以打破现有技术中只能考虑单层分组的局限性,可以提高量化精度；并且,分的组比现有的分组的数量少的多,进而量化时在硬件上可以减少成本消耗。(A quantization method and a quantization device are provided, which are used for providing a universal quantization method which ensures that the precision loss after quantization is small on the premise of friendly hardware design. In the method, weights corresponding to N channels in a neural network are read, the N channels are divided into F groups, quantization coefficients of each group are determined according to the weights corresponding to the channels contained in each group, and the weights corresponding to the groups are quantized according to the quantization coefficients; each group comprises at least one channel, a plurality of channels contained in at least one group belong to at least two layers of the neural network, and F is a positive integer smaller than N. Therefore, channels of all layers in the neural network are grouped as a whole, and the grouped channels comprise channels in different layers, so that the limitation that only single-layer grouping can be considered in the prior art can be broken, and the quantization precision can be improved; moreover, the number of grouped packets is much smaller than that of the existing grouped packets, and further, the cost consumption can be reduced on the hardware during quantization.)

1. A method of quantization, comprising:

reading weights respectively corresponding to N channels in a neural network, wherein each layer of M layers of the neural network comprises at least one channel, each channel corresponds to at least one weight, N is an integer greater than 1, and M is an integer greater than 1;

dividing the N channels into F groups, wherein each group comprises at least one channel, a plurality of channels contained in at least one group belong to at least two layers of the neural network, and F is a positive integer less than N;

and determining the quantization coefficients of each group according to the weights respectively corresponding to the channels contained in each group, and quantizing the weights corresponding to the group according to the quantization coefficients.

2. The method of claim 1, wherein dividing the N channels into F groups comprises:

determining F-1 dividing points in the N channels, wherein any dividing point is any one of two adjacent channels at the boundary between two adjacent groups;

and grouping the N channels according to the F-1 dividing points to obtain an F group.

3. The method of claim 2, wherein determining F-1 split points in the N lanes comprises:

under the condition that p takes any one integer from 1 to N, when p takes each value, respectively:

when the (r + 1) th channel to the (p) th channel are used as a group, and the (1) th channel to the (r) th channel are used as a group as a grouping result, when r takes any integer from p-1 to 1, a p-1 grouping result is obtained;

respectively calculating the weight loss degree corresponding to each grouping result according to a preset weight loss function, and selecting the grouping result with the minimum weight loss degree from the p-1 grouping results;

updating channel identifications corresponding to the segmentation points at the two groups of boundaries in the selected grouping result into the segmentation point sequence;

and when the p is taken as any integer from 1 to N, finally updating the channel corresponding to each channel identifier in the obtained segmentation point sequence to be used as the F-1 segmentation points.

4. The method of claim 3, wherein the predetermined weight loss function conforms to the following equation:

wherein, f () is a weight loss function for calculating channel grouping, I is a set of channel identifications respectively corresponding to channels included in the channel grouping, a_iIs the ith channel included in the channel group, and ω is A_iCorresponding weight values, θ is the quantization coefficients corresponding to the channel groups, g () is a weight-related function for adjusting the precision of the weight loss function, and round () is a rounding function.

5. The method of claim 4, wherein the degree of weight loss is in accordance with the following equation:

where D () is the weight loss degree of the channel packet and P is the grouped sequence.

6. The method of any one of claims 1-5, wherein prior to reading the weights corresponding to the respective N channels in the neural network, the method further comprises:

and training the neural network to obtain all weights in the neural network.

7. A quantization apparatus, comprising:

the communication unit is used for reading weights respectively corresponding to N channels in a neural network, wherein each layer of M layers of the neural network comprises at least one channel, each channel corresponds to at least one weight, N is an integer larger than 1, and M is an integer larger than 1;

the processing unit is used for dividing the N channels into F groups, wherein each group comprises at least one channel, a plurality of channels contained in at least one group belong to at least two layers of the neural network, and F is a positive integer smaller than N; and

8. The apparatus as claimed in claim 7, wherein said processing unit, when dividing said N channels into F groups, is specifically configured to:

determining F-1 dividing points in the N channels, wherein any dividing point is any one of two adjacent channels at the boundary between two adjacent groups;

and grouping the N channels according to the F-1 dividing points to obtain an F group.

9. The apparatus as claimed in claim 8, wherein said processing unit, when determining F-1 split points in said N lanes, is specifically configured to:

under the condition that p takes any one integer from 1 to N, when p takes each value, respectively:

updating channel identifications corresponding to the segmentation points at the two groups of boundaries in the selected grouping result into the segmentation point sequence;

10. The apparatus of claim 9, wherein the predetermined weight loss function is in accordance with the following equation:

11. The apparatus of claim 10, wherein the degree of weight loss is in accordance with the following equation:

where D () is the weight loss degree of the channel packet and P is the grouped sequence.

12. The apparatus of any of claims 7-11, wherein the processing unit is further to:

before the communication unit reads the weights corresponding to the N channels in the neural network, the neural network is trained to obtain all the weights in the neural network.

13. A computer program product comprising instructions for causing a computer to perform the method of any one of claims 1 to 6 when the computer program product is run on a computer.

14. A computer storage medium, in which a computer program is stored, which, when executed by a computer, causes the computer to perform the method as provided in any one of claims 1 to 6.

15. A chip, wherein the chip is connected to a memory for reading and executing program instructions stored in the memory to implement the method of any one of claims 1 to 6.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a quantization method and apparatus.

Background

The coming of big data and artificial intelligence era pushes revolutionary change of data processing, people not only put forward high requirements on accuracy, but also expand requirements on real time, low power consumption, intelligence and the like on the basis of accuracy. With the development, data processing through a neural network is more and more widely applied.

From the storage perspective, existing neural networks are all stored in a floating point type, a neural network model generally needs tens of megabits to hundreds of megabits of storage resources, and in terms of the model size, the neural network model is difficult to be transplanted to terminal equipment such as a mobile phone for use. From the calculation perspective, a neural network needs to perform a large number of operations such as multiplication and addition, and in some application scenarios with high requirements on real-time performance, the requirements of the neural network are difficult to meet. Such as an autopilot scenario, requires multiple neural networks to compute simultaneously. From a hardware perspective, the existing neural network can only run on a CPU, GPU operating on floating point data. When the consumption is lower and the operation is faster, the customizable FPGA platform realizes the neural network algorithm, and the operation of floating point number must be converted into lower stored fixed point number in consideration of the limiting conditions of hardware resources and the like. Therefore, it is an important research direction to quantize the floating-point data of the model into fixed-point integer data.

At present, the commonly used quantization method of the neural network is mainly to count the weight value of each layer of the neural network, and each layer determines the quantization coefficient corresponding to the weight value of the layer according to the maximum value of the weight. When calculating the feature map (feature map) output by each layer, firstly multiplying the weight matrix by the corresponding quantization coefficient to obtain a quantization weight matrix, then convolving the feature map of the previous layer with the quantization weight, and dividing the result by the corresponding quantization coefficient to restore the original data value so as to finish quantization.

Obviously, the method only mechanically considers the weight distribution of each layer, namely each layer corresponds to one quantization scheme, but because the difference between a plurality of weights corresponding to each layer of neural network is uncertain, the precision after quantization cannot be guaranteed by adopting the same quantization scheme; and because the number of layers of the neural network is huge, hundreds of layers are possible, and the cost of realizing each layer corresponding to one quantization scheme on hardware is possibly high. Therefore, the above quantization method is poor in usability.

In summary, it is an urgent need to solve the problem to provide a quantization method that ensures less precision loss after quantization on the premise of friendly hardware design.

Disclosure of Invention

The application provides a quantization method and a quantization device, which are used for providing a universal quantization method for ensuring less precision loss after quantization on the premise of friendly hardware design.

In a first aspect, the present application provides a quantization method, wherein weights corresponding to N channels in a neural network are read, the N channels are divided into F groups, quantization coefficients of each group are determined according to the weights corresponding to the channels included in the group, and the weights corresponding to the group are quantized according to the quantization coefficients; wherein each of the M layers of the neural network comprises at least one channel, each channel corresponds to at least one weight, N is an integer greater than 1, and M is an integer greater than 1; each group comprises at least one channel, a plurality of channels contained in at least one group belong to at least two layers of the neural network, and F is a positive integer smaller than N.

By the method, the channels of all layers in the neural network are grouped as a whole, and the grouped channels comprise the channels in different layers, so that the limitation that only single-layer grouping can be considered in the prior art can be broken, and the quantization precision can be improved; in addition, the grouping of the method is much less than the number of the existing grouping, namely, the quantization scheme can be much less than the prior art, and further, the cost consumption can be reduced on hardware during quantization.

In one possible design, when the N channels are divided into F groups, a specific method may be as follows: and determining F-1 dividing points in the N channels, and grouping the N channels according to the F-1 dividing points to obtain an F group, wherein any dividing point is any one of two adjacent channels at the boundary in two adjacent groups.

By the method, the N channels can be accurately divided into F groups.

In one possible design, F-1 split points may be determined in the N lanes by: under the condition that p takes any one integer from 1 to N, when p takes each value, respectively:

determining a corresponding grouped sequence, a sequence to be grouped and a segmentation point sequence, wherein the grouped sequence comprises channel identifications corresponding to the first p-1 channels which are grouped respectively, the sequence to be grouped comprises channel identifications corresponding to the p-th channel to the N-th channel which are not grouped respectively, and the segmentation point sequence comprises channel identifications corresponding to the channels which are taken as segmentation points in the first p-1 channels which are grouped; when the (r + 1) th channel to the (p) th channel are used as a group, and the (1) th channel to the (r) th channel are used as a group as a grouping result, when r takes any integer from p-1 to 1, a p-1 grouping result is obtained; respectively calculating the weight loss degree corresponding to each grouping result according to a preset weight loss function, and selecting the grouping result with the minimum weight loss degree from the p-1 grouping results; updating channel identifications corresponding to the segmentation points at the two groups of boundaries in the selected grouping result into the segmentation point sequence; and when the p is taken as any integer from 1 to N, finally updating the channel corresponding to each channel identifier in the obtained segmentation point sequence to be used as the F-1 segmentation points.

By the method, the division points meeting the requirements can be obtained, so that the grouping can be carried out according to the determined division points.

In one possible design, the preset weight loss function may conform to the following equation:

In one possible design, the degree of weight loss may be in accordance with the following equation:

where D () is the weight loss degree of the channel packet and P is the grouped sequence.

In one possible design, before the weights corresponding to the N channels in the neural network are read, the neural network is trained to obtain all the weights in the neural network. Thus, channel grouping can be carried out subsequently according to the corresponding weight, and quantization can be carried out.

In a second aspect, the present application also provides a quantization apparatus having a function of implementing the above method. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.

In a possible design, the structure of the quantization apparatus includes a communication unit and a processing unit, and these units may perform corresponding functions in the foregoing method example, which is specifically referred to the detailed description in the method example, and is not described herein again.

In one possible design, the structure of the quantization apparatus includes a communication module and a processor, and optionally may further include a memory, the communication module is configured to acquire data and perform communication interaction with other devices, and the processor is configured to execute the above-mentioned method. The memory is coupled to the processor and holds the program instructions and data necessary for the quantization means.

In a third aspect, the present application also provides a computer storage medium having stored thereon computer-executable instructions for causing the computer, when invoked by the computer, to perform any of the methods mentioned in the first aspect above.

In a fourth aspect, the present application also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods mentioned above in relation to the first aspect.

In a fifth aspect, the present application further provides a chip, connected to the memory, for reading and executing the program instructions stored in the memory to implement any one of the methods mentioned in the first aspect.

Drawings

FIG. 1 is a schematic diagram of a neural network provided herein;

FIG. 2 is a schematic diagram illustrating a training process of a neural network provided herein;

FIG. 3 is a schematic representation of a before and after quantization data flow provided herein;

FIG. 4 is a schematic diagram of a quantized hardware implementation provided herein;

FIG. 5 is a flow chart of a quantization method provided herein;

fig. 5a is a schematic flow chart of determining a segmentation point according to the present application;

FIG. 6 is a schematic diagram of an amplifier provided herein;

FIG. 7 is a schematic view of a shifter provided herein;

FIG. 8 is a schematic diagram of a channel grouping provided herein;

fig. 9 is a schematic structural diagram of a quantization apparatus provided in the present application;

fig. 10 is a structural diagram of a quantization apparatus provided in the present application.

Detailed Description

The present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application provides a quantization method and a quantization device, and provides a general quantization method for ensuring small precision loss after quantization on the premise of friendly hardware design. The method and the device are based on the same inventive concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.

Hereinafter, some terms in the present application are explained to facilitate understanding by those skilled in the art.

1) The neural network is used for simulating the behavior characteristics of the animal neural network and processing data similar to the structure of brain neural synapse connection. As a mathematical operation model, a neural network is formed by connecting a large number of nodes (or referred to as neurons). The neural network is composed of an input layer, a hidden layer and an output layer, for example, as shown in fig. 1. Wherein, the input layer is input data of the neural network; the output layer is output data of the neural network; the hidden layer is formed by connecting a plurality of nodes between the input layer and the output layer and is used for carrying out operation processing on input data. Wherein the hidden layer may be composed of one or more layers. The number of layers and nodes of a hidden layer in the neural network has a direct relation with the complexity of the problem actually solved by the neural network, and the number of nodes of an input layer and the number of nodes of an output layer. Among them, the common neural network is a Deep Neural Network (DNN), and a Convolutional Neural Network (CNN) is one of the common DNNs.

2) A channel of a neural network, a set of convolution kernels and biases used in a convolutional neural network to compute a point in a feature map. Where each layer (called convolutional layer in convolutional neural networks) has multiple channels.

3) At least one means one or more, and a plurality means two or more.

At present, the neural network is applied to many scenarios, for example, in the scenario of automatic driving, it is necessary to utilize a deep learning model (i.e., a model of the neural network) to process multiple tasks such as target recognition, target classification, and target tracking. In these tasks, a very effective model can be obtained by a deep convolutional neural network and a large amount of supervised learning training. On the other hand, however, as the depth and the number of parameters of the neural network increase, the time and resources required for completing one calculation also increase greatly, and the resource configuration and response time requirements of the automatic driving cannot be completed. Therefore, the quantification method can greatly reduce the calculation amount and the calculation time of the model on the premise of ensuring that the precision is basically unchanged.

For example, when image recognition is performed through a neural network, the input unit obtains a picture from a camera, the picture is transmitted to the processing unit in the form of pixel values, and the processing unit performs matrix operation on the pixel values and a trained neural network (for example, when image recognition is performed, a training flow of the neural network (or a model of the neural network) may be as shown in fig. 2), and finally obtains an output of a certain label (for example, determines a category of the picture). Because the main computing and storage resources are consumed in the processing unit, in order to reduce this part of the overhead, a quantization method can be used to convert the complex data type (such as floating point type 32()) into a simple data type with less storage (such as integer type 8(Int8)), thereby reducing the resource consumption. For example, in image recognition, a comparison graph of data flow before and after quantization can be shown in fig. 3, wherein as can be seen from fig. 3, the data before quantization is Float32, and the data after quantization is Int 8.

In the specific quantization, in order to ensure that the precision of the neural network is not affected, different quantization schemes need to be set for different characteristics in the data, and then the quantization is realized through the setting of hardware. For example, when quantization is implemented using a resistive random access memory (ReRAM), quantization may be implemented by setting different amplifiers in the ReRAM according to different quantization schemes when hardware is set. For example, fig. 4 shows one possible quantized hardware implementation. Therefore, it is important to balance the loss of precision (i.e., to ensure the precision of the neural network) with the hardware implementation. Based on this, the present application provides a general quantization method that ensures a small precision loss after quantization on the premise of a friendly hardware design.

In the embodiment of the present application, the quantization method may be performed by, but not limited to, a processor, where when the quantization apparatus is a processor, the processor may be a processor in a computer apparatus, a processor in another device (e.g., a quantization chip system, a ReRAM), or a separate processor. In the embodiments of the present application, the execution main body is taken as a processor for detailed description.

The quantization method provided in the embodiment of the present application is applicable to the neural network shown in fig. 1, and referring to fig. 5, a specific process of the method includes:

step 501, a processor reads weights corresponding to N channels in a neural network, wherein each of M layers of the neural network includes at least one channel, each channel corresponds to at least one weight, N is an integer greater than 1, and M is an integer greater than 1.

In an optional implementation manner, before the processor reads the weights corresponding to the N channels in the neural network, the neural network needs to be trained to obtain all the weights in the neural network. Training the neural network to obtain all weights in the neural network, which may specifically be: and obtaining the structure of the neural network and all weight values in the neural network through data input and neural network model construction. For example, training of the neural network can be achieved by the following three steps, and all weights in the neural network are obtained:

step a 1: signals such as pictures and sounds are obtained through input equipment (such as a camera, a microphone and the like), and are expressed by tensor formed by a plurality of two-dimensional matrixes.

Step a 2: parameters of the neural network, i.e., weight values in the neural network, are trained using the labeled training data set. The method specifically comprises the following steps: information forward propagation: setting an initial weight value, and sequentially calculating the output of each layer of the neural network from the input by calculating the multiplication and the addition of a matrix so as to obtain a final output result; and (3) error back propagation: and (3) adopting a gradient descent method, and sequentially updating the weights and the offsets of the output layer and the hidden layer so as to minimize the overall error.

The above-mentioned training process of the neural network can also be illustrated by fig. 2. By the method, the neural network can be accurately trained, and all weights in the neural network are obtained.

Step 502, the processor divides the N channels into F groups, where each group includes at least one channel, and a plurality of channels included in at least one group belong to at least two layers of the neural network, and F is a positive integer smaller than N.

In an optional implementation manner, when the processor divides the N channels into F groups, a specific method may be: the processor determines F-1 dividing points in the N channels, and groups the N channels according to the F-1 dividing points to obtain an F group, wherein any dividing point is any one of two adjacent channels at the boundary of two adjacent groups. For example, if two adjacent channels at the boundary of two groups of the divided F groups are channel 3 and channel 4 when two adjacent channels are channel 1, channel 2, channel 3, and channel 4, channel 5, respectively, one of channel 3 or channel 4 may be used as a dividing point. That is, when it is determined that one of the division points is the channel 3, the channel 3 is at the boundary of two groups, in which the channel 3 can be divided into channels in the previous group. Of course, a split point may also be divided into channels in the latter group. Such as the case when the channel 4 is a dividing point in the above example, and will not be described in detail here. Therefore, when F-1 division points are determined, F groups can be obtained.

In one particular implementation, the processor may determine F-1 split points in the N lanes by performing the following procedure:

under the condition that p takes any one integer from 1 to N, when p takes each value, respectively:

updating channel identifications corresponding to the segmentation points at the two groups of boundaries in the selected grouping result into the segmentation point sequence;

The above-mentioned flow is actually a cyclic process, and when the above-mentioned flow is finished, the finally obtained grouped sequence includes the identifications of all channels of the neural network, i.e. the identifications corresponding to N channels respectively, and the last sequence to be grouped is empty. That is, after the F-1 division points are obtained by the above method, the F groups are obtained. For example, the flow chart for determining F-1 division points can be as shown in FIG. 5 a.

For example, the process when p is 8 in the above flow is described in detail as a representative example:

at this time, the currently determined grouped sequence includes channel identifiers (for example, recorded as channel 1 … … channel 7) corresponding to the first 7 grouped channels, respectively, and the currently determined sequence to be grouped includes channel identifiers (for example, recorded as channel 8 … … channel N) corresponding to the 8 th channel to the N-th channel.

When the (r + 1) th channel to the (8) th channel are taken as a group and the (1) th channel to the (r) th channel are taken as a group as a grouping result, when r is taken as any one integer from 7 to 1, 7 grouping results are obtained, specifically, the obtained 7 grouping results can be recorded as:

first grouping result: { channel 1, channel 2, … …, channel 7} and { channel 8 };

second grouping result: { channel 1, channel 2, … …, channel 6} and { channel 7, channel 8 };

third grouping results: { channel 1, channel 2, … …, channel 5} and { channel 6, channel 7, channel 8 };

fourth grouping result: { channel 1, channel 2, channel 3, channel 4} and { channel 5, … …, channel 8 };

the fifth grouping result: { channel 1, channel 2, channel 3} and { channel 4, channel 5, … …, channel 8 };

sixth grouping result: { channel 1, channel 2} and { channel 3, channel 4, … …, channel 8 };

seventh grouping result: { channel 1} and { channel 2, channel 3, … …, channel 8 }.

And then, respectively calculating the weight loss degree corresponding to each grouping result according to the preset weight loss function, and if the weight loss degree is the third grouping result which is the smallest in the 7 grouping results, knowing that the dividing point at the boundary of two groups in the third grouping result is the channel 5 or the channel 6 through the third grouping result, so that the dividing point obtained when p is 8 can be obtained as any one of the channel 5 and the channel 6, and updating the determined dividing point into a dividing point sequence.

Through the above steps, channel 8 is also added to the grouped sequence, and then the above steps are repeated from channel 9 until channel N is added to the grouped sequence, resulting in F-1 split points.

It should be noted that the channel corresponding to the channel identifier in the grouped sequence is already grouped, that is, the channel identifier included in the grouped sequence actually includes several sets of channel identifiers. Specifically, in each grouping result in the above flow, in two groups in the grouping result, one group from the first channel to the r-th channel in the grouping result may be regarded as a channel group set of several groups (channel groups) into which the corresponding 1 st channel to the r-th channel in the grouped sequence are divided, and one group from the r +1 th channel to the p-th channel is an integral channel group, that is, it may be regarded that one currently divided channel group is composed of the r +1 th channel to the p-th channel. Further, when calculating the loss degree corresponding to one grouping result, the values of the weight loss functions corresponding to several channel groups from the 1 st channel to the r th channel, and the values of the weight loss functions corresponding to the channel groups composed of the (r + 1) th channel to the p th channel are calculated, respectively, and then the sum of the obtained values of the plurality of loss functions is used as the weight loss degree.

It should be noted that, each time the above-mentioned flow is executed for one channel, a determined segmentation point may already exist in the segmentation point sequence, that is, a repeated segmentation point is determined, and at this time, the updated segmentation point sequence may be the same as the segmentation point sequence before the update.

In an alternative embodiment, the preset weight loss function involved in the above process may conform to the following formula one:

wherein, in the above formula one, f () is a weight loss function for calculating a channel group, I is a set of channel identifiers corresponding to channels included in the channel group, and a_iIs the ith channel included in the channel group, and ω is A_iCorresponding weight values, θ is the quantization coefficients corresponding to the channel groups, g () is a weight-related function for adjusting the precision of the weight loss function, and round () is a rounding function.

The preset weight loss function can be defined as a weighted mean square sum before quantization and after quantization, which represents a difference between a quantized neural network and an original neural network, and the smaller the value, the better the quantization scheme is.

In an example, the weight loss degree involved in the above process may be in accordance with the following formula two:

in the second formula, D () is the weight loss degree of the channel packet, and P is the grouped sequence.

For example, when p is 8 in the above example, when the weight loss degree (e.g., formula two) corresponding to each grouping result is calculated according to the preset weight loss function (e.g., formula one), for each grouping result, the weight losses corresponding to two groups divided in the grouping result may be obtained according to formula one, respectively, and then the two weight losses are substituted into formula two to be summed to obtain the weight loss degree of the grouping result. Specifically, the set of identifiers of channels included in one group (i.e., a channel group) in the grouping result may correspond to I, where Ai is the ith channel included in the group. It should be noted that, the weight loss degree corresponding to the previous group (i.e., the group consisting of the 1 st channel to the r-th channel) in the grouping result is the sum of the weight loss degrees corresponding to the one or more channel groups included in the group.

The above method may be called a dynamic programming algorithm, and may also be called another algorithm, which is not limited in this case.

By the method, an optimal grouping scheme can be obtained. Specifically, by the method, the weight of each channel can be fully considered, the limitation of a layer is broken through, and the cross-layer grouping of the channels is realized, so that the grouping number of the channels can be reduced as much as possible, the number of quantization schemes can be reduced, and the aim of reducing hardware overhead can be achieved.

Step 503, the processor determines the quantization coefficients of each group according to the weights respectively corresponding to the channels included in each group, and quantizes the weights corresponding to the group according to the quantization coefficients.

There are various methods for determining the quantization coefficients of each group and quantizing the reorganization, and the detailed description of the method is omitted here. For example, in an alternative embodiment, the most common fixed point shift quantization method may be employed:

firstly, the maximum value of the weight corresponding to the channel in each group is counted, and the quantization coefficient of the group is obtained by dividing the quantization range (int8) by the maximum value of the weight and then rounding. After the quantized coefficients of the group are obtained, the weight of each Float32 type is directly quantized to upper and lower limits according to the obtained result by multiplying the corresponding quantized coefficients, and the result in the int8 data range is directly obtained to int8 integer data by rounding. And after the corresponding operation is finished, dividing the operation by the quantization coefficient to restore the data. This completes the quantization of the weights for the group.

In an alternative embodiment, after the N channels are divided into F groups by step 502, one amplifier may be provided for each channel in the F groups of channels to implement quantization, and the amplification factor of the amplifier corresponding to each group is the same as the quantization coefficient of the group. That is, the amplification factor of each set of corresponding amplifiers is set to the corresponding quantization coefficients of the set, for example, a schematic diagram of each set of corresponding amplifiers may be as shown in fig. 6. Thus, a hardware implementation for the quantization method described above is completed. Due to the quantization method provided by the embodiment of the application, the number of the packets is much smaller than the existing number of the packets, so that the number of the set method devices is much smaller, and the hardware cost consumption can be greatly reduced from the hardware aspect.

In another alternative implementation, in hardware implementation, quantization of each set of weights may also be implemented by setting a shifter. For example, as shown in the schematic diagram of the shifter shown in fig. 7, for the number of channels in each group, shifting of the shifter is performed, where j bits shifted by the shifter are related to the quantized coefficients corresponding to the group.

By adopting the quantization method provided by the embodiment of the application, the weights corresponding to N channels in a neural network with M layers are read, the N channels are divided into F groups, the quantization coefficients of each group are determined according to the weights corresponding to the channels contained in each group, and the weights corresponding to the group are quantized according to the quantization coefficients; wherein each group comprises at least one channel, and a plurality of channels contained in at least one group belong to at least two layers of the neural network, and F is a positive integer less than N. By the method, the channels of all layers in the neural network are grouped as a whole, and the grouped channels comprise the channels in different layers, so that the limitation that only single-layer grouping can be considered in the prior art can be broken, and the quantization precision can be improved; in addition, the grouping of the method is much less than the number of the existing grouping, namely, the quantization scheme can be much less than the prior art, and further, the cost consumption can be reduced on hardware during quantization.

The following detailed description of the comparison between the existing packet and the method packet of the embodiment of the present application is illustrated by a schematic diagram of the channel packet shown in fig. 8:

wherein, the first layer, the second layer, … …, the M-1 layer, the M layer shown in FIG. 8 represent the M layers in the neural network. Fig. 8 shows grouping results of five different methods for grouping channels in a neural network, wherein the first four are existing grouping cases, and the fifth is grouping case by the method provided by the embodiment of the present application. Wherein different shapes (e.g., circles, rectangles, squares, triangles) in fig. 8 represent channels in different layers in the neural network. Specifically, the method comprises the following steps:

method 1 is a grouping situation of a conventional single-layer grouping quantization method, i.e. channels in each layer are grouped, so that how many layers of the neural network are divided into how many groups, here M groups according to the M-layer neural network. Where different shapes (e.g., circles, rectangles, squares, triangles) in method 1 represent groupings of different layers.

Method 2 is a grouping case of a conventional quantization method of intra-layer grouping, equally grouping channels in each layer, where different shapes (e.g., rectangles, squares, triangles) in each layer in method 2 represent different groupings of channels. As can be seen from the figure, in method 2, each layer is divided into a plurality of channel groups.

Method 3 is a grouping situation of a clustering grouping quantification method, all channels are put together and clustered through a clustering algorithm, and finally the channels in each layer are divided into different groups, wherein different shapes (such as circles, rectangles, squares and triangles) in each layer in method 3 represent different channel groupings. It can be seen from the figure that each layer of method 3 is divided into a plurality of channel groups.

Method 4 is a grouping situation of a clustering rearrangement grouping quantization method, in one layer, channels belonging to the same category are redistributed and arranged together. Where different shapes (e.g., rectangles, squares, triangles) in each layer in method 4 represent different groupings of channels. It can be seen from the figure that each layer of method 4 is divided into a plurality of channel groups.

Method 5 is a grouping situation of the quantization method provided in the embodiment of the present application, and all channels are grouped as a whole, where different shapes (e.g., circle, rectangle, square, triangle) in method 5 represent different channel groupings. As can be seen from the figure, the channels included in some channel groups include more than one channel in one layer, that is, two or more layers of channels, that is, a cross-layer grouping is implemented.

As can be seen from the above, the groups into which the first four existing methods are divided can only be single-layer or channels in a layer are grouped, and cross-layer grouping cannot be achieved, so that the number of the groups into which the groups are finally divided is large, and thus hardware cost consumption is large when hardware implementation is subsequently performed on the groups into which the groups are divided (for example, the hardware cost consumption is large because a corresponding amplifier is arranged for each group and the number of the arranged amplifiers is large); in the method 5 provided in this embodiment of the present application, cross-layer grouping may be implemented, and the number of groups thus divided is much smaller than the number of existing groups, so that hardware cost consumption may be reduced (for example, corresponding amplifiers are also provided for each group, and the number of amplifiers provided is much smaller than the existing number, so that hardware cost consumption may be reduced).

Based on the above embodiment, the grouping method (dynamic programming algorithm) involved in the quantization method can ensure the obtained grouping result with the minimum loss function. For example, the loss degree of each finally determined packet is defined to conform to the following formula three:

the loss degree can be further proved to be minimum by the formula four:

wherein, the above formula three B (r) represents the loss degree of the final grouping scheme with the added penalty term when the number of channels is r,_γto penalize the coefficient (to ensure the number of groups to be grouped does not exceed a certain value, thereby avoiding overfitting), D () is the weight loss in equation two, | P | is the number of groups to be grouped in the grouping scheme P, Θ^PRepresenting the set of quantized coefficients for P. B (p) represents the loss degree of the grouping scheme added with the penalty term when the number of channels is p; inf denotes the function of finding the infimum, f [ r, p ]]Representing the value of the weight loss function from the r-th channel to the p-th channel.

Through the verification, the minimum B (n) is obtained, wherein n is the total number of the channels in the neural network. Based on the method, each iteration obtains a grouping scheme which enables the overall loss degree to be minimum, and therefore optimization is achieved. Therefore, the precision loss after quantization is ensured to be small on the premise of friendly hardware design.

Based on the above embodiments, the embodiments of the present application further provide a quantization apparatus, which is used to implement the quantization method provided by the embodiment shown in fig. 5. Referring to fig. 9, the quantization apparatus 900 includes: a communication unit 901 and a processing unit 902. Wherein:

the communication unit 901 is configured to read weights respectively corresponding to N channels in a neural network, where each of M layers of the neural network includes at least one channel, each channel corresponds to at least one weight, N is an integer greater than 1, and M is an integer greater than 1;

the processing unit 902 is configured to divide the N channels into F groups, determine quantization coefficients of each group according to weights corresponding to the channels included in each group, and quantize the weights corresponding to the group according to the quantization coefficients, where each group includes at least one channel, and a plurality of channels included in at least one group belong to at least two layers of the neural network, and F is a positive integer smaller than N.

In an optional implementation manner, the processing unit 902, when dividing the N channels into F groups, is specifically configured to: and determining F-1 dividing points in the N channels, and grouping the N channels according to the F-1 dividing points to obtain an F group, wherein any dividing point is any one of two adjacent channels at the boundary of two adjacent groups.

In an optional implementation manner, when determining F-1 partition points in the N channels, processing unit 902 is specifically configured to: under the condition that p takes any one integer from 1 to N, when p takes each value, respectively:

updating channel identifications corresponding to the segmentation points at the two groups of boundaries in the selected grouping result into the segmentation point sequence;

In an alternative embodiment, the preset weight loss function conforms to the following formula:

whereinF () is a weight loss function for calculating a channel group, I is a set of channel identifications respectively corresponding to channels included in the channel group, a_iIs the ith channel included in the channel group, and ω is A_iCorresponding weight values, θ is the quantization coefficients corresponding to the channel groups, g () is a weight-related function for adjusting the precision of the weight loss function, and round () is a rounding function.

In an alternative embodiment, the weight loss degree complies with the following formula:

where D () is the weight loss degree of the channel packet and P is the grouped sequence.

In an optional implementation manner, the processing unit 902 is further configured to train the neural network before the communication unit 901 reads the weights corresponding to the N channels in the neural network, so as to obtain all the weights in the neural network.

By adopting the quantization device provided by the embodiment of the application, the weights corresponding to N channels in a neural network with M layers are read, the N channels are divided into F groups, the quantization coefficients of each group are determined according to the weights corresponding to the channels contained in each group, and the weights corresponding to the group are quantized according to the quantization coefficients; wherein each group comprises at least one channel, and a plurality of channels contained in at least one group belong to at least two layers of the neural network, and F is a positive integer less than N. Therefore, channels of all layers in the neural network are grouped as a whole, and the grouped channels comprise channels in different layers, so that the limitation that only single-layer grouping can be considered in the prior art can be broken, and the quantization precision can be improved; in addition, the grouping of the method is much less than the number of the existing grouping, namely, the quantization scheme can be much less than the prior art, and further, the cost consumption can be reduced on hardware during quantization.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. The functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Based on the above embodiments, the present application further provides a quantization apparatus, which is used for implementing the quantization method shown in fig. 5. Referring to fig. 10, the apparatus 1000 includes: the communication module 1001, the processor 1002, and optionally the memory 1003, where the processor 1002 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of the CPU and the NP. The processor 1002 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. The processor 1002 may be implemented by hardware when implementing the above functions, or may be implemented by hardware executing corresponding software.

The communication module 1001, the processor 1002, and the memory 1003 are connected to each other. Optionally, the communication module 1001, the processor 1002 and the memory 1003 are connected to each other through a bus 1004; the bus 1004 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The communication module 1001 is configured to perform communication interaction with other devices. In an alternative embodiment, the communication module 1001 may communicate with other devices through a wireless connection, for example, the communication module 1001 may be an RF circuit, a WiFi module, or the like. The communication module 1001 may also communicate with other devices through physical connection, for example, the communication module 1001 may be a communication interface.

The processor 1002 is configured to implement the quantization method shown in fig. 2, and the specific process may refer to the specific description in the above embodiment, which is not described herein again.

The memory 1003 is used for storing programs, data, and the like. In particular, the program may include program code comprising instructions for the operation of a computer. The memory 1003 may include Random Access Memory (RAM) and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The processor 1002 executes the program stored in the memory 1002 to implement the above-described functions, thereby implementing the quantization method shown in fig. 2.

In summary, the embodiments of the present application provide a quantization method and apparatus, reading weights corresponding to N channels in a neural network having M layers, dividing the N channels into F groups, determining quantization coefficients of each group according to the weights corresponding to the channels included in each group, and quantizing the weights corresponding to the group according to the quantization coefficients; wherein each group comprises at least one channel, and a plurality of channels contained in at least one group belong to at least two layers of the neural network, and F is a positive integer less than N. Therefore, channels of all layers in the neural network are grouped as a whole, and the grouped channels comprise channels in different layers, so that the limitation that only single-layer grouping can be considered in the prior art can be broken, and the quantization precision can be improved; in addition, the grouping of the method is much less than the number of the existing grouping, namely, the quantization scheme can be much less than the prior art, and further, the cost consumption can be reduced on hardware during quantization.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：数据处理方法、数据处理装置及计算机可读介质

Quantization method and device

相关技术

网友询问留言