Quantization and computation acceleration method and apparatus

文档序号:762301 发布日期:2021-04-06 浏览:38次 中文

阅读说明:本技术 量化和计算加速方法和装置 (Quantization and computation acceleration method and apparatus ) 是由 陈福松 蒋泳森 于 2020-12-28 设计创作,主要内容包括:本发明公开量化和计算加速方法和装置,其中,一种量化和计算加速方法,包括:对数据进行4bit量化,并对量化后的数值进行偏移以消除4bit量化产生的负数;存储量化后的数据时,依据SIMD的最大并行数进行交叉存储。Bias的引入,一方面让我们省去在移位运算时的符号问题,另一方面使得我们的参数区间不再局限在(-1,1)之间,可以是一个非0为中心的区间中,去除这个假设,给了模型训练更大的自由,可以训出性能更好的模型。交叉存储的数据,空间上节省了内存,data cache上也有优势,一次性load。(The invention discloses a quantization and calculation acceleration method and a device, wherein the quantization and calculation acceleration method comprises the following steps: carrying out 4-bit quantization on the data, and offsetting the quantized numerical value to eliminate the negative number generated by the 4-bit quantization; and when the quantized data is stored, performing cross storage according to the maximum parallel number of the SIMD. The introduction of Bias, on one hand, enables us to omit the sign problem during shift operation, on the other hand, enables the parameter interval of us not to be limited between (-1,1), but to be an interval with non-0 as the center, and removes the assumption, thereby giving greater freedom to model training and being capable of training a model with better performance. The data stored in a cross way saves the internal memory in space, has the advantage on the data cache and is loaded at one time.)

1. A quantization and computation acceleration method, comprising:

carrying out 4-bit quantization on the data, and offsetting the quantized numerical value to eliminate the negative number generated by the 4-bit quantization;

and when the quantized data is stored, performing cross storage according to the maximum parallel number of the SIMD.

2. The method of claim 1, further comprising:

when the multiplication of two vectors is calculated, only one loading is carried out, and one right shift and two times of & operation are added.

3. The method of claim 2, further comprising:

at the end of the calculation, a constant value is subtracted.

4. The method of claim 3, wherein for:

when the SIMD supports 8-way parallelism, vector a1(a11,a12,a13,...,a115,a116) Interleaving storage according to the following format

(a11<<4)|a119,(a12<<4)|a110,(a13<<4)|a111,...,(a18<<4)|a116,...。

5. The method of claim 4, wherein the fixed value formula is represented as follows:

6. the method of claim 5, wherein in calculating the multiplication of two vectors, only one load is performed, adding one right shift and two & operations comprises:

when calculating the multiplication of two vectors, load one at a time the value a of int64, including load a at a time11,a19,a12,a110,a13,a111,...,a18,a116,(a>>4)&0x0F0F0F0F0F0F0F0F load into SIMD register reg _ a, (a)&0x0F 0F) to another register reg _ b of the SIMD, reg _ a, reg _ b, calculated using 8-bit SIMD.

7. The method of any one of claims 1-6, wherein 4 bits quantization ranges from (-8-7), and shifting the quantized values comprises:

the quantized value is increased by bias (+8) and the offset quantization is (0, 15).

8. A quantization and computation acceleration apparatus comprising:

the quantization module is configured to quantize the data by 4 bits and offset the quantized value to eliminate the negative number generated by the 4-bit quantization;

and the storage module is configured to perform cross storage according to the maximum parallel number of the SIMD when the quantized data is stored.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 7.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 7.

Technical Field

The invention belongs to the technical field of storage, and particularly relates to a quantization and calculation acceleration method and device.

Background

In the related art, currently, each chip performs calculation and memory optimization to improve the performance of the self-developed chip, reduce power consumption, and save cost, thereby improving the competitiveness of the self-developed chip relative to a general chip.

Currently, 8 bits are used for quantization or binary quantization is used for the mainstream.

And 8bit quantization is adopted, so that the mountable acoustic model is relatively small.

Binary quantification is not mature, and has the problems of performance reduction or unstable performance.

The inventors found in the course of carrying out the present application that these problems in the related art are caused by the following reasons: in order to speed up the computation, in a chip system, we usually adopt to put the model into SRAM, so as to avoid performance loss from flash to memory. The self-developed chip is usually smaller in on-chip memory due to the cost problem, the memory of the existing mainstream voice chip is below 3M, the limited space causes the space left for the acoustic neural network to be smaller, for example, 1M space can be used for the neural network, when weight adopts 8bit to quantize, only 1M parameters can be provided, but if 4bit is used for quantizing, the parameter quantity can reach 2M.

Disclosure of Invention

Embodiments of the present invention provide a quantization and computation acceleration method and apparatus, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a quantization and computation acceleration method, including: carrying out 4-bit quantization on the data, and offsetting the quantized numerical value to eliminate the negative number generated by the 4-bit quantization; and when the quantized data is stored, performing cross storage according to the maximum parallel number of the SIMD.

In a second aspect, an embodiment of the present invention provides a quantization and computation acceleration apparatus, including: the quantization module is configured to quantize the data by 4 bits and offset the quantized value to eliminate the negative number generated by the 4-bit quantization; and the storage module is configured to perform cross storage according to the maximum parallel number of the SIMD when the quantized data is stored.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the quantization and computation acceleration method of any of the embodiments of the present invention.

In a fourth aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the steps of the quantization and computation acceleration method according to any one of the embodiments of the present invention.

According to the method and the device, through the introduction of the offset (such as the Bias in the subsequent embodiment), on one hand, the symbol problem in the shift operation is eliminated, on the other hand, the parameter interval is not limited to be (-1,1) and can be an interval with non-0 as the center, and the assumption is eliminated, so that the model training is more free, and the model with better performance can be trained. Furthermore, the data are stored in a crossed mode, so that the memory is saved in space, the data cache is also advantageous, and the load is carried out at one time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow chart of a quantization and computation acceleration method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a quantization and computation acceleration device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flow chart of a quantization and computation acceleration method is shown, which is mainly applied to a chip.

As shown in fig. 1, in step 101, 4-bit quantization is performed on data, and the quantized value is shifted to eliminate the negative number generated by the 4-bit quantization;

then, in step 102, when storing the quantized data, the data is interleaved according to the maximum parallel number of SIMDs.

For step 101, 4-bit quantization is adopted, and the quantized numerical value is shifted to effectively eliminate the negative number generated by the 4-bit quantization, so that the shift problem of positive and negative numbers is not required to be considered in the shift during calculation, more instructions are required for processing the positive and negative numbers in one calculation, and the low calculation efficiency is not caused.

In step 102, when storing the quantized data, the data is stored in a cross manner, rather than in a sequence, according to the maximum parallel number of the SIMDs.

According to the method and the device, through the introduction of the offset (such as the Bias in the subsequent embodiment), on one hand, the symbol problem in the shift operation is eliminated, on the other hand, the parameter interval is not limited to be (-1,1) and can be an interval with non-0 as the center, and the assumption is eliminated, so that the model training is more free, and the model with better performance can be trained. Furthermore, the data are stored in a crossed mode, so that the memory is saved in space, the data cache is also advantageous, and the load is carried out at one time.

In a further embodiment, the method further comprises: when the multiplication of two vectors is calculated, only one loading is carried out, and one right shift and two times of & operation are added. Thus, compared with 8bit, 16 value calculation, two times of loading (load) of data are needed, and one load operation is saved.

In a further embodiment, the method further comprises: at the end of the calculation, a constant value is subtracted. So that the offset generated in step 101 can be cancelled out.

In a further embodiment, the method described above, wherein for:

when the SIMD supports 8-way parallelism, vector a1(a11,a12,a13,...,a115,a116) The interleaving is performed according to the following format:

(a11<<4)|a119,(a12<<4)|a110,(a13<<4)|a111,...,(a18<<4)|a116,...。

so that parallel computations can be performed using SIMD after interleaving.

In a further embodiment, the fixed value formula is represented as follows:

so that the effect of the step 101 offset can be eliminated by the above formula.

In a further embodiment, when calculating the multiplication of two vectors, only one load is performed, and adding one right shift and two & operations includes:

when calculating the multiplication of two vectors, load one at a time the value a of int64, including load a at a time11,a19,a12,a110,a13,a111,...,a18,a116,(a>>4)&0x0F0F0F0F0F0F0F0F load into SIMD register reg _ a, (a)&0x0F 0F) to another register reg _ b of the SIMD, reg _ a, reg _ b, calculated using 8-bit SIMD.

Therefore, compared with 8bit and 16 value calculation, the scheme of the embodiment of the application needs two times of load data, and one load operation is saved.

In a further embodiment, the 4-bit quantization range is (-8-7), and shifting the quantized values comprises: the quantized value is increased by bias (+8) and the offset quantization is (0, 15).

Therefore, when the calculation is carried out, the shifting problem of positive and negative numbers does not need to be considered in the shifting process, more instructions are required for processing the positive and negative numbers in one calculation, and the low calculation efficiency is avoided.

It should be noted that the above method steps are not intended to limit the execution order of the steps, and in fact, some steps may be executed simultaneously or in the reverse order of the steps, which is not limited herein.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventor finds that, in the process of implementing the present application, a dsp (dsp chip) generally has no 4-bit acceleration unit, so to avoid the increase of computation power, most of the technicians in the field adopt 8-bit quantization, and consider computation power and memory, and mainly consider adopting a binary quantization mode to avoid the increase of computation power.

According to the embodiment of the application, the hit rate of the data cache (data cache) is improved by optimizing the storage sequence and the calculation method, so that the calculation power occupation of 4-bit calculation is reduced.

4bit quantization range (-8 to 7)

First add bias (+8) to the quantized value, and the offset quantization is (0,15)

Storing crosswise according to maximum parallel number of SIMD (Single Instruction Multiple Data) Data

For example:

vector a when SIMD supports 8-way parallelism1(a11,a12,a13,...,a115,a116) The interleaving is performed according to the following format:

(a11<<4)|a19,(a12<<4)|a110,(a13<<4)|a111,...,(a18<<4)|a116,...。

when calculating the multiplication of two vectors, load one at a time for the value a of int64, which is equivalent to one at a time load11,a19,a12,a110,a13,a111,...,a18,a116,(a>>4)&0x0F0F0F0F0F0F0F0F load into SIMD register reg _ a, (a)&0x0F 0F) to another register reg _ b of the SIMD, reg _ a, reg _ b, which may be calculated using 8-bit SIMD.

As can be seen from the above, the computation of 16 values requires only one load, one right shift and two & operations.

Compared with 8bit, 16-value calculation, two times of load data are needed, and one load operation is saved.

Last minus a constant value ∑ bias × bi

In the process of realizing the inventor, the inventor also adopts the following technical scheme:

a. sequentially storing, and when the parameters are loaded, simd cannot be directly utilized for parallel computation;

b. when the bias is not done, the shift problem of positive and negative numbers needs to be considered in the shift process, so that more instructions are needed to process the positive and negative numbers in one calculation, and the calculation efficiency is very low.

The scheme of the embodiment of the application can realize the following technical effects: the introduction of Bias, on one hand, enables us to omit the sign problem during shift operation, on the other hand, enables the parameter interval of us not to be limited between (-1,1), but to be an interval with non-0 as the center, and removes the assumption, thereby giving greater freedom to model training and being capable of training a model with better performance. The data stored in a cross way saves the internal memory in space, has the advantage on the data cache and is loaded at one time.

Referring to fig. 2, a block diagram of a quantization and computation acceleration apparatus according to an embodiment of the present invention is shown.

As shown in fig. 2, the quantization and computation acceleration apparatus 200 includes: a quantization module 210 and a storage module 220.

The quantization module 210 is configured to perform 4-bit quantization on the data, and offset the quantized value to eliminate a negative number generated by the 4-bit quantization; and a storage module 220 configured to perform cross storage according to the maximum parallel number of SIMDs when storing the quantized data.

It should be understood that the modules depicted in fig. 2 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 2, and are not described again here.

It should be noted that the modules in the embodiments of the present disclosure are not limited to the scheme of the present disclosure, for example, the quantization module may be described as a module that performs 4-bit quantization on data and offsets the quantized values to eliminate negative numbers generated by the 4-bit quantization. In addition, the related functional modules may also be implemented by a hardware processor, for example, the quantization module may also be implemented by a processor, which is not described herein again.

In other embodiments, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the quantization and computation acceleration methods in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

carrying out 4-bit quantization on the data, and offsetting the quantized numerical value to eliminate the negative number generated by the 4-bit quantization;

and when the quantized data is stored, performing cross storage according to the maximum parallel number of the SIMD.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the quantization and calculation acceleration means, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the quantification and computation acceleration apparatus through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above-described quantization and computation acceleration methods.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes: one or more processors 310 and a memory 320, one processor 310 being illustrated in fig. 3. The apparatus of the quantization and computation acceleration method may further include: an input device 330 and an output device 340. The processor 310, the memory 320, the input device 330, and the output device 340 may be connected by a bus or other means, such as the bus connection in fig. 3. The memory 320 is a non-volatile computer-readable storage medium as described above. The processor 310 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 320, namely, implements the quantization and calculation acceleration methods of the above-described method embodiments. The input device 330 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the communication compensation device. The output device 340 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a quantization and computation acceleration device, and is used for a client, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

carrying out 4-bit quantization on the data, and offsetting the quantized numerical value to eliminate the negative number generated by the 4-bit quantization;

and when the quantized data is stored, performing cross storage according to the maximum parallel number of the SIMD.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种嵌入式设备驱动系统及方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!