Method, device and storage medium for acquiring word vector of natural language processing model

文档序号：361613 发布日期：2021-12-07 浏览：10次中文

阅读说明：本技术 自然语言处理模型的获取词向量的方法、装置和存储介质 (Method, device and storage medium for acquiring word vector of natural language processing model ) 是由张帆涂眉王黎杰于 2020-06-01 设计创作，主要内容包括：公开了一种获取词向量的方法,包括：在编码信息中得到目标词对应的编码,在第一向量信息中得到所述编码对应的第一部分词向量；根据目标词,在第二向量信息中得到所述目标词对应的第二部分词向量；合并所述第一部分词向量和第二部分词向量,得到所述目标词的词向量。同时,可以使用人工智能模型来执行由电子设备执行的上述方法。(Disclosed is a method for obtaining a word vector, comprising: obtaining a code corresponding to a target word in coding information, and obtaining a first part of word vectors corresponding to the code in first vector information; according to the target word, obtaining a second part of word vectors corresponding to the target word in second vector information; and combining the first part of word vectors and the second part of word vectors to obtain the word vectors of the target words. Also, the above-described method performed by the electronic device may be performed using an artificial intelligence model.)

1. A method of obtaining a word vector, comprising:

obtaining a code corresponding to a target word in coding information, and obtaining a first part of word vectors corresponding to the code in first vector information;

according to the target word, obtaining a second part of word vectors corresponding to the target word in second vector information;

and combining the first part of word vectors and the second part of word vectors to obtain the word vectors of the target words.

2. The method of claim 1, wherein the encoded information comprises two or more, the first partial word vectors comprise two or more first partial word vectors corresponding to the two or more encoded information, respectively;

the merging the first partial word vector and the second partial word vector to obtain the word vector of the target word includes:

and combining the more than two first part word vectors and the second part word vector to obtain the word vector of the target word.

3. The method of claim 1, further comprising:

inputting the word vector into a decoder to obtain context information;

cutting the context information to obtain a first part of context information and a second part of context information;

obtaining a first probability according to the first part of context information and the first part of word vectors, and obtaining a second probability according to the second part of context information and the second part of word vectors; and

and obtaining the probability of the target word according to the first probability and the second probability.

4. The method of claim 1, wherein the coding information, the first vector information, is obtained by performing more than two compression trainings on an initial word vector matrix.

5. An apparatus for obtaining a word vector, comprising:

a module for obtaining a code corresponding to a target word from the coding information and obtaining a first part of word vectors corresponding to the code from the first vector information;

a module for obtaining a second part of word vectors corresponding to the target words in second vector information according to the target words;

and a module for combining the first part of word vectors and the second part of word vectors to obtain the word vectors of the target words.

6. The apparatus according to claim 5, wherein the encoded information includes two or more, the first partial word vectors include two or more first partial word vectors corresponding to the two or more encoded information, respectively;

the merging the first partial word vector and the second partial word vector to obtain the word vector of the target word includes:

and combining the more than two first part word vectors and the second part word vector to obtain the word vector of the target word.

7. The apparatus of claim 5, further comprising:

a module for inputting the word vector into a decoder to obtain context information;

a module for cutting the context information to obtain a first part of context information and a second part of context information;

a module for obtaining a first probability according to the first part of context information and the first part of word vectors, and obtaining a second probability according to the second part of context information and the part of second part of word vectors; and

and obtaining the probability of the target word according to the first probability and the second probability.

8. The apparatus of claim 5, wherein the coding information, the first vector information, is derived by performing more than two compression trainings on an initial word vector matrix.

9. An electronic device for obtaining a word vector, comprising:

a memory configured to store instructions; and

a processor configured to execute the instructions to perform the method of claims 1-4.

10. A non-transitory computer readable storage medium having instructions thereon, which when executed by a processor, cause the processor to perform the method of claims 1-4.

Technical Field

The invention relates to the field of natural language processing, aims at the problem of overlarge word vector matrix, optimizes the storage space of a model, and realizes model acceleration on certain word generation models capable of implementing a weight sharing strategy.

Background

There are many branches of natural language processing tasks, such as machine translation, abstract generation, dialog systems, etc., which are all intended to solve the natural language understanding problem with machines.

The word generation model generally includes an encoder section, a decoder section, and an output probability calculation section.

The encoder part is used for encoding the input received by the task, the decoder part is used for decoding the output of the encoder, and finally the probability of all generated words is calculated by using the decoded context information vector.

After the generated words are obtained, the words can search the word vectors corresponding to the words from the word vector table (also called as a word vector matrix), and then the words continue to enter the decoder together with the output of the encoder to generate the next word until the stop condition is reached.

Disclosure of Invention

In the existing word generation task, the word vector matrix occupies too much storage space due to the overlarge word list;

and when the word output is calculated, the probability calculation matrix occupies a high storage space due to the need of calculating the output probabilities of all the words.

At the same time, the computational complexity of this process of calculating the output probabilities of all words can be very high.

In the prior art, in consideration of the relationship that the word vector matrix and the probability calculation matrix are transposed, a weight sharing strategy (hereinafter, the weight sharing strategy refers to this method) is adopted to multiplex the parameters of the two matrices, thereby greatly reducing the model storage space.

However, the parameter amount of the single matrix is still huge, and the problem of computational complexity is not solved.

According to an aspect of the present disclosure, there is provided a method of obtaining a word vector, including: obtaining a code corresponding to a target word in coding information, and obtaining a first part of word vectors corresponding to the code in first vector information; according to the target word, obtaining a second part of word vectors corresponding to the target word in second vector information; and merging the first part of word vectors and the second part of word vectors to obtain the word vectors of the target words.

According to an embodiment of the present disclosure, the encoded information includes two or more, and the first partial word vector includes two or more first partial word vectors respectively corresponding to the two or more encoded information; the merging the first partial word vector and the second partial word vector to obtain the word vector of the target word includes: and combining the more than two first part word vectors and the second part word vector to obtain the word vector of the target word.

According to an embodiment of the present disclosure, further comprising: inputting the word vector into a decoder to obtain context information; cutting the context information to obtain a first part of context information and a second part of context information; obtaining a first probability according to the first part of context information and the first part of word vectors, and obtaining a second probability according to the second part of context information and the second part of word vectors; and obtaining the probability of the target word according to the first probability and the second probability.

According to the embodiment of the disclosure, the coding information and the first vector information are obtained by performing more than two times of compression training on an initial word vector matrix.

According to another aspect of the present disclosure, there is provided an apparatus for obtaining a word vector, including: a module for obtaining a code corresponding to a target word from the coding information and obtaining a first part of word vectors corresponding to the code from the first vector information; a module for obtaining a second part of word vectors corresponding to the target words in second vector information according to the target words; and a module for combining the first part of word vectors and the second part of word vectors to obtain the word vectors of the target words.

According to an embodiment of the present disclosure, further comprising: a module for inputting the word vector into a decoder to obtain context information; a module for cutting the context information to obtain a first part of context information and a second part of context information; a module for obtaining a first probability according to the first part of context information and the first part of word vectors, and obtaining a second probability according to the second part of context information and the second part of word vectors; and obtaining the probability of the target word according to the first probability and the second probability.

According to another aspect of the present disclosure, there is provided an electronic device for obtaining a word vector, including: a memory configured to store instructions; and a processor configured to execute the instructions to perform a method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having instructions thereon, which when executed by a processor, cause the processor to perform a method of an embodiment of the present disclosure.

The above problems with the prior art and other problems are at least addressed by the various embodiments presented in this disclosure.

Drawings

The above and other aspects, features and advantages of the present disclosure will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of a word generation model according to an embodiment of the present disclosure;

FIG. 2 is an overall algorithm flow diagram according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a method 1 according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of method 2 according to an embodiment of the present disclosure; and

fig. 5 is a schematic diagram of method 3 according to an embodiment of the present disclosure.

Detailed Description

In the prior art, the following scheme is adopted:

1. aiming at the problem of the size of a word list, the existing scheme achieves the purpose of reducing the size of the word list by utilizing word list division with finer granularity, and the operations exist in the word list preprocessing process, namely before model training.

2. Aiming at the problem of high probability computation complexity, the existing scheme adopts a hierarchical prediction technology to classify words, firstly predicts word categories, and then reduces the probability computation complexity by a method of predicting few words in the categories, but has certain influence on the model quality.

3. Aiming at the problem that the word vector matrix occupies too much storage space, the existing scheme compresses the word vector matrix, and some schemes can also achieve the purpose of reducing the complexity of probability calculation on the basis of a weight sharing strategy. But will also lose model quality.

Aiming at the problems of overhigh space occupation and overhigh probability calculation complexity of a word vector matrix, the disclosure provides a compression technology of the word vector matrix, which can reduce the probability calculation complexity and simultaneously ensure that the quality of a model is not obviously lost if a weight sharing strategy is adopted on the basis of effectively compressing the word vector matrix.

FIG. 1 is a block diagram of a word generation model according to an embodiment of the present disclosure. The word generation model generally includes an encoder section, a decoder section, and an output probability calculation section.

Fig. 2 is an overall algorithm flow diagram according to an embodiment of the disclosure.

In fig. 2, first, partial vector compression is performed using a dynamic learning method according to a given word vector matrix, resulting in compression results of a reserved part, and a VQ part: vector lookup tables and word list coding.

At the input end, after receiving a word vector request of a word, firstly, a mode of word list coding and vector lookup table is utilized to search the VQ partial word vector, and then the VQ partial word vector is combined with the reserved partial word vector to obtain a complete word vector of the word.

And at the probability calculation end, the probability values of all the words are quickly calculated by utilizing the efficient calculation mode described above, and the final output is obtained through probability integration of softmax.

The invention relates to a word vector matrix compression method for a natural language processing model, which can realize the acceleration effect of a model (a word generation model) after combining a weight sharing strategy for the model capable of implementing a total sharing strategy at a decoding end. The invention comprises the following aspects:

1. word vector matrix compression

According to an embodiment of the present disclosure, there is provided a method of obtaining a word vector, including: obtaining a code corresponding to a target word in coding information, and obtaining a first part of word vectors corresponding to the code in first vector information; according to the target word, obtaining a second part of word vectors corresponding to the target word in second vector information; and combining the first part of word vectors and the second part of word vectors to obtain the word vectors of the target words.

In the following, the above method will be described by means of a specific embodiment, wherein the coding information in the method corresponds to the vocabulary coding in the specific embodiment, the first vector information corresponds to the compressed part of the word vector matrix in the specific embodiment, and the second vector information corresponds to the reserved part of the word vector matrix in the specific embodiment.

The invention provides a method for compressing a block word vector matrix aiming at the problem of overlarge existing word vector matrix. The scheme can ensure mutual independence between word vectors of each word while compressing the word vector matrix, so that the scheme is more compatible with a weight sharing strategy, and the effect of compressing the matrix and realizing model acceleration is achieved. The core scheme is as follows:

firstly, a word vector matrix is divided into two or more matrixes according to hidden unit dimensions, then one matrix is reserved (called a reserved part of the word vector matrix), and the other matrix or matrices are compressed in a vector quantization mode (called a compressed part of the word vector matrix). Each compression matrix, after compression, comprises two parts: vocabulary coding (each word has a corresponding vocabulary coding), vector lookup table (words are coded by vocabulary, vectors in vector lookup table are looked up). When the word carries out word vector searching operation, for a compression part, the compression part firstly obtains own word list codes and then searches corresponding vectors in a vector lookup table; for the reserved part, the word can directly search the word vector of the word through the corresponding position; and finally, splicing the searched word vectors to obtain a final complete word vector.

2. Reducing probabilistic computational complexity

According to an embodiment of the present disclosure, further comprising inputting the word vector to a decoder, obtaining context information; cutting the context information to obtain a first part of context information and a second part of context information; obtaining a first probability according to the first part of context information and the first part of word vectors, and obtaining a second probability according to the second part of context information and the second part of word vectors; and obtaining the probability of the target word according to the first probability and the second probability.

Hereinafter, the above method will be described by specific embodiments, wherein a first part of context information in the method corresponds to a compressed part of the context information in the specific embodiments, and a second part of context information corresponds to a reserved part of the context information in the specific embodiments.

In some word generation models, the invention can reduce the complexity of probability calculation by combining with a weight sharing strategy, thereby realizing model acceleration. The core scheme is as follows:

firstly, cutting a decoded context information vector in a manner of compressing (i.e. cutting) the context information vector corresponding to a word vector matrix;

then, the respective parts obtained by the cutting (e.g., the reserved part of the context information and the compressed part of the context information) are respectively calculated with the corresponding word vector parts. For the reserved part, directly multiplying the reserved part by the reserved part of the word vector matrix to obtain a probability result of the reserved part; for the compression part, multiplying the compression part by a vector lookup table to obtain a multiplication result, and then coding and looking up the multiplication result by a word table to obtain a probability result of the compression part of all words; and finally, adding the probability results of all parts to obtain final probability output.

3. Dynamic compression training

According to the embodiment of the present disclosure, the encoding information and the first vector information are obtained by performing compression training on an initial word vector matrix more than two times.

Hereinafter, the above method will be described by specific embodiments, wherein the dynamic compression training method is applied to train a vocabulary code and a compression part of a word vector matrix, and the code information in the method corresponds to the vocabulary code in the specific embodiments, and the first vector information corresponds to the compression part of the word vector matrix in the specific embodiments.

The dynamic training method is used for solving the problem of model performance loss possibly caused by implementing vector quantization compression. The core scheme is as follows:

step 1: and setting a training hyper-parameter. Respectively recording as follows: the total step number Dst _ max of dynamic training; the compression frequency st _ c; a starting compression value k _ begin; a target compression value k; the reduction step st _ k is compressed. And assume that the current compression value is k _ curr and the current training step number is Dst _ curr.

step 2: initializing, and enabling k _ curr to be k _ begin; dst _ curr ═ 0.

step 3: and compressing the word vector matrix once every st _ c step. The compression operation includes: 1. obtaining a word vector matrix; 2. partitioning, and then compressing each block; 3. modifying the value of the word vector matrix according to the compression result; 4. updating the compression ratio according to the compression reduction step size: k _ curr ═ k _ curr-st _ k;

step 4: and finishing the training when the total training steps are reached.Method 1. binary word vector matrix compression and probability computation complexity reduction Is low in

Fig. 3 is a schematic diagram of a method 1 according to an embodiment of the present disclosure. As described in fig. 3, the word vector matrix is first divided into two parts, the second part is retained (retained part), the first part is vector quantized (compressed part, VQ part), and the word table code (I) and vector lookup table (C) are obtained. When searching word vectors, firstly, searching corresponding vectors in a vector lookup table through word table coding at corresponding positions of words; then, taking out the reserved partial vector of the corresponding position of the word; and finally, splicing the two vectors to obtain a final complete word vector.

Fig. 4 is a schematic diagram of method 2 according to an embodiment of the present disclosure. As described in fig. 4, the decoded context information vector is first cut in a word vector matrix compression manner to obtain context information (h _ c and h _ p) of the compressed portion and the reserved portion;

then, multiplying the context information of the compression part by a vector lookup table to obtain a multiplication result, and then searching the multiplication result by a word coding part to obtain a probability result of the compression part of all words; directly multiplying the context information of the reserved part by the reserved part to obtain the probability result of the reserved part of all the words; and finally, adding the two results to obtain final probability output.

Method 2. compression of multi-segment word vector matrix and reduction of complexity of probability calculation

In the above example, the word vector matrix may be divided into a plurality of portions, only one of the portions is reserved, and the same compression and acceleration effects can be achieved by compressing the rest of the portions.

Method 3. dynamic compression

Like the dynamic compression method described above, fig. 5 shows the dynamic compression step of the two-segment word vector matrix. Dynamic compression may also be applied to multi-participle vector matrix compression.

Because the vocabulary is too large, the current word generation model still faces huge memory overhead and computational complexity problems.

In the memory problem, the strategy of sharing the weight of the word vector matrix and the softmax layer is widely applied.

Based on the weight sharing strategy, the compression method of some word vector matrixes can simultaneously solve the problem of computational complexity.

In the present disclosure, a new method, a partial vector quantization method, is proposed to achieve model acceleration while compressing the word vector matrix.

A word vector matrix is divided into two parts using a window and one part is compressed into a vector lookup table by vector quantization while the other part is kept unchanged to maintain word independence.

For each word, a long word vector may be decomposed into one code and one short word vector.

When used in conjunction with a weight sharing strategy, the method of the present disclosure may reduce the computational complexity of the word probability in the softmax layer by converting the matrix multiplication operation to a lookup operation.

In addition, by dynamically learning vocabulary coding, the method of the present disclosure may achieve higher model quality.

Experimental results on a neural machine translation model show that the method of the present disclosure can reduce the word vector matrix parameters by 74.51%, reduce the FLOPs in the softmax layer by 74.41%, while maintaining a high translation quality on the WMT test set.

The weight sharing strategy is very useful in the word generation task, can reduce the memory cost and makes it possible to accelerate the model speed when compressing the word vector matrix.

The only requirement for using the weight sharing strategy is to ensure independence between the word vector matrices.

Based on the above consideration, the present disclosure proposes a partial vector quantization method, which can be well compatible with a weight sharing policy, and implement model acceleration while compressing a word vector matrix.

The present disclosure may include two parts:

partial vector quantization. First, the word vector matrix is divided into two parts, one part is kept unchanged, and the other part is compressed.

And (5) dynamic learning. Unlike previous methods, the vocabulary code and vector lookup tables are learned using a dynamic learning approach.

Partial vector quantization:

in partial vector quantization, a word vector matrix is first divided into two parts using a window: a compressed portion and an exclusive portion. The compressed portion is then decompressed by vector quantization while the exclusive portion is kept unchanged.

Partial vector quantization requires the use of two hyper-parameters: w and k. Where V denotes the vocabulary size, d denotes the length of each word vector,a matrix of word vectors is represented that is,represents the parameter matrix in the softmax layer,a bias matrix is represented. Suppose thatIs a vocabulary code, i_n∈[1，...，k]Is the code for the nth word,is a look-up table of vectors, is the exclusive portion. The total word vector matrix parameter size is

P_PVQ＝O(k×w+V×(d-w))

When compared to before uncompressed

ΔP_PVQ＝O((V-k)×w)

It can thus be seen that the larger w, the higher the compression ratio; the smaller k, the higher the compression ratio.

At the softmax level, a more efficient method is used to calculate the probability for each word.

Due to the fact that

W_w ^T＝W_e＝f^concat(f^lookup(C，I)，W_p)

First decompose given context information

h＝f^concat(h_c，h_p)

Then, firstly calculating VQ part information:

Y_VQ＝f^lookup(Ch_c，I)

the number of FLOPs herein is

F_VQ＝(2w-1)×k

Then, exclusive part information is calculated:

Y_P＝W_ph_p

the number of FLOPs herein is

F_P＝(2(d-w)-1)×V

The probability of each word can eventually be expressed as

Y_l＝f^lookup(Ch_c，I)+W_ph_p+b

And the total number of FLOPs is

F_PVQ＝F_VQ+F_P+2V

Compared with the common version

ΔF_PVQ＝F-F_PVQ＝(2w-1)×(V-k)

It follows that the computational complexity is positively correlated with k and negatively correlated with w.

Dynamic learning

A dynamic learning method is introduced, by means of which the quality of the model can be improved.

Vector quantization is performed by dividing the vectors into different groups, with the center point of a group representing all the vectors in that group. When the vector is fixed, vector quantization is actually a clustering algorithm. In the existing work, some directly use k-means clustering to achieve the purpose of vector quantization, and then train a vector lookup table by using a fine tuning method; some direct training methods may be used to train the vocabulary code, vector lookup tables, and other parameters. The authors of the second method also point out that for some models that are trained with word vector matrices and other parameters, a direct training method may not be used, and word list codes and vector lookup tables need to be trained first, and then fine-tuned.

When learning vocabulary coding through a word vector matrix, a problem is found: in some models, the distance between word vectors is small. When different clustering algorithms are tried, it is found that the clustering results are difficult to converge, or the words are divided into some large groups and many small groups. A good vocabulary code is very important in vector quantization, and it is difficult to give a good vocabulary code if the word vector matrix is quantized only once. Further, the larger the compression ratio, the more information is lost. In view of the above, dynamic learning is proposed to better learn vocabulary encodings and vector lookup tables.

At the beginning, a large number of clusters k _ begin is set as the current vector quantization number to reduce information loss, and then the number of clusters is dynamically reduced using step _ k until k _ curr equals the target number k.

The partial vector quantization is performed every step _ c.

In partial vector quantization, the values of the word vectors in the k _ curr and VQ parts are used to cluster words to obtain the vocabulary code and vector lookup tables.

The word table encoding look-up vector look-up table is then used to replace all word vectors in the VQ part.

The training steps in the dynamic learning process are the same as pre-training, and the word vector matrix will remain uncompressed.

In other words, the vocabulary code and the vector lookup table are learned without any change except in the quantization step, and the value of the word vector matrix is forcibly changed according to the vector lookup table and the vocabulary code.

After the vocabulary code is learned, the vocabulary code is also fixed and the vector lookup table is fine-tuned with other parameters.

At least one of the plurality of units or modules according to an exemplary embodiment of the present invention may be implemented by an AI model. The functions associated with the AI may be performed by the non-volatile memory, the volatile memory, and the processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors such as a Central Processing Unit (CPU), an Application Processor (AP), etc., processors for graphics only (e.g., a Graphics Processor (GPU), a Vision Processor (VPU), and/or an AI-specific processor (e.g., a Neural Processing Unit (NPU)).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. The predefined operating rules or artificial intelligence models are provided through training or learning.

Here, the provision by learning means that a predefined operation rule or AI model having a desired characteristic is formed by applying a learning algorithm to a plurality of learning data. The learning may be performed in the device itself performing the AI according to the embodiment, and/or may be implemented by a separate server/system.

The artificial intelligence model may be composed of multiple neural network layers. Each layer has a plurality of weight values, and layer operations are performed by calculation of a previous layer and operation of the plurality of weights. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), generative countermeasure networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

The user's intention can be obtained by interpreting text using a Natural Language Understanding (NLU) model. The NLU model may be an artificial intelligence model. The artificial intelligence model can be processed by an artificial intelligence specific processor designed with a hardware architecture that specifies the processing for the artificial intelligence model. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained by a training algorithm using a plurality of training data to obtain a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model can include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and neural network calculation is performed by calculation between a calculation result of a previous layer and the plurality of weight values.

Language understanding is a technique to recognize and apply/process human language/text, including natural language processing, machine translation, dialog systems, question answering, or speech recognition/synthesis.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种汉字拼音到盲文ASCII码的转换方法

Method, device and storage medium for acquiring word vector of natural language processing model

相关技术

网友询问留言