Compression method and system of multi-language BERT sequence labeling model

文档序号:763052 发布日期:2021-04-06 浏览:26次 中文

阅读说明:本技术 多语言bert序列标注模型的压缩方法及系统 (Compression method and system of multi-language BERT sequence labeling model ) 是由 撖朝润 李琦 傅洛伊 王新兵 于 2020-12-16 设计创作,主要内容包括:本发明提供了多语言BERT序列标注模型的压缩方法及系统,涉及BERT类模型的知识蒸馏技术领域,该方法包括:步骤1:基于Wordpiece算法从多语语料中抽取词表;步骤2:对多/单语言BERT教师模型、多语言BERT学生模型进行预训练;步骤3:基于人工标注的下游任务数据对多/单语言BERT教师模型进行微调;步骤4:利用多/单语言BERT教师模型对预训练后的多语言BERT学生模型进行残差知识蒸馏;步骤5:基于人工标注的下游任务数据对蒸馏后的多语言BERT学生模型进行微调。本发明通过残差学习和多对一的知识蒸馏方式,提高了学生模型的准确率和泛化程度,降低了多语言环境下BERT类序列标注模型部署所需的硬件资源。(The invention provides a compression method and a compression system of a multilingual BERT sequence labeling model, which relate to the technical field of knowledge distillation of BERT models, and the method comprises the following steps: step 1: extracting a word list from the multi-language corpus based on a Wordpieee algorithm; step 2: pre-training a multi-language/single-language BERT teacher model and a multi-language BERT student model; and step 3: fine-tuning a multi/single language BERT teacher model based on the manually labeled downstream task data; and 4, step 4: carrying out residual knowledge distillation on the pre-trained multilingual BERT student model by using a multilingual BERT teacher model; and 5: and fine-tuning the distilled multilingual BERT student model based on the manually labeled downstream task data. According to the invention, through residual learning and many-to-one knowledge distillation mode, the accuracy and generalization degree of the student model are improved, and hardware resources required by deployment of the BERT sequence labeling model in a multi-language environment are reduced.)

1. A compression method of a multilingual BERT sequence annotation model, comprising:

step 1: extracting a word list from the multi-language corpus based on a Wordpieee algorithm;

step 2: pre-training a multilingual BERT teacher model and a multilingual BERT student model by using the vocabulary as training data;

and step 3: fine-tuning a multi/single language BERT teacher model based on the manually labeled downstream task data;

and 4, step 4: carrying out residual knowledge distillation on the pre-trained multilingual BERT student model by using a multilingual BERT teacher model;

and 5: and fine-tuning the distilled multilingual BERT student model based on the manually labeled downstream task data.

2. The method of claim 1, wherein step 1 comprises:

step 1.1: initializing a word list into all characters in the multilingual corpus;

step 1.2: performing word segmentation on the multilingual corpus and counting the occurrence frequency of all element pairs in a word list;

step 1.3: according to the likelihood formula:

the likelihood values of all sentences in the corpus are calculated,

where s denotes a natural sentence in the corpus, ciRepresenting the ith element in the sentence, and n represents the number of elements in the sentence;

traversing the set of element pairs in step 1.2, selecting to enableThe largest element pair is added as a new element to the vocabulary, skRepresenting the kth sentence in the corpus, and m represents the total number of sentences in the corpus;

step 1.4: steps 1.2 and 1.3 are repeated until the vocabulary size reaches a preset value, which is related to the number of languages involved in the corpus.

3. The method of claim 1, wherein step 2 comprises:

step 2.1: determining BERT model structures of a teacher model and a student model, wherein the BERT model structures comprise transform layers L, hidden layer dimensions H and multi-head attention head numbers A of the models; randomly initializing all model parameters;

step 2.2: segmenting words of the pre-training corpus and segmenting the corpus according to the preset maximum segment length; pre-training linguistic data of the multilingual BERT teacher model and the multilingual BERT student model are full multilingual linguistic data, and pre-training linguistic data of the monolingual BERT teacher model are subsets corresponding to the languages in the multilingual linguistic data;

step 2.3: randomly covering each segmentation segment;

step 2.4: mapping each word in the segment into a corresponding semantic vector and a corresponding position vector, adding the two vectors, and inputting the two vectors into a BERT model for forward calculation;

step 2.5: mapping an output vector of the BERT model into a vector space with the size of a word list through a full-connection prediction layer to obtain the prediction probability distribution of the covered words, and calculating a cross entropy loss function;

step 2.6: calculating the gradient of the loss function to the BERT model parameters, and updating all parameters of the BERT model by using a random gradient descent method;

step 2.7: repeating the steps 2.3 to 2.6 until the preset iteration times are reached; and storing the parameters of each teacher model and each student model at the end of pre-training.

4. The method of claim 1, wherein step 3 comprises:

step 3.1: loading the pre-training parameters of the teacher models stored in the step 2;

step 3.2: performing word segmentation on the manually labeled downstream task data, and segmenting the data according to a preset maximum segment length; training data of the multilingual BERT teacher model are full downstream task data, and training data of the monolingual BERT teacher model are subsets corresponding to the language in the downstream task data;

step 3.3: taking an original natural sentence of training data as input to obtain a corresponding output vector of the BERT model; mapping the output vector of each word to a tag space of a downstream task through a full-connection prediction layer to obtain a labeling result of an input sentence;

step 3.4: calculating a cross entropy loss function between the labeling result of the BERT model and the manual labeling, and finely adjusting all parameters in the BERT model according to the gradient of the loss function to the model parameters;

step 3.5: repeating the steps 3.3 to 3.4 until the preset iteration times are reached; and saving the parameters of each teacher model at the end of training.

5. The method of claim 1, wherein the step 4 comprises:

step 4.1: loading the parameters of each teacher model after fine tuning stored in the step 3 and the pre-training parameters of the student models stored in the step 2;

step 4.2: selecting a proper multilingual distillation corpus;

step 4.3: inputting the multilingual distillation corpus into a multilingual BERT teacher model to obtain corresponding model outputThen pass throughSoft tag to derive multi-language BERT teacher model predictionsWherein Softmax (·) denotes a Softmax function, T is a smoothing parameter;

for each monolingual BERT teacher model, the part corresponding to the language in the multilingual distillation corpus is input into the model to obtain the corresponding model outputWherein i represents the ith language;

then pass throughObtaining soft labels for monolingual BERT teacher model prediction

Step 4.4: initializing a student model queue to be empty; the learning target for initializing the student model is L ═ KL (Z'S|Z′T) Wherein KL (·) represents KL divergence, Z'SSoft label of multilingual distillation corpus output by multilingual BERT student model is represented, and the calculation process is the same as the step 4.3, Z'TSoft label representing teacher model output:

training multilingual BERT student model S by taking learning target as loss function0Adding the trained models into a student model queue;

updating Z 'in learning objectives of student model'TIs composed ofNamely the residual error between the soft label output by the teacher model and the sum of the soft labels output by all the student models in the current student model queue;

continuing to train multi-language BERT student model S by taking the learning objective as a loss function1Adding the trained models into a student model queue;

repeatedly updating the learning target of the student model and adding a new model into the student model queue according to the method until the length of the student model queue reaches a preset maximum value;

step 4.5: and (5) saving parameters of all the student models in the student model queue after residual knowledge distillation.

6. The method of claim 1, wherein the step 5 comprises:

step 5.1: loading parameters S of all student models after residual knowledge distillation saved in step 40,S1,...,Sk};

Step 5.2: fine-tuning the student model queue obtained in the step 4 based on the manually labeled downstream task data;

firstly, training samples are synchronously input into all student models to obtain model outputOiRepresenting a student model SiAn output of (d);

then, the cross entropy between the O and the result of manual marking is used as a loss function to finely adjust the whole student model queue;

step 5.3: saving parameters of all the student models in the trimmed student model queue; the student model queue is the compressed multilingual BERT sequence labeling model finally output by the invention.

7. A compression system for a multi-lingual BERT sequence annotation model, said system comprising:

a word list module: extracting a word list from the multi-language corpus based on a Wordpieee algorithm;

a pre-training module: pre-training a multilingual BERT teacher model and a multilingual BERT student model by using the vocabulary as training data;

an adjusting module: fine-tuning a multi/single language BERT teacher model based on the manually labeled downstream task data;

a distillation module: carrying out residual knowledge distillation on the pre-trained multilingual BERT student model by using a multilingual BERT teacher model;

a result module: and fine-tuning the distilled multilingual BERT student model based on the manually labeled downstream task data.

8. The system of claim 7, wherein the pre-training module comprises:

determining BERT model structures of a teacher model and a student model, wherein the BERT model structures comprise transform layers L, hidden layer dimensions H and multi-head attention head numbers A of the models; randomly initializing all model parameters;

segmenting words of the pre-training corpus and segmenting the corpus according to the preset maximum segment length; pre-training linguistic data of the multilingual BERT teacher model and the multilingual BERT student model are full multilingual linguistic data, and pre-training linguistic data of the monolingual BERT teacher model are subsets corresponding to the languages in the multilingual linguistic data;

randomly covering each segmentation segment;

mapping each word in the segment into a corresponding semantic vector and a corresponding position vector, adding the two vectors, and inputting the two vectors into a BERT model for forward calculation;

mapping an output vector of the BERT model into a vector space with the size of a word list through a full-connection prediction layer to obtain the prediction probability distribution of the covered words, and calculating a cross entropy loss function;

calculating the gradient of the loss function to the BERT model parameters, and updating all parameters of the BERT model by using a random gradient descent method;

and repeating the steps from random covering of each segmentation segment to updating of all parameters of the BERT model until a preset iteration number is reached, and storing the parameters of each teacher model and each student model when the pre-training is finished.

9. The system of claim 7, wherein the adjustment module comprises:

loading the pre-training parameters of each teacher model stored in the pre-training module;

performing word segmentation on the manually labeled downstream task data, and segmenting the data according to a preset maximum segment length; training data of the multilingual BERT teacher model are full downstream task data, and training data of the monolingual BERT teacher model are subsets corresponding to the language in the downstream task data;

taking an original natural sentence of training data as input to obtain a corresponding output vector of the BERT model; mapping the output vector of each word to a tag space of a downstream task through a full-connection prediction layer to obtain a labeling result of an input sentence;

calculating a cross entropy loss function between the labeling result of the BERT model and the manual labeling, and finely adjusting all parameters in the BERT model according to the gradient of the loss function to the model parameters;

repeating the two steps until a preset iteration number is reached; and saving the parameters of each teacher model at the end of training.

10. The system of claim 7, wherein the distillation module comprises:

loading the parameters of each teacher model after fine adjustment stored in the adjusting module and the pre-training parameters of the student models stored in the pre-training module;

selecting a proper multilingual distillation corpus;

inputting the multilingual distillation corpus into a multilingual BERT teacher model to obtain corresponding model outputThen pass throughSoft tag to derive multi-language BERT teacher model predictionsWherein Softmax (·) denotes a Softmax function, T is a smoothing parameter;

for each monolingual BERT teacher model, the part corresponding to the language in the multilingual distillation corpus is input into the model to obtain the corresponding model outputWherein i represents the ith language;

then pass throughObtaining soft labels for monolingual BERT teacher model prediction

Initializing a student model queue to be empty; the learning target for initializing the student model is L ═ KL (Z'S|Z′T) Wherein KL (·) represents KL divergence, Z'SZ 'representing the soft label of the multilingual distillation corpus output by the multilingual BERT student model and the calculation process is the same as the step'TSoft label representing teacher model output:

training multilingual BERT student model S by taking learning target as loss function0Adding the trained models into a student model queue;

updating Z 'in learning objectives of student model'TIs composed ofNamely the residual error between the soft label output by the teacher model and the sum of the soft labels output by all the student models in the current student model queue;

continuing to train multi-language BERT student model S by taking the learning objective as a loss function1Adding the trained models into a student model queue;

repeatedly updating the learning target of the student model and adding a new model into the student model queue according to the method until the length of the student model queue reaches a preset maximum value;

and (5) saving parameters of all the student models in the student model queue after residual knowledge distillation.

Technical Field

The invention relates to the technical field of knowledge distillation of BERT models, in particular to a compression method and a compression system of a multilingual BERT sequence labeling model.

Background

BERT is a large-scale pre-trained language model based on transformations encoders. In recent years, BERT has exhibited great strength on many downstream tasks. Sequence tagging is a type of task that classifies elements in a sequence, and common sequence tagging tasks include named entity recognition, part-of-speech tagging, and so on. In a multi-language environment, if a plurality of single-language BERT models are used for modeling different language texts at the same time, huge computing resources are occupied; meanwhile, for some language categories with deficient training corpora, the BERT model and the traditional model are difficult to achieve good effects. The multilingual BERT can simultaneously model hundreds of languages in a word list sharing and co-training mode, so that the overall effect of the BERT model in a multilingual environment is improved while resources are saved.

Although the multilingual BERT model can achieve excellent effects on the sequence labeling task, the inference speed of a single BERT model is still limited by the huge model size. In order to solve the sequence tagging problem in the low-latency application scenario by using multilingual BERT, the industry often compresses BERT models by using methods such as knowledge distillation. The knowledge distillation technology is a method for introducing knowledge learned by a teacher model in a downstream task into a student model, and the method comprises the steps of firstly reasoning on unlabelled distillation linguistic data through the teacher model to obtain a corresponding soft label, and then enabling the student model to fit the output of the teacher model on the same data to achieve the aim of improving the prediction accuracy of the student model, so that the teacher model can be replaced by the student model with smaller scale and higher speed in actual deployment.

In view of the above-mentioned prior art, there are technical drawbacks that the knowledge distillation techniques for multilingual BERT sequence labeling models in the related art all employ a one-to-one training approach, i.e., distillation from multilingual BERT teacher models to multilingual BERT student models, which fails to take into account that multilingual BERT models are not superior to single-language BERT models in all language categories; in addition, the student model and the teacher model still have great difference in structural complexity, so that a single student model cannot effectively fit the output of the teacher model.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a compression method and a compression system of a multilingual BERT sequence annotation model, and the knowledge distillation effect of the BERT sequence annotation model under a multilingual background is improved in a mode of model mixing and residual learning of a multilingual teacher.

According to the compression method of the multilingual BERT sequence annotation model provided by the invention, the scheme is as follows:

in a first aspect, a compression method of a multilingual BERT sequence annotation model is provided, the method comprising:

extracting a word list from the multi-language corpus based on a Wordpieee algorithm;

pre-training a multilingual BERT teacher model and a multilingual BERT student model by using the vocabulary as training data;

fine-tuning a multi/single language BERT teacher model based on the manually labeled downstream task data;

carrying out residual knowledge distillation on the pre-trained multilingual BERT student model by using a multilingual BERT teacher model;

and fine-tuning the distilled multilingual BERT student model based on the manually labeled downstream task data.

Preferably, the extracting the vocabulary from the multilingual corpus based on the Wordpiece algorithm includes:

initializing a word list into all characters in the multilingual corpus;

performing word segmentation on the multilingual corpus and counting the occurrence frequency of all element pairs in a word list;

according to the likelihood formula:

the likelihood values of all sentences in the corpus are calculated,

wherein s represents in corpusNatural sentence, ciRepresenting the ith element in the sentence, and n represents the number of elements in the sentence;

traversing the set of element pairs in the initialization vocabulary, the selection enablingThe largest element pair is added as a new element to the vocabulary, skRepresenting the kth sentence in the corpus, and m represents the total number of sentences in the corpus;

repeating the above two steps until the vocabulary scale reaches a preset value, which is related to the number of languages involved in the corpus.

Preferably, the pre-training of the multilingual BERT teacher model and the multilingual BERT student model includes:

determining BERT model structures of a teacher model and a student model, wherein the BERT model structures comprise transform layers L, hidden layer dimensions H and multi-head attention head numbers A of the models; randomly initializing all model parameters;

segmenting words of the pre-training corpus and segmenting the corpus according to the preset maximum segment length; pre-training linguistic data of the multilingual BERT teacher model and the multilingual BERT student model are full multilingual linguistic data, and pre-training linguistic data of the monolingual BERT teacher model are subsets corresponding to the languages in the multilingual linguistic data;

randomly covering each segmentation segment;

mapping each word in the segment into a corresponding semantic vector and a corresponding position vector, adding the two vectors, and inputting the two vectors into a BERT model for forward calculation;

mapping an output vector of the BERT model into a vector space with the size of a word list through a full-connection prediction layer to obtain the prediction probability distribution of the covered words, and calculating a cross entropy loss function;

calculating the gradient of the loss function to the BERT model parameters, and updating all parameters of the BERT model by using a random gradient descent method;

repeating the steps from random covering of each segmentation segment to updating of all parameters of the BERT model until a preset iteration number is reached; and storing the parameters of each teacher model and each student model at the end of pre-training.

Preferably, the fine-tuning the multi/single-language BERT teacher model based on the manually labeled downstream task data comprises:

loading pre-training parameters of each teacher model stored in the pre-training step of the multi-language/single-language BERT teacher model and the multi-language BERT student model;

performing word segmentation on the manually labeled downstream task data, and segmenting the data according to a preset maximum segment length; training data of the multilingual BERT teacher model are full downstream task data, and training data of the monolingual BERT teacher model are subsets corresponding to the language in the downstream task data;

taking an original natural sentence of training data as input to obtain a corresponding output vector of the BERT model; mapping the output vector of each word to a tag space of a downstream task through a full-connection prediction layer to obtain a labeling result of an input sentence;

calculating a cross entropy loss function between the labeling result of the BERT model and the manual labeling, and finely adjusting all parameters in the BERT model according to the gradient of the loss function to the model parameters;

repeating the two adjacent steps until a preset iteration number is reached; and saving the parameters of each teacher model at the end of training.

Preferably, the performing residual knowledge distillation on the pre-trained multilingual BERT student model by using the multilingual BERT teacher model comprises:

loading downstream task data based on manual annotation, wherein the parameters of each teacher model after fine tuning stored in the step of fine tuning the multi/single-language BERT teacher model and the pre-training parameters of the student models stored in the step of pre-training the multi/single-language BERT teacher model and the multi-language BERT student models are loaded;

selecting a proper multilingual distillation corpus;

inputting the multilingual distillation corpus into a multilingual BERT teacher model to obtain corresponding model outputThen pass throughSoft tag to derive multi-language BERT teacher model predictionsWherein Softmax (·) denotes a Softmax function, T is a smoothing parameter;

for each monolingual BERT teacher model, the part corresponding to the language in the multilingual distillation corpus is input into the model to obtain the corresponding model outputWherein i represents the ith language;

then pass throughObtaining soft labels for monolingual BERT teacher model prediction

Initializing a student model queue to be empty; the learning target for initializing the student model is L ═ KL (Z'S|Z′T) Wherein KL (·) represents KL divergence, Z'SZ 'representing the soft label of the multilingual distillation corpus output by the multilingual BERT student model and the calculation process is the same as the step'TSoft label representing teacher model output:

training multilingual BERT student model S by taking learning target as loss function0Adding the trained models into a student model queue;

updating Z 'in learning objectives of student model'TIs composed ofNamely the residual error between the soft label output by the teacher model and the sum of the soft labels output by all the student models in the current student model queue;

continuing to train multi-language BERT student model S by taking the learning objective as a loss function1Adding the trained models into a student model queue;

repeatedly updating the learning target of the student model and adding a new model into the student model queue according to the method until the length of the student model queue reaches a preset maximum value;

and (5) saving parameters of all the student models in the student model queue after residual knowledge distillation.

Preferably, the fine-tuning of the distilled multi-lingual BERT student model based on the manually labeled downstream task data comprises:

loading parameters { S) of all student models after residual knowledge distillation, which are saved in the step of residual knowledge distillation of the pre-trained multilingual BERT student models by using the multilingual BERT teacher model0,S1,…,Sk};

Fine-tuning a student model queue obtained in the step of residual knowledge distillation of the pre-trained multilingual BERT student model by using a multilingual BERT teacher model based on manually labeled downstream task data;

firstly, training samples are synchronously input into all student models to obtain model outputOiRepresenting a student model SiAn output of (d);

then, the cross entropy between the O and the result of manual marking is used as a loss function to finely adjust the whole student model queue;

saving parameters of all the student models in the trimmed student model queue; the student model queue is the compressed multilingual BERT sequence labeling model finally output by the invention.

In a second aspect, there is provided a compression system for a multilingual BERT sequence annotation model, said system comprising:

a word list module: extracting a word list from the multi-language corpus based on a Wordpieee algorithm;

a pre-training module: pre-training a multilingual BERT teacher model and a multilingual BERT student model by using the vocabulary as training data;

an adjusting module: fine-tuning a multi/single language BERT teacher model based on the manually labeled downstream task data;

a distillation module: carrying out residual knowledge distillation on the pre-trained multilingual BERT student model by using a multilingual BERT teacher model;

a result module: and fine-tuning the distilled multilingual BERT student model based on the manually labeled downstream task data.

Preferably, the pre-training module comprises:

determining BERT model structures of a teacher model and a student model, wherein the BERT model structures comprise transform layers L, hidden layer dimensions H and multi-head attention head numbers A of the models; randomly initializing all model parameters;

segmenting words of the pre-training corpus and segmenting the corpus according to the preset maximum segment length; pre-training linguistic data of the multilingual BERT teacher model and the multilingual BERT student model are full multilingual linguistic data, and pre-training linguistic data of the monolingual BERT teacher model are subsets corresponding to the languages in the multilingual linguistic data;

randomly covering each segmentation segment;

mapping each word in the segment into a corresponding semantic vector and a corresponding position vector, adding the two vectors, and inputting the two vectors into a BERT model for forward calculation;

mapping an output vector of the BERT model into a vector space with the size of a word list through a full-connection prediction layer to obtain the prediction probability distribution of the covered words, and calculating a cross entropy loss function;

calculating the gradient of the loss function to the BERT model parameters, and updating all parameters of the BERT model by using a random gradient descent method;

and repeating the steps from random covering of each segmentation segment to updating of all parameters of the BERT model until a preset iteration number is reached, and storing the parameters of each teacher model and each student model when the pre-training is finished.

Preferably, the adjusting module includes:

loading the pre-training parameters of each teacher model stored in the pre-training module;

performing word segmentation on the manually labeled downstream task data, and segmenting the data according to a preset maximum segment length; training data of the multilingual BERT teacher model are full downstream task data, and training data of the monolingual BERT teacher model are subsets corresponding to the language in the downstream task data;

taking an original natural sentence of training data as input to obtain a corresponding output vector of the BERT model; mapping the output vector of each word to a tag space of a downstream task through a full-connection prediction layer to obtain a labeling result of an input sentence;

calculating a cross entropy loss function between the labeling result of the BERT model and the manual labeling, and finely adjusting all parameters in the BERT model according to the gradient of the loss function to the model parameters;

repeating the two steps until a preset iteration number is reached; and saving the parameters of each teacher model at the end of training.

Preferably, the distillation module comprises:

loading the parameters of each teacher model after fine adjustment stored in the adjusting module and the pre-training parameters of the student models stored in the pre-training module;

selecting a proper multilingual distillation corpus;

inputting the multilingual distillation corpus into a multilingual BERT teacher model to obtain corresponding model outputThen pass throughObtaining multilingual BERT teacher modelSoft tag for type predictionWherein Softmax (·) denotes a Softmax function, T is a smoothing parameter;

for each monolingual BERT teacher model, the part corresponding to the language in the multilingual distillation corpus is input into the model to obtain the corresponding model outputWherein i represents the ith language;

then pass throughObtaining soft labels for monolingual BERT teacher model prediction

Initializing a student model queue to be empty; the learning target for initializing the student model is L ═ KL (Z'S|Z′T) Wherein KL (·) represents KL divergence, Z'SZ 'representing the soft label of the multilingual distillation corpus output by the multilingual BERT student model and the calculation process is the same as the step'TSoft label representing teacher model output:

training multilingual BERT student model S by taking learning target as loss function0Adding the trained models into a student model queue;

updating Z 'in learning objectives of student model'TIs composed ofNamely the residual error between the soft label output by the teacher model and the sum of the soft labels output by all the student models in the current student model queue;

to study byTraining multilingual BERT student model S with learning objective as loss function1Adding the trained models into a student model queue;

repeatedly updating the learning target of the student model and adding a new model into the student model queue according to the method until the length of the student model queue reaches a preset maximum value;

and (5) saving parameters of all the student models in the student model queue after residual knowledge distillation.

Compared with the prior art, the invention has the following beneficial effects:

1. by means of mixed use of the single-language teacher model and the multi-language teacher model, information sources in the knowledge distillation process are increased, the prediction accuracy of the student models on a single language is improved, and the generalization performance of the student models is also improved;

2. through the residual learning mode, the modeling capability of the student model is improved, and the knowledge distillation effect is improved while the reasoning speed of a single student model is not influenced.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the compression method of the present invention;

FIG. 2 is a schematic structural diagram of a multilingual BERT sequence labeling model;

FIG. 3 is a schematic of a residual knowledge distillation process used in the present invention;

FIG. 4 is a schematic diagram of reasoning using the compression model obtained by the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The embodiment of the invention provides a compression method of a multilingual BERT sequence annotation model, which is shown in figure 1 and comprises the following steps:

step 1: extracting a word list from the multi-language corpus based on a Wordpieee algorithm;

step 2: pre-training a multi-language/single-language BERT teacher model and a multi-language BERT student model;

and step 3: fine-tuning a multi/single language BERT teacher model based on the manually labeled downstream task data;

and 4, step 4: carrying out residual knowledge distillation on the pre-trained multilingual BERT student model by using a multilingual BERT teacher model;

and 5: and fine-tuning the distilled multilingual BERT student model based on the manually labeled downstream task data.

The present invention is further described in detail by the following preferred examples:

taking the BERT named entity recognition model of the three languages of chinese, english, and french as an example, the compression method of the multilingual BERT sequence annotation model provided in this embodiment relates to the pre-training of multilingual, chinese, english, and french BERT teacher models and multilingual BERT student models, the fine-tuning of the multilingual, chinese, english, and french BERT teacher models on the manual annotation data of the named entity recognition task, the residual knowledge distillation of the multilingual BERT student models, the fine-tuning of the multilingual BERT student models on the manual annotation data of the named entity recognition task, and the inference on the manually annotated data by using the multilingual BERT student models.

Specifically, step 1 comprises:

step 1.1: downloading Wikipedia data sets of Chinese, English and French as training corpora, and initializing word lists to be all characters in the three-language corpora;

step 1.2: performing word segmentation on the three-language corpus and counting the occurrence frequency of all element pairs in a word list; for example, if an element pair "ab" consisting of the element "a" and the element "b" occurs 2000 times in the corpus, the frequency of occurrence of "ab" is 2000;

step 1.3: according to likelihood formulaCalculating the likelihood values of all sentences in the corpus, wherein s represents the natural sentence in the corpus, ciRepresenting the ith element in the sentence, and n represents the number of elements in the sentence; traversing the set of element pairs in step 1.2, selecting to enableThe largest element pair is added as a new element to the vocabulary, skRepresenting the kth sentence in the corpus, and m represents the total number of sentences in the corpus; for example, in all element pairs, "ab" as a wholeIf the maximum value is reached, adding ab as a newly added element into the vocabulary; if the frequency of the appearance of the ab is equal to the frequency of the appearance of the a, deleting the a from the word list, and similarly, deleting the b when the frequency of the appearance of the ab is equal to the frequency of the appearance of the b;

step 1.4: steps 1.2 and 1.3 are repeated until the vocabulary size reaches a preset value, which is related to the number of languages involved in the corpus.

The step 2 comprises the following steps:

step 2.1: determining structures of BERT models of a teacher model and a student model, wherein parameters to be determined comprise the number L of transform layers, the dimension H of hidden layers and the number A of multi-head attention heads of the models as shown in FIG. 2; for example, the BERT teacher model is set to have a structure of L-24, H-1024, and a-16, and the BERT student model is set to have a structure of L-4, H-512, and a-8; randomly initializing all model parameters;

step 2.2: dividing the wiki encyclopedia linguistic data of Chinese, English and French, and dividing the linguistic data by taking 512 as the maximum length; the Chinese is divided into words by taking characters as units, and English and French are divided into words by spaces and punctuation marks; the pre-training linguistic data of the multilingual BERT teacher model and the multilingual BERT student model are full three-language linguistic data, and the pre-training linguistic data of the middle, English and French BERT teacher model are subsets corresponding to the languages in the three-language data;

step 2.3: and randomly covering each segmentation segment. Firstly, randomly selecting 20% of words in the fragments, then replacing 80% of the words with "[ MASK ]", randomly replacing 10% of the words, and keeping 10% of the words unchanged;

step 2.4: mapping each word in the segment into a corresponding semantic vector and a corresponding position vector, adding the two vectors, and inputting the two vectors into a BERT model for forward calculation;

step 2.5: mapping an output vector of the BERT model into a vector space with the size of a word list through a full-connection prediction layer to obtain the prediction probability distribution of the covered words, and calculating a cross entropy loss function;

step 2.6: calculating the gradient of the loss function to the BERT model parameters, and updating all parameters of the BERT model by using a random gradient descent method;

step 2.7: respectively implementing the steps 2.3 to 2.6 on the multilingual BERT teacher model, the Chinese BERT teacher model, the English BERT teacher model, the French BERT teacher model and the multilingual BERT student model until the preset iteration times are reached; and storing the parameters of each teacher model and each student model at the end of pre-training.

The step 3 comprises the following steps:

step 3.1: loading the pre-training parameters of the teacher models stored in the step 2;

step 3.2: acquiring a public Chinese, English and French named entity identification data set with manual labels, and segmenting the data set; segmenting data according to a preset maximum segment length 512; the training data of the multilingual BERT teacher model is full three-language named entity identification data, and the training data of the Chinese, English and French BERT teacher model is a subset corresponding to the language in the three-language named entity identification data;

step 3.3: taking an original natural sentence of training data as input to obtain a corresponding output vector of the BERT model; mapping the output vector of each word to a tag space of a downstream task through a full-connection prediction layer to obtain a labeling result of an input sentence; assume that there are three named entities in the dataset: the human name, the place name and the organization name are used as input, the output of the whole connection prediction layer to the word 'Shanghai' is [0.1, 0.7, 0.1, 0.1], the probability that the word is the human name, the organization name or the like is 10% by the representation model, and the probability that the word is the place name is 70%;

step 3.4: calculating a cross entropy loss function between the labeling result of the BERT model and the manual labeling, and finely adjusting all parameters in the BERT model according to the gradient of the loss function to the model parameters;

step 3.5: repeating the steps 3.3 to 3.4 until the preset iteration times are reached; and saving the parameters of each teacher model at the end of training.

Step 4 comprises, as shown in fig. 3:

step 4.1: loading the parameters of the multilingual BERT teacher model and the parameters of the Chinese, English and French BERT teacher models which are stored in the step (3) after fine adjustment and the pre-training parameters of the student models which are stored in the step (2);

step 4.2: selecting an appropriate multilingual corpus, which may be part of a pre-corpus or from another source; the corpus has no downstream task label marked manually; for example, 10% of each of wikipedia data of English, French and English is extracted as multilingual distillation corpus;

step 4.3: inputting the multilingual distillation corpus into a multilingual BERT teacher model to obtain corresponding model outputThen pass throughSoft tag to derive multi-language BERT teacher model predictionsWherein; softmax (·) represents a Softmax function, and T is a smoothing parameter and can be adjusted according to requirements; for each monolingual BERT teacher model, the part corresponding to the language in the multilingual distillation corpus is input into the model to obtain the corresponding model outputWhereinIs the output of the Chinese BERT model,is the output of the english BERT model,is the French BERT model output; then pass throughObtaining soft labels for monolingual BERT teacher model prediction

Step 4.4: initializing a student model queue to be empty; the learning target for initializing the student model is L ═ KL (Z'S|Z′T) Wherein KL (·) represents KL divergence; z'SThe soft label of the multilingual distillation corpus output by the multilingual BERT student model is represented, and the calculation process is the same as the step 4.3; z'TSoft label representing teacher model output:

training multilingual BERT student model S by taking learning target as loss function0Adding the trained models into a student model queue; updating Z 'in learning objectives of student model'TIs composed ofNamely the residual error between the soft label output by the teacher model and the sum of the soft labels output by all the student models in the current student model queue; using the learning objective as a loss functionMulti-language BERT student model S for digital continuous training1Adding the trained models into a student model queue; repeatedly updating the learning target of the student model and adding a new model into the student model queue according to the method until the length of the student model queue reaches a preset maximum value;

step 4.5: and (5) saving parameters of all the student models in the student model queue after residual knowledge distillation.

The step 5 comprises the following steps:

step 5.1: loading parameters S of all student models after residual knowledge distillation saved in step 40,S1,…,Sk};

Step 5.2: finely adjusting the student model queue obtained in the step 4 based on the manually marked Chinese, English and French named entity identification data set; firstly, training samples are synchronously input into all student models to obtain model outputOiRepresenting a student model SiAn output of (d); then, fine-tuning the whole student model queue by taking the cross entropy between the O and the result of manual marking as a loss function;

step 5.3: saving parameters of all the student models in the trimmed student model queue; the queue is the compressed multilingual BERT sequence labeling model finally output by the invention.

The process of named entity recognition using the multilingual BERT student model obtained by the method proposed by the present invention is as follows, as shown in fig. 4:

firstly, synchronously inputting sentences to be labeled into all student models to obtain model outputOiRepresenting a student model SiAn output of (d); and selecting the label with the maximum model prediction probability from the output corresponding to each word as a labeling result.

The embodiment of the invention provides a compression method of a multilingual BERT sequence annotation model, which enhances the information source of knowledge distillation by a method of mixing multiple and single-language teacher models, so that a student model can obtain a more accurate fitting target; by adopting the residual error training mode, the integrated learning is carried out on a plurality of student models, and the fitting capability of the student models is improved.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:蒙古文国际标准编码到形码转换方法、装置及计算机终端

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!