Language identification method based on multiple tasks

文档序号:1955161 发布日期:2021-12-10 浏览:17次 中文

阅读说明:本技术 一种基于多任务的语种识别方法 (Language identification method based on multiple tasks ) 是由 陈玮 冯少辉 张建业 于 2021-09-29 设计创作,主要内容包括:本发明涉及一种基于多任务的语种识别方法,包括以下步骤:构建并训练基于多任务学习的多语种识别模型;所述任务包括语种识别任务和是否为有效话音识别任务;获取待识别语音数据,进行预处理得到待识别的第一语音队列;将第一语音队列中语音数据进行分批处理,得到多个批次且每批中数据长短相近的第二语音队列;将多批次的所述第二语音队列依次导入到所述多语种识别模型,在所述语种识别模型中同时进行语种识别和是否为有效话音识别后,逐批次输出识别结果。本发明实现了对语种识别和有效话音识别两种任务的同时识别,提高了语种识别的准确率和识别效率。(The invention relates to a language identification method based on multitask, which comprises the following steps: constructing and training a multi-language identification model based on multi-task learning; the tasks comprise language identification tasks and whether the tasks are effective voice identification tasks or not; acquiring voice data to be recognized, and preprocessing the voice data to be recognized to obtain a first voice queue to be recognized; processing the voice data in the first voice queue in batches to obtain a plurality of batches of second voice queues with the data in each batch being similar in length; and sequentially importing the second voice queues of multiple batches into the multi-language recognition model, and outputting recognition results batch by batch after simultaneously performing language recognition and whether the speech recognition is effective or not in the language recognition model. The invention realizes the simultaneous recognition of two tasks of language recognition and effective voice recognition, and improves the accuracy and the recognition efficiency of the language recognition.)

1. A language identification method based on multitask is characterized by comprising the following steps:

constructing and training a multi-language identification model based on multi-task learning; the tasks comprise language identification tasks and whether the tasks are effective voice identification tasks or not;

acquiring voice data to be recognized, and preprocessing the voice data to be recognized to obtain a first voice queue to be recognized;

processing the voice data in the first voice queue in batches to obtain a plurality of batches of second voice queues with the data in each batch being similar in length;

and sequentially importing the second voice queues of multiple batches into the multi-language recognition model, and outputting recognition results batch by batch after simultaneously performing language recognition and whether the speech recognition is effective or not in the language recognition model.

2. The language identification method according to claim 1, wherein the pre-processing method of the speech data to be identified comprises:

1) after resampling the acquired voice data to be recognized, unifying the sampling rate, the coding, the precision and the header file format of the voice data;

2) splitting voice data to be recognized, the number of channels of which is more than 1, into single-channel voice data;

3) performing voice detection on the single-channel voice data, dividing the voice data into a plurality of voice segments with fixed time length, judging whether each voice segment is a mute segment or not according to the energy of each sub-band of each voice segment in the frequency band range of the voice, if so, removing the segment, otherwise, keeping the segment; obtaining voice data with silence removed;

4) according to the threshold range of the voice cutting, cutting the voice data without silence into voice data segments with fixed length;

5) and re-sampling the cut voice data segments again and then sequentially outputting to obtain a first voice queue.

3. The language identification method according to claim 1, wherein the speech data in the first speech queue is processed in batches to obtain a plurality of batches of second speech queues having similar data sizes; the method comprises the following steps:

after the voice data segments in the first voice queue are taken out, sequencing the voice data segments according to the length to obtain a sequencing voice queue;

continuously taking out voice data segments from one end of the sequencing voice queue, and forming a batch of second voice queue by the taken-out voice data segments after the length of the voice data segments is taken out and the preset length is reached;

continuously taking out the sorted voice queues and batching to obtain a plurality of batches of second voice queues;

and returning the voice data of the sequencing voice queue of less than one batch back to the first voice queue after the batch.

4. The language identification method of claim 1, wherein the multilingual recognition model comprises a feature extraction layer, a context coding layer, and an output layer; the feature extraction layer adopts a convolution network in a wav2vec model; the shallow feature extraction module is used for extracting the shallow feature of the input voice sample voice frame level; the context coding layer adopts a transformer network based on a self-attention mechanism in a wav2vec model and is used for extracting the weight and the characteristics of each frame of voice; the output layer is used for adjusting the output layer of the wav2vec model and the loss function according to the requirement of simultaneously completing language identification and effective data identification tasks; and the output layer transforms the extracted weights and features to a label dimension corresponding to the language identification task and the data effective identification task by using a full-connection network, and simultaneously outputs the identification results of the two tasks.

5. The language identification method as claimed in claim 4, wherein said output layer belongs to output Y of language task i Comprises the following steps:

the output layer belongs to the output Z of the data effective identification task i Comprises the following steps:

wherein, P (X) i h y i ) Output for language class probability normalization, P (X) i h z i ) Data validity probability normalization output, X i Input for the multilingual recognition modeliSample data of each sample;

h y i the hidden layer characteristics are obtained after the voice tasks are processed by a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );

h z i obtaining hidden layer characteristics for the data belonging to the effective data identification task after passing through a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );j=1,2, …, T; and T is the number of frames after the convolution operation in the multi-language identification model.

6. The language identification method of claim 5 wherein in multiple languagesIntroduction of weighting factors into loss functions in seed recognition modelsαTo balance the output of both language identification and data efficient identification tasks, using weighting factorsβTo scale the losses of different classes of languages.

7. The language identification method as claimed in claim 6 wherein the penalty function in the multilingual identification modelL fine_tuning =(1-α)L y +αL z

Wherein the content of the first and second substances,αweighting factors for balancing the output of the language identification task and the data effective identification task;L y for the softmax crossover loss of the language identification task,L z sigmoid cross-penalties for data efficient identification tasks,y i z i respectively true category of language and true category of valid tone,Nis the total number of samples.

8. The language identification method as claimed in claim 6 wherein the penalty function in the multilingual identification modelL fine_tuning =(1-α)L y +αL z

Wherein the content of the first and second substances,αweighting factors for balancing the output of the language identification task and the data effective identification task;L y for the softmax crossover loss of the language identification task,L z sigmoid cross-penalties for data efficient identification tasks,y i z i respectively true category of language and true category of valid tone,Nis the total number of samples;

wherein the content of the first and second substances,βweighting factors for scaling losses of different categories of languages;Rejectthe data is marked invalid.

9. The language identification method according to any one of claims 1-8,

the training process of the multilingual recognition model comprises the following steps:

performing primary pre-training on the wav2vec model by using an unsupervised pre-training mode;

establishing a training sample set containing multi-language sample data according to language identification and effective data identification task requirements;

carrying out secondary pre-training on the wav2vec model which is pre-trained for the first time by adopting sample data in the training sample set;

aiming at the requirements of language identification and effective data identification tasks, adaptively fine-tuning the output layer and the loss function of the wav2vec model pre-trained for the second time to construct a final multi-language identification model;

and training the multi-language recognition model by adopting the sample data in the training sample set again, so that the multi-language recognition model can simultaneously recognize the language and the voice validity of the input voice data.

10. The language identification method of claim 9,

the data source of the training sample set comprises call recording voice data; the call recording voice data comprises effective voice data of a plurality of languages; after the effective voice data is processed, including data resampling, silence removal and data cutting, language category marking is carried out to obtain sample data of the effective voice;

the training sample set further comprises inactive speech data having a range of audio frequencies and features that are distinguishable from active speech data; and after the invalid voice data is processed by data resampling, silence removal and data cutting, carrying out data invalidity marking to obtain sample data of the invalid voice.

Technical Field

The invention relates to the technical field of voice recognition, in particular to a language recognition method based on multiple tasks.

Background

In recent years, speech recognition technology has made remarkable progress, and language recognition, as one of important technologies of speech recognition, plays an important role in a plurality of fields such as military affairs, medical treatment, education and the like; in a multi-language speech recognition system, language recognition is used as an upstream task in the speech recognition system, and plays an important role in improving the accuracy of multi-language speech recognition and the user experience of the system;

the traditional language identification method generally comprises three processes of voice signal feature extraction, language model construction and judgment rules, wherein the commonly used voice features in the process comprise MFCC, Fbank, spectrogram, i-vector and the like, the language model is generally one or more classifiers comprising SVM, decision tree, random forest and the like, the judgment rules are related to the selected classifier, and the probability or confidence of a sample to be classified is usually calculated according to prior information so as to predict the category of the language;

the language identification method based on deep learning does not generally need manual feature extraction, speech is transformed to a distinguishable vector space through deep learning models, such as nonlinear feature extractors of CNN, LSTM, transform and the like, so as to define an optimization objective function, and parameters of the models are updated by using a gradient descent algorithm until the predicted classes of the models are consistent with the real classes. In such methods, the quality of the models depends on the quality and quantity of the labeled data, and further, it is difficult for the models to extract deep features of the speech, such as the gender and age of the speaker, and pronunciation differences between different languages.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a language identification method based on multitask, which uses a trained multi-language identification model for multitask learning and simultaneously realizes language identification of voice data and whether the voice data is a valid voice identification task.

The technical scheme provided by the invention is as follows:

the invention discloses a language identification method based on multitask, which comprises the following steps:

constructing and training a multi-language identification model based on multi-task learning; the tasks comprise language identification tasks and whether the tasks are effective voice identification tasks or not;

acquiring voice data to be recognized, and preprocessing the voice data to be recognized to obtain a first voice queue to be recognized;

processing the voice data in the first voice queue in batches to obtain a plurality of batches of second voice queues with the data in each batch being similar in length;

and sequentially importing the second voice queues of multiple batches into the multi-language recognition model, and outputting recognition results batch by batch after simultaneously performing language recognition and whether the speech recognition is effective or not in the language recognition model.

Further, the method for preprocessing the voice data to be recognized comprises the following steps:

1) after resampling the acquired voice data to be recognized, unifying the sampling rate, the coding, the precision and the header file format of the voice data;

2) splitting voice data to be recognized, the number of channels of which is more than 1, into single-channel voice data;

3) performing voice detection on the single-channel voice data, dividing the voice data into a plurality of voice segments with fixed time length, judging whether each voice segment is a mute segment or not according to the energy of each sub-band of each voice segment in the frequency band range of the voice, if so, removing the segment, otherwise, keeping the segment; obtaining voice data with silence removed;

4) according to the threshold range of the voice cutting, cutting the voice data without silence into voice data segments with fixed length;

5) and re-sampling the cut voice data segments again and then sequentially outputting to obtain a first voice queue.

Further, the voice data in the first voice queue is processed in batches, so that a plurality of batches of second voice queues with similar data lengths in each batch are obtained; the method comprises the following steps:

after the voice data segments in the first voice queue are taken out, sequencing the voice data segments according to the length to obtain a sequencing voice queue;

continuously taking out voice data segments from one end of the sequencing voice queue, and forming a batch of second voice queue by the taken-out voice data segments after the length of the voice data segments is taken out and the preset length is reached;

continuously taking out the sorted voice queues and batching to obtain a plurality of batches of second voice queues;

and returning the voice data of the sequencing voice queue of less than one batch back to the first voice queue after the batch.

Further, the multi-language identification model comprises a feature extraction layer, a context coding layer and an output layer; the feature extraction layer adopts a convolution network in a wav2vec model; the shallow feature extraction module is used for extracting the shallow feature of the input voice sample voice frame level; the context coding layer adopts a transformer network based on a self-attention mechanism in a wav2vec model and is used for extracting the weight and the characteristics of each frame of voice; the output layer is used for adjusting the output layer of the wav2vec model and the loss function according to the requirement of simultaneously completing language identification and effective data identification tasks; and the output layer transforms the extracted weights and features to a label dimension corresponding to the language identification task and the data effective identification task by using a full-connection network, and simultaneously outputs the identification results of the two tasks.

Furthermore, the output layer belongs to the output Y of language tasks i Comprises the following steps:

the output layer belongs to the output Z of the data effective identification task i Comprises the following steps:

wherein, P (X) i h y i ) Output for language class probability normalization, P (X) i h z i ) Data validity probability normalization output, X i Input for the multilingual recognition modeliSample data of each sample;

h y i the hidden layer characteristics are obtained after the voice tasks are processed by a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );

h z i obtaining hidden layer characteristics for the data belonging to the effective data identification task after passing through a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );j=1,2, …, T; and T is the number of frames after the convolution operation in the multi-language identification model.

Further, a weighting factor is introduced into the loss function in the multi-language recognition modelαTo balance the output of both language identification and data efficient identification tasks, using weighting factorsβTo scale the losses of different classes of languages.

Further, a loss function in a multilingual recognition modelL fine_tuning =(1-α)L y +αL z

Wherein the content of the first and second substances,αweighting factors for balancing the output of the language identification task and the data effective identification task;L y for the softmax crossover loss of the language identification task,L z sigmoid cross-penalties for data efficient identification tasks,y i z i respectively true category of language and true category of valid tone,Nis the total number of samples.

Further, a loss function in a multilingual recognition modelL fine_tuning =(1-α)L y +αL z

Wherein the content of the first and second substances,αweighting factors for balancing the output of the language identification task and the data effective identification task;L y for the softmax crossover loss of the language identification task,L z sigmoid cross-penalties for data efficient identification tasks,y i z i respectively true category of language and true category of valid tone,Nis the total number of samples;

wherein the content of the first and second substances,βweighting factors for scaling losses of different categories of languages;Rejectthe data is marked invalid.

Further, the training process of the multilingual recognition model includes:

performing primary pre-training on the wav2vec model by using an unsupervised pre-training mode;

establishing a training sample set containing multi-language sample data according to language identification and effective data identification task requirements;

carrying out secondary pre-training on the wav2vec model which is pre-trained for the first time by adopting sample data in the training sample set;

aiming at the requirements of language identification and effective data identification tasks, adaptively fine-tuning the output layer and the loss function of the wav2vec model pre-trained for the second time to construct a final multi-language identification model;

and training the multi-language recognition model by adopting the sample data in the training sample set again, so that the multi-language recognition model can simultaneously recognize the language and the voice validity of the input voice data.

Further, the data source of the training sample set comprises call recording voice data; the call recording voice data comprises effective voice data of a plurality of languages; after the effective voice data is processed, including data resampling, silence removal and data cutting, language category marking is carried out to obtain sample data of the effective voice;

the training sample set further comprises inactive speech data having a range of audio frequencies and features that are distinguishable from active speech data; and after the invalid voice data is processed by data resampling, silence removal and data cutting, carrying out data invalidity marking to obtain sample data of the invalid voice.

The invention can realize at least one of the following beneficial effects:

the multi-language recognition model adopted by the invention directly inputs voice data, can judge the unvoiced sound while recognizing the language, can simultaneously meet two tasks of effective voice detection and language recognition in an actual scene, and saves time and space cost.

After the acquired voice data to be recognized is preprocessed and processed in batches, the voice fragments with fixed total length and the same or similar voice fragment length are obtained and simultaneously input into the recognition model for recognition, and the recognition efficiency of the model is improved.

The adopted multilingual recognition model adopts a three-time training mode, wherein the model generated after two times of pre-training is used as the input of an initial model of the multilingual recognition model, parameters in the model of the multilingual recognition model except for an output layer are initialized, so that the starting point of a fine-tuning task is positioned at a position closer to an optimal point, and the recognition effects on two tasks of effective voice detection and language recognition after the third training are better and shorter in training time.

A small amount of language marking data is used in the training sample set, so that the labor cost and the time cost for acquiring a large amount of marking data are reduced; and, the marked invalid voice data is introduced to improve the generalization capability of the model to the language identification.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a flowchart illustrating a language identification method based on multitasking according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for constructing and training a language identification model according to an embodiment of the present invention;

FIG. 3 is a diagram of a wav2vec pre-training task model structure in an embodiment of the present invention;

FIG. 4 is a diagram of a multi-language identification model according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.

In this embodiment, as shown in fig. 1, a language identification method based on multitasking includes the following steps:

s101, constructing and training a multi-language recognition model based on multi-task learning; the tasks comprise language identification tasks and whether the tasks are effective voice identification tasks or not;

step S102, voice data to be recognized are obtained, and preprocessing is carried out to obtain a first voice queue to be recognized;

step S103, performing batch processing on the voice data in the first voice queue to obtain a plurality of batches of second voice queues with data in the batches of the second voice queues which are similar in length;

and step S104, sequentially importing the second voice queues of multiple batches into the multi-language recognition model, and outputting recognition results batch by batch after simultaneously performing language recognition and whether the speech recognition is effective in the language recognition model.

Specifically, the multi-language identification model based on the multi-task learning in step S101 includes a feature extraction layer, a context coding layer, and an output layer; the feature extraction layer adopts a convolution network in a wav2vec model; the shallow feature extraction module is used for extracting the shallow feature of the input voice sample voice frame level; the context coding layer adopts a transformer network based on a self-attention mechanism in a wav2vec model and is used for extracting the weight and the characteristics of each frame of voice; the output layer is used for adjusting the output layer of the wav2vec model and the loss function according to the requirement of simultaneously completing language identification and effective data identification tasks; and the output layer transforms the extracted weights and features to a label dimension corresponding to the language identification task and the data effective identification task by using a full-connection network, and simultaneously outputs the identification results of the two tasks.

More specifically, the output layer belongs to output Y of language tasks i Comprises the following steps:

Y i =argmax(P(X i ,h i y ))

the output layer belongs to the output of a data effective identification task:

Z i =argmax(P(X i ,h i z ))

wherein, P (X) i h y i ) Output for language class probability normalization, P (X) i h z i ) Data validity probability normalization output, X i Input for the multilingual recognition modeliSample data of each sample;

h y i the hidden layer characteristics are obtained after the voice tasks are processed by a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );

h z i obtaining hidden layer characteristics for the data belonging to the effective data identification task after passing through a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );j=1,2, …, T; and T is the number of frames after the convolution operation in the multi-language identification model.

Preferably, weighting factors are introduced into the loss function in a multilingual recognition modelαTo balance the output of both language identification and data efficient identification tasks, using weighting factorsβTo scale the losses of different classes of languages.

In particular, a loss function in a multilingual recognition modelL fine_tuning =(1-α)L y +αL z Wherein, in the step (A),L y for the cross-loss of softmax in a language,L z sigmoid cross-loss for active tones, specifically:

,

therefore, the temperature of the molten metal is controlled,

wherein the content of the first and second substances,y i z i true class and true class of active tone, respectively, of language, Y i 、Z i Are respectively a modelAnd predicting the output language category and the real category of the valid tone.

NIs the total number of samples.

In practice, different speeches after the language identification are generally input to different transcription identification engines for processing, which requires higher accuracy for the language identification;

for this purpose, a factor is further usedβTo scale the loss of different classes of languages, namely:

wherein, in the step (A),

specifically, the multilingual recognition model of the embodiment is finely adjusted based on the fairseq open source framework, the main structure of the model is still formed by 7 layers of convolution and 12 layers of transformers, and the model parameters are not adjusted.

Specifically, in step 102, the method for preprocessing the speech data to be recognized includes:

1) after resampling the acquired voice data to be recognized, unifying the sampling rate, the coding, the precision and the header file format of the voice data;

specifically, through initial resampling, converting formats of all voices into formats with sampling rate of 8000, number of channels of 1, PCM coding and precision of 16 bits, and adding a voice header file for each voice; the header file is a segment of data at the beginning of the voice file, is used for describing main data, occupies 44 bytes in total, and contains voice format information such as sampling rate, channel number, coding and the like.

2) Splitting voice data to be recognized, the number of channels of which is more than 1, into single-channel voice data;

3) performing voice detection on the single-channel voice data, dividing the voice data into a plurality of voice segments with fixed time length, judging whether each voice segment is a mute segment or not according to the energy of each sub-band of each voice segment in the frequency band range of the voice, if so, removing the segment, otherwise, keeping the segment; obtaining voice data with silence removed;

specifically, in the embodiment, a webbtc voice endpoint is adopted to detect a voice segment;

firstly, the input voice is segmented at 20ms intervals to obtain a series of voice segments,

secondly, whether the segment is silent or not is detected for each voice segment, if so, the segment is removed, and if not, the segment is kept.

The method comprises the following steps of detecting by using a Very aggregate mode in vad of webrtc, dividing a frequency spectrum of an input voice segment into six sub-bands (80 Hz-250 Hz, 250 Hz-500 Hz, 500 Hz-1K, 1K-2K, 2K-3K and 3K-4K); and respectively calculating the energy, namely the characteristics, of the six sub-bands, and performing operation by using a probability density function of a Gaussian model to obtain log likelihood ratio functions, namely the probabilities of silence and voice, of the six sub-bands.

Taking the log-likelihood ratio of each sub-band as a condition for local judgment; weighting is preferably performed according to a spectrum range of human voice (approximately about 80Hz to 1K), and (0.25, 0.25, 0.25, 0.08, 0.08, 0.08) is used as a weight for each of the six subbands; and carrying out weighted summation on the six sub-bands as a global feature.

When the judgment is performed on whether the sound is mute, the local judgment is performed firstly, namely whether the log likelihood ratio of each sub-band exceeds a judgment threshold value is judged, the threshold value is divided into a local group and a global group, and 94 and 1100 are respectively used; when the log likelihood ratio of one sub-band exceeds a threshold value, judging that the voice is contained; and when all the local parts judge that the voice is not contained, judging the global situation, namely judging whether the weighted summation result of the six sub-bands exceeds a judgment threshold value, if so, judging that the voice is contained, and if not, judging that the voice is mute.

4) According to the threshold range of the voice cutting, cutting the voice data without silence into voice data segments with fixed length;

the threshold value of the voice cutting comprises a minimum length min _ len and a maximum length max _ len; cutting the voice into lengths within a fixed length range according to min _ len and max _ len; filtering short voice by using the minimum voice min _ len so as to remove some noise data in a training set and accelerate the convergence speed of the model; because the model cannot process overlong data, setting max _ len to intercept overlong voice, and accordingly training efficiency is improved. In the invention, min _ len and max _ len are respectively taken as 1 second and 30 seconds, and the scheme is regarded as an experience that voice data below 1 second is difficult to determine the language type due to the limitation of expression content and speaking speed of a speaker, and voice data within 30 seconds is enough to judge the language related to the speaking content.

5) And re-sampling the cut voice data segments again and then sequentially outputting to obtain a first voice queue.

And re-sampling again, uniformly converting the cut voice into a format with the sampling rate of 16000 and the sampling precision of 16bit, and taking the format as sample data for model training and recognition.

The processing of the voice data to be recognized of the embodiment is carried out by adopting a double-resampling mode, a sampling rate with a lower sampling rate is adopted in the primary resampling, so that the formats of the voice are unified, the data volume is reduced when the voice channel is split, the silence is removed and the voice is cut, the processing speed can be increased, the requirement on processing hardware is reduced, after the processing is finished, the resampling is carried out again at a high sampling rate, and the sampling rate and the sampling precision of the voice data to be recognized meet the requirement of a model.

Specifically, in step S103, the voice data in the first voice queue is processed in batches to obtain a plurality of batches of second voice queues with similar data lengths in each batch; the method comprises the following steps:

1) after the voice data segments in the first voice queue are taken out, sequencing the voice data segments according to the length to obtain a sequencing voice queue;

2) continuously taking out voice data segments from one end of the sequencing voice queue, and forming a batch of second voice queue by the taken-out voice data segments after the length of the voice data segments is taken out and the preset length is reached;

3) continuously taking out the sorted voice queues and batching the voice queues to obtain a plurality of batches of second voice queues;

4) and returning the voice data of the sequencing voice queue of less than one batch back to the first voice queue after the batch.

And combining the voice data fragments which are re-sampled after waiting for being preprocessed with the next voice data to be recognized to form a new first voice queue.

The embodiment also discloses a construction and training process for constructing and training a multi-language recognition model based on multi-task learning, as shown in fig. 2, comprising the following steps:

step S201, performing first pre-training on the wav2vec model by using an unsupervised pre-training mode; obtaining a representation of speech data;

step S202, establishing a training sample set containing multi-language sample data according to language identification and effective data identification task requirements;

s203, performing secondary pre-training on the wav2vec model pre-trained for the first time by using sample data in the training sample set;

step S204, carrying out adaptive fine adjustment on the output layer and the loss function of the wav2vec model pre-trained for the second time aiming at language identification and effective data identification task requirements, and then constructing a final multi-language identification model; and training the multi-language recognition model by adopting the sample data in the training sample set again, so that the multi-language recognition model can simultaneously recognize the language and the voice validity of the input voice data.

Specifically, in step S201, the embodiment uses an open-source pre-training model, such as the wav2vec model, as an initial model, and uses an unsupervised pre-training mode of mask, and the pre-training task of the embodiment predicts the segments to obtain an enhanced representation of the voice context information, and uses the enhanced representation as an initial parameter to perform parameter fine-tuning on the downstream task, so that the downstream task has better performance.

Figure 3 shows the structure of a pre-trained model using wav2 vec. As can be seen from the figures, it is,

firstly, the original voice is input into the Feature encoder and sequentially passes through convolution of 7 layers for Feature extraction, the output of each layer is used as the input of the next layer, the step length of each layer is (5, 2,2,2,2,2, 2) respectively, and the width of a convolution kernel is (10, 3,3,3,3,2, 2) respectively. For example, a piece of (1, L) -dimensional speech is input, where L is the length of the speech, and a 3-dimensional vector with dimension (1, L/320,512) is generated after feature encoding.

Secondly, obtaining a voice feature vector with a fixed dimension of 512 after the previous step, then calculating attention among voice features through 12 layers of blocks, wherein each block is a transform structure containing 768 hidden layer units, calculating attention weight among the L/320 feature sequences by using multi-headed self-attention so as to obtain deep context features of the whole complete voice, and obtaining a vector with a dimension of (L/320, 1, 768) through 12 layers of transform coding the next voice.

In the pre-training stage, in order to construct a prediction target, wav2vec masks a fixed-length feature sequence in features output by CNN convolution in a mask mode, and the target of the training task is to predict the masked feature sequence. To compute the occluded features, wav2vec introduces a quantization module to discretize the representation of the context-coded output layer into a vector close to ont-hot, whereby the quantized representation output features are used to compute the loss with the ground-route. The wav2vec provides Gumbel-Softmax and K-means clustering quantification methods, and the pre-training task of the invention uses the former method. The loss function for the pre-training task is:

L pre =L m 1 L d wherein,L m To make the contrast loss predicted by the mask,L d in order to be a loss of diversity,α 1set to 0.1. In particular, the amount of the solvent to be used,

L m c t as the current timetThe output of the transform network of (2),q t is a secret(feature encoder after multilayer convolution) quantized output, and cosine similarity is calculated for the above two outputs, i.e. sim (a, b) = aTb/| a | | | b |. Here, wav2vec introduces a negative sampling technique: at the current momenttThe model is to be included inq t In whichκ+1 quantization candidates identifiedq t WhereinQ t Indicate thisκ+1 of the quantization candidates for the quantization parameter,κthe number of interference items which are uniformly sampled in other shielding items.

L d G is the number of codebooks, 2 is taken, V is the number of entries of each codebook, 320 is taken, and the dimensionality of the codebook is 128;average gumbelsoft max probability for a group of speech segments on each codebook entry; in particular to Is a non-negative temperature of gumbelsoft max,n=-log(-log(u)),usubject to a uniform distribution of U (0,1),l g,v for each entry in the codebook, i.e.l g,v ∈R(G×V)

Training data of the wav2vec model for the first pre-training is large-scale voice data; the large-scale voice data can be voice data outside the target language to be recognized, marking according to task requirements is not needed, and extra workload is not increased.

Using the processed speech to perform pre-training, the optimization goal of the pre-training is as above untilL pre Learning rate less than a preset maximum or pre-trainingL RpreLess than a preset maximum learning rate value.

Specifically, the process of creating the training sample set containing the multilingual sample data in step S202 is as follows:

in the multi-task learning of the present embodiment, the language recognition task and the valid voice recognition task are each formulated as a multi-classification task, i.e., input as voice data, output as a language category and whether the voice is valid voice. Different from a general language identification task, the task of the embodiment is to determine a language category label and predict whether a voice is valid voice, by introducing the valid voice identification task, the premise of judging the language category of the voice is that the voice is necessary to be valid voice data, but an actual intelligent voice identification system is difficult to ensure that all inputs are valid sounds and often contain a large number of invalid sounds, the introduction of the valid voice identification task has actual significance for language identification, and in addition, compared with the valid voice, the invalid sounds have certain difference between the audio frequency range and characteristics and the voice data, so that the generalization capability of the language identification can be improved. Thus, the sample data in the training sample set includes sample data of valid voice and invalid voice data.

The data source of the training sample set is call recording voice data which comprises effective voice data of a plurality of languages; processing the effective voice data including data resampling, silence removal and data cutting to obtain sample data of the effective voice; performing language category marking on the sample data;

the unvoiced speech data in the training sample set is audio data having a different audio range and characteristics from the voiced speech data, such as noisy data or machine-synthesized speech sounds. And carrying out the same processing including data resampling, mute removal and data cutting on the invalid voice data as the valid voice data to obtain sample data of the invalid voice, and carrying out data invalidity marking. In the present embodiment, the sample data of the null voice is tagged with the tag "Reject", and is represented as the language category of the null voice.

Specifically, the valid voice data of a plurality of languages relates to 15 languages, including russian, hindi, bangladesh, german, japanese, chinese, french, bosch, timier, thai, english, spanish, vietnamese, arabic, korean, and partially invalid voice data, and is randomly divided into a training set, a development set, and a test set in the data set of table 1 according to the difference of each type of data volume, so as to cross-verify the performance of the model.

More specifically, in order to achieve better training effect and match with the trained model, the processing of the valid voice data or the invalid voice data in the embodiment includes the following steps:

1) after carrying out primary resampling on effective voice data and ineffective voice data of a plurality of languages, unifying the sampling rate, coding, precision and header file format of the voice data;

2) voice data having a number of channels greater than 1 is split into monaural voice data.

3) Performing voice detection on monophonic voice data, dividing the voice data into a plurality of voice segments with fixed duration, judging whether each voice segment is a mute segment or not according to the energy of each sub-band of each voice segment in the frequency band range of the voice, if so, removing the segment, and otherwise, keeping the segment; resulting in de-muted voice data.

4) According to the threshold range of the voice cutting, cutting the voice data without silence into voice data segments with fixed length;

5) and re-sampling the cut voice data segments again to obtain sample data of input effective voice and invalid voice used for model training and recognition.

The processing method of the sample data is the same as the preprocessing method of the recognized voice data, so that the trained model can be matched with the recognition task better.

Preferably, for a certain language, because the sample data is too small, the accuracy, the recall rate and the F1 will be poor in the training process, that is, the recognition effect will be poor, and in order to improve the recognition effect of the language, the data size of the language data with a small voice data size is increased by enhancing the training sample in this embodiment.

Specifically, for voice data of a language with a small data volume, variable speed disturbance is performed by using each set voice multiple respectively to increase the number of the voice data, so that overfitting caused by unbalanced samples is relieved, and the accuracy of the model is improved.

Preferably, 0.9, 1.1 and 1.2 times speed variable speed disturbance is respectively carried out, and the disturbance method uses sox tool operation; the data volume of the language is increased by 3 times through the speed change operation; and then the processing is carried out by the processing method of the effective voice data, so that the sample data volume of the language is greatly increased, and the problems of overfitting and poor model accuracy caused by sample imbalance are solved.

Specifically, the model structure and the loss function are not changed in the second pre-training process in step S203 until the loss reaches a relatively balanced state, and then the training is ended;

preferably, the learning rate of the first pre-training can be adjustedL RpreAdding 0.5, during learning, until lossL preAgain, the value of (A) is less than the preset lossL preThe maximum value ends the pre-training; in the second pre-training process, since the sample data in the training sample set is different from the first pre-training data, a loss function may be caused in the second pre-training processL preBecome larger, but become smaller after several rounds, and the model parameters can be made closer to the characteristics of the data by continuing training, which is beneficial to the fine tuning task and improves the accuracy.

And (5) finishing the pre-training for the second time, preferably after iteration exceeds 10000 steps.

Specifically, after adaptive fine tuning is performed in step S204, a final multilingual recognition model is constructed, which includes a feature extraction layer, a context coding layer, and an output layer; the feature extraction layer adopts a convolution network in a wav2vec model pre-trained for the second time; the shallow feature extraction module is used for extracting the shallow feature of the input voice sample voice frame level; the context coding layer is used for extracting the weight and the characteristic of each frame of voice by using a self-attention mechanism-based transform network in a wav2vec model pre-trained for the second time; based on language recognition and effective voice recognition tasks, adjusting the output layer and the loss function of the wav2vec model, transforming the extracted weights and characteristics to the label dimension of the corresponding task by using a full-connection network, and simultaneously outputting the recognition results of the two tasks.

In order to train both the language recognition task and the valid speech recognition task, the output layer of the model in this embodiment includes 1 softmax layer and 1 sigmoid layer to predict the language class and the valid speech class, respectively. For the firstiSample data, predicting the class language of the sample dataY i And valid tone categoryZ i

The structure of the model after fine tuning is shown in FIG. 4, the first time of the speech dataiA sample X i =(x i1x i2,…,x iL) As an input, L is the length of the speech (L is related to the sampling rate of the speech, and the sampling rate of the training samples in the scheme is 16K), and the hidden layer feature is obtained after the speech is processed by the convolutional network and the self-attention networkh i =(h i1,h i2,…,h ij ,…,h iT) Wherein the value of T is the number of frames after 7 layers of convolution operation, preferably T = L/320; obtaining output Y belonging to language task through improved output layer after convolution network and self-attention network i And active voice class output Z i

In the model, hidden layer characteristics which belong to language tasks and are obtained after a convolution network and a self-attention network

Obtaining hidden layer characteristics through convolution network and self-attention network belonging to data effective identification task

And (3) output of language tasks:

Y i =argmax(P(X i ,h i y ))

and (3) output belonging to a data effective identification task:

Z i =argmax(P(X i ,h i z ))

wherein, P (X) i h y i ) Output for language class probability normalization, P (X) i h z i ) Data validity probability normalization output, X i Input for the multilingual recognition modeliSample data of each sample;

h y i the hidden layer characteristics are obtained after the voice tasks are processed by a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );

h z i obtaining hidden layer characteristics for the data belonging to the effective data identification task after passing through a convolution network and a self-attention network;h y i =(h y i1,h y i2,…,h y ij ,…,h y iT );j=1,2, …, T; and T is the number of frames after the convolution operation in the multi-language identification model.

The softmax in this embodiment uses a log _ softmax function, that is, softmax is followed by a log function to alleviate the overflow and underflow problems in the calculation of the softmax function.

HiThe calculation method is as follows:(ii) a The average value of the features of the voice is obtained according to the frame level, that is, the features of all frames of a voice are averaged to be used as the input of the probability conversion function, so as to obtain the target output category.

Furthermore, since only two categories of "valid" and "invalid" are available for valid voice recognition, the sigmoid function is used as an activation function of the output layer.

Introduction of weighting factors in loss functions in multilingual recognition modelsαTo balance the output of both language identification and data efficient identification tasks, using weighting factorsβTo scale the losses of different classes of languages.

Specifically, the penalty function for the entire fine tuning task is as follows:

L fine_tuning =(1-α)L y +αL z wherein, in the step (A),L y for the cross-loss of softmax in a language,L z sigmoid cross-loss for active tones, specifically:

,

therefore, the temperature of the molten metal is controlled,

wherein the content of the first and second substances,y i z i true class and true class of active tone, respectively, of language, Y i 、Z i And predicting the output language category and the real category of the effective sound for the model respectively.

NIs the total number of samples.

In practice, different speeches after the language identification are generally input to different transcription identification engines for processing, which requires higher accuracy for the language identification;

for this purpose, a factor is further usedβTo scale the loss of different classes of languages, namely:

wherein, in the step (A),

specifically, the multilingual recognition model of the embodiment is subjected to fine tuning based on a fair seq open source framework, and in the whole fine tuning process, the main structure of the model is still formed by 7 layers of convolution and 12 layers of transformers, and model parameters are not adjusted.

In the training process, a training sample is input into a fairseq for training, and in a loss functionαAndβthe effect is more stable when 0.2 and 1.5 are respectively taken; the change in loss is recorded until the training is stopped when the magnitude of the change in loss is within 0.001.

The statistical results after the language identification training using the data of 15 languages are shown in table 1,

the data in the table comprises data of 15 languages, and each language is randomly divided into a training set, a development set and a test set 3. In the data set 1 experiment, except that the accuracy, recall rate and F1 are poor due to less data of the word types, the cross validation of other language types obtains better identification effect; in the data set 2, for the categories with few samples, speech rate disturbance of 0.9, 1.1 and 1.2 times speed is sequentially performed, so that the data enhancement through the speech rate disturbance is obvious, and better effects are achieved in terms of accuracy, recall rate and F1.

TABLE 1 language identification device implementation results

In summary, the multilingual recognition model adopted in the embodiment directly inputs voice data, so that an invalid sound can be judged while language recognition is performed, two tasks of effective voice detection and language recognition in an actual scene can be simultaneously met, and time and space costs are saved.

After the acquired voice data to be recognized is preprocessed and processed in batches, the voice fragments with fixed total length and the same or similar voice fragment length are obtained and simultaneously input into the recognition model for recognition, and the recognition efficiency of the model is improved.

The adopted multilingual recognition model adopts a three-time training mode, wherein the model generated after two times of pre-training is used as the input of an initial model of the multilingual recognition model, parameters in the model of the multilingual recognition model except for an output layer are initialized, so that the starting point of a fine-tuning task is positioned at a position closer to an optimal point, and the recognition effects on two tasks of effective voice detection and language recognition after the third training are better and shorter in training time.

A small amount of language marking data is used in the training sample set, so that the labor cost and the time cost for acquiring a large amount of marking data are reduced; and, the marked invalid voice data is introduced to improve the generalization capability of the model to the language identification.

And the data volume of the language data with small sample number is increased through the data enhancement of the speech speed disturbance, so that the training accuracy, the recall rate and the F1 value of the model are improved.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种特定领域语音识别方法、装置、电子设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!