End-to-end long-term speech recognition method

文档序号：116973 发布日期：2021-10-19 浏览：28次中文

阅读说明：本技术 一种端到端长时语音识别方法 (End-to-end long-term speech recognition method ) 是由明悦邹俊伟温志刚李泽瑞吕柏阳于 2021-06-07 设计创作，主要内容包括：本发明提供了一种端到端长时语音识别方法。该方法包括：选取语料库为训练数据集,对训练数据集中的语音数据进行数据预处理和特征提取,生成语音特征；构造融合外部语言模型和长时语音识别算法的改进的RNN-T模型,将语音特征输入进的RNN-T模型进行训练,得到训练好的改进的RNN-T模型；将训练好的改进的RNN-T模型作为互学习知识蒸馏算法中的教师模型,利用互学习知识蒸馏算法训练互学习知识蒸馏算法中的学生模型,利用训练和验证好的学生模型对待识别的长时语音数据进行识别,输出语音识别结果。本发明通过对外部语言模型、长时语音识别算法模块和RNN-T模型三部分进行融合,提升了模型长时语音识别的鲁棒性和泛化能力的同时,优化了算法的时间和空间复杂度。(The invention provides an end-to-end long-term speech recognition method. The method comprises the following steps: selecting a corpus as a training data set, and performing data preprocessing and feature extraction on voice data in the training data set to generate voice features; constructing an improved RNN-T model fusing an external language model and a long-term speech recognition algorithm, and inputting speech characteristics into the RNN-T model for training to obtain a trained improved RNN-T model; and taking the trained improved RNN-T model as a teacher model in the mutual learning knowledge distillation algorithm, training a student model in the mutual learning knowledge distillation algorithm by using the mutual learning knowledge distillation algorithm, recognizing the long-term voice data to be recognized by using the trained and verified student model, and outputting a voice recognition result. According to the invention, the external language model, the long-term speech recognition algorithm module and the RNN-T model are fused, so that the time and space complexity of the algorithm is optimized while the robustness and generalization capability of the model long-term speech recognition are improved.)

1. An end-to-end long-term speech recognition method, comprising:

selecting a corpus as a training data set, performing data preprocessing and feature extraction on voice data in the training data set to generate voice features, and forming a test and verification data set;

constructing an improved RNN-T model fusing an external language model and a long-term speech recognition algorithm, and inputting the speech features into the improved RNN-T model for training to obtain a trained improved RNN-T model;

taking the trained improved RNN-T model as a teacher model in a mutual learning knowledge distillation algorithm, training a student model in the mutual learning knowledge distillation algorithm by using the mutual learning knowledge distillation algorithm, and testing and verifying the student model by using a testing and verifying data set to obtain a trained and verified student model;

and recognizing the long-term voice data to be recognized by utilizing the trained and verified student model, and outputting a voice recognition result.

2. The method of claim 1, wherein the selecting a corpus as a training dataset, performing data preprocessing and feature extraction on speech data in the training dataset to generate speech features, and forming a test and verification dataset comprises:

selecting an AISHELL-1 corpus as a training data set, synthesizing long-term voice data in the AISHELL-1 by using a Sox audio processing tool, calling a kaldi tool kit to perform feature extraction processing on the synthesized long-term voice data, generating voice features of a verification and test student network, and forming a test and verification data set by using the voice features.

3. The method of claim 1, wherein constructing an improved RNN-T model that fuses an external language model and a long-term speech recognition algorithm comprises:

constructing an improved RNN-T model which completes a speech recognition task in which a speech feature x of training data is to be trained, a language modeling task, and a knowledge distillation task guiding the language modeling_tInput into coding network to obtain high-level representation of acoustic informationHigh-level representation of acoustic informationOutput c obtained by long-time speech recognition algorithm_kAnd predicting the output obtained by the networkFusing in the united network, and calculating to obtain the voice recognition task loss L_RNN-T；

The language modeling task adds the trained foreign language on the basis of language modeling of the RNN-T model through a prediction networkLanguage model, the trained external language model based on the previous non-empty label y_u-1Providing a softening label for the prediction network, guiding the language modeling of the prediction network, and using a distillation loss function L_kdLoss function L modeled with predictive network language_LMWeighted addition is used as an optimization function of the predictive network language model.

4. The method of claim 3, wherein the long-term speech recognition algorithm comprises a simulated long-term audio training module and a cross-sentence context module, the simulated long-term audio training module simulates long-term audio training by modifying an initial state of a model hidden layer to realize equivalent long-term speech recognition model training, and the cross-sentence context module is used for retaining cross-sentence historical context information.

5. The method of claim 4, wherein:

the simulated long-term audio training module is specifically used for initializing the hidden layer state of the model when a k-th statement is trained, and encoding the final hidden layer state E of the network and predicting the final hidden layer state E of the network after the k-1 st statement is trained_k-1(T) and P_k-1(U) to initialize E_k(0) And P_k(0) Setting random seeds in the process of training the simulated long-term audio, controlling the length of the simulated long-term audio sentence, setting the probability P of transferring the LSTM state to be 0.5 in the process of training the RNN-T model, and otherwise setting the initial state of the LSTM to be a 0 vector, as shown in a formula (2-4):

the cross-sentence context module is specifically used for inputting the historical sentence text predicted by the model into the cross-sentence context module for feature coding to obtainUsing attention mechanism pairAnd a historical context vector c_k-1Calculating the attention fraction alpha_u,iUsing text featuresCalculating the attention score to obtain the historical context vector c of the current sentence_kThe calculation process is shown in formulas (2-5) to (2-6):

wherein the content of the first and second substances,representing the output of the cross-sentence context module, the entire model combines the acoustic features x_tText feature and cross-sentence context vector c_kPredicting network outputInput into a joint network to calculate z_t,uAs shown in formulas (2-7):

wherein U and V are respectively expressed as projection matrices,. psi._zIndicating the bias.

6. The method of claim 3, wherein:

the language modeling task is used for integrating external language on the basis of the RNN-T modelModel, previous non-empty label y predicted by model_u-1Generating a high-level representation through an RNN-T prediction network as input to the RNN-T prediction networkThrough full connectivity layer and Softmax pairClassifying and calculating probability to perform language modeling, wherein the external language model trains a recurrent neural network language model RNNLM on text data, and RNN-T predicts a previous non-empty label y predicted by a network_u-1And history hidden state h_t-1Inputting a trained RNNLM model, and calculating the probability that the RNNLM model outputs a softening label k as shown in formula (2-1):

wherein z is_iRepresenting the output of the RNNLM model, T is represented as a temperature coefficient and is used for label smoothing;

the knowledge distillation task of the guide language modeling predicts the previous non-empty label y of the RNN-T prediction network_u-1Inputting a trained external language model, and using KL divergence as distillation loss L_kdTo minimize the posterior probability of the external language model and RNN-T prediction network model, as shown in equation (2-2):

function of distillation loss L_kdL modeling with predictive web language_LMThe weighted addition of the loss functions is used as an optimization function of the prediction network language model, and in the training stage, the total loss function of the model is shown as the formula (2-3), and comprises three parts: one is RNN-T model loss function L_RNN-TSecond, LM loss function L of predictive network language model_LMAnd three is the external language model distillation loss function L_kd。

L_total＝L_RNN-T+α((1-β)L_LM+βL_kd) (2-3)

Wherein alpha and beta are respectively a prediction network language model weighting coefficient and an external language model knowledge distillation weighting coefficient, and are used for balancing the scale difference of different loss functions.

7. The method of claim 1, wherein the training and validating the improved RNN-T model as a teacher model in a cross-learning knowledge distillation algorithm, training a student model in the cross-learning knowledge distillation algorithm using the cross-learning knowledge distillation algorithm, and testing and validating the student model using a test and validation data set to obtain a trained and validated student model comprises:

(1) training a teacher model: taking the trained improved RNN-T model as a teacher model in a mutual learning knowledge distillation algorithm, taking the voice characteristics as training data, and performing supervision training on the teacher model by using the training data;

(2) teacher model output acquisition: storing the prediction labels output by the teacher model as pseudo labels, and guiding the student model to learn by using the pseudo labels;

(3) training a student model: the student baseline model is a standard RNN-T model, the student model A is an RNN-T model added with an external language model, the student model B is an RNN-T model added with a span context algorithm module, the student models are supervised and trained by utilizing a pseudo label and a real label output by a teacher model, and a plurality of student models are trained simultaneously in a de novo training mode to realize mutual learning in the training process;

(4) and inputting the test and verification data set into the teacher model and the trained student model to obtain the trained and verified student model.

Technical Field

The invention relates to the technical field of voice recognition, in particular to an end-to-end long-term voice recognition method.

Background

Voice is the most direct and effective way for information transmission, and is the most important way for people to communicate emotions and ideas with each other. The Automatic Speech Recognition (ASR) technique refers to correctly recognizing a Speech signal as corresponding text content or command, and making a machine understand human language and execute related operations. With the wide application of computers, the ASR technology becomes a key technology for realizing simple and convenient human-computer intelligent interaction, and gradually becomes a popular research field. With the progress and development of deep learning and speech recognition technology, the speech recognition accuracy of the end-to-end-based speech recognition model is remarkably improved compared with the traditional speech recognition technology by virtue of strong modeling capability and learning capability. Different from the traditional speech recognition system, the end-to-end model solves the problem that the speech data needs to be aligned and preprocessed, can directly obtain the mapping relation between the input speech waveform or characteristic and the output text content, and simplifies the model training process. Therefore, the voice recognition technology is widely applied to the fields of intelligent home, unmanned driving, security monitoring and the like, and has a very wide application prospect.

With the rapid development of speech recognition technology, speech recognition needs to face a large amount of long-term speech scenes in a specific application environment, such as speech recognition at a long speech and conversation level or uninterrupted long-term speech spoken by a user during a robot phone interaction. The current solution for long sentence recognition is to segment a long sentence into fixed-length segments and then recognize each segment independently, but the segmentation boundary has a word segmentation problem and cannot effectively recover the original word from the segmented segment. Overlapping inference strategy decoding partitions a long sentence into overlapping multi-fragments solves the word segmentation problem, but adds significant extra computational effort in the decoding stage. Therefore, how to construct a speech recognition model more suitable for a long-term speech scene aiming at the characteristics of a long-term speech recognition task is an urgent problem to be solved.

The twentieth century and the fifties are the beginning of the research of the voice recognition technology, and the automatic digital recognition machine developed in the Bell laboratory marks the beginning of the research of the voice recognition technology in the true sense, so that the research enthusiasm of the voice recognition technology is started. The development of the speech recognition technology can be divided into three stages of traditional speech recognition, speech recognition based on deep learning and end-to-end speech recognition. The traditional speech recognition uses the speech recognition model of GMM-HMM as the main framework and achieves remarkable results. Until the 21 st century, the development of Deep Learning (Deep Learning) technology has greatly promoted the progress of speech recognition technology, so that the recognition accuracy is greatly improved, and the speech recognition technology based on Deep Learning is rapidly developed. The model can directly establish the mapping relation between the input speech waveform and the output text content through a neural network model without independently training each module in the system, thereby simplifying the speech recognition process. At present, there are three main methods for implementing an end-to-end model: based on the Connection Temporal Classification (CTC), on the Attention-based coder-decoder Model (Attention-based Model), and on the Recurrent Neural Network transformer (RNN-T). The RNN-T model is designed aiming at the defects of the CTC model, integrates acoustic characteristics and language information, simultaneously considers the acoustic and language information, solves the problem of independent hypothesis of output existing in the CTC, and is one of the key parts of the research and improvement of the invention.

In recent years, deep learning has received great attention and has been successfully applied to the fields of speech recognition, computer vision, and the like. With the rapid development of deep learning, the design of a deep neural network is increasingly complex, and a model compression technology in the field of deep learning is rapidly developed. On the premise of ensuring the model performance, the method for effectively reducing the calculated amount and the storage space of the deep neural network model becomes a research hotspot. The current stage model compression technology mainly comprises the following steps: network pruning, quantification, low rank decomposition, compact network design, and Knowledge Distillation (KD).

Wherein knowledge distillation can compress deeper depth models into shallower models, transfer useful knowledge extracted from complex large models to simple small models by simulating the output of large models, and help to reduce model time and space complexity. Is also one of the key parts for the research and improvement of the invention.

As described above, the existing voice recognition method includes: based on the defect that the output of the current frame and the historical output are conditional independence, the RNN-T model compensates the problem that the CTC conditional independence hypothesis does not have language modeling capability by introducing a prediction network. The RNN-T model integrates language information and acoustic information and simultaneously performs joint optimization, and the model structure diagram is shown in FIG. 1. The RNN-T model is mainly composed of an encoding Network (Encoder Network), a Prediction Network (Prediction Network) and a Joint Network (Joint Network).

The above-mentioned voice recognition method in the prior art has the following disadvantages:

(1) the RNN-T model is difficult to train. The RNN-T model training needs a large amount of voice-text data to realize the convergence of the model, so that better recognition performance is achieved. In practical application, the construction of the speech-text corpus requires high cost, so that the labeled data is rare. The RNN-T model training is difficult due to the lack of linguistic knowledge and insufficient language modeling capability in the process of training the RNN-T model (namely, the prediction network is insufficiently trained).

(2) The long-term speech recognition is less robust. Speech recognition techniques need to face specific scenarios of large amounts of long-term speech, such as long speech lasting more than half a minute and speech recognition at the conversation level. Limited by the hardware computing equipment and the mismatch of the training data and the test data. The RNN-T model is typically based on training at the sentence level, with long sentences being segmented into short sentence audio segments, making the training model computationally feasible. However, this may cause a problem that training data and test data are not matched when long-term speech is recognized, and a speech recognition model trained based on short sentence-level training data has poor robustness for recognizing long-term speech, thereby greatly reducing the recognition performance of the model. Challenges remain for long speech and speech recognition at the conversational level.

As mentioned above, knowledge distillation is a model compression that is currently used more widely, and is based on the training mode of teachers and students. The core idea is that through knowledge migration, a pre-trained large network is used as a teacher model, a network with higher running speed and smaller parameter amount is guided to be used as a student model, and a 'Soft label' (Soft Target) output by the teacher model is used as knowledge and transmitted to the student model, so that the performance of the student model is improved. Fig. 2 is a schematic diagram of a prior art implementation of a frame-level and sequence-level knowledge distillation algorithm.

The disadvantages of the model compression method in the prior art are as follows: the number of model parameters is large and the computational complexity is high. In order to further improve the identification performance of the model, the number of layers of the network model is continuously increased by the existing algorithm, and a more complex network structure is designed. However, while improving the recognition performance, these strategies have the problems of large model parameters and high computational complexity. This results in a less computationally efficient model that is difficult to use in real-time demanding real-time environments. Aiming at the problems of huge calculated amount and parameter redundancy of a speech recognition model, the sequence-level knowledge distillation algorithm has important effects on reducing model redundancy parameters, reducing the parameter amount of the model and reducing the time and space complexity of the model. However, the effect of the sequence-level knowledge distillation is easily influenced by information such as parameter facilities and model initialization, and the generalization capability of the model is poor.

Disclosure of Invention

The embodiment of the invention provides an end-to-end long-term voice recognition method, which is used for effectively recognizing end-to-end long-term voice data.

In order to achieve the purpose, the invention adopts the following technical scheme.

An end-to-end long-term speech recognition method, comprising:

and recognizing the long-term voice data to be recognized by utilizing the trained and verified student model, and outputting a voice recognition result.

Preferably, the selecting corpus is a training dataset, and the preprocessing and feature extraction are performed on the speech data in the training dataset to generate speech features, and a test and verification dataset is formed, including:

Preferably, the constructing of the improved RNN-T model fusing the external language model and the long-term speech recognition algorithm comprises:

The language modeling task adds a trained external language model on the basis of language modeling of an RNN-T model through a prediction network, and the trained external language model is based on a previous non-empty label y_u-1Providing a softening label for the prediction network, guiding the language modeling of the prediction network, and using a distillation loss function L_kdLoss function L modeled with predictive network language_LMWeighted addition is used as an optimization function of the predictive network language model.

Preferably, the long-term speech recognition algorithm includes a simulated long-term audio training module and a cross-sentence context module, the simulated long-term audio training module simulates long-term audio training by modifying the initial state of the model hidden layer to realize equivalent long-term speech recognition model training, and the cross-sentence context module is used for retaining cross-sentence historical context information.

Preferably, the simulated long-term audio training module is specifically configured to initialize a model hidden state when a kth sentence is trained, and encode a final hidden state E of a network and a final hidden state E of a prediction network after training of the kth sentence by using a kth-1 sentence is completed_k-1(T) and P_k-1(U) to initialize E_k(0) And P_k(0) Setting random seeds in the process of training the simulated long-term audio, controlling the length of the simulated long-term audio sentence, setting the probability P of transferring the LSTM state to be 0.5 in the process of training the RNN-T model, and otherwise setting the initial state of the LSTM to be a 0 vector, as shown in a formula (2-4):

the cross-sentence context module is particularly used for predicting the modelInputting the historical sentence text into a span context module for feature coding to obtainUsing attention mechanism pairAnd a historical context vector c_k-1Calculating the attention fraction alpha_u,iUsing text featuresCalculating the attention score to obtain the historical context vector c of the current sentence_kThe calculation process is shown in formulas (2-5) to (2-6):

wherein U and V are respectively expressed as projection matrices,. psi._zIndicating the bias.

Preferably, said language modeling task is forAn external language model is merged on the basis of the RNN-T model, and the model predicts the previous non-empty label y_u-1Generating a high-level representation through an RNN-T prediction network as input to the RNN-T prediction networkThrough full connectivity layer and Softmax pairClassifying and calculating probability to perform language modeling, wherein the external language model trains a recurrent neural network language model RNNLM on text data, and RNN-T predicts a previous non-empty label y predicted by a network_y-1And history hidden state h_t-1Inputting a trained RNNLM model, and calculating the probability that the RNNLM model outputs a softening label k as shown in formula (2-1):

wherein z is_iRepresenting the output of the RNNLM model, T is represented as a temperature coefficient and is used for label smoothing;

function of distillation loss L_kdL modeling with predictive web language_LMThe weighted addition of the loss functions is used as an optimization function of the prediction network language model, and in the training stage, the total loss function of the model is shown as the formula (2-3), and comprises three parts: one is RNN-T model loss function L_RNN-TSecond, LM loss function L of predictive network language model_LMAnd three is outsideLanguage model distillation loss function L_kd。

L_total＝L_RNN-T+α((1-β)L_LM+βL_kd) (2-3)

Preferably, the training of the improved RNN-T model as a teacher model in the mutual learning knowledge distillation algorithm, the training of the student model in the mutual learning knowledge distillation algorithm by using the mutual learning knowledge distillation algorithm, and the testing and verifying of the student model by using the testing and verifying data set, to obtain the trained and verified student model, includes:

(2) teacher model output acquisition: storing the prediction labels output by the teacher model as pseudo labels, and guiding the student model to learn by using the pseudo labels;

(4) and inputting the test and verification data set into the teacher model and the trained student model to obtain the trained and verified student model.

According to the technical scheme provided by the embodiment of the invention, the external language model and the RNN-T model are fused, so that the language modeling is directly performed on the prediction network in the model, the language model auxiliary task is added, the trained external language model is used for providing a softening label for the prediction network language model, the prediction network language modeling is guided, and the joint optimization of the acoustic information and the language information is realized. The invention also adds a long-term speech recognition algorithm module on the basis of the RNN-T model, and the cross-sentence context module stores context information between sentences, provides cross-sentence level historical context information for the RNN-T model and solves the problem of long-term dependence establishment in long-term speech recognition.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram of a RNN-Transducer model according to the prior art;

FIG. 2 is a schematic diagram of a prior art implementation of a frame-level and sequence-level knowledge distillation algorithm;

fig. 3 is a schematic diagram of an implementation of an end-to-end long-term speech recognition method according to an embodiment of the present invention;

FIG. 4 is a block diagram of an improved RNN-T model according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an overall architecture of a long-term speech recognition algorithm module according to an embodiment of the present invention;

FIG. 6 is a block diagram of an RNN-T model for adding a cross-sentence context module according to an embodiment of the present invention;

fig. 7 is a structural diagram of a knowledge distillation framework based on mutual learning according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The invention mainly aims at the problems of insufficient language modeling capability, poor robustness in a long-term speech recognition scene, large model parameter quantity, high calculation complexity and the like of the existing end-to-end speech recognition algorithm, improves the defects and the deficiencies of the existing speech recognition algorithm and the existing model compression algorithm in the long-term audio scene, and promotes the wider application of the existing end-to-end speech recognition algorithm in practice. On the basis of improving the language modeling capability of the end-to-end speech recognition model, the generalization capability and robustness of the model to a long-term speech recognition scene are further improved, and the compression and acceleration of the algorithm model are realized through a knowledge distillation technology so as to achieve an ideal recognition effect.

The embodiment of the invention researches an end-to-end speech recognition algorithm on the basis of analyzing the current research situation and existing problems of the speech recognition technology and provides an improved RNN-T speech recognition model. The improved RNN-T model integrates an external language model and a long-term speech recognition algorithm, and the problem of reduced recognition accuracy rate is relieved in a long sentence recognition scene. An implementation schematic diagram of the end-to-end long-term speech recognition method provided by the embodiment of the invention is shown in fig. 3, and the overall scheme firstly inputs speech-text pairs, and obtains the Fbank characteristics of a training and testing/verifying data set through data preprocessing on the speech-text pairs. And then, inputting the Fbank characteristics of the training data set into the improved RNN-T model to train the voice recognition model, so as to obtain the trained improved RNN-T model. Secondly, training a student model in the mutual learning knowledge distillation algorithm by using the trained improved RNN-T model as a teacher model in the mutual learning knowledge distillation algorithm to obtain a trained student model; and finally, inputting the verification set and the test set into a teacher model and a trained student model in a mutual learning knowledge distillation algorithm to obtain a verified and tested student model. The teacher model selects an RNN-Transducer model combined with an external language model and a span context algorithm as a baseline model. The student baseline model is a standard RNN-Transducer model, and the student model A and the student model B are respectively an RNN-Transducer model added with an external language model and a span context model. On the basis of reducing precision loss, the student model obtained by combining the improved RNN-T model and the inter-learning knowledge distillation algorithm removes redundant parameters to reduce the parameter quantity of the model and aims to reduce the time and space complexity of the model, so that the model achieves better balance on the size, the calculated quantity and the recognition task performance.

The data preprocessing process for the speech-text pair in the embodiment of the invention comprises the following steps: the data preprocessing is divided into two steps of generating a training data set and generating verification/test data set characteristics. The AISHELL-1 corpus is a training data set, and the AISHELL-1 voice data is subjected to feature extraction processing by calling a kaldi toolkit to generate voice features of a training teacher network and a student network. The verification/test data set utilizes a Sox (Soundexchange) audio processing tool to synthesize the voice data of the AISHELL-1 and aidataang _200zh Chinese Mandarin corpus, and generates the voice features of the verification/test student network by calling a kaldi toolkit to perform feature extraction processing on the synthesized long-term voice data.

An improved RNN-T model framework structure provided by an embodiment of the present invention is shown in fig. 4. The improvement mainly comprises two parts:

1: in the speech recognition task. Output of coding network and long-term speech recognition algorithm moduleAnd c_kSimultaneous input into a federated network to obtain L_RNNN-TSpan context information is added for the voice recognition task, and robustness of the RNN-T model to long-term voice recognition is improved;

2: an external language model is added, a trained external language model is used for providing a softening label for a prediction network language model, the prediction network language modeling capability is guided, and the language modeling capability of the RNN-T model is improved;

the improved RNN-T model fuses an external language model and a long-term speech recognition algorithm on the basis of an original RNN-T model. As shown in fig. 4, the improved RNN-T model contains three branches for the speech recognition task, the language modeling task, and the knowledge distillation task of guiding language modeling, respectively. Speech feature x of training data in speech recognition task_tInput into coding network to obtain high-level representation of acoustic informationThen the output c obtained by the long-term speech recognition algorithm_kAnd predicting the output obtained by the networkFusing in the united network, and calculating to obtain the voice recognition task loss L_RNNN-T(ii) a The combining network in the RNN-T model is a feedforward neural network, and the fusion operation can beAnd c_kAnd splicing, or direct addition.

Fig. 5 is an overall architecture diagram of a long-term speech recognition algorithm module according to an embodiment of the present invention. Aiming at the problem that the robustness of RNN-T to a long-term speech recognition scene is poor, the invention provides a long-term speech recognition algorithm which comprises a simulation long-term audio training and sentence-crossing context module. The goals of long-term speech recognition algorithms include: adding a cross-sentence context module to obtain historical context information c_k. And (3) simulating long-time audio training by modifying the initial state of the hidden layer of the model. The long-term speech recognition algorithm module simulates long-term audio training and realizes equivalent long-term speech recognition model training so as to improve the robustness of the speech recognition model based on short sentence training on long sentence recognition and realize the training process of the equivalent long-term speech recognition model. Simulating long audio training, simulating long-time data by updating the internal recursive state of the model to train the model, initializing the hidden layer state of the model when the whole RNN-T model is trained to the kth statement, and encoding the final hidden layer state E of the network and the prediction network after the training of the kth statement is finished_k-1(T) and P_k-1(U) to initialize E_k(0) And P_k(0). In order to avoid the reduction of the recognition generalization capability of the model on the normal data of the short sentence after the model is trained by using the long-term audio data, random seeds are set in the process of simulating the long-term audio training to control the length of the simulated long-term audio sentence. Setting the probability P of transferring the LSTM state to be 0.5 in the process of training the RNN-T model, otherwise setting the initial state of the LSTM to be a 0 vector, as shown in the formula (2-4):

the structure diagram of an RNN-T model with an added cross-sentence context module according to an embodiment of the present invention is shown in fig. 6. And the cross-sentence context module is used for reserving cross-sentence historical context information, so that the model can better learn the context information of the conversation level, and the performance of long-term speech recognition is improved. The cross sentence context module firstly inputs the historical sentence text predicted by the model into the cross context module for feature coding to obtainThen using the attention mechanism pairAnd a historical context vector c_k-1Calculating the attention fraction alpha_u,iFinally using text featuresCalculating the attention score to obtain the historical context vector c of the current sentence_kThe calculation process is shown in formulas (2-5) to (2-6):

wherein the content of the first and second substances,representing the output of the cross sentence context module. Then the acoustic feature x_tText feature and cross-sentence context vector c_kPredicting network outputInput entryJoint network computation to get z_t,uAs shown in formulas (2-7):

wherein U and V are respectively expressed as projection matrices,. psi._zIndicating the bias.

The language modeling task and the knowledge distillation task of the improved RNN-T model guide the prediction network of the RNN-T model to carry out language modeling by adding an external language model, so that the prediction network modeling is easier, and the model can better learn language knowledge. The language modeling task directly carries out language modeling through a prediction network, and the model predicts the previous non-empty label y_u-1Entering a prediction network as input, and then performing classification and probability calculation through a full connection layer and a softmax layer to complete a language modeling task; at the same time, the knowledge distillation task of the guided language modeling predicts the previous non-empty label y of the model_u-1Inputting a trained external language model, providing a softening label for the prediction network language model by using the trained external language model, and guiding the prediction network language to model. And finally, carrying out integral training optimization on the model by combining a weighting method with the three loss functions. The invention fuses an external language model in the stage of training an original RNN-T model, transmits language information to a prediction network language model, realizes the fusion of the external language model and the RNN-T model, and obtains L_{total_LM}. And the external language model provides a softening label for the prediction network language model, guides the prediction network language to model, and calculates to obtain the knowledge distillation loss. The above-mentioned transmission is specifically P in the posterior probability_LM(knowledge of the language information in the external language model) is fed into a loss function that calculates KL divergence. Language information refers to the previous non-null label y_u-1And after an external language model is input or the network is predicted, classifying the network output to obtain the probability of a classification result. Improving the RNN-T model language modeling capability and outputting the coding network and the long-term speech recognition algorithm moduleAnd c_kL derived by input federation network_RNNN-TAnd L_{total_LM}Form an objective function L_totalAnd integrally training the improved RNN-T model by using an objective function. The method and the device realize the improvement of the language modeling capability of the RNN-T model, improve the homonymy error correction capability of the model and improve the accuracy of the model for long-term speech recognition.

The specific steps of the invention for integrating the external language model on the basis of the RNN-T model are firstly to predict the previous non-empty label y of the model_u-1Generating a high-level representation through a predictive network as input to an RNN-T predictive networkThen through the full connection layer and Softmax pairAnd classifying and calculating the probability to further perform language modeling, and adding a prediction network language modeling auxiliary task to help training. The external Language Model firstly trains a Recurrent Neural Network Language Model (RNNLM) on large-scale text data, and the Model predicts a previous non-empty label y_u-1And history hidden state h_t-1(the trained RNNLM network itself will have a history hidden state h_t-1) Inputting a trained RNNLM language model, and calculating the probability that the RNNLM model outputs a softening label k as shown in formula (2-1):

wherein z is_iRepresents the output of the RNNLM model and T represents the temperature coefficient for label smoothing. For the prediction network in the RNN-T model to learn linguistic knowledge from softening labels, KL divergence is used as the distillation loss L_kdTo minimize the posterior probability of the external language model and the predictive network language model, as shown in equation (2-2):

function of distillation loss L_kdL modeling with predictive web language_LMAnd weighted addition of the loss functions is used as an optimization function of the prediction network language model. In the training phase, the model total loss function is shown as formula (2-3), and comprises three parts: one is RNN-T model loss function L_RNN-TSecond, LM loss function L of predictive network language model_LMAnd three is the external language model distillation loss function L_kd。

L_total＝L_RNN-T+α((1-β)L_LM+βL_kd) (2-3)

The most common evaluation index in speech recognition tasks is the error rate. And deleting, inserting or replacing the identification result according to the real labeling sequence to ensure that the two sequences are the same, and calculating the Edit Distance (Edit Distance) of the two sequences to obtain the identification error rate. A Character Error Rate (CER) is commonly used for chinese data sets and a Word Error Rate (WER) is commonly used for english data sets. The two calculation formulas are consistent in form, and taking the calculation of the word error rate as an example, the calculation formula (2-8) is shown as follows:

namely, the word error rate is the ratio of the number of the words with errors in the recognition result to the total number of the words in the real tag sequence, wherein S represents a replacement operation error, D represents a deletion operation error, I represents an insertion operation error, and N represents the total number of the words in the real tag sequence. The RNN-T recognition result of the long-time speech recognition algorithm module and the external language model are fused as shown in the table:

TABLE 1 AISHELL-1 and synthetic long sentence data set comparative experiment results

By means of fusing the long-term speech recognition algorithm module and the external language model, the recognition performance of the RNN-T model in long audio frequency can be effectively improved, the recognition accuracy rate in normal data can be kept, the problem that the recognition performance of the end-to-end model trained based on sentence level for long sentence speech is seriously reduced is effectively solved, and the model has better robustness and generalization capability.

The improved RNN-T model relieves the problem that the long-term speech recognition performance of the model is seriously reduced, and effectively improves the generalization capability and robustness of the model based on sentence-level training to the long-term audio recognition scene. Meanwhile, a knowledge distillation algorithm based on mutual learning is provided by considering the balance relation between the size of the model and the performance of the recognition task. The student models were trained using the inter-learning knowledge distillation algorithm with the trained improved RNN-T model mentioned in section 2.4 as the teacher model. On the basis of reducing precision loss, the obtained student model removes redundant parameters to reduce the parameter quantity of the model, reduces the time and space complexity of the model, and enables the model to achieve better balance in size, calculated quantity and recognition task performance.

Aiming at the problems of huge calculated amount and parameter redundancy of the voice recognition model, the structural difference between the student models is learned through Mutual learning of the student models with different structures based on a Mutual learning Knowledge Distillation algorithm (MLKD), the structural defect of each student model is overcome, and more diversity is introduced. The knowledge distillation framework based on mutual learning provided by the embodiment of the invention is shown in fig. 7.

(1) Training a teacher model: taking the trained improved RNN-T model as a teacher model in a mutual learning knowledge distillation algorithm, taking the voice characteristics as training data, and training the teacher model by using the training data;

(2) teacher model output acquisition: storing the prediction labels output by the teacher model as pseudo labels for subsequently guiding the student model to learn so as to reduce the calculation overhead in training;

(3) training a student model: the student baseline model is a standard RNN-T model, the student model A is an RNN-T model added with an external language model, and the student model B is an RNN-T model added with a span context algorithm module. And (4) performing supervision training on the student model by using the pseudo labels and the real labels output by the teacher model. When the student models are trained, a plurality of student models are trained simultaneously in a head-on training mode, and mutual learning in the training process is achieved.

Subsequently, the trained and verified student models are used for respectively recognizing the long-term voice data to be recognized, so that extra calculated amount of the models cannot be introduced in the process of obtaining the final recognition result of the long-term voice data to be recognized.

By using the mutual learning knowledge distillation algorithm, tests are carried out on the open source AISHELL-1 data set and the synthesized long-term voice data set, and CER of the student model A (student model A + MLKD) and the student model B (student model B + MLKD) compressed by the mutual learning knowledge distillation algorithm is reduced. The optimal CER of the student model B (student model B + MLKD) compressed by the mutual learning knowledge distillation algorithm on the verification set and the test set of the synthesized long-term voice data reaches 15.53% and 18.10%, and the performance difference with a teacher baseline model is further reduced.

TABLE 2 RNN-Transducer model compression identification Performance results

And comparing the parameters and calculated quantities before and after compression of the teacher model, the student model coding network and the prediction network. The results are shown in table 3, where the encoding network reduces the parameter amount by 41.02% and floating point operands by 43.62% before and after model compression. The prediction network parameters are reduced by 63.03%, and the floating point operands are reduced by 77.98%. The coding network and the prediction network in the model both obviously reduce the parameter quantity and the calculated quantity, the purpose of compression is achieved, and the model still can keep higher identification accuracy.

TABLE 3 comparison of the values of the parameters and the calculated quantities before and after compression of the coding network and the prediction network

In summary, the embodiment of the present invention solves 3 problems that cannot be solved in the existing long-term end-to-end speech recognition algorithm:

(1) the modeling capability of the RNN-T model language is insufficient. Insufficient modeling capabilities of the RNN-T model language can make model training difficult. Aiming at the problem of insufficient modeling capability of the RNN-T model language. According to the invention, the external language model and the RNN-T model are fused, so that language modeling is directly carried out on the prediction network in the model, meanwhile, a language model auxiliary task is added, a trained external language model is used for providing a softening label for the prediction network language model, the prediction network language modeling is guided, and the joint optimization of acoustic information and language information is realized. Under the condition that the parameter quantity and the calculated quantity of the model in the decoding stage are not increased, the language modeling capacity of the model is further improved, and the training difficulty of the model is reduced.

(2) Poor robustness under the long-term speech recognition scene. The robustness of the RNN-T model for long-term speech recognition is poor. According to the invention, a long-term speech recognition algorithm module is added on the basis of an RNN-T model, a cross-sentence context module stores context information between sentences, cross-sentence level historical context information is provided for the RNN-T model, and the problem of long-term dependence establishment in long-term speech recognition is solved; secondly, a step of simulating long audio training is added in the long-term speech recognition algorithm, and long-term data is simulated by updating the internal recursive state of the model to train the model, so that the robustness of the speech recognition model based on short sentence training on long sentence recognition is improved. After the sentence-crossing context module and the simulated long-term audio training are combined, the recognition accuracy of the model under a long-term speech recognition scene can be further improved, the problem of recognition performance loss caused by too large difference between lengths of training sentences and testing sentences of the model is effectively solved, and the model has stronger robustness and generalization capability.

(3) The model parameter quantity is large and the calculation complexity is high. Aiming at the problems of overlarge parameter quantity and high calculation complexity of a voice recognition model, the invention provides the mutual learning knowledge distillation algorithm, solves the problem that the traditional knowledge distillation algorithm is easily influenced by information such as parameter facilities, model initialization and the like, reduces redundant parameters of the voice recognition model, reduces the parameter quantity of the model, and reduces the time and space complexity of the model. The mutual learning knowledge distillation algorithm is based on the mutual learning of the student models with different structures, the structural difference between the student models is learned, the structural defect of each student model is overcome, and more diversity is introduced. Abundant and correct information in the teacher model is transmitted to the student models, so that the student models can be better identified, and the sizes, the calculated amount and the identification task performance of the models are well balanced.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

19页详细技术资料下载

End-to-end long-term speech recognition method

相关技术

网友询问留言