End-to-end voice-to-text rare word optimization method

文档序号:139068 发布日期:2021-10-22 浏览:33次 中文

阅读说明:本技术 一种端到端语音转文本罕见词优化方法 (End-to-end voice-to-text rare word optimization method ) 是由 胡燕 于 2021-01-29 设计创作,主要内容包括:本发明公开了一种端到端语音转文本罕见词优化方法,训练集语料库对应标注文本中罕见词列表的构造,首先,整理分析训练集语料的标注文本;然后,使用分词工具对标注文本进行分词并使用SRILM语言模型工具统计单词词频;最后,将词频小于所设置的词频阈值的单词定义为罕见词,将其加入到罕见词列表中,本发明提出的一种端到端语音转文本罕见词优化技术,通过对训练集语料对应的文本标注进行统计分析,并构造包含罕见词的文本语料列表,能够有效搜索出端到端语音识别模型中未充分训练的文本语料。(The invention discloses an end-to-end voice-to-text rare word optimization method.A training set corpus corresponds to the structure of a rare word list in a labeled text; then, segmenting the labeled text by using a segmentation tool and counting word frequency of the words by using an SRILM language model tool; finally, defining the words with the word frequency smaller than the set word frequency threshold value as the rare words, and adding the rare words into a rare word list.)

1. An end-to-end speech-to-text rare word optimization method is characterized by comprising the following steps: the construction of the training set corpus corresponding to the rare word list in the annotation text, firstly, the annotation text of the training set corpus is sorted and analyzed; then, segmenting the labeled text by using a segmentation tool and counting word frequency of the words by using an SRILM language model tool; and finally, defining the words with the word frequency smaller than the set word frequency threshold value as the rare words, and adding the rare words into a rare word list.

2. The method of claim 1, wherein the end-to-end speech-to-text rare word optimization method comprises: and extracting sentences containing the rare words from the large-scale language model training corpora according to the constructed rare word list, wherein generally speaking, the text corpora contained in the language model training corpora are far more than the text corpora contained in the acoustic model training set, so that a plurality of sentences containing the rare words can be extracted from the language model text corpora and are used for synthesizing the voice data containing the rare words.

3. The method of claim 1, wherein the end-to-end speech-to-text rare word optimization method comprises: synthesizing voice data containing rare word sentences, firstly constructing a multi-speaker voice synthesis model based on Tacotron2, then training Tacotron2 by using the multi-speaker voice data, and finally performing text voice synthesis on the sentences containing rare words by using the multi-speaker voice synthesis model obtained by training to obtain a synthesized rare word voice corpus.

4. The method of claim 1, wherein the end-to-end speech-to-text rare word optimization method comprises: model optimization under a few heavy accent voice data. Aiming at a universal speech recognition model obtained by training under standard Mandarin speech data, a small amount of accent speech data is used for optimizing the model.

5. The method of claim 1, wherein the end-to-end speech-to-text rare word optimization method comprises: and expanding a training corpus and constructing an end-to-end acoustic model, and combining synthesized voice data of a plurality of speakers containing rare words with the voice data of the original training set to obtain an expanded training set corpus.

6. The method of claim 1, wherein the end-to-end speech-to-text rare word optimization method comprises: the method comprises the steps of constructing an end-to-end voice recognition model LAS, wherein the end-to-end voice model constructed by the method is a sequence-to-sequence voice recognition model LAS based on an attention mechanism, and mainly comprises an encoder module, an attention module and a decoder module.

7. The method of claim 1, wherein the end-to-end speech-to-text rare word optimization method comprises: and training an end-to-end voice recognition model LAS, training the LAS by using the expanded training set voice data, and jointly optimizing three modules of the LAS, so that the end-to-end model can use a function highly related to a final evaluation standard as a target function of global optimization, thereby being beneficial to solving to obtain a global optimal result.

8. The method of claim 1, wherein the end-to-end speech-to-text rare word optimization method comprises: performing voice decoding and language model re-scoring on the end-to-end voice recognition model, firstly, performing decoding search on the jointly optimized voice recognition model by using a beam search decoding technology to obtain an acoustic model score; then, using the existing large-scale language model to perform language model score calculation on the decoding path obtained by searching; and finally, re-scoring the acoustic model scores by using the language model scores to obtain final scores of decoding search, and calculating by using the scores to obtain decoded texts corresponding to the accent voices.

9. The method of claim 1, comprising the steps of:

and S1, counting the word frequency of the corresponding labeled text in the training corpus, and acquiring the rare word list. Firstly, statistically analyzing corresponding text corpora in a training set corpus; then, segmenting the obtained text corpus by using a segmentation tool, and carrying out word frequency statistics on the segmented corpus by using an SRILM language model tool to obtain a word frequency list of words in the training corpus; analyzing the word frequency list, extracting the rare words to construct a rare word list, namely aiming at the word frequency nwordSetting a word frequency threshold nthresholdWhen n isword≤nthresholdWhen the word belongs to the low-frequency word or the rare word in the training set corpus, adding the word into the rare word list, and obtaining the rare word list corresponding to the current training set corpus after the processing is finished;

and S2, extracting sentences containing the rare words in the large-scale text corpus, and constructing to obtain a rare word text corpus. Aiming at the rare word vocabulary extracted in the step S1, extracting a plurality of sentences containing rare words from the large-scale language model text corpus, adding the sentences into the rare word text corpus, and obtaining a text corpus containing rare words corresponding to the rare word vocabulary after the processing is finished;

s3, synthesizing voice data containing rare word sentences, and aiming at the rare word text corpus extracted in the step S2, performing rare word text voice synthesis by using a mature voice synthesis model, namely a Tacotron2, wherein the Tacotron2 is a voice synthesis method based on a neural network and mainly comprises a voice spectrum prediction network, a vocoder and an intermediate connection module;

the acoustic spectrum prediction network is a network structure from a sequence to a sequence based on an attention mechanism, the input of a model is a character sequence to be synthesized, and the output is a Mel frequency spectrum acoustic characteristic frame sequence, wherein an encoder module consists of a character embedding layer containing 512 neuron nodes, 3 convolutional neural network CNN layers each containing 512 convolutional kernels with the scale of 5 x 1 and a bidirectional long-short time memory LSTM network layer containing 256 hidden layer neuron nodes, and the calculation process of the encoder module is shown in formulas (1) to (2);

Fe=ReLU(K3*ReLU(K2*ReLU(K1*E(Ch)))) (1)

H=BLSTM(Fe) (2)

wherein, K1、K2、K3Respectively representing convolution kernels of three convolution neural network layers, ReLU representing a nonlinear activation function, E (-) representing embedded coding (Embedding) of an input character sequence Ch, BLSTM being a bidirectional long-and-short time memory network, FeThe high-level semantic code of the characters output by the convolutional layer, and H is the output of a Bidirectional Long-Short Term Memory (BLSTM) layer.

The attention module uses a position-sensitive attention mechanism, adds position features in the alignment process, can simultaneously extract input content information and position information of input elements, and the formal expression of the position information is shown as formula (3);

wherein v isaW, V, U and b are trainable parameters, siFor the current decoder hidden state, hjFor the current encoder hidden state, fi,jTo attention weight alphai-1Coding the position characteristics obtained by convolution operation;

the decoder module is an autoregressive recurrent neural network model, the decoding process starts from inputting the output sound spectrum of the previous step or the real sound spectrum of the previous step to a preprocessing network PreNet, the output of the PreNet and the context-related semantic vector obtained by using the output calculation of the previous decoding step are spliced and then integrally sent into the decoder network, the context-related semantic vector of the current step is calculated based on the output of the decoder of an RNN network structure, the context-related semantic vector of the current step and the output of the decoder of the current step are spliced and then sent into a linear projection network for calculation and output, and the predicted Mel spectrum is sent into a post-processing network for post-processing after the decoding of the decoder is finished so as to improve the generation quality of the Mel spectrum;

the vocoder adopts a revised WaveNet model and is used for converting the generated frequency domain Mel spectrum acoustic characteristics into a time domain voice waveform file;

the method adopts a well-trained Tacotron2 multi-speaker voice synthesis model to carry out voice synthesis containing rare word text corpora, and simultaneously, in order to relieve the influence of speaker information on the synthesized voice, the information of a plurality of speakers is adopted in the experiment to synthesize a no-entry mark text, so that the diversity of the target text corpora is expanded;

s4, adding the synthesized rare word corpus into training set to train end-to-end acoustic model, firstly, synthesizing the synthesized multi-speaker voice corpus X in step S3synthesisAnd training corpus XtrainMerging to obtain the extended corpus X of training setextensionI.e. Xextension=Xtrain∪Xsynthesis

An end-to-end based speech recognition model LAS is constructed and trained. The LAS model is mainly composed of an encoder module, an attention module andthe decoder module is composed of three parts, the model structure is shown in figure 2, wherein, the encoder uses a bidirectional long-time memory network to input sequence characteristics XextensionPerforming time sequence relation modeling, wherein the formalization representation of the time sequence relation modeling is shown as an equation (4);

three pBLSTM layers are stacked after the BLSTM layer, and the pBLSTM layer is calculated as shown in equation (5);

the attention mechanism of introducing context dependence can enable the model to be focused on learning of the semantically significant features related to the context in the sequence, thereby improving the accuracy of model reasoning, and aiming at the semantic features H (H) in the middle layer output by the decoder1,h2,...,hu,...,hU) First, the attention layer calculates the weight α corresponding to the ith time step output feature H in the sequencei,uThe calculation method is shown in formulas (6) to (7);

wherein exp is an exponential function with a natural constant e as a base, phi,Context dependent semantic features c for fully connected neural networks with trainable parametersiIs a weighted sum of input sequences, is a representation of the overall semantics of a speech segment, the weighting of whichThe method of summation is shown in equation (8);

the decoder network is composed of two layers of unidirectional LSTM networks respectively comprising 512 neural nodes, and the formal expressions of the unidirectional LSTM networks are shown in formulas (9) to (10);

si=LSTM(si-1,yi-1,ci-1) (9)

P(yi|x,y<i)=MLP(si,ci) (10)

the MLP is a fully-connected neural network with a Softmax activation function, and the output of the MLP is the posterior probability of the modeling unit;

the encoder, attention layer and decoder of the LAS model constructed by the invention can perform end-to-end joint training, and the objective function of the LAS model is shown as a formula (11);

wherein, thetae、θa、θdModel parameters of an encoder module, an attention module and a decoder module of the LAS respectively;a true token representing a time step character prior to the ith time step;

s5, because the greedy-based decoding strategy directly takes the optimal path at the current position each time, the probability of generating the whole sequence is not guaranteed to be optimal, in addition, in practical application, the word list is generally very large, and the decoding Search of all possible paths cannot be completed within limited Search time, therefore, in practical application, the method of Beam Search (Beam Search) is generally adopted for speech decoding, and meanwhile, in order to introduce a speech model to correct the decoding result, a language model is introduced in the technology to re-score the searched paths, and the formal expression of the language model is shown in formula (12);

wherein | y | Y |cRepresenting the number of characters; logPLM(y) representing a language model score; λ represents a language model score weight, which can be determined from a validation set. The decoding search in practical application adopts Beam search with Beam number of 32 and language model score weight lambda of 0.008.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an end-to-end voice-to-text rare word optimization method.

Background

For a long time, a Speech Recognition method based on Hidden Markov Model (HMM) has been the mainstream Large-scale Continuous Speech Recognition (LVCSR) method. Until now, a hybrid Model based on a Deep Neural Network-Hidden Markov Model (DNN-HMM) still can achieve the optimal recognition accuracy, and generally speaking, a speech recognition Model based on an HMM is composed of three modules, namely an acoustic Model, a pronunciation dictionary and a language Model. The acoustic model is mainly used for modeling the mapping relation between input speech and phoneme sequences or sub-phoneme sequences; pronunciation dictionaries are mainly used to implement the mapping between phonemes (or sub-phonemes) and characters, and are usually constructed by professional human linguists; the language model maps the character sequence to fluent transcription text, however, three relatively independent components of the HMM-based speech recognition model need to be designed independently and different modules need to be trained separately, then different modules are fused by constructing a Weighted Finite State Transducer (WFST), and finally speech decoding is carried out, the design and training of each component of the method need professional technology accumulation, the training steps are complex, and the global optimal solution is difficult to optimize; furthermore, the assumption of conditional independence in the model construction process makes the method not exactly match the LVCSR in real scenes. Therefore, the usability, maintainability and migratability of HMM-based speech recognition methods are greatly limited;

the advent of deep learning techniques has greatly improved the recognition accuracy of speech recognition models, and in view of the limitations of conventional HMM model-based methods, more and more research institutes have begun studying LVCSR based on end-to-end acoustic models. The end-to-end voice recognition model integrates three major components of a traditional voice recognition system into a network model, an input audio sequence is directly mapped into a word sequence or other character sequences, and the combination of a plurality of modules has the advantages that a plurality of modules are not required to be designed to realize mapping among various intermediate states, so that the construction and training processes of the voice recognition model can be greatly simplified; the joint training enables the end-to-end model to use a function highly related to the final evaluation standard as a target function of global optimization, so that a global optimal solution is easier to search, the voice recognition accuracy is further improved, and therefore an end-to-end voice-to-text rare word optimization method is provided for solving the problems.

Disclosure of Invention

The invention aims to provide an end-to-end speech-to-text rare word optimization method to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: an end-to-end voice-to-text rare word optimization method comprises the steps that a training set corpus corresponds to a structure of a rare word list in a labeled text, and firstly, the labeled text of the training set corpus is sorted and analyzed; then, segmenting the labeled text by using a segmentation tool and counting word frequency of the words by using an SRILM language model tool; and finally, defining the words with the word frequency smaller than the set word frequency threshold value as the rare words, and adding the rare words into a rare word list.

And extracting sentences containing the rare words from the large-scale language model training corpora according to the constructed rare word list, wherein generally speaking, the text corpora contained in the language model training corpora are far more than the text corpora contained in the acoustic model training set, so that a plurality of sentences containing the rare words can be extracted from the language model text corpora and are used for synthesizing the voice data containing the rare words.

Synthesizing voice data containing rare word sentences, firstly constructing a multi-speaker voice synthesis model based on Tacotron2, then training Tacotron2 by using the multi-speaker voice data, and finally performing text voice synthesis on the sentences containing rare words by using the multi-speaker voice synthesis model obtained by training to obtain a synthesized rare word voice corpus.

Model optimization under a few heavy accent voice data. Aiming at a universal speech recognition model obtained by training under standard Mandarin speech data, a small amount of accent speech data is used for optimizing the model.

And expanding a training corpus and constructing an end-to-end acoustic model, and combining synthesized voice data of a plurality of speakers containing rare words with the voice data of the original training set to obtain an expanded training set corpus.

The method comprises the steps of constructing an end-to-end voice recognition model LAS, wherein the end-to-end voice model constructed by the method is a sequence-to-sequence voice recognition model LAS based on an attention mechanism, and mainly comprises an encoder module, an attention module and a decoder module.

And training an end-to-end voice recognition model LAS, training the LAS by using the expanded training set voice data, and jointly optimizing three modules of the LAS, so that the end-to-end model can use a function highly related to a final evaluation standard as a target function of global optimization, thereby being beneficial to solving to obtain a global optimal result.

Performing voice decoding and language model re-scoring on the end-to-end voice recognition model, firstly, performing decoding search on the jointly optimized voice recognition model by using a beam search decoding technology to obtain an acoustic model score; then, using the existing large-scale language model to perform language model score calculation on the decoding path obtained by searching; and finally, re-scoring the acoustic model scores by using the language model scores to obtain final scores of decoding search, and calculating by using the scores to obtain decoded texts corresponding to the accent voices.

An end-to-end speech-to-text rare word optimization method comprises the following steps:

and S1, counting the word frequency of the corresponding labeled text in the training corpus, and acquiring the rare word list. Firstly, statistically analyzing corresponding text corpora in a training set corpus; then, segmenting the obtained text corpus by using a segmentation tool, and carrying out word frequency statistics on the segmented corpus by using an SRILM language model tool to obtain a word frequency list of words in the training corpus; analyzing the word frequency list, extracting the rare words to construct a rare word list, namely aiming at the word frequency nwordSetting a word frequency threshold nthresholdWhen n isword≤nthresholdWhen the word belongs to the low-frequency word or the rare word in the training set corpus, adding the word into the rare word list, and obtaining the rare word list corresponding to the current training set corpus after the processing is finished;

and S2, extracting sentences containing the rare words in the large-scale text corpus, and constructing to obtain a rare word text corpus. Aiming at the rare word vocabulary extracted in the step S1, extracting a plurality of sentences containing rare words from the large-scale language model text corpus, adding the sentences into the rare word text corpus, and obtaining a text corpus containing rare words corresponding to the rare word vocabulary after the processing is finished;

s3, synthesizing voice data containing rare word sentences, and aiming at the rare word text corpus extracted in the step S2, performing rare word text voice synthesis by using a mature voice synthesis model, namely a Tacotron2, wherein the Tacotron2 is a voice synthesis method based on a neural network and mainly comprises a voice spectrum prediction network, a vocoder and an intermediate connection module;

the acoustic spectrum prediction network is a network structure from a sequence to a sequence based on an attention mechanism, the input of a model is a character sequence to be synthesized, and the output is a Mel frequency spectrum acoustic characteristic frame sequence, wherein an encoder module consists of a character embedding layer containing 512 neuron nodes, 3 convolutional neural network CNN layers each containing 512 convolutional kernels with the scale of 5 x 1 and a bidirectional long-short time memory LSTM network layer containing 256 hidden layer neuron nodes, and the calculation process of the encoder module is shown in formulas (1) to (2);

Fe=ReLU(K3*ReLU(K2*ReLU(K1*E(Ch)))) (1)

H=BLSTM(Fe) (2)

wherein, K1、K2、K3Respectively representing convolution kernels of three convolution neural network layers, ReLU representing a nonlinear activation function, E (-) representing embedded coding (Embedding) of an input character sequence Ch, BLSTM being a bidirectional long-and-short time memory network, FeFor high-level semantic encoding of characters output by convolutional layer, H is Bidirectional Long-Sho memory networkrt Term Memory, BLSTM) layer.

The attention module uses a position-sensitive attention mechanism, adds position features in the alignment process, can simultaneously extract input content information and position information of input elements, and the formal expression of the position information is shown as formula (3);

wherein v isaW, V, U and b are trainable parameters, siFor the current decoder hidden state, hjFor the current encoder hidden state, fi,jTo attention weight alphai-1Coding the position characteristics obtained by convolution operation;

the decoder module is an autoregressive recurrent neural network model, the decoding process starts from inputting the output sound spectrum of the previous step or the real sound spectrum of the previous step to a preprocessing network PreNet, the output of the PreNet and the context-related semantic vector obtained by using the output calculation of the previous decoding step are spliced and then integrally sent into the decoder network, the context-related semantic vector of the current step is calculated based on the output of the decoder of an RNN network structure, the context-related semantic vector of the current step and the output of the decoder of the current step are spliced and then sent into a linear projection network for calculation and output, and the predicted Mel spectrum is sent into a post-processing network for post-processing after the decoding of the decoder is finished so as to improve the generation quality of the Mel spectrum;

the vocoder adopts a revised WaveNet model and is used for converting the generated frequency domain Mel spectrum acoustic characteristics into a time domain voice waveform file;

the method adopts a well-trained Tacotron2 multi-speaker voice synthesis model to carry out voice synthesis containing rare word text corpora, and simultaneously, in order to relieve the influence of speaker information on the synthesized voice, the information of a plurality of speakers is adopted in the experiment to synthesize a no-entry mark text, so that the diversity of the target text corpora is expanded;

s4, adding synthesized rare word corpus into trainingTraining set is used for end-to-end acoustic model training, firstly, the multi-speaker voice corpus X synthesized in step S3synthesisAnd training corpus XtrainMerging to obtain the extended corpus X of training setextensionI.e. Xextension=Xtrain∪Xsynthesis

An end-to-end based speech recognition model LAS is constructed and trained. The LAS model mainly comprises an encoder module, an attention module and a decoder module, and the model structure is shown in FIG. 2, wherein the encoder uses a bidirectional long-and-short time memory network to input sequence characteristics XextensionPerforming time sequence relation modeling, wherein the formalization representation of the time sequence relation modeling is shown as an equation (4);

three pBLSTM layers are stacked after the BLSTM layer, and the pBLSTM layer is calculated as shown in equation (5);

the attention mechanism of introducing context dependence can enable the model to be focused on learning of the semantically significant features related to the context in the sequence, thereby improving the accuracy of model reasoning, and aiming at the semantic features H (H) in the middle layer output by the decoder1,h2,...,hu,...,hU) First, the attention layer calculates the weight α corresponding to the ith time step output feature H in the sequencei,uThe calculation method is shown in formulas (6) to (7);

wherein exp is an exponential function with a natural constant e as a base, phi,Context dependent semantic features c for fully connected neural networks with trainable parametersiThe weighted sum of the input sequences is the representation of the integral semantics of a section of voice, and the weighted sum method is shown as a formula (8);

the decoder network is composed of two layers of unidirectional LSTM networks respectively comprising 512 neural nodes, and the formal expressions of the unidirectional LSTM networks are shown in formulas (9) to (10);

si=LSTM(si-1,yi-1,ci-1) (9)

P(yi|x,y<i)=MLP(si,ci) (10)

the MLP is a fully-connected neural network with a Softmax activation function, and the output of the MLP is the posterior probability of the modeling unit;

the encoder, attention layer and decoder of the LAS model constructed by the invention can perform end-to-end joint training, and the objective function of the LAS model is shown as a formula (11);

wherein, thetae、θa、θdModel parameters of an encoder module, an attention module and a decoder module of the LAS respectively;a true token representing a time step character prior to the ith time step;

s5, because the greedy-based decoding strategy directly takes the optimal path at the current position each time, the probability of generating the whole sequence is not guaranteed to be optimal, in addition, in practical application, the word list is generally very large, and the decoding Search of all possible paths cannot be completed within limited Search time, therefore, in practical application, the method of Beam Search (Beam Search) is generally adopted for speech decoding, and meanwhile, in order to introduce a speech model to correct the decoding result, a language model is introduced in the technology to re-score the searched paths, and the formal expression of the language model is shown in formula (12);

wherein | y | Y |cRepresenting the number of characters; log PLM(y) representing a language model score; λ represents a language model score weight, which can be determined from a validation set. The decoding search in practical application adopts Beam search with Beam number of 32 and language model score weight lambda of 0.008.

Compared with the prior art, the invention has the beneficial effects that:

according to the end-to-end voice-to-text rare word optimization technology provided by the invention, the text labels corresponding to the training set corpora are subjected to statistical analysis, and a text corpus list containing rare words is constructed, so that the insufficiently trained text corpora in the end-to-end voice recognition model can be effectively searched; then, text voice synthesis is carried out on the rare words which are insufficiently trained in the training set through a mature multi-speaker voice synthesis model, and the text voice containing the rare words can be effectively expanded, so that the corpus of the training set is expanded; and finally, after the synthesized text voice and the original training corpus are fused, the end-to-end voice recognition model is trained and optimized, the generalization capability of the end-to-end voice-to-text model to rare words can be obviously improved, so that the problem of poor rare word recognition effect caused by insufficient training corpus is solved, and the accuracy of end-to-end voice recognition can be effectively improved.

Drawings

FIG. 1 is a schematic flow chart of an end-to-end speech-to-text rare word optimization technique of the present invention;

FIG. 2 is a diagram of an end-to-end speech recognition model LAS model architecture of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-2, the present invention provides a technical solution: an end-to-end voice-to-text rare word optimization method comprises the steps that a training set corpus corresponds to a structure of a rare word list in a labeled text, and firstly, the labeled text of the training set corpus is sorted and analyzed; then, segmenting the labeled text by using a segmentation tool and counting word frequency of the words by using an SRILM language model tool; and finally, defining the words with the word frequency smaller than the set word frequency threshold value as the rare words, and adding the rare words into a rare word list.

And extracting sentences containing the rare words from the large-scale language model training corpora according to the constructed rare word list, wherein generally speaking, the text corpora contained in the language model training corpora are far more than the text corpora contained in the acoustic model training set, so that a plurality of sentences containing the rare words can be extracted from the language model text corpora and are used for synthesizing the voice data containing the rare words.

Synthesizing voice data containing rare word sentences, firstly constructing a multi-speaker voice synthesis model based on Tacotron2, then training Tacotron2 by using the multi-speaker voice data, and finally performing text voice synthesis on the sentences containing rare words by using the multi-speaker voice synthesis model obtained by training to obtain a synthesized rare word voice corpus.

Model optimization under a few heavy accent voice data. Aiming at a universal speech recognition model obtained by training under standard Mandarin speech data, a small amount of accent speech data is used for optimizing the model.

And expanding a training corpus and constructing an end-to-end acoustic model, and combining synthesized voice data of a plurality of speakers containing rare words with the voice data of the original training set to obtain an expanded training set corpus.

The method comprises the steps of constructing an end-to-end voice recognition model LAS, wherein the end-to-end voice model constructed by the method is a sequence-to-sequence voice recognition model LAS based on an attention mechanism, and mainly comprises an encoder module, an attention module and a decoder module.

And training an end-to-end voice recognition model LAS, training the LAS by using the expanded training set voice data, and jointly optimizing three modules of the LAS, so that the end-to-end model can use a function highly related to a final evaluation standard as a target function of global optimization, thereby being beneficial to solving to obtain a global optimal result.

Performing voice decoding and language model re-scoring on the end-to-end voice recognition model, firstly, performing decoding search on the jointly optimized voice recognition model by using a beam search decoding technology to obtain an acoustic model score; then, using the existing large-scale language model to perform language model score calculation on the decoding path obtained by searching; and finally, re-scoring the acoustic model scores by using the language model scores to obtain final scores of decoding search, and calculating by using the scores to obtain decoded texts corresponding to the accent voices.

An end-to-end speech-to-text rare word optimization method comprises the following steps:

and S1, counting the word frequency of the corresponding labeled text in the training corpus, and acquiring the rare word list. Firstly, statistically analyzing corresponding text corpora in a training set corpus; then, segmenting the obtained text corpus by using a segmentation tool, and carrying out word frequency statistics on the segmented corpus by using an SRILM language model tool to obtain a word frequency list of words in the training corpus; analyzing the word frequency list, extracting the rare words to construct a rare word list, namely aiming at the word frequency nwordSetting a word frequency threshold nthresholdWhen n isword≤nthresholdThen, the word is considered to belong to a low-frequency word or a rare word in the training set corpus, and the word is added into the rare word or the rare wordIn the table, the rare word list corresponding to the corpus of the current training set can be obtained after the processing is finished;

and S2, extracting sentences containing the rare words in the large-scale text corpus, and constructing to obtain a rare word text corpus. Aiming at the rare word vocabulary extracted in the step S1, extracting a plurality of sentences containing rare words from the large-scale language model text corpus, adding the sentences into the rare word text corpus, and obtaining a text corpus containing rare words corresponding to the rare word vocabulary after the processing is finished;

s3, synthesizing voice data containing rare word sentences, and aiming at the rare word text corpus extracted in the step S2, performing rare word text voice synthesis by using a mature voice synthesis model, namely a Tacotron2, wherein the Tacotron2 is a voice synthesis method based on a neural network and mainly comprises a voice spectrum prediction network, a vocoder and an intermediate connection module;

the acoustic spectrum prediction network is a network structure from a sequence to a sequence based on an attention mechanism, the input of a model is a character sequence to be synthesized, and the output is a Mel frequency spectrum acoustic characteristic frame sequence, wherein an encoder module consists of a character embedding layer containing 512 neuron nodes, 3 convolutional neural network CNN layers each containing 512 convolutional kernels with the scale of 5 x 1 and a bidirectional long-short time memory LSTM network layer containing 256 hidden layer neuron nodes, and the calculation process of the encoder module is shown in formulas (1) to (2);

Fe=ReLU(K3*ReLU(K2*ReLU(K1*E(Ch)))) (1)

H=BLSTM(Fe) (2)

wherein, K1、K2、K3Respectively representing convolution kernels of three convolution neural network layers, ReLU representing a nonlinear activation function, E (-) representing embedded coding (Embedding) of an input character sequence Ch, BLSTM being a bidirectional long-and-short time memory network, FeThe high-level semantic code of the characters output by the convolutional layer, and H is the output of a Bidirectional Long-Short Term Memory (BLSTM) layer.

The attention module uses a position-sensitive attention mechanism, adds position features in the alignment process, can simultaneously extract input content information and position information of input elements, and the formal expression of the position information is shown as formula (3);

wherein v isaW, V, U and b are trainable parameters, siFor the current decoder hidden state, hjFor the current encoder hidden state, fi,jTo attention weight alphai-1Coding the position characteristics obtained by convolution operation;

the decoder module is an autoregressive recurrent neural network model, the decoding process starts from inputting the output sound spectrum of the previous step or the real sound spectrum of the previous step to a preprocessing network PreNet, the output of the PreNet and the context-related semantic vector obtained by using the output calculation of the previous decoding step are spliced and then integrally sent into the decoder network, the context-related semantic vector of the current step is calculated based on the output of the decoder of an RNN network structure, the context-related semantic vector of the current step and the output of the decoder of the current step are spliced and then sent into a linear projection network for calculation and output, and the predicted Mel spectrum is sent into a post-processing network for post-processing after the decoding of the decoder is finished so as to improve the generation quality of the Mel spectrum;

the vocoder adopts a revised WaveNet model and is used for converting the generated frequency domain Mel spectrum acoustic characteristics into a time domain voice waveform file;

the method adopts a well-trained Tacotron2 multi-speaker voice synthesis model to carry out voice synthesis containing rare word text corpora, and simultaneously, in order to relieve the influence of speaker information on the synthesized voice, the information of a plurality of speakers is adopted in the experiment to synthesize a no-entry mark text, so that the diversity of the target text corpora is expanded;

s4, adding the synthesized rare word corpus into training set to train end-to-end acoustic model, firstly, synthesizing the synthesized multi-speaker voice corpus X in step S3synthesisAnd training setCorpus XtrainMerging to obtain the extended corpus X of training setextensionI.e. Xextension=Xtrain∪Xsynthesis

An end-to-end based speech recognition model LAS is constructed and trained. The LAS model mainly comprises an encoder module, an attention module and a decoder module, and the model structure is shown in FIG. 2, wherein the encoder uses a bidirectional long-and-short time memory network to input sequence characteristics XextensionPerforming time sequence relation modeling, wherein the formalization representation of the time sequence relation modeling is shown as an equation (4);

three pBLSTM layers are stacked after the BLSTM layer, and the pBLSTM layer is calculated as shown in equation (5);

the attention mechanism of introducing context dependence can enable the model to be focused on learning of the semantically significant features related to the context in the sequence, thereby improving the accuracy of model reasoning, and aiming at the semantic features H (H) in the middle layer output by the decoder1,h2,...,hu,...,hU) First, the attention layer calculates the weight α corresponding to the ith time step output feature H in the sequencei,uThe calculation method is shown in formulas (6) to (7);

wherein exp is an exponential function with a natural constant e as a base, phi,Context dependent semantic features c for fully connected neural networks with trainable parametersiThe weighted sum of the input sequences is the representation of the integral semantics of a section of voice, and the weighted sum method is shown as a formula (8);

the decoder network is composed of two layers of unidirectional LSTM networks respectively comprising 512 neural nodes, and the formal expressions of the unidirectional LSTM networks are shown in formulas (9) to (10);

si=LSTM(si-1,yi-1,ci-1) (9)

P(yi|x,y<i)=MLP(si,ci) (10)

the MLP is a fully-connected neural network with a Softmax activation function, and the output of the MLP is the posterior probability of the modeling unit;

the encoder, attention layer and decoder of the LAS model constructed by the invention can perform end-to-end joint training, and the objective function of the LAS model is shown as a formula (11);

wherein, thetae、θa、θdModel parameters of an encoder module, an attention module and a decoder module of the LAS respectively;a true token representing a time step character prior to the ith time step;

s5, because the greedy-based decoding strategy directly takes the optimal path at the current position each time, the probability of generating the whole sequence is not guaranteed to be optimal, in addition, in practical application, the word list is generally very large, and the decoding Search of all possible paths cannot be completed within limited Search time, therefore, in practical application, the method of Beam Search (Beam Search) is generally adopted for speech decoding, and meanwhile, in order to introduce a speech model to correct the decoding result, a language model is introduced in the technology to re-score the searched paths, and the formal expression of the language model is shown in formula (12);

wherein | y | Y |cRepresenting the number of characters; log PLM(y) representing a language model score; λ represents a language model score weight, which can be determined from a validation set. The decoding search in practical application adopts Beam search with Beam number of 32 and language model score weight lambda of 0.008.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:音频信息处理方法、系统和计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!