End-to-end multi-channel speech recognition method using advanced feature fusion

文档序号:1364299 发布日期:2020-08-11 浏览:31次 中文

阅读说明:本技术 采用高级特征融合的端到端多通道语音识别方法 (End-to-end multi-channel speech recognition method using advanced feature fusion ) 是由 郭武 刘谭 于 2020-05-28 设计创作,主要内容包括:本发明公开了一种采用高级特征融合的端到端多通道语音识别方法,包括:对于多通道语音输入,采用与通道数目相同的编码器单独对一个通道的语音输入进行编码;所述编码器为多层金字塔结构的神经网络,神经网络最后一层输出的声学特征序列称为高级特征序列;对于每一通道的高级特征序列,通过得分函数来计算相应的注意力权重,从而将所有通道的高级特征序列融合为一个增强的高级特征序列;将所述增强的高级特征序列输入至解码器,由解码器根据之前预测到的字符以及当前输入的所述增强的高级特征序列中当前字符来计算当前当前字符的概率分布,最终得到所述增强的高级特征序列的识别结果。该方法能够达到相对单通道语音输入的识别率更高的目的。(The invention discloses an end-to-end multi-channel speech recognition method adopting high-level feature fusion, which comprises the following steps: for multi-channel voice input, independently coding the voice input of one channel by adopting a coder with the same number as the channels; the encoder is a neural network with a multilayer pyramid structure, and an acoustic feature sequence output by the last layer of the neural network is called a high-level feature sequence; calculating corresponding attention weight through a scoring function for the high-level feature sequence of each channel, so as to fuse the high-level feature sequences of all channels into an enhanced high-level feature sequence; and inputting the enhanced high-level feature sequence into a decoder, and calculating the probability distribution of the current character by the decoder according to the character predicted before and the current character in the currently input enhanced high-level feature sequence to finally obtain the recognition result of the enhanced high-level feature sequence. The method can achieve the aim of higher recognition rate compared with single-channel voice input.)

1. An end-to-end multi-channel speech recognition method using advanced feature fusion, comprising:

for multi-channel voice input, independently coding the voice input of one channel by adopting a coder with the same number as the channels; the encoder is a neural network with a multilayer pyramid structure, and an acoustic feature sequence output by the last layer of the neural network is called a high-level feature sequence; calculating corresponding attention weight through a scoring function for the high-level feature sequence of each channel, so as to fuse the high-level feature sequences of all channels into an enhanced high-level feature sequence;

and inputting the enhanced high-level feature sequence into a decoder, and calculating the probability distribution of the current character by the decoder according to the predicted character and the input enhanced high-level feature sequence to finally obtain the recognition result of the enhanced high-level feature sequence.

2. The method of claim 1, wherein the encoder is a bidirectional long-term memory network with a multi-layer pyramid structure, and the number of frames in each layer of the encoder is decreased by 2 times, so that the top-level operation is reduced to 1/8;

hidden state at ith time of jth layer of any encoderFrom j (th)Hidden state at layer time i-1And hidden state at 2i time of j-1 th layerHidden state from the 2i +1 th time

The hidden state sequence output by the last layer is expressed as h1,h2,...,hUU is the sequence length, i.e. the total time, and H is the high-level signature sequence H ═ H1,h2,...,hU}。

3. An end-to-end multi-channel speech recognition method using advanced feature fusion according to claim 1, wherein for the advanced feature sequences of each channel, calculating the corresponding attention weight by a scoring function, so as to fuse the advanced feature sequences of all channels into an enhanced advanced feature sequence comprises:

record the high level feature sequence of the ith channel asFor each of the advanced featuresCalculating attention weight, and then weighting and summing the high-level features of all channels to obtain an enhanced high-level feature sequence:

wherein C represents the total number of channels, Z represents a scoring function,representing high-level features calculated by a scoring function ZScore of (a); u1, 2, U being the enhanced high level signature sequence length; m isuFor enhanced high-level feature sequence M ═ { M ═ M1,m2,...,mUA high-level feature of.

4. An end-to-end multi-channel speech recognition method with advanced feature fusion according to claim 3, characterized in that the scoring function is implemented by a neural network, expressed as:

wherein the content of the first and second substances,an attention weight representing a previous high-level feature; w*Denotes the weight parameter, ═ f, h, a.

Technical Field

The invention relates to the field of voice signal processing, in particular to an end-to-end multi-channel voice recognition method adopting high-level feature fusion.

Background

In recent years, with the widespread use of neural networks in the field of speech recognition, the performance of speech recognition systems has improved significantly. At present, there are two main types of speech recognition systems, one is an HMM-based speech recognition system, and the other is an end-to-end speech recognition system. Compared with an HMM-based speech recognition system, the end-to-end speech recognition system has a simpler structure, directly converts an input speech feature sequence into a character sequence through a neural network, does not need a set of pronunciation dictionary, decision tree and character level marking alignment information of the HMM system, and is a hotspot of current research due to simple realization and excellent performance.

An attention mechanism-based 'coding-decoding' framework is a mainstream structure in an end-to-end speech recognition system and comprises a coding network, a decoding network and an attention calculation network, wherein the coding network firstly converts an input acoustic feature sequence into a high-level feature sequence, then the attention calculation network calculates attention weights, namely correlation degrees, of each element of a current position of a decoder and the high-level feature sequence, the elements are weighted and summed to obtain a Context Vector (Context Vector), and finally the decoding network predicts the label distribution probability of the current position according to the previous prediction result and the Context Vector.

The speech recognition system has already realized very high accuracy in the clean speech recognition task of near field, has entered the practical stage; but perform poorly in far-field speech recognition tasks. The multi-channel speech recognition system comprehensively utilizes the information collected by each microphone to enhance the signals and improve the accuracy of far-field speech recognition, so that the multi-channel speech recognition system is widely applied to far-field speech recognition tasks. The traditional method for synthesizing multi-channel speech is implemented based on speech enhancement, and a beamforming (beamforming) algorithm, such as Delay-sum (Delay-sum), minimum variance distortion free response (MVDR), etc., is used to enhance multi-channel speech signals, but these algorithms need to know a priori knowledge about a microphone array, such as the shape of the array, the distance to a sound source, etc., and the implementation process does not aim at speech recognition accuracy.

A multi-channel speech fusion method based on attention mechanism has been applied to speech recognition systems, such as documents (Braun S, Neil D, Anumula J, et al, multi-channel-interaction for end-to-end speech recognition [ J ].2018 interspace, 2018:17-21), and fusion is performed at an acoustic feature level by using the multi-channel speech fusion method based on attention mechanism, that is, a weight is assigned to each channel speech according to the quality of the acoustic feature of the channel speech, and then the acoustic features of all channels are weighted and summed to obtain an enhanced acoustic feature, which is input into an end-to-end speech recognition system. Compared with the result of training and recognizing each channel of voice independently, the system improves the recognition accuracy to a certain extent. However, in the deep network, there is a problem of Internal Covariate Shift (Internal Covariate Shift), and the difference of different channel voice characteristics changes as the network deepens. Therefore, the information of the deep features of each channel cannot be utilized by simply fusing the features of different channels at the input feature layer.

Disclosure of Invention

The invention aims to provide an end-to-end multi-channel voice recognition method adopting high-level feature fusion, which takes multi-channel voice signals as input in an end-to-end recognition framework to complete the task of voice recognition and can achieve the aim of higher recognition rate compared with single-channel voice input.

The purpose of the invention is realized by the following technical scheme:

an end-to-end multi-channel speech recognition method using advanced feature fusion, comprising:

for multi-channel voice input, independently coding the voice input of one channel by adopting a coder with the same number as the channels; the encoder is a neural network with a multilayer pyramid structure, and an acoustic feature sequence output by the last layer of the neural network is called a high-level feature sequence; calculating corresponding attention weight through a scoring function for the high-level feature sequence of each channel, so as to fuse the high-level feature sequences of all channels into an enhanced high-level feature sequence;

and inputting the enhanced high-level feature sequence into a decoder, and calculating the probability distribution of the current character by the decoder according to the predicted character and the input enhanced high-level feature sequence to finally obtain the recognition result of the enhanced high-level feature sequence.

According to the technical scheme provided by the invention, the attention mechanism dynamically allocates the attention weight to the high-level features of each channel, and the high-level features of all the channels are subjected to weighted summation, so that the high-level features of all the channels are complemented to obtain an enhanced high-level feature sequence, and the recognition performance is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram of a typical LAS architecture provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of an end-to-end multi-channel speech recognition method using advanced feature fusion according to an embodiment of the present invention;

fig. 3 is a schematic diagram of multi-channel advanced feature fusion provided by the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Due to the influence of noise and echo, the high-level features generated by the single-channel far-field speech features through the encoder are interfered, and the recognition accuracy is reduced. The acoustic modeling method of speech recognition is to train a probability model through a large amount of speech data, and after test speech enters the probability model, the corresponding text is output through decoding. In practical application, a microphone array is adopted to simultaneously acquire voices, and a multi-channel voice signal is used for modeling of voice recognition, so that theoretically, the recognition accuracy can be improved, and the key is how to utilize the multi-channel signal to perform modeling of recognition.

At present, a speech recognition system based on deep learning becomes the mainstream, and an end-to-end technology, namely a recognition technology for directly inputting speech and outputting text, is adopted, so that the system is simple to implement, almost equal to or better than the traditional method in performance, and high in decoding speed, and becomes a research hotspot; end-to-end recognizers that use one-way speech input have matured substantially. The invention is in the end-to-end recognition framework, takes the multi-channel voice signal as input, completes the task of voice recognition, thereby achieving the purpose of higher recognition rate compared with single-channel voice input.

The end-to-end identification framework refers to an "encoding-decoding" based end-to-end identification framework, and specifically, to an end-to-end system using Attention (Attention) mechanism, also called as las (list attribute and ball) end-to-end framework. The method realizes the fusion of multi-channel voice input in an encoder of an LAS framework, realizes the weighted synthesis of different channels into a better encoding input through an attention mechanism, thereby achieving a signal superior to a single-channel input and obtaining higher identification accuracy.

The 'encoding-decoding' frame (Encoder-Decoder) is composed of an Encoder (Encoder) and a Decoder (Decoder), is an end-to-end frame structure and directly performs sequence conversion. In the training of a speech recognition model, inputting acoustic characteristic parameters corresponding to a section of speech, and outputting texts corresponding to the section of speech; in the recognition and decoding, the trained model is input into acoustic characteristic parameters corresponding to the voice, and a corresponding text can be obtained through a search algorithm. LSTM networks are commonly used as encoders and decoders in speech recognition.

In the Encoder-Decoder framework, the input of the Encoder end is the acoustic feature sequence X ═ { X } of the speech1,x2,…xTEncoder converts the original soundCoding the learning characteristic sequence into a high-level characteristic sequence H ═ H1,h2,…hU}:

H=Encoder(X)

At each moment, the Decoder end outputs H according to the Encode and the label y of the previous momenti-1To predict the probability distribution of the current time label:

ci=AttentionContext(si,H)

P(yi|X,y<i)=Decoder(yi-1,ci)

wherein, ciIs a context vector, siIs the hidden state of the Decoder at the current time, AttentionContext function calculates siAnd Encoder outputs attention weights between each element of H, will HuWeighted summation to ci

ei,u=<si,hu>

ci=∑uαi,uhu

Wherein the content of the first and second substances,<>is used to calculate siAnd huFunction of correlation, αi,uThen is huCorresponding attention weight.

The las (listen end and talk) architecture is a typical codec framework that can be used in many modes of speech recognition, typically speech recognition, machine translation. As shown in fig. 1, the LAS includes two components: listener corresponds to Encoder and Speller corresponds to Decoder. The input is the speech feature X ═ X1,x2,…xTThe output is the corresponding text sequence Y ═ Y }1,y2,…ys}

Listener adopts three layers of BLSTM (pBLSTM) with a pyramid structure, and the number of frames in each layer is decreased by 2 times, so that the operation of the uppermost layer is reduced to only 1/8. Hidden state at ith time of jth layer of any encoderHidden state from ith-1 time of jth layerAnd hidden state at 2i time of j-1 th layerHidden state from the 2i +1 th time

Speller employs two layers of BLSTM, each time computing the distribution probability P (y) of the current character based on the previous output character and the output of Listeneri):

ci=AttentionContext(si,H)

si=RNN(si-1,yi-1,ci-1)

P(yi|X,y<i)=CharacterDistribution(si,ci)

Wherein s isiRepresents the hidden state of Speller at the current time, yi-1Representing the previous predicted character. The CharacterDistribution function is a multi-layer perceptron with softmax output layer and the RNN function is a two-layer LSTM.

The end-to-end multi-channel speech recognition method adopting high-grade feature fusion is realized on the basis of an LAS structure. Specifically, the method comprises the following steps:

for multi-channel voice input, independently coding the voice input of one channel by adopting a coder with the same number as the channels; the encoder is a neural network with a multilayer pyramid structure, and an acoustic feature sequence output by the last layer of the neural network is called a high-level feature sequence; calculating corresponding attention weight through a scoring function for the high-level feature sequence of each channel, so as to fuse the high-level feature sequences of all channels into an enhanced high-level feature sequence;

and inputting the enhanced high-level feature sequence into a decoder, and calculating the probability distribution of the current character by the decoder according to the predicted character and the input enhanced high-level feature sequence to finally obtain the recognition result of the enhanced high-level feature sequence.

Fig. 2 is a schematic diagram of a related solution for implementing the present invention, wherein the encoder is a bidirectional long-time memory network with a multilayer pyramid structure, and an implementation manner and related principles are the same as those of the encoder (Lister component) in fig. 1, and thus are not described again.

In the embodiment of the invention, the hidden state sequence output by the last layer is expressed as h1,h2,…,hUU is the sequence length, i.e. the total time, and H is the high-level signature sequence H ═ H1,h2,…,hU}. Since the embodiments of the present invention consider multi-channel input, the high-level feature sequence of the ith channel is noted asAnd, by automatically selecting the weight, the high-level feature sequences formed by the channels are fused to generate a more robust high-level feature sequence (i.e. enhanced high-level feature sequence).

As shown in fig. 3, which is a schematic diagram of multi-channel advanced feature fusion, fig. 3 only exemplarily illustrates two channels, and in practical applications, the specific number C of channels may be considered according to practical situations, and the multi-channel advanced feature fusion can be implemented according to the principle shown in fig. 3.

In the process of multi-channel advanced feature fusion, the voice feature of each channel is combinedInputting the data into corresponding Encoder to obtain the corresponding high-level characteristic sequence

Hl=Encoder(Xl)

For each of the advanced featuresCalculating attention weight, and then weighting and summing the high-level features of all channels to obtain an enhanced high-level feature sequence:

wherein C represents the total number of channels, Z represents a scoring function,representing high-level features calculated by a scoring function ZScore of (a); attention weightThe scores of all the channels are obtained through the softmax function, so that the scores can be obtainedU is 1, 2, …, U is the enhanced high-level feature sequence length; m isuFor enhanced high-level feature sequence M ═ { M ═ M1,m2,…,mUA high-level feature of.

In an embodiment of the present invention, the scoring function may be implemented by a neural network, and the neural network may include three linear layers and one non-linear layer:

wherein the content of the first and second substances,the attention weight of the previous high-grade feature is represented, a certain relation exists between the two adjacent high-grade features, and the attention weight of the current high-grade feature can be more accurately calculated by introducing the attention weight of the previous high-grade feature; w*Denotes the weight parameter, ═ f, h, a. As can be seen from the above, the scoring function begins withAndthe space mapped to the same dimension Dms (dimensional of mapping space) is added, and is mapped to a score after a nonlinear function is carried out.

Then, the enhanced high-level feature sequence M is input to a Decoder for decoding, so as to obtain a corresponding text, the principle is the same as that of the Decoder in the LAS structure described above, that is:

calculating the distribution probability of the current character based on the previous output character and the output of Listener:

cu=AttentionContext(su,M)

su=RNN(su-1,yu-1,cu-1)

P(yu|X,y<u)=CharacterDistribution(su,cu)

wherein, cuIs a context vector, suRepresenting the hidden state of the decoder at the current time, yu-1Represents the previous predicted character; the CharacterDistribution function is a multi-layer perceptron with softmax output layer, the RNN function is a two-layer LSTM, and when u is 1, the initial s is0And c0For random values, X contains the speech characteristics X ═ X of all channel inputs1,X2,…XC}。

Sos, eos, abbreviations for start of sequence and end of sequence, respectively, of the decoders of fig. 1 and 2 need to be labeled at the beginning and end of a sentence at the time of training; furthermore, the number of input-output sequences is not necessarily equal, so different indices are used.

Compared with the traditional end-to-end speech recognition modeling method, the scheme provided by the embodiment of the invention mainly has the following advantages:

1) compared with the traditional beam forming algorithm, the method dynamically distributes attention weight to each channel according to the quality of the high-level features of each channel, selectively extracts the high-level features with good quality, obtains the high-level features with higher quality after fusion, and improves the identification performance of the system. At the same time, the weights of the high-level features of each channel are automatically derived by the attention mechanism, without any a priori information about the microphone array.

2) Compared with a multi-channel speech recognition system based on acoustic feature fusion, the method utilizes the high-level feature information of each channel, and has stronger robustness compared with the bottom-layer feature information.

Those skilled in the art will understand that the high-level feature information and the low-level feature information are relative concepts, the neural network is a multi-layer structure, the traditional scheme adopts the feature of low-level output for fusion, and the high-level feature information is the feature information of high-level output.

As described above, in the scheme provided by the embodiment of the present invention, the high-level features generated by the encoder are fused, instead of simply fusing the acoustic features of the bottom layer, so that the fused high-level features are more stable; in the fusion weight, an attention mechanism is adopted to dynamically generate the fusion weight corresponding to each channel, and meanwhile, automatic channel selection is realized without any information of a microphone array. To verify the effectiveness of the proposed method of the present invention, the following experiment was designed.

1. And (4) setting an experiment.

Experiments were performed on the Chinese dataset King-ASR-120, selecting the speech from two microphones. All voice data are stored in a 16KHZ sampling rate, 16bit quantized format. In the experiment, the Chinese characters are used as a modeling unit, and the dictionary formed by transcribing the text totally comprises 3896 units. 66318 voices were chosen as the training set, 4319 voices as the development set, and 5200 voices as the test set.

The acoustic signature used in this experiment was a 108-dimensional MFCC signature, formed from a 36-dimensional MFCC signature combined with its first and second order differences. The performance of the method proposed herein was explored by comparing the experimental results of different models using the pytorch, kaldi as experimental platform.

2. Results of the experiment

A total of 4 system models were tested in this experiment: LAS, LAS-AF (LAS base on Acoustic failure fusion), LAS-AVG (LAS base on Acoustic failure fusion using average weight), LAS-HLF (LAS base on high level failure fusion). All models contain an LAS structure with the same parameters, the Listener in the LAS is composed of three layers of bidirectional LSTM networks, and each layer of hidden nodes is 512. The Speller consists of two layers of bidirectional LSTMs and a full-connection layer, the number of hidden nodes of each layer of LSTM is 1024, the output nodes of the full-connection layer are 3898, and all learnable parameters are optimized through an ADAM optimizer.

1) The LAS model employs standard codec mechanisms for speech recognition on single channel data, with the results taken as a baseline.

2) LAS-AF is a structure proposed in the literature (Braun S, Neil D, Anumula J, et al. Multi-channel interaction for end-to-end speech recognition [ J ].2018Interspeech,2018:17-21), is a multi-channel speech recognition system that performs fusion at the acoustic feature level, and consists of a front-end feature enhancement part and a back-end recognition part. The front-end feature enhancement part distributes weights to the acoustic features of each channel by using an attention mechanism, weights and sums the acoustic features of all the channels to obtain enhanced acoustic features, and then sends the enhanced acoustic features to the back end for identification. The back-end identification portion of the system employs an LAS architecture.

3) The LAS-AVG is similar in structure to the LAS-AF except that the front-end feature enhancement section sets the attention weight of each channel acoustic feature to a fixed, same value of 1/C, C being the total number of channels, simply as a comparison system.

4) The LAS-HLF is the system structure (i.e. the structure shown in fig. 2) corresponding to the advanced feature fusion based multi-channel speech recognition method proposed in the present invention.

The experimental results of the system models are shown in table 1, wherein "CH 1" and "CH 2" represent the data of the first channel and the second channel, respectively, and the parameters of the system models are adjusted to be optimal. The word error rate (CER%) is used to measure the performance of the system, and a smaller value indicates a higher recognition performance.

Model (model) Training data Test data Word error rate (CER%)
LAS CH1 CH1 17.75%
LAS CH2 CH2 15.32%
LAS-AVG CH1,CH2 CH1,CH2 15.86%
LAS-AF CH1,CH2 CH1,CH2 14.09%
LAS-HLF CH1,CH2 CH1,CH2 13.47%

TABLE 1 Experimental results of different System models

The LAS-AVG model performs poorly because it simply averages the sum of the acoustic features of the two channels, which is a straightforward algorithm. The LAS-AF model is fused on the acoustic feature level, comprehensively utilizes multi-channel voice information, dynamically allocates attention weight to the acoustic feature of each channel, and the word error rate is reduced to 14.09%, which is higher than the accuracy rate of independent training and recognition of data of each channel. Compared with LAS-AF, LAS-HLF utilizes information of high-level features, recognition performance is further improved, and CER is further reduced by 0.62%.

The attention weight was calculated in this experiment by mapping the attention weight of the last high-level feature and the current high-level feature into the same Dms-dimensional space, as provided aboveAnd calculating a formula, and performing addition and mapping to a one-dimensional space to obtain the target. Dms as a super-reference, has a direct impact on the experimental results, and table 2 shows the results for the different Dms cases.

Dms Word error rate (CER%)
250 14.19
300 13.47
384 13.68
512 13.81
768 14.17

TABLE 2 Experimental results for different Dms values

It can be seen that Dms is 300 for best system performance, with a CER of 13.47%.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于误差逆向传播神经网络的声纹识别方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!