Multimodal speech recognition method, system and computer readable storage medium

文档序号：344510 发布日期：2021-12-03 浏览：26次中文

阅读说明：本技术 多模态语音识别方法、系统及计算机可读存储介质 (Multimodal speech recognition method, system and computer readable storage medium ) 是由林峰刘天天高铭王超巴钟杰韩劲松许文曜任奎于 2021-08-10 设计创作，主要内容包括：本发明公开了一种多模态语音识别方法、系统及计算机可读存储介质,该方法包括：当目标毫米波信号和目标音频信号均包含目标用户对应的人声信息时,计算第一对数梅尔频谱系数和第二对数梅尔频谱系数,并将第一对数梅尔频谱系数和第二对数梅尔频谱系数输入到融合网络中,以确定目标融合特征；融合网络至少包括校准模块和映射模块；校准模块用于目标音频信号和目标毫米波信号进行相互特征校准；映射模块用于对校准后的毫米波特征和所述校准后的音频特征进行融合处理；将目标融合特征输入到语义特征网络中,以确定目标用户对应的语音识别结果。本发明能够达到高准确率语音识别的目的。(The invention discloses a multi-modal speech recognition method, a system and a computer readable storage medium, wherein the method comprises the following steps: when the target millimeter wave signal and the target audio signal both contain the voice information corresponding to the target user, calculating a first logarithmic Mel spectral coefficient and a second logarithmic Mel spectral coefficient, and inputting the first logarithmic Mel spectral coefficient and the second logarithmic Mel spectral coefficient into the fusion network to determine target fusion characteristics; the fusion network at least comprises a calibration module and a mapping module; the calibration module is used for performing mutual characteristic calibration on the target audio signal and the target millimeter wave signal; the mapping module is used for carrying out fusion processing on the calibrated millimeter wave features and the calibrated audio frequency features; and inputting the target fusion characteristics into a semantic characteristic network to determine a voice recognition result corresponding to the target user. The invention can achieve the aim of high-accuracy speech recognition.)

1. A method of multimodal speech recognition, comprising:

acquiring a target millimeter wave signal and a target audio signal;

when the target millimeter wave signal and the target audio signal both contain the voice information corresponding to the target user, calculating a first logarithmic Mel spectral coefficient and a second logarithmic Mel spectral coefficient; the first logarithmic mel-frequency spectral coefficient is determined according to the target millimeter wave signal, and the second logarithmic mel-frequency spectral coefficient is determined according to the target audio signal;

inputting the first logarithmic Mel spectral coefficient and the second logarithmic Mel spectral coefficient into a fusion network to determine a target fusion characteristic; the fusion network at least comprises a calibration module and a mapping module; the calibration module is used for carrying out characteristic calibration processing on the target millimeter wave signal according to the target audio signal and carrying out characteristic calibration processing on the target audio signal according to the target millimeter wave signal so as to obtain a calibrated millimeter wave signal and a calibrated audio characteristic; the mapping module is used for carrying out fusion processing on the calibrated millimeter wave features and the calibrated audio features to obtain target fusion features;

and inputting the target fusion feature into a semantic feature network to determine a voice recognition result corresponding to the target user.

2. The multi-modal speech recognition method according to claim 1, wherein the obtaining of the target millimeter wave signal and the target audio signal specifically comprises:

acquiring a target millimeter wave signal acquired by a millimeter wave radar;

and acquiring a target audio signal collected by a microphone.

3. The method according to claim 1, wherein when the target millimeter wave signal and the target audio signal each contain human voice information corresponding to a target user, the calculating a first logarithmic mel-frequency spectrum coefficient and a second logarithmic mel-frequency spectrum coefficient specifically comprises:

judging whether the target millimeter wave signal and the target audio signal both comprise voice information to obtain a first judgment result;

if the first judgment result shows that the target millimeter wave signal and the target audio signal both comprise voice information, judging whether the target millimeter wave signal and the target audio signal both come from a target user to obtain a second judgment result;

and if the second judgment result shows that the target millimeter wave signal and the target audio signal are both from a target user, respectively performing short-time Fourier transform processing on the target millimeter wave signal and the target audio signal to determine a first logarithmic Mel spectral coefficient and a second logarithmic Mel spectral coefficient.

4. The multi-modal speech recognition method according to claim 3, wherein the determining whether the target millimeter wave signal and the target audio signal both include voice information to obtain a first determination result specifically comprises:

respectively preprocessing the target millimeter wave signal and the target audio signal;

performing fast Fourier transform processing on the preprocessed target millimeter wave signal to extract a millimeter wave phase signal;

carrying out differential processing on the millimeter wave phase signals to extract millimeter wave phase difference signals;

multiplying the preprocessed target audio signal by the millimeter wave phase difference signal to obtain a target product component;

calculating the spectral entropy of the target product component;

judging whether the spectral entropy is larger than a set threshold value;

when the spectral entropy is larger than a set threshold value, the target millimeter wave signal and the target audio signal both include human voice information.

5. The multi-modal speech recognition method according to claim 4, wherein the determining whether the target millimeter wave signal and the target audio signal are both from a target user specifically comprises:

processing the target product component to extract a target linear prediction coding component;

inputting the target linear prediction coding component into a trained support vector machine to judge whether the target millimeter wave signal and the target audio signal are both from a target user;

wherein the trained support vector machine of the first type is determined according to training data and the support vector machine of the first type; the training data comprises a plurality of calibration product components and a label corresponding to each calibration product component; the label is a calibration user; and the calibration product component is a product component determined according to the millimeter wave signal and the audio signal corresponding to the calibration user.

6. A multimodal speech recognition method according to claim 1, wherein the converged network further comprises two identical branch networks, a first branch network and a second branch network; the branched network comprises a first ResECA block, a second ResECA block, a third ResECA block, a fourth ResECA block and a fifth ResECA block;

wherein an input end of the calibration module is respectively connected with an output end of a third ResECA block of the first branch network and an output end of a third ResECA block of the second branch network; the output end of the calibration module is respectively connected with the input end of a fourth ResECA block of the first branch network and the input end of a fourth ResECA block of the second branch network;

an input of a first ResECA block of the first branch network is for inputting the first logarithmic mel-frequency spectral coefficients; the output end of a first ResECA block of the first branch network is connected with the input end of a second ResECA block of the first branch network, the output end of the second ResECA block of the first branch network is connected with the input end of a third ResECA block of the first branch network, and the output end of a fourth ResECA block of the first branch network is connected with the input end of a fifth ResECA block of the first branch network;

the input end of the first ResECA block of the second branch network is used for inputting the second logarithmic Mel frequency spectrum coefficient; the output end of the first ResECA block of the second branch network is connected with the input end of the second ResECA block of the second branch network, the output end of the second ResECA block of the second branch network is connected with the input end of the third ResECA block of the second branch network, and the output end of the fourth ResECA block of the second branch network is connected with the input end of the fifth ResECA block of the second branch network;

the input end of the mapping module is respectively connected with the output end of the fifth ResECA block of the first branch network and the output end of the fifth ResECA block of the second branch network.

7. The method according to claim 6, wherein the calibration module processes:

calculating a first channel feature distribution according to the first intermediate features; said first intermediate characteristic is a signal output by an output of a third ResECA block of said first branch network;

calculating a second channel feature distribution according to the second intermediate features; the second intermediate characteristic is a signal output by an output of a third ResECA block of the second branch network;

calibrating the first intermediate feature according to the second channel feature distribution;

calibrating the second intermediate feature according to the first channel feature distribution.

8. The multi-modal speech recognition method of claim 1, wherein the mapping module processes:

calculating a first similar matrix according to the calibrated millimeter wave characteristics;

calculating a second similarity matrix according to the calibrated audio features;

respectively carrying out normalization processing on the first similar matrix and the second similar matrix;

calculating a first attention feature according to the first similarity matrix after normalization processing;

calculating a second attention characteristic according to the second similar matrix after normalization processing;

and calculating a target fusion feature according to the first attention feature and the second attention feature.

9. A multimodal speech recognition system, comprising:

the signal acquisition module is used for acquiring a target millimeter wave signal and a target audio signal;

the logarithmic Mel spectral coefficient calculating module is used for calculating a first logarithmic Mel spectral coefficient and a second logarithmic Mel spectral coefficient when the target millimeter wave signal and the target audio signal both contain voice information corresponding to a target user; the first logarithmic mel-frequency spectral coefficient is determined according to the target millimeter wave signal, and the second logarithmic mel-frequency spectral coefficient is determined according to the target audio signal;

the target fusion characteristic determining module is used for inputting the first logarithmic Mel spectral coefficient and the second logarithmic Mel spectral coefficient into a fusion network so as to determine a target fusion characteristic; the fusion network at least comprises a calibration module and a mapping module; the calibration module is used for carrying out characteristic calibration processing on the target millimeter wave signal according to the target audio signal and carrying out characteristic calibration processing on the target audio signal according to the target millimeter wave signal so as to obtain a calibrated millimeter wave signal and a calibrated audio characteristic; the mapping module is used for carrying out fusion processing on the calibrated millimeter wave features and the calibrated audio features to obtain target fusion features;

and the voice recognition result extraction module is used for inputting the target fusion feature into a semantic feature network so as to determine a voice recognition result corresponding to the target user.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the multimodal speech recognition method according to any of the claims 1-8.

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, a system, and a computer-readable storage medium for multimodal speech recognition.

Background

The voice interaction plays a crucial role in intelligent interaction scenes, such as smart home. The voice interaction provides non-contact human-computer interaction between a person and the Internet of things equipment. With the development of deep learning and natural language processing, automatic speech recognition technology enables a speech interactive device to accurately acquire the speech content of a user. In recent years, commercial voice interaction products have become popular, such as smart speakers (e.g., Amazon Echo and Google Home), voice assistants in smart phones (e.g., Siri), and in-vehicle voice control interactions (e.g., voice interactions in Tesla Model S/X/3/Y).

However, besides home scenarios, today's voice interaction also needs to address more diverse environmental noises (such as traffic noise, business noise, and nearby sounds) in public places (such as streets, stations, halls, or parties). However, speech recognition technology based on microphone arrays and audio requires that the audio signal is a high signal-to-noise ratio and clear audio signal. Thus, in noisy environments, audio signals buried in unpredictable noise become difficult to identify. In addition, as the recognition distance increases, the speech quality gradually deteriorates, thereby affecting the recognition accuracy. To address these difficulties, researchers now utilize multi-sensor information fusion for speech enhancement and recognition. For example, audiovisual methods combine lip motion captured by a camera with noisy sound, but are limited by lighting conditions, line of sight requirements, or occlusion. The working distance of ultrasound-assisted speech enhancement techniques is extremely short (within 20 cm), but requires a specific gesture.

Disclosure of Invention

The invention aims to provide a multi-modal speech recognition method, a multi-modal speech recognition system and a computer readable storage medium, so as to achieve the aim of high-accuracy speech recognition.

In order to achieve the purpose, the invention provides the following scheme:

a multi-modal speech recognition method, comprising:

acquiring a target millimeter wave signal and a target audio signal;

and inputting the target fusion feature into a semantic feature network to determine a voice recognition result corresponding to the target user.

Optionally, the acquiring a target millimeter wave signal and a target audio signal specifically includes:

acquiring a target millimeter wave signal acquired by a millimeter wave radar;

and acquiring a target audio signal collected by a microphone.

Optionally, when both the target millimeter wave signal and the target audio signal contain voice information corresponding to a target user, calculating a first logarithmic mel-frequency spectrum coefficient and a second logarithmic mel-frequency spectrum coefficient specifically includes:

judging whether the target millimeter wave signal and the target audio signal both comprise voice information to obtain a first judgment result;

Optionally, the determining whether the target millimeter wave signal and the target audio signal both include voice information obtains a first determination result, and specifically includes:

respectively preprocessing the target millimeter wave signal and the target audio signal;

performing fast Fourier transform processing on the preprocessed target millimeter wave signal to extract a millimeter wave phase signal;

carrying out differential processing on the millimeter wave phase signals to extract millimeter wave phase difference signals;

multiplying the preprocessed target audio signal by the millimeter wave phase difference signal to obtain a target product component;

calculating the spectral entropy of the target product component;

judging whether the spectral entropy is larger than a set threshold value;

when the spectral entropy is larger than a set threshold value, the target millimeter wave signal and the target audio signal both include human voice information.

Optionally, the determining whether the target millimeter wave signal and the target audio signal are both from a target user specifically includes:

processing the target product component to extract a target linear prediction coding component;

Optionally, the converged network further includes two identical branch networks, which are a first branch network and a second branch network respectively; the branched network comprises a first ResECA block, a second ResECA block, a third ResECA block, a fourth ResECA block and a fifth ResECA block;

Optionally, the processing procedure of the calibration module is as follows:

calibrating the first intermediate feature according to the second channel feature distribution;

calibrating the second intermediate feature according to the first channel feature distribution.

Optionally, the processing procedure of the mapping module is as follows:

calculating a first similar matrix according to the calibrated millimeter wave characteristics;

calculating a second similarity matrix according to the calibrated audio features;

respectively carrying out normalization processing on the first similar matrix and the second similar matrix;

calculating a first attention feature according to the first similarity matrix after normalization processing;

calculating a second attention characteristic according to the second similar matrix after normalization processing;

and calculating a target fusion feature according to the first attention feature and the second attention feature.

A multimodal speech recognition system comprising:

the signal acquisition module is used for acquiring a target millimeter wave signal and a target audio signal;

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a multimodal speech recognition method

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

because the millimeter wave signal is not influenced by noise and can sense throat vibration information when a user produces sound, when the audio signal is polluted by noise, the millimeter wave signal and the audio signal are subjected to mutual characteristic calibration fusion by virtue of the fusion network, namely the characteristics of the millimeter wave signal and the audio signal are mutually calibrated, the vibration information in the millimeter wave signal is fused into the audio characteristic to obtain a target fusion characteristic, and the semantic characteristic network is guided to capture semantic information in the target fusion characteristic with high precision.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a multimodal speech recognition method of the present invention;

FIG. 2 is a flow chart of a multimodal speech recognition method of the present invention that combines millimeter wave signals and audio signals;

FIG. 3 is a schematic diagram of the multi-modal speech recognition system of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

English abbreviations, chinese translations and english original words.

FFT: fast Fourier Transform, Fast Fourier Transform.

LPC: linear predictive Coding, linear predictive Coding.

OC-SVM: one type of support vector machine, One-Class SupportVectormachine.

STFT: Short-Time Fourier Transform.

ResECA: the Residual Block with Efficient Channel attachment, the Residual Block, is a powerful mechanism of Attention.

ReLU: linear rectification function, Rectified linear unit.

And (3) LAS: listen, note and Spell, Listen, Attend, and Spell.

pBLSTM: pyramid Bidirectional Long Short Term Memory, pyramid Bidirectional Long Short Term Memory.

LSTM: long Short Term Memory, Long Short Term Memory.

Research has proved that the millimeter wave signal is helpful for voice information recovery, and has excellent environmental noise resistance and penetration performance. Based on the problems raised by the background art, the invention adopts the millimeter wave radar as the supplement of the voice recognition. The millimeter wave radar can sense a remote target user, and even if the user wears a mask in a noisy environment, a reflected signal received by the millimeter wave radar still contains throat vibration information. However, the performance of the millimeter wave radar is not always satisfactory, and in practice, the millimeter wave signal is also affected by the body motion of the user because the wavelength is small (about 4 mm), and the millimeter wave signal is sensitive to both human acoustic vibration and motion vibration. Fortunately, microphone-based speech acquisition can compensate for the loss of information to some extent. Therefore, the invention considers the complementary cooperation between the millimeter wave radar and the microphone, and fuses the signals of two different modes for voice recognition, namely the millimeter wave signal supports anti-noise human voice perception, and the audio signal collected by the microphone can be used as a guide for calibrating millimeter wave characteristics under motion interference.

In view of this, the invention provides a multimodal speech recognition method and system fusing millimeter wave signals and audio signals. First, both functions of voice activity detection and user lock are performed based on the correlation of the millimeter wave signal and the audio signal to obtain the millimeter wave signal and the audio signal of the user. And then inputting the millimeter wave signal and the audio signal into a fusion network for sufficient fusion to obtain fusion characteristics. And finally, inputting the fusion features into a semantic extraction network to obtain a semantic text, namely a voice recognition result. The invention integrates and enhances the advantages of millimeter wave signals and audio signals, and realizes high-accuracy speech recognition under the severe conditions of high noise, long distance, multi-angle and the like.

Example one

Referring to fig. 1, the multi-modal speech recognition method provided in this embodiment specifically includes the following steps.

Step 10: and acquiring a target millimeter wave signal and a target audio signal.

Step 20: when the target millimeter wave signal and the target audio signal both contain the voice information corresponding to the target user, calculating a first logarithmic Mel spectral coefficient and a second logarithmic Mel spectral coefficient; the first logarithmic mel-frequency spectral coefficient is determined from the target millimeter wave signal, and the second logarithmic mel-frequency spectral coefficient is determined from the target audio signal.

Step 30: inputting the first logarithmic Mel spectral coefficient and the second logarithmic Mel spectral coefficient into a fusion network to determine a target fusion characteristic; the fusion network at least comprises a calibration module and a mapping module; the calibration module is used for carrying out characteristic calibration processing on the target millimeter wave signal according to the target audio signal and carrying out characteristic calibration processing on the target audio signal according to the target millimeter wave signal so as to obtain a calibrated millimeter wave signal and a calibrated audio characteristic; the mapping module is used for carrying out fusion processing on the calibrated millimeter wave features and the calibrated audio features to obtain target fusion features.

Step 40: and inputting the target fusion feature into a semantic feature network to determine a voice recognition result corresponding to the target user.

As a preferred embodiment, the step 10 specifically includes:

and acquiring a target millimeter wave signal acquired by the millimeter wave radar.

And acquiring a target audio signal collected by a microphone.

As a preferred embodiment, the step 20 specifically includes:

and judging whether the target millimeter wave signal and the target audio signal both comprise voice information to obtain a first judgment result.

And if the first judgment result shows that the target millimeter wave signal and the target audio signal both comprise voice information, judging whether the target millimeter wave signal and the target audio signal both come from a target user, and obtaining a second judgment result.

And if the second judgment result shows that the target millimeter wave signal and the target audio signal are both from a target user, respectively performing short-time Fourier transform processing on the target millimeter wave signal and the target audio signal to determine a first logarithmic Mel spectral coefficient and a second logarithmic Mel spectral coefficient.

Wherein, judge whether target millimeter wave signal and target audio signal all include the people's voice information, obtain first judged result, specifically include:

and respectively preprocessing the target millimeter wave signal and the target audio signal.

And carrying out fast Fourier transform processing on the preprocessed target millimeter wave signal to extract a millimeter wave phase signal.

And carrying out differential processing on the millimeter wave phase signals to extract millimeter wave phase difference signals.

And multiplying the preprocessed target audio signal by the millimeter wave phase difference signal to obtain a target product component.

Spectral entropy of the target product component is calculated.

And judging whether the spectral entropy is larger than a set threshold value.

When the spectral entropy is larger than a set threshold value, the target millimeter wave signal and the target audio signal both include human voice information.

Further, the determining whether the target millimeter wave signal and the target audio signal are both from a target user specifically includes:

processing the target product component to extract a target linear prediction coding component.

And inputting the target linear prediction coding component into a trained support vector machine to judge whether the target millimeter wave signal and the target audio signal are both from a target user.

Wherein the trained support vector machine of the first type is determined according to training data and the support vector machine of the first type; the training data comprises a plurality of calibration product components and a label corresponding to each calibration product component; the label is a calibration user; and the calibration product component is a product component determined according to the millimeter wave signal and the audio signal corresponding to the calibration user.

As a preferred embodiment, the converged network further includes two identical branch networks, which are a first branch network and a second branch network respectively; the branched network includes a first ResECA block, a second ResECA block, a third ResECA block, a fourth ResECA block, and a fifth ResECA block.

Wherein an input end of the calibration module is respectively connected with an output end of a third ResECA block of the first branch network and an output end of a third ResECA block of the second branch network; the output end of the calibration module is respectively connected with the input end of the fourth ResECA block of the first branch network and the input end of the fourth ResECA block of the second branch network.

An input of a first ResECA block of the first branch network is for inputting the first logarithmic mel-frequency spectral coefficients; the output of the first ResECA block of the first branching network is connected to the input of the second ResECA block of the first branching network, the output of the second ResECA block of the first branching network is connected to the input of the third ResECA block of the first branching network, and the output of the fourth ResECA block of the first branching network is connected to the input of the fifth ResECA block of the first branching network.

The input end of the first ResECA block of the second branch network is used for inputting the second logarithmic Mel frequency spectrum coefficient; the output end of the first ResECA block of the second branch network is connected with the input end of the second ResECA block of the second branch network, the output end of the second ResECA block of the second branch network is connected with the input end of the third ResECA block of the second branch network, and the output end of the fourth ResECA block of the second branch network is connected with the input end of the fifth ResECA block of the second branch network;

Further, the processing procedure of the calibration module is as follows:

Calculating a second channel feature distribution according to the second intermediate features; the second intermediate characteristic is a signal output by an output of a third ResECA block of the second branch network.

Calibrating the first intermediate feature according to the second channel feature distribution.

Calibrating the second intermediate feature according to the first channel feature distribution.

Further, the processing procedure of the mapping module is as follows:

and calculating a first similar matrix according to the calibrated millimeter wave characteristics.

And calculating a second similarity matrix according to the calibrated audio features.

And respectively carrying out normalization processing on the first similar matrix and the second similar matrix.

And calculating the first attention feature according to the normalized first similarity matrix.

And calculating a second attention feature according to the second similarity matrix after the normalization processing.

And calculating a target fusion feature according to the first attention feature and the second attention feature.

Example two

As shown in fig. 2, the present embodiment provides a multimodal speech recognition method that fuses a millimeter wave signal and an audio signal, including the following steps:

step 1: the target user station speaks the awakening word and the voice command at a position approximately 7 meters away from the millimeter wave radar and the microphone, meanwhile, the millimeter wave radar collects millimeter wave signals, and the microphone collects audio signals.

Both signals were first cut into 3 second lengths and then normalized and down-sampled to 16 kHz. Then, the down-sampled millimeter wave signal is subjected to fast fourier transform FFT processing to extract a millimeter wave phase signal, and then the millimeter wave phase signal is subjected to differential processing to extract a millimeter wave phase difference signal. And multiplying the down-sampled audio signal by the millimeter wave phase difference signal to obtain a product component, and further judging whether the millimeter wave signal or the audio signal contains the voice information.

Calculating the spectral entropy of the product component, and if the spectral entropy is greater than a set threshold value, namely greater than 0.83, indicating that the millimeter wave signal and the audio signal both contain the human voice information; otherwise, it is stated that the millimeter wave signal or the audio signal does not perceive the human voice information. And 2, performing step 2 on the millimeter wave signal and the audio signal which sense the human voice information to judge whether the millimeter wave signal and the audio signal come from the target user or not, rather than the interference of other people.

Step 2: and extracting a linear predictive coding LPC component from the product component, and inputting the LPC component into a trained support vector machine OC-SVM to judge whether the millimeter wave signal and the audio signal come from a target user. If the LPC component is from the target user, proceed to step 3, otherwise continue with step 1 and step 2. The well trained OC-SVM is well trained in advance based on millimeter wave signals and audio signals of a calibration user.

The training process is as follows: and (3) the calibration user speaks a wakeup word for 30 times towards the millimeter wave radar and the microphone, both the collected millimeter wave signal and the collected audio signal are preprocessed in the step (1) to obtain a calibration product component, and the OC-SVM is trained by using a calibration LPC component extracted from the calibration product component and the calibration user to judge whether the LPC component comes from a target user.

And step 3: and performing short-time Fourier transform (STFT) processing on the millimeter wave signals and the audio signals containing the voice of the user, and respectively calculating logarithmic Mel frequency spectrum coefficients of the millimeter wave signals and the audio signals after the STFT processing. The logarithmic mel-frequency spectrum coefficients are input into the fusion network to obtain fusion characteristics.

The fusion network is composed of two branch networks, each branch network respectively receives logarithmic Mel spectral coefficients from millimeter wave signals and audio signals, and each branch network is composed of 5 ResECA blocks. The converged network has two modules, one of which is a calibration module to recalibrate two characteristics of the input, the calibration module being located after the 3 rd ResECA, the output from the 3 rd ResECA block being re-flowed through the calibration module into the 4 th ResECA. The other one is a mapping module, the mapping module maps the two features into the same feature space to obtain the final fusion feature, and the mapping module is positioned behind the 5 th ResECA block and receives millimeter wave signals and audio signals from the two branch networks respectively.

Here, the mathematical principle of the lower calibration module is introduced. X_W∈R^H×W×CAnd X_S∈R^H×W×CTwo intermediate features from the respective branch networks, R referring to the real number domain, H, W and C respectively the width, length and channel dimensions, subscripts W and S respectively representing the millimeter wave signal and the audio signal. Calculating the channel characteristic distribution Y of the two_WAnd Y_S：

Y_W＝σ(W_WReLU(GAP(X_W))),Y_W∈R^1×1×C (1)；

Y_S＝σ(W_SReLU(GAP(X_S))),Y_S∈R^1×1×C (2)；

Wherein ReLU is a ReLU function, W_WAnd W_SAre learning parameter matrixes, sigma is a Sigmoid function, and GAP is a global pooling function. Channel characteristic distribution Y_WAnd Y_SCan be viewed as a feature detector and filter. Mutual characteristic calibration is achieved by equations (3) and (4):

andrespectively, the final calibrated millimeter wave and audio features. Based on the correlation between the two, namely both contain the voice information of the user, the important information in the respective characteristic diagrams can be strengthened and the irrelevant interference information can be suppressed by mutual calibration.

In order to map two features from different feature spaces, namely millimeter wave features and audio features, to the same feature space, a mapping module is designed and inserted at the end of the fusion network to generate the final fusion feature. Suppose M ∈ R^H×W×CAnd V ∈ R^H×W×CMillimeter wave features and audio features from a branched network, respectively, flattening M and V to a size R^C×HWIs measured. Calculating a similarity matrix of M and V:

S＝M^TW_MVV,S∈R^HW×HW (5)；

wherein, W_MVIs a learning parameter matrix, each element of S reveals a correlation between corresponding columns of M and V. Performing Softmax normalization and column normalization on the similarity matrix respectively:

S^M＝softmax(S),S^M∈R^HW×HW (6)；

S^V＝softmax(S^T),S^V∈R^HW×HW (7)；

S^Mthe similarity matrix may convert the millimeter wave feature space to the audio feature space, for the same reason, S^VThe audio feature space can be converted into a millimeter wave feature space. Calculating the corresponding attention feature:

representing a matrix multiplication. Finally, based on the attention features of the two, a final fusion feature Z can be obtained:

Z＝W_Z{σ(C^M)⊙M+σ(C^V)⊙V},Z∈R^C×HW (10)；

W_Zis a learning parameter matrix, Z, which represents the characteristics of both modalities, selectively integrates the information. The fine-grained elements in Z that are related to human acoustic vibrations and acoustic properties dominate. And inputting the final fusion features output by the fusion network into a semantic extraction network for voice recognition.

And 4, step 4: and inputting the final fusion features into a semantic feature network to obtain semantic texts, namely voice recognition results. The semantic feature network in this approach is a classical LAS. The LAS consists of two components: an encoder called Listener and a decoder called Speller. Listener maps the fused feature to the hidden feature through pBLSTM. The Speller is a superimposed recurrent neural network that computes the probability of outputting a character sequence, and uses a multi-head attention mechanism to generate a context vector. In LAS, two successive pBLSTM layers are called Listener, the Speller contains two LSTM layers and one output Softmax layer. And the LAS outputs a voice recognition result after receiving the fusion characteristics from the step 3.

EXAMPLE III

Referring to fig. 3, the multi-modal speech recognition system provided in the present embodiment includes:

and a signal obtaining module 50, configured to obtain the target millimeter wave signal and the target audio signal.

A logarithmic mel-frequency spectrum coefficient calculating module 60, configured to calculate a first logarithmic mel-frequency spectrum coefficient and a second logarithmic mel-frequency spectrum coefficient when the target millimeter-wave signal and the target audio-frequency signal both include the vocal information corresponding to the target user; the first logarithmic mel-frequency spectral coefficient is determined from the target millimeter wave signal, and the second logarithmic mel-frequency spectral coefficient is determined from the target audio signal.

A target fusion characteristic determining module 70, configured to input the first logarithmic mel-frequency spectrum coefficient and the second logarithmic mel-frequency spectrum coefficient into a fusion network to determine a target fusion characteristic; the fusion network at least comprises a calibration module and a mapping module; the calibration module is used for carrying out characteristic calibration processing on the target millimeter wave signal according to the target audio signal and carrying out characteristic calibration processing on the target audio signal according to the target millimeter wave signal so as to obtain a calibrated millimeter wave signal and a calibrated audio characteristic; the mapping module is used for carrying out fusion processing on the calibrated millimeter wave features and the calibrated audio features to obtain target fusion features.

And a speech recognition result extraction module 80, configured to input the target fusion feature into a semantic feature network, so as to determine a speech recognition result corresponding to the target user.

Example four

The present embodiment provides a computer-readable storage medium having a computer program stored thereon.

The computer program, when executed by a processor, implements the steps of the multimodal speech recognition method of embodiment one or embodiment two.

Compared with the prior art, the invention has the following effects:

1. anti-noise: the millimeter wave signal is not influenced by noise, and the throat vibration information of the user during sounding can be sensed. When the audio signal is polluted by noise, the millimeter wave feature and the audio feature can be mutually aligned and fused by means of the fusion network, namely, vibration information in the millimeter wave feature can be fused into the audio feature, and the network is guided to capture semantic information in the audio feature instead of noise interference.

2. The recognition distance is long, and the angle is wide: the millimeter wave radar has a far sensing distance but a limited sensing angle. The microphone radar can capture all-around sound, but the sensing distance is close. The features of the two different modes are input into a fusion network, the features are mutually selectively calibrated to mutually enhance the features, and finally, a fusion feature combining the advantages of the two features is generated. This fusion feature combines two types of modal information, namely throat vibration information from the far range of the millimeter wave radar and the omnidirectional speech information of the microphone radar.

3. The method is suitable for multi-person scenes: millimeter wave radars and microphone radars do not necessarily collect signals containing human voice, nor do they necessarily come from a user. Therefore, the designed voice activity detection and user detection are both based on the correlation between the millimeter wave signal and the audio signal, namely, both signals are related to the human voice information, so that whether the signals are derived from the human voice information or not can be detected, and whether the signals are derived from the user or not can be further judged, and the method is suitable for the noisy scene.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

17页详细技术资料下载

Multimodal speech recognition method, system and computer readable storage medium

相关技术

网友询问留言