Training method and system for context information prediction model of video scene

文档序号:972884 发布日期:2020-11-03 浏览:2次 中文

阅读说明:本技术 用于视频场景的上下文信息预测模型的训练方法及系统 (Training method and system for context information prediction model of video scene ) 是由 钱彦旻 李晨达 于 2020-07-15 设计创作,主要内容包括:本发明实施例提供一种用于视频场景的上下文信息预测模型的训练方法。该方法包括:通过端到端语音识别编码器提取第一说话人的第一干净音频以及第二说话人的第二干净音频中的第一理想上下文特征以及第二理想上下文特征;将混合音频的幅度谱、以及第一说话人的第一视觉表示信息以及第二说话人的第二视觉表示信息,作为上下文信息预测模型的输入,输出第一预测上下文特征以及第二预测上下文特征;基于第一理想上下文特征以及第二理想上下文特征与第一预测上下文特征以及第二预测上下文特征的误差对上下文信息预测模型训练。本发明实施例还提供一种用于视频场景的上下文信息预测模型的训练系统。本发明实施例提升语音分离的性能。(The embodiment of the invention provides a training method for a context information prediction model of a video scene. The method comprises the following steps: extracting a first ideal context feature and a second ideal context feature in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder; taking the amplitude spectrum of the mixed audio, first visual representation information of a first speaker and second visual representation information of a second speaker as the input of a context information prediction model, and outputting a first prediction context characteristic and a second prediction context characteristic; the context information prediction model is trained based on errors of the first and second ideal context features and the first and second prediction context features. The embodiment of the invention also provides a training system for the context information prediction model of the video scene. The embodiment of the invention improves the performance of voice separation.)

1. A method for training a context information prediction model for a video scene, comprising:

extracting a first ideal context feature and a second ideal context feature in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder of the single speaker;

outputting a first prediction context feature and a second prediction context feature by using a magnitude spectrum of a mixed audio generated by the first clean audio and the second clean audio, and first visual representation information of the first speaker and second visual representation information of the second speaker as inputs of a context information prediction model;

training the context information prediction model based on errors of the first and second ideal context features and the first and second prediction context features until the first and second prediction context features approach the first and second ideal context features.

2. The method of claim 1, wherein the number of end-to-end speech recognition encoders is the same as the number of speakers, wherein each end-to-end speech recognition encoder shares a weight.

3. The method of claim 1, wherein the first visual representation information and the second visual representation information comprise: features extracted from a video image of a speaker's mouth-shaped region.

4. The method of claim 1, wherein the context information prediction model comprises: the system comprises a two-dimensional visual group convolution network, a one-dimensional depth residual error network sharing a weight, and a bidirectional long-time memory cell cyclic neural network.

5. A method of context information prediction, comprising:

inputting the amplitude spectrum of the mixed voice to be separated into a context information prediction model trained according to the training method of claim 1, performing feature extraction on a time-frequency dimension through a visual group-like convolution network, and performing down-sampling on a time dimension;

inputting the amplitude spectrum after the characteristic extraction and sampling into a depth residual error network to obtain high-dimensional audio modal representation;

inputting first visual representation information of a first speaker and second visual representation information of a second speaker in mixed voice to be separated into the context information prediction model, and determining a first high-dimensional visual modal representation and a second high-dimensional visual modal representation through the depth residual error network;

splicing the high-dimensional audio modality representation, the first high-dimensional visual modality representation and the second high-dimensional visual modality representation to determine a spliced modality representation;

and inputting the splicing modal representation into two different bidirectional long-time and short-time memory cell cyclic neural networks to obtain a first context information representation of the first speaker and a second context information representation of the second speaker.

6. A method of speech separation comprising:

inputting the first high-dimensional visual modality representation, the second high-dimensional visual modality representation, the magnitude spectrum of the mixed speech to be separated, the first context information representation, and the second context information representation determined in claim 5 to a speech separation system, determining a high-dimensional feature representation of the mixed speech to be separated;

determining a first magnitude spectrum mask and a second magnitude spectrum mask for the first speaker based on the high-dimensional feature representation;

and predicting the magnitude spectrum of the mixed voice to be separated through the first magnitude spectrum mask and the second magnitude spectrum mask, and determining the separated voice of the first speaker and the separated voice of the second speaker.

7. The method of claim 6, wherein the speech separation system includes an attention mechanism for assisting predictive speech separation.

8. A training system for a context information prediction model for a video scene, comprising:

an ideal context feature determination program module for extracting a first ideal context feature and a second ideal context feature in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder of a single speaker;

a prediction context feature determination program module for outputting a first prediction context feature and a second prediction context feature using a magnitude spectrum of a mixed audio generated from the first clean audio and the second clean audio, and first visual representation information of the first speaker and second visual representation information of the second speaker as inputs of a context information prediction model;

a training program module for training the context information prediction model based on errors of the first and second ideal context features and the first and second prediction context features until the first and second prediction context features approach the first and second ideal context features.

9. A contextual information prediction system comprising:

an extraction and adoption program module, which is used for inputting the amplitude spectrum of the mixed voice to be separated into a context information prediction model trained according to the training method of claim 8, extracting the characteristics in the time-frequency dimension through a visual group-like convolution network, and down-sampling in the time dimension;

the high-dimensional audio modal representation determining program module is used for inputting the amplitude spectrum after the characteristic extraction and the sampling into the depth residual error network to obtain high-dimensional audio modal representation;

the high-dimensional visual modal representation determining program module is used for inputting first visual representation information of a first speaker and second visual representation information of a second speaker in the mixed voice to be separated into the context information prediction model and determining first high-dimensional visual modal representation and second high-dimensional visual modal representation through the depth residual error network;

a splicing modality representation determination program module for splicing the high-dimensional audio modality representation, the first high-dimensional visual modality representation and the second high-dimensional visual modality representation to determine a splicing modality representation;

and the context information representation determining program module is used for inputting the splicing modal representation into two different bidirectional long-time and short-time memory cell cyclic neural networks to obtain a first context information representation of the first speaker and a second context information representation of the second speaker.

10. A speech separation system comprising:

a high-dimensional feature representation determining program module for inputting the first high-dimensional visual modality representation, the second high-dimensional visual modality representation, the magnitude spectrum of the mixed speech to be separated, the first context information representation and the second context information representation determined in claim 9 to a speech separation system, and determining a high-dimensional feature representation of the mixed speech to be separated;

a magnitude spectrum mask determination program module for determining a first magnitude spectrum mask and a second magnitude spectrum mask for the first speaker based on the high-dimensional feature representation;

and the voice separation program module is used for predicting the magnitude spectrum of the mixed voice to be separated through the first magnitude spectrum mask and the second magnitude spectrum mask and determining the separated voice of the first speaker and the separated voice of the second speaker.

Technical Field

The invention relates to the field of intelligent voice, in particular to a training method and a system for a context information prediction model of a video scene.

Background

It has become popular to solve the cocktail party problem with a multi-modal approach. To solve the problem of speech separation in the cocktail party, it is common to use: traditional deep neural network voice separation technology and audio and video information-based voice separation technology.

Traditional deep neural network speech separation techniques. The system uses a neural network to process audio that mixes multiple speakers. Taking the example of a system with two target speakers, the network input is mixed audio and the output is separate audio for each of the two speakers.

A voice separation technology based on audio and video information. When the system is used for separating the voice, the video information (the video comprises important information such as the speaking mouth shape) of the target speaker is merged. And the neural network is combined with the video information of the target speaker to separate the voice audio corresponding to the target speaker from the mixed audio.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

in the traditional deep neural network voice separation technology, since two separated target sounds are equal, the matching problem of the output and the training label is encountered during training. The replacement invariance criterion is used for training, and the training cost is high.

The voice separation technology based on audio and video information fuses the video information of a target speaker into a neural network, eliminates the peer-to-peer problem between audios and introduces more available information. But how to further exploit the video information is not well explored.

Disclosure of Invention

The method at least solves the problems that in the prior art, a deep voice separation technology has no extra information, the matching training overhead of a label is high, and the application of video information to the voice separation technology is not considered.

In a first aspect, an embodiment of the present invention provides a method for training a context information prediction model for a video scene, including:

extracting a first ideal context feature and a second ideal context feature in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder of the single speaker;

outputting a first prediction context feature and a second prediction context feature by using a magnitude spectrum of a mixed audio generated by the first clean audio and the second clean audio, and first visual representation information of the first speaker and second visual representation information of the second speaker as inputs of a context information prediction model;

training the context information prediction model based on errors of the first and second ideal context features and the first and second prediction context features until the first and second prediction context features approach the first and second ideal context features.

In a second aspect, an embodiment of the present invention provides a method for predicting context information, including:

inputting the amplitude spectrum of the mixed voice to be separated into a context information prediction model trained according to the training method of claim 1, performing feature extraction on a time-frequency dimension through a visual group-like convolution network, and performing down-sampling on a time dimension;

inputting the amplitude spectrum after the characteristic extraction and sampling into a depth residual error network to obtain high-dimensional audio modal representation;

inputting first visual representation information of a first speaker and second visual representation information of a second speaker in mixed voice to be separated into the context information prediction model, and determining a first high-dimensional visual modal representation and a second high-dimensional visual modal representation through the depth residual error network;

splicing the high-dimensional audio modality representation, the first high-dimensional visual modality representation and the second high-dimensional visual modality representation to determine a spliced modality representation;

and inputting the splicing modal representation into two different bidirectional long-time and short-time memory cell cyclic neural networks to obtain a first context information representation of the first speaker and a second context information representation of the second speaker.

In a third aspect, an embodiment of the present invention provides a speech separation method, including:

inputting the first high-dimensional visual modality representation, the second high-dimensional visual modality representation, the magnitude spectrum of the mixed speech to be separated, the first context information representation, and the second context information representation determined in claim 5 to a speech separation system, determining a high-dimensional feature representation of the mixed speech to be separated;

determining a first magnitude spectrum mask and a second magnitude spectrum mask for the first speaker based on the high-dimensional feature representation;

and predicting the magnitude spectrum of the mixed voice to be separated through the first magnitude spectrum mask and the second magnitude spectrum mask, and determining the separated voice of the first speaker and the separated voice of the second speaker.

In a fourth aspect, an embodiment of the present invention provides a training system for a context information prediction model of a video scene, including:

an ideal context feature determination program module for extracting a first ideal context feature and a second ideal context feature in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder of a single speaker;

a prediction context feature determination program module for outputting a first prediction context feature and a second prediction context feature using a magnitude spectrum of a mixed audio generated from the first clean audio and the second clean audio, and first visual representation information of the first speaker and second visual representation information of the second speaker as inputs of a context information prediction model;

a training program module for training the context information prediction model based on errors of the first and second ideal context features and the first and second prediction context features until the first and second prediction context features approach the first and second ideal context features.

In a fifth aspect, an embodiment of the present invention provides a context information prediction system, including:

an extraction and adoption program module, which is used for inputting the amplitude spectrum of the mixed voice to be separated into a context information prediction model trained according to the training method of claim 8, extracting the characteristics in the time-frequency dimension through a visual group-like convolution network, and down-sampling in the time dimension;

the high-dimensional audio modal representation determining program module is used for inputting the amplitude spectrum after the characteristic extraction and the sampling into the depth residual error network to obtain high-dimensional audio modal representation;

the high-dimensional visual modal representation determining program module is used for inputting first visual representation information of a first speaker and second visual representation information of a second speaker in the mixed voice to be separated into the context information prediction model and determining first high-dimensional visual modal representation and second high-dimensional visual modal representation through the depth residual error network;

a splicing modality representation determination program module for splicing the high-dimensional audio modality representation, the first high-dimensional visual modality representation and the second high-dimensional visual modality representation to determine a splicing modality representation;

and the context information representation determining program module is used for inputting the splicing modal representation into two different bidirectional long-time and short-time memory cell cyclic neural networks to obtain a first context information representation of the first speaker and a second context information representation of the second speaker.

In a sixth aspect, an embodiment of the present invention provides a speech separation system, including:

a high-dimensional feature representation determining program module for inputting the first high-dimensional visual modality representation, the second high-dimensional visual modality representation, the magnitude spectrum of the mixed speech to be separated, the first context information representation and the second context information representation determined in claim 9 to a speech separation system, and determining a high-dimensional feature representation of the mixed speech to be separated;

a magnitude spectrum mask determination program module for determining a first magnitude spectrum mask and a second magnitude spectrum mask for the first speaker based on the high-dimensional feature representation;

a voice separation program module for predicting the amplitude spectrum of the mixed voice to be separated through the first amplitude spectrum mask and the second amplitude spectrum mask and determining the separated voice of the first speaker and the separated voice of the second speaker

In a seventh aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a context information prediction model for a video scene, the method for context information prediction, and the method for speech separation of any of the embodiments of the present invention.

In an eighth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the training method for a context information prediction model for a video scene, the context information prediction method, and the speech separation method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: contextual information is extracted from the mixed audio signal and visual information of the targeted speaker and is incorporated into the speech separation task. The method models a mechanism for completing missed hearing and correcting mishearing by understanding the context information of the speaker in the cocktail party scene of human beings. Experiments prove that compared with an audio and video voice separation baseline, the audio and video voice separation integrated with the context information has obvious separation performance improvement.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a training method for a context information prediction model of a video scene according to an embodiment of the present invention;

fig. 2 is a training structure diagram of a context information prediction model of a method for training a context information prediction model of a video scene according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for predicting context information according to an embodiment of the present invention;

fig. 4 is a diagram of a context information prediction model structure of a context information prediction method according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method for separating speech according to an embodiment of the present invention;

FIG. 6 is a diagram of a voice separation system incorporating context information according to a voice separation method provided in an embodiment of the present invention;

FIG. 7 is a detailed information data diagram of a deep residual error network of a speech separation method according to an embodiment of the present invention;

FIG. 8 is a graph of audio-visual context speech separation model result comparison data for a speech separation method according to an embodiment of the present invention;

FIG. 9 is a data diagram illustrating the results of an attention mechanism of a speech separation method according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a training system for a context information prediction model of a video scene according to an embodiment of the present invention;

FIG. 11 is a block diagram illustrating a context information prediction system according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a speech separation system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for training a context information prediction model for a video scene according to an embodiment of the present invention, including the following steps:

s11: extracting a first ideal context feature and a second ideal context feature in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder of the single speaker;

s12: outputting a first prediction context feature and a second prediction context feature by using a magnitude spectrum of a mixed audio generated by the first clean audio and the second clean audio, and first visual representation information of the first speaker and second visual representation information of the second speaker as inputs of a context information prediction model;

s13: training the context information prediction model based on errors of the first and second ideal context features and the first and second prediction context features until the first and second prediction context features approach the first and second ideal context features.

In this embodiment, in order to distinguish the voice of the target speaker from the mixed voice in a real cocktail party scene, people not only listen to the voice carefully and pay attention to the visual information of the target speaker, but also try to understand the contents that the target speaker is talking about. Research on human auditory mechanism shows that relevant nerve centers exist in human brain, and sound covered by noise can be restored according to context information of voice.

It is difficult for step S11 to extract context information directly from the mixed audio information and the corresponding target speaker information. A simpler case is considered first, in which the encoder part of the end-to-end speech recognition system is used to extract context information from the clean label data, referred to as ideal context information. Ideal context information obtained from clean audio cannot be utilized in real scenes, but the ideal context information can be used as a training label to train a context information prediction model.

FIG. 2 illustrates the training process of the context prediction model for the two-target speaker case, with the first clean audio X of the first speaker extracted by the single-speaker end-to-end speech recognition encoderAAudio magnitude spectrum | XA|,fbank(|XA|) is the filter bank audio feature extracted from the clean audio. Extracting a second clean audio X of a second speakerBAudio magnitude spectrum | XB|,fbank(|XB|) is the filter bank audio feature extracted from the clean audio. Finally, the first ideal context characteristic E is obtainedAAnd a second ideal context feature EB. In one embodiment, the number of the end-to-end speech recognition encoders is the same as the number of the speakers, wherein the weight is shared by the end-to-end speech recognition encoders. That is, there are several speakers, the number of end-to-end speech recognition encoders is prepared, and the number of speakers is not limited to 2, but alsoCan be applied to a plurality of persons. The structure within the context information prediction model includes: the system comprises a two-dimensional visual convolution network, a one-dimensional depth residual error network sharing a weight, and a bidirectional long-time memory cell cyclic neural network.

For step S12, | Y | is represented by XAAnd XBMagnitude spectrum, V, of the generated mixed audio YAAnd VBIs a visual representation of the targeted speaker. The context information extraction model receives the mixed amplitude spectrum Y, VAAnd VBAs an input, context information is predicted separately for speakers A and BAndwherein the visual representation information includes: features extracted from a video image of a speaker's mouth-shaped region.

For step S13, the difference L between the ideal context information and the prediction context information is calculatedctcThe context information prediction model may be trained:

Figure BDA0002585819680000073

in this way, the first and second prediction context features approach to the first and second ideal context features.

According to the embodiment, similar capability is expanded for the audio and video voice separation system based on deep learning, namely, a context mode of voice is tried to be fused into the voice separation system, the voice separation system with the fused audio mode, the fused video mode and the fused context mode is realized, and the voice separation effect is improved in an auxiliary mode.

Fig. 3 is a flowchart of a context information prediction method according to an embodiment of the present invention, which includes the following steps:

s21: inputting the amplitude spectrum of the mixed voice to be separated into a context information prediction model trained according to the training method of claim 1, performing feature extraction on a time-frequency dimension through a visual group-like convolution network, and performing down-sampling on a time dimension;

s22: inputting the amplitude spectrum after the characteristic extraction and sampling into a depth residual error network to obtain high-dimensional audio modal representation;

s23: inputting first visual representation information of a first speaker and second visual representation information of a second speaker in mixed voice to be separated into the context information prediction model, and determining a first high-dimensional visual modal representation and a second high-dimensional visual modal representation through the depth residual error network;

s24: splicing the high-dimensional audio modality representation, the first high-dimensional visual modality representation and the second high-dimensional visual modality representation to determine a spliced modality representation;

s25: and inputting the splicing modal representation into two different bidirectional long-time and short-time memory cell cyclic neural networks to obtain a first context information representation of the first speaker and a second context information representation of the second speaker.

With respect to step S21, the specific structure of the context information prediction model is as shown in fig. 4, when performing context information prediction, the magnitude spectrum | Y | of the mixed audio is input to the context information prediction model trained according to the training method of claim 1, first, feature extraction is performed in time-frequency dimension using a two-dimensional VGG-like convolutional network (the VGG convolutional network is a neural network structure proposed by the comparatively well-known oxford university VGG group, and here, a similar structure VGG-like) and downsampling is performed in time dimension.

For step S22, after step S21, ResNet is passed through a one-dimensional depth residual networkM′Processing to obtain high-dimensional audio modal representation YR

For step S23, the visual representation of the target speaker is passed through a one-dimensional depth residual network ResNet that shares weightsV′Processed to obtain high-dimensional visual modal representation

Figure BDA0002585819680000081

For step S24, the high-dimensional audio modality representation and the high-dimensional visual modality representation determined in steps S22 and S23 are spliced to obtain corresponding spliced modality representation

For step S25, the depth residual network ResNet is passedF′After processing, the data are sent into two different bidirectional long-and-short memory cell recurrent neural networks (BLSTM)Andderiving an intermediate representation of contextual information

Figure BDA0002585819680000085

Intermediate representationBLSTM by a shared weightEFinally, a first contextual information representation of the corresponding speaker is generatedSecond context information representation

Figure BDA0002585819680000088

The specific implementation steps will be described in detail in the following experimental modes.

According to the embodiment, under the situation of cocktail party problems, the method is expanded for the audio and video voice separation system based on deep learning, similar human beings have the ability of paying attention to the mouth shape change of the target speaker, paying attention to the voice and understanding the content of conversation of the target speaker, so that the ability of distinguishing the voice of the target speaker from a mixed scene is better obtained, namely, the context mode of the voice is tried to be fused into the voice separation system, the voice separation system with the integration of the three modes of audio, video and context is realized, and the voice separation effect is assisted to be improved.

Fig. 5 is a flowchart of a speech separation method according to an embodiment of the present invention, which includes the following steps:

s31: inputting the first high-dimensional visual modality representation, the second high-dimensional visual modality representation, the magnitude spectrum of the mixed speech to be separated, the first context information representation, and the second context information representation determined in claim 5 to a speech separation system, determining a high-dimensional feature representation of the mixed speech to be separated;

s32: determining a first magnitude spectrum mask and a second magnitude spectrum mask for the first speaker based on the high-dimensional feature representation;

s33: and predicting the magnitude spectrum of the mixed voice to be separated through the first magnitude spectrum mask and the second magnitude spectrum mask, and determining the separated voice of the first speaker and the separated voice of the second speaker.

In this embodiment, after the context information of the prediction is obtained, the context information can be integrated into the speech separation system as shown in fig. 6, and the speech separation system can use an attention mechanism to assist in predicting the effect of speech separation.

For step S31, video feature VA、VBMixed magnitude spectrum Y, and context information EA、EBAnd after the corresponding deep residual error network processing, obtaining the corresponding high-dimensional feature representation of the mixed voice to be separated.

With respect to step S32, after being relied upon in step S31, a magnitude spectrum mask M is estimated for each target speaker through the subsequent networkA、MB

For step S33, the predicted magnitude spectrum mask acts on the original mixed magnitude spectrum Y to predict the magnitude spectrum of the target speechThereby determining the separated voice of the first speaker and the separated voice of the second speaker according to the amplitude spectrumAnd (4) sound.

It can be seen by this embodiment that contextual information is extracted from the mixed audio signal and visual information of the targeted speaker and is incorporated into the speech separation task. The method models a mechanism for completing missed hearing and correcting mishearing by understanding the context information of the speaker in the cocktail party scene of human beings. Experiments prove that compared with an audio and video voice separation baseline, the audio and video voice separation integrated with the context information has obvious separation performance improvement. The method also has application value in practical scenes.

The steps of the method are explained in detail, and the mixed voice is separated in a time frequency (T-F) domain. Consider a linear mixed speech of two speakers a and B:

after Short Time Fourier Transform (STFT), the signal in the T-F domain can be written as:

is provided with

Figure BDA0002585819680000103

Represents a single frame of the hybrid STFT, where N is the window size of the STFT.

The mix of T-frames can be written as:

Figure BDA0002585819680000104

the amplitude spectrum of Y is available

Then, V representing two target speakers A and BAWhere D is the number of dimensions of each frame. The audiovisual speech separation network can be abstractly represented as:

MA,MB=Net(|Y|,VA,VB)

where MA, MB are estimated amplitude masks.

As shown in FIG. 6, the audio-visual separation network mixes the voice magnitude spectrum of the voice Y and the two speakers VAAnd VBAs input. The input representations are represented by different one-dimensional ResNets. Each ResNet consists of a stack of basic blocks, each containing a one-dimensional convolutional layer with residual connections, a ReLU activation layer, and a batch normalization layer. Some basic blocks contain additional up-sampling layers or down-sampling layers. First using a shared weight ResNetVTo VAAnd VBIs processed to obtain a high-level visual representation

Figure BDA0002585819680000107

Andthe amplitude spectrum of the mixed speech Y is represented by ResNetMProcessing to obtain a high level audio representation YR。ResNetMThere are 2 downsampling layers with a downsampling factor of 2 because in our setup, each visual representation frame corresponds to 4 audio frames. Then, the higher level representations are connected on the channel to obtain a fused representationDelivery of converged representations to ResNetFAAnd ResNetFBThen activated by sigmoid to estimate the amplitude mask MAAnd MBThe estimated mask is applied to the mixed magnitude spectrum by elemental multiplication to obtain a predicted magnitude spectrum:

Figure BDA0002585819680000112

the L1 loss was used for training, with the optimization objective being:

wherein | X |AAnd | X | + ]BThe target magnitude spectra of two speakers in the mixed speech, respectively.

In the separation stage, the estimated magnitude spectrum and the phase spectrum of Y are used to reconstruct the predicted STFT spectrum, and then the predicted speech is recovered using the inverse short-time fourier transform (iSTFT).

Audio-visual context separation, in addition to visual modalities, contextual language modalities for speech separation are further explored.

In the attention-based end-to-end speech recognition model, the encoder is said to encode context information of the speech signal, explicitly incorporating the context information, including speech and language information of each speaker, to help improve the performance of speech separation. However, this is a two-stage process. The first step is normal speech separation without using context information. The first stage extracts context information from the separated speech and then constructs a second separation containing the context information. The method has certain constraint that the method highly depends on the performance of the first-stage separation module and influences the accuracy of extracting the context information; on the other hand, in real scenes, it is often not possible to obtain a clear voice of the target speaker in advance.

The method provides a more direct and effective context language embedding and extracting method, and further integrates the context language embedding and extracting method and the sound image emotion into voice separation. As shown in fig. 2 and 4, the overall framework of the proposed context language embedding learning is illustrated. First, an end-to-end single speaker speech recognition model based on the CTC-attention mechanism was trained using single speaker data using the ESPnet toolkit. Using this pre-trained single speaker ASR model, the encoder can generate an ideal contextual language embedding E for two mixed speakers A and BAAnd EBThese ideal contexts are embedded in EAAnd EBCan be directly used for later separation modulesThe context label may be further used as an embedding or training module.

In the context prediction model, spectral features of speech and visual representations of two speakers are mixed as input. Visualization ResNet handling by sharing weight 1-DV’And the mixed amplitude spectrum | Y | is composed of two-dimensional class VGG layer and one-dimensional ResNetM’And (6) processing. The higher level representation is then concatenated into a fused representation. Then one-dimensional ResNetF’The fused representation is processed. Using two separate layers of bidirectional long-short term memory (BLSTM), i.e. BLSTMSAAnd BLSTMSBAnd a shared encoder BLSTM layer BLSTM for each speakerETo predict the context embedding of a single speaker, and generatedAnd

Figure BDA0002585819680000122

is predicted to be context embedding of two speakers in mixed speech. The training criteria can be written as:

Figure BDA0002585819680000123

audio-visual context speech separation, then predictive (or ideal) context language embedding can be integrated with the audio and video modes to build an audio-video context speech separation, as shown in FIG. 6. Adding a sharing weight ResNetEIt embeds contextAnd

Figure BDA0002585819680000125

conversion to high-level representationAndlanguage used forAnd separating the tones. Then, similar to the audiovisual system, all the high-level representations are connected together as a fused representation

Looking at multi-modal embedding, in the audio-visual context speech separation system proposed by the present method, an attention mechanism is developed to make better use of multi-modal information. High level representation before the fusion stepAndfirst connected together and through a shallow net ResNetVEProjection results in a fused representation CA. Speaker B gets CBThe same procedure is also applied. CAAnd CBCan be regarded as clue information of the target speaker.

At CAAnd CBAnd calculating a scaled dot product attention matrix A:

Figure BDA00025858196800001211

in which D is CAAnd CBDimension (d) of (a). The attention score matrix a is then converted to attention features and passed through a learnable fully connected layer W:

A=W·AT

B=W·A

w willIs projected to

Figure BDA00025858196800001213

Where L is the maximum frame length in the data set. In an implementation, the fill locations of A and B are hidden. Finally, all high-level representations are combined together, F ═ CA;C;Y]。

The method was tested and a speech isolation model and a context embedding prediction model were trained on the LRS2 data set in the data preparation. This is a set of audiovisual data collected from BBC television stations. The LibriSpeech corpus was also used for end-to-end single speaker automatic speech recognition training. Visual characterization: the method uses a pre-trained lip reading network described in the description to extract visual representations from the LRS2 dataset. For each frame of the video, the face region of the speaker is first clipped and then processed through a pre-trained model to generate 512-dimensional features.

Audio representation: in the LRS2 data set, audio was recorded at a sample rate of 16kHz and the frame rate of the video was 25 fps. For STFT, the window size is set to 40ms and the hop length is set to 10ms, with this setting each frame of the magnitude spectrum is 321 dimensions, with every 4 frames of the magnitude spectrum corresponding to a single frame of the visual representation.

Context learning: in end-to-end single speaker ASR training, input features are converted into 80-dimensional log-melfilterbank coefficients. Predictive or predictive context embedding is 512-dimensional. The ASR encoder performs 4 sub-samplings on the time scale of the input features. Thus, the length of the ideal context embedding is the same as the visual representation.

Composite audio-mixed audio is generated from two target audio randomly chosen from the LRS2 data set. The target audio is linear mixed, with the shorter audio being padded to the same length as the longer audio.

An end-to-end single-speaker ASR model based on the CTC/attention mechanism was trained using the LibriSpeech 960h corpus. The training program follows the recipe in the ESPnet toolkit. After convergence on the libristech dataset, the model was fine tuned using the LRS2 training set. On the LRS2 test set, the final bit error rate of the well-trained ASR model reached 8.2%. The ASR encoder used to extract the ideal ASR features is a 5-layer BSTLM with projections, each layer containing 512 units, the encoder performs 4 sub-samplings on a time scale.

The VGG-like context-embedded prediction model comprises a 4-layer two-dimensional convolution. In thatIn each convolution, the kernel size is 3, and the number of channels in the convolutional layer is 64-64-128 and 128. The two largest pool levels are contained in a VGG-like block, which performs 4 sub-samples on a time scale. The separate BLSTM network consists of 2-layer 512 units and the BLSTM encoder sharing weights consists of 1-layer 512 units. The exit rate of the BSLTMs is set to 0.2. Details of ResNets in the context-embedded predictive model are shown in fig. 7. Use of weight attenuation 10 during training-6Adam optimizer of (1). The learning rate is initially set to 10-4Then a factor of 3 is reduced in every 3 stages. The batch size is set to 16, and 4GTX-2080Ti gpu are used for data parallel training.

Fig. 7 lists details of ResNet in audiovisual or audiovisual mixed speech separation networks. Wherein, N: the number of remaining blocks. C: the number of convolution channels; o: an output size, if different from C, including an additional projection layer; k: particle size; D/U: a down-sampling or up-sampling factor on a time scale. The training program is almost the same as the previous work except for the length of the data. In order to maintain consistency of the context information, in the present method, the input data is not clipped to a fixed length. The voice separation model adopts 4gtx-2080tigpu to carry out data parallel training, and the batch setting is 32. And a barrel sampler is used during training, so that the length difference of each batch of data is not large.

Results and analysis, the method uses signal-to-distortion ratio (SDR), short-term objective intelligibility (STOI), and perceptual speech quality score (PESQ) as evaluation indices.

To evaluate the upper bound of merged context embedding, ideal context embedding is first used in training and evaluation. As shown in fig. 8, the speech separation system using ideal context embedding is a great improvement over audiovisual speech separation systems in all respects. The method then evaluates the new audio-visual context model using predicted context embedding, since ideal context embedding is not actually available in real applications. The use of different context embedding in training and testing was compared and is listed in fig. 8. Experimental results show that context embedding extracted by the model also has obvious improvement effect on the voice separation of a strong audio-visual dual-mode system.

The method further evaluates the multi-modal attention mechanism described in the above method, and the results are shown in fig. 9. The results show that the proposed attention can get additional consistency improvement in the case of multi-modal embedding.

The method provides a novel multi-mode voice separation architecture, which comprises three modes of audio-visual context. Specific models are designed to extract contextual linguistic information directly from multi-speaker mixed speech and to combine this contextual linguistic knowledge with other emotional verbs for speech separation through appropriate attention mechanisms. With the proposed audiovisual context architecture, significant improvements in speech separation can be achieved.

Fig. 10 is a schematic structural diagram of a training system for a context information prediction model for a video scene according to an embodiment of the present invention, which can execute the training method for a context information prediction model for a video scene according to any of the above embodiments and is configured in a terminal.

The embodiment provides a training system for a context information prediction model of a video scene, which comprises: an ideal contextual feature determination program module 11, a predictive contextual feature determination program module 12 and a training program module 13.

Wherein, the ideal context characteristic determining program module 11 is used for extracting a first ideal context characteristic and a second ideal context characteristic in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder of a single speaker; the prediction context feature determination program module 12 is configured to output a first prediction context feature and a second prediction context feature by using a magnitude spectrum of a mixed audio generated from the first clean audio and the second clean audio, and the first visual representation information of the first speaker and the second visual representation information of the second speaker as inputs of a context information prediction model; the training program module 13 is configured to train the context information prediction model based on the errors of the first ideal context feature and the second ideal context feature and the first prediction context feature and the second prediction context feature until the first prediction context feature and the second prediction context feature approach to the first ideal context feature and the second ideal context feature.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method for the context information prediction model of the video scene in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

extracting a first ideal context feature and a second ideal context feature in a first clean audio of a first speaker and a second clean audio of a second speaker through an end-to-end speech recognition encoder of the single speaker;

outputting a first prediction context feature and a second prediction context feature by using a magnitude spectrum of a mixed audio generated by the first clean audio and the second clean audio, and first visual representation information of the first speaker and second visual representation information of the second speaker as inputs of a context information prediction model;

training the context information prediction model based on errors of the first and second ideal context features and the first and second prediction context features until the first and second prediction context features approach the first and second ideal context features.

Fig. 11 is a schematic structural diagram of a context information prediction system according to an embodiment of the present invention, which can execute the context information prediction method for a video scene according to any of the above embodiments and is configured in a terminal.

The embodiment provides a training system for a context information prediction model of a video scene, which comprises: the extraction employs a program module 21, a high-dimensional audio modality representation determining program module 22, a high-dimensional visual modality representation determining program module 23, a splicing modality representation determining program module 24 and a context information representation determining program module 25.

Wherein, the extraction and application program module 21 is configured to input the magnitude spectrum of the mixed speech to be separated into the context information prediction model trained according to the training method of claim 8, perform feature extraction on the time-frequency dimension through the visual group-like convolution network, and perform down-sampling on the time dimension; the high-dimensional audio modal representation determining program module 22 is configured to input the feature extraction and the sampled amplitude spectrum to the depth residual error network to obtain a high-dimensional audio modal representation; the high-dimensional visual modal representation determining program module 23 is configured to input first visual representation information of a first speaker and second visual representation information of a second speaker in the mixed speech to be separated into the context information prediction model, and determine a first high-dimensional visual modal representation and a second high-dimensional visual modal representation through the depth residual error network; the splicing modality representation determining program module 24 is configured to splice the high-dimensional audio modality representation, the first high-dimensional visual modality representation and the second high-dimensional visual modality representation to determine a splicing modality representation; the context information representation determining program module 25 is configured to input the splicing modality representation to two different long-and-short term memory cyclic neural networks to obtain a first context information representation of the first speaker and a second context information representation of the second speaker.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the context information prediction method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

inputting the amplitude spectrum of the mixed voice to be separated into a context information prediction model trained according to the training method of claim 1, performing feature extraction on a time-frequency dimension through a visual group-like convolution network, and performing down-sampling on a time dimension;

inputting the amplitude spectrum after the characteristic extraction and sampling into a depth residual error network to obtain high-dimensional audio modal representation;

inputting first visual representation information of a first speaker and second visual representation information of a second speaker in mixed voice to be separated into the context information prediction model, and determining a first high-dimensional visual modal representation and a second high-dimensional visual modal representation through the depth residual error network;

splicing the high-dimensional audio modality representation, the first high-dimensional visual modality representation and the second high-dimensional visual modality representation to determine a spliced modality representation;

and inputting the splicing modal representation into two different bidirectional long-time and short-time memory cell cyclic neural networks to obtain a first context information representation of the first speaker and a second context information representation of the second speaker.

Fig. 12 is a schematic structural diagram of a voice separation system according to an embodiment of the present invention, which can execute the voice separation method according to any of the above embodiments and is configured in a terminal.

The embodiment provides a speech separation system, which comprises: a high-dimensional feature representation determining program module 31, a magnitude spectral mask determining program module 32 and a speech separating program module 33.

Wherein the high-dimensional feature representation determining program module 31 is configured to input the first high-dimensional visual modality representation, the second high-dimensional visual modality representation, the magnitude spectrum of the mixed speech to be separated, the first context information representation and the second context information representation determined in claim 9 to a speech separation system, and determine a high-dimensional feature representation of the mixed speech to be separated; a magnitude spectrum mask determination program module 32 for determining a first magnitude spectrum mask and a second magnitude spectrum mask for the first speaker based on the high-dimensional representation of features; the voice separation program module 33 is configured to predict the magnitude spectrum of the mixed voice to be separated through the first magnitude spectrum mask and the second magnitude spectrum mask, and determine the separated voice of the first speaker and the separated voice of the second speaker.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the voice separation method for the video scene in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

inputting the first high-dimensional visual modality representation, the second high-dimensional visual modality representation, the magnitude spectrum of the mixed speech to be separated, the first context information representation, and the second context information representation determined in claim 5 to a speech separation system, determining a high-dimensional feature representation of the mixed speech to be separated;

determining a first magnitude spectrum mask and a second magnitude spectrum mask for the first speaker based on the high-dimensional feature representation;

and predicting the magnitude spectrum of the mixed voice to be separated through the first magnitude spectrum mask and the second magnitude spectrum mask, and determining the separated voice of the first speaker and the separated voice of the second speaker.

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method for training a context information prediction model for a video scene in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a context information prediction model for a video scene of any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:音频处理方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!