Target person voice enhancement method based on conditional variation self-encoder

文档序号:1143068 发布日期:2020-09-11 浏览:14次 中文

阅读说明:本技术 基于条件变分自编码器的目标人语音增强方法 (Target person voice enhancement method based on conditional variation self-encoder ) 是由 乐笑怀 卢晶 于 2020-06-18 设计创作,主要内容包括:本发明公开了一种基于条件变分自编码器的目标人语音增强方法。该方法包括以下步骤:(1)对目标说话人清晰语音数据做短时傅里叶变换得到幅度谱;(2)使用目标说话人清晰语音幅度谱和身份编码向量来训练条件变分自编码器作为语音模型;(3)对含噪语音信号做短时傅里叶变换得到幅度谱和相位谱;(4)将含噪语音幅度谱和目标说话人身份编码向量输入语音模型,固定语音模型解码器权重,将语音模型和非负矩阵分解模型联合迭代优化得到语音和噪声的幅度谱估计;(5)使用幅度谱估计和含噪语音相位谱组合成复数谱,再通过逆短时傅里叶变换得到增强语音时域信号。本发明的方法能够在多种复杂噪声下对目标人语音进行增强,鲁棒性较高。(The invention discloses a target person voice enhancement method based on a conditional variation self-encoder. The method comprises the following steps: (1) carrying out short-time Fourier transform on the clear voice data of the target speaker to obtain a magnitude spectrum; (2) training a conditional variation autocoder as a voice model by using a clear voice amplitude spectrum and an identity coding vector of a target speaker; (3) carrying out short-time Fourier transform on the noisy speech signal to obtain an amplitude spectrum and a phase spectrum; (4) inputting the noisy speech amplitude spectrum and the target speaker identity coding vector into a speech model, fixing the weight of a speech model decoder, and performing combined iterative optimization on the speech model and a non-negative matrix decomposition model to obtain the amplitude spectrum estimation of speech and noise; (5) and combining the amplitude spectrum estimation and the noisy speech phase spectrum into a complex spectrum, and then obtaining an enhanced speech time domain signal through inverse short-time Fourier transform. The method can enhance the voice of the target person under various complex noises, and has higher robustness.)

1. The target person voice enhancement method based on the conditional variational self-encoder is characterized by comprising the following steps of:

step 1, performing short-time Fourier transform on clear voice data of a target speaker to obtain a short-time amplitude spectrum;

step 2, constructing an identity coding vector of the target speaker, and training a conditional variational self-encoder as a voice model by using the identity coding vector and the short-time amplitude spectrum obtained in the step 1; the input of the conditional variation self-encoder is the voice amplitude spectrum of the target speaker and the identity encoding vector thereof, and the output is the logarithm of the voice amplitude spectrum of the target speaker;

step 3, performing short-time Fourier transform on the noise-containing voice signal to obtain a short-time amplitude spectrum, and reserving a phase spectrum of the noise-containing voice signal;

step 4, inputting the short-time amplitude spectrum of the noisy speech signal obtained in the step 3 into the speech model, taking the identity coding vector of the target speaker as a speech model condition item, and fixing the weight of a decoder of the speech model; carrying out combined iterative optimization on the voice model and the non-negative matrix decomposition model to obtain the amplitude spectrum estimation of the voice and the noise;

and 5, combining the amplitude spectrum estimation obtained in the step 4 and the phase spectrum of the noise-containing voice signal reserved in the step 3 into a complex spectrum, and obtaining an enhanced voice time domain signal through inverse short-time Fourier transform.

2. The target person speech enhancement method based on the conditional variational auto-encoder according to claim 1, characterized in that in step 2, the conditional variational auto-encoder uses a deep neural network as an encoder and a decoder, the encoder maps the speech magnitude spectrum to a random variable z, and the decoder maps from the random variable z to the clean speech.

3. The method of claim 1, wherein in step 4, the specific steps of jointly and iteratively optimizing the speech model and the non-negative matrix factorization model are as follows:

1) the encoder and decoder of the conditional variational self-encoder may be represented as follows:

zt~qφ(zt|xt,c)

xt~pθ(xt|zt,c)

wherein xtFor the amplitude spectrum of the t-th frame of the input speech, ztHidden variables of the t-th frame output by the encoder, c represents speaker identity vectors, phi and theta represent weights of the encoder and the decoder respectively, qφAnd pθRespectively representing the distribution of the encoder generating hidden variables and the distribution of the decoder generating speech amplitude spectrum estimation;

after the above-mentioned encoder and decoder have been trained, the decoder p is fixedθ(xt|ztC) in speech enhancement, only the encoder weights are back-propagated and trained, and the speech magnitude spectrum output by the speech model is estimated to be sigma (z)tC) power spectrum estimation as sigma2(zt,c);

2) The non-negative matrix factorization may be expressed in the form:

V=WH

wherein

Figure FDA0002544713250000021

3) when optimizing, inputting the amplitude spectrum x of the noisy speechtAnd the target speaker identity vector c, the parameters W, H for initializing the non-negative matrix factorization and the gain vector of the 1-dimensional T frame

Figure FDA0002544713250000024

wherein

Figure FDA0002544713250000026

the parameters W, H and a of the non-negative matrix factorization are then iterated using the following iterative formulat

Figure FDA0002544713250000029

Wherein the content of the first and second substances,is formed by

Figure FDA00025447132500000212

after several iterations, the resulting clear speech estimate is expressed as:

wherein x isftAnd

Figure FDA0002544713250000032

4. The conditional variation based adaptive encoder-based target person speech enhancement method according to claim 3, wherein in step 3), the objective function is optimized using the following equation:

wherein the content of the first and second substances,

Figure FDA0002544713250000034

Technical Field

The invention belongs to the field of voice enhancement, and particularly relates to a target person voice enhancement method based on a conditional variation self-encoder.

Background

When a microphone is used to collect a voice signal of a speaker in a real environment, various interference signals, which may be background noise, room reverberation, etc., are collected at the same time. These noise interferences degrade the quality of speech and severely degrade speech recognition accuracy when the signal-to-noise ratio is low. A technique of extracting a target voice from noise interference is called a voice enhancement technique.

Spectral subtraction can be used to achieve Speech enhancement (Boll, S.F. (1979) compression of an acoustic in Speech using spectral sub-transmission, IEEE Transactions on Acoustics, Speech and Signal Processing,27, 113-. In chinese patent CN103594094A, the speech is transformed into the time-frequency domain by short-time fourier transform, then the power spectrum of the speech signal of the current frame is subtracted from the estimated noise power spectrum by using an adaptive threshold spectral subtraction method to obtain the power spectrum of the enhanced signal, and finally the time-domain enhanced signal is obtained by short-time fourier inverse transform. However, this enhancement method has a large penalty on speech quality due to unreasonable assumptions made about speech and noise.

Non-negative matrix factorization algorithms are also used for Speech enhancement (Wilson K W, Raj B, Smragdis P, et. Spech differentiating using non-reactive matrix factorization with documents [ C ]. Proceedings of the IEEE International Conference on Acoustics, Spech, and Signal Processing, 2008.). The short-time power spectrums of the voice and the noise are respectively subjected to nonnegative matrix decomposition, so that dictionaries of the voice and the noise can be obtained, and the enhancement is carried out through the dictionaries during enhancement. Chinese patent CN104505100A uses a non-negative matrix factorization algorithm combining spectral subtraction and minimum mean square error for speech enhancement. However, non-negative matrix factorization models only the speech features linearly, modeling the non-linear characteristics of speech poorly, which limits its performance.

Recently, a variety of deep learning based generative models have been used in speech modeling, where a variational self-encoder is a method of explicitly learning the data distribution, which can be used for non-linear modeling of speech. The literature (s.leggain, l.girin and r.hoaud, "AVARIANCE MODELING FRAMEWORK BASED ON VARIATIONALAUTOENCODERS FOR SPEECH ENHANCEMENT,"2018IEEE 28th International work hop on Machine Learning for Signal Processing (MLSP), Aalborg,2018, pp.1-6, doi:10.1109/mlsp.2018.8516711.) uses a speech enhancement algorithm combining variational self-coder and non-negative matrix decomposition, where the variational self-coder model is previously trained with a clear speech short-time power spectrum and the non-negative matrix model learns at enhancement, which has a better enhancement effect on non-stationary noise without impairing speech quality. However, since the variational auto-encoder model uses clean speech training, the enhancement model has poor enhancement capability for human voice interference.

In practical applications, the types of noise vary widely, and besides non-human voice noise, it is also very meaningful to extract the voice of the target speaker from human voice interference.

Disclosure of Invention

Drawings

FIG. 1 is a processing flow chart of a target person speech enhancement method based on a conditional variation self-encoder according to the invention.

FIG. 2 is a schematic diagram of a variational self-coder model employed in an embodiment of the present invention; the deep neural network used therein is a frame-independent fully-connected network, | StI represents the input clear voice amplitude spectrum, C represents the identity independent hot vector of the speaker corresponding to the voice, Embedding represents that the network reduces the dimension of the identity vector of the speaker to 10 dimensions, and sigma (| ztAnd | c) represents the magnitude spectrum of the output speech.

FIG. 3 is a comparison graph of enhanced speech SDR values under different noise types for the prior art variational autocoder-nonnegative matrix factorization algorithm and the method of the present invention.

FIG. 4 is a comparison graph of the enhancement effect of the present invention method and the existing algorithm based on variational autocoder-nonnegative matrix factorization model on the target speech under the condition of multi-human voice mixing. (a) The method is used for enhancing the short-time amplitude spectrum of the voice.

Detailed Description

The invention relates to a target person voice enhancement method based on a conditional variation self-encoder, which mainly comprises the following parts:

1. target person speech model training

1) Short-time Fourier transform of clear voice signal of target person

If the clear voice signal of the target person is x (T), performing short-time Fourier transform of N-point FFT to obtain a T frame F dimension (F ═ N ^ H)2+1) complex spectrum X ═ X1,...,xtTherein of

Figure BDA0002544713260000051

|xftAnd | represents the magnitude of the f-th spectral component of the t-th frame.

2) Constructing a target person identity vector

If there is clear voice data of M speakers, the identity of each speaker is labeled as an M-dimensional one-hot vector (one-hot vector), and if a certain target speaker is the ith bit in the data set, the ith dimension of the identity vector is 1, and the other dimensions are 0.

3) Training of conditional variational autocoder

The conditional variational auto-Encoder model consists of an Encoder (Encoder) and a Decoder (Decoder). The goal of the encoder is to map the speech magnitude spectrum | xtI maps to a random variable ztThe goal of the decoder is to map back the speech magnitude spectrum from this random variable, which is generally assumed to satisfy a gaussian distribution.

The model of the encoder and decoder can thus be expressed as:

zt~qφ(zt|xt,c) (2)

xt~pθ(xt|zt,c) (3)

wherein c is a condition term, i.e. the identity vector of the target speaker in step 2), and the coupling of the encoder and the decoder can refer to fig. 2. In this embodiment, a neural network is used to reduce the dimension of the M-dimensional identity vector to 10 dimensions, and the reduced output is spliced with each hidden layer output of the codec.

The training goal of a conditional variational self-encoder is to maximize the likelihood of the decoder output, i.e., to make the speech spectrum output by the decoder closer to the true speech spectrum, the better. Its objective function can therefore be written as log-likelihood:

the objective function can be decomposed into the following formula which is easier to calculate by adopting a variation inference method:

Figure BDA0002544713260000062

therein

Figure BDA0002544713260000063

Represents the Kullback-Leibler (K-L) divergence between the two distributions. The first term of the above equation describes the K-L divergence of the hidden variable and the normal distribution output from the encoder, specifically, z will be output from the encodertThen using the resampling method shown in fig. 2 to obtain ztAnd then input to a decoder. The second term is represented by xtInput encoder obtains hidden variable zt,ztRe-input to the decoder to obtain xtIn a neural network, the expectation is that the network can be input with X and its output as close to X as possible, specifically, the Itakura-Saito (I-S) divergence of the network input and output will be calculated:

the objective function can therefore be rewritten as:

Figure BDA0002544713260000065

wherein

Figure BDA0002544713260000066

Respectively, the encoder network output ztMean and variance of. And optimizing the model to obtain the target person voice model. It is generally assumed that speech satisfies the following complex gaussian distribution, which can be used as a speech model:

Figure BDA0002544713260000067

2. speech enhancement using iterative algorithms

1) Short-time Fourier transform of noisy speech signal

If the noisy speech signal is X (t), the short-time fourier transform of N-point FFT can also obtain complex frequency spectrum X ═ X1,...,xt}。

2) Modeling noise using non-negative matrix factorization models

Similar to the speech model (8), the noise model can also be described in the following distribution:

wherein the matrix

Figure BDA0002544713260000072

The matrix is decomposed into the product of two low rank matrices:

V=WH (10)

assuming that the noise is additive noise, the noisy speech signal xftCan be expressed as:

wherein xft、sftAnd nftRepresenting noisy speech, clean speech and noise spectra, respectively, atRepresenting the t-th element of the gain vector.

3) Iterative optimization

The optimized parameters of the present embodiment are the parameters { W, H, a } of the noise model and the speech model, and the optimization goal is to maximize the noisy speech likelihood of equation (11), which can generally be performed by the expectation-maximization algorithm (E-M). For the speech model, the objective function is as follows, as long as the likelihood of noisy speech is maximized:

this equation resembles the trained objective function (7) except that the noisy speech power spectrum | x is computedft|2And estimated noisy speech power spectrum atσ2(zt,c)+vft) I-S divergence, the optimization process will fix the decoder weights theta and only optimize phi.

Since the decoder is a suitable speech generation model, the present embodiment will fix the decoder weights and only optimize the encoder weights when optimizing the above objective function. The optimization of the noise model may use the optimization-minimization (mm) algorithm, which is specifically accomplished by the following iterative equation:

Figure BDA0002544713260000083

wherein the content of the first and second substances,

Figure BDA0002544713260000084

may be represented by a formula

Figure BDA0002544713260000085

It is shown that,wherein ⊙ represents the multiplication of the para positions, and

Figure BDA0002544713260000087

then represents slave qφ(zt|xtAnd c) the sampled r-th sample.

Through the above iterations, the converged speech and noise parameters will be obtained finally, and the final goal of this embodiment is to compute a clear speech estimate, which can be expressed as the following expectation:

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种回放语音检测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!