Speech enhancement method, electronic device, and computer-readable storage medium

文档序号:1876907 发布日期:2021-11-23 浏览:24次 中文

阅读说明:本技术 语音增强方法、电子设备以及计算机可读存储介质 (Speech enhancement method, electronic device, and computer-readable storage medium ) 是由 陈庭威 黄景标 林聚财 殷俊 于 2021-07-26 设计创作,主要内容包括:本发明公开了一种语音增强方法、电子设备以及计算机可读存储介质,语音增强方法包括:获取到待增强语音;基于待增强语音确定待增强语音的信号协方差矩阵的逆矩阵;利用待增强语音中目标语音对应的掩码矩阵确定目标语音的目标信号协方差矩阵;通过信号协方差矩阵的逆矩阵和目标信号协方差矩阵对待增强语音进行语音增强。通过上述方式,本发明能够实现对待增强语音的语音增强,并提高语音增强效果。(The invention discloses a voice enhancement method, electronic equipment and a computer readable storage medium, wherein the voice enhancement method comprises the following steps: acquiring a voice to be enhanced; determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced; determining a target signal covariance matrix of a target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced; and performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix. By the method, the voice enhancement of the voice to be enhanced can be realized, and the voice enhancement effect is improved.)

1. A method of speech enhancement, the method comprising:

acquiring a voice to be enhanced;

determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced;

determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced;

and performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

2. The speech enhancement method of claim 1 wherein the step of determining the inverse of the signal covariance matrix for the speech to be enhanced based on the speech to be enhanced comprises:

transforming the voice to be enhanced to obtain a matrix corresponding to the current frame of the voice to be enhanced; and

obtaining an inverse matrix of a signal covariance matrix of the initial frame of the voice to be enhanced;

obtaining an inverse matrix of a signal covariance matrix of the current speech frame to be enhanced based on a first recursion relational expression by using the matrix of the current frame, a conjugate transpose matrix of the matrix and an inverse matrix of the signal covariance matrix of the initial frame;

and the first recursion relational expression represents the corresponding relation between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame.

3. The speech enhancement method of claim 2,

the first recursion relational expression is obtained by constructing a first corresponding relation through the matrix of the current frame of the speech to be enhanced and the conjugate transpose matrix and performing inverse operation on the first corresponding relation.

4. The method of claim 1, wherein the step of determining the covariance matrix of the target signal of the target speech by using the mask matrix corresponding to the target speech in the speech to be enhanced comprises:

obtaining the probability of the target voice existing in the current frame of the voice to be enhanced by utilizing the matrix corresponding to the voice to be enhanced;

acquiring a mask matrix of an initial frame;

obtaining a mask matrix of a current frame of the speech signal to be enhanced by using the mask matrix of the initial frame and the probability;

and obtaining a target signal covariance matrix of the current frame of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix.

5. The speech enhancement method of claim 4, wherein the step of obtaining the mask matrix of the current frame of the speech signal to be enhanced by using the mask matrix of the initial frame and the probability comprises:

obtaining a mask matrix of the current frame based on a second recursion relational expression by utilizing the probability and the mask matrix of the initial frame; the second recursion relational expression represents the corresponding relation between the mask matrix of the current frame and the mask matrix of the previous frame;

the step of obtaining the target signal covariance matrix of the target speech by using the mask matrix of the current frame, the matrix corresponding to the current frame, and the conjugate transpose matrix includes:

obtaining a target signal covariance matrix of the current frame by using the mask matrix of the current frame, the matrix and the conjugate transpose matrix corresponding to the current frame, and the target signal covariance matrix of the initial frame by using a third recursion relation;

and the third recursion relational expression represents the corresponding relation between the target signal covariance matrix of the current frame and the target signal covariance matrix of the previous frame.

6. The speech enhancement method of claim 5,

and the third recursion relational expression is obtained by constructing a second corresponding relation through the matrix of the current frame of the voice to be enhanced, the conjugate transpose matrix, the target signal covariance matrix and the mask matrix of the current frame and then transforming the second corresponding relation.

7. The speech enhancement method according to any one of claims 4 to 6, wherein the step of obtaining the mask matrix of the initial frame comprises:

acquiring an identity matrix, a random matrix with a value range of 0-1 or a probability matrix which obeys normal distribution;

and determining the unit matrix, the random matrix with the value range of 0-1 or the probability matrix subject to normal distribution as the mask matrix of the initial frame.

8. The method of claim 1, wherein the step of performing speech enhancement on the speech to be enhanced by the inverse of the signal covariance matrix and the target signal covariance matrix comprises:

calculating to obtain a beam former coefficient of the current frame through an inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame;

and multiplying the beam former coefficient by the voice to be enhanced of the current frame so as to enhance the voice to be enhanced of the current frame.

9. The speech enhancement method according to claim 1, wherein the step of obtaining the speech to be enhanced comprises:

acquiring initial voice in a time domain form;

and sequentially carrying out windowing, framing and Fourier transformation on the initial voice to obtain the voice to be enhanced in a time-frequency domain signal form.

10. The method of claim 9, wherein the step of speech enhancing the speech to be enhanced by the inverse of the signal covariance matrix and the target signal covariance matrix is further followed by the steps of:

and performing inverse Fourier transform on the voice after voice enhancement to obtain a voice signal in a time domain form after voice enhancement.

11. An electronic device, characterized in that the electronic device comprises: a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the speech enhancement method of any of claims 1 to 10.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores program data executable to implement the speech enhancement method according to any one of claims 1-10.

Technical Field

The present invention relates to the field of speech processing, and more particularly, to a speech enhancement method, an electronic device, and a computer-readable storage medium.

Background

In the fields of telephone video conferencing, artificial intelligence and the like, voice transmission often plays an important role. However, in practical scenarios, the target speech signal is often interfered by various noises or background sounds, and therefore, the target speech signal needs to be speech-enhanced to improve the semantic communication of the target speech.

However, the conventional speech enhancement usually performs speech enhancement by using a beam forming technique, but the beam forming technique needs to estimate the direction information of the target speech signal in advance, and then filters signals except the direction of the target speech signal by using a beam former, so as to achieve the purpose of speech enhancement.

However, it is difficult to accurately obtain the direction information of the target speech signal in practice, so the speech enhancement effect is not good.

Disclosure of Invention

The invention provides a voice enhancement method, electronic equipment and a computer-readable storage medium, aiming at improving the voice enhancement effect.

To solve the above technical problem, the present invention provides a speech enhancement method, including: acquiring a voice to be enhanced; determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced; determining a target signal covariance matrix of a target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced; and performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

The method comprises the following steps of determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced, wherein the step of determining the inverse matrix of the signal covariance matrix of the voice to be enhanced comprises the following steps: transforming the voice to be enhanced to obtain a matrix corresponding to the current frame of the voice to be enhanced; acquiring an inverse matrix of a signal covariance matrix of the initial frame of the voice to be enhanced; obtaining an inverse matrix of a signal covariance matrix of the current speech frame to be enhanced based on a first recursion relational expression by utilizing the matrix of the current frame, a conjugate transpose matrix of the matrix and the inverse matrix of the signal covariance matrix of the initial frame; the first recursion relational expression represents the corresponding relation between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame.

The first recursion relational expression is obtained by constructing a first corresponding relation through a matrix of a current frame of the speech to be enhanced and a conjugate transpose matrix and performing inverse operation on the first corresponding relation.

The method for determining the target signal covariance matrix of the target voice by using the mask matrix corresponding to the target voice in the voice to be enhanced comprises the following steps: obtaining the probability of the target voice existing in the current frame of the voice to be enhanced by utilizing the matrix corresponding to the voice to be enhanced; acquiring a mask matrix of an initial frame; obtaining a mask matrix of a current frame of the speech signal to be enhanced by using the mask matrix and the probability of the initial frame; and obtaining a target signal covariance matrix of the current frame of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix.

The method for obtaining the mask matrix of the current frame of the speech signal to be enhanced by using the mask matrix and the probability of the initial frame comprises the following steps: obtaining a mask matrix of the current frame based on a second recursion relational expression by utilizing the probability and the mask matrix of the initial frame; the second recursion relational expression represents the corresponding relation between the mask matrix of the current frame and the mask matrix of the previous frame; the method for obtaining the target signal covariance matrix of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix comprises the following steps: obtaining a target signal covariance matrix of the current frame by using a mask matrix of the current frame, a matrix and a conjugate transpose matrix corresponding to the current frame and a target signal covariance matrix of the initial frame by using a third recursion relation; and the third recursion relational expression represents the corresponding relation between the target signal covariance matrix of the current frame and the target signal covariance matrix of the previous frame.

And the third recursion relational expression is obtained by constructing a second corresponding relation through the matrix of the current frame of the voice to be enhanced, the conjugate transpose matrix, the target signal covariance matrix and the mask matrix of the current frame and then transforming the second corresponding relation.

The step of acquiring the mask matrix of the initial frame includes: acquiring an identity matrix, a random matrix with a value range of 0-1 or a probability matrix which obeys normal distribution; and determining a unit matrix, a random matrix with the value ranging from 0 to 1 or a probability matrix subject to normal distribution as a mask matrix of the initial frame.

The method for enhancing the voice of the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix comprises the following steps: calculating to obtain the beam former coefficient of the current frame through the inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame; and multiplying the beam former coefficient by the voice to be enhanced of the current frame so as to enhance the voice to be enhanced of the current frame.

The step of acquiring the voice to be enhanced comprises the following steps: acquiring initial voice in a time domain form; and sequentially carrying out windowing, framing and Fourier transformation on the initial voice to obtain the voice to be enhanced in a time-frequency domain signal form.

The step of performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix further comprises the following steps: and performing inverse Fourier transform on the voice after voice enhancement to obtain a voice signal in a time domain form after voice enhancement.

In order to solve the above technical problem, the present invention further provides an electronic device, including: a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the speech enhancement method of any of the above.

To solve the above technical problem, the present invention also provides a computer-readable storage medium storing program data that can be executed to implement the speech enhancement method according to any one of the above.

The invention has the beneficial effects that: different from the situation in the prior art, the voice enhancement method determines the inverse matrix of the signal covariance matrix of the voice to be enhanced, determines the target signal covariance matrix of the target voice by using the mask matrix corresponding to the target voice in the voice to be enhanced, and performs voice enhancement on the voice to be enhanced by using the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

Drawings

FIG. 1 is a flow chart of an embodiment of a speech enhancement method provided by the present invention;

FIG. 2 is a flow chart of another embodiment of a speech enhancement method provided by the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention;

fig. 4 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a speech enhancement method according to an embodiment of the present invention.

Step S11: and acquiring the voice to be enhanced.

And acquiring the voice to be enhanced. The voice to be enhanced or the voice to be enhanced played by the voice player can be acquired by the voice receiver. The voice receiver comprises a wired microphone, a wireless microphone, a telephone receiver and other voice receivers. The voice player includes: intelligent device players, telephone players and other voice players.

In a specific application scenario, multiple microphones may be used to acquire multiple channels of speech to be enhanced. In another specific application scenario, a single channel of speech to be enhanced may also be acquired through a single handset.

Step S12: and determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced.

And determining an inverse matrix of a signal covariance matrix of the voice to be enhanced based on the voice to be enhanced. Wherein each element in the covariance matrix is the covariance between the individual vector elements.

In a specific application scenario, a matrix of the speech to be enhanced may be obtained based on the speech to be enhanced, then matrix transformation is performed on the matrix to obtain a signal covariance matrix of the speech to be enhanced, and inversion operation is performed on the signal covariance matrix to obtain an inverse matrix of the signal covariance matrix of the speech to be enhanced.

In another specific application scenario, an adjoint matrix of a signal covariance matrix of the speech to be enhanced may also be determined based on the speech to be enhanced, and then an inverse matrix is solved based on the adjoint matrix to obtain an inverse matrix of the signal covariance matrix of the speech to be enhanced.

The method for specifically calculating the inverse of the signal covariance matrix of the speech to be enhanced is not limited herein.

Step S13: and determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced.

And determining a target signal covariance matrix of the target voice by using a mask matrix corresponding to the target voice in the voice to be enhanced. Wherein, the target voice refers to the voice which needs to be enhanced in the voice to be enhanced. In a specific application scenario, when the speech to be enhanced is the speech received by the conference recording microphone, the target speech is the speech of the conference speaker, and is otherwise the background speech.

The mask matrix is used for masking the voice to be enhanced so as to mask the background sound and highlight the target voice. In a specific application scenario, probability estimation of existence of target voice can be performed on each element of a matrix of voice to be enhanced, and the higher the probability is, the higher the probability of existence of the target voice is, so as to obtain a mask matrix corresponding to the target voice in the voice to be enhanced. In other application scenarios, the matrix of the speech to be enhanced may also be filtered through the deep neural network, so as to obtain a mask matrix corresponding to the target speech in the speech to be enhanced.

And the target signal covariance matrix is a signal covariance matrix corresponding to the target speech. The target voice is highlighted through the mask matrix corresponding to the target voice, the target signal covariance matrix of the target voice is further determined, and the accuracy and the reliability of the target signal covariance matrix can be improved.

Step S14: and performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

And performing voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix obtained in the step.

In a specific application scenario, a beamformer coefficient can be obtained by calculating an inverse matrix of a signal covariance matrix and a target signal covariance matrix, and then speech enhancement is performed on speech to be enhanced through the beamformer coefficient. The coefficient of the beam former is MVDR (minimum Variance relationship response), and the background sound of the voice to be enhanced can be minimized based on the constraint condition of the MVDR algorithm by processing the voice to be enhanced through the coefficient of the MVDR beam former. Wherein the constraint is that the variance of the output is minimized in the case that the clean speech signal remains unchanged. Then the minimization of the background tone signal can be accomplished.

In another specific application scenario, the inverse matrix of the signal covariance matrix and the target signal covariance matrix may also be directly combined with the matrix of the speech to be enhanced to perform speech enhancement on the speech to be enhanced.

In another specific application scenario, speech enhancement can also be performed on the speech to be enhanced by machine learning based on the inverse matrix of the signal covariance matrix and the target signal covariance matrix. The specific manner of enhancement is not limited herein.

By the method, the voice enhancement method of the embodiment firstly determines the inverse matrix of the signal covariance matrix of the voice to be enhanced, then determines the target signal covariance matrix of the target voice by using the mask matrix corresponding to the target voice in the voice to be enhanced, and finally performs voice enhancement on the voice to be enhanced through the inverse matrix of the signal covariance matrix and the target signal covariance matrix.

Referring to fig. 2, fig. 2 is a flowchart illustrating a speech enhancement method according to another embodiment of the present invention.

Step S21: the method comprises the steps of obtaining initial voice in a time domain form, and sequentially carrying out windowing, framing and Fourier transform on the initial voice to obtain voice to be enhanced in a time-frequency domain signal form.

Obtaining initial voice in a time domain form, and then performing windowing, framing and Fourier transformation on the initial voice to obtain voice to be enhanced in a time-frequency domain signal form. In a specific application scenario, multiple microphones may be used to obtain initial speech in a multi-channel time domain form, and then windowing, framing, and Fourier transform (FFT) are performed on the initial speech in sequence to obtain speech to be enhanced in a time-frequency domain signal form. The voice to be enhanced in the form of the time-frequency domain signal comprises multiframe voice to be enhanced.

In other embodiments, the speech to be enhanced may also be directly obtained in the form of a time-frequency domain signal, for example: and acquiring the speech to be enhanced in the form of the time-frequency domain signal output by a processor or other processing equipment.

Step S22: and transforming the voice to be enhanced to obtain a matrix corresponding to the current frame of the voice to be enhanced and an inverse matrix of the signal covariance matrix of the initial frame of the voice to be enhanced, and obtaining the inverse matrix of the signal covariance matrix of the current frame of the voice to be enhanced based on a first recursion relational expression by utilizing the matrix of the current frame, the conjugate transpose matrix of the matrix and the inverse matrix of the signal covariance matrix of the initial frame.

And transforming the voice to be enhanced in the form of the time-frequency domain signal to obtain a matrix corresponding to the current frame of the voice to be enhanced. In a specific application scenario, the speech enhancement of the embodiment may enhance the speech of each frame in real time when the microphone acquires the speech, or sequentially enhance the speech of each frame after the microphone acquires the complete speech.

In this embodiment, a matrix corresponding to a current frame of speech to be enhanced in the form of a time-frequency domain signal may be represented as:

y(f,t)=[y1,f,t,y2,f,t,...yJ,f,t]T

wherein y (f, t) is a matrix corresponding to the current frame in the form of time-frequency domain signal, and y in the following formulaf,tThe matrix y (f, t) corresponding to the current frame is the same, and y (f, t) represents the observation vector of dimension J × 1 at time t or t-th frame, frequency f. J is the number of microphones, i.e. y1,f,t,y2,f,t,...yJ,f,tThe voice signals respectively corresponding to the J microphones. T is the transpose of the matrix. t is the current frame time, or any frame time. f is the current frequency or an arbitrary frequency.

And then obtaining an inverse matrix of a signal covariance matrix of the initial frame of the voice to be enhanced, namely the initial frame is a voice frame at the moment t is 0. And then obtaining the inverse matrix of the signal covariance matrix of the current speech frame to be enhanced based on the first recursion relational expression by utilizing the matrix of the current frame, the conjugate transpose matrix of the matrix and the inverse matrix of the signal covariance matrix of the initial frame. The first recursion relational expression represents the corresponding relation between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame.

Specifically, the first recurrence relation is as follows:

wherein the content of the first and second substances,is the inverse of the signal covariance matrix of the t-th frame of speech to be enhanced,is the inverse of the signal covariance matrix of the t-1 th frame, yf,tIs a matrix of the t-th frame of speech to be enhanced,and (3) a conjugate transpose matrix of the t frame matrix of the voice to be enhanced. t is the current frame time, and when the speech to be enhanced has s frames in total, t may include (0, 1, 2)... s) based on the current speech frame setting.

The first recurrence relation represents the inverse of the signal covariance matrix of the current frameInverse of the signal covariance matrix of the previous frameI.e., the correspondence between the inverse matrices of the signal covariance matrix of every two adjacent frames. Therefore, when the inverse matrix of the signal covariance matrix of the initial frame of the speech to be enhanced is obtained, that is, the inverse matrix of the signal covariance matrix of the initial frame of the speech to be enhanced is substituted into the first recursion relational expression with t being 0, the inverse matrix of the signal covariance matrix of the first frame is obtained through calculation, then the inverse matrix of the signal covariance matrix of the first frame is substituted into the first recursion relational expression with t being 1, the inverse matrix of the signal covariance matrix of the second frame is obtained through calculation, and so on, the inverse matrices of the signal covariance matrices of all frames of the speech to be enhanced can be obtained by using the matrices of all frames, the conjugate transpose matrix of each frame matrix and the inverse matrix of the signal covariance matrix of the initial frame based on the first recursion relational expression.

The method for acquiring the inverse matrixes of the signal covariance matrixes of all frames of the voice to be enhanced only needs to acquire the matrixes of all the frames, the conjugate transpose matrixes of all the frames and the signal covariance matrix of the initial frame and then calculates based on the first recursion relational expression, inverse operation does not need to be sequentially performed on the inverse matrixes of the signal covariance matrixes of all the frames in the voice to be enhanced, the calculation amount and the calculation complexity in the voice enhancement process are greatly reduced, and the voice enhancement efficiency is improved.

The first recursion relational expression is obtained by constructing a first corresponding relation through a matrix of a current frame of the speech to be enhanced and a conjugate transpose matrix and performing inverse operation on the first corresponding relation. Specifically, a first corresponding relationship between the signal covariance matrix of the current frame and the signal covariance matrix of the previous frame may be constructed based on the matrix of the current frame of the speech to be enhanced and the conjugate transpose matrix, where the first corresponding relationship is as follows:

wherein, Yf,tSignal covariance matrix, Y, for the t-th frame of speech to be enhancedf,t-1Is the signal covariance matrix of the t-1 th frame of the speech to be enhanced. And (3) carrying out inversion operation on the first corresponding relation, namely the formula (2), and obtaining a first recursion relational formula, namely the formula (1).

The first corresponding relation represents an iteration updating mode between signal covariance matrixes of adjacent frames of the voice to be enhanced, a first recursion relational expression can be obtained only by carrying out inversion operation on the first corresponding relation once, and then the inverse matrix of the signal covariance matrix of each frame of the voice to be enhanced is obtained based on the first recursion relational expression and the inverse matrix recursion of the signal covariance matrix of the initial frame.

The inverse matrix of the signal covariance matrix of the initial frame may include simple matrices such as an identity matrix, a random matrix with a value range of 0-1, or a probability matrix subject to normal distribution, so that after the signal covariance matrix of the initial frame is substituted into the first recurrence relation, the calculation amount of recurrence calculation can be further reduced, and the calculation efficiency can be improved.

Step S23: obtaining the probability of the target voice of the current frame of the voice to be enhanced by using the matrix corresponding to the voice to be enhanced, obtaining the mask matrix of the initial frame, obtaining the mask matrix of the current frame of the voice signal to be enhanced by using the mask matrix and the probability of the initial frame, and obtaining the target signal covariance matrix of the current frame of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix.

Obtaining the probability of the target voice of the current frame of the voice to be enhanced by using the matrix corresponding to the voice to be enhanced, then obtaining the mask matrix of the initial frame, obtaining the mask matrix of the current frame of the voice signal to be enhanced by using the mask matrix and the probability of the initial frame, and further obtaining the target signal covariance matrix of the current frame of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix.

In a specific application scenario, the probability of the target speech existing in the current frame of the speech to be enhanced can be calculated by the following formula:

wherein, p (y)f,t) The probability of the existence of the target speech for the t-th frame of the speech to be enhanced, e is a natural constant, andwhere J is the number of microphones, tr () represents the trace of the matrix,is the target matrix covariance matrix. And the covariance matrix of the target matrixThe second corresponding relationship can be obtained, and the second corresponding relationship refers to formula (5).

After the probability that the target voice exists in the current frame of the voice to be enhanced is obtained, the mask matrix of the current frame is obtained based on the second recursion relational expression by using the probability and the mask matrix of the initial frame. The second recurrence relation is as follows:

wherein the content of the first and second substances,mask moments for the t-th frameThe number of the arrays is determined,the mask matrix for the t-1 th frame. The hyper-parameters α and β satisfy the relationship α + β ═ 1.

The second recursion relational expression represents the corresponding relation between the mask matrix of the current frame and the mask matrix of the previous frame; therefore, after the mask matrix of the initial frame is obtained, the mask matrix of the initial frame is substituted into the second recursion relational expression, that is, the mask matrix of the first frame can be obtained, and then the mask matrix of the first frame is substituted into the second recursion relational expression again, that is, the mask matrix of the second frame can be obtained, and so on, and then the mask matrices of all frames of the voice signal to be enhanced can be obtained.

Wherein the mask matrix of the initial frameThe simple matrix with elements between 0 and 1, such as an identity matrix, a random matrix with a value range between 0 and 1 or a probability matrix subject to normal distribution, can be included, so that after the mask matrix of the initial frame is substituted into the first recursion relational expression, the calculation amount of the recursion calculation can be further reduced, and the calculation efficiency can be improved.

And after obtaining mask matrixes of all frames of the voice signal to be enhanced, obtaining a target signal covariance matrix of the target voice by using the mask matrix of the current frame, the matrix corresponding to the current frame and the conjugate transpose matrix. Specifically, a mask matrix of the current frame, a matrix and a conjugate transpose matrix corresponding to the current frame, and a target signal covariance matrix of the initial frame are used to obtain a target signal covariance matrix of the current frame by using a third recurrence relation, where the third recurrence relation is specifically as follows:

wherein the content of the first and second substances,is the target signal covariance matrix for the t-th frame,is the target signal covariance matrix of the t-1 th frame.And representing the corresponding relation between the target signal covariance matrix of the current frame and the target signal covariance matrix of the previous frame by the known third recurrence relation formula, wherein the mask matrix is the mask matrix of the t-th frame. And the target signal covariance matrixes of all the frames can be successively recurred by substituting the target signal covariance matrixes of the initial frames into the third recursion relational expression.

And the third recursion relational expression is obtained by constructing a second corresponding relation through the matrix of the current frame of the voice to be enhanced, the conjugate transpose matrix, the target signal covariance matrix and the mask matrix of the current frame and then transforming the second corresponding relation.

The second correspondence relationship is specifically as follows:

wherein the content of the first and second substances,is the mask matrix for the t-th frame,for the target signal covariance matrix of the whole target voice, the matrix of the voice to be enhanced of the t frame and the conjugate transpose matrix of the voice to be enhanced of the t frame are masked by the mask matrix of the t frame, and the target signal covariance matrix of the target voice can be obtained

Obtaining a target signal covariance matrix for a target speechAnd then, converting based on the formula (6) to obtain a target signal covariance matrix of the initial frame, substituting the target signal covariance matrix of the initial frame into the formula (5), and finishing the calculation of the target signal covariance matrices of all frames.

Step S24: and calculating to obtain a beam former coefficient of the current frame through the inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame, and multiplying the beam former coefficient by the voice to be enhanced of the current frame so as to enhance the voice to be enhanced of the current frame.

And after the inverse matrixes of the signal covariance matrixes of all the frames and the target signal covariance matrixes of all the frames are obtained, the beam former coefficient of each frame is obtained through calculation of the inverse matrix of the signal covariance matrix of each frame and the target signal covariance matrix of each frame.

In a specific application scenario, the specific calculation method for obtaining the beamformer coefficient of the current frame by calculating the inverse matrix of the signal covariance matrix of the current frame and the target signal covariance matrix of the current frame is as follows:

wherein, wf,tThe beamformer coefficients of the t-th frame, also called MVDR filter coefficients, tr () represent the traces of the matrix, d is a 0-1 vector of M × 1 dimension, which may be 1 or 0 in this embodiment.

The beamformer coefficient w of the t-th framef,tPerforming conversion to obtain matrix W of the whole beam former coefficientfAnd further on the matrix W of the overall beamformer coefficientsfConverting to obtain the conjugate transpose matrix of the whole beam former coefficient

The beamformer coefficients are multiplied by the speech to be enhanced for each frame, so that the speech to be enhanced for each frame can be enhanced by the beamformer coefficients.

In a specific application scenario, taking a current frame as an example for calculation, a specific calculation process is as follows:

wherein the content of the first and second substances,speech to be enhanced y for the t-th framef,tConjugate transpose matrix across the entire beamformer coefficientsAnd carrying out the enhanced t frame speech to be enhanced. In this embodiment, the t-th frame is the current frame.

The speech to be enhanced of each frame is enhanced by the above formula (7), and the speech enhancement of the whole speech to be enhanced can be realized.

After the speech of the whole speech to be enhanced is enhanced, the speech after speech enhancement is subjected to inverse Fourier transform to obtain a speech signal in a time domain form after speech enhancement, so that the speech signal in the time domain form after speech enhancement can be conveniently applied, for example, the format of a matched sound player is matched.

Through the steps, the speech enhancement method of the embodiment firstly constructs the corresponding relationship between the mask matrix of the current frame and the mask matrix of the previous frame and the corresponding relationship between the inverse matrix of the signal covariance matrix of the current frame and the inverse matrix of the signal covariance matrix of the previous frame, then obtains the mask matrix of the initial frame and the inverse matrix of the signal covariance matrix, further sequentially deduces the mask matrices of all the frames and the inverse matrices of the signal covariance matrix through the corresponding relationships, further obtains the beam former coefficient based on the matrix calculation, and finally completes the speech enhancement of the speech to be enhanced of each frame by using the beam former coefficient. Moreover, the mask matrix of each frame can be prevented from being calculated independently, and recursive acquisition is performed based on the mask matrix of the initial frame, so that the calculation amount and the calculation complexity of the mask matrix are further reduced, and the mask matrix of the initial frame and the inverse matrix of the signal covariance matrix with simple values are acquired, so that the recursive calculation difficulty is reduced, and the calculation effect is improved. Therefore, the speech enhancement method of the embodiment can greatly reduce the calculation amount and the calculation complexity and improve the calculation efficiency. And calculation errors are reduced, and the voice enhancement effect is improved.

Based on the same inventive concept, the present invention further provides an electronic device, which can be executed to implement the speech enhancement method according to any of the above embodiments, please refer to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of the electronic device provided by the present invention, and the electronic device includes a processor 31 and a memory 32.

The processor 31 is adapted to execute program instructions stored in the memory 32 to implement the steps of any of the speech enhancement method embodiments described above. In one particular implementation scenario, the electronic devices may include, but are not limited to: the electronic device may further include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

In particular, the processor 31 is adapted to control itself and the memory 32 to implement the steps of any of the speech enhancement method embodiments described above. The processor 31 may also be referred to as a CPU (Central Processing Unit). The processor 31 may be an integrated circuit chip having signal processing capabilities. The Processor 31 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 31 may be commonly implemented by integrated circuit chips.

By the scheme, the voice enhancement of the voice to be enhanced can be realized.

Based on the same inventive concept, the present invention further provides a computer-readable storage medium, please refer to fig. 4, and fig. 4 is a schematic structural diagram of an embodiment of the computer-readable storage medium provided in the present invention. The computer-readable storage medium 40 has stored therein at least one program data 41, the program data 41 being for implementing any of the methods described above. In one embodiment, the computer-readable storage medium 40 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the embodiments provided in the present invention, it should be understood that the disclosed method and apparatus can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium, or in a part of or all of the technical solution that contributes to the prior art.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种多通道语音增强方法及其装置、终端、可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!