Audio signal processing method and device and storage medium

文档序号：1339714 发布日期：2020-07-17 浏览：5次中文

阅读说明：本技术 音频信号处理方法及装置、存储介质 (Audio signal processing method and device and storage medium ) 是由侯海宁李炯亮李晓明于 2020-03-13 设计创作，主要内容包括：本公开是关于一种音频信号处理方法及装置、存储介质。该方法包括：由至少两个麦克风获取至少两个声源各自发出的音频信号,以获得至少两个麦克风各自的原始带噪信号；根据至少两个麦克风各自的原始带噪信号,获取至少两个声源各自的频域估计信号；将预定的频点范围划分为多个频点子带；根据每个频点子带的频域估计信号和预设的至少两个声源各自处于预设状态的第一状态概率,确定至少两个声源在每个频点子带上处于预设状态的第二状态概率；根据第二状态概率确定每个频点子带对应的各频点的分离矩阵；基于分离矩阵及原始带噪信号,获得至少两个声源各自发出的音频信号。根据本公开实施例的技术方案,能够减少系统延迟。(The disclosure relates to an audio signal processing method and apparatus, and a storage medium. The method comprises the following steps: acquiring audio signals sent by at least two sound sources respectively by at least two microphones to obtain original noisy signals of the at least two microphones respectively; acquiring respective frequency domain estimation signals of at least two sound sources according to respective original noisy signals of at least two microphones; dividing a preset frequency point range into a plurality of frequency point sub-bands; determining a second state probability that at least two sound sources are in a preset state on each frequency point sub-band according to the frequency domain estimation signal of each frequency point sub-band and a preset first state probability that at least two sound sources are in the preset state respectively; determining a separation matrix of each frequency point corresponding to each frequency point sub-band according to the second state probability; based on the separation matrix and the original noisy signal, audio signals emitted by at least two sound sources are obtained. According to the technical scheme of the embodiment of the disclosure, the system delay can be reduced.)

1. An audio signal processing method, comprising:

acquiring audio signals emitted by at least two sound sources respectively by at least two microphones to obtain original noisy signals of the at least two microphones respectively;

for each frame in the time domain, acquiring respective frequency domain estimation signals of the at least two sound sources according to the respective original noisy signals of the at least two microphones;

dividing a preset frequency point range into a plurality of frequency point sub-bands, wherein each frequency point sub-band comprises a plurality of frequency point data;

determining a second state probability that the at least two sound sources are in the preset state on each frequency point sub-band according to the frequency domain estimation signal of each frequency point sub-band and a preset first state probability that the at least two sound sources are in the preset state respectively;

determining a separation matrix of each frequency point corresponding to each frequency point sub-band according to the second state probability;

and obtaining audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.

2. The method according to claim 1, characterized in that it comprises:

if the second state probability or the first state probability is not converged, updating the first state probability according to the second state probability;

and updating the second state probability according to the frequency domain estimation signal of each frequency point sub-band and the updated first state probability.

3. The method of claim 2, wherein updating the first state probability based on the second state probability comprises:

and updating the first state probability according to the sum of the second state probabilities of the frequency point sub-bands and the number of the frequency point sub-bands.

4. The method according to claim 2, wherein the updating the second state probability according to the frequency domain estimation signal of each frequency point sub-band and the updated first state probability comprises:

determining a state probability distribution function according to the frequency domain estimation signal of each frequency point sub-band;

and updating the second state probability according to the state probability distribution function and the updated first state probability.

5. The method according to claim 2, wherein the determining the separation matrix of each frequency point corresponding to each frequency point sub-band according to the second state probability comprises:

determining the alternative separation matrix of each frequency point corresponding to each frequency point sub-band according to the updated second state probability;

if the alternative separation matrix is not converged, determining the alternative separation matrix of each frequency point corresponding to each frequency point sub-band again according to the updated second state probability;

and if the alternative separation matrix is converged, determining the alternative separation matrix as the separation matrix.

6. The method according to claim 5, wherein the determining the alternative separation matrix of each frequency point corresponding to each frequency point sub-band according to the updated second state probability comprises:

determining a covariance matrix of each frequency point of the at least two sound sources on each frequency point subband according to the updated second state probability;

and determining the alternative separation matrix according to the covariance matrix.

7. An audio signal processing apparatus, comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring audio signals emitted by at least two sound sources by at least two microphones respectively so as to obtain original noisy signals of the at least two microphones respectively;

a second obtaining module, configured to obtain, for each frame in a time domain, frequency domain estimation signals of the at least two sound sources according to the original noisy signals of the at least two microphones, respectively;

the device comprises a dividing module, a receiving module and a processing module, wherein the dividing module is used for dividing a preset frequency point range into a plurality of frequency point sub-bands, and each frequency point sub-band comprises a plurality of frequency point data;

a first determining module, configured to determine, according to the frequency domain estimation signal of each frequency point subband and a preset first state probability that each of the at least two sound sources is in a preset state, a second state probability that the at least two sound sources are in the preset state on each frequency point subband;

a second determining module, configured to determine, according to the second state probability, a separation matrix of each frequency point corresponding to each frequency point subband;

and the third acquisition module is used for acquiring audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.

8. The apparatus of claim 7, further comprising:

a first updating module, configured to update the first state probability according to the second state probability if the second state probability or the first state probability is not converged;

and the second updating module is used for updating the second state probability according to the frequency domain estimation signal of each frequency point sub-band and the updated first state probability.

9. The apparatus of claim 8, wherein the first update module comprises:

and the first updating submodule is used for updating the first state probability according to the sum of the second state probabilities of the frequency point sub-bands and the number of the frequency point sub-bands.

10. The apparatus of claim 8, wherein the second update module comprises:

the first determining submodule is used for determining a state probability distribution function according to the frequency domain estimation signal of each frequency point sub-band;

and the second updating submodule is used for updating the second state probability according to the state probability distribution function and the updated first state probability.

11. The apparatus of claim 8, wherein the second determining module comprises:

a second determining submodule, configured to determine, according to the updated second state probability, an alternative separation matrix of each frequency point corresponding to each frequency point subband;

a third determining submodule, configured to determine, according to the updated second state probability, an alternative separation matrix of each frequency point corresponding to each frequency point sub-band again if the alternative separation matrix is not converged;

a fourth determining submodule, configured to determine the candidate separation matrix as the separation matrix if the candidate separation matrix converges.

12. The apparatus of claim 11, wherein the second determining submodule comprises:

a fifth determining submodule, configured to determine, according to the updated second state probability, a covariance matrix of each frequency point on each frequency point subband of the at least two sound sources;

and the sixth determining submodule is used for determining the alternative separation matrix according to the covariance matrix.

13. An apparatus for processing an audio signal, the apparatus comprising at least: a processor and a memory for storing executable instructions operable on the processor, wherein:

the processor is adapted to execute the executable instructions, which when executed perform the steps of the audio signal processing method as provided in any of the preceding claims 1 to 6.

14. A non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in the audio signal processing method provided in any one of claims 1 to 6.

Technical Field

The present disclosure relates to the field of signal processing, and in particular, to an audio signal processing method and apparatus, and a storage medium.

Background

Disclosure of Invention

The present disclosure provides an audio signal processing method and apparatus, and a storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided an audio signal processing method, including:

acquiring audio signals emitted by at least two sound sources respectively by at least two microphones to obtain original noisy signals of the at least two microphones respectively;

dividing a preset frequency point range into a plurality of frequency point sub-bands, wherein each frequency point sub-band comprises a plurality of frequency point data;

determining second state probabilities of the at least two sound sources in the preset state on each frequency point sub-band according to the frequency domain estimation signal of each frequency point sub-band and the preset first state probabilities of the at least two sound sources in the preset state respectively;

determining a separation matrix of each frequency point corresponding to each frequency point sub-band according to the second state probability;

and obtaining audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.

In some embodiments, the method comprises:

if the second state probability or the first state probability is not converged, updating the first state probability according to the second state probability;

and updating the second state probability according to the frequency domain estimation signal of each frequency point sub-band and the updated first state probability.

In some embodiments, said updating said first state probability in accordance with said second state probability comprises:

and updating the first state probability according to the sum of the second state probabilities of the frequency point sub-bands and the number of the frequency point sub-bands.

In some embodiments, the updating the second state probability according to the frequency domain estimation signal of each frequency point sub-band and the updated first state probability includes:

determining a state probability distribution function according to the frequency domain estimation signal of each frequency point sub-band;

and updating the second state probability according to the state probability distribution function and the updated first state probability.

In some embodiments, the determining the separation matrix of each frequency point corresponding to each frequency point sub-band according to the second state probability includes:

determining alternative separation matrixes of the frequency points corresponding to the frequency point sub-bands according to the updated second state probability;

and if the alternative separation matrix is converged, determining the alternative separation matrix as the separation matrix.

In some embodiments, the determining the alternative separation matrix of each frequency point corresponding to each frequency point sub-band according to the updated second state probability includes:

determining a covariance matrix of each frequency point of the at least two sound sources on each frequency point subband according to the updated second state probability;

and determining the alternative separation matrix according to the covariance matrix.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio signal processing apparatus including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring audio signals emitted by at least two sound sources by at least two microphones respectively so as to acquire original noisy signals of the at least two microphones respectively;

a second determining module, configured to determine, according to the second state probability, a separation matrix of each frequency point corresponding to each frequency point subband;

and the third acquisition module is used for acquiring audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.

In some embodiments, the apparatus further comprises:

a first updating module, configured to update the first state probability according to the second state probability if the second state probability or the first state probability is not converged;

In some embodiments, the first update module comprises:

In some embodiments, the second update module comprises:

the first determining submodule is used for determining a state probability distribution function according to the frequency domain estimation signal of each frequency point sub-band;

and the second updating submodule is used for updating the second state probability according to the state probability distribution function and the updated first state probability.

In some embodiments, the second determining module comprises:

a fourth determining submodule, configured to determine the candidate separation matrix as the separation matrix if the candidate separation matrix converges.

In some embodiments, the second determining sub-module includes:

and the sixth determining submodule is used for determining the alternative separation matrix according to the covariance matrix.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for processing an audio signal, the apparatus at least comprising: a processor and a memory for storing executable instructions operable on the processor, wherein:

the processor is configured to execute the executable instructions, and the executable instructions perform the steps of any of the audio signal processing methods described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in any of the audio signal processing methods described above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: with the technical solution of the embodiments of the present disclosure, the activation state of the sound source is taken into account when performing audio signal separation. The actual state of the sound source is estimated by determining the probability of activation of the sound source. Compared with the prior art, the method has the advantages that the separation is performed by assuming that the sound source state is always in the activated state, the method is closer to the actual audio signal sent by the sound source, the voice quality after separation is further improved, and the signal-to-noise ratio and the recognition rate are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart illustrating a method of audio signal processing according to an exemplary embodiment;

fig. 2 is a block diagram illustrating an application scenario of an audio signal processing method according to an exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of audio signal processing according to an exemplary embodiment;

fig. 4 is a block diagram illustrating a structure of an audio signal processing apparatus according to an exemplary embodiment;

fig. 5 is a block diagram illustrating a physical structure of an audio signal processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment, as shown in fig. 1, including the steps of:

step S101, acquiring audio signals sent by at least two sound sources by at least two microphones respectively to obtain original noisy signals of the at least two microphones respectively;

step S102, for each frame in a time domain, acquiring respective frequency domain estimation signals of the at least two sound sources according to the respective original noisy signals of the at least two microphones;

step S103, dividing a preset frequency point range into a plurality of frequency point sub-bands, wherein each frequency point sub-band comprises a plurality of frequency point data;

step S104, determining second state probabilities of the at least two sound sources in the preset state on each frequency point sub-band according to the frequency domain estimation signal of each frequency point sub-band and the preset first state probabilities of the at least two sound sources in the preset state respectively;

step S105, determining a separation matrix of each frequency point corresponding to each frequency point sub-band according to the second state probability;

and S106, obtaining audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.

The method disclosed by the embodiment of the disclosure is applied to the terminal. Here, the terminal is an electronic device into which two or more microphones are integrated. For example, the terminal may be a vehicle-mounted terminal, a computer, a server, or the like.

In an embodiment, the terminal may further be: an electronic device connected to a predetermined device in which two or more microphones are integrated; and the electronic equipment receives the audio signal collected by the predetermined equipment based on the connection and sends the processed audio signal to the predetermined equipment based on the connection. For example, the predetermined device is a sound box or the like.

In practical application, the terminal includes at least two microphones, and the at least two microphones simultaneously detect audio signals emitted by at least two sound sources respectively, so as to obtain original noisy signals of the at least two microphones respectively. Here, it is understood that in the present embodiment, the at least two microphones detect the audio signals emitted by the two sound sources synchronously.

In the audio signal processing method according to the embodiment of the present disclosure, after the original noisy signal of the audio frame in the predetermined time is acquired, the audio signal of the audio frame in the predetermined time is separated.

In the embodiment of the present disclosure, the number of the microphones is 2 or more, and the number of the sound sources is 2 or more.

In the embodiment of the present disclosure, the original noisy signal is: comprising a mixed signal of the sounds emitted by at least two sound sources. For example, the number of the microphones is 2, namely a microphone 1 and a microphone 2; the number of the sound sources is 2, namely a sound source 1 and a sound source 2; the original noisy signal of said microphone 1 is an audio signal comprising a sound source 1 and a sound source 2; the original noisy signal of the microphone 2 is also an audio signal comprising both the sound source 1 and the sound source 2.

For example, the number of the microphones is 3, namely a microphone 1, a microphone 2 and a microphone 3; the number of the sound sources is 3, namely a sound source 1, a sound source 2 and a sound source 3; the original noisy signal of the microphone 1 is an audio signal comprising a sound source 1, a sound source 2 and a sound source 3; the original noisy signals of said microphone 2 and said microphone 3 are likewise audio signals each comprising a sound source 1, a sound source 2 and a sound source 3.

It will be appreciated that if the signal generated in a corresponding microphone by the sound from one sound source is an audio signal, the signal generated in the microphone by the other sound source is a noise signal. The disclosed embodiments require recovery of sound sources emanating from at least two sound sources from at least two microphones. The number of sound sources is generally the same as the number of microphones, and in some embodiments, the number of sound sources may be different from the number of microphones.

It will be understood that when the microphones collect audio signals from sound sources, the audio signals of at least one frame of audio frame may be collected, and the collected audio signals are the original noisy signals of each microphone. The original noisy signal may be either a time domain signal or a frequency domain signal. If the original signal with noise is a time domain signal, the time domain signal can be converted into a frequency domain signal according to the operation of time-frequency conversion.

Here, the time-frequency conversion refers to the mutual conversion between a time-domain signal and a frequency-domain signal, and the time-domain signal may be subjected to frequency-domain conversion based on Fast Fourier Transform (FFT). Alternatively, the time-domain signal may be frequency-domain transformed based on a short-time Fourier transform (STFT). Alternatively, the time domain signal may also be frequency domain transformed based on other fourier transforms.

For example, if the time domain signal of the p-th microphone in the n-th frame is:transforming the time domain signal of the nth frame into a frequency domain signal, and determining the original noisy signal of the nth frame as follows:and m is the discrete time point number of the nth frame of time domain signal, and k is a frequency point. Thus, the present embodiment can obtain the original noisy signal of each frame through the time domain to frequency domain variation. Of course, the acquisition of the original noisy signal for each frame may also be based onOther fast fourier transform equations, are not limited herein.

In the embodiment of the present disclosure, the predetermined frequency point range may be all frequency points included in each audio frame, for example, if the FFT point number of the system is Nfft, the number of frequency points included in each divided audio frame is NfftAll will beEach frequency point is divided into D frequency point sub-bands, and the frequency point of each frequency point sub-band packet is as follows:

exemplarily, the number of FFT points of the system is 2048, and the division into D ═ 4 frequency point subbands is performed, so that the first frequency point subband is F₁The second frequency subband is F {1,2, L, 1024}, and₂-1025, 1026, L, 1536, and F for the third frequency subband₃F for the fourth frequency sub-band of 1537,1538, L, 1792₄＝{1793,1538,L,2048}。

In the disclosed embodiment, each sound source is in the frequency point sub-band F_dThere may be two activation states at different times: activated and not activated. I.e. the sound source emits an audio signal or does not emit an audio signal. The active state probabilities are thus represented here by the first state probability and the second state probability. The activated state probability refers to the probability that each sound source emits audio signals at each frequency point and the probability that each sound source does not emit audio signals. Here, the first state probability is an estimated prior activation state probability, and the second state probability is a posterior activation state probability of each frequency point of the sound source on each frequency point subband determined according to the first state probability.

Here, the first state probability is preset as the prior state probability, and may be initially preset to be a uniform distribution, for example. The a posteriori state probabilities, i.e. the above-mentioned second state probabilities, are then determined on the basis of the first state probabilities and the frequency domain estimation signal.

For example, the frequency domain estimation signal may be obtained by separating the frequency domain noisy signal according to an initial separation matrix or a separation matrix of a previous frame, and according to the distribution of the frequency domain estimation signal, a signal distribution model in two different activation states on each frequency point subband may be determined. Based on the signal distribution model and the first state probability, a posterior activation state probability, i.e., the second state probability, can be obtained. And updating the separation matrix according to the second state probability so as to separate the original signal with noise.

Therefore, when the audio signal is separated, the activation state of the sound source is considered, and compared with the prior art that separation is performed by a method of assuming that the sound source state is always the activation state, the separation method is closer to the audio signal emitted by the actual sound source, so that the separated voice quality is improved, and the signal-to-noise ratio and the recognition rate are improved.

In some embodiments, the method comprises:

if the second state probability or the first state probability is not converged, updating the first state probability according to the second state probability;

and updating the second state probability according to the frequency domain estimation signal of each frequency point sub-band and the updated first state probability.

In the embodiment of the present disclosure, the first state probability and the second state probability may be repeatedly updated according to the convergence condition of the first state probability and the second state probability until both converge. The finally obtained first state probability and the second state probability are infinitely close to fixed values, namely close to the distribution situation of the actual sound source state probability.

Here, the first state probability and the second state probability are both the number sequence of the corresponding frequency point sub-band, and convergence refers to a distribution in which the final first state probability and the final second state probability approach to the actual sound source state probability with repeated updating.

In some embodiments, said updating said first state probability in accordance with said second state probability comprises:

and updating the first state probability according to the sum of the second state probabilities of the frequency point sub-bands and the number of the frequency point sub-bands.

Illustratively, the a priori activation state probability, i.e. the first state probability, is estimated here using the following equation (1):

wherein, pi_p,n,cThe prior probability that the p-th sound source is in the c-state at time n, i.e. the first state probability,sub-band F of frequency point for p sound source_dThe posterior probability of the c state at the last n moments, namely the probability of a second state; d is the number of sub-bands of the divided frequency points, phi_cIs a parameter, illustratively φ_c＝5，c＝0,1。

In some embodiments, the updating the second state probability according to the frequency domain estimation signal of each frequency point sub-band and the updated first state probability includes:

determining a state probability distribution function according to the frequency domain estimation signal of each frequency point sub-band;

and updating the second state probability according to the state probability distribution function and the updated first state probability.

In the embodiment of the present disclosure, a probability distribution model, that is, a state probability distribution function, may be determined according to the frequency domain estimation signal, and a distribution situation of probabilities of a sound source in different states may be determined.

Thus, according to the state probability distribution function and the prior activation state probability, namely the first state probability, the second state probability can be determined, and the second state probability is closer to the real state probability of the sound source.

Illustratively, the second state probability may be updated by the following equation (2):

wherein, pi_p,n,cFor the updated first state probability,the functions are also distributed for the states.For the comparison function:wherein, α_cAnd β_cIs a coefficient, illustratively, (α)₀,β₀)＝(0.09,0.1)，(α₁,β₁)＝(1,0.1)。

In some embodiments, the determining the separation matrix of each frequency point corresponding to each frequency point sub-band according to the second state probability includes:

determining alternative separation matrixes of the frequency points corresponding to the frequency point sub-bands according to the updated second state probability;

and if the alternative separation matrix is converged, determining the alternative separation matrix as the separation matrix.

In the embodiment of the present disclosure, the method in the above embodiment is repeated by using the second state probability, and the separation matrix may be repeatedly updated until the separation matrix converges. The separation matrix converges, i.e. each element in the separation matrix approaches a fixed value over the radio, i.e. each element in the separation matrix converges. And finally, a more accurate separation matrix is obtained, and the accuracy of signal separation is improved.

determining a covariance matrix of each frequency point of the at least two sound sources on each frequency point subband according to the updated second state probability;

and determining the alternative separation matrix according to the covariance matrix.

In the embodiment of the present disclosure, a weighted covariance matrix may be determined according to the frequency-domain original noisy signal and the weighting coefficient, as shown in the following equation (3):

wherein the weighting coefficient isY(k,n)＝[Y₁(k,n),Y₂(k,n)]^TW (k) X (k, n). X (k, n) is the original noisy signal in frequency domain, X (k, n)^HIs the conjugate matrix of X (k, n). Y (k, n) is the frequency domain estimation signal, and W (K) is the initialized separation matrix or the last determined alternative separation matrix.

Based on the covariance matrix, a new separation matrix can be followed to obtain an alternative separation matrix:

the device separation matrix is: w (k) ═ w₁(k),w₂(k)]^H. Wherein, w_p(k)＝(W^H(k)R_n,k)^-1e_p，p is sound source, p is 1, 2.

Thus, the candidate separation matrix is obtained by updating, and whether the candidate separation matrix is the final separation matrix can be determined by judging whether the candidate separation matrix is converged. And if the alternative separation matrix is not converged, re-determining the covariance matrix, continuously and subsequently substituting the current alternative separation matrix, and re-determining the alternative separation matrix until the alternative separation matrix is converged.

After the separation matrix is determined, the frequency domain estimation signals can be separated through the separation matrix to obtain final frequency domain signals of each sound source, and the separated time domain sound source signals can be obtained through ISTFT and overlap-add processing of each frame.

Embodiments of the present disclosure also provide the following examples:

FIG. 3 is a flow chart illustrating a method of audio signal processing according to an exemplary embodiment; in the audio signal processing method, as shown in fig. 2, the sound source includes a sound source 1 and a sound source 2, and the microphone includes a microphone 1 and a microphone 2. Based on the audio signal processing method, the audio signals of the sound source 1 and the sound source 2 are recovered from the original noisy signals of the microphone 1 and the microphone 2. As shown in fig. 3, the method comprises the steps of:

step S301, initializing the separation matrix of each frequency point as a unit matrix:wherein, K is 1.

Let the system FFT point number be Nfft. All will beEach frequency point is divided into the following D frequency point sub-bands, and illustratively, D is 4. The divided frequency point sub-bands are as follows:

by usingRepresentative sound source p in subband F_dState of activation or non-activation at last n time, i.e.Let Pi_p,n,cRepresents the prior probability that sound source p is in the c state at time n, i.e. the first state probability in the above embodiment. Exemplarily, n_p,n,cInitialization is to a uniform distribution.

Step S302, determining a frequency domain noisy signal;

to be provided withTime domain signal representing the nth frame of the pth microphone, p being 1, 2; m is 1, … Nfft. N is 1, N_T. Windowing and carrying out Nfft point FFT to obtain corresponding frequency domain signal X_p(k,n)：k＝1,..,K。n＝1,..,N_TThen the observed signal matrix, i.e. the frequency domain noisy signal, is: x (k, n) ═ X₁(k,n),X₂(k,n)]^T。 k＝1,..,K。n＝1,..,N_T。

The separation matrix W (k) and the prior probability π are estimated by EM algorithm_p,n,cTo obtain an a posteriori estimate of the separation signal Y (k, n), which is the frequency domain estimated signal from the initial separation matrix.

The above-mentioned EM algorithm, i.e., the Expectation-Maximization algorithm (Expectation-Maximization algorithm), is often used in statistics to find the maximum likelihood estimates of parameters in a probabilistic model that depends on unobservable hidden variables, including finding the parameter maximum likelihood estimates or the maximum a posteriori estimates in the probabilistic model. The EM algorithm is realized through the alternate operation iteration of an E (expectation) step and an M (maximization) step, the first step is to calculate an expectation (E), and a maximum likelihood estimation value of the expectation (E) is calculated by utilizing the existing estimation value of the hidden variable; the second step is to maximize (M), the maximum likelihood found at step E is maximized to calculate the value of the parameter. The parameter estimation value found in the M steps is used in the next E step calculation, thereby realizing the alternate iteration operation.

In an embodiment of the present disclosure, the EM algorithm includes the following steps:

step S303 and step E: estimating the posterior activation state probability of the sound source;

firstly, the prior frequency domain estimation of two sound source signals in the current frame is obtained by using the last separation matrix W (k). Let Y (k, n) be [ Y₁(k,n),Y₂(k,n)]^TK is 1, K, wherein Y₁(k,n),Y₂(k, n) are estimated values of the sound sources s1 and s2 at the time frequency points (k, n), respectively, and are obtained by separating the observation matrix X (k, n) by using the separation matrix w (k), as shown in formula (4):

Y(k,n)＝W(k)X(k,n)k＝1,..,K。n＝1,..,N_T。(4)

then the frequency domain estimation of the p sound source in the n frame is:

wherein p is 1, 2.

Then, the posterior activation state probability of the sound source, i.e., the above-mentioned second state probability, is estimated as shown in the following equation (6):

wherein, the state probability distribution function is shown in the following formula (7):

wherein the content of the first and second substances,for the comparison function, it is determined by the following formula (8) and formula (9).

Wherein, α_cAnd β_cIs a coefficient, illustratively, (α)₀,β₀)＝(0.09,0.1)，(α₁,β₁)＝(1,0.1)

Step S304, step M: estimating a priori activation state probability pi_p,n,c

According to the posterior activation state probability, the prior activation state probability may be updated, as shown in equation (10):

wherein, for a parameter, illustratively, [ phi ]_cIf the updated first state probability is obtained as 5, c is 0 and 1, the above procedure may be repeated to update the second state probability.

Step S305, updating the separation matrix according to the updated posterior activation state probability: w (k) ═ w₁(k),w₂(k)]^HK is 1, K, which specifically comprises the following steps:

a) computing a weighted covariance matrix R_p,kThe following formula (11):

wherein the content of the first and second substances,as weighting coefficients:

Y(k,n)＝[Y₁(k,n),Y₂(k,n)]^T＝W(k)X(k,n) (13)

b) updating separation matrix w (k) ═ w₁(k),w₂(k)]^H：

w_p(k)＝(W^H(k)R_n,k)^-1e_p(15)

Repeating the above equations (11) to (16) can continuously optimize the separation matrix, and finally obtain the convergent separation matrix.

If the first state probability, i.e. the prior activation state probability, and the second state probability, i.e. the posterior activation state probability are still not converged, the steps E to M can be continuously repeated until W (k), pi and k_p,n,cAndand (6) converging.

Step S306, separating the original signal with noise by using W (k) to obtain the posterior frequency domain estimation of the sound source signal, as shown in the following formula (17):

Y(k,n)＝[Y₁(k,n),Y₂(k,n)]^T＝W(k)X(k,n) (17)

step S307, respectively forK is 1, K is subjected to ISTFT and overlap addition to obtain a separated time domain sound source signalAs shown in the following equation (18):

where n is the nth frame, m is the number of frequency bins, and m is 1, …, Nfft. p is 1, 2.

Fig. 4 is a block diagram illustrating an audio signal processing apparatus according to an exemplary embodiment. Referring to fig. 4, the apparatus includes a first obtaining module 401, a second obtaining module 402, a dividing module 403, a first determining module 404, a second determining module 405, and a third obtaining module 406.

A first obtaining module 401, configured to obtain, by at least two microphones, audio signals emitted by at least two sound sources, respectively, so as to obtain original noisy signals of the at least two microphones, respectively;

a second obtaining module 402, configured to, for each frame in a time domain, obtain frequency domain estimation signals of the at least two sound sources according to the original noisy signals of the at least two microphones, respectively;

a dividing module 403, configured to divide a predetermined frequency point range into multiple frequency point sub-bands, where each frequency point sub-band includes multiple frequency point data;

a first determining module 404, configured to determine, according to the frequency domain estimation signal of each frequency point subband and a preset first state probability that each of the at least two sound sources is in a preset state, a second state probability that the at least two sound sources are in the preset state on each frequency point subband;

a second determining module 405, configured to determine, according to the second state probability, a separation matrix of each frequency point corresponding to each frequency point subband;

a third obtaining module 406, configured to obtain, based on the separation matrix and the original noisy signal, audio signals sent by at least two sound sources respectively.

In some embodiments, the apparatus further comprises:

a first updating module, configured to update the first state probability according to the second state probability if the second state probability or the first state probability is not converged;

In some embodiments, the first update module comprises:

In some embodiments, the second update module comprises:

the first determining submodule is used for determining a state probability distribution function according to the frequency domain estimation signal of each frequency point sub-band;

and the second updating submodule is used for updating the second state probability according to the state probability distribution function and the updated first state probability.

In some embodiments, the second determining module comprises:

a fourth determining submodule, configured to determine the candidate separation matrix as the separation matrix if the candidate separation matrix converges.

In some embodiments, the second determining sub-module includes:

and the sixth determining submodule is used for determining the alternative separation matrix according to the covariance matrix.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 is a block diagram illustrating a physical structure of an audio signal processing apparatus 500 according to an exemplary embodiment. For example, the apparatus 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.

Referring to fig. 5, the apparatus 500 may include one or more of the following components: a processing component 501, a memory 502, a power supply component 503, a multimedia component 504, an audio component 505, an input/output (I/O) interface 506, a sensor component 507, and a communication component 508.

The processing component 501 generally controls the overall operation of the device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 501 may include one or more processors 510 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 501 may also include one or more modules that facilitate interaction between the processing component 501 and other components. For example, the processing component 501 may include a multimedia module to facilitate interaction between the multimedia component 504 and the processing component 501.

The memory 510 is configured to store various types of data to support operations at the apparatus 500. Examples of such data include instructions for any application or method operating on the apparatus 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 502 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 503 provides power to the various components of the device 500. The power supply component 503 may include: a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 500.

The multimedia component 504 includes a screen that provides an output interface between the device 500 and a user, hi some embodiments, the screen may include a liquid crystal display (L CD) and a Touch Panel (TP). if the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.

The audio component 505 is configured to output and/or input audio signals. For example, audio component 505 includes a Microphone (MIC) configured to receive external audio signals when apparatus 500 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 510 or transmitted via the communication component 508. In some embodiments, audio component 505 further comprises a speaker for outputting audio signals.

The I/O interface 506 provides an interface between the processing component 501 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 507 includes one or more sensors for providing various aspects of condition assessment for the device 500. For example, the sensor component 507 may detect the open/closed status of the device 500, the relative positioning of components such as a display and keypad of the device 500, the sensor component 507 may also detect a change in the position of the device 500 or a component of the device 500, the presence or absence of user contact with the device 500, orientation or acceleration/deceleration of the device 500, and a change in the temperature of the device 500. The sensor assembly 507 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 507 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 507 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 508 is configured to facilitate wired or wireless communication between the apparatus 500 and other devices. The apparatus 500 may access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 508 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 508 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.

In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), programmable logic devices (P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 502 comprising instructions, executable by the processor 510 of the apparatus 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform any of the methods provided in the above embodiments.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种语音话者分离方法和装置

Audio signal processing method and device and storage medium

相关技术

网友询问留言