Voice separation method, device, equipment and storage medium

文档序号：1955201 发布日期：2021-12-10 浏览：16次中文

阅读说明：本技术 语音分离方法、装置、设备和存储介质 (Voice separation method, device, equipment and storage medium ) 是由戴玮关海欣梁家恩于 2021-09-06 设计创作，主要内容包括：本发明涉及一种语音分离方法、装置、设备和存储介质,包括对时域的混合语音信号进行分离得到第一通道的时域信号和第二通道的时域信号后,按照信号能量由高到低的顺序,选取指定帧数的第一通道的时域信号对应的二维波达方位估计,并求众数,得到第一通道的方位估计信息,以及,选取指定帧数的第二通道的时域信号对应的二维波达方位估计信息,并求众数,得到第二通道的方位估计；根据第一通道的方位估计信息,计算第一通道的俯仰角偏差和第一通道的方位角偏差,以及,根据第二通道的方位估计信息,计算第二通道的俯仰角偏差和第二通道的方位角偏差；并得到第一通道与第二通道各偏差的比较结果,根据比较结果,确定每个通道对应的目标声源。(The invention relates to a voice separation method, a device, equipment and a storage medium, which comprises the steps of separating a mixed voice signal of a time domain to obtain a time domain signal of a first channel and a time domain signal of a second channel, selecting a two-dimensional direction of arrival estimation corresponding to the time domain signal of the first channel of a specified frame number according to the sequence of signal energy from high to low, and solving a mode to obtain direction estimation information of the first channel, and selecting the two-dimensional direction of arrival estimation information corresponding to the time domain signal of the second channel of the specified frame number and solving the mode to obtain direction estimation of the second channel; calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel; and obtaining a comparison result of each deviation of the first channel and the second channel, and determining a target sound source corresponding to each channel according to the comparison result.)

1. A method of speech separation, comprising:

carrying out Fourier transform on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;

separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;

respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;

according to the sequence of signal energy from high to low, selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of a first channel with the specified frame number, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of a second channel with the specified frame number, and solving a mode to obtain direction estimation of the second channel;

calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source;

and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.

2. The speech separation method according to claim 1, wherein before performing short-time inverse fourier transform on the separated signal of the first channel and the separated signal of the second channel to obtain the time-domain signal of the first channel and the time-domain signal of the second channel, the method further comprises:

processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;

comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;

correspondingly, the performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes:

and respectively carrying out short-time Fourier inverse transformation on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

3. The method of separating speech according to claim 2, wherein before performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain the time domain signal of the first channel and the time domain signal of the second channel, the method further comprises:

respectively eliminating background noise by single-channel noise reduction on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel;

correspondingly, the performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes:

and respectively carrying out short-time Fourier inverse transformation on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

4. The speech separation method of claim 1, further comprising:

and when the pitch angle deviation is greater than the angular deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angular deviation threshold of the azimuth angle, updating the weight of the filter corresponding to the adaptive filtering algorithm.

5. The speech separation method of claim 4, further comprising:

and when the pitch angle deviation is smaller than or equal to the angle deviation threshold value and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight value of the filter corresponding to the adaptive filtering algorithm unchanged.

6. The speech separation method of claim 1, wherein the adaptive filtering algorithm is any one of Least Mean Square (LMS), NLMS, and least squares (RLS).

7. A speech separation apparatus, comprising:

the first transformation module is used for not carrying out Fourier transformation on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;

the separation module is used for separating the mixed voice signals of the time-frequency domain to obtain separation signals of a first channel and separation signals of a second channel;

the second transformation module is used for respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;

the direction estimation module is used for selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with the specified frame number according to the sequence of the signal energy from high to low, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified frame number and solving a mode to obtain direction estimation of the second channel;

the deviation estimation module is used for calculating pitch angle deviation of the first channel and azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating pitch angle deviation of the second channel and azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

a determining module, configured to determine that the first channel is voice information of a first target sound source and the second channel is voice information of a second target sound source if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel; and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.

8. The speech separation device of claim 7, wherein the separation module is further configured to:

correspondingly, the second transform module is further configured to perform short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.

9. A speech separation device, comprising: a processor and a memory;

the processor is configured to execute a program of the voice separation method stored in the memory to implement the voice separation method of any one of claims 1 to 6.

10. A storage medium storing one or more programs which, when executed, implement the speech separation method of any one of claims 1-6.

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice separation method, a voice separation device, voice separation equipment and a storage medium.

Background

In recent years, with the rapid development of speech recognition technology, urgent technical demands are put forward on real-time speech separation technology in a multipath speech recognition scene. For example, in one-to-one teaching, the voice of a student and the voice of a teacher need to be separated.

In the related art, a blind source separation technology is usually adopted to separate mixed speech, but the output channel sequence corresponding to each speech signal obtained by blind source separation is uncertain, and a user is required to further determine the speech signal corresponding to each channel, thereby reducing the speech separation efficiency.

Disclosure of Invention

The invention provides a voice separation method, a voice separation device, voice separation equipment and a storage medium, which are used for solving the technical problems that in the prior art, the output channel sequence corresponding to each voice signal obtained by blind source separation is uncertain, a user is required to further determine the voice signal corresponding to each channel, and the voice separation efficiency is reduced.

The technical scheme for solving the technical problems is as follows:

a method of speech separation comprising:

carrying out Fourier transform on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;

separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;

Further, in the voice separation method, before performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel, the method further includes:

Further, in the voice separation method, before performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel, the method further includes:

Further, the voice separation method further includes:

Further, in the speech separation method, the adaptive filtering algorithm is any one of a least mean square algorithm LMS, an NLMS algorithm, and a least square method RLS.

The invention also provides a voice separation device, comprising:

the first transformation module is used for not carrying out Fourier transformation on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;

the separation module is used for separating the mixed voice signals of the time-frequency domain to obtain separation signals of a first channel and separation signals of a second channel;

Further, in the above speech separation apparatus, the separation module is further configured to:

The present invention also provides a voice separating apparatus, comprising: a processor and a memory;

the processor is configured to execute the program of the voice separation method stored in the memory to implement the voice separation method described in any one of the above.

The present invention also provides a storage medium, wherein the storage medium stores one or more programs that when executed implement any of the above-described speech separation methods.

The invention has the beneficial effects that:

performing voice separation on a mixed voice signal of a time domain, collecting energy judgment after obtaining a time domain signal of a first channel and a time domain signal of a second channel, selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with a specified frame number, solving a mode to obtain direction estimation information of the first channel, selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified frame number, and solving the mode to obtain direction estimation of the second channel; then, calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel; if the pitch angle deviation of the first channel is not larger than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source; and if the pitch angle deviation of the first channel is greater than that of the second channel and the azimuth angle deviation of the first channel is greater than that of the second channel, determining that the first channel is the voice information of the second target sound source and the second channel is the voice information of the first target sound source. Therefore, the voice signals are output according to the determined channel sequence, the user is prevented from further determining the voice signal corresponding to each channel, and the voice separation efficiency is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a speech separation method of the present invention;

FIG. 2 is a schematic diagram of a microphone array according to the present invention

FIG. 3 is a schematic structural diagram of an embodiment of a speech separation apparatus according to the present invention;

fig. 4 is a schematic structural diagram of the voice separating apparatus of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of an embodiment of a speech separation method of the present invention, and as shown in fig. 1, the speech separation method of the present embodiment may specifically include the following steps:

100. carrying out Fourier transform on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;

fig. 2 is a schematic diagram of a microphone array according to the present invention. As shown in fig. 2, an angle error threshold of a pitch angle and an angle error threshold of an azimuth angle θ of the first sound source signal of the time-domain mixed speech signal received by the microphone array, such as a 30-degree direction and an azimuth angle θ, may be setSuch as 60 degrees. The second sound source signal of the mixed speech signal in the time domain received by the microphone array may be in any direction.

In a specific implementation process, the microphone array can receive a time-domain mixed voice signal, and because the voice signal has a short-time stationary characteristic, the voice signal is generally converted into a short-time frequency domain for analysis processing, so that the short-time fourier transform is performed on the time-domain mixed voice signal to obtain a time-frequency domain mixed voice signal. Can be expressed as x (t, k), t representing frame number and k representing frequency.

101. Separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;

in a specific implementation process, the time-frequency domain mixed speech signal may be separated by using a blind source separation algorithm to obtain a first channel separation signal and a second channel separation signal. For a specific separation method, reference may be made to the related art, which is not described herein again.

102. Respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;

in a specific implementation process, the separation signal of the first channel and the separation signal of the second channel may be subjected to short-time inverse fourier transform, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.

103. According to the sequence of signal energy from high to low, selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of a first channel with the specified frame number, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of a second channel with the specified frame number, and solving a mode to obtain direction estimation of the second channel;

in one specific implementation, the pitch angle of each frame of each channel can be obtained by two-dimensional direction-of-arrival estimationAnd azimuth angleThe capability of the speech signal of each frame can be obtained according to the calculation formula of the speech signal energy. Wherein the speech signal energy is calculated byE_iRepresenting speech signal energy, x_i(t) represents a time domain signal of each channel of the current frame, and N represents a frame number.

In a specific implementation process, two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with the previous 30% of the frame number may be selected, and a mode may be solved to obtain direction estimation information of the first channel, and two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified frame number may be selected, and a mode may be solved to obtain direction estimation of the second channel.

Specifically, after obtaining two-dimensional direction-of-arrival estimates (pitch angle and azimuth angle) of all frames, sorting energy calculations of all frames from high to low, selecting the pitch angle and azimuth angle of the first 30% frame with the highest energy, and obtaining an array of pitch angles and an array of azimuth angles at this time. Three angle area ranges, such as 0-50, 50-100, and 100-180, can be set in advance, and the mode is to select which angle according to which range the value in the array appears most times. For example, the number of occurrences of 0-50 in the pitch angle array is the largest, and i think that the mode of the pitch angle of the channel is any value from 0-50.

104. Calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

in one implementation, the position estimation information of the first channel can be recorded asThe second channel of position estimation information may be written asThe pitch angle deviation of the first passage isThe azimuth deviation of the first channel isWherein, theta represents a reference pitch angle,a reference azimuth is indicated.

105. Detecting whether the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel or not, and whether the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel or not; if yes, go to step 106, if no, go to step 107;

106. determining that the first channel is the voice information of a second target sound source, wherein the second channel is the voice information of a first target sound source;

and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, determining that the second channel is the voice information of the first target sound source, and determining that the first channel is the voice information of the second target sound source.

107. And determining that the first channel is the voice information of a first target sound source, and the second channel is the voice information of a second target sound source.

And if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source.

In the voice separation method of this embodiment, after performing voice separation on a time domain mixed voice signal and obtaining a time domain signal of a first channel and a time domain signal of a second channel, energy judgment is integrated, a two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel of a specified frame number is selected, a mode is solved, direction estimation information of the first channel is obtained, a two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel of the specified frame number is selected, and the mode is solved, so that direction estimation of the second channel is obtained; then, calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel; if the pitch angle deviation of the first channel is not larger than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source; and if the pitch angle deviation of the first channel is greater than that of the second channel and the azimuth angle deviation of the first channel is greater than that of the second channel, determining that the first channel is the voice information of the second target sound source and the second channel is the voice information of the first target sound source. Therefore, the voice signals are output according to the determined channel sequence, the user is prevented from further determining the voice signal corresponding to each channel, and the voice separation efficiency is improved.

In a specific implementation process, before performing short-time inverse fourier transform on the separated signal of the first channel and the separated signal of the second channel respectively to obtain the time-domain signal of the first channel and the time-domain signal of the second channel in step 102 "in the above embodiment, the following steps may also be performed:

(1) processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;

(2) comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;

specifically, after the primary noise reduction signal of the first channel is obtained, energy comparison may be performed between the primary noise reduction signal of the first channel and the time-domain mixed speech signal, and a speech signal with high energy is selected. And if the energy of the primary noise reduction signal of the first channel is lower than that of the mixed voice signal of the time domain, the mixed voice signal of the time domain is taken as a voice signal with high energy. And taking the time-domain mixed voice signal as a reference, and filtering by using a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the second channel. The adaptive filtering algorithm is any one of least mean square algorithm LMS, NLMS algorithm and least square method RLS.

Correspondingly, the performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes: and respectively carrying out short-time Fourier inverse transformation on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In a specific implementation process, before "performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel", the following steps may be further performed:

(11) and respectively carrying out single-channel noise reduction and background noise elimination on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel.

Correspondingly, performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel, including: and respectively carrying out short-time Fourier inverse transformation on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In this embodiment, the energy determination and the adaptive filtering technology are combined to further perform denoising processing on the separated voice signals of each channel, so that the separated voice is cleaner.

In one specific implementation procedure, after "calculating the pitch angle deviation and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel" in step 104, the following steps may be further performed: and when the pitch angle deviation is greater than the angular deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angular deviation threshold of the azimuth angle, updating the weight of the filter corresponding to the adaptive filtering algorithm. And when the pitch angle deviation is smaller than or equal to the angle deviation threshold value and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight value of the filter corresponding to the adaptive filtering algorithm unchanged.

In a specific implementation process, fitting can be performed according to the historical weight of the filter corresponding to the updated adaptive filtering algorithm to obtain a weight updating fitting function of the filter, so that before the filter is used, the filter weight is set according to the obtained weight updating fitting function, the updating times of the weight updating fitting function reach a preset time m, then the actual calculation weight of the filter of the mth time is obtained by the updating method, the fitting weight of the filter of the mth time is compared, if the error between the fitting weight of the mth time and the fitting weight of the filter of the mth time is within a preset range, the weight of the filter is still set by the weight updating fitting function from the mth time to the 2 mth time, otherwise, the weight of the filter is set by the method of updating the weight of the filter corresponding to the adaptive filtering algorithm when the pitch angle deviation is greater than the angle deviation threshold or the azimuth angle deviation is greater than the angle deviation threshold of the azimuth angle, and after the n times, modifying the weight updating fitting function of the filter according to the n times of calculated values, and then setting the weight of the filter by using the weight updating fitting function of the filter. Therefore, the pitch angle deviation between the pitch angle in the time-frequency domain mixed voice signal and the target azimuth and the azimuth angle deviation between the azimuth angle in the time-frequency domain mixed voice signal and the target azimuth can be repeatedly avoided, and the efficiency and the accuracy are improved.

It should be noted that the method of the embodiment of the present invention may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In the case of such a distributed scenario, one device of the multiple devices may only perform one or more steps of the method according to the embodiment of the present invention, and the multiple devices interact with each other to complete the method.

Fig. 3 is a schematic structural diagram of an embodiment of the speech separation apparatus of the present invention, and as shown in fig. 3, the speech separation apparatus of this embodiment may include a first transformation module 20, a separation module 21, a second transformation module 22, an orientation estimation module 23, a deviation estimation module 24, and a determination module 25.

The first transformation module 20 is used for not performing Fourier transformation on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;

a separation module 21, configured to separate the time-frequency domain mixed speech signal to obtain a first channel separation signal and a second channel separation signal;

a second transform module 22, configured to perform short-time inverse fourier transform on the separated signal of the first channel and the separated signal of the second channel, respectively, to obtain a time-domain signal of the first channel and a time-domain signal of the second channel;

the direction estimation module 23 is configured to select, according to the sequence of signal energy from high to low, two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with the specified number of frames, and solve a mode to obtain direction estimation information of the first channel, and select two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified number of frames, and solve a mode to obtain direction estimation of the second channel;

a deviation estimation module 24, configured to calculate a pitch angle deviation of the first channel and an azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculate a pitch angle deviation of the second channel and an azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;

a determining module 25, configured to determine that the first channel is the voice information of the first target sound source and the second channel is the voice information of the second target sound source if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel; and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.

In a specific implementation process, the separation module 21 is further configured to:

and comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel. The adaptive filtering algorithm is any one of least mean square algorithm LMS, NLMS algorithm and least square method RLS.

Correspondingly, the second transform module 22 is further configured to perform short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In a specific implementation process, the separation module 21 is further configured to: respectively eliminating background noise by single-channel noise reduction on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel;

correspondingly, the second transform module 22 is further configured to perform short-time inverse fourier transform on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.

In a specific implementation process, the deviation estimation module 24 is further configured to update the weight of the filter corresponding to the adaptive filtering algorithm when the pitch angle deviation is greater than the angular deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angular deviation threshold of the azimuth angle. And when the pitch angle deviation is smaller than or equal to the angle deviation threshold value and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight value of the filter corresponding to the adaptive filtering algorithm unchanged.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and specific implementation schemes thereof may refer to the method described in the foregoing embodiment and relevant descriptions in the method embodiment, and have beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 4 is a schematic structural diagram of the voice separating device of the present invention, and as shown in fig. 4, the passing device of this embodiment may include: a processor 1010 and a memory 1020. Those skilled in the art will appreciate that the device may also include input/output interface 1030, communication interface 1040, and bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting the input/output module 32 to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

In one specific implementation, the processor 1010 is configured to execute the program for speech separation stored in the memory 1020 to implement the speech separation method of the above-described embodiment.

The present invention also provides a storage medium storing one or more programs that when executed implement the voice separation method of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

15页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：多模态语音分离方法、训练方法及相关装置

Voice separation method, device, equipment and storage medium

相关技术

网友询问留言