Voice sound source positioning method using microphone array under interference and high reverberation environment
阅读说明:本技术 干扰及高混响环境下使用传声器阵列的语音声源定位方法 (Voice sound source positioning method using microphone array under interference and high reverberation environment ) 是由 王浩 卢晶 刘晓峻 狄敏 邵治英 于 2019-10-21 设计创作,主要内容包括:本发明公开了一种干扰及高混响环境下使用传声器阵列的语音声源定位方法,步骤如下:(1)设定参数;(2)短时傅里叶变换,得到时-频域信号;(3)对时-频域信号的每个时-频点,计算对数化的交叉谱幅度均值,获得“能量”包络;(4)对时-频域信号的每个时-频点,计算“能量”包络的“变化率”;(5)利用瞬态噪声特征,判断并定位瞬态噪声;(6)选择直达声对应的时-频点,并忽略瞬态噪声部分;(7)对选中的时-频点,应用加权SRP-PHAT方法,得到定位结果。本发明中语音声源定位方法,能够使得在高混响及干扰的环境中,依然可以获得精确度和鲁棒性较高的结果。(The invention discloses a method for positioning a voice sound source by using a microphone array in an interference and high reverberation environment, which comprises the following steps: (1) setting parameters; (2) short-time Fourier transform to obtain a time-frequency domain signal; (3) calculating a logarithmized cross spectrum amplitude mean value at each time-frequency point of the time-frequency domain signal to obtain an 'energy' envelope; (4) calculating the change rate of the energy envelope at each time-frequency point of the time-frequency domain signal; (5) judging and positioning transient noise by using the transient noise characteristics; (6) selecting a time-frequency point corresponding to the direct sound, and neglecting a transient noise part; (7) and applying a weighted SRP-PHAT method to the selected time-frequency point to obtain a positioning result. The voice sound source positioning method can still obtain results with higher accuracy and robustness in the environment with high reverberation and interference.)
1. A speech sound source localization method using microphone array under interference and high reverberation environment is characterized in that there is accurate localization effect under high reverberation condition and effectively avoids impact noise to influence on localization effect, comprising the following steps:
step 1, direct sound selection
Step 1.1, 1 sound source is arranged in a room, usingIThe microphone array formed by the microphones collects signals, and the average value of the cross-spectrum amplitude of the collected signals is represented as:
and expressed logarithmically as:
(2)
in the formula (I), the compound is shown in the specification,x i (k,l) Is shown asiIn the frequency band of the microphonekInner firstlThe signal of the frame is transmitted to the receiver,
step 1.2, obtaining a 'change rate' of a logarithmic cross-spectrum amplitude mean value according to the power envelope of a signal in frequency:
in the formula (I), the compound is shown in the specification,
step 1.3, the change rate calculated by the formula (3) is larger than the preset change rate threshold valueKSelecting time-frequency points, and considering the time-frequency points to be selected through direct sound to form a direct sound candidate set:
(4)
wherein the content of the first and second substances,a set of direct sound candidates is represented,
step 2, judging and eliminating transient noise
Step 2.1, judging the transient noise according to the following two judgment criteria:
1) calculating the "energy" of each frame "
2) Judgment of
(6)
In the formula (I), the compound is shown in the specification,representing the "energy" of each frame,
step 2.2, if both decision criteria of step 2.1 are fulfilled,n v the corresponding part is determined as transient noise ton v The centered "local" is ignored in the direct sound selection, and formula (4) is rewritten as:
in the formula
Step 3, positioning the voice sound source by using the selected direct sound
And positioning the selected time-frequency point by adopting a weighted SRP-PHAT method, wherein the positioning is represented as follows:
in the formula
(12)
In the formula (I), the compound is shown in the specification,
2. The method of claim 1, wherein the method comprises: the microphone array is a line array or a ring array.
3. The method of claim 2, wherein the method comprises: if the microphone uses a line array, g: (k,θ) Expressed as:
in the formula (I), the compound is shown in the specification,
Technical Field
The invention relates to a voice sound source positioning method using a microphone array in an interference and high reverberation environment, belonging to the technical field of voice signal processing.
Background
The purpose of Speech Signal Source Localization (SSL) is to estimate the angle (DOA) at which the Speech signal reaches the microphone array. Using a microphone array for sound source localization, or DOA estimation, of speech signals is a very important and hot topic in acoustic signal processing. The method has a very important role in sound capture in many application scenarios, such as man-machine voice interaction, lens tracking and intelligent monitoring of intelligent devices. The difficulty with this problem is that the speech signal is a broadband, non-stationary random process, with background noise, reverberation and other interfering sources.
Classical sound source localization methods can be divided into TDOA (time Delay Of arrival), SRP (SteeredResponse Power) and Spatial Spectrum. In a large number of application scenes, not only reverberation but also noise interference exists, and most of the current methods cannot keep high accuracy and robustness in such a complex environment.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a method for positioning a voice sound source by using a microphone array in an interference and high-reverberation environment, so that the result with higher accuracy and robustness can be still obtained in the environment with high reverberation and interference.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a method for positioning a voice sound source by using a microphone array in an interference and high reverberation environment comprises the following steps:
Step 1.1, 1 sound source is arranged in a room, usingIThe microphone array formed by microphones collects signalsThe cross-spectral amplitude mean of the acquired signal is expressed as:
(1)
and expressed logarithmically as:
(2)
in the formula (I), the compound is shown in the specification,x i (k,l) Is shown asiIn the frequency band of the microphonekInner firstlThe signal of the frame is transmitted to the receiver,
represents the cross-spectral amplitude mean of the acquired signal,the number of frames is represented by the number of frames,ξis a regularization term to reduce the effects of background noise,representing the absolute value of the complex number, representing the conjugate operation,P(n,k) Is the power envelope of the signal in frequency;step 1.2, obtaining the change rate of the logarithmic cross-spectrum amplitude average value according to the power envelope of the signal in the frequency:
(3)
in the formula (I), the compound is shown in the specification,
representing the rate of change of the logarithmized cross-spectral amplitude average value,representing a range of frame numbers used to calculate the "rate of change",P(n−t,k) Is ratio ofP(n,k) Early stagetNumber of frames corresponding to the number of framesn−tAt a frequency ofkA power envelope of;step 1.3, the change rate calculated by the formula (3) is larger than the preset change rate threshold valueKThe time-frequency points are selected and considered to pass a direct sound selection (DPD) test to form a direct sound candidate set:
(4)
wherein the content of the first and second substances,
a set of direct sound candidates is represented,is shown asnFrame numberkTime-frequency points corresponding to the frequency bands;
Step 2.1, judging the transient noise according to the following two judgment criteria:
1) calculating the "energy" of each frame "
Finding frames of "energy" local maxima(5)
2) Judgment of
(6)
(7)
In the formula (I), the compound is shown in the specification,
representing the "energy" of each frame,n v a frame representing a "energy" local maximum,dnindicating the "energy rate of change" calculation range,Δnthe range of "local" is meant to be,V 1andV 2the threshold values for the rise and fall of the "energy", respectively;step 2.2, if both decision criteria of step 2.1 are fulfilled,n v the corresponding part is determined as transient noise ton v Centered "local" is ignored in the direct sound selection, and formula (4) is rewritten as
(8)
In the formula
(9)
Step 3, positioning the voice sound source by using the selected direct sound
And positioning the selected time-frequency point by adopting a weighted SRP-PHAT method, wherein the positioning is represented as follows:
(10)
in the formula
(11)
(12)
In the formula (I), the compound is shown in the specification,
indicating the direction of arrival of the acoustic wave to be estimated,θrepresenting possible values of the direction of arrival of the sound wave, i.e. the independent variable, argmax representing the maximum value of the expressionThe argument takes the value ofn,k) When in the set Π, the user can,W(n,k) Is 1, otherwise is 0,which is indicative of the cross-spectrum of the signal,representing a frequency domain signal, superscripts "H" and "T" representing complex conjugate transpose and transpose, respectively; g (b)k,θ) Is shown asθA directional vector of the direction.Preferably: the microphone array may be any suitable array, typically a line array or a ring array is used.
Preferably: if the microphone array is a line array, g: (k,θ) Expressed as:
(13)
in the formula (I), the compound is shown in the specification,
denotes the index based on the natural logarithm e,the degree of the effect of the variable is represented,representing the speed of sound, d is the spacing vector of the microphone array,ω k is a frequency bandkCorresponding angular frequency.Compared with the prior art, the invention has the following beneficial effects:
the voice sound source positioning method can still obtain results with higher accuracy and robustness in the environment with high reverberation and interference.
Drawings
FIG. 1 is a comparison of RMSE for different methods in simulation.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
A method for positioning a voice sound source by using a microphone array in an interference and high reverberation environment is suitable for the interference and high reverberation environment, and meanwhile, compared with a same-class algorithm, the method for positioning the voice sound source by using the microphone array in the interference and high reverberation environment has smaller calculation amount and comprises the following steps:
1. direct sound selection (DPD)
Arranging 1 sound source in a room, usingIA microphone to collect the signal. A line array, a ring array, etc. may be used in the present invention, and are not limited to the array shape. By usingx i (k,l) Respectively representiIn the frequency band of the microphonekInner firstlThe cross-spectral amplitude mean of the signal of a frame, the acquired signal, can be expressed as:
(1)
and expressed logarithmically as:
(2)
in the formula (I), the compound is shown in the specification,x i (k,l) Is shown asiIn the frequency band of the microphonekInner firstlThe signal of the frame is transmitted to the receiver,
represents the cross-spectral amplitude mean of the acquired signal,the number of frames is represented by the number of frames,ξis a regularization term to reduce the effects of background noise,representing the absolute value of the complex number, representing the conjugate operation,P(n,k) Is the power envelope of the signal in frequency.Inspired by the precedence effect (Litovsky R Y, Colburn H S, Yost W a,et al. Theprecedence effect[J]the Journal of The acoustic Society of America, 1999,106(4): 1633-1654.), The time-frequency points at The beginning of speech can be considered to consist mainly of direct sound, which contains accurate sound source location information. This portion of the power envelope increases rapidly, so we define the rate of change of the logarithmized cross-spectral magnitude average as:
(3)
in the formula (I), the compound is shown in the specification,
representing the rate of change of the logarithmized cross-spectral amplitude average value,representing a range of frame numbers used to calculate the "rate of change",P(n−t,k) Is ratio ofP(n,k) Early stagetNumber of frames corresponding to the number of framesn−tAt a frequency ofkThe power envelope of (c). Having a large rate of change calculated by equation (3)K(preset threshold) time-frequency points are selected and considered to pass through a direct sound selection (DPD) test to form a direct sound candidate set(4)
Wherein the content of the first and second substances,
a set of direct sound candidates is represented,is shown asnFrame numberkFrequency band correspondingTime-frequency points. It is clear that if the frame shift is shorter, more points can be selected, which is more advantageous for improving the accuracy of the DOA estimation.2. Determination and elimination of transient noise
In a real scenario there will always be some environmental interference. Common interference noise can be classified into the following categories: steady state noise, such as fan noise and electrical noise; transient noises such as door slamming, tapping, keyboard sounds; other non-stationary noise such as musical interference and television sound. Stationary noise is negligible because their acoustic power does not change rapidly. The average sound power of the target voice is usually larger than the environmental interference, so the direct voice is expected to be a main component of a time-frequency point with fast power increase under general conditions, but the transient noise has the largest influence on the judgment of the direct voice, and the misjudgment rate in the judgment of the direct voice is greatly increased because the transient noise has higher power change rate in a time-frequency domain. The transient noise has the characteristics of high power and short time interval, and can be judged according to the following two judgment criteria.
(1) Calculating the "energy" of each frame "
Finding frames of "energy" local maxima(5)
(2) Judgment of
(6)
(7)
In the formula (I), the compound is shown in the specification,
representing the "energy" of each frame,n v representing local maxima of "energyThe number of frames in a frame is,indicating the "energy rate of change" calculation range,Δnthe range of "local" is meant to be,V 1andV 2respectively the rising and falling thresholds of the "energy".Number of frames if both of the above criteria are metn v The corresponding part is determined as transient noise ton v The centered "local" is ignored in the direct sound selection, and equation (4) can be rewritten as:
(8)
in the formula
(9)
1. Speech sound source localization using selected direct sound
The selected time-frequency point can be positioned by means of a common positioning method, namely an SRP-PHAT method. Because time-frequency points need to be screened, a weighted SRP-PHAT method is adopted here, which is expressed as:
(10)
in the formula
(11)
(12)
In the formula (I), the compound is shown in the specification,indicating the direction of arrival of the acoustic wave to be estimated,θrepresenting possible values of the arrival direction of the sound wave, namely, independent variable, argmax representing the value of the independent variable corresponding to the maximum value of the expression, when (A), (B), (C), and (C)n,k) When in the set Π, the user can,W(n,k) Is 1, otherwise is 0,
which is indicative of the cross-spectrum of the signal,representing a frequency domain signal, superscripts "H" and "T" representing complex conjugate transpose and transpose, respectively; when (A), (B) isn,k) When in the set Π, the user can,W(n,k) Is 1, otherwise is 0; g (b)k,θ) Is shown asθThe steering vector of the direction, if the array is a linear array, can be expressed as:(13)
where d is the spacing vector of the microphone array,ω k is a frequency bandkCorresponding angular frequency. If the array is other type of array, the steering vector can be given according to the specific shape.
At this point, a voice sound source localization result is obtained.
Simulation example
1. Simulated hybrid speech generation
The implementation of the invention takes the positioning of the simulation signal as an example. During simulation, an Image model is adopted to generate room impulse response and convolute with clean voice to generate voice under a reverberation environment, and the room impulse response generated by the Image model at different sound source positions is convoluted and superposed with the clean interference with the same room parameters to obtain a mixed signal. When an Image model is used for simulation, the pitch of the microphone array units is 3.5 cm, and the room size is 7 × 5 × 3 m3(ii) a The target sound source surrounds the array for a circle, the distance from the target sound source to the center of the array is 2 m, and the included angle between the interference sound source and the target sound source relative to the center of the array is not less than 120 degrees; the room reverberation time takes two cases, 0.4 s and 1.0 s. Each speech sample is 2 s in length. Reverberation times of 0.4 s and 1.0s each2300 mixed voices are generated. The sampling frequency of the signal is 16 KHz.
2. Method process flow
a) Parameter setting
The parameters of the proposed method are first given in table 1. It is noted that the proposed method does not require adjustment of parameters in different environments, and that the parameters given can be applied in various environments.
TABLE 1 respective parameters
b) Short time Fourier transform
And (3) performing discrete short-time Fourier transform on the time domain signal acquired by the microphone to obtain a time-frequency domain signal, wherein the window function is a Hanning window, the window length is 32 ms, and the window shift is 0.5 ms.
c) Computing an "energy" envelope
Each time-frequency point of the time-frequency domain signal: the logarithmized cross-spectral amplitude means is calculated using equations (1) (2).
d) Estimating the "energy Change Rate"
Each time-frequency point of the time-frequency domain signal: the "rate of change" of the "energy" envelope is calculated using equation (3).
e) Determining and locating transient noise
For each frame of the time-frequency domain signal:
1. calculating the "energy" of each frame "
Using equation (5) to find the frame of the local maximum of "energy";2. for the frame of local maximum of 'energy', the expression (6-7) is used to judge the occurrence and dissipation rate of the energy, and the frame can correspond to the transient noise when the excessively fast threshold is met.
f) Selecting time-frequency points corresponding to the direct sound and neglecting transient noise part
Each time-frequency point of the time-frequency domain signal: and (8-9) selecting K time-frequency points with larger change rate of the energy envelope as the direct sound screening result, and recording the result as a set pi.
g) Applying a weighted SRP-PHAT method to the selected time-frequency points to obtain a positioning result
Each time-frequency point of the time-frequency domain signal: the final positioning result is estimated using equation (10). It should be noted that the time-frequency points (c:)n,k) When in set Π, W: (n,k) Is 1, otherwise is 0.
To illustrate the advantages of the algorithm of the present invention, the method proposed in the present invention is compared and verified with the conventional method using simulation and experiment.
In different representations, DPD-D-FR (PHAT) is the method proposed in the present invention, DPD-D-FR (MUSIC) is the method of changing the weighted SRP-PHAT positioning method in the third step of the proposed method into the weighted MUSIC method, DPD-MUSIC is the DPD-test method of decomposing the eigenvalues of matrix by means of the matrix, proposed by Rafaely et al (Rafael B, Kolossa D. Speaker localization in reversible absolute basis based on direct path statistical stability [ C ]// Acoustics, speech Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE,2017: 6120-.
Under simulation conditions, a 6-channel ring array is used for sampling and recording 50 voice samples and 46 common indoor noise and unsteady state interference samples after being mixed in pairs. The 6-channel ring array is easier to install on the top of the smart device. The test room is 7X 5X 3 m3Including different reverberation:
TABLE 2P of different methods in the simulationsAnd RsComparison
In the experiment, we tested in three rooms:
TABLE 3 comparison of RMSE (. degree.) for different methods in the experiment
Simulation and experiments show that the method provided by the invention is superior to most other common methods in accuracy and robustness, the DPD-D-FR (PHAT) method is more stable under the condition of high reverberation, the maximum RMSE is 1.2 degrees in the experiment without interference, the influence on the result is smaller when the interference exists, and the robustness is also higher. Compared with a DPD-MUSIC method, the method has certain advantages, and the operation demand of the method is far smaller than that of a direct sound judgment method based on matrix space decomposition.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:一种用于航天航空雷达的调节装置