Voice sound source positioning method using microphone array under interference and high reverberation environment

文档序号:1566592 发布日期:2020-01-24 浏览:27次 中文

阅读说明:本技术 干扰及高混响环境下使用传声器阵列的语音声源定位方法 (Voice sound source positioning method using microphone array under interference and high reverberation environment ) 是由 王浩 卢晶 刘晓峻 狄敏 邵治英 于 2019-10-21 设计创作,主要内容包括:本发明公开了一种干扰及高混响环境下使用传声器阵列的语音声源定位方法,步骤如下:(1)设定参数;(2)短时傅里叶变换,得到时-频域信号;(3)对时-频域信号的每个时-频点,计算对数化的交叉谱幅度均值,获得“能量”包络;(4)对时-频域信号的每个时-频点,计算“能量”包络的“变化率”;(5)利用瞬态噪声特征,判断并定位瞬态噪声;(6)选择直达声对应的时-频点,并忽略瞬态噪声部分;(7)对选中的时-频点,应用加权SRP-PHAT方法,得到定位结果。本发明中语音声源定位方法,能够使得在高混响及干扰的环境中,依然可以获得精确度和鲁棒性较高的结果。(The invention discloses a method for positioning a voice sound source by using a microphone array in an interference and high reverberation environment, which comprises the following steps: (1) setting parameters; (2) short-time Fourier transform to obtain a time-frequency domain signal; (3) calculating a logarithmized cross spectrum amplitude mean value at each time-frequency point of the time-frequency domain signal to obtain an 'energy' envelope; (4) calculating the change rate of the energy envelope at each time-frequency point of the time-frequency domain signal; (5) judging and positioning transient noise by using the transient noise characteristics; (6) selecting a time-frequency point corresponding to the direct sound, and neglecting a transient noise part; (7) and applying a weighted SRP-PHAT method to the selected time-frequency point to obtain a positioning result. The voice sound source positioning method can still obtain results with higher accuracy and robustness in the environment with high reverberation and interference.)

1. A speech sound source localization method using microphone array under interference and high reverberation environment is characterized in that there is accurate localization effect under high reverberation condition and effectively avoids impact noise to influence on localization effect, comprising the following steps:

step 1, direct sound selection

Step 1.1, 1 sound source is arranged in a room, usingIThe microphone array formed by the microphones collects signals, and the average value of the cross-spectrum amplitude of the collected signals is represented as:

Figure 824510DEST_PATH_IMAGE001

and expressed logarithmically as:

(2)

in the formula (I), the compound is shown in the specification,x i (k,l) Is shown asiIn the frequency band of the microphonekInner firstlThe signal of the frame is transmitted to the receiver,

Figure 313053DEST_PATH_IMAGE003

step 1.2, obtaining a 'change rate' of a logarithmic cross-spectrum amplitude mean value according to the power envelope of a signal in frequency:

Figure 291297DEST_PATH_IMAGE006

in the formula (I), the compound is shown in the specification,

Figure 769158DEST_PATH_IMAGE007

step 1.3, the change rate calculated by the formula (3) is larger than the preset change rate threshold valueKSelecting time-frequency points, and considering the time-frequency points to be selected through direct sound to form a direct sound candidate set:

(4)

wherein the content of the first and second substances,a set of direct sound candidates is represented,

Figure 968747DEST_PATH_IMAGE011

step 2, judging and eliminating transient noise

Step 2.1, judging the transient noise according to the following two judgment criteria:

1) calculating the "energy" of each frame "

Figure 210241DEST_PATH_IMAGE012

Figure 801539DEST_PATH_IMAGE013

2) Judgment of

(6)

Figure 193654DEST_PATH_IMAGE015

In the formula (I), the compound is shown in the specification,representing the "energy" of each frame,

Figure 264302DEST_PATH_IMAGE017

step 2.2, if both decision criteria of step 2.1 are fulfilled,n v the corresponding part is determined as transient noise ton v The centered "local" is ignored in the direct sound selection, and formula (4) is rewritten as:

Figure 540879DEST_PATH_IMAGE019

in the formula

Figure 366753DEST_PATH_IMAGE020

Step 3, positioning the voice sound source by using the selected direct sound

And positioning the selected time-frequency point by adopting a weighted SRP-PHAT method, wherein the positioning is represented as follows:

Figure 826815DEST_PATH_IMAGE021

in the formula

Figure 43033DEST_PATH_IMAGE022

(12)

In the formula (I), the compound is shown in the specification,

Figure 137077DEST_PATH_IMAGE024

2. The method of claim 1, wherein the method comprises: the microphone array is a line array or a ring array.

3. The method of claim 2, wherein the method comprises: if the microphone uses a line array, g: (k,θ) Expressed as:

Figure 407018DEST_PATH_IMAGE027

in the formula (I), the compound is shown in the specification,

Figure 565074DEST_PATH_IMAGE028

Technical Field

The invention relates to a voice sound source positioning method using a microphone array in an interference and high reverberation environment, belonging to the technical field of voice signal processing.

Background

The purpose of Speech Signal Source Localization (SSL) is to estimate the angle (DOA) at which the Speech signal reaches the microphone array. Using a microphone array for sound source localization, or DOA estimation, of speech signals is a very important and hot topic in acoustic signal processing. The method has a very important role in sound capture in many application scenarios, such as man-machine voice interaction, lens tracking and intelligent monitoring of intelligent devices. The difficulty with this problem is that the speech signal is a broadband, non-stationary random process, with background noise, reverberation and other interfering sources.

Classical sound source localization methods can be divided into TDOA (time Delay Of arrival), SRP (SteeredResponse Power) and Spatial Spectrum. In a large number of application scenes, not only reverberation but also noise interference exists, and most of the current methods cannot keep high accuracy and robustness in such a complex environment.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a method for positioning a voice sound source by using a microphone array in an interference and high-reverberation environment, so that the result with higher accuracy and robustness can be still obtained in the environment with high reverberation and interference.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a method for positioning a voice sound source by using a microphone array in an interference and high reverberation environment comprises the following steps:

step 1, direct sound selection

Step 1.1, 1 sound source is arranged in a room, usingIThe microphone array formed by microphones collects signalsThe cross-spectral amplitude mean of the acquired signal is expressed as:

Figure 980738DEST_PATH_IMAGE001

(1)

and expressed logarithmically as:

Figure 299724DEST_PATH_IMAGE002

(2)

in the formula (I), the compound is shown in the specification,x i (k,l) Is shown asiIn the frequency band of the microphonekInner firstlThe signal of the frame is transmitted to the receiver,

Figure 546029DEST_PATH_IMAGE003

represents the cross-spectral amplitude mean of the acquired signal,

Figure 557847DEST_PATH_IMAGE004

the number of frames is represented by the number of frames,ξis a regularization term to reduce the effects of background noise,representing the absolute value of the complex number, representing the conjugate operation,P(n,k) Is the power envelope of the signal in frequency;

step 1.2, obtaining the change rate of the logarithmic cross-spectrum amplitude average value according to the power envelope of the signal in the frequency:

Figure 317173DEST_PATH_IMAGE006

(3)

in the formula (I), the compound is shown in the specification,

Figure 542618DEST_PATH_IMAGE007

representing the rate of change of the logarithmized cross-spectral amplitude average value,

Figure 600704DEST_PATH_IMAGE008

representing a range of frame numbers used to calculate the "rate of change",P(nt,k) Is ratio ofP(n,k) Early stagetNumber of frames corresponding to the number of framesntAt a frequency ofkA power envelope of;

step 1.3, the change rate calculated by the formula (3) is larger than the preset change rate threshold valueKThe time-frequency points are selected and considered to pass a direct sound selection (DPD) test to form a direct sound candidate set:

Figure 177179DEST_PATH_IMAGE009

(4)

wherein the content of the first and second substances,

Figure 733841DEST_PATH_IMAGE010

a set of direct sound candidates is represented,

Figure 813793DEST_PATH_IMAGE011

is shown asnFrame numberkTime-frequency points corresponding to the frequency bands;

step 2, judging and eliminating transient noise

Step 2.1, judging the transient noise according to the following two judgment criteria:

1) calculating the "energy" of each frame "

Figure 308359DEST_PATH_IMAGE012

Finding frames of "energy" local maxima

Figure 106551DEST_PATH_IMAGE013

(5)

2) Judgment of

Figure 711976DEST_PATH_IMAGE014

(6)

Figure 646434DEST_PATH_IMAGE015

(7)

In the formula (I), the compound is shown in the specification,

Figure 311901DEST_PATH_IMAGE016

representing the "energy" of each frame,n v a frame representing a "energy" local maximum,dnindicating the "energy rate of change" calculation range,Δnthe range of "local" is meant to be,V 1andV 2the threshold values for the rise and fall of the "energy", respectively;

step 2.2, if both decision criteria of step 2.1 are fulfilled,n v the corresponding part is determined as transient noise ton v Centered "local" is ignored in the direct sound selection, and formula (4) is rewritten as

Figure 597389DEST_PATH_IMAGE017

(8)

In the formula

Figure 740926DEST_PATH_IMAGE018

(9)

Step 3, positioning the voice sound source by using the selected direct sound

And positioning the selected time-frequency point by adopting a weighted SRP-PHAT method, wherein the positioning is represented as follows:

Figure 795469DEST_PATH_IMAGE019

(10)

in the formula

Figure 631838DEST_PATH_IMAGE020

(11)

Figure 139043DEST_PATH_IMAGE021

(12)

In the formula (I), the compound is shown in the specification,

Figure 883008DEST_PATH_IMAGE022

indicating the direction of arrival of the acoustic wave to be estimated,θrepresenting possible values of the direction of arrival of the sound wave, i.e. the independent variable, argmax representing the maximum value of the expressionThe argument takes the value ofn,k) When in the set Π, the user can,W(n,k) Is 1, otherwise is 0,

Figure 933004DEST_PATH_IMAGE023

which is indicative of the cross-spectrum of the signal,

Figure 799328DEST_PATH_IMAGE024

representing a frequency domain signal, superscripts "H" and "T" representing complex conjugate transpose and transpose, respectively; g (b)k,θ) Is shown asθA directional vector of the direction.

Preferably: the microphone array may be any suitable array, typically a line array or a ring array is used.

Preferably: if the microphone array is a line array, g: (k,θ) Expressed as:

Figure 931845DEST_PATH_IMAGE025

(13)

in the formula (I), the compound is shown in the specification,

Figure 541818DEST_PATH_IMAGE026

denotes the index based on the natural logarithm e,

Figure 180741DEST_PATH_IMAGE027

the degree of the effect of the variable is represented,

Figure 483546DEST_PATH_IMAGE028

representing the speed of sound, d is the spacing vector of the microphone array,ω k is a frequency bandkCorresponding angular frequency.

Compared with the prior art, the invention has the following beneficial effects:

the voice sound source positioning method can still obtain results with higher accuracy and robustness in the environment with high reverberation and interference.

Drawings

FIG. 1 is a comparison of RMSE for different methods in simulation.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

A method for positioning a voice sound source by using a microphone array in an interference and high reverberation environment is suitable for the interference and high reverberation environment, and meanwhile, compared with a same-class algorithm, the method for positioning the voice sound source by using the microphone array in the interference and high reverberation environment has smaller calculation amount and comprises the following steps:

1. direct sound selection (DPD)

Arranging 1 sound source in a room, usingIA microphone to collect the signal. A line array, a ring array, etc. may be used in the present invention, and are not limited to the array shape. By usingx i (k,l) Respectively representiIn the frequency band of the microphonekInner firstlThe cross-spectral amplitude mean of the signal of a frame, the acquired signal, can be expressed as:

(1)

and expressed logarithmically as:

Figure 519952DEST_PATH_IMAGE030

(2)

in the formula (I), the compound is shown in the specification,x i (k,l) Is shown asiIn the frequency band of the microphonekInner firstlThe signal of the frame is transmitted to the receiver,

Figure 13382DEST_PATH_IMAGE031

represents the cross-spectral amplitude mean of the acquired signal,

Figure 487088DEST_PATH_IMAGE032

the number of frames is represented by the number of frames,ξis a regularization term to reduce the effects of background noise,representing the absolute value of the complex number, representing the conjugate operation,P(n,k) Is the power envelope of the signal in frequency.

Inspired by the precedence effect (Litovsky R Y, Colburn H S, Yost W a,et al. Theprecedence effect[J]the Journal of The acoustic Society of America, 1999,106(4): 1633-1654.), The time-frequency points at The beginning of speech can be considered to consist mainly of direct sound, which contains accurate sound source location information. This portion of the power envelope increases rapidly, so we define the rate of change of the logarithmized cross-spectral magnitude average as:

(3)

in the formula (I), the compound is shown in the specification,

Figure 162417DEST_PATH_IMAGE034

representing the rate of change of the logarithmized cross-spectral amplitude average value,

Figure 541446DEST_PATH_IMAGE035

representing a range of frame numbers used to calculate the "rate of change",P(nt,k) Is ratio ofP(n,k) Early stagetNumber of frames corresponding to the number of framesntAt a frequency ofkThe power envelope of (c). Having a large rate of change calculated by equation (3)K(preset threshold) time-frequency points are selected and considered to pass through a direct sound selection (DPD) test to form a direct sound candidate set

Figure 873201DEST_PATH_IMAGE036

(4)

Wherein the content of the first and second substances,

Figure 300772DEST_PATH_IMAGE037

a set of direct sound candidates is represented,

Figure 159006DEST_PATH_IMAGE038

is shown asnFrame numberkFrequency band correspondingTime-frequency points. It is clear that if the frame shift is shorter, more points can be selected, which is more advantageous for improving the accuracy of the DOA estimation.

2. Determination and elimination of transient noise

In a real scenario there will always be some environmental interference. Common interference noise can be classified into the following categories: steady state noise, such as fan noise and electrical noise; transient noises such as door slamming, tapping, keyboard sounds; other non-stationary noise such as musical interference and television sound. Stationary noise is negligible because their acoustic power does not change rapidly. The average sound power of the target voice is usually larger than the environmental interference, so the direct voice is expected to be a main component of a time-frequency point with fast power increase under general conditions, but the transient noise has the largest influence on the judgment of the direct voice, and the misjudgment rate in the judgment of the direct voice is greatly increased because the transient noise has higher power change rate in a time-frequency domain. The transient noise has the characteristics of high power and short time interval, and can be judged according to the following two judgment criteria.

(1) Calculating the "energy" of each frame "

Figure 846952DEST_PATH_IMAGE039

Finding frames of "energy" local maxima

Figure 525058DEST_PATH_IMAGE040

(5)

(2) Judgment of

Figure 694002DEST_PATH_IMAGE041

(6)

Figure 406743DEST_PATH_IMAGE042

(7)

In the formula (I), the compound is shown in the specification,

Figure 268520DEST_PATH_IMAGE043

representing the "energy" of each frame,n v representing local maxima of "energyThe number of frames in a frame is,

Figure 699502DEST_PATH_IMAGE044

indicating the "energy rate of change" calculation range,Δnthe range of "local" is meant to be,V 1andV 2respectively the rising and falling thresholds of the "energy".

Number of frames if both of the above criteria are metn v The corresponding part is determined as transient noise ton v The centered "local" is ignored in the direct sound selection, and equation (4) can be rewritten as:

Figure 468874DEST_PATH_IMAGE045

(8)

in the formula

(9)

1. Speech sound source localization using selected direct sound

The selected time-frequency point can be positioned by means of a common positioning method, namely an SRP-PHAT method. Because time-frequency points need to be screened, a weighted SRP-PHAT method is adopted here, which is expressed as:

Figure 68800DEST_PATH_IMAGE047

(10)

in the formula

Figure 862444DEST_PATH_IMAGE048

(11)

Figure 497824DEST_PATH_IMAGE049

(12)

In the formula (I), the compound is shown in the specification,indicating the direction of arrival of the acoustic wave to be estimated,θrepresenting possible values of the arrival direction of the sound wave, namely, independent variable, argmax representing the value of the independent variable corresponding to the maximum value of the expression, when (A), (B), (C), and (C)n,k) When in the set Π, the user can,W(n,k) Is 1, otherwise is 0,

Figure 795262DEST_PATH_IMAGE051

which is indicative of the cross-spectrum of the signal,

Figure 935256DEST_PATH_IMAGE052

representing a frequency domain signal, superscripts "H" and "T" representing complex conjugate transpose and transpose, respectively; when (A), (B) isn,k) When in the set Π, the user can,W(n,k) Is 1, otherwise is 0; g (b)k,θ) Is shown asθThe steering vector of the direction, if the array is a linear array, can be expressed as:

(13)

where d is the spacing vector of the microphone array,ω k is a frequency bandkCorresponding angular frequency. If the array is other type of array, the steering vector can be given according to the specific shape.

At this point, a voice sound source localization result is obtained.

Simulation example

1. Simulated hybrid speech generation

The implementation of the invention takes the positioning of the simulation signal as an example. During simulation, an Image model is adopted to generate room impulse response and convolute with clean voice to generate voice under a reverberation environment, and the room impulse response generated by the Image model at different sound source positions is convoluted and superposed with the clean interference with the same room parameters to obtain a mixed signal. When an Image model is used for simulation, the pitch of the microphone array units is 3.5 cm, and the room size is 7 × 5 × 3 m3(ii) a The target sound source surrounds the array for a circle, the distance from the target sound source to the center of the array is 2 m, and the included angle between the interference sound source and the target sound source relative to the center of the array is not less than 120 degrees; the room reverberation time takes two cases, 0.4 s and 1.0 s. Each speech sample is 2 s in length. Reverberation times of 0.4 s and 1.0s each2300 mixed voices are generated. The sampling frequency of the signal is 16 KHz.

2. Method process flow

a) Parameter setting

The parameters of the proposed method are first given in table 1. It is noted that the proposed method does not require adjustment of parameters in different environments, and that the parameters given can be applied in various environments.

TABLE 1 respective parameters

Figure 54183DEST_PATH_IMAGE054

b) Short time Fourier transform

And (3) performing discrete short-time Fourier transform on the time domain signal acquired by the microphone to obtain a time-frequency domain signal, wherein the window function is a Hanning window, the window length is 32 ms, and the window shift is 0.5 ms.

c) Computing an "energy" envelope

Each time-frequency point of the time-frequency domain signal: the logarithmized cross-spectral amplitude means is calculated using equations (1) (2).

d) Estimating the "energy Change Rate"

Each time-frequency point of the time-frequency domain signal: the "rate of change" of the "energy" envelope is calculated using equation (3).

e) Determining and locating transient noise

For each frame of the time-frequency domain signal:

1. calculating the "energy" of each frame "

Figure 428664DEST_PATH_IMAGE055

Using equation (5) to find the frame of the local maximum of "energy";

2. for the frame of local maximum of 'energy', the expression (6-7) is used to judge the occurrence and dissipation rate of the energy, and the frame can correspond to the transient noise when the excessively fast threshold is met.

f) Selecting time-frequency points corresponding to the direct sound and neglecting transient noise part

Each time-frequency point of the time-frequency domain signal: and (8-9) selecting K time-frequency points with larger change rate of the energy envelope as the direct sound screening result, and recording the result as a set pi.

g) Applying a weighted SRP-PHAT method to the selected time-frequency points to obtain a positioning result

Each time-frequency point of the time-frequency domain signal: the final positioning result is estimated using equation (10). It should be noted that the time-frequency points (c:)n,k) When in set Π, W: (n,k) Is 1, otherwise is 0.

To illustrate the advantages of the algorithm of the present invention, the method proposed in the present invention is compared and verified with the conventional method using simulation and experiment.

In different representations, DPD-D-FR (PHAT) is the method proposed in the present invention, DPD-D-FR (MUSIC) is the method of changing the weighted SRP-PHAT positioning method in the third step of the proposed method into the weighted MUSIC method, DPD-MUSIC is the DPD-test method of decomposing the eigenvalues of matrix by means of the matrix, proposed by Rafaely et al (Rafael B, Kolossa D. Speaker localization in reversible absolute basis based on direct path statistical stability [ C ]// Acoustics, speech Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE,2017: 6120-.

Under simulation conditions, a 6-channel ring array is used for sampling and recording 50 voice samples and 46 common indoor noise and unsteady state interference samples after being mixed in pairs. The 6-channel ring array is easier to install on the top of the smart device. The test room is 7X 5X 3 m3Including different reverberation: room 1, T60= 0.4 s, Room 2, T60= 1.0 s. The array center coordinates (3.5 m, 2.2 m, 1.5 m), the voice sound source is in 10 directions around, the interval is 36 degrees, the included angle between the interference sound source and the voice sound source relative to the array center is not less than 120 degrees, the distance from the interference sound source to the microphone array is 2 meters, the heights are the same, and the Signal-to-interference ratio (SIR) is 5 dB. The speed of sound is 344 m/s. A comparison of the Root-mean-square error (RMSE) for different methods without interference is shown in FIG. 1. Two are defined herein forThe comparative index is as follows:P sthe probability that the estimate is closer to interference is located;R s: the localization estimate is closer to the root mean square error corresponding to the data of the targeted speaker. Involving different methods of interferenceP sAndR sthe comparison is shown in Table 2.

TABLE 2P of different methods in the simulationsAnd RsComparison

Figure 55954DEST_PATH_IMAGE056

In the experiment, we tested in three rooms: room 1 is an audio-visual Room with a volume of 4.5X 7.4X 3 m3T60= 0.32 s; room 2 is a small classroom with a volume of 3.6X 5.2X 3 m3T60= 1.20 s; room 3 is a reverberation chamber with a volume of 7.35 × 5.9 × 5.22 m3T60 ≈ 5 s. 35 voice samples are recorded by using a 4-channel line array, interference samples containing 20 different common noises are played circularly in a recording environment at the same time, and the expected distances from a sound source and the interference sources to the microphone array are both 2 meters and the heights are the same. The sampling rate is 16 KHz. The speech sound source is at 30 ° and 60 ° respectively, and the interfering sound source is at-45 °. The root mean square error for the different methods is shown in table 3.

TABLE 3 comparison of RMSE (. degree.) for different methods in the experiment

Simulation and experiments show that the method provided by the invention is superior to most other common methods in accuracy and robustness, the DPD-D-FR (PHAT) method is more stable under the condition of high reverberation, the maximum RMSE is 1.2 degrees in the experiment without interference, the influence on the result is smaller when the interference exists, and the robustness is also higher. Compared with a DPD-MUSIC method, the method has certain advantages, and the operation demand of the method is far smaller than that of a direct sound judgment method based on matrix space decomposition.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种用于航天航空雷达的调节装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!