A kind of microphone array voice enhancement method and realization device

文档序号:1757175 发布日期:2019-11-29 浏览:17次 中文

阅读说明:本技术 一种麦克风阵列语音增强方法及实现装置 (A kind of microphone array voice enhancement method and realization device ) 是由 张军 梁晟 宁更新 冯义志 余华 季飞 于 2019-07-25 设计创作,主要内容包括:本发明公开了一种麦克风阵列语音增强方法及实现装置,通过支路三来抑制说话人和干扰源方向的信号,得到空间非相干噪声频谱矢量;使用深度神经网络来完成从带噪语音和噪声到干净语音的映射,可以有效地利用语音信号的非线性特性和时间相关性,使估计结果更精确和接近人类听觉特性;该深度神经网络采用带噪语音和噪声作为输入,与传统仅采用带噪语音作为输入的深度神经网络语音增强技术相比具有更好的增强效果。本发明将基于麦克风阵列和深度神经网络的语音增强技术相结合,性能优于传统的麦克风阵列语音增强方法和单麦克风深度神经网络语音增强方法;可以广泛用于视频会议、车载通信、会场、多媒体教室等各种具有嘈杂背景的语音通信应用中。(The invention discloses a kind of microphone array voice enhancement method and realization devices, inhibit the signal of speaker and interference source direction by branch three, obtain space noncoherent noise spectral vector;Nonlinear characteristic and temporal correlation that voice signal can be effectively utilized from noisy speech and noise to the mapping of clean speech are completed using deep neural network, keep estimated result more accurate and close to human auditory system;The deep neural network, as input, has better reinforcing effect using noisy speech and noise compared with tradition is only with noisy speech deep neural network speech enhancement technique as input.The present invention combines the speech enhancement technique based on microphone array and deep neural network, and performance is better than traditional microphone array voice enhancement method and single microphone deep neural network sound enhancement method;It can be widely applied in the various voice communications applications with noisy background such as video conference, vehicle-carrying communication, meeting-place, multi-media classroom.)

1. a kind of microphone array voice enhancement method based on deep neural network, which is characterized in that use following steps pair The voice signal of input is enhanced:

S1, the depth using clean speech library and the training of noise library for noisy speech and noise to be mapped as to clean speech are neural Network;

S2, arrival bearing, the number of interference source and the arrival bearing of interference source of microphone array estimation speaker are used;

S3, microphone array received signal is divided into three branches, branch one is using fixed beam former to speaker side To signal enhanced, obtain branch one output voice spectrum S(f)(ω, t), wherein t is frame number;Branch two is using resistance Fill in matrix B1Inhibit the signal in speaker direction, and it is defeated to be obtained into branch two by sef-adapting filter for the output of blocking matrix Noise component(s) frequency spectrum outBranch three uses blocking matrix B2Inhibit the letter of speaker and all interference source directions Number, obtain the spectral vector of the space noncoherent noise of the output of branch three

S4, useWithEstimate S(f)The noise spectrum for including in (ω, t)

S5, by S(f)(ω, t) andThe deep neural network of training, obtains enhanced voice in input step S1.

2. microphone array voice enhancement method according to claim 1, which is characterized in that depth in the step S1 The training of neural network uses following steps:

S1.1, noisy speech is obtained by the noise in the voice in clean speech library and noise library is superimposed, in short-term by noisy speech The short-term spectrum of frequency spectrum and corresponding noise is exported as target, is obtained as input, the short-term spectrum of corresponding clean speech Training dataset;

S1.2, the structural parameters that deep neural network is set, and use following cost function:

Wherein X (ω, t) indicates the short-term spectrum of t frame clean speech,It indicates by the T frame noisy speech short-term spectrum S(f)(ω, t) and noise short-term spectrumThe input sample of composition, f (Y (ω, t)) table Show the output of neural network, T is the number of speech frames of training;

S1.3, training deep neural network, so that the variation of cost function Φ is less than preset value.

3. microphone array voice enhancement method according to claim 1, which is characterized in that the step S3 and step In S4, it is first K subband by the signal decomposition of input, after the signal of each subband is handled by three branches, then closes Help the S of band(f)(ω, t) and

4. microphone array voice enhancement method according to claim 1, which is characterized in that right in the step S3 In i-th, i=1,2 ..., 24 subbands, the weight matrix w of branch oneq,iIt is calculated using following methods:

Wherein C1i=d (ωi0) it is constraint matrix,M is the battle array of microphone array First number, ωiFor the centre frequency of i-th of subband, θ0For the arrival bearing of speaker, τ0,m, 0≤m≤M-1 is speaker's sound It reaches m-th of array element and reaches the delay inequality of the 0th array element, f is response vector.

5. microphone array voice enhancement method according to claim 1, which is characterized in that right in the step S3 In i-th, i=1,2 ..., 24 subbands, the blocking matrix B of branch two1iIt is calculated using following methods:

By Matrix C1i=d (ωi0) carry out singular value decomposition

Wherein Σ1irFor r1×r1Diagonal matrix, r1For C1iOrder.It enablesWherein U1irFor U1iPreceding r1Row,For U1iRemaining rows, then

6. microphone array voice enhancement method according to claim 1, which is characterized in that right in the step S3 In i-th, i=1,2 ..., 24 subbands, the blocking matrix B of branch three2iIt is calculated using following methods:

By Matrix C2i=[d (ωi0),d(ωi1),…,d(ωiJ)] carry out singular value decomposition

WhereinM is the array number of microphone array, ωiFor in i-th of subband Frequency of heart, θ0For the arrival bearing of speaker, τ0,m, 0≤m≤M-1, for m-th of array element of speaker's sound arrival and arrival the 0th The delay inequality of a array element,1≤j≤J, J are interference source number, θjFor interference source Arrival bearing, τj,m, 0≤m≤M-1 is that j-th of interference source sound reaches m-th of array element and reaches the delay inequality of the 0th array element, Σ2irFor r2×r2Diagonal matrix, r2For C2iOrder, enableWherein U2irFor U2iBeforer2Row,For U2i Remaining rows, then

7. microphone array voice enhancement method according to claim 6, which is characterized in that right in the step S4 In i-th, i=1,2 ..., 24 subbands calculate the voice spectrum that branch one exports using following formulaIn include noise Frequency spectrum

Wherein wq,iAnd wa,iThe respectively weight vector of the sef-adapting filter of the fixed beam former of branch one and branch two, B1iFor the blocking matrix of branch two, For branch three in i-th of subband The spectral vector of the space noncoherent noise of output,The noise component(s) frequency spectrum exported for branch two in i-th of subband.

8. a kind of realization device of the microphone array voice enhancement method based on deep neural network, which is characterized in that described Realization device include microphone array receiving module, sub-band division module, sub-band synthesis module, 24 improved subband GSC And deep neural network, wherein the microphone array receiving module, sub-band division module are sequentially connected with, and are respectively used to connect It receives multipath audio signal and divides subband;The sub-band synthesis module and deep neural network is sequentially connected with, and is respectively used to close Neural network at full band signal and training for filtering;The improved subband GSC module of described 24 respectively with sub-band division Module is connected with sub-band synthesis module, carries out GSC filtering for the subband to signal;

Wherein, the microphone array receiving module uses linear array configuration, the wheat being uniformly distributed on straight line comprising 8 Gram wind, each array element isotropism;The audio signal that each microphone array element acquires is decomposed into 24 by the sub-band division module A subband is sent to respectively and is correspondingly improved subband GSC and is handled;The sub-band synthesis module is by 24 improved subbands The output of GSC synthesizes full band signal, and sending to deep neural network is enhanced.

9. the realization device of microphone array voice enhancement method according to claim 8, which is characterized in that i-th, i= 1,2 ..., 24 improved subband GSC structures include 3 branches, and branch one uses fixed beam former wq,iTo speaker side To signal enhanced, branch two use blocking matrix B1iInhibit the signal in speaker direction, and by the output of blocking matrix Pass through sef-adapting filter wa,i, obtain noise component(s) frequency spectrumBranch three uses blocking matrix B2iInhibit speaker and The signal in all interference source directions, obtains the spectral vector of space noncoherent noise

Technical field

The present invention relates to speech signal processing technologies, and in particular to one kind is based on the wheat of deep neural network (DNN) Gram wind array voice enhancement method and realization device.

Background technique

In real life, the process that people transmit voice messaging usually unavoidably will receive the interference of outside noise, These interference can enable voice quality decline, and influence the effect of voice communication and identification.Speech enhan-cement is one kind from by noise jamming Voice in extract useful voice signal, inhibition and reduce noise technology, i.e., extracted from noisy speech as pure as possible Raw tone, voice communication, in terms of have extensive purposes.

According to the number of used microphone, existing voice enhancement algorithm can be divided into two classes, and one kind is based on single wheat The voice enhancement algorithm, such as spectrum-subtraction, Wiener Filter Method, MMSE, Kalman filtering etc. of gram wind.This kind of voice enhancement algorithm makes Voice signal is received with single microphone, small in size, structure is simple, but noise reduction capability is limited, can only handle and steadily make an uproar mostly Sound, for nonstationary noise effect speech enhan-cement, the effect is unsatisfactory.Another kind of is the speech enhan-cement based on microphone array, The sound from different spaces direction is received using multiple microphones i.e. in voice acquisition system, by airspace filter come The signal for amplifying speaker direction, inhibits noise and the interference in other directions, has higher signal compared with traditional method Gain and stronger interference rejection capability can solve a variety of acoustics estimation problems, as auditory localization, dereverberation, speech enhan-cement, Blind source separating etc., the disadvantage is that volume is big, algorithm complexity is higher.Existing Microphone Array Speech enhancing technology can substantially divide Method, Adaptive beamformer method and adaptive post-filtering method three classes are formed for fixed beam, wherein Adaptive beamformer is Adjusted under certain optiaml ciriterion by adaptive algorithm with optimization array weight, have to the variation of environment and well adapt to Ability, therefore apply in practice the most extensive.

Generalized sidelobe canceller (GSC) is a kind of common structure for realizing adaptive beam, is mainly made of two branches: Branch one is using the signal of fixed beam former enhancing receiving direction, and branch two is first using blocking matrix prevention receiving direction Signal pass through, then the output of blocking matrix is filtered using sef-adapting filter, to estimate in the output of out branch one Remaining noise, and by subtracting each other counteracting.GSC can convert limited linear constraint minimal variance (LCMV) optimization problem For unconstrained optimization problem, therefore there is very high computational efficiency, implements than other adaptive beam-forming algorithms more Simply.But there is also some shortcomings by traditional GSC, such as: it is not strong to space noncoherent noise rejection ability, do not utilize language The priori knowledge of sound signal is simultaneously optimized for the characteristics of voice signal.

To solve the above-mentioned problems, Chinese invention patent 201711201341.5 provides a kind of wheat based on statistical model Gram wind array voice enhancement method, this method utilize clean speech model and the noise model estimated from the output of GSC branch two Best voice filter is constructed to enhance the output signal of GSC branch one, enhancing system can be effectively improved to incoherent The rejection ability of noise, and can make to export the auditory properties that voice more meets the mankind using the priori knowledge of voice signal.But There is also following disadvantages for this method: (1) this method uses sef-adapting filter output signal energy and sef-adapting filter M-1 The ratio of the sum of road input signal energy adjusts the renewal rate of noncoherent noise, coherent noise and noncoherent noise simultaneously In the presence of be difficult to accurately to estimate and tracking noncoherent noise, thus affect the effect of noise suppressed;(2) this method uses linear Filter enhances come the output for forming part to fixed beam, and the mistake of voice signal can be brought while eliminating noise Very, make reinforcing effect by biggish limitation;(3) in speech enhan-cement treatment process, the processing of front and back speech frame is mutually indepedent, The correlation of voice signal in time can not be utilized.

Summary of the invention

The purpose of the present invention is to solve drawbacks described above in the prior art, provide a kind of based on deep neural network Microphone array voice enhancement method and realization device, this method difference from prior art are: (1) traditional GSC's On the basis of increase branch three for estimating noncoherent noise, can more accurately estimate remaining noise in the output of out branch one;(2) Noisy speech and noise is used to utilize the depth nerve net as output training deep neural network as input, clean speech Network enhances the output of branch one, the nonlinear characteristic and temporal correlation of voice signal can be preferably utilized, by branch One output is more accurately mapped as clean speech.It present invention can be widely used to video conference, vehicle-carrying communication, meeting-place, more matchmakers In the various voice communications applications with noisy background such as body classroom.

The first purpose of this invention can be reached by adopting the following technical scheme that:

A kind of microphone array voice enhancement method based on deep neural network, using following steps to the voice of input Signal is enhanced:

S1, depth for noisy speech and noise to be mapped as to clean speech is trained using clean speech library and noise library Neural network.

S2, arrival bearing, the number of interference source and the arrival bearing of interference source of microphone array estimation speaker are used.

S3, microphone array received signal is divided into three branches, branch one is using fixed beam former to speaking The signal in people direction enhances, and obtains the voice spectrum S of the output of branch one(f)(ω, t), wherein t is frame number.Branch two is adopted With blocking matrix B1Inhibit the signal in speaker direction, and the output of blocking matrix is obtained into branch by sef-adapting filter The noise component(s) frequency spectrum of two outputsBranch three uses blocking matrix B2Inhibit speaker and all interference source directions Signal obtains the spectral vector of the space noncoherent noise of the output of branch three

S4, useWithEstimate S(f)The noise spectrum for including in (ω, t)

S5, by S(f)(ω, t) andTrained deep neural network in input step S1, obtains enhanced Voice.

Further, in above-mentioned steps S1, the training of deep neural network uses following steps:

Step S1.1, noisy speech is obtained by the noise in the voice in clean speech library and noise library is superimposed, band is made an uproar language As input, the short-term spectrum of corresponding clean speech is defeated as target for the short-term spectrum of sound and the short-term spectrum of corresponding noise Out, training dataset is obtained.

Step S1.2, the structural parameters of deep neural network are set, and use following cost function:

Wherein X (ω, t) indicates the short-term spectrum of t frame clean speech,It indicates By t frame noisy speech short-term spectrum S(f)(ω, t) and noise short-term spectrumThe input sample of composition, f (Y (ω, T) output of neural network) is indicated, T is the number of speech frames of training.

Step S1.3, training deep neural network, so that the variation of cost function Φ is less than preset value.

It is first K subband by the signal decomposition of input, the signal of each subband passes through in above-mentioned steps S3 and step S4 After three branches are handled, then synthesize the S of full band(f)(ω, t) and

In above-mentioned steps S3, for i-th of subband, the weight matrix w of branch oneq,iIt is calculated using following methods:

Wherein C1i=d (ωi0) it is constraint matrix,M is microphone array Array number, ωiFor the centre frequency of i-th of subband, θ0For the arrival bearing of speaker, τ0,m, 0≤m≤M-1 is speaker Sound reaches m-th of array element and reaches the delay inequality of the 0th array element, and f is response vector.

In above-mentioned steps S3, for i-th of subband, the blocking matrix B of branch two1iIt is calculated using following methods:

By Matrix C1i=d (ωi0) carry out singular value decomposition

Wherein Σ1irFor r1×r1Diagonal matrix, r1For C1iOrder.It enablesWherein U1irFor U1iBefore r1Row,For U1iRemaining rows, then

In above-mentioned steps S3, for i-th of subband, the blocking matrix B of branch three2iIt is calculated using following methods:

By Matrix C2i=[d (ωi0),d(ωi1),…,d(ωiJ)] carry out singular value decomposition

WhereinM is the array number of microphone array, ωiFor i-th of subband Centre frequency, θ0For the arrival bearing of speaker, τ0,m, 0≤m≤M-1, for m-th of array element of speaker's sound arrival and arrival The delay inequality of 0th array element, J is interference source number, θjFor interference The arrival bearing in source, τj,m, 0≤m≤M-1, be j-th interference source sound reach m-th of array element and reach the 0th array element when Prolong difference, Σ2irFor r2×r2Diagonal matrix, r2For C2iOrder.It enablesWherein U2irFor U2iPreceding r2Row,For U2iRemaining rows, then

In above-mentioned steps S4, for i-th of subband, the voice spectrum that branch one exports is calculated using following formulaIn The noise spectrum for including

Wherein wq,iAnd wa,iThe respectively weight of the sef-adapting filter of the fixed beam former of branch one and branch two Vector, B1iFor the blocking matrix of branch two, To be propped up in i-th of subband The spectral vector for the space noncoherent noise that road three exports,The noise component(s) exported for branch two in i-th of subband Frequency spectrum.

Another object of the present invention can be reached by adopting the following technical scheme that:

A kind of realization device of the microphone array voice enhancement method based on deep neural network, the realization device Including microphone array receiving module, sub-band division module, sub-band synthesis module, 24 improved subband GSC and depth nerve Network, wherein the microphone array receiving module, sub-band division module are sequentially connected with, and are respectively used to receive MCVF multichannel voice frequency Signal and division subband;The sub-band synthesis module and deep neural network is sequentially connected with, and is respectively used to synthesis full band signal Neural network with training for filtering;The improved subband GSC module of described 24 respectively with sub-band division module and subband Synthesis module connection carries out GSC filtering for the subband to signal;

Wherein, the microphone array receiving module uses linear array configuration, is uniformly distributed on straight line comprising 8 Microphone, each array element isotropism;The audio signal that the sub-band division module acquires each microphone array element is decomposed For 24 subbands, it is sent to respectively and is correspondingly improved subband GSC and is handled;The sub-band synthesis module is by 24 improved sons Output with GSC synthesizes full band signal, and sending to deep neural network is enhanced.

Further, i-th, i=1,2 ..., 24 improved subband GSC structures include 3 branches, and branch one is using solid Standing wave beamformer wq,iThe signal in speaker direction is enhanced, branch two uses blocking matrix B1iInhibit speaker direction Signal, and by the output of blocking matrix pass through sef-adapting filter wa,i, obtain noise component(s) frequency spectrumBranch three Using blocking matrix B2iThe signal for inhibiting speaker and all interference source directions, obtains the spectral vector of space noncoherent noise

The present invention has the following advantages and effects with respect to the prior art:

1, the present invention inhibits the signal of speaker and interference source direction by branch three, obtains space noncoherent noise frequency Vector is composed, space noncoherent noise can be more accurately estimated and tracked compared with Chinese invention patent 201711201341.5.

2, the present invention is completed using deep neural network from noisy speech and noise to the mapping of clean speech, with tradition GSC directly subtract each other or Chinese invention patent 201711201341.5 in using the statistical models construction such as GMM, HMM linear filter Wave device is compared, and the nonlinear characteristic and temporal correlation of voice signal can be effectively utilized, and is kept estimated result more accurate and is connect Nearly human auditory system.

3, deep neural network used in the present invention is used as input using noisy speech and noise, with tradition only with band Voice of making an uproar deep neural network speech enhancement technique as input, which is compared, has better reinforcing effect.

4, the present invention combines the speech enhancement technique based on microphone array and deep neural network, and performance is better than biography The microphone array voice enhancement method and single microphone deep neural network sound enhancement method of system.

Detailed description of the invention

Fig. 1 is the structural block diagram that microphone array voice enhancement method realizes system in the embodiment of the present invention;

Fig. 2 is i-th of improved subband GSC structural block diagram in the embodiment of the present invention;

Fig. 3 is the flow chart of microphone array voice enhancement method in the embodiment of the present invention;

Fig. 4 is deep neural network structural block diagram used in the embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:信号生成的方法、基于人工智能的语音识别方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!