Voice enhancement method, device and system and computer readable storage medium

文档序号：617742 发布日期：2021-05-07 浏览：38次中文

阅读说明：本技术 一种语音增强方法、装置、系统及计算机可读存储介质 (Voice enhancement method, device and system and computer readable storage medium ) 是由陈国明于 2021-01-28 设计创作，主要内容包括：本发明公开了一种语音增强方法、装置、系统及计算机可读存储介质,该方法包括：获取当前时刻的时域麦克风信号和时域骨导信号；判断时域麦克风信号和时域骨导信号是否为语音信号,若是,则通过预先建立的DNN噪声消除模型对时域麦克风信号进行噪声消除处理,并对时域骨导信号进行频域的噪声消除处理；若否,则将与当前时刻对应的输出信号置为零；对经噪声消除后的时域麦克风信号进行高通滤波处理,得到第一输出时域信号,对经噪声消除后的时域骨导信号进行低通滤波处理,得到第二输出时域信号；依据第一输出时域信号和第二输出时域信号,得到与当前时刻对应的输出时域信号；本发明能够较好的消除背景噪声,有利于提高声音的音质,提升用户体验。(The invention discloses a voice enhancement method, a device, a system and a computer readable storage medium, wherein the method comprises the following steps: acquiring a time domain microphone signal and a time domain bone conduction signal at the current moment; judging whether the time domain microphone signal and the time domain bone conduction signal are voice signals, if so, carrying out noise elimination processing on the time domain microphone signal through a pre-established DNN noise elimination model, and carrying out noise elimination processing on the time domain bone conduction signal in a frequency domain; if not, setting the output signal corresponding to the current moment to be zero; carrying out high-pass filtering processing on the time domain microphone signal subjected to noise elimination to obtain a first output time domain signal, and carrying out low-pass filtering processing on the time domain bone conduction signal subjected to noise elimination to obtain a second output time domain signal; obtaining an output time domain signal corresponding to the current moment according to the first output time domain signal and the second output time domain signal; the invention can better eliminate background noise, is beneficial to improving the tone quality of sound and improves user experience.)

1. A method of speech enhancement, comprising:

acquiring a time domain microphone signal and a time domain bone conduction signal at the current moment;

judging whether the time domain microphone signal and the time domain bone conduction signal are voice signals, if so, carrying out noise elimination processing on the time domain microphone signal through a pre-established DNN noise elimination model to obtain a time domain microphone signal after noise elimination, and carrying out frequency domain noise elimination processing on the time domain bone conduction signal to obtain a time domain bone conduction signal after noise elimination; if not, setting the output signal corresponding to the current moment to be zero;

carrying out high-pass filtering processing on the time domain microphone signal subjected to noise elimination to obtain a first output time domain signal, and carrying out low-pass filtering processing on the time domain bone conduction signal subjected to noise elimination to obtain a second output time domain signal;

and obtaining an output time domain signal corresponding to the current moment according to the first output time domain signal and the second output time domain signal.

2. The speech enhancement method according to claim 1, wherein the process of performing the frequency domain noise cancellation process on the time domain bone conduction signal to obtain the noise-cancelled time domain bone conduction signal comprises:

converting the time domain bone conduction signal into a frequency domain bone conduction signal through time-frequency conversion;

carrying out frequency domain noise elimination processing on the frequency domain bone conduction signal to obtain a frequency domain bone conduction signal subjected to noise elimination;

judging whether the bandwidth of the frequency domain bone conduction signal after the noise elimination reaches a preset bandwidth, if so, directly carrying out time-frequency inverse transformation on the frequency domain bone conduction signal after the noise elimination to obtain a time domain bone conduction signal after the noise elimination; if the frequency domain bone conduction signal does not meet the preset bandwidth requirement, performing bandwidth expansion on the frequency domain bone conduction signal subjected to noise elimination by adopting a pre-established DNN bandwidth expansion model to enable the expanded bandwidth to reach the preset bandwidth, and performing inverse frequency-frequency transformation on the expanded frequency domain bone conduction signal to obtain a time domain bone conduction signal subjected to noise elimination.

3. The speech enhancement method according to claim 1, wherein the noise-canceling process is performed on the time-domain microphone signal by a pre-established DNN noise-canceling model, and the process of obtaining the time-domain microphone signal after noise cancellation is:

performing time-frequency transformation on the time domain microphone signals to obtain corresponding frequency domain microphone signals;

extracting first signal characteristics of the frequency domain microphone signals, and processing the first signal characteristics by adopting a pre-established DNN noise elimination model to obtain first gains corresponding to each first frequency point of the frequency domain microphone signals respectively;

calculating the product of the frequency spectrum signal corresponding to each first frequency point in the frequency domain microphone signal and the corresponding first gain to obtain a frequency spectrum signal which corresponds to each first frequency point and is subjected to noise elimination, so as to obtain a frequency domain microphone signal subjected to noise elimination;

and performing time-frequency inverse transformation on the frequency domain microphone signal subjected to noise elimination to obtain a time domain microphone signal subjected to noise elimination.

4. The method of claim 1, wherein the determining whether the time-domain microphone signal and the time-domain bone conduction signal are speech signals comprises:

performing voice activation detection on the time domain bone conduction signal to judge whether the time domain bone conduction signal is a voice signal;

and when the time domain bone conduction signal is a voice signal, the time domain microphone signal is a voice signal.

5. The speech enhancement method according to claim 4, wherein the performing of the speech activity detection on the time-domain bone conduction signal and the determining whether the time-domain bone conduction signal is a speech signal comprises:

calculating the zero crossing rate and the pitch period corresponding to the time domain bone conduction signal;

performing time-frequency transformation on the time domain bone conduction signal to obtain a frequency domain bone conduction signal;

calculating the corresponding spectral energy and spectral centroid of the frequency domain bone conduction signal;

performing fusion judgment on the zero crossing rate, the pitch period, the spectrum energy and the spectrum centroid, and obtaining a voice activation detection flag bit corresponding to the time domain bone conduction signal;

and judging whether the time domain bone conduction signal is a voice signal or not according to the voice activation detection mark bit.

6. The speech enhancement method according to claim 5, wherein the process of performing the fusion judgment on the zero crossing rate, the pitch period, the spectral energy and the spectral centroid and obtaining the voice activation detection flag bit corresponding to the time-domain bone conduction signal comprises:

judging whether the frequency spectrum energy is smaller than a first preset value, if so, setting a voice activation detection mark bit corresponding to the time domain bone conduction signal to be 0; if not, the next step of judgment is carried out;

judging whether the zero crossing rate is greater than a second preset value or not, if so, setting a voice activation detection flag bit corresponding to the time domain bone conduction signal to be 0, and if not, entering the next judgment;

judging whether the pitch period is greater than a third preset value or less than a fourth preset value, if so, setting a voice activation detection flag bit corresponding to the time domain bone conduction signal to be 0; otherwise, entering the next judgment;

judging whether the spectrum centroid is larger than a fifth preset value, if so, setting a voice activation detection mark bit corresponding to the time domain bone conduction signal to be 0; otherwise, the voice activation detection mark bit corresponding to the time domain bone conduction signal is 1;

then, the process of determining whether the time domain bone conduction signal is a voice signal according to the voice activation detection flag bit is as follows:

when the voice activation detection flag bit is 1, the time domain bone conduction signal is a voice signal;

and when the voice activation detection mark bit is 0, the current time domain bone conduction signal is a noise signal.

7. The speech enhancement method according to claim 1, wherein the step of obtaining the output time domain signal corresponding to the current time according to the first output time domain signal and the second output time domain signal comprises:

fusing the first output time domain signal and the second output time domain signal according to a first weight coefficient and a second weight coefficient to obtain a fused time domain signal;

and dynamically adjusting the fused time domain signal to enable the adjusted time domain signal to be in a preset range, and taking the adjusted time domain signal as an output time domain signal corresponding to the current moment.

8. A speech enhancement apparatus, comprising:

the acquisition module is used for acquiring a time domain microphone signal and a time domain bone conduction signal at the current moment;

the judging module is used for judging whether the time domain microphone signal and the time domain bone conduction signal are voice signals or not, and if so, the noise reduction module is triggered; if not, triggering a zero setting module;

the noise reduction module is used for carrying out noise elimination processing on the time domain microphone signal through a pre-established DNN noise elimination model to obtain a time domain microphone signal after noise elimination, and carrying out frequency domain noise elimination processing on the time domain bone conduction signal to obtain a time domain bone conduction signal after noise elimination;

the zero setting module is used for setting the output signal corresponding to the current moment to be zero;

the filtering module is used for carrying out high-pass filtering processing on the time domain microphone signal subjected to noise elimination to obtain a first output time domain signal, and carrying out low-pass filtering processing on the time domain bone conduction signal subjected to noise elimination to obtain a second output time domain signal;

and the fusion module is used for obtaining an output time domain signal corresponding to the current moment according to the first output time domain signal and the second output time domain signal.

9. A speech enhancement system, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the speech enhancement method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the speech enhancement method according to any one of claims 1 to 7.

Technical Field

Embodiments of the present invention relate to the field of speech processing technologies, and in particular, to a speech enhancement method, apparatus, system, and computer-readable storage medium.

Background

Speech enhancement is an effective method for solving noise pollution, and thus is widely used in digital mobile phones, hand-free phone systems in automobiles, teleconferencing (teleffensing), reducing background interference for hearing-impaired people, and the like in civil and military situations. The main goal of speech enhancement is to extract a clean speech signal from a noisy speech signal as much as possible at the receiving end, reduce the hearing fatigue of the listener and improve intelligibility.

Under normal circumstances, sound waves may travel into the inner ear through two paths as shown in fig. 1: air conduction and bone conduction. Air conduction is well known in the art that sound waves are transmitted from the external auditory canal to the middle ear through the auricle and then to the inner ear through the ossicular chain, and the voice spectrum is rich. Due to the influence of environmental noise, speech signals conducted through the air are inevitably contaminated by noise.

Bone conduction means that sound waves are transmitted to the inner ear through vibrations of the skull, the jaw bone, and the like, and in bone conduction, sound waves can be transmitted to the inner ear without passing through the outer ear and the middle ear. The bone voiceprint sensor can only collect information which is in direct contact with the bone conduction microphone and generates vibration, theoretically, voice transmitted through air cannot be collected, interference of environmental noise is avoided, and the bone voiceprint sensor is very suitable for voice transmission in a noise environment. However, due to the influence of the process, the bone voiceprint sensor can only collect and transmit voice signals with lower frequency, so that voice sounds more boring, and the voice quality and the user experience are influenced.

In view of the above, how to provide a speech enhancement method, apparatus, system and computer readable storage medium that solve the above technical problems becomes a problem to be solved by those skilled in the art.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a system, and a computer-readable storage medium for speech enhancement, which can make an output sound signal more audible, improve sound quality, and improve user experience.

To solve the foregoing technical problem, an embodiment of the present invention provides a speech enhancement method, including:

acquiring a time domain microphone signal and a time domain bone conduction signal at the current moment;

and obtaining an output time domain signal corresponding to the current moment according to the first output time domain signal and the second output time domain signal.

Optionally, the process of performing frequency domain noise cancellation processing on the time domain bone conduction signal to obtain a time domain bone conduction signal after noise cancellation is as follows:

converting the time domain bone conduction signal into a frequency domain bone conduction signal through time-frequency conversion;

carrying out frequency domain noise elimination processing on the frequency domain bone conduction signal to obtain a frequency domain bone conduction signal subjected to noise elimination;

Optionally, the process of performing noise cancellation processing on the time domain microphone signal through the pre-established DNN noise cancellation model to obtain the time domain microphone signal after noise cancellation is as follows:

performing time-frequency transformation on the time domain microphone signals to obtain corresponding frequency domain microphone signals;

and performing time-frequency inverse transformation on the frequency domain microphone signal subjected to noise elimination to obtain a time domain microphone signal subjected to noise elimination.

Optionally, the process of determining whether the time-domain microphone signal and the time-domain bone conduction signal are voice signals is as follows:

performing voice activation detection on the time domain bone conduction signal to judge whether the time domain bone conduction signal is a voice signal;

and when the time domain bone conduction signal is a voice signal, the time domain microphone signal is a voice signal.

Optionally, the process of performing voice activation detection on the time domain bone conduction signal and determining whether the time domain bone conduction signal is a voice signal is as follows:

calculating the zero crossing rate and the pitch period corresponding to the time domain bone conduction signal;

performing time-frequency transformation on the time domain bone conduction signal to obtain a frequency domain bone conduction signal;

calculating the corresponding spectral energy and spectral centroid of the frequency domain bone conduction signal;

and judging whether the time domain bone conduction signal is a voice signal or not according to the voice activation detection mark bit.

Optionally, the process of performing fusion judgment on the zero-crossing rate, the pitch period, the spectral energy, and the spectral centroid, and obtaining the voice activation detection flag bit corresponding to the time-domain bone conduction signal is as follows:

then, the process of determining whether the time domain bone conduction signal is a voice signal according to the voice activation detection flag bit is as follows:

when the voice activation detection flag bit is 1, the time domain bone conduction signal is a voice signal;

and when the voice activation detection mark bit is 0, the current time domain bone conduction signal is a noise signal.

Optionally, the process of obtaining the output time domain signal corresponding to the current time according to the first output time domain signal and the second output time domain signal is as follows:

fusing the first output time domain signal and the second output time domain signal according to a first weight coefficient and a second weight coefficient to obtain a fused time domain signal;

An embodiment of the present invention further provides a speech enhancement apparatus, including:

the acquisition module is used for acquiring a time domain microphone signal and a time domain bone conduction signal at the current moment;

the noise reduction module is used for carrying out noise elimination processing on the time domain microphone signal through a pre-established DNN noise elimination model to obtain a time domain microphone signal after noise elimination, and is used for carrying out frequency domain noise elimination processing on the time domain bone conduction signal to obtain a time domain bone conduction signal after noise elimination;

the zero setting module is used for setting the output signal corresponding to the current moment to be zero;

and the fusion module is used for obtaining an output time domain signal corresponding to the current moment according to the first output time domain signal and the second output time domain signal.

An embodiment of the present invention further provides a speech enhancement system, including:

a memory for storing a computer program;

a processor for implementing the steps of the speech enhancement method as described above when executing the computer program.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the speech enhancement method are implemented as described above.

The embodiment of the invention provides a voice enhancement method, a device, a system and a computer readable storage medium, wherein the method comprises the steps of picking up a time domain microphone signal and a time domain bone conduction signal, judging whether the time domain microphone signal and the time domain bone conduction signal are voice signals, determining whether the current moment is the speech of a user, further carrying out noise elimination processing on the time domain microphone signal through a pre-established DNN noise elimination model when the speech signal is the voice signal, carrying out frequency domain noise elimination processing on the time domain bone conduction signal, better eliminating background noise, carrying out high-pass filtering on the time domain microphone signal subjected to noise elimination to obtain a first output time domain signal of a high-frequency part, carrying out low-pass filtering on the time domain bone conduction signal subjected to noise elimination to obtain a second output time domain signal of a low-frequency part, and obtaining a high-frequency part and a high-frequency part according to the first output time domain signal and the second output time domain signal An output time domain signal of the low frequency part; the invention can better eliminate background noise, is beneficial to improving the tone quality of sound and improves user experience.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed in the prior art and the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic view of a conventional bone conduction principle;

fig. 2 is a flowchart illustrating a speech enhancement method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 2, fig. 2 is a flowchart illustrating a speech enhancement method according to an embodiment of the present invention. The method comprises the following steps:

s110: acquiring a time domain microphone signal and a time domain bone conduction signal at the current moment;

specifically, in practical application, a time domain microphone signal can be picked up through a microphone, a time domain bone conduction signal is collected through a bone and vocal print sensor, and the time domain microphone signal and the time domain bone conduction signal acquired at each moment are processed by adopting the voice enhancement method provided by the embodiment of the invention.

S120: judging whether the time domain microphone signal and the time domain bone conduction signal are voice signals, if so, entering S130; if not, entering S140;

it should be noted that, after acquiring the time-domain microphone signal and the time-domain bone conduction signal at the current time, it can be determined whether the time-domain microphone signal and the time-domain bone conduction signal are voice signals or not, wherein, because the time domain bone conduction signal can accurately reflect whether the user is speaking at present, by judging whether the time domain bone conduction signal is a voice signal, whether the time domain microphone signal picked up by the microphone at the present moment is the voice signal can be further determined, that is, after the time domain bone conduction signal at the present moment is determined to be the voice signal, because the time-domain microphone signal and the time-domain bone conduction signal are signals collected at the same time, the time-domain microphone signal at the current time is also a voice signal, the time domain microphone signal at the current moment is also a voice signal, and when the time domain bone conduction signal at the current moment is determined to be a noise signal, the time domain microphone signal at the current moment is also a noise signal.

S130: carrying out noise elimination processing on the time domain microphone signal through a pre-established DNN noise elimination model to obtain a time domain microphone signal subjected to noise elimination, and carrying out frequency domain noise elimination processing on the time domain bone conduction signal to obtain a time domain bone conduction signal subjected to noise elimination;

it should be noted that, in this embodiment, in order to better eliminate noise, a DNN noise elimination model may be established in advance, and then noise elimination processing may be performed on the time-domain microphone signal by using the DNN noise elimination model, where the DNN noise elimination model is established by:

actually recording a time domain noise signal N 'and a time domain microphone voice signal S, calculating a mixed signal S _ mix of the time domain noise signal N' and the time domain microphone voice signal S, and performing time-frequency transformation (such as FFT) on the time domain noise signal N ', the time domain microphone voice signal S and the mixed signal S _ mix respectively to obtain frequency domain signals N' (k), S (k) and S _ mix (k), wherein k is a frequency domain serial number. And then, carrying out feature extraction on the S _ mix (k) and calculating a first feature parameter.

The time-domain microphone speech signal s and the mixed signal s _ mix are respectively divided into a plurality of first sub-bands (for example, 18 first sub-bands) in the frequency domain, the first sub-band division mode may adopt a mel-frequency division mode or a bark sub-band division mode, and specifically, which mode is adopted may be determined according to actual needs.

After the division is finished, calculating the voice signal energy and the mixed signal energy on each sub-band, wherein the voice signal energy is based onPerforming calculation according to the energy of the mixed signalPerforming a calculation, wherein b represents a subband sequence number, and b is 0, 1.

A first subband gain is then calculated, which may be based onA calculation is performed wherein g (b) represents the gain of the b-th first subband.

Specifically, in the training process of the deep neural network DNN noise cancellation model, the calculated first characteristic parameter of the true mixed signal is used as an input signal, the calculated true first subband gain g is used as an output signal, and the weight coefficient W, U and the offset in the deep neural network are continuously trained and adjusted, so that the first gain g' output each time is continuously close to the true first gain value g. And when the error between g' and g is smaller than the corresponding preset value, successfully training the network, and obtaining a final DNN noise elimination model according to the network parameters at the moment.

In addition, after determining whether the time-domain bone conduction signal is a speech signal and determining that the time-domain bone conduction signal is not a speech signal, the method may further include:

updating a bone conduction noise signal power spectrum according to the time domain bone conduction signal; specifically, the time domain bone conduction signal is converted into the frequency domain bone conduction signal through time-frequency conversion, and then the calculation relation P can be obtained_n(k,t)＝β*P_n(k,t-1)+(1-β)*|Y(k,t)|²Updating the power spectrum of the bone conduction noise signal, wherein P_n(k, t) represents the power of the noise signal received by the bone conduction sensor at time t, P_nThe (k, t-1) represents the power of the noise signal received by the bone conduction sensor at the t-1 time, Y (k, t) represents the kth frequency domain bone conduction signal at the t time, k represents a frequency domain serial number, β represents an iteration factor, β may be specifically 0.9, of course, a specific value of β may be determined according to actual needs, and this embodiment is not particularly limited.

Correspondingly, the process of performing frequency-domain noise cancellation processing on the time-domain bone conduction signal to obtain a time-domain bone conduction signal after noise cancellation may specifically be:

according to the calculation relationCarrying out noise elimination on the frequency domain bone conduction signal, and obtaining the eliminated signalThe frequency domain bone conduction signal of (1), wherein,Y_t(k) which represents the spectral signal at the time t,representing the noise-cancelled spectral signal, H_t(k) Representing a gain function, λ representing an over-subtraction factor, λ being a constant (e.g. 0.9), γ_t(k) Representing the posterior signal-to-noise ratio.

S140: setting an output signal corresponding to the current moment to be zero;

specifically, after the time domain bone conduction signal at the current moment is determined to be a noise signal, the corresponding time domain microphone signal is also a noise signal, so that the output signal corresponding to the current moment can be directly set to be zero.

S150: carrying out high-pass filtering processing on the time domain microphone signal subjected to noise elimination to obtain a first output time domain signal, and carrying out low-pass filtering processing on the time domain bone conduction signal subjected to noise elimination to obtain a second output time domain signal;

it should be noted that, because the high frequency of the sound signal collected by the microphone is relatively rich, and the low frequency of the sound signal collected by the bone conduction sensor is relatively clear and complete, the embodiment of the present invention may perform high-pass filtering processing on the time domain microphone signal after noise elimination to obtain a first output time domain signal of the high frequency portion, and perform low-pass filtering processing on the time domain bone conduction signal after noise elimination to obtain a second output time domain signal of the low frequency portion.

S160: and obtaining an output time domain signal corresponding to the current moment according to the first output time domain signal and the second output time domain signal.

Specifically, the first output time domain signal and the second output time domain signal may be fused, specifically, a first weight coefficient k1 corresponding to the first output time domain signal and a second weight coefficient k2 corresponding to the second output time domain signal may be predetermined, and then the fused time domain signals are obtained by summing the respective weight coefficients, specifically, a relational expression may be calculated by out-k 1 out1+ k2 out2 to obtain a fused time domain signal out, where 1 is the first output time domain signal and out2 is the second output time domain signal.

In addition, in order to avoid overflow of the fused time domain signal, the fused time domain signal can be dynamically adjusted, an overlarge signal is compressed, an undersize signal is properly amplified, so that signal overflow is prevented, and then the adjusted time domain signal is used as an output time domain signal corresponding to the current time.

Further, the process of performing frequency domain noise cancellation processing on the time domain bone conduction signal to obtain a time domain bone conduction signal after noise cancellation may specifically be:

converting the time domain bone conduction signal into a frequency domain bone conduction signal through time-frequency conversion;

carrying out frequency domain noise elimination processing on the frequency domain bone conduction signal to obtain a frequency domain bone conduction signal subjected to noise elimination;

judging whether the bandwidth of the frequency domain bone conduction signal subjected to noise elimination reaches a preset bandwidth or not, if so, directly performing time-frequency inverse transformation on the frequency domain bone conduction signal subjected to noise elimination to obtain a time domain bone conduction signal subjected to noise elimination; if the frequency domain bone conduction signal does not meet the preset bandwidth requirement, performing bandwidth expansion on the frequency domain bone conduction signal subjected to noise elimination by adopting a pre-established DNN bandwidth expansion model to enable the expanded bandwidth to reach the preset bandwidth, and performing time-frequency inverse transformation on the expanded frequency domain bone conduction signal to obtain the time domain bone conduction signal subjected to noise elimination.

It should be noted that, after the frequency domain bone conduction signal after noise elimination is obtained, whether the bandwidth of the frequency domain bone conduction signal after noise elimination reaches a preset bandwidth (the preset bandwidth may be 1kHz) may be further determined, and if the bandwidth of the frequency domain bone conduction signal after noise elimination reaches the preset bandwidth, time-frequency inverse transformation is directly performed on the frequency domain bone conduction signal after noise elimination to obtain a time domain bone conduction signal after noise elimination; if the preset bandwidth is not met, bandwidth expansion can be performed on the frequency domain bone conduction signal after noise elimination by adopting a pre-established DNN bandwidth expansion model, the bandwidth after the bandwidth expansion reaches the preset bandwidth, and then time-frequency inverse transformation is performed on the frequency domain bone conduction signal after the expansion to obtain the time domain bone conduction signal after the noise elimination.

The DNN bandwidth extension model is established in the following process:

actually acquiring bone conduction noise signal n remained after noise reduction_gAnd bone conduction speech signal s_gCalculating the bone conduction noise signal n_gAnd bone conduction speech signal s_gIs mixed with the signal s_gMix, the bone conduction noise signal n_gBone conduction speech signal s_gAnd bone conduction mixed signal s_gSeparately performing time-frequency transformation (such as FFT) to obtain frequency domain signal N_g(k),S_g(k) And S_gC, m is x (k), and then to N_g(k),S_g(k) And S_gAnd (k) respectively extracting the features and calculating respective second feature parameters.

Also the bone conduction speech signal s_gAnd a mixed signal s_gThe _ mix is divided into a plurality of second sub-bands (for example, 5 second sub-bands) in the frequency domain, the second sub-band division mode may adopt a mel frequency division mode or a bark sub-band division mode, and specifically, which mode is adopted may be determined according to actual needs; and calculating the energy of the bone conduction voice signal and the energy of the bone conduction mixed signal on each second sub-band:

wherein, the energy of the bone conduction voice signal can adopt a calculation relation formulaPerforming calculation according to the energy of the bone conduction mixed signalA calculation is performed, b 'denotes a second subband sequence number, b' is 0, 1.

A second subband gain is then calculated, which may be based onA calculation is performed wherein g (b ') represents the gain of the b' th second subband.

Specifically, in the training process of the DNN bandwidth extension model of the deep neural network, the true second characteristic parameter obtained by the calculation is used as an input signal, the true second subband gain g obtained by the calculation is used as an output signal, and the weight coefficient W, U offset in the deep neural network is continuously trained and adjusted, so that the second gain output each time is continuously close to the true value. And when the error between the output second gain and the actual value is smaller than the corresponding preset value, the network training is successful, and a final DNN bandwidth expansion model is obtained according to the network parameters at the moment.

Specifically, the process of performing bandwidth extension on the frequency domain bone conduction signal after noise elimination by using a pre-established DNN bandwidth extension model may specifically be: performing feature extraction on the frequency domain bone conduction signal to obtain a second signal feature; processing the second signal characteristics by adopting the pre-established DNN bandwidth expansion model to obtain second gains corresponding to each second frequency domain point of the frequency domain bone conduction signal;

and calculating the product of the frequency spectrum signal corresponding to each second frequency point in the frequency domain bone conduction signal and the corresponding second gain to obtain the frequency spectrum signal corresponding to each second frequency point and subjected to noise elimination so as to obtain the frequency domain bone conduction signal subjected to noise elimination. Further, the process of obtaining the time-domain microphone signal after noise elimination by performing noise elimination processing on the time-domain microphone signal through the pre-established DNN noise elimination model may specifically be:

carrying out time-frequency transformation on the time domain microphone signals to obtain corresponding frequency domain microphone signals;

calculating the product of the frequency spectrum signal corresponding to each first frequency point in the frequency domain microphone signal and the corresponding first gain to obtain the frequency spectrum signal which corresponds to each first frequency point and is subjected to noise elimination so as to obtain the frequency domain microphone signal subjected to noise elimination;

and performing time-frequency inverse transformation on the frequency domain microphone signal subjected to noise elimination to obtain a time domain microphone signal subjected to noise elimination.

Further, the process of determining whether the time-domain bone conduction signal is a voice signal in S120 may specifically be:

and carrying out voice activation detection on the time domain bone conduction signal so as to judge whether the time domain bone conduction signal is a voice signal.

The above-mentioned process of performing voice activation detection on the time domain bone conduction signal and determining whether the time domain bone conduction signal is a voice signal may specifically be:

calculating a zero crossing rate and a pitch period corresponding to the time domain bone conduction signal;

performing time-frequency transformation on the time domain bone conduction signal to obtain a frequency domain bone conduction signal; specifically, the time domain bone conduction signal can be processed by adopting FFT (fast Fourier transform) to obtain a frequency domain bone conduction signal;

calculating the corresponding spectral energy and spectral centroid of the frequency domain bone conduction signal;

performing fusion judgment on the zero crossing rate, the pitch period, the spectrum energy and the spectrum mass center, and obtaining a voice activation detection mark bit corresponding to the time domain bone conduction signal;

and judging whether the time domain bone conduction signal is a voice signal or not according to the voice activation detection mark bit.

Specifically, the process of calculating the zero crossing rate corresponding to the time-domain bone conduction signal includes:

calculating the zero crossing rate corresponding to the time domain bone conduction signal according to a first calculation relation, wherein the first calculation relation is as follows:

wherein Z is_nRepresents the number of zero crossings, x (m) represents a time domain signal corresponding to a time variable m, x (m-1) represents a time domain signal corresponding to a time variable m-1, x (n) represents a time domain signal corresponding to a time variable n, and x (n-1) represents a time domain signal corresponding to a time variable n-1; n is less than or equal to N, wherein N represents the length of the current time domain signal x (N);

ZCR＝Z_n(m2-m1+1), where ZCR represents the zero crossing rate, m1 represents the m1 th point in the current frame time-domain signal column, and m2 represents the m2 th point in the current frame time-domain signal column.

The process of calculating the pitch period corresponding to the time domain bone conduction signal comprises the following steps:

the autocorrelation function is:wherein R is_mRepresenting a speech signal autocorrelation function, x (n + m) representing a time domain signal corresponding to a time variable n + m;

the pitch period is: pitch ═ max { R }_mWhere Pitch denotes the Pitch period.

The process of calculating the frequency spectrum energy corresponding to the frequency domain bone conduction signal comprises the following steps:

specifically, for spectral energy of a specified bandwidth, for example, after the time-domain bone conduction signal is subjected to FFT, the 8khz bandwidth is divided into 128 subbands, and the low 24 subband energy is taken:

wherein E is_gRepresents the logarithmic energy of the lower 24 sub-bands, j represents the lower 24 sub-band number, and y (j) represents the frequency domain signal, wherein the lower 24 sub-bands refer to 24 sub-bands from 128 sub-bands from low frequency to high frequency.

The process of calculating the spectral centroid corresponding to the frequency domain bone conduction signal comprises the following steps:

E(k)＝|Y(k)|²wherein, brightness represents the spectrum centroid, f (k) represents the frequency of the k-th frequency point, E (k) the spectrum energy of the k-th frequency point, and U represents the number of frequency points.

Further, the above process of performing fusion judgment on the zero-crossing rate, the pitch period, the spectral energy and the spectral centroid and obtaining the voice activation detection flag bit corresponding to the time domain bone conduction signal may specifically be:

judging whether the frequency spectrum energy is smaller than a first preset value, if so, setting a voice activation detection flag bit corresponding to the time domain bone conduction signal to be 0; if not, the next step of judgment is carried out;

judging whether the zero crossing rate is greater than a second preset value, if so, setting the voice activation detection flag bit corresponding to the time domain bone conduction signal to be 0, and if not, entering the next judgment;

judging whether the spectrum centroid is larger than a fifth preset value, if so, setting the voice activation detection mark bit corresponding to the time domain bone conduction signal to be 0; otherwise, the voice activation detection flag bit corresponding to the time domain bone conduction signal is 1;

it should be noted that, in practical application, the first preset value may be-9, the second preset value may be 03.6, the third preset value may be 143, the fourth preset value may be 8, and the fifth preset value may be 3, and of course, specific values of each preset value may be determined according to practical situations, and this embodiment is not particularly limited.

Then, the corresponding process of determining whether the time domain bone conduction signal is a voice signal according to the voice activation detection flag bit may specifically be:

when the voice activation detection mark bit is 1, the time domain bone conduction signal is a voice signal;

when the voice activation detection flag bit is 0, the current time domain bone conduction signal is a noise signal.

Further, the process of performing noise elimination processing on the time-domain microphone signal and the time-domain bone conduction signal in S130 may specifically be:

noise elimination processing is carried out on the time domain microphone signal through a pre-established DNN noise elimination model, and the time domain microphone signal after noise elimination is obtained;

and carrying out frequency domain noise elimination processing on the time domain bone conduction signal to obtain the time domain bone conduction signal after noise elimination.

Therefore, the embodiment of the invention picks up the time domain microphone signal through the microphone, collects the time domain bone conduction signal through the bone vocal print sensor, then, by judging whether the time domain microphone signal and the time domain bone conduction signal are voice signals, whether the current time is that the user speaks can be determined, when the signal is a voice signal, noise elimination processing is further carried out on the time domain microphone signal through a pre-established DNN noise elimination model, and the noise elimination processing of the frequency domain is carried out on the time domain bone conduction signal, thereby better eliminating background noise, then carrying out high-pass filtering on the time domain microphone signal after noise elimination to obtain a first output time domain signal of a high-frequency part, after the low-pass filtering processing is carried out on the time domain bone conduction signal after the noise elimination, a second output time domain signal of the low-frequency part is obtained, then, according to the first output time domain signal and the second output time domain signal, an output time domain signal which comprises a high frequency part and a low frequency part can be obtained; the invention can better eliminate background noise, is beneficial to improving the tone quality of sound and improves user experience.

On the basis of the foregoing embodiments, an embodiment of the present invention further provides a speech enhancement apparatus, which is specifically shown in fig. 3. The device includes:

an obtaining module 21, configured to obtain a time-domain microphone signal and a time-domain bone conduction signal at a current time;

the judging module 22 is configured to judge whether the time-domain microphone signal and the time-domain bone conduction signal are voice signals, and if yes, trigger the noise reduction module 23; if not, triggering a zero setting module 24;

the noise reduction module 23 is configured to perform noise reduction processing on the time-domain microphone signal through a pre-established DNN noise reduction model to obtain a time-domain microphone signal after noise reduction, and perform frequency-domain noise reduction processing on the time-domain bone conduction signal to obtain a time-domain bone conduction signal after noise reduction;

a zero setting module 24, configured to set an output signal corresponding to the current time to zero;

the filtering module 25 is configured to perform high-pass filtering on the time-domain microphone signal after the noise is removed to obtain a first output time-domain signal, and perform low-pass filtering on the time-domain bone conduction signal after the noise is removed to obtain a second output time-domain signal;

and the fusion module 26 is configured to obtain an output time domain signal corresponding to the current time according to the first output time domain signal and the second output time domain signal.

It should be noted that the speech enhancement apparatus provided in the embodiment of the present invention has the same beneficial effects as the speech enhancement method provided in the above embodiment, and for the specific description of the speech enhancement method related in the embodiment, please refer to the above embodiment, which is not described herein again.

On the basis of the above embodiment, an embodiment of the present invention further provides a speech enhancement system, including:

a memory for storing a computer program;

a processor for implementing the steps of the speech enhancement method as described above when executing the computer program.

It should be noted that the processor in the embodiment of the present invention may be specifically configured to receive a time-domain microphone signal and a time-domain bone conduction signal at a current time, where the time-domain microphone signal is picked up by a microphone, and the time-domain bone conduction signal is collected by a bone voiceprint sensor; judging whether the time domain microphone signal and the time domain bone conduction signal are voice signals, if so, performing noise elimination processing on the time domain microphone signal through a pre-established DNN noise elimination model to obtain a time domain microphone signal after noise elimination, and performing frequency domain noise elimination processing on the time domain bone conduction signal to obtain a time domain bone conduction signal after noise elimination; if not, setting the output signal corresponding to the current moment to be zero; carrying out high-pass filtering processing on the time domain microphone signal subjected to noise elimination to obtain a first output time domain signal, and carrying out low-pass filtering processing on the time domain bone conduction signal subjected to noise elimination to obtain a second output time domain signal; and obtaining an output time domain signal corresponding to the current moment according to the first output time domain signal and the second output time domain signal.

On the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the speech enhancement method as described above.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：电子装置、方法和存储介质

Voice enhancement method, device and system and computer readable storage medium

相关技术

网友询问留言