Earplug voice estimation
阅读说明:本技术 耳塞语音估计 (Earplug voice estimation ) 是由 D·L·沃茨 B·R·斯蒂尔 T·I·哈维 V·萨博兹耐科夫 于 2018-06-15 设计创作,主要内容包括:本发明的实施方案使用骨传导传感器或加速度计来确定语音估计,而不采用语音估计的话音活动检测选通。语音估计完全地基于所述骨传导信号,或者与麦克风信号结合执行。语音估计之后被用来调节所述麦克风的输出信号。音频设备中存在多种用于语音处理的使用实例。(Embodiments of the present invention use bone conduction sensors or accelerometers to determine speech estimates, rather than voice activity detection gating of speech estimates. The speech estimation is performed either entirely on the basis of the bone conduction signal or in combination with the microphone signal. The speech estimate is then used to adjust the output signal of the microphone. There are a number of examples of uses for speech processing in audio devices.)
1, a signal processing device for earplug speech estimation, the device comprising:
at least inputs receiving microphone signals from the microphone of the ear piece;
at least inputs receiving bone conduction sensor signals from the bone conduction sensors of the earplugs;
a processor configured to determine from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables, the processor further configured to derive at least signal conditioning parameters from the at least characteristics of the voice, and the processor further configured to condition the microphone signal using the at least signal conditioning parameters.
2. The signal processing apparatus of claim 1 wherein the ear bud is a wireless ear bud.
3. A signal processing apparatus according to claim 1 or 2, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a speech estimate derived from the bone conduction sensor signal.
4. A signal processing apparatus according to claim 3, wherein the processor is configured such that the adjustment of the microphone signal comprises non-static noise reduction controlled by a speech estimate derived from the bone conduction sensor signal.
5. The signal processing apparatus of claim 4, wherein the non-static noise reduction is further controlled by a speech estimate derived from the microphone signal.
6. The signal processing device of any of claims 1-5, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a speech level of the bone conduction sensor signal.
7. The signal processing device of any of claims 1-6, wherein the processor is configured such that a non-binary variable characteristic of speech determined from the bone conduction sensor signal is an observed spectrum of the bone conduction sensor signal.
8. The signal processing apparatus according to claim 7, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a parametric representation of a spectral envelope of the bone conduction sensor signal.
9. The signal processing apparatus according to claim 8, wherein the processor is configured such that the parametric representation of the spectral envelope of the bone conduction sensor signal comprises at least terms of linear prediction cepstral coefficients, autoregressive coefficients, and line spectral frequencies.
10. The signal processing apparatus of any of claims 1-9, wherein the processor is configured such that adjustment of an output signal from the microphone occurs regardless of voice activity.
11. The signal processing apparatus of any of claims 1-10, wherein the processor is configured such that the at least signal conditioning parameters include a band-specific gain derived from the bone conduction sensor signal, and wherein conditioning the microphone signal includes applying the band-specific gain to the microphone signal.
12. The signal processing apparatus of any of claims 1-11, wherein the processor is configured such that adjustment of the microphone signal includes applying a kalman filter process in which the bone conduction sensor signal acts as an a priori to a speech estimation process.
13. The signal processing apparatus of claim 12, wherein a speech estimate derived from the bone conduction sensor signal is used to modify a decision-directed weighting factor for an a priori SNR estimate.
14. The signal processing apparatus of claim 12, wherein a speech estimate derived from the bone conduction sensor signal is used to inform a Causal Recursive Speech Enhancement (CRSE) of an update step.
15. The signal processing apparatus of any of claims 1-14, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a signal-to-noise ratio of the bone conduction sensor signal.
16. The signal processing device of any of claims 1-15, wherein the processor is configured such that no component of the bone conduction sensor signal is passed to a signal output of the earbud except that the bone conduction sensor signal is the basis for determining at least characteristics of the speech.
17. The signal processing device of any of claims 1-16, wherein the processor is configured such that the bone conduction sensor signal is corrected for the observed condition prior to determining non-binary variable characteristics of the speech from the bone conduction sensor signal.
18. A signal processing apparatus according to claim 17, wherein the processor is configured such that the bone conduction sensor signal is corrected for phonemes.
19. A signal processing apparatus according to claim 17 or 18, wherein the processor is configured such that the bone conduction sensor signal is corrected for bone conduction coupling.
20. The signal processing device of any of claims 17-19, wherein the processor is configured to cause the bone conduction sensor signal to be corrected for bandwidth.
21. The signal processing device of any of claims 17-20, wherein the processor is configured such that the bone conduction sensor signal is corrected for distortion.
22. The signal processing device of any of claims 17-21, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying a mapping process.
23. A signal processing apparatus according to claim 22, wherein the mapping process comprises a linear mapping comprising series of corrections associated with each spectral interval of the bone conduction sensor signal.
24. The signal processing apparatus of claim 23, wherein the correction comprises a multiplication and an offset applied to respective spectral bin values of the bone conduction sensor signal.
25. The signal processing device of any of claims 17-24, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying offline learning.
26. The signal processing device of any of claims 1-25, wherein the processor is configured such that adjustment of the microphone signal is based only on non-binary variable characteristics of speech determined from the bone conduction sensor signal.
27. A signal processing apparatus according to any of claims 1 to 26, wherein the bone conduction sensor comprises an accelerometer that is coupled, in use, to a surface of the user's ear canal or outer ear to detect bone conducted signals from the user's speech.
28. A signal processing apparatus according to any of claims 1-27, wherein the bone conduction sensor comprises an in-ear microphone positioned, in use, to detect acoustic sound produced within the ear canal by bone conduction of the user's speech.
29. A signal processing apparatus according to claims 27 and 28, wherein both the accelerometer and the in-ear microphone are used to detect at least characteristics of the user's voice.
30. The signal processing device of any of claims 1-29, wherein the processor is configured to apply at least matched filters to the bone conduction sensor signal, the matched filters configured to match the user's voice in the bone conduction sensor signal to the user's voice in the microphone signal.
31. The signal processing apparatus of claim 30 wherein the at least matched filters have a training set based design.
32. The signal processing apparatus of any of claims 1-31, wherein the processor is configured to adjust the microphone signal unilaterally without input from any contralateral sensor on the user's other ear.
A method of adjusting an earbud microphone signal, the method comprising:
receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;
receiving a microphone signal from a microphone of the earbud;
determining from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables;
deriving at least signal-conditioning parameters from at least characteristics of the speech;
adjusting an output signal from the microphone using the at least signal adjustment parameters.
34. The method of claim 33, wherein the earbud is a wireless earbud.
35. The method of claim 33 or claim 34, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a speech estimate derived from the bone conduction sensor signal.
36. The method of claim 35, wherein the processor is configured such that the adjustment of the microphone signal comprises non-static noise reduction controlled by a speech estimate derived from the bone conduction sensor signal.
37. The method of claim 36, wherein the non-static noise reduction is further controlled by a speech estimate derived from the microphone signal.
38. The method of any of claims 33-37, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a speech level of the bone conduction sensor signal.
39. The method of any of claims 33-38, wherein the processor is configured such that a non-binary variable characteristic of speech determined from the bone conduction sensor signal is an observed spectrum of the bone conduction sensor signal.
40. The method according to claim 39, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a parametric representation of a spectral envelope of the bone conduction sensor signal.
41. The method according to claim 40, wherein the processor is configured such that the parametric representation of the spectral envelope of the bone conduction sensor signal includes at least terms of linear prediction cepstral coefficients, autoregressive coefficients, and line spectral frequencies.
42. The method of any of claims 33-41, wherein the processor is configured such that adjustments to output signals from the microphone occur regardless of voice activity.
43. The method of any of claims 33-42, wherein the processor is configured such that the at least signal conditioning parameters include a frequency band-specific gain derived from the bone conduction sensor signal, and wherein conditioning of the microphone signal includes applying the frequency band-specific gain to the microphone signal.
44. The method of any of claims 33-43, wherein the processor is configured such that the conditioning of the microphone signals includes applying a Kalman filter process in which the bone conduction sensor signals act as priors for a speech estimation process.
45. The method of claim 44, wherein a speech estimate derived from the bone conduction sensor signal is used to modify a decision-directed weighting factor for an a priori SNR estimate.
46. The method of claim 44, wherein a speech estimate derived from the bone conduction sensor signal is used to inform a Causal Recursive Speech Enhancement (CRSE) update step.
47. The method of any of claims 33-46, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a signal-to-noise ratio of the bone conduction sensor signal.
48. The method of any of claims 33-47, wherein the processor is configured such that no component of the bone conduction sensor signal is passed to a signal output of the earbud except that the bone conduction sensor signal is a basis for determining at least characteristics of the speech.
49. The method of any of claims 33-48, wherein the processor is configured such that the bone conduction sensor signal is corrected for observed conditions prior to determining non-binary variable characteristics of the speech from the bone conduction sensor signal.
50. The method of claim 49, wherein the processor is configured such that the bone conduction sensor signal is corrected for phonemes.
51. The method of claim 49 or claim 50, wherein the processor is configured such that the bone conduction sensor signal is corrected for bone conduction coupling.
52. The method of any of claims 49-51, wherein the processor is configured to cause the bone conduction sensor signal to be corrected for bandwidth.
53. The method of any of claims 49-52, wherein the processor is configured to cause correction of the bone conduction sensor signal for distortion.
54. The method of any of claims 49-53, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying a mapping process.
55. The method of claim 54, wherein the mapping process comprises a linear mapping including series of corrections associated with each spectral bin of the bone conduction sensor signal.
56. The method of claim 55, wherein the correction comprises a multiplication and an offset applied to respective spectral bin values of the bone conduction sensor signal.
57. The method of any of claims 49-56, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying offline learning.
58. The method of any of claims 33-57, wherein the processor is configured such that the adjustment of the microphone signal is based only on non-binary variable characteristics of speech determined from the bone conduction sensor signal.
59. The method of any of claims 33-58, wherein the bone conduction sensor comprises an accelerometer that is coupled, in use, to a surface of the user's ear canal or outer ear to detect bone conducted signals from the user's speech.
60. The method of any of claims 33-59, wherein the bone conduction sensor comprises an in-ear microphone positioned, in use, to detect acoustic sound generated within an ear canal due to bone conduction of the user's speech.
61. The method of claims 59 and 60, wherein both the accelerometer and the in-ear microphone are used to detect at least characteristics of the user's voice.
62. The method of any of claims 33-61, wherein the processor is configured to apply at least matched filters to the bone conduction sensor signal, the matched filters configured to match the user's voice in the bone conduction sensor signal to the user's voice in the microphone signal.
63. The method of claim 62, wherein the at least matched filters have a training set based design.
64. The method of any of claims 33-63, wherein the processor is configured to adjust the microphone signal unilaterally without input from any contralateral sensor on the user's other ear.
65., a non-transitory computer-readable medium for conditioning an earbud microphone signal, the non-transitory computer-readable medium comprising instructions that, when executed by or more processors, cause performance of:
receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;
receiving a microphone signal from a microphone of an earbud;
determining from bone conduction sensor signals at least characteristics of speech of a user of the earbud, the at least characteristics being non-binary variables;
deriving at least signal-conditioning parameters from at least characteristics of the speech, and
adjusting an output signal from the microphone using the at least signal adjustment parameters.
66. The non-transitory computer-readable medium of claim 65, further configured to perform the method of any of claims 34-64.
Technical Field
The present invention relates to ear bud headphones (earbud headsets) configured to perform voice (speech) estimation for functions such as voice capture, and in particular, the present invention relates to ear bud voice estimation based on bone conduction sensor signals (bone conduction sensor signals).
Background
The ear-piece's in-ear position severely limits the geometry of the device and greatly limits the ability to place the microphones far apart (as required by functions such as beamforming or sidelobe canceling). moreover, the small form factor poses a significant limitation on the battery size and thus power for wireless earplugs.
Speech capture generally refers to the situation where the headset user's voice is captured and any ambient noise, including the other person's voice, is minimized. A common scenario for this use case is when the user makes a voice call or interacts with a speech recognition system. Both scenarios place stringent requirements on the underlying algorithm. For voice calls, telephone standards and user requirements require that a high level of noise reduction be achieved with excellent sound quality. Similarly, speech recognition systems typically require the audio signal to have minimal modification while eliminating as much noise as possible. There are many signal processing algorithms in which it is important that the operation of the algorithm changes depending on whether the user is speaking. Voice activity detection is an important aspect of speech capture and other such signal processing algorithms by processing an input signal to determine the presence or absence of speech in the signal. However, even in larger headphones, such as boom earphones (boom), pendant, and ear-headphone, it is very difficult to reliably ignore the speech from other people located within the beam of the device's beamformer, with the result that these other people's speech only disrupts the processing of the user's voice capture. These and other aspects of voice capture are particularly difficult to achieve with earplugs, including because earplugs do not place a microphone near the mouth of the user, and thus do not benefit from the significantly improved signal-to-noise ratio that results from such microphone positioning.
Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is solely for the purpose of providing a context for the present invention and it is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application.
Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.
In this specification, it is to be understood that stating that an element may be "at least " in a list of options, it may be that the element may be any of the listed options, or may be any combination of two or more of the listed options.
Disclosure of Invention
According to , the invention provides signal processing apparatus for earplug speech estimation, the apparatus comprising:
at least inputs for receiving microphone signals from a microphone of an earbud;
at least inputs for receiving bone conduction sensor signals from bone conduction sensors of the ear plugs;
a processor configured to determine from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables, the processor further configured to derive at least signal-conditioning parameters from the at least characteristics of the voice, and the processor further configured to condition the microphone signal using the at least signal-conditioning parameters.
According to a second aspect, the present invention provides a method of conditioning an earbud microphone signal, the method comprising:
receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;
receiving a microphone signal from a microphone of the earbud;
determining from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables;
deriving at least signal-conditioning parameters from at least characteristics of the speech, and
adjusting an output signal from the microphone using the at least signal adjustment parameters.
According to a third aspect, the present invention provides a non-transitory computer-readable medium for conditioning an earbud microphone signal, the non-transitory computer-readable medium comprising instructions that when executed by or more processors result in performance of:
receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;
receiving a microphone signal from a microphone of the earbud;
determining from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables;
deriving at least signal-conditioning parameters from at least characteristics of the speech, and
adjusting an output signal from the microphone using the at least signal adjustment parameters.
In embodiments, the earplug is a wireless earplug.
In embodiments , the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is an estimate of speech derived from the bone conduction sensor signal in embodiments , the processor may be configured such that the adjustment of the microphone signal includes non-stationary noise reduction controlled by the estimate of speech derived from the bone conduction sensor signal in embodiments , the non-stationary noise reduction may be further controlled by the estimate of speech derived from the microphone signal.
In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a speech level of the bone conduction sensor signal.
In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is an observed spectrum of the bone conduction sensor signal.
In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a parametric representation of a spectral envelope of the bone conduction sensor signal.
In embodiments, the processor may be configured such that the parametric representation of the spectral envelope of the bone conduction sensor signal includes at least of linear prediction cepstral coefficients, autoregressive coefficients, and line spectral frequencies to model a human vocal tract, for example, to derive the speech envelope.
In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a non-parametric representation of the spectral envelope of the bone conduction sensor signal, such as mel-frequency cepstral coefficients (MFCCs) derived from a model of human sound perception, or logarithmically spaced spectral magnitudes derived from a short-time fourier transform, which is preferred methods.
In embodiments, the processor may be configured such that the adjustment of the output signal from the microphone occurs regardless of voice activity.
In embodiments, the processor may be configured such that the at least signal conditioning parameters include a frequency band-specific gain derived from the bone conduction sensor signal, and wherein conditioning the microphone signal includes applying the frequency band-specific gain to the microphone signal.
In embodiments, the processor may be configured such that the conditioning of the microphone signal includes applying a kalman filter process in which the bone conduction sensor signal acts as an a priori (a priori) for a speech estimation process in embodiments, a speech estimate may be derived from the bone conduction sensor signal and used to modify a decision-directed weighting factor for an a priori SNR estimate in embodiments, a speech estimate derived from the bone conduction sensor signal may be used to inform a Causal Recursive Speech Enhancement (CRSE) update step.
In embodiments, the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal may be a signal-to-noise ratio of the bone conduction sensor signal.
In embodiments, the processor may be configured such that no component of the bone conduction sensor signal is passed to the signal output of the earbud except that the bone conduction sensor signal is the basis for determining at least characteristics of the speech.
In embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for observed conditions prior to determining non-binary variable characteristics of the speech from the bone conduction sensor signal in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for phonemes in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for bone conduction coupling in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for bandwidth in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for distortion in embodiments, the processor may be configured to perform the correction of the bone conduction sensor signal by applying a mapping process in embodiments, the mapping process may include a linear mapping that includes a series of corrections associated with each spectral interval of the bone conduction sensor signal, for example, the series of corrections may include applying a spectral offset to the bone conduction sensor signal in embodiments, the processor may be configured to perform the linear mapping process including the respective spectral offset correction of the bone conduction sensor signal in embodiments.
In embodiments, the processor may be configured such that the adjustment to the microphone signal is based only on non-binary variable characteristics of the speech determined from the bone conduction sensor signal.
In embodiments, the bone conduction sensor may include an accelerometer that, in use, is coupled to a surface of the ear canal or outer ear of the user to detect signals from bone conduction of the user's speech.
In embodiments, the bone conduction sensor may include an in-ear microphone positioned, in use, to detect acoustic sound generated within the ear canal by bone conduction of a user's voice in embodiments, both the accelerometer and the in-ear microphone may be used to detect at least characteristics of a user's voice.
In embodiments, the processor may be configured to apply at least matched filters to the bone conduction sensor signal, the matched filters configured to match a user's voice in the bone conduction sensor signal with a user's voice in the microphone signal in embodiments, the matched filters may have a training set based design.
In embodiments, the processor may be configured to adjust the microphone signal unilaterally without input from any contralateral sensor on the user's other ear.
An earpiece is defined herein as an audio headphone device, whether wired or wireless, which in use is supported solely or substantially by the ear on which it is placed, and which comprises an earpiece body which in use is located substantially or entirely within the ear canal and/or within the concha of the pinna.
Drawings
Embodiments of the invention will now be described with reference to the accompanying drawings, in which:
fig. 1 illustrates the use of a wireless ear bud in telephone and/or audio playback;
fig. 2 is a system schematic of earplugs according to embodiments of the invention;
fig. 3a and 3b are detailed system schematic diagrams of the earplug of fig. 2;
FIG. 4 is a flow chart of an earplug voice estimation process of the embodiment of FIG. 3;
fig. 5 illustrates a noise suppressor for a telephone according to another embodiment of the invention;
FIG. 6 illustrates embodiments including a speech estimator that uses a statistical model based estimation process;
FIG. 7 illustrates a microphone-accelerometer mixing method based on a mixing factor using SNR estimation;
FIG. 8 illustrates the configuration of another embodiment of the invention;
FIG. 9 illustrates an embodiment of applying speech estimation from bone conduction sensor signals to a phone use case; and
fig. 10 shows the objective Mean Opinion Score (MOS) results for embodiments of the present invention.
Detailed Description
The
Fig. 2 illustrates a system of
The microphone signal from the
The
In alternative embodiments, the bone conduction sensor may be coupled to the outer ear, or mounted on any part of the headphone body that reliably contacts the ear canal or ear within the outer ear. The use of an earplug allows a reliable direct contact with the ear canal and thus a mechanical coupling to the vibrational model of bone-conducted speech as measured at the ear canal wall. This is in contrast to the external temples, cheeks or skull where a mobile device (such as a telephone) may make contact. The present invention recognizes that a speech model of bone conduction derived from portions of the anatomy outside the ear yields a signal with greatly reduced reliability of speech estimation compared to the embodiments described herein. The present invention recognizes that the use of a bone conduction sensor in a wireless ear bud is sufficient to perform speech estimation. This is because, unlike headphones outside of the earpiece or ear, the nature of the bone conduction sensor signal from the wireless earbud is largely static with respect to user fit, user motion, and user movement. For example, the present invention recognizes that compensation of the bone conduction sensor is not required for fit or proximity. Therefore, the choice of the ear canal or outer ear as the location of the bone conduction sensor is a key enabler of the present invention. In turn, the invention then turns to a transformation of the signal that derives the temporal and spectral characteristics that best recognize the user's speech.
The
Unlike an earpiece or corded headset, in which the primary voice microphone is closer to the mouth and the difference in the way the user holds the phone/lanyard can result in a wide range of SNRs, in this embodiment, the SNR on the
A sufficient condition for contact between the
The
It is worth noting that the present embodiment provides noise reduction applied in a controlled hierarchical manner, rather than in a binary on-off manner, for a headset form factor comprising a wireless ear bud provided with at least microphones and at least accelerometers, based on a speech estimate derived from a bone conducted sensor signal.
Voice Activity Detection (VAD) is a method of improving speech estimation, but inherently relies on an imperfect concept of binary recognition of the presence or absence of speech in noisy signals.
Fig. 3a and 3b illustrate in more detail the configuration of the
In more detail, in fig. 3, the microphone signal from
In such an embodiment, the selection of the
The
In particular, embodiments employing multiple
The signal from
Notably, the configuration of fig. 3 omits any Voice Activity Detection (VAD). Many methods of speech enhancement rely on various estimates of the speech signal and become challenging when the microphone speech signal is degraded by ambient noise. The accuracy of these estimates typically decreases with the ambient noise level. Uses of speech estimation include wind noise suppression, a priori SNR estimation for noise suppression, gain function biasing for noise suppression, beamforming adaptation (block matrix update), adaptive control for acoustic echo cancellation, a priori speech echo ratio (speech to echo) estimation for echo suppression, adaptive thresholding (level difference and cross correlation) for VAD, and adaptive windowing (minimum control recursive average; MCRA) for static noise estimation.
In this embodiment of the invention, the processing and subsequent adjustment of the
The embodiment of fig. 3 recognizes that speech estimation using a microphone and bone conduction sensor can improve speech estimation for such purposes. The speech estimate may be derived from a bone conduction sensor (e.g., accelerometer 230) or a combination of both
The bone conduction sensor signal is corrected for the observed conditions and may be corrected for phonemes, sensor bandwidth, and/or distortion, for example, the bone conduction sensor signal may include a linear mapping that makes series corrections associated with each spectral bin, such as applying a multiplication or offset to each bin value.
The speech estimate may be derived from the
Fig. 3b provides more detail of the earbud
Notably, fig. 3a and 3b depict a
A noise suppressor for a telephone as shown in fig. 5 may use an estimator to produce a net voice signal that will be transmitted over a telephone network to a remote recipient. Embodiments of the noise suppressor include spectral subtraction, wiener filtering methods, and statistical modeling methods.
An example of an implementation of a speech estimator using a statistical model-based estimation process is shown in FIG. 6. The microphone speech estimate for air conduction, the speech estimate for bone conduction, and the SNR are each derived from a causal recursive speech enhancement process. The a priori SNR estimates from each process are then combined to derive mixing coefficients that will adjust the user speech estimate to arrive at the final speech estimator. It is important to note that in this process, neither the microphone nor the accelerometer sensor signals are used to derive the noise model. In contrast, the information content within the signal that is affected by the wireless ear bud form factor allows for a direct speech estimation process.
In another embodiment, the application can be used to generate a signal representing a potential representation of speech suitable for an Automated Speech Recognition (ASR) system.
This is different from the same dynamics that utilize speech detection (which have general application in the field of voice activity detectors) in that the time dynamics and the spectral dynamics of bone conduction signals are used to derive a speech model in the presence of static noise signals.
The bone conduction spectral envelope in the earplug can be corrected to trade-off feature importance, but the matching signal is not necessary for designing the tuning parameters.
The method of deriving a speech estimator using a bone conduction sensor, as opposed to a speech detector (VAD), may be elaborated further in the context of the present invention.traditionally, the quality of the noise suppressor depends on an estimate of the noise spectrum, which is typically derived from measurements made with a binary decision device, such as a VAD, during speech gaps.
The present invention does not use bone conduction sensors in the process of building the noise model. Therefore, the construction of the noise model does not require a Voice Activity Detector (VAD) derived from the bone conduction sensor. This is an important difference from other proposals that use bone conduction sensors instead of microphones, since in such alternative proposals, typically the noise model must be accurately modeled for performing speech enhancement, so that the bone conduction sensors help to derive the model.
In contrast, the basic assumption of a bone conduction sensor in an earplug is that the bone conduction sensor signal representing speech contains sufficient time and spectral content to derive a non-binary signal representing the user's speech.
In embodiments (FIG. 6), the speech model from a noisy microphone may be completed with a causal recursive speech estimator that needs to estimate the noise varianceThe rate spectrum is treated as a priori (prior) of the user's speech by means of a representation of the vibrations of their ear canal. The net speech microphone signal can be roughly estimated without transformation. In this case, it is regarded as Sbc bone conduction speech estimation instead of the net speech estimation adjusted on the bone conduction sensor, i.e.
In embodiments, SbcIt is noted that these embodiments do not use off-line processing to derive the bone conduction to a clean air conduction microphone transformation, nor use, for example, a synthesized signal as a hypothetical estimate.FIG. 7 illustrates microphone-accelerometer mixing methods based on mixing factors using SNR estimates, and methods are provided that combine a priori SNR estimates from a microphone and an accelerometer (BC sensor). The may be particularly applicable to low SNR environments that use the best speech estimate of SNR estimates.thus, the net speech estimate and a priori SNR estimate derived from a bone conduction sensor signal are applications of the speech estimation techniques controlled by a bone conduction sensor signal according to the present invention.
Secondary noise reduction is then performed on this mixed signal.
This is in contrast to using the VAD to derive the noise estimate and then determining the blend ratio.
Other embodiments of the present invention may expand the idea by discarding the speech estimates from the speech enhancement blocks 710, 720 instead of mixing the noisy signal from the SNR estimate and performing a secondary noise reduction.
Fig. 8 illustrates the configuration of the
Fig. 9 illustrates another embodiment according to the present invention, which illustrates the application of speech estimation from bone conduction sensor signals to phone use cases.
Embodiments of the present invention note that although the frequency response of an in-ear accelerometer is poor compared to a microphone or even compared to a bone sensor or the like mounted at the temple, it is possible that not only can the in-ear accelerometer signal be used for speech estimation, but it is also recognized that the in-ear accelerometer signal can be used for hierarchical or non-binary control of speech estimation, such as by controlling non-static noise reduction in a multi-step or hierarchical manner. In more detail, the low-pass frequency response and relatively poor sensitivity of the earbud inertial sensor are limitations of the bone conduction model at the external ear canal. Bone conduction sensors for vibration are usually of the magnetic type and are usually mounted to other parts of the head, such as the temporal bone or mastoid bone, with the elastic force of a headband or the like holding a firm contact. However, this mounting location and technique is somewhat incompatible with headphones for audio applications and is incompatible with the preferred headphone form factor. The present invention facilitates compliance to a preferred headphone form factor when utilizing an inertial sensor of an earplug.
The time-frequency model of estimating speech in the ear canal is therefore a different problem, as the inventors have found that the observable frequency range of ear canal bone conduction signals is typically below 1 kHz.
Fig. 10 shows objective Mean Opinion Score (MOS) results for the embodiment of fig. 9, which shows the improvement when adjusting the a priori speech envelope from the
While in other applications such as earpieces, the contribution of the bone conduction estimate and the microphone spectrum estimate in the combined estimate in time and frequency may drop to zero if the earpiece use case causes any sensor signal quality to be poor, this is not the case in the wireless earbud application of this embodiment.
While the described embodiment provides for the voice estimation/
The
Unlike
Wireless communication is understood to mean types of communication, monitoring, or control systems in which electromagnetic or acoustic waves carry signals through the atmosphere or free space, rather than along wires.
Corresponding reference characters indicate corresponding parts throughout the drawings.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:带有可卷起的膜片的扬声器