Earplug voice estimation

文档序号：1581274 发布日期：2020-01-31 浏览：9次中文

阅读说明：本技术 耳塞语音估计 (Earplug voice estimation ) 是由 D·L·沃茨 B·R·斯蒂尔 T·I·哈维 V·萨博兹耐科夫于 2018-06-15 设计创作，主要内容包括：本发明的实施方案使用骨传导传感器或加速度计来确定语音估计,而不采用语音估计的话音活动检测选通。语音估计完全地基于所述骨传导信号,或者与麦克风信号结合执行。语音估计之后被用来调节所述麦克风的输出信号。音频设备中存在多种用于语音处理的使用实例。(Embodiments of the present invention use bone conduction sensors or accelerometers to determine speech estimates, rather than voice activity detection gating of speech estimates. The speech estimation is performed either entirely on the basis of the bone conduction signal or in combination with the microphone signal. The speech estimate is then used to adjust the output signal of the microphone. There are a number of examples of uses for speech processing in audio devices.)

1, a signal processing device for earplug speech estimation, the device comprising:

at least inputs receiving microphone signals from the microphone of the ear piece;

at least inputs receiving bone conduction sensor signals from the bone conduction sensors of the earplugs;

a processor configured to determine from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables, the processor further configured to derive at least signal conditioning parameters from the at least characteristics of the voice, and the processor further configured to condition the microphone signal using the at least signal conditioning parameters.

2. The signal processing apparatus of claim 1 wherein the ear bud is a wireless ear bud.

3. A signal processing apparatus according to claim 1 or 2, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a speech estimate derived from the bone conduction sensor signal.

4. A signal processing apparatus according to claim 3, wherein the processor is configured such that the adjustment of the microphone signal comprises non-static noise reduction controlled by a speech estimate derived from the bone conduction sensor signal.

5. The signal processing apparatus of claim 4, wherein the non-static noise reduction is further controlled by a speech estimate derived from the microphone signal.

6. The signal processing device of any of claims 1-5, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a speech level of the bone conduction sensor signal.

7. The signal processing device of any of claims 1-6, wherein the processor is configured such that a non-binary variable characteristic of speech determined from the bone conduction sensor signal is an observed spectrum of the bone conduction sensor signal.

8. The signal processing apparatus according to claim 7, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a parametric representation of a spectral envelope of the bone conduction sensor signal.

9. The signal processing apparatus according to claim 8, wherein the processor is configured such that the parametric representation of the spectral envelope of the bone conduction sensor signal comprises at least terms of linear prediction cepstral coefficients, autoregressive coefficients, and line spectral frequencies.

10. The signal processing apparatus of any of claims 1-9, wherein the processor is configured such that adjustment of an output signal from the microphone occurs regardless of voice activity.

11. The signal processing apparatus of any of claims 1-10, wherein the processor is configured such that the at least signal conditioning parameters include a band-specific gain derived from the bone conduction sensor signal, and wherein conditioning the microphone signal includes applying the band-specific gain to the microphone signal.

12. The signal processing apparatus of any of claims 1-11, wherein the processor is configured such that adjustment of the microphone signal includes applying a kalman filter process in which the bone conduction sensor signal acts as an a priori to a speech estimation process.

13. The signal processing apparatus of claim 12, wherein a speech estimate derived from the bone conduction sensor signal is used to modify a decision-directed weighting factor for an a priori SNR estimate.

14. The signal processing apparatus of claim 12, wherein a speech estimate derived from the bone conduction sensor signal is used to inform a Causal Recursive Speech Enhancement (CRSE) of an update step.

15. The signal processing apparatus of any of claims 1-14, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a signal-to-noise ratio of the bone conduction sensor signal.

16. The signal processing device of any of claims 1-15, wherein the processor is configured such that no component of the bone conduction sensor signal is passed to a signal output of the earbud except that the bone conduction sensor signal is the basis for determining at least characteristics of the speech.

17. The signal processing device of any of claims 1-16, wherein the processor is configured such that the bone conduction sensor signal is corrected for the observed condition prior to determining non-binary variable characteristics of the speech from the bone conduction sensor signal.

18. A signal processing apparatus according to claim 17, wherein the processor is configured such that the bone conduction sensor signal is corrected for phonemes.

19. A signal processing apparatus according to claim 17 or 18, wherein the processor is configured such that the bone conduction sensor signal is corrected for bone conduction coupling.

20. The signal processing device of any of claims 17-19, wherein the processor is configured to cause the bone conduction sensor signal to be corrected for bandwidth.

21. The signal processing device of any of claims 17-20, wherein the processor is configured such that the bone conduction sensor signal is corrected for distortion.

22. The signal processing device of any of claims 17-21, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying a mapping process.

23. A signal processing apparatus according to claim 22, wherein the mapping process comprises a linear mapping comprising series of corrections associated with each spectral interval of the bone conduction sensor signal.

24. The signal processing apparatus of claim 23, wherein the correction comprises a multiplication and an offset applied to respective spectral bin values of the bone conduction sensor signal.

25. The signal processing device of any of claims 17-24, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying offline learning.

26. The signal processing device of any of claims 1-25, wherein the processor is configured such that adjustment of the microphone signal is based only on non-binary variable characteristics of speech determined from the bone conduction sensor signal.

27. A signal processing apparatus according to any of claims 1 to 26, wherein the bone conduction sensor comprises an accelerometer that is coupled, in use, to a surface of the user's ear canal or outer ear to detect bone conducted signals from the user's speech.

28. A signal processing apparatus according to any of claims 1-27, wherein the bone conduction sensor comprises an in-ear microphone positioned, in use, to detect acoustic sound produced within the ear canal by bone conduction of the user's speech.

29. A signal processing apparatus according to claims 27 and 28, wherein both the accelerometer and the in-ear microphone are used to detect at least characteristics of the user's voice.

30. The signal processing device of any of claims 1-29, wherein the processor is configured to apply at least matched filters to the bone conduction sensor signal, the matched filters configured to match the user's voice in the bone conduction sensor signal to the user's voice in the microphone signal.

31. The signal processing apparatus of claim 30 wherein the at least matched filters have a training set based design.

32. The signal processing apparatus of any of claims 1-31, wherein the processor is configured to adjust the microphone signal unilaterally without input from any contralateral sensor on the user's other ear.

A method of adjusting an earbud microphone signal, the method comprising:

receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;

receiving a microphone signal from a microphone of the earbud;

determining from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables;

deriving at least signal-conditioning parameters from at least characteristics of the speech;

adjusting an output signal from the microphone using the at least signal adjustment parameters.

34. The method of claim 33, wherein the earbud is a wireless earbud.

35. The method of claim 33 or claim 34, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a speech estimate derived from the bone conduction sensor signal.

36. The method of claim 35, wherein the processor is configured such that the adjustment of the microphone signal comprises non-static noise reduction controlled by a speech estimate derived from the bone conduction sensor signal.

37. The method of claim 36, wherein the non-static noise reduction is further controlled by a speech estimate derived from the microphone signal.

38. The method of any of claims 33-37, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a speech level of the bone conduction sensor signal.

39. The method of any of claims 33-38, wherein the processor is configured such that a non-binary variable characteristic of speech determined from the bone conduction sensor signal is an observed spectrum of the bone conduction sensor signal.

40. The method according to claim 39, wherein the processor is configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a parametric representation of a spectral envelope of the bone conduction sensor signal.

41. The method according to claim 40, wherein the processor is configured such that the parametric representation of the spectral envelope of the bone conduction sensor signal includes at least terms of linear prediction cepstral coefficients, autoregressive coefficients, and line spectral frequencies.

42. The method of any of claims 33-41, wherein the processor is configured such that adjustments to output signals from the microphone occur regardless of voice activity.

43. The method of any of claims 33-42, wherein the processor is configured such that the at least signal conditioning parameters include a frequency band-specific gain derived from the bone conduction sensor signal, and wherein conditioning of the microphone signal includes applying the frequency band-specific gain to the microphone signal.

44. The method of any of claims 33-43, wherein the processor is configured such that the conditioning of the microphone signals includes applying a Kalman filter process in which the bone conduction sensor signals act as priors for a speech estimation process.

45. The method of claim 44, wherein a speech estimate derived from the bone conduction sensor signal is used to modify a decision-directed weighting factor for an a priori SNR estimate.

46. The method of claim 44, wherein a speech estimate derived from the bone conduction sensor signal is used to inform a Causal Recursive Speech Enhancement (CRSE) update step.

47. The method of any of claims 33-46, wherein the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is a signal-to-noise ratio of the bone conduction sensor signal.

48. The method of any of claims 33-47, wherein the processor is configured such that no component of the bone conduction sensor signal is passed to a signal output of the earbud except that the bone conduction sensor signal is a basis for determining at least characteristics of the speech.

49. The method of any of claims 33-48, wherein the processor is configured such that the bone conduction sensor signal is corrected for observed conditions prior to determining non-binary variable characteristics of the speech from the bone conduction sensor signal.

50. The method of claim 49, wherein the processor is configured such that the bone conduction sensor signal is corrected for phonemes.

51. The method of claim 49 or claim 50, wherein the processor is configured such that the bone conduction sensor signal is corrected for bone conduction coupling.

52. The method of any of claims 49-51, wherein the processor is configured to cause the bone conduction sensor signal to be corrected for bandwidth.

53. The method of any of claims 49-52, wherein the processor is configured to cause correction of the bone conduction sensor signal for distortion.

54. The method of any of claims 49-53, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying a mapping process.

55. The method of claim 54, wherein the mapping process comprises a linear mapping including series of corrections associated with each spectral bin of the bone conduction sensor signal.

56. The method of claim 55, wherein the correction comprises a multiplication and an offset applied to respective spectral bin values of the bone conduction sensor signal.

57. The method of any of claims 49-56, wherein the processor is configured to perform correction of the bone conduction sensor signal by applying offline learning.

58. The method of any of claims 33-57, wherein the processor is configured such that the adjustment of the microphone signal is based only on non-binary variable characteristics of speech determined from the bone conduction sensor signal.

59. The method of any of claims 33-58, wherein the bone conduction sensor comprises an accelerometer that is coupled, in use, to a surface of the user's ear canal or outer ear to detect bone conducted signals from the user's speech.

60. The method of any of claims 33-59, wherein the bone conduction sensor comprises an in-ear microphone positioned, in use, to detect acoustic sound generated within an ear canal due to bone conduction of the user's speech.

61. The method of claims 59 and 60, wherein both the accelerometer and the in-ear microphone are used to detect at least characteristics of the user's voice.

62. The method of any of claims 33-61, wherein the processor is configured to apply at least matched filters to the bone conduction sensor signal, the matched filters configured to match the user's voice in the bone conduction sensor signal to the user's voice in the microphone signal.

63. The method of claim 62, wherein the at least matched filters have a training set based design.

64. The method of any of claims 33-63, wherein the processor is configured to adjust the microphone signal unilaterally without input from any contralateral sensor on the user's other ear.

65., a non-transitory computer-readable medium for conditioning an earbud microphone signal, the non-transitory computer-readable medium comprising instructions that, when executed by or more processors, cause performance of:

receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;

receiving a microphone signal from a microphone of an earbud;

determining from bone conduction sensor signals at least characteristics of speech of a user of the earbud, the at least characteristics being non-binary variables;

deriving at least signal-conditioning parameters from at least characteristics of the speech, and

adjusting an output signal from the microphone using the at least signal adjustment parameters.

66. The non-transitory computer-readable medium of claim 65, further configured to perform the method of any of claims 34-64.

Technical Field

The present invention relates to ear bud headphones (earbud headsets) configured to perform voice (speech) estimation for functions such as voice capture, and in particular, the present invention relates to ear bud voice estimation based on bone conduction sensor signals (bone conduction sensor signals).

Background

The ear-piece's in-ear position severely limits the geometry of the device and greatly limits the ability to place the microphones far apart (as required by functions such as beamforming or sidelobe canceling). moreover, the small form factor poses a significant limitation on the battery size and thus power for wireless earplugs.

Speech capture generally refers to the situation where the headset user's voice is captured and any ambient noise, including the other person's voice, is minimized. A common scenario for this use case is when the user makes a voice call or interacts with a speech recognition system. Both scenarios place stringent requirements on the underlying algorithm. For voice calls, telephone standards and user requirements require that a high level of noise reduction be achieved with excellent sound quality. Similarly, speech recognition systems typically require the audio signal to have minimal modification while eliminating as much noise as possible. There are many signal processing algorithms in which it is important that the operation of the algorithm changes depending on whether the user is speaking. Voice activity detection is an important aspect of speech capture and other such signal processing algorithms by processing an input signal to determine the presence or absence of speech in the signal. However, even in larger headphones, such as boom earphones (boom), pendant, and ear-headphone, it is very difficult to reliably ignore the speech from other people located within the beam of the device's beamformer, with the result that these other people's speech only disrupts the processing of the user's voice capture. These and other aspects of voice capture are particularly difficult to achieve with earplugs, including because earplugs do not place a microphone near the mouth of the user, and thus do not benefit from the significantly improved signal-to-noise ratio that results from such microphone positioning.

Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is solely for the purpose of providing a context for the present invention and it is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application.

Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

In this specification, it is to be understood that stating that an element may be "at least " in a list of options, it may be that the element may be any of the listed options, or may be any combination of two or more of the listed options.

Disclosure of Invention

According to , the invention provides signal processing apparatus for earplug speech estimation, the apparatus comprising:

at least inputs for receiving microphone signals from a microphone of an earbud;

at least inputs for receiving bone conduction sensor signals from bone conduction sensors of the ear plugs;

a processor configured to determine from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables, the processor further configured to derive at least signal-conditioning parameters from the at least characteristics of the voice, and the processor further configured to condition the microphone signal using the at least signal-conditioning parameters.

According to a second aspect, the present invention provides a method of conditioning an earbud microphone signal, the method comprising:

receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;

receiving a microphone signal from a microphone of the earbud;

determining from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables;

deriving at least signal-conditioning parameters from at least characteristics of the speech, and

adjusting an output signal from the microphone using the at least signal adjustment parameters.

According to a third aspect, the present invention provides a non-transitory computer-readable medium for conditioning an earbud microphone signal, the non-transitory computer-readable medium comprising instructions that when executed by or more processors result in performance of:

receiving a bone conduction sensor signal from a bone conduction sensor of an ear plug;

receiving a microphone signal from a microphone of the earbud;

determining from the bone conduction sensor signal at least characteristics of a voice of a user of the earbud, the at least characteristics being non-binary variables;

deriving at least signal-conditioning parameters from at least characteristics of the speech, and

adjusting an output signal from the microphone using the at least signal adjustment parameters.

In embodiments, the earplug is a wireless earplug.

In embodiments , the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal is an estimate of speech derived from the bone conduction sensor signal in embodiments , the processor may be configured such that the adjustment of the microphone signal includes non-stationary noise reduction controlled by the estimate of speech derived from the bone conduction sensor signal in embodiments , the non-stationary noise reduction may be further controlled by the estimate of speech derived from the microphone signal.

In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a speech level of the bone conduction sensor signal.

In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is an observed spectrum of the bone conduction sensor signal.

In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a parametric representation of a spectral envelope of the bone conduction sensor signal.

In embodiments, the processor may be configured such that the parametric representation of the spectral envelope of the bone conduction sensor signal includes at least of linear prediction cepstral coefficients, autoregressive coefficients, and line spectral frequencies to model a human vocal tract, for example, to derive the speech envelope.

In embodiments, the processor may be configured such that the non-binary variable characteristic of speech determined from the bone conduction sensor signal is a non-parametric representation of the spectral envelope of the bone conduction sensor signal, such as mel-frequency cepstral coefficients (MFCCs) derived from a model of human sound perception, or logarithmically spaced spectral magnitudes derived from a short-time fourier transform, which is preferred methods.

In embodiments, the processor may be configured such that the adjustment of the output signal from the microphone occurs regardless of voice activity.

In embodiments, the processor may be configured such that the at least signal conditioning parameters include a frequency band-specific gain derived from the bone conduction sensor signal, and wherein conditioning the microphone signal includes applying the frequency band-specific gain to the microphone signal.

In embodiments, the processor may be configured such that the conditioning of the microphone signal includes applying a kalman filter process in which the bone conduction sensor signal acts as an a priori (a priori) for a speech estimation process in embodiments, a speech estimate may be derived from the bone conduction sensor signal and used to modify a decision-directed weighting factor for an a priori SNR estimate in embodiments, a speech estimate derived from the bone conduction sensor signal may be used to inform a Causal Recursive Speech Enhancement (CRSE) update step.

In embodiments, the non-binary variable characteristic of speech determined by the processor from the bone conduction sensor signal may be a signal-to-noise ratio of the bone conduction sensor signal.

In embodiments, the processor may be configured such that no component of the bone conduction sensor signal is passed to the signal output of the earbud except that the bone conduction sensor signal is the basis for determining at least characteristics of the speech.

In embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for observed conditions prior to determining non-binary variable characteristics of the speech from the bone conduction sensor signal in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for phonemes in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for bone conduction coupling in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for bandwidth in embodiments, the processor may be configured to cause the bone conduction sensor signal to be corrected for distortion in embodiments, the processor may be configured to perform the correction of the bone conduction sensor signal by applying a mapping process in embodiments, the mapping process may include a linear mapping that includes a series of corrections associated with each spectral interval of the bone conduction sensor signal, for example, the series of corrections may include applying a spectral offset to the bone conduction sensor signal in embodiments, the processor may be configured to perform the linear mapping process including the respective spectral offset correction of the bone conduction sensor signal in embodiments.

In embodiments, the processor may be configured such that the adjustment to the microphone signal is based only on non-binary variable characteristics of the speech determined from the bone conduction sensor signal.

In embodiments, the bone conduction sensor may include an accelerometer that, in use, is coupled to a surface of the ear canal or outer ear of the user to detect signals from bone conduction of the user's speech.

In embodiments, the bone conduction sensor may include an in-ear microphone positioned, in use, to detect acoustic sound generated within the ear canal by bone conduction of a user's voice in embodiments, both the accelerometer and the in-ear microphone may be used to detect at least characteristics of a user's voice.

In embodiments, the processor may be configured to apply at least matched filters to the bone conduction sensor signal, the matched filters configured to match a user's voice in the bone conduction sensor signal with a user's voice in the microphone signal in embodiments, the matched filters may have a training set based design.

In embodiments, the processor may be configured to adjust the microphone signal unilaterally without input from any contralateral sensor on the user's other ear.

An earpiece is defined herein as an audio headphone device, whether wired or wireless, which in use is supported solely or substantially by the ear on which it is placed, and which comprises an earpiece body which in use is located substantially or entirely within the ear canal and/or within the concha of the pinna.

Drawings

Embodiments of the invention will now be described with reference to the accompanying drawings, in which:

fig. 1 illustrates the use of a wireless ear bud in telephone and/or audio playback;

fig. 2 is a system schematic of earplugs according to embodiments of the invention;

fig. 3a and 3b are detailed system schematic diagrams of the earplug of fig. 2;

FIG. 4 is a flow chart of an earplug voice estimation process of the embodiment of FIG. 3;

fig. 5 illustrates a noise suppressor for a telephone according to another embodiment of the invention;

FIG. 6 illustrates embodiments including a speech estimator that uses a statistical model based estimation process;

FIG. 7 illustrates a microphone-accelerometer mixing method based on a mixing factor using SNR estimation;

FIG. 8 illustrates the configuration of another embodiment of the invention;

FIG. 9 illustrates an embodiment of applying speech estimation from bone conduction sensor signals to a phone use case; and

fig. 10 shows the objective Mean Opinion Score (MOS) results for embodiments of the present invention.

Detailed Description

The earplugs 120, 130 are shown as being external to the ears for illustrative purposes, however, each earpiece is positioned such that a body of the earpiece is substantially or entirely within the concha and/or ear canal of a respective ear in use, the earplugs 120, 130 may take any suitable form to fit comfortably on or within and be supported by a user's ears, embodiments within the scope of the present invention the body of the earpiece may further be supported by a hook or support member that extends beyond the concha, such as partially or entirely around the exterior of a respective pinna.

Fig. 2 illustrates a system of earplugs 120. The earplugs 130 may be similarly configured and are not separately described. The microphone 210 is positioned on the earplug 120 to receive external acoustic signals when the earplug is in place. Multiple microphones may be provided, for example, to enable beamforming noise reduction by the earbud 120, but the small size of the earbud 120 imposes difficult limitations on the maximum microphone spacing that can be achieved, and positioning of the earbud in a position where sound is partially obscured or eliminated by the pinna is a limiting factor in beamforming efficacy compared to, for example, a cantilever-mounted microphone.

The microphone signal from the microphone 210 is passed to a suitable processor 220 of the ear bud 120. Due to the size of the ear bud 120, a limited battery power is available, which dictates that the processor 220 only performs low-power and computationally simple audio processing functions.

The ear bud 120 further includes an accelerometer 230, which accelerometer 230 is mounted on the ear bud 120 in a position inserted into the ear canal and pressing against the ear canal wall when in use, or the accelerometer 230 may be mounted within the body of the ear bud 120, optionally mechanically coupled to the ear canal wall the accelerometer 230 is thereby configured to detect bone conducted signals, particularly the user's own voice as conducted by bone and tissue placed between the acoustic channel and the ear canal.

In alternative embodiments, the bone conduction sensor may be coupled to the outer ear, or mounted on any part of the headphone body that reliably contacts the ear canal or ear within the outer ear. The use of an earplug allows a reliable direct contact with the ear canal and thus a mechanical coupling to the vibrational model of bone-conducted speech as measured at the ear canal wall. This is in contrast to the external temples, cheeks or skull where a mobile device (such as a telephone) may make contact. The present invention recognizes that a speech model of bone conduction derived from portions of the anatomy outside the ear yields a signal with greatly reduced reliability of speech estimation compared to the embodiments described herein. The present invention recognizes that the use of a bone conduction sensor in a wireless ear bud is sufficient to perform speech estimation. This is because, unlike headphones outside of the earpiece or ear, the nature of the bone conduction sensor signal from the wireless earbud is largely static with respect to user fit, user motion, and user movement. For example, the present invention recognizes that compensation of the bone conduction sensor is not required for fit or proximity. Therefore, the choice of the ear canal or outer ear as the location of the bone conduction sensor is a key enabler of the present invention. In turn, the invention then turns to a transformation of the signal that derives the temporal and spectral characteristics that best recognize the user's speech.

The device 120 is a wireless earbud, which is important because the accessory cable attached to wired personal audio equipment is a significant source of external vibration for the bone conduction sensor 230. the accessory cable also increases the effective mass of the device 120, which can suppress ear canal vibration caused by bone-conducted speech.removing the cable also reduces the need for a compliant medium in which the bone conduction sensor 230 is housed.

Unlike an earpiece or corded headset, in which the primary voice microphone is closer to the mouth and the difference in the way the user holds the phone/lanyard can result in a wide range of SNRs, in this embodiment, the SNR on the primary voice microphone 210 is not so variable for a given ambient noise level because the geometry between the user's mouth and the ear containing the earpiece is fixed.

A sufficient condition for contact between the bone conduction transducer 230 and the ear canal is that the vibrational forces caused by the speech exceed the minimum sensitivity of the commercial accelerometer 230 due to the small enough weight of the ear plug 120. This is in contrast to external headphones or telephone handsets which have a large mass which prevents bone-conducted vibrations from being able to couple easily to the device.

The processor 220 is a signal processing device configured to determine at least characteristics of the user's speech of the earbud 120 from the sensor signal of bone conduction from the accelerometer 230, at least signal conditioning parameters derived from the at least characteristics of the speech, and the processor 220 is further configured to condition the microphone signal from the microphone 210 using the at least signal conditioning parameters and wirelessly transmit the conditioned signal to the host device 110 for use as a transmit signal for a voice call and/or for Automatic Speech Recognition (ASR). communication between the earbud 120 and the host device 110 may be, for example, by low energy Bluetooth.

It is worth noting that the present embodiment provides noise reduction applied in a controlled hierarchical manner, rather than in a binary on-off manner, for a headset form factor comprising a wireless ear bud provided with at least microphones and at least accelerometers, based on a speech estimate derived from a bone conducted sensor signal.

Voice Activity Detection (VAD) is a method of improving speech estimation, but inherently relies on an imperfect concept of binary recognition of the presence or absence of speech in noisy signals.

Fig. 3a and 3b illustrate in more detail the configuration of the processor 220 within the system of an earplug 120 according to embodiments of the invention the embodiment of fig. 3a and 3b recognises that under moderate signal-to-noise ratio (SR) conditions, improved non-stationary noise reduction can be achieved with only voice estimation, without the VAD being required unlike the method in which voice activity detection is used to distinguish between the presence and absence of voice and a discrete binary decision signal from the VAD is used to gate (i.e. turn on and off) a noise suppressor acting on the audio signal the embodiment of fig. 3 recognises that it is also possible to rely on the accelerometer signal or some signal derived from the accelerometer signal to obtain a sufficiently accurate voice estimation, even under acoustic conditions where an accurate voice estimation cannot be obtained from the microphone signal.

In more detail, in fig. 3, the microphone signal from microphone 210 is conditioned by noise suppressor 310 and then passed to an output, such as for wireless communication to device 110. The noise suppressor 310 is continuously controlled by the speech estimation/characterization module 320 without any switching gating by any VAD. The speech estimation/characterization module 320 takes input from the accelerometer 230 and optionally also from other accelerometers, the microphone 210, and/or other microphones.

In such an embodiment, the selection of the accelerometer 230 as a bone conduction sensor is particularly useful because, as an approximation of , the noise floor in commercial accelerometers is spectrally flat.these devices are all acoustically transparent before reaching the resonant frequency and therefore do not display any signal due to ambient noise.accordingly, the noise profile of the sensor 230 can be updated a priori to the speech estimation process. important differences are present because it allows modeling of the temporal and spectral properties of the real speech signal without dynamic interference from complex noise models.

The speech estimation 320 is performed based on certain signal guarantees in the microphone 210 and the accelerometer 230, in particular as guaranteed in the wireless ear bud use case. However, the bone conduction spectral envelope in the earplug can be corrected to trade off feature importance, but the matching signal is not necessary for designing the tuning parameters. Sensor non-idealities and non-linearities in a bone conduction model of the ear canal are other reasons for which corrections may be applied.

In particular, embodiments employing multiple bone conduction sensors 230 in the ear are presented to be configured to take advantage of the orthogonal vibration modes caused by bone-conducted speech in the ear canal to extract more information about the user's speech it is important that the bone-conducted signals be reliably coupled into sensors within range of the wireless earbud, to a certain extent different from wired earbuds at , and different from headphones external to the ear.

The signal from accelerometer 230 is high pass filtered and then used by module 320 to determine a speech estimate output, which may include a single channel representation or a multi-channel representation of the user's speech, such as a net speech estimate, a priori SNR, and/or model coefficients.

Notably, the configuration of fig. 3 omits any Voice Activity Detection (VAD). Many methods of speech enhancement rely on various estimates of the speech signal and become challenging when the microphone speech signal is degraded by ambient noise. The accuracy of these estimates typically decreases with the ambient noise level. Uses of speech estimation include wind noise suppression, a priori SNR estimation for noise suppression, gain function biasing for noise suppression, beamforming adaptation (block matrix update), adaptive control for acoustic echo cancellation, a priori speech echo ratio (speech to echo) estimation for echo suppression, adaptive thresholding (level difference and cross correlation) for VAD, and adaptive windowing (minimum control recursive average; MCRA) for static noise estimation.

In this embodiment of the invention, the processing and subsequent adjustment of the bone conduction sensor 230 occurs regardless of the voice activity in the accelerometer signal. Thus, it does not rely on speech detection processing or noise modeling (VAD) processing when deriving the speech estimate for noise reduction processing. Unlike the earpiece use case, the accelerometer sensor 230 measures noise statistics of ear canal vibrations in the wireless ear bud 120 with a well-defined distribution. The present invention recognizes that this justifies continuous speech estimation based on the signal from accelerometer 230. Although the SNR of the microphone 210 is lower in the earpiece due to the distance of the microphone 210 from the mouthpiece, the distribution of the voice sample will have a lower variance than the distribution of the earpiece or lanyard due to the fixed position of the earpiece and the microphone 210 relative to the mouthpiece. This together forms a priori knowledge of the user's speech signal to be used in the tuning parameter design and speech estimation process 320.

The embodiment of fig. 3 recognizes that speech estimation using a microphone and bone conduction sensor can improve speech estimation for such purposes. The speech estimate may be derived from a bone conduction sensor (e.g., accelerometer 230) or a combination of both bone conduction sensor 230 and microphone 210. The speech estimate from the bone conduction sensor 230 may include any combination of signals from the individual axes of a single device. The speech estimate may be derived from a time domain signal or a frequency domain signal. By processing within the earplugs 120 rather than in the main device 110, the processor 220 may be configured at the time of manufacture or configuration to ensure that the described processing has access to all appropriate signals and is based on accurate knowledge of the earplug geometry.

The bone conduction sensor signal is corrected for the observed conditions and may be corrected for phonemes, sensor bandwidth, and/or distortion, for example, the bone conduction sensor signal may include a linear mapping that makes series corrections associated with each spectral bin, such as applying a multiplication or offset to each bin value.

The speech estimate may be derived from the bone conduction sensor 230 at 320 by any of the following techniques, exponential filtering of the signal (leak integrator), gain function of the signal value, fixed matched filter (FIR or spectral gain function), adaptive matching (LMS or input signal driven adaptation), mapping function (codebook), and updating the estimation routine using second order statistics.

Fig. 3b provides more detail of the earbud speech estimation process 320 of fig. 3 a. Fig. 4 is a flow chart of an earbud speech estimation process.

Notably, fig. 3a and 3b depict a speech estimator 320 that conditions the bone conduction speech signal from 230. This estimate may take the form of a time and/or frequency domain signal representing the user's speech signal. This is different from the net speech signal which may be the result of applying this estimator 320.

A noise suppressor for a telephone as shown in fig. 5 may use an estimator to produce a net voice signal that will be transmitted over a telephone network to a remote recipient. Embodiments of the noise suppressor include spectral subtraction, wiener filtering methods, and statistical modeling methods.

An example of an implementation of a speech estimator using a statistical model-based estimation process is shown in FIG. 6. The microphone speech estimate for air conduction, the speech estimate for bone conduction, and the SNR are each derived from a causal recursive speech enhancement process. The a priori SNR estimates from each process are then combined to derive mixing coefficients that will adjust the user speech estimate to arrive at the final speech estimator. It is important to note that in this process, neither the microphone nor the accelerometer sensor signals are used to derive the noise model. In contrast, the information content within the signal that is affected by the wireless ear bud form factor allows for a direct speech estimation process.

In another embodiment, the application can be used to generate a signal representing a potential representation of speech suitable for an Automated Speech Recognition (ASR) system.

This is different from the same dynamics that utilize speech detection (which have general application in the field of voice activity detectors) in that the time dynamics and the spectral dynamics of bone conduction signals are used to derive a speech model in the presence of static noise signals.

The bone conduction spectral envelope in the earplug can be corrected to trade-off feature importance, but the matching signal is not necessary for designing the tuning parameters.

The method of deriving a speech estimator using a bone conduction sensor, as opposed to a speech detector (VAD), may be elaborated further in the context of the present invention.traditionally, the quality of the noise suppressor depends on an estimate of the noise spectrum, which is typically derived from measurements made with a binary decision device, such as a VAD, during speech gaps.

The present invention does not use bone conduction sensors in the process of building the noise model. Therefore, the construction of the noise model does not require a Voice Activity Detector (VAD) derived from the bone conduction sensor. This is an important difference from other proposals that use bone conduction sensors instead of microphones, since in such alternative proposals, typically the noise model must be accurately modeled for performing speech enhancement, so that the bone conduction sensors help to derive the model.

In contrast, the basic assumption of a bone conduction sensor in an earplug is that the bone conduction sensor signal representing speech contains sufficient time and spectral content to derive a non-binary signal representing the user's speech.

In embodiments (FIG. 6), the speech model from a noisy microphone may be completed with a causal recursive speech estimator that needs to estimate the noise varianceThe rate spectrum is treated as a priori (prior) of the user's speech by means of a representation of the vibrations of their ear canal. The net speech microphone signal can be roughly estimated without transformation. In this case, it is regarded as S_bc bone conduction speech estimation instead of the net speech estimation adjusted on the bone conduction sensor, i.e.

In embodiments, S_bcIt is noted that these embodiments do not use off-line processing to derive the bone conduction to a clean air conduction microphone transformation, nor use, for example, a synthesized signal as a hypothetical estimate.

FIG. 7 illustrates microphone-accelerometer mixing methods based on mixing factors using SNR estimates, and methods are provided that combine a priori SNR estimates from a microphone and an accelerometer (BC sensor). The may be particularly applicable to low SNR environments that use the best speech estimate of SNR estimates.thus, the net speech estimate and a priori SNR estimate derived from a bone conduction sensor signal are applications of the speech estimation techniques controlled by a bone conduction sensor signal according to the present invention.

Secondary noise reduction is then performed on this mixed signal.

This is in contrast to using the VAD to derive the noise estimate and then determining the blend ratio.

Other embodiments of the present invention may expand the idea by discarding the speech estimates from the speech enhancement blocks 710, 720 instead of mixing the noisy signal from the SNR estimate and performing a secondary noise reduction.

Fig. 8 illustrates the configuration of the processor 220 within the system of the ear bud 120 according to another embodiment of the invention the elements of fig. 8 that are not described are the same as fig. 3, however, in the embodiment of fig. 8, the speech estimate output by the speech estimation/characterization module is not only conveyed to the noise suppressor, but also to the auxiliary output path for use by other modules that may be within the ear bud 120 or the master device 110, for example, and that may include an Automatic Speech Recognition (ASR) module or may be a voice trigger module, for example.

Fig. 9 illustrates another embodiment according to the present invention, which illustrates the application of speech estimation from bone conduction sensor signals to phone use cases.

Embodiments of the present invention note that although the frequency response of an in-ear accelerometer is poor compared to a microphone or even compared to a bone sensor or the like mounted at the temple, it is possible that not only can the in-ear accelerometer signal be used for speech estimation, but it is also recognized that the in-ear accelerometer signal can be used for hierarchical or non-binary control of speech estimation, such as by controlling non-static noise reduction in a multi-step or hierarchical manner. In more detail, the low-pass frequency response and relatively poor sensitivity of the earbud inertial sensor are limitations of the bone conduction model at the external ear canal. Bone conduction sensors for vibration are usually of the magnetic type and are usually mounted to other parts of the head, such as the temporal bone or mastoid bone, with the elastic force of a headband or the like holding a firm contact. However, this mounting location and technique is somewhat incompatible with headphones for audio applications and is incompatible with the preferred headphone form factor. The present invention facilitates compliance to a preferred headphone form factor when utilizing an inertial sensor of an earplug.

The time-frequency model of estimating speech in the ear canal is therefore a different problem, as the inventors have found that the observable frequency range of ear canal bone conduction signals is typically below 1 kHz.

Fig. 10 shows objective Mean Opinion Score (MOS) results for the embodiment of fig. 9, which shows the improvement when adjusting the a priori speech envelope from the microphone 210 with parameters derived from the bone conduction sensor 230 spectral envelope. Measurements are performed in a number of different static and non-static noise types using a 3Quest method to obtain speech MOS (S-MOS) values and noise MOS (N-MOS) values.

While in other applications such as earpieces, the contribution of the bone conduction estimate and the microphone spectrum estimate in the combined estimate in time and frequency may drop to zero if the earpiece use case causes any sensor signal quality to be poor, this is not the case in the wireless earbud application of this embodiment.

While the described embodiment provides for the voice estimation/characterization 320 module and the noise suppressor module 310 to reside within the ear buds 120, alternative embodiments may alternatively or additionally provide for such functionality to be provided by the master device 110. Thus, such an embodiment may take advantage of the significantly higher processing power and power budget of the main device 110 compared to the earplugs 120, 130.

The ear bud 120 may further include other elements not shown, such as an additional digital signal processor, flash memory, microcontroller, bluetooth radio chip, or equivalent.

Unlike accelerometer 230, such an in-ear microphone would receive acoustic reverberation of bone-conducted signals reverberating within the ear canal and would also receive external noise leaking into the ear canal through the ear-bud.

Wireless communication is understood to mean types of communication, monitoring, or control systems in which electromagnetic or acoustic waves carry signals through the atmosphere or free space, rather than along wires.

Corresponding reference characters indicate corresponding parts throughout the drawings.

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：带有可卷起的膜片的扬声器

Earplug voice estimation

相关技术

网友询问留言