Speech recognition using multiple sensors

文档序号：144568 发布日期：2021-10-22 浏览：38次中文

阅读说明：本技术 使用多传感器的语音识别 (Speech recognition using multiple sensors ) 是由卢卡·约翰·坎贝尔德拉甘·彼得罗维奇于 2019-12-20 设计创作，主要内容包括：本文介绍的是利用被放置在多个语音传输区域,诸如用户的嘴唇,喉咙,耳道等的多个传感器来增加语音识别准确性的系统和方法。多个语音传输区域更好地传输某些音素,以及靠近特定语音转换区域放置的传感器可以更准确地检测通过特定语音传输区域传输的音素。例如,靠近嘴唇放置的麦克风比靠近喉咙放置的麦克风可以更好地检测唇音音素,诸如m,n,p,和b。此外,本文公开了在执行语音识别的同时降低能耗的方法。(Described herein are systems and methods for increasing speech recognition accuracy using multiple sensors placed in multiple speech transmission regions, such as a user's lips, throat, ear canal, etc. Multiple speech transmission regions better transmit certain phonemes and sensors placed close to a particular speech conversion region can more accurately detect phonemes transmitted through a particular speech transmission region. For example, a microphone placed close to the lips may detect lip sounds phonemes such as m, n, p, and b better than a microphone placed close to the throat. Further, a method of reducing power consumption while performing speech recognition is disclosed herein.)

1. A system, comprising:

a first microphone disposed at an entrance or within an ear canal of a user to measure a first sound transmitted through the ear canal of the user;

a second microphone disposed adjacent to a user's lips to measure a second sound transmitted through the user's lips;

the processor is configured to:

receiving a first measurement of the first sound and a second measurement of the second sound; and

improving accuracy of a speech recognition algorithm by determining a difference between a portion of the first sound and a portion of the second sound, the difference between the portion of the first sound and the portion of the second sound including a difference between an amplitude and a phase associated with the first sound and the second sound recorded from each microphone, and modifying a probability of phoneme prediction based on the difference.

2. The system of claim 1, comprising the processor configured to modify probabilities of phoneme predictions, comprising the processor configured to:

reconstructing the user's speech by selecting lip phones from the second microphone and non-lip phones from the first microphone.

3. The system of claim 1, comprising:

a third microphone positioned proximate to a throat of a user to measure a third sound transmitted through the throat of the user.

4. The system of claim 3, comprising the processor configured to:

reconstructing the user's speech by selecting a laryngeal phoneme from the third microphone and selecting a non-lip and non-laryngeal phoneme from the first microphone.

5. The system of claim 1, comprising the processor configured to:

identifying an activation utterance based on the first measurement and the second measurement; and

when the activation utterance is recognized, recognition of the user's speech is prompted.

6. The system of claim 1, comprising the processor configured to:

reducing energy consumption associated with the second microphone by operating the second microphone in a low energy mode until the first microphone detects the first sound and, upon detection of the first sound, switching the second microphone to a high energy mode to measure the second sound.

7. A system, comprising:

a plurality of sensors, a first sensor of the plurality of sensors being disposed at an entrance to or within an ear canal of a user, a second sensor of the plurality of sensors being disposed proximate a voice transmission region associated with a user's voice system, the first sensor sensing a first sound within the user's ear canal, the second sensor sensing a second sound transmitted by the voice transmission region;

the processor is configured to:

receiving one or more measurements of the first sound and the second sound; and

the accuracy of a speech recognition algorithm is improved by determining a difference between a portion of the first sound and a portion of the second sound and modifying a probability of phoneme prediction based on the difference.

8. The system of claim 7, comprising the processor configured to improve the accuracy of the speech recognition algorithm, comprising the processor configured to:

determining that the portion of the first sound has a higher amplitude than the portion of the second sound; and

in the speech recognition algorithm, the portion of the first sound is more dependent than the portion of the second sound.

9. The system of claim 7, comprising the processor configured to:

reconstructing the user's voice by selecting a first portion of voice from the first sound and a second portion of voice from the second sound based on criteria indicative of which of the first and second sensors better senses the first portion of voice and the second portion of voice.

10. The system of claim 9, the processor configured to reconstruct speech of the user, comprising the processor configured to:

selecting a phoneme from sounds recorded by a sensor placed closer to the phoneme transmission source than the other sensors.

11. The system of claim 10, the processor configured to select the phoneme, comprising the processor configured to:

selecting a lip phone from the second sound when the second sensor is placed closer to a user's lips than the first sensor.

12. The system of claim 9, the processor configured to reconstruct speech of the user, comprising the processor configured to:

acquiring said criterion representing a frequency range; and

selecting a low frequency sound from the first sensor and a high frequency sound from the second sensor.

13. The system of claim 7, comprising the processor configured to:

identifying an activation utterance based on the one or more measurements; and

upon recognition of the activation utterance, prompting recognition of the user's speech.

14. The system of claim 13, comprising the processor to:

determining, based on one or more phonemes contained in the activated utterance, a sensor of the plurality of sensors that is likely to sense the one or more phonemes, and

the sensor is continuously operated in a high energy mode.

15. The system of claim 14, comprising the processor configured to recognize the activated utterance when the sensor senses the one or more phonemes.

16. The system of claim 13, the processor comprising a dual-mode processor configured to operate in a low-energy mode prior to recognizing the active utterance and to operate in a high-energy mode when causing the recognition of the user speech.

17. The system of claim 7, comprising one or more sensors of the plurality of sensors disposed outside of the user's head and exposed to ambient sounds, the one or more sensors extracting the user's speech by removing the ambient sounds from the sensed audio.

18. The system of claim 7, comprising:

a housing surrounding one of the plurality of sensors, the housing attenuating ambient sound reaching the sensor;

the processor reduces power consumption by:

receiving a notification from the sensor surrounded by the housing, the notification indicating detection of a user's voice, an

Upon receiving the notification, activating remaining sensors of the plurality of sensors to sense a voice of the user.

19. The system of claim 7, wherein one of the plurality of sensors is positioned proximate a lip of the user or the sensor is positioned proximate a throat of the user.

20. The system of claim 7, comprising the processor to:

determining a sensor of the plurality of sensors that is likely to detect a phoneme; and

relying more on the sensor to detect the phoneme than on the remaining sensors of the plurality of sensors.

21. The system of claim 7, comprising:

the processor to reduce power consumption by receiving a user's voice from the plurality of sensors and transmitting the user's voice to a remote processor to perform voice recognition.

22. A method, comprising:

measuring, by a plurality of sensors, a plurality of sounds transmitted by a plurality of voice transmission regions associated with a user's voice system, a first sensor of the plurality of sensors being disposed at an entrance of or within an ear canal of a user and a second sensor of the plurality of sensors being disposed proximate to a voice transmission region, the first sensor measuring a first sound at the entrance of or within the ear canal of the user, the second sensor measuring a second sound transmitted by the voice transmission region; and

23. The method of claim 22, comprising:

24. The method of claim 23, wherein said selecting comprises:

selecting a phoneme from sounds recorded by sensors placed closer to a transmission source of the phoneme than the other sensors.

25. The method of claim 24, wherein the selecting comprises:

selecting a lip phone from the second sound when the second sensor is placed closer to a user's lips than the first sensor.

26. The method of claim 23, wherein said selecting comprises:

selecting a low frequency sound from the first sensor and a high frequency sound from the second sensor.

27. The method of claim 22, comprising:

identifying an active utterance based on a plurality of measurements of the first sound and the second sound; and

upon recognition of the activation utterance, prompting recognition of the user's speech.

28. The method of claim 27, comprising:

determining, based on one or more phonemes included in the activated utterance, a sensor of the plurality of sensors that is likely to sense the one or more phonemes; and

the sensor is continuously operated in a high energy mode.

29. The method of claim 28, comprising:

reducing power consumption by switching a processor to the high energy mode upon recognition of the activation utterance, the processor configured to operate in a low energy mode when not activated and to operate in the high energy mode when activated; and

recognizing speech of the user while operating in the high energy mode.

30. The method of claim 22, comprising:

determining a time delay to receive the same sound at each of the plurality of sensors;

identifying a source of the sound based on the plurality of time delays; and

extracting from the sound a portion of the sound belonging to a desired source.

31. The method of claim 22, comprising:

reducing power consumption by receiving a notification from a sensor surrounded by a housing that attenuates ambient sound reaching the sensor, the notification indicating detection of a user's voice; and

upon receiving the notification, activating remaining sensors of the plurality of sensors to sense a voice of the user.

32. The method of claim 22, comprising:

determining a sensor of the plurality of sensors that is likely to detect a phoneme; and

relying more on the sensor to detect the phoneme than on the remaining sensors of the plurality of sensors.

Technical Field

The present application relates to sensors used in speech recognition, and more particularly to methods and systems for recognizing speech using multiple sensors.

Background

Today, audio interaction with computers is becoming ubiquitous and speech recognition plays a central role. However, speech recognition is fraught with inaccuracies due to poor sound effects or speaker idiosyncrasies such as accents, speech patterns, etc. Moreover, speech recognition tends to consume a significant amount of processing time and energy.

Summary of the invention

Described herein are systems and methods for increasing speech recognition accuracy by utilizing multiple sensors placed in multiple speech transmission regions, such as the user's lips, throat, ear canal, etc. Multiple speech transmission regions are better at transmitting certain phonemes, and a sensor placed close to a particular speech conversion (transition) region can more accurately detect phonemes transmitted through that particular speech transmission region. For example, a microphone placed near the lips may better detect lip phones (labial phones), such as m, n, p, and b, than a microphone placed near the throat. Further, a method of reducing power consumption while performing speech recognition is disclosed herein.

Drawings

These and other objects, features, and characteristics of the present embodiments will become more apparent to those skilled in the art from a study of the following detailed description when taken in conjunction with the appended claims and the accompanying drawings, all of which form a part of this specification. Although the drawings comprise illustrations of various embodiments, the drawings are not intended to limit the claimed subject matter.

FIGS. 1A-1B illustrate a plurality of sensors surrounding a user and recording the user's voice.

Fig. 2 illustrates voice transmission regions associated with a user's voice system.

Fig. 3A-3B illustrate a hearing instrument according to various embodiments.

FIG. 4 is a flow chart of a method of performing speech recognition using multiple sensors.

Fig. 5 is a diagrammatic representation of a machine in the example form of a computer system within which a set of instructions, for causing the machine to perform any one or more of the methodologies (modules) or modules discussed herein, may be executed.

Detailed Description

Term(s) for

Brief definitions of terms, abbreviations, and phrases used in this application are given below.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Also, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not other embodiments.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, in the sense of "including, but not limited to". As used herein, the terms "connected," "coupled," or any variant thereof, refer to any direct or indirect connection or coupling between two or more elements. The coupling or connection between the elements may be physical, logical, or a combination thereof. For example, two devices may be coupled directly, or via one or more intervening channels or devices. As another example, devices may be coupled in such a way that information may be passed between them without sharing any physical connection between each other. Furthermore, the words "herein," "above," "below," and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of this application. Words in the detailed description that use the singular or plural number may also include the plural or singular number, respectively, as the context permits. The word "or" in reference to a list of two or more items covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list.

If the specification states that a component or feature may be "included" or have a property, that particular component or feature need not be included or have a property.

The term "module" broadly refers to a software, hardware, or firmware component (or any combination thereof). A module is generally a functional component that can use specified inputs to generate useful data or other outputs. A module may or may not be self-contained. An application (also referred to as an "application") may include one or more modules, or a module may include one or more applications.

The terminology used in the detailed description is intended to be interpreted in its broadest reasonable manner, even though it is being used in conjunction with certain examples. The terms used in this specification generally have their ordinary meanings in the art, in the context of the present disclosure, and in the specific context in which each term is used. For convenience, certain terms may be highlighted, such as using capitalization, italics, and/or quotation marks. The use of highlighting has no effect on the scope and meaning of the term; in the same context, the terms are used in the same sense, whether or not highlighted. It should be appreciated that the same elements may be described in more than one way.

Thus, alternative language and synonyms may be used for any one or more of the terms discussed herein, but no special significance is made as to whether or not a term is elaborated or discussed herein. The use of one or more synonyms does not preclude the use of other synonyms. The use of examples anywhere in this specification, including examples of any terms discussed herein, is illustrative only and is not intended to further limit the scope and meaning of the disclosure or any exemplary terms. Also, the present disclosure is not limited to the various embodiments presented in this specification.

Speech recognition using multiple sensors

Described herein are systems and methods for increasing speech recognition accuracy by utilizing multiple sensors placed in multiple speech transmission regions, such as the user's lips, throat, ear canal, etc. Various voice transmission regions are better at transmitting certain phonemes, and a sensor placed close to a particular voice conversion region can more accurately detect phonemes transmitted through that particular voice transmission region. For example, a microphone placed near the lips may detect lip sounds phonemes such as m, n, p, and b better than a microphone placed near the throat. Further, a method of reducing power consumption while performing speech recognition is disclosed herein.

One of the most accurate speech recognition systems today is the Google speech used to recognize native speakers. The accuracy of the system was 95%. However, the speech recognition accuracy for speakers with accents drops significantly, down to 59%. By using multiple sensors placed along multiple speech transmission regions, speech recognition accuracy can exceed 95% for both native and non-native speakers.

FIGS. 1A-1B illustrate a plurality of sensors surrounding a user and sensing the user's voice. The sensors 100, 110, 120, 150, and/or 160 may be associated with a hearing device 130, such as an ear plug, ear piece, hearing aid, or the like. The sensors 100, 110, 120, 150, and/or 160 may be in wired or wireless communication with a processor 140 associated with the hearing instrument 130. The sensors 100, 110, 120, 150, and/or 160 may be microphones, piezoelectric sensors, capacitive sensors, dry electrodes, accelerometers, lasers, infrared sensors, and the like.

Sensors 100, 110, 120, 150, and/or 160 may be positioned proximate to a plurality of voice transmission regions associated with a user's voice system. Sensors 100, 110, 120, 150, and/or 160 may sense sounds associated with voice transmission zones. The voice transmission region may be a location along the user's voice system where sounds associated with the user's voice may be heard, as described below.

The first sensor 100 may be disposed at an entrance or in the ear canal to measure a first sound transmitted through the ear canal of a user. The first sound transmitted through the ear canal of the user may be transmitted using bone conduction, typically low frequency sound. The first sensor 100 may be physically attached to the hearing device 130 or may be in wireless communication with the hearing device 130. For example, the first sensor 100 may be encapsulated (encapsulated) within the hearing device 130, as described below.

The second sensor 110 may be disposed adjacent to the lips of the user to measure a second sound transmitted through the lips of the user. The second sensor 110 may be physically attached to the hearing device 130 as shown in fig. 1 or may be in wireless communication with the hearing device 130.

Third sensor 120 may be positioned proximate to the throat of the user to measure a third sound transmitted through the throat of the user. The third sensor 120 may be physically attached to the hearing instrument 130 as shown in fig. 1A, or the third sensor 120 may be in wireless communication with the hearing instrument 130. As shown in fig. 1B, third sensor 120 may be attached to the throat of the user with a sticker 170. As described below, the sensors 150 and/or 160 may be placed adjacent to tongue tip (coronal) and tongue jaw (dorsal) voice transmission regions.

The first sensor 100 can measure low frequencies of the user's voice because the sound heard in or at the entrance to the user's ear canal is transmitted through the user's head using bone conduction. The second sensor 110, 120, 150, 160 may measure the high frequency of the user's voice. The human voice ranges between 80Hz and 300 Hz. The low frequency range detected by the first sensor may be between 80Hz and 200Hz, while the high frequency range detected by the second sensor may be between 180Hz and 300 Hz. The processor 140 may receive the low frequencies recorded by the first sensor and the high frequencies recorded by the second sensor and combine them into a human voice recording.

Further, multiple sensors may be placed outside of a person's head, for example, near the person's mouth or throat, such as sensors 110, 120, 150, 160, forming a sensor array. Each sensor in the sensor array is spatially separated from the other sensors, and each sensor can be a known distance from the person's voice source. When a person speaks, each sensor 110, 120, 150, 160 receives the person's voice at a different time than the rest of the sensors in the sensor array due to the different distances between the source of the voice and the sensor locations. Thus, the time delay for receiving the speech of the person wearing the hearing device 130 at each of the sensors 110, 120, 150, 160 is known.

In order to accurately detect the source of sound based on the time delay, the distance between each sensor in the sensor array needs to be smaller than the wavelength of the detected sound. To detect high frequencies, the sensors need to be closer together than they would be to detect low frequencies. In order to accurately detect the source of human speech, the distance between the sensors needs to be less than 1 m.

If the sensors 110, 120, 150, 160 receive ambient speech from a person other than the person wearing the hearing device 130, the time delay for receiving the speech of the other person between the different sensors 110, 120, 150, 160 is different than when the sensors 110, 120, 150, 160 receive the speech of the person wearing the hearing device 130. The sensors 110, 120, 150, 160 may send the received speech to the processor 140. Based on the different times of received speech, the processor 140 may filter out ambient speech and noise, and may detect the speech of the person wearing the hearing device 130 even in a crowded room.

The sensors 100, 110, 120, 150, 160 may be used to measure sound away from the primary target recording location. For example, the sensors 100, 110, 120, 150, 160 may be considered to be proximate to the user's lips and may measure the sound transmitted by the user's lips. In a more specific example, two sensor arrays in the ear bud may be used to measure signals from the user's lips, even though the primary target recording location of the sensor 100 may not be the user's lips.

Fig. 2 illustrates voice transmission regions associated with a user's voice system. As shown in fig. 2, there are multiple voice transmission regions associated with the user's voice system. The four main voice transmission areas are lip, tongue tip, tongue, and laryngis. The lip sound transmission zone may include the bicuspid, the labochite, and the linguolabial sound regions. The tongue tip voice transmission region may include a tongue lip voice, a tooth voice, a gum voice, a rear gum voice, and a tongue curl voice region. The lingual tonsils regions may include palatal, soft palatal, and lingual tonsils regions. The laryngeal voice transmission region may include the epiglottis and the glottic region.

When the sensor 110 is placed near the lip sound voice transmission area in fig. 1A-1B, the voice signal is transmitted, such as m, n, p, B, t, v,the isophones are well detected. When sensors 110, 150, and/or 160 are placed near the tongue voice transmission region in fig. 1A-1B, such as r,s，z，t∫，the isophones are well detected. When sensors 110, 150, and/or 160 are placed near the tongued voice transmission area, phonemes such as k, g, j are well detected. When the sensors 110 and/or 120 are placed close to the throat voice transmission area, phonemes such as h, u:, and a:arewell detected.

The speech transmission region may comprise the ear canal of the user, as the user's speech is transmitted by bone conduction through the user's head. Thus, the ear canal of the user may be used to detect the user's speech, in particular the low frequencies of the user's speech. A sensor, such as a microphone 100 in fig. 1A-1B, may be placed at the entrance or in the ear canal of the user to detect and record the user's voice. A single sensor may detect multiple phonemes. In addition, a single sensor may detect phonemes generated from a plurality of speech transmission regions.

Different phonemes have different sound component generation positions. This means that the transfer function is different between the generation site and each recording instrument. There may be different amplitudes and phases/delays (delay)/latencies (latency) across the frequency. By comparing data recorded from different locations around the head/body/room, the phonemes can be identified more accurately. For example, sound generated near the lips is louder near the lips than near the throat. Thus, if a sound is recorded and a sensor on the lips shows a much louder signal than a throat sensor, then the phonemes trying to be confirmed are more likely to be generated near the lips. Thus, the processor may select the measurements of the sensor near the lips to perform speech recognition and/or may rely more on the sound recorded by the sensor near the lips to perform speech recognition.

This additional information may be used to improve the accuracy of the speech recognition algorithm by determining a difference between a portion of the first sound and a portion of the second sound and selecting either the portion of the first sound or the portion of the second sound based on the difference. For example, to improve the accuracy of the speech recognition algorithm, the processor may modify the probability of which phoneme was spoken that was generated by any other speech recognition algorithm, such as a neural network-based approach (apreach). The processor may modify the probabilities of the phoneme predictions based on the difference information recorded from each sensor. The difference information may include the amplitude between two or more sounds measured by two or more different sensors, and/or the difference between phase/delay/latency. Additionally, or alternatively, the additional information may be extracted implicitly (implicitly) by constructing a neural network model with inputs from multiple sensor locations, and the neural network constructed in a manner that benefits from the difference information.

The difference between the portions of the first sound and the second sound may be reflected in amplitude, and/or phase/delay/latency. Based on the difference, e.g., one sensor detects a higher volume of sound, or one sensor detects sound faster than the other sensor, the processor may determine where one sensor is closer to the voice transmission region. Thus, the processor may determine the likely speech transmission regions, as well as the likely phone set being spoken. Further, the processor may select a measurement of sound recorded by one sensor to perform speech recognition and/or may rely more on one sensor closer to the speech transmission region than another sensor to perform speech recognition.

The analysis may be done locally, or as is more common today, by streaming the recorded sound to a cloud-based provider. With current technology, multiple recording sensor signal channels are sent to a cloud-based provider, not just one used today.

Fig. 3A-3B illustrate a hearing instrument according to various embodiments. The hearing device 300 may be an earpiece, such as a wired or wireless ear bud, a hearing aid, an earpiece, or the like. The hearing device 300 may include an ear cup 302 and an earpiece 304, fig. 3B, in wired or wireless communication with each other. The ear cup 302 and the earpiece 304 may be part of a hearing device 300, such as an earpiece. The hearing device 300 may include one or more processors 310, 315, 320, and/or 325; one or more sensors, 330, 335, 337, 340, 345, and/or 347; a transceiver, 350, 355, or 357; an audio transmitter, 360, 365, or 367; and a housing, 370, 375, or 377, etc.

The sensors 330, 335, and/or 345 may be microphones for recording sound. The sensors 337 and/or 347 may be electrodes and/or capacitive sensors to detect Auditory Evoked Potential Signals (AEPs). AEP is an electroencephalographic (EEG) signal emanating from the brain through the scalp in response to an acoustic stimulus. The sensors 337 and/or 347 may measure any AEP, such as auditory brainstem response, mid-latency response, cortical response, acoustic change complex, auditory steady-state response, complex auditory brainstem response, cochlear electrogram, cochlear microphonics, or cochlear nerve tone AEP. The sensor 320 may also measure acoustic reflections (also known as stapedius muscle reflections, Middle Ear Muscle (MEM) reflections, attenuation reflections, or auditory reflections). Acoustic reflexes are involuntary muscle contractions that occur in the middle ear in response to high intensity sound stimuli or when a person starts to produce a sound.

The first processor 310 or 315 may be configured to consume a small amount of energy while waiting to receive an activation utterance. The first processor 310 or 315 may be configured to always consume a small amount of power and transmit the user's voice to the second processor 320 or 325, or a remote processor for voice recognition. The first processor 310 or 315 may be configured to operate as a dual mode processor, i.e., both in a low energy mode and a high energy mode. For example, when the first processor 310 or 315 is waiting to receive an activation utterance, the first processor 310 or 315 may operate in a low energy mode, and when the first processor 310, 315 is performing speech recognition, the processor 310 may operate in a high energy mode.

The first processor 310 or 315 may receive signals from one or more sensors: 330, 335, 337, 340, 345, and/or 347. The first processor 310 or 315 may recognize the activation utterance based on one or more recordings. The activation utterance may be a wake phrase or wake word, such as "Nura," "okay Nura," "wake up Nura," or the like. Upon recognition of the activation utterance, the first processor 310 or 315 facilitates recognition of the user's speech.

To facilitate recognition of the user's voice, the first processor 310 or 315 may switch to a high energy mode to perform voice recognition, activate the second processor 320 or 325 to perform voice recognition, or may transmit a recording of one or more user's voices to a remote processor, such as a cloud processor.

The transmission may be performed using transceivers 350, 355, and/or 357. Transceivers 350, 355, and/or 357 may transmit the recording of the user's voice to a remote processor via a cellular network, wireless network, or the like. The transceivers 350, 355, and/or 357 may send the records to an intermediary such as a cell phone, smart watch, home device, etc., which then forwards the records to a remote processor, or the transceivers 350, 355, and/or 357 may communicate directly with the remote processor.

Sensors 330, 335, 337, 340, 345, and/or 347 may be any sensor capable of recording a signal representative of a user's voice. For example, sensors 330, 335, 337, 340, 345, and/or 347 may be microphones, electrodes, capacitive sensors, or any combination thereof. The first processor 310 or 315 may reduce the power consumption of the hearing instrument 300 by only maintaining one operation of the sensors 330, 335, 337, 340, 345, or 347 to detect whether the user is speaking and maintaining the remaining sensors of the sensors 330, 335, 337, 340, 345, and/or 347 in a low energy mode or off until the active sensor detects the user's voice.

For example, the sensors 330 and/or 335 may be isolated from ambient sound by the enclosure 370 or 375 and may better detect the user's voice because the sensors 330 and/or 335 are isolated from ambient sound. The sensor 330 may be placed at the entrance or in the ear canal of the user so that the user's head, in addition to the housing 370 or 375, also attenuates ambient sounds. The sensor 335 in fig. 3 may be placed within a housing 375, the housing 375 being placed at the entrance or in the ear canal of the user. Sensor 335 may detect user speech conducted through a cavity defined by housing 375. Once the sensor 330 receives a signal, such as a sound, the sensor 330 may send a notification to the first processor 310 or 315 to activate the remaining sensors 340. By activating the remaining sensors 340 only at selected times, the power consumption of the hearing instrument 300 is reduced.

Even when operating in the high-energy mode, the sensors 330 and/or 335 do not consume much energy because the sensors 330 and/or 335 do not detect ambient sound and do not expend energy recording ambient sound. However, the sensors 330 and/or 335 are poor in detecting phonemes transmitted through the lips, and thus, another sensor 330 and/or 335 needs to be used. In general, even if two sensors are used, one in the ear canal and the other near the lips, the two sensors consume less energy than if only one sensor near the lips were used, because the sensor near the lips is exposed to ambient sound and consumes more energy to detect an active utterance than does the sensor 330 and/or 335, which is isolated from the ambient sound.

The sensor 330 may measure otoacoustic emissions generated within the user's ear canal in response to the received sound. Based on the measured otoacoustic emissions, the processors 310, 315, 320, and/or 325 may obtain a user hearing profile that represents how the user perceives the received sound. In other words, the hearing profile may correlate the received frequencies and amplitudes with the perceived frequencies and amplitudes.

Based on the hearing profile, the processors 310, 315, 320, and/or 325 may modify the sound delivered to the user. For example, when the hearing instrument 300 is playing music to the user, the processors 310, 315, 320, and/or 325 may automatically equalize (equalize) the music before the audio transmitters 360, 365, and/or 367 emit the music to the user. Further, based on the hearing profile, the processors 310, 315, 320, and/or 325 may confirm the user. For example, the processors 310, 315, 320, and/or 325 may measure a hearing profile (profile) of the user and search a hearing profile database to match the hearing profile. If the processor 310, 315, 320, and/or 325 finds a match, the processor may confirm the user.

Creating a hearing profile based on otoacoustic emissions can consume a significant amount of energy. Thus, processors 310 and/or 315 may create a hearing profile while operating in the high-energy mode, or processors 310 and/or 315 may activate processors 320 and/or 325 to operate in the high-energy mode while creating a hearing profile.

Based on the one or more phonemes included in the activation utterance, processor 310, 315, 320, and/or 325 may determine that one of sensors 330, 335, 337, 340, 345, and/or 347 may record the one or more phonemes. Processors 310, 315, 320, and/or 325 may be more dependent on sensors that are more likely to detect phonemes than other sensors. One or more sensors that may record one or more phonemes in an activation utterance may be continuously operated in a high energy mode while the remaining sensors are operated in a low energy mode until the activation utterance is received. Upon receiving the activation utterance, all of the sensors 330, 335, 337, 340, 345, and/or 347 may operate in a high-energy mode to measure the voice of the user. The processor may select certain phonemes in the active utterance to be more intelligible (distint) and/or less frequently used in speech, and may only operate one or more sensors that may continuously detect the selected phonemes in the high energy mode.

For example, sensor 340 is more likely to record lip phonemes, while sensor 330 is more likely to record laryngeal phonemes. To detect "okay Nura," the sensor 330 may detect the user's voice when the user speaks "okey" and send a signal to the processors 310, 315, 320, and/or 325 to activate the sensor 340. The processors 310, 315, 320, and/or 325 may activate the sensor 340 and receive a recording of the user's voice "Nura" from the sensors 330, 335, 337, 340, 345, and/or 347. Processors 310, 315, 320, and/or 325 may rely on sensors 330 and/or 335 to detect phonemes u and a, while sensors 340 and/or 345 may detect phonemes n and r. The sensors 330 and/or 335 may pick up (pick up) low frequency sounds transmitted by the laryngeal voice transmission region because the sensors 330 and/or 335 may pick up low frequency sounds transmitted using bone conduction. If there is a collision between the sensors 330, 335, and 340, 345, where the sensors 340, 345 do not indicate the presence of the phoneme u: and the sensors 330, 335 indicate the presence of the phoneme u: then the processors 310, 315, 320, 325 may rely on the records of the sensors 330, 335 to resolve the collision because the sensors 330, 335 are more likely to detect the phoneme u:.

Recordings of user speech by sensors 330, 335, 337, 340, 345, and/or 347 may be tagged with phonemes each recording is more likely to be detected. The markers may assist in speech recognition. For example, a processor performing speech recognition may receive a recording and record a list of phonemes that may represent correctly. Speech recognition may be performed using artificial intelligence (such as neural networks, statistical modeling systems, etc.).

The hearing instrument 300 may reduce power consumption by switching the second processor 320, 325 to a high-energy mode after the first processor 310, 315 recognizes that the sound production is activated. The second processor 320, 325 may operate in a low energy mode when not activated and may operate in a high energy mode when activated. When operating in the high energy mode, the second processor 320, 325 may perform speech recognition.

The second processor, which may be processor 325 in fig. 3B, is associated with ear cup 302. When the second processor 325 is associated with the ear cup 302, the second processor 325 can use (access) the energy source 380, which energy source 380 is greater than the energy source 390 associated with the earpiece 304. The energy source 380 may be larger than the energy source 390 because the ear cup 302 has a larger volume than the earpiece 304 because the ear cup does not have to be tucked into (fit into) the user's ear.

The first processor 310, 315 may reduce power consumption by receiving the user's voice from the sensors 330, 335, 340, 345 and transmitting the user's voice to the remote processor to perform voice recognition. The remote processor may be a processor associated with a laptop, a home appliance, a mobile device, an internet server (such as a cloud computer), and the like.

FIG. 4 is a flow chart of a method of performing speech recognition using multiple sensors. In step 400, a plurality of sensors may measure and record a plurality of sounds transmitted by a plurality of voice transmission regions associated with a user's voice system. The sensors may be placed adjacent to multiple voice transmission areas. The speech transmission region is a location along the user's speech system where one of a plurality of sounds associated with the user's speech is audible, as shown in fig. 2. The speech transmission region may include a point of articulation in the user's speech system and the user's ear canal.

The first sensor may be placed at or within an ear canal of the user and may measure the first sound. The first sound may include low frequency speech, as the low frequency speech may be transmitted into the ear canal of the user by bone conduction. The second sensor may be placed outside the user's head, for example near the user's mouth or throat. The second sensor may measure a second sound, which may include high frequency speech. The processor may combine the low and high frequency speech into a recording of the user's audio. The first sound and the second sound may be different aspects of the same sound, where the first sound is the sound detected by the first sensor at the first location and the second sound is the same sound detected by the second sensor at the second location. For example, when the user speaks, the first sound may be the user's voice detected at the entrance of or within the ear canal, and the second sound may be the user's voice detected at the user's mouth or throat.

In step 410, the processor may enhance the accuracy of the speech recognition algorithm by determining a difference between a portion of the first sound and a portion of the second sound, and selecting a portion of the first sound or a portion of the second sound based on the difference, or modifying a probability of phoneme prediction based on difference information recorded from each sensor, as described herein. The difference information may include differences between the amplitude and/or phase/delay/latency between two or more sounds measured by two or more different sensors.

The processor may reconstruct the user's voice by selecting a first voice portion from the first sound and a second voice portion from the second sound based on criteria (criteria) indicating which of the first and second sensors senses the first voice portion and the second voice portion better. For example, the criteria may specify a frequency range, and/or phonemes to be used in selecting the appropriate sensor.

When the criteria may specify a phoneme, the processor may select the phoneme from sounds recorded by sensors that are placed closer to a transmission source of the phoneme than other sensors. For example, the processor may select a lip phone from the second sound when the second sensor is placed closer to the user's lips than the first sensor. In another example, the processor may select a laryngeal phoneme from a sensor placed closest to the user's throat.

When the criteria may specify a frequency range, the processor may select low frequency sounds from a first sensor placed at or within the entrance of the user's ear canal and high frequency sounds from a second sensor placed outside the user's head.

The processor may identify an activation utterance based on a plurality of recordings of a plurality of sounds. The activation utterance may be a word or phrase, such as "Nura," "hey Nura," "okay Nura," or the like.

The processor may facilitate recognition of the user's speech when recognizing the activation utterance. The processor may save energy by operating in a low energy mode while waiting for the sound production to be activated. The processor may switch to the high energy mode while performing user speech recognition, or may send one or more recordings of the user's speech to another processor operating in the high energy mode.

Multiple sensors outside the user's head may form a sensor array, where each sensor receives the same sound with a unique time delay. The processor may determine a time delay to receive the same sound at each of the plurality of sensors. The processor may identify the source of the sound based on a plurality of time delays and may extract from the sound a portion of the sound that is of the desired source. For example, the processor may filter out ambient speech and/or noise from the sound to isolate a desired source of sound, i.e., the user's speech.

Energy consumption may be reduced by having only a subset of the plurality of sensors operate in a high energy mode while the remaining sensors operate in a low energy mode or are completely shut down. The subset of sensors is better at detecting user speech than the remaining sensors because the sensors may be surrounded by a housing that attenuates ambient sound reaching the subset of sensors. The subset of sensors, upon detecting the user's voice, may send a notification to the processor indicating detection of the user's voice. Upon receiving the notification, the processor may activate the remaining sensors of the plurality of sensors to record the voice of the user.

The one or more sensors may measure otoacoustic emissions generated within the ear canal of the user in response to the received sound. The sensor may be placed at the entrance or in the ear canal of the user. The processor may obtain a hearing profile of the user based on the measured otoacoustic emissions. The profile may indicate how the user perceives the sound by correlating the received frequencies and amplitudes with the perceived frequencies and amplitudes.

Based on the user's hearing profile, the processor may modify the sound delivered to the user based on the user's hearing profile. For example, the processor may increase the user's enjoyment of the music by matching the user's enjoyment to the intended perception of the music (e.g., as intended by the artist who recorded the song). Further, based on the user's hearing profile, the processor may validate the user because the user's hearing profile is unique to the user.

To detect the activated utterance, the processor may determine a sensor of the plurality of sensors that is more likely to record one or more phonemes included in the activated utterance. For example, if the activation word is "okay Nura," the phoneme n is more likely to be detected by a sensor placed near the mouth. For example, the processor may identify an active utterance when the selected sensor records one or more phonemes. In another example, the processor may identify the active utterance by weighting the recorded phonemes from the multiple streams such that sensors that are likely to record phonemes are weighted more heavily than sensors that are unlikely to record phonemes. In other words, the processor may rely more on the sensor to detect the phonemes than the remaining sensors of the plurality of sensors.

Computer with a memory card

Fig. 5 is a diagrammatic representation of machine in the example form of a computer system 500 within which a set of instructions, for causing the machine to perform any one or more of the methodologies or modules discussed herein, may be executed.

In the example of FIG. 5, computer system 500 includes a processor, memory, non-volatile storage, and interface devices. Various general-purpose components (e.g., cache memory) are omitted for simplicity of illustration. Computer system 500 is intended to illustrate a hardware device on which any of the components described in the examples of fig. 1-4 (as well as any other components described in this specification) may be implemented. Computer system 500 may be of any suitable known or convenient type. The components of computer system 500 may be coupled together via a bus or by some other known or convenient means.

The processor of the computer system 500 may be the processor associated with the hearing instrument 300 in fig. 3A-3B. The processor of the computer system 500 may perform the various methods described herein. The non-volatile memory and/or drive unit may store a database containing a variety of hearing profiles as described herein. The network interface device of computer system 500 may be transceiver 350, 355, and/or 357 in fig. 3A-3B.

The present disclosure encompasses computer system 500 taking any suitable physical form. By way of example and not limitation, computer system 500 may be an embedded computer system, a system on a chip (SOC), a single board computer System (SBC) (such as, for example, a computer module (COM), or a System On Module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive kiosk, a mainframe, a mesh of computer systems, a mobile phone, a Personal Digital Assistant (PDA), a server, or a combination of two or more thereof. Computer system 500 may include one or more computer systems 500, singular or distributed, across multiple locations, across multiple machines, or resident in a cloud, where appropriate, which may include one or more cloud components in one or more networks. Where appropriate, one or more computer systems 500 may perform without substantial spatial or temporal limitation one or more steps of one or more methods described or illustrated herein. By way of example, and not limitation, one or more computer systems 500 may perform in real-time, or in batch mode, one or more steps of one or more methods described or illustrated herein. One or more computer systems 500 may perform at different times or at different locations one or more steps of one or more methods described or illustrated herein, where appropriate.

The processor may be, for example, a conventional microprocessor, such as an intel pentium microprocessor or a motorola power PC microprocessor. One skilled in the relevant art will recognize that the terms "machine-readable (storage) medium" or "computer-readable (storage) medium" comprise any type of device that is accessible by a processor.

The memory is coupled to the processor by, for example, a bus. The memory may include, by way of example and not limitation, Random Access Memory (RAM), such as dynamic RAM (dram) and static RAM (sram). The memory may be local, remote, or distributed.

The bus also couples the processor to the non-volatile memory and the drive unit. Non-volatile memory is typically a magnetic floppy disk or hard disk, a magneto-optical disk, an optical disk, a Read Only Memory (ROM) (such as a CD-ROM, EPROM, or EEPROM), a magnetic or optical card, or another form of storage of large amounts of data. During execution of software in the computer 500, some of this data is often written to memory by direct memory access programs. The non-volatile memory may be local, remote, or distributed. Non-volatile memory is optional, as the system can be created using all applicable data available in the memory. A typical computer system will usually include at least a processor, memory, and a device (e.g., a bus) coupling the memory to the processor.

The software is typically stored in a non-volatile memory and/or in a drive unit. In fact, it may not even be possible to store the entire large program in memory. However, it should be understood that in order for the software to run, it is moved to a computer readable location suitable for processing, if necessary, and for purposes of illustration, this location is referred to herein as memory. Even if software is moved into storage for execution, processors typically use hardware registers to store values associated with the software, and ideally local caches to speed up execution. As used herein, when a software program is referred to as being "implemented on a computer-readable medium," the software program is assumed to be stored in any known or convenient location (from non-volatile memory to hardware registers). A processor is considered "configured to execute a program" when at least one value associated with the program is stored in a register readable by the processor.

The bus also couples the processor to a network interface device. The interface may include one or more of a modem or a network interface. It should be appreciated that a modem or network interface can be considered to be part of computer system 500. The interface may include an analog modem, an ISDN modem, a cable modem, a token ring interface, a satellite transmission interface (e.g., "direct PC"), or other interfaces for coupling a computer system to other computer systems. An interface may include one or more input and/or output devices. I/O devices can include, by way of example and not limitation, a keyboard, a mouse or other pointing device, a disk drive, a printer, a scanner, and other input and/or output devices, including a display device. The display device may include, by way of example and not limitation, a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), or some other suitable known or convenient display device. For simplicity, it is assumed that the controller of any device not depicted in the example of FIG. 5 resides at the interface.

In operation, computer system 500 may be controlled by operating system software, including a file management system, such as a disk operating system. One example of operating system software associated with file management system software is Microsoft corporation, called Redmond, WashingtonAnd its associated file management system. Another example of operating system software and its associated file management system software is Linux^TMAn operating system and its associated file management system. The file management system is typically stored in non-volatile memory and/or a drive unit and causes the processor to perform various actions required by the operating system to input and output data and store the data in memory, including storing files in non-volatile memory and/or a drive unit.

Some portions of the detailed description may be presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the description, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or "generating" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the method of some embodiments. The required structure for a variety of these systems will appear from the description below. Moreover, these techniques are not described with reference to any particular programming language, and thus various embodiments may be implemented using a variety of programming languages.

In alternative embodiments, the machine operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine may operate in the capacity of a server or a client machine in client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.

The machine may be a server computer, a client computer, a Personal Computer (PC), a tablet PC, a laptop computer, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, an iPhone, a blackberry, a processor, a telephone, a network appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.

While the machine-readable medium or machine-readable storage medium is shown in an exemplary embodiment to be a single medium, the terms "machine-readable medium" and "machine-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The terms "machine-readable medium" and "machine-readable storage medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies or modules of the techniques and innovations of the present disclosure.

In general, the routines executed to implement the embodiments of the disclosure, may be implemented as part of an operating system or as a specific application, component, program, object, module or sequence of instructions referred to as a "computer program". The computer programs typically comprise one or more instructions disposed in various memory and storage devices in the computer at various times and, when read and executed by one or more processing units or processors in the computer, cause the computer to perform operations to perform elements relating to aspects of the present disclosure.

Moreover, while embodiments have been described in the context of fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments are capable of being distributed as a program product in a variety of forms, and that the disclosure applies equally regardless of the particular type of machine or computer readable media used to actually effect the distribution.

Machine-readable storage media, machine-readable media, or other examples of computer-readable (storage) media include but are not limited to recordable type media such as volatile and non-volatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., compact disk read only memories (CD ROMS), Digital Versatile Disks (DVDs), etc.), among others, and transmission type media such as digital and analog communication links.

In some cases, the operation of the memory device, such as a state change from a binary one to a binary zero or vice versa, may include a transition (such as a physical transition), for example. For a particular type of memory device, such physical transitions may involve physical transitions of an item to a different state or thing. For example, but not limiting of, for some types of memory devices, the change in state may involve the accumulation and storage of charge or the release of stored charge. Likewise, in other memory devices, the change in state may include a physical change or transition in magnetic orientation or a physical change or transition in molecular structure, such as a transition from crystalline to amorphous or vice versa. The foregoing is not intended to be an exhaustive list in which state changes of binary ones to binary zeros or vice versa in a storage device may include transitions, such as physical transitions. Rather, the foregoing is intended as an illustrative example.

The storage medium may typically be non-transitory or contain a non-transitory device. In this context, a non-transitory storage medium may comprise a tangible device, meaning that the device has a concrete physical form, although the device may change its physical state. Thus, for example, non-transitory means that the device remains tangible despite such a change in state.

Remarks for note

The foregoing description of various embodiments of the claimed subject matter has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise form disclosed. Many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best describe the principles of the invention and its practical application, to thereby enable others skilled in the relevant art to understand the claimed subject matter, various embodiments, and with various modifications as are suited to the particular use contemplated.

While embodiments have been described in the context of fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments are capable of being distributed as a program product in a variety of forms, and that the disclosure applies equally regardless of the particular type of machine or computer readable media used to actually effect the distribution.

While the above detailed description describes certain embodiments and the best mode contemplated, no matter how detailed the above appears in text, the embodiments can be practiced in many ways. The details of the systems and methods may vary widely in their implementation details but are still covered by the description. As noted above, particular terminology used when describing certain features or aspects of various embodiments should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the invention with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification, unless the terms are explicitly defined herein. Accordingly, the actual scope of the invention encompasses not only the disclosed embodiments, but also all equivalent ways of practicing or implementing the embodiments under the claims.

The language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, it is intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of various embodiments is intended to be illustrative, but not limiting, of the scope of embodiments, which is set forth in the following claims.

24页详细技术资料下载

Speech recognition using multiple sensors

相关技术

网友询问留言