Dual pipeline architecture for wake phrase detection with voice onset detection

文档序号：958604 发布日期：2020-10-30 浏览：32次中文

阅读说明：本技术 用于利用语音开始检测来唤醒短语检测的双管线体系结构 (Dual pipeline architecture for wake phrase detection with voice onset detection ) 是由维克托·西米利斯基罗伯特·措普夫于 2019-02-22 设计创作，主要内容包括：短语检测设备包括：将音频数据的第一部分从音频数据源传送到处理单元的高时延管线,其中高时延管线包括存储音频数据的第一部分的历史缓冲器；以及以比高时延管线更低的时延将音频数据的第二部分从音频数据源传送到处理单元的低时延管线。与音频数据源耦合的声音开始检测器基于音频数据来检测声音开始事件。与高时延管线和低时延管线耦合的同步电路响应于声音开始事件而使存储在历史缓冲器中的音频数据的第一部分到处理单元的输出和音频数据的第二部分经由低时延管线到处理单元的输出同步。(The phrase detection apparatus includes: a high latency pipeline to communicate a first portion of the audio data from the audio data source to the processing unit, wherein the high latency pipeline includes a history buffer to store the first portion of the audio data; and a low-latency pipeline to communicate a second portion of the audio data from the audio data source to the processing unit with a lower latency than the high-latency pipeline. A sound onset detector coupled to the source of audio data detects a sound onset event based on the audio data. A synchronization circuit coupled to the high-latency pipeline and the low-latency pipeline synchronizes an output of the first portion of the audio data stored in the history buffer to the processing unit and an output of the second portion of the audio data to the processing unit via the low-latency pipeline in response to the sound onset event.)

1. An apparatus, comprising:

a high latency pipeline configured to communicate a first portion of audio data from an audio data source to a processing unit, wherein the high latency pipeline includes a history buffer configured to store the first portion of audio data;

a low-latency pipeline configured to communicate a second portion of the audio data from the audio data source to the processing unit with a lower latency than the high-latency pipeline;

a sound onset detector coupled with the source of audio data and configured to detect a sound onset event based on the audio data; and

a synchronization circuit coupled with the high latency pipeline and the low latency pipeline and configured to synchronize an output of the first portion of the audio data from the history buffer to the processing unit and an output of the second portion of the audio data to the processing unit via the low latency pipeline in response to the sound onset event.

2. The device of claim 1, wherein the high latency pipeline further comprises a decimator coupled with the audio data source and configured to reduce a sampling rate of the first portion of audio data, wherein the sound onset detector detects the sound onset event based on the first portion of audio data at the reduced sampling rate.

3. The device of claim 1, wherein the first portion of audio data has a lower sampling rate than the second portion of audio data.

4. The apparatus of claim 1, wherein:

the history buffer comprises a circular buffer configured to store a first portion of the audio data; and

the first portion of audio data comprises a fixed number of most recent data samples from the audio data source.

5. The apparatus of claim 1, wherein:

the history buffer is further configured to asynchronously provide samples of the first portion of the audio data to the processing unit; and

the low-latency pipeline is further configured to transmit a second portion of the audio data as a data stream.

6. The device of claim 1, wherein the synchronization circuit is coupled with the processing unit and is further configured to synchronize output by recording a memory location in the history buffer corresponding to a time of an initial data sample of the second portion of the audio data transferred to the processing unit.

7. The apparatus of claim 1, wherein:

the sound onset detector and the history buffer are supplied with power in a first power domain;

the processing unit is supplied power in a second power domain that is isolated from the first power domain; and

the device also includes a power management unit coupled with the onset of sound detector and configured to enable the second power domain in response to the onset of sound detector detecting the onset of sound event.

8. The device of claim 1, further comprising the audio data source, wherein the audio data source comprises:

a pulse density modulator configured to generate a pulse density modulated bitstream based on a transducer signal; and

a decimator coupled with the pulse density modulator and configured to generate the audio data by reducing a sampling rate of the pulse density modulated bitstream, wherein an output of the decimator is coupled with the high latency pipeline and with the low latency pipeline.

9. The apparatus of claim 1, wherein:

the voice onset detector comprises a voice onset detector; and

The sound start event comprises a speech start event.

10. A method, comprising:

transmitting a first portion of audio data from an audio data source to a processing unit through a high latency pipeline;

storing a first portion of the audio data in a history buffer in the high latency pipeline;

transmitting a second portion of the audio data from the audio data source to the processing unit through a low-latency pipeline with a lower latency than the high-latency pipeline;

detecting a speech onset event based on the audio data; and

in response to the speech onset event, synchronizing an output of the first portion of the audio data from the history buffer to the processing unit and an output of the second portion of the audio data to the processing unit via the low latency pipeline.

11. The method of claim 10, further comprising reducing a sampling rate of the first portion of audio data, wherein detecting the onset event is based on the first portion of audio data at the reduced sampling rate, and wherein the first portion of audio data has a lower sampling rate than the second portion of audio data.

12. The method of claim 10, wherein storing the first portion of the audio data in the history buffer comprises storing a fixed number of most recent data samples from the audio data source in the history buffer, wherein the history buffer is a circular buffer.

13. The method of claim 10, further comprising:

asynchronously providing samples of a first portion of the audio data from the history buffer to the processing unit; and

transmitting a second portion of the audio data as a data stream via the low latency pipeline.

14. The method of claim 10, further comprising:

synchronizing output by recording a memory location in the history buffer corresponding to a time of an initial data sample of the second portion of the audio data transferred to the processing unit.

15. The method of claim 10, wherein:

detecting the speech onset event is performed by a speech onset detector; and

the method further comprises the following steps:

supplying power to the voice onset detector and the history buffer in a first power domain,

supplying power to the processing unit in a second power domain isolated from the first power domain, an

Enabling the second power domain in response to the detection of the speech onset event.

16. The method of claim 10, further comprising generating the audio data by:

generating a pulse density modulated bitstream based on the transducer signal; and

reducing a sampling rate of the pulse density modulated bitstream.

17. A system, comprising:

an audio transducer for capturing audio data;

a processing unit configured to identify a sound pattern in the audio data;

a low sample rate pipeline configured to communicate a first portion of the audio data to the processing unit, wherein the low sample rate pipeline includes a history buffer configured to store the first portion of the audio data;

a high sample rate pipeline configured to communicate a second portion of the audio data to the processing unit at a higher sample rate than the low sample rate pipeline;

a sound onset detector coupled with the audio transducer and configured to detect a sound onset event based on the audio data; and

A synchronization circuit coupled with the low sample rate pipeline and the high sample rate pipeline and configured to synchronize an output of the first portion of the audio data from the history buffer to the processing unit and an output of the second portion of the audio data to the processing unit via the high sample rate pipeline in response to the sound onset event.

18. The system of claim 17, wherein:

the sound onset detector and the history buffer are supplied with power in a first power domain;

the processing unit is supplied power in a second power domain that is isolated from the first power domain; and

the system also includes a power management unit coupled with the onset of voicing detector and configured to enable the second power domain in response to the onset of voicing detector detecting the onset of voicing event.

19. The system of claim 17, wherein:

the synchronization circuit is further configured to synchronize the output by recording a memory location in the history buffer corresponding to a time of an initial data sample of the second portion of the audio data transferred to the processing unit; and

The processing unit is further configured to:

retrieving samples of the first portion of the audio data from the sequence of memory locations in the history buffer until the recorded memory location is reached, and

in response to reaching the recorded memory location, an initial data sample of the second portion of the audio data is retrieved.

20. The system of claim 17, wherein:

the sound pattern comprises a wake phrase;

the processing unit is further configured to start a speech recognition engine in response to recognizing the wake phrase; and

the speech recognition engine is configured to recognize one or more voice commands from a second portion of the audio data communicated via the high sample rate pipeline.

21. The system of claim 17, further comprising:

a network adapter configured to transmit one or more network messages in response to the recognized one or more voice commands; and

a speaker device configured to generate an acoustic output in response to the recognized one or more voice commands.

Technical Field

The present disclosure relates to the field of speech (speech) recognition, and in particular to speech onset and wake phrase (phrase) detection.

Background

More and more modern computing devices feature speech recognition capabilities, allowing users to perform a wide variety of computing tasks through voice commands and natural speech. Devices such as mobile phones or smart speakers provide integrated virtual assistants that can respond to a user's commands or natural language requests by communicating over a local area network and/or a wide area network to retrieve requested information or to control other devices (e.g., lights, heating and air conditioning controllers, audio or video devices, etc.). Devices with speech recognition capabilities often remain in a low power consumption mode until a particular word or phrase (i.e., the wake phrase) is spoken, which allows the user to control the device using voice commands after the device is so activated.

However, since a portion of the device (including the microphone and some voice detection circuitry) remains in the powered-on state for a long period of time, implementations of wake phrase detection result in increased power consumption. Furthermore, the additional circuitry for performing wake phrase detection may increase latency, which manifests as a slower response time when general speech recognition is in progress.

16页详细技术资料下载

Dual pipeline architecture for wake phrase detection with voice onset detection

相关技术

网友询问留言