Audio processing in low bandwidth networked systems
阅读说明:本技术 低带宽联网系统中的音频处理 (Audio processing in low bandwidth networked systems ) 是由 杰里米·佩恩 托默·阿马里利奥 于 2018-08-01 设计创作,主要内容包括:本公开一般涉及一种用于检测在低带宽网络上发送的输入音频信号内的激活短语的系统。所述系统可以使用两阶段激活短语检测过程。首先,可以包括用于检测输入音频信号的多个麦克风的感测设备可以检测包括候选激活短语的输入音频信号。其次,感测设备可以将输入音频信号的记录发送到客户端设备,以确认输入音频信号包括激活短语。(The present disclosure relates generally to a system for detecting an activation phrase within an input audio signal transmitted over a low bandwidth network. The system may use a two-stage activation phrase detection process. First, a sensing device, which may include multiple microphones for detecting an input audio signal, may detect the input audio signal including a candidate activation phrase. Second, the sensing device may send a recording of the input audio signal to the client device to confirm that the input audio signal includes the activation phrase.)
1. A system for detecting an activation phrase in a remote device, comprising:
a natural language processor component executed by a first client device to:
receiving a first instance of a first input audio signal detected by a first microphone of a sensing device;
parsing the first instance of the first input audio signal to identify a first candidate activation phrase in the first instance of the first input audio signal;
determining that the first candidate activation phrase does not comprise a predetermined activation phrase;
receiving a second instance of the first input audio signal detected by a second microphone of the sensing device;
parsing the second instance of the first input audio signal to identify a second candidate activation phrase in the second instance of the first input audio signal;
determining that the second candidate activation phrase comprises the predetermined activation phrase; and
an interface of the first client device to:
based on determining that the second candidate activation phrase includes the predetermined activation phrase, sending an audio signal associated with at least one of the first instance of the first input audio signal and the second instance of the first input audio signal to a data processing system, the data processing system including a second natural language processor component for identifying a request in at least one of the first instance of the first input audio signal and the second instance of the first input audio signal.
2. The system of claim 1, comprising:
the interface to send, from the first client device to the sensing device, a request for the second instance of the first input audio signal based on a determination that the first candidate activation phrase does not include the predetermined activation phrase.
3. The system of claim 1 or 2, comprising:
the natural language processor component to:
receiving a first instance of a second input audio signal detected by the first microphone of the sensing device;
parsing the first instance of the second input audio signal to identify a third candidate activation phrase;
determining that the third candidate activation phrase comprises the predetermined activation phrase; and
the interface to send a request for a third input audio signal to the sensing device based on determining that the third candidate activation phrase comprises the predetermined activation phrase.
4. The system of any preceding claim, comprising:
the natural language processor component to:
receiving a first instance of a second input audio signal detected by the first microphone of the sensing device;
parsing the first instance of the second input audio signal to identify a third candidate activation phrase;
determining that the third candidate activation phrase comprises the predetermined activation phrase; and
the interface to terminate receiving the second instance of the second input audio signal based on determining that the third candidate activation phrase comprises the predetermined activation phrase.
5. The system of any preceding claim, comprising:
the interface is configured to establish a bluetooth connection between the first client device and the sensing device.
6. The system of any preceding claim, comprising the natural language processor component to:
receiving a third instance of the first input audio signal from a sensor of the first client device;
parsing the third instance of the first input audio signal to identify a third candidate activation phrase in the third instance of the first input audio signal; and
determining that the first input audio signal includes the predetermined activation phrase based at least on the third candidate activation phrase and the second candidate activation phrase.
7. The system of any preceding claim, comprising:
the natural language processor component to:
receiving a first instance of a second input audio signal detected by a first microphone of a second sensing device;
parsing the first instance of the second input audio signal to identify a third candidate activation phrase in the first instance of the second input audio signal;
determining that the third candidate activation phrase does not comprise a predetermined activation phrase;
receiving a second instance of the second input audio signal detected by the first microphone of the second sensing device, the second instance of the second input audio signal having a lower compression rate than the first instance of the second input audio signal;
parsing the second instance of the second input audio signal to identify a fourth candidate activation phrase in the second instance of the second input audio signal;
determining that the fourth candidate activation phrase comprises the predetermined activation phrase; and
the interface of the first client device to:
based on determining that the fourth candidate activation phrase includes the predetermined activation phrase, sending at least one of the first instance of the second input audio signal and the second instance of the second input audio signal to the data processing system including the second natural language processor component to identify a second request in at least one of the first instance of the second input audio signal and the second instance of the second input audio signal.
8. A system for transmitting data in a voice activated network, comprising:
a first microphone of a sensing device for receiving a first instance of a first input audio signal;
a second microphone of the sensing device to receive a second instance of the first input audio signal;
a natural language processor component executed by the sensing device to parse the first instance of the first input audio signal to identify an activation phrase;
an interface of the sensing device to send, at a first point in time, the first instance of the first input audio signal to a client device based on the identification of the activation phrase in the first instance of the first input audio signal, the client device including a second natural language processor component;
the interface of the sensing device to send the second instance of the first input audio signal to the client device at a second point in time after the first point in time; and
the interface of the sensing device to send an audio signal associated with at least one of the first instance of the first input audio signal and the second instance of the first input audio signal to the client device based on an acknowledgement message from the client device of the identification of the activation phrase in the second instance of the first input audio signal.
9. The system of claim 8, comprising the interface to:
at the first point in time, sending the first instance of the first input audio signal to the client device at a first compression level; and
at the second point in time, sending the second instance of the first input audio signal to the client device at a second compression level that is lower than the first compression level.
10. The system of claim 8 or 9, comprising the interface to:
at the first point in time, sending the second instance of the first input audio signal to the client device at a first compression level; and
at the second point in time, sending the first instance of the first input audio signal and the second instance of the input audio signal to the client device at a second compression level that is lower than the first compression level.
11. The system of any of claims 8, 9 or 10, comprising the interface to:
at the second point in time, sending the second instance of the first input audio signal to the client device based on a confirmation message that the activation phrase is not in the first instance of the input audio signal.
12. The system of any of claims 8 to 11, comprising the interface to:
at the second point in time, sending the second instance of the first input audio signal to the client device prior to receiving a confirmation message that the activation phrase is not in the first instance of the input audio signal.
13. The system of claim 12, comprising:
the interface to terminate the sending of the second interface of the first input audio signal based on an acknowledgement message of the activation phrase in the first instance of the input audio signal.
14. The system of any of claims 8 to 13, comprising the interface to:
establishing a Bluetooth connection with the client device;
transmitting the first instance of the first input audio signal and the second instance of the first input audio signal over the Bluetooth connection.
15. A method of transmitting data in a voice activated network, comprising:
receiving, by a first microphone of a sensing device, a first instance of a first input audio signal;
receiving, by a second microphone of the sensing device, a second instance of the first input audio signal;
parsing, by a natural language processor component executed by the sensing device, the first instance of the first input audio signal to identify an activation phrase;
sending, by an interface of the sensing device, the first instance of the first input audio signal to a client device at a first point in time based on the identification of the activation phrase in the first instance of the first input audio signal, the client device comprising a second natural language processor component;
sending, by the interface of the sensing device, the second instance of the first input audio signal to the client device at a second point in time after the first point in time; and
sending, by the interface of the sensing device, an audio signal associated with at least one of the first instance of the first input audio signal and the second instance of the first input audio signal based on a confirmation message from the client device of the identification of the activation phrase in the second instance of the first input audio signal.
16. The method of claim 15, comprising:
sending, by the interface, the first instance of the first input audio signal to the client device at the first point in time at a first compression level; and
sending, by the interface, the second instance of the first input audio signal to the client device at a second compression level lower than the first compression level at the second point in time.
17. The method according to claim 15 or 16, comprising:
sending, by the interface, the second instance of the first input audio signal to the client device at the first point in time at a first compression level; and
sending, by the interface, the first instance of the first input audio signal and the second instance of the input audio signal to the client device at a second compression level lower than the first compression level at the second point in time.
18. The method of any of claims 15 to 17, comprising:
sending, by the interface, the second instance of the first input audio signal to the client device at the second point in time based on a confirmation message that the activation phrase is not in the first instance of the input audio signal.
19. The method of any of claims 15 to 18, comprising:
sending, by the interface, the second instance of the first input audio signal to the client device at the second point in time prior to receiving a confirmation message that the activation phrase is not in the first instance of the input audio signal.
20. The method of claim 19, comprising:
terminating, by the interface, the sending of the second interface of the first input audio signal based on an acknowledgement message of the activation phrase in the first instance of the input audio signal.
21. A system for detecting an activation phrase in a remote device, comprising:
a natural language processor component executed by a first client device to:
receiving a first instance of a first input audio signal detected by a first microphone of a sensing device;
parsing the first instance of the first input audio signal to identify a first candidate activation phrase in the first instance of the first input audio signal;
determining that the first candidate activation phrase does not comprise a predetermined activation phrase;
receiving a second instance of the first input audio signal detected by a second microphone of the sensing device;
parsing the second instance of the first input audio signal to identify a second candidate activation phrase in the second instance of the first input audio signal;
determining that the second candidate activation phrase comprises the predetermined activation phrase; and
an interface of the first client device to:
based on determining that the second candidate activation phrase includes the predetermined activation phrase, sending at least one of the first instance of the first input audio signal and the second instance of the first input audio signal to a data processing system comprising a second natural language processor component to identify a request in at least one of the first instance of the first input audio signal and the second instance of the first input audio signal.
22. The system of claim 21, comprising:
the interface to send, from the first client device to the sensing device, a request for the second instance of the first input audio signal based on a determination that the first candidate activation phrase does not include the predetermined activation phrase.
23. The system of claim 21, comprising:
the natural language processor component to:
receiving a first instance of a second input audio signal detected by the first microphone of the sensing device;
parsing the first instance of the second input audio signal to identify a third candidate activation phrase;
determining that the third candidate activation phrase comprises the predetermined activation phrase; and
the interface to send a request for a third input audio signal to the sensing device based on determining that the third candidate activation phrase comprises the predetermined activation phrase.
24. The system of claim 21, comprising:
the natural language processor component to:
receiving a first instance of a second input audio signal detected by the first microphone of the sensing device;
parsing the first instance of the second input audio signal to identify a third candidate activation phrase;
determining that the third candidate activation phrase comprises the predetermined activation phrase; and
the interface to terminate receiving the second instance of the second input audio signal based on determining that the third candidate activation phrase comprises the predetermined activation phrase.
25. The system of claim 21, comprising:
the interface is configured to establish a bluetooth connection between the first client device and the sensing device.
26. The system of claim 21, comprising the natural language processor component to:
receiving a third instance of the first input audio signal from a sensor of the first client device;
parsing the third instance of the first input audio signal to identify a third candidate activation phrase in the third instance of the first input audio signal; and
determining that the first input audio signal includes the predetermined activation phrase based at least on the third candidate activation phrase and the second candidate activation phrase.
27. The system of claim 21, comprising:
the natural language processor component to:
receiving a first instance of a second input audio signal detected by a first microphone of a second sensing device;
parsing the first instance of the second input audio signal to identify a third candidate activation phrase in the first instance of the second input audio signal;
determining that the third candidate activation phrase does not comprise a predetermined activation phrase;
receiving a second instance of the second input audio signal detected by the first microphone of the second sensing device, the second instance of the second input audio signal having a lower compression rate than the first instance of the second input audio signal;
parsing the second instance of the second input audio signal to identify a fourth candidate activation phrase in the second instance of the second input audio signal;
determining that the fourth candidate activation phrase comprises the predetermined activation phrase; and
an interface of the first client device to:
based on determining that the fourth candidate activation phrase includes the predetermined activation phrase, sending at least one of the first instance of the second input audio signal and the second instance of the second input audio signal to the data processing system including the second natural language processor component to identify a second request in at least one of the first instance of the second input audio signal and the second instance of the second input audio signal
28. A system for transmitting data in a voice activated network, comprising:
a first microphone of a sensing device to receive a first instance of a first input audio signal and a first instance of a second input audio signal;
a second microphone of the sensing device to receive a second instance of the first input audio signal and a second instance of the second input audio signal;
a natural language processor component executed by the sensing device to parse the first instance of the first input audio signal to identify an activation phrase;
an interface of the sensing device to send the first instance of the first input audio signal to a client device at a first point in time based on identification of the activation phrase in the first instance of the first input audio signal, the client device including a second natural language processor component;
the interface of the sensing device to send the second instance of the first input audio signal to the client device at a second point in time after the first point in time; and
the interface of the sensing device to send the first instance of the second input audio signal to the client device based on a confirmation message from the client device of the identification of the activation phrase in the second instance of the first input audio signal.
29. The system of claim 28, comprising the interface to:
at the first point in time, sending the first instance of the first input audio signal to the client device at a first compression level; and
at the second point in time, sending the second instance of the first input audio signal to the client device at a second compression level that is lower than the first compression level.
30. The system of claim 28, comprising the interface to:
at the first point in time, sending the second instance of the first input audio signal to the client device at a first compression level; and
at the second point in time, sending the first instance of the first input audio signal and the second instance of the input audio signal to the client device at a second compression level that is lower than the first compression level.
31. The system of claim 28, comprising the interface to:
sending the second instance of the first input audio signal to the client device at the second point in time based on a confirmation message that the activation phrase is not in the first instance of the input audio signal.
32. The system of claim 28, comprising the interface to:
sending the second instance of the first input audio signal to the client device at the second point in time prior to receiving a confirmation message that the activation phrase is not in the first instance of the input audio signal.
33. The system of claim 32, comprising:
the interface to terminate the sending of the second interface of the first input audio signal based on an acknowledgement message of the activation phrase in the first instance of the input audio signal.
34. The system of claim 28, comprising the interface to:
establishing a Bluetooth connection with the client device;
transmitting the first instance of the first input audio signal and the second instance of the first input audio signal over the Bluetooth connection.
35. A method for transmitting data in a voice activated network, comprising:
receiving, by a first microphone of a sensing device, a first instance of a first input audio signal and a first instance of a second input audio signal;
receiving, by a second microphone of the sensing device, a second instance of the first input audio signal and a second instance of the second input audio signal;
parsing, by a natural language processor component executed by the sensing device, the first instance of the first input audio signal to identify an activation phrase;
sending, by an interface of the sensing device, the first instance of the first input audio signal to a client device at a first point in time based on the identification of the activation phrase in the first instance of the first input audio signal, the client device comprising a second natural language processor component;
sending, by the interface of the sensing device, the second instance of the first input audio signal to the client device at a second point in time after the first point in time; and
sending, by the interface of the sensing device, the first instance of the second input audio signal to the client device based on a confirmation message from the client device of the identification of the activation phrase in the second instance of the first input audio signal.
36. The method of claim 35, comprising:
sending, by the interface, the first instance of the first input audio signal to the client device at the first point in time at a first compression level; and
transmitting, by the interface, the second instance of the first input audio signal to the client device at a second compression level lower than the first compression level at the second point in time.
37. The method of claim 35, comprising:
sending, by the interface, the second instance of the first input audio signal to the client device at the first point in time at a first compression level; and
sending, by the interface, the first instance of the first input audio signal and the second instance of the input audio signal to the client device at a second compression level that is lower than the first compression level at the second point in time.
38. The method of claim 35, comprising:
sending, by the interface, the second instance of the first input audio signal to the client device at the second point in time based on a confirmation message that the activation phrase is not in the first instance of the input audio signal.
39. The method of claim 35, comprising:
sending, by the interface, the second instance of the first input audio signal to the client device at the second point in time prior to receiving a confirmation message that the activation phrase is not in the first instance of the input audio signal.
40. The method of claim 39, comprising:
terminating, by the interface, the sending of the second interface of the first input audio signal based on an acknowledgement message of the activation phrase in the first instance of the input audio signal.
Background
The networked devices may process the audio-based signals. The ability of the device to process the audio-based signal may be based on the quality of the audio-based signal. High quality audio-based signals may have a relatively large file size. Packet-based or other excessive network transmission of network traffic data between computing devices may prevent the computing devices from properly processing audio-based signals, performing operations related to the audio-based signals, or responding to the audio-based signals in a timely manner.
Disclosure of Invention
In accordance with at least one aspect of the present disclosure, a system for detecting an activation phrase in a remote device may include a natural language processor component executed by a first client device. The system may receive a first instance of a first input audio signal detected by a sensing device. The system may parse the first instance of the first input audio signal to identify a first candidate activation phrase in the first instance of the first input audio signal. The system may determine that the first candidate activation phrase does not comprise the predetermined activation phrase. The system may receive a second instance of the first input audio signal obtained by the sensing device. The system may parse the second instance of the first input audio signal to identify a second candidate activation phrase in the second instance of the first input audio signal. The system may determine that the second candidate activation phrase comprises a predetermined activation phrase. The system may include an interface to transmit an audio signal associated with at least one of a first instance of the first input audio signal and a second instance of the first input audio signal based on a determination that the second candidate activation phrase comprises the predetermined activation phrase. The data processing system may include a second natural language processor component to identify a request in at least one of the first instance of the first input audio signal and the second instance of the first input audio signal.
According to at least one aspect of the present disclosure, a system for transmitting data in a voice activated network may include a client device for receiving a first instance of a first input audio signal. The system may include a client device to receive a second instance of the first input audio signal. The system may include a natural language processor component executed by the client device to parse a first instance of the first input audio signal to identify the activation phrase. The system may include an interface of the client device to transmit a first instance of the first input audio signal to the data processing system at a first point in time based on the identification of the activation phrase in the first instance of the first input audio signal. The data processing system may include a second natural language processor component. The interface of the client device may transmit a second instance of the first input audio signal to the data processing system at a second point in time after the first point in time. The interface of the client device may send an audio signal associated with at least one of the first instance of the first input audio signal and the second instance of the first input audio signal to the data processing system based on a confirmation message from the data processing system identifying the activation phrase in the second instance of the first input audio signal.
According to at least one aspect of the present disclosure, a method of transmitting data in a voice-activated network may include receiving, by a client device, a first instance of a first input audio signal. The method may include obtaining, by the client device, a second instance of the first input audio signal. The method may include a natural language processor component executed by the client device parsing a first instance of the first input audio signal to identify an activation phrase. The method can comprise the following steps: based on the identification of the activation phrase in the first instance of the first input audio signal, the interface of the client device sends the first instance of the first input audio signal to the data processing system at a first point in time. The data processing system may include a second natural language processor component. The method may include transmitting, by the interface of the client device, a second instance of the first input audio signal to the data processing system at a second point in time after the first point in time. The method may include sending, by an interface of the client device, an audio signal associated with at least one of the first instance of the first input audio signal and the second instance of the first input audio signal to the data processing system based on the identified confirmation message of the activation phrase in the second instance of the first input audio signal from the data processing system.
According to at least one aspect of the disclosure, the system for detecting an activation phrase in a remote device may include a natural language processor component executed by a first client device. The system may receive a first instance of a first input audio signal detected by a first microphone of a sensing device. The system may parse the first instance of the first input audio signal to identify a first candidate activation phrase in the first instance of the first input audio signal. The system may determine that the first candidate activation phrase does not comprise the predetermined activation phrase. The system may receive a second instance of the first input audio signal detected by a second microphone of the sensing device. The system may parse the second instance of the first input audio signal to identify a second candidate activation phrase in the second instance of the first input audio signal. The system may determine that the second candidate activation phrase comprises a predetermined activation phrase. The system may include an interface to send at least one of a first instance of the first input audio signal and a second instance of the first input audio signal to the data processing system based on a determination that the second candidate activation phrase comprises the predetermined activation phrase. The data processing system may include a second natural language processor component to identify a request in at least one of the first instance of the first input audio signal and the second instance of the first input audio signal.
According to at least one aspect of the present disclosure, a system for transmitting data in a voice-activated network may include a first microphone of a client device for receiving a first instance of a first input audio signal and a first instance of a second input audio signal. The system may include a second microphone of the client device for receiving a second instance of the first input audio signal and a second instance of the second input audio signal. The system may include a natural language processor component executed by the client device to parse a first instance of the first input audio signal to identify the activation phrase. The system may include an interface of the client device to transmit, at a first point in time, a first instance of a first input audio signal to the data processing system based on identification of an activation phrase in the first instance of the first input audio signal. The data processing system may include a second natural language processor component. The interface of the client device may transmit a second instance of the first input audio signal to the data processing system at a second point in time after the first point in time. The interface of the client device may send the first instance of the second input audio signal to the data processing system based on a confirmation message from the data processing system identifying the activation phrase in the second instance of the first input audio signal.
According to at least one aspect of the present disclosure, a method of transmitting data in a voice-activated network may include receiving, by a first microphone of a client device, a first instance of a first input audio signal and a first instance of a second input audio signal. The method may include receiving, by a second microphone of the client device, a second instance of the first input audio signal and a second instance of the second input audio signal. The method may include a natural language processor component executed by the client device parsing a first instance of the first input audio signal to identify an activation phrase. The method can comprise the following steps: at a first point in time, an interface of the client device sends a first instance of a first input audio signal to a data processing system based on recognition of an activation phrase in the first instance of the first input audio signal. The data processing system may include a second natural language processor component. The method may include transmitting, by the interface of the client device, a second instance of the first input audio signal to the data processing system at a second point in time after the first point in time. The method may include sending, by the interface of the client device, the first instance of the second input audio signal to the data processing system based on a confirmation message from the data processing system identifying the activation phrase in the second instance of the first input audio signal.
Each aspect may optionally include one or more of the following features. The system may include first and second microphones. A first instance of the first input audio signal may be detected by a first microphone and a second instance of the first input audio signal may be detected by a second microphone. A first instance of a second input audio signal may be received and a second instance of the second input audio signal may be received. The audio signal associated with at least one of the first instance of the first input audio signal and the second instance of the first input audio signal may be at least one of the first instance of the second input audio signal and the second instance of the second input audio signal. Alternatively, the audio signal associated with at least one of the first instance of the first input audio signal and the second instance of the first input audio signal may be at least one of the first instance of the first input audio signal and the second instance of the first input audio signal or a portion thereof.
The interface may send a request for a second instance of the first input audio signal from the first client device to the sensing device based on determining that the first candidate activation phrase does not include the predetermined activation phrase. The natural language processor component may: receiving a first instance of a second input audio signal detected by a first microphone of a sensing device; parsing the first instance of the second input audio signal to identify a third candidate activation phrase; determining that the third candidate activation phrase comprises the predetermined activation phrase; and the interface may send a request for a third input audio signal to the sensing device based on determining that the third candidate activation phrase comprises the predetermined activation phrase. The natural language processor component may: receiving a first instance of a second input audio signal detected by a first microphone of a sensing device; parsing the first instance of the second input audio signal to identify a third candidate activation phrase; determining that the third candidate activation phrase comprises the predetermined activation phrase; and the interface may terminate reception of the second instance of the second input audio signal based on determining that the third candidate activation phrase comprises the predetermined activation phrase. The interface may establish a bluetooth connection between the first client device and the sensing device. The natural language processor component may: receiving a third instance of the first input audio signal from a sensor of the first client device; parsing the third instance of the first input audio signal to identify a third candidate activation phrase in the third instance of the first input audio signal; and determining that the first input audio signal contains the predetermined activation phrase based on at least the third candidate activation phrase and the second candidate activation phrase. The natural language processor component may: receiving a first instance of a second input audio signal detected by a first microphone of a second sensing device; parsing the first instance of the second input audio signal to identify a third candidate activation phrase in the first instance of the second input audio signal; determining that the third candidate activation phrase does not comprise the predetermined activation phrase; receiving a second instance of a second input audio signal detected by a first microphone of a second sensing device, the second instance of the second input audio signal having a lower compression rate than the first instance of the first input audio signal; parsing the second instance of the second input audio signal to identify a fourth candidate activation phrase in the second instance of the second input audio signal; determining that the fourth candidate activation phrase comprises the predetermined activation phrase; the interface of the first client device may: based on a determination that the fourth candidate activation phrase comprises the predetermined activation phrase, sending at least one of the first instance of the second input audio signal and the second instance of the second input audio signal to a data processing system comprising a second natural language processor component to identify a second request in at least one of the first instance of the second input audio signal and the second instance of the second input audio signal.
The interface may be: at a first point in time, sending a first instance of a first input audio signal to a client device at a first compression level; and at a second point in time, sending a second instance of the first input audio signal to the client device at a second compression level lower than the first compression level. The interface may transmit a second instance of the first input audio signal to the client device at a first point in time at a first compression level; and at a second point in time, transmitting the first instance of the first input audio signal and the second instance of the input audio signal to the client device at a second compression level that is lower than the first compression level. The interface may be: at a second point in time, a second instance of the first input audio signal is sent to the client device based on a confirmation message that the activation phrase is not in the first instance of the input audio signal. The interface may be: sending a second instance of the first input audio signal to the client device at a second point in time before receiving a confirmation message that the activation phrase is not in the first instance of the input audio signal. The interface may terminate transmission of the second interface of the first input audio signal based on an acknowledgement message of the activation phrase in the first instance of the input audio signal. The interface may be: establishing a Bluetooth connection with a client device; the first instance of the first input audio signal and the second instance of the first input audio signal are transmitted over a bluetooth connection.
These and other aspects and embodiments are discussed in detail below. The foregoing information and the following detailed description include illustrative examples of various aspects and embodiments, and provide an overview or framework for understanding the nature and character of the claimed aspects and embodiments. The accompanying drawings provide an illustration and a further understanding of the various aspects and embodiments, and are incorporated in and constitute a part of this specification.
Drawings
The drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component may be labeled in every drawing. In the figure:
FIG. 1 illustrates an example system for detecting an activation phrase in an input audio signal transmitted in a low bandwidth network.
Fig. 2 shows a top view of the vehicle and shows the interior compartment of the vehicle shown in fig. 1.
FIG. 3 illustrates a block diagram of an example method for detecting an activation phrase in a networked system with limited bandwidth.
FIG. 4 is a block diagram of an example computer system.
Detailed Description
The following are more detailed descriptions of various concepts related to and embodiments of methods, apparatus, and systems for multi-modal transmission of packet data in a computer network environment based on voice activated data packets. The various concepts introduced above and discussed in more detail below may be implemented in any of a number of ways.
The present disclosure relates generally to a system for detecting an activation phrase within an input audio signal transmitted over a low bandwidth network. For example, one or more devices of the system may be communicatively coupled together via a bluetooth connection. The system may use a two-stage activation phrase detection process. First, a sensing device, which may include multiple microphones for detecting an input audio signal, may detect the input audio signal with a candidate activation phrase. When determining whether the input audio signal includes an activation phrase, the language processing of the sensing device may be inaccurate and trigger false positives. Second, the sensing device may send a recording of the input audio signal to the client device to confirm that the input audio signal includes the activation phrase. Due to the limited bandwidth of the link between the sensing device and the client device, sending both recordings by the sensing device at the same time introduces a delay when the data transfer is complete.
To reduce the delay of receiving an acknowledgement from the client device, the sensing device may transmit a first recording of the input audio signal to the client device. If the client device determines that an activation phrase is present in the first record, other records made by the sensing device may be discarded and not sent to the client device, thereby saving bandwidth. If the client device is unable to recognize the activation phrase in the first recording, the client device may receive a second recording from the sensing device. The client device may then process the second recording to confirm whether the input audio signal includes the activation phrase. Thus, the client device may use the recordings made by the sensing device to provide improved detection of the activation phrase without introducing delays due to additional information sent between the sensing device and the client device.
FIG. 1 illustrates an
The data processing system 102 may include an
The
The components of
The
The
The
The
The
As shown in fig. 1, the
The
The
The
The
The
The
For example, and in response to determining, by the
In response to a determination by the
The data processing system 102 of the system may include at least one server having at least one processor. For example, the data processing system 102 may include a plurality of servers located in at least one data center or server farm. The data processing system 102 may determine the request and a trigger key associated with the request from the audio input signal. Based on the request and the trigger key, the data processing system 102 may generate or select response data. The response data may be audio-based or text-based. For example, the response date may include one or more audio files that, when rendered, provide an audio output or sound wave. The data within the response data may also be referred to as content items. In addition to audio content, the response data may include other content (e.g., text, video, or image content).
The data processing system 102 may include a number of logically grouped servers and facilitate distributed computing techniques. A logical group of servers may be referred to as a data center, a server farm, or a machine farm. The servers may be geographically dispersed. The data center or machine farm may be managed as a single entity, or the machine farm may include multiple machine farms. The servers within each machine farm may be heterogeneous: one or more servers or machines may operate in accordance with one or more types of operating system platforms. The data processing system 102 may include servers stored in a data center in one or more high-density rack systems and associated storage systems located, for example, in an enterprise data center. In this manner, the data processing system 102 with integrated servers may improve system manageability, data security, physical security of the system, and system performance by locating servers and high performance storage systems on a localized high performance network. Centralizing all or some of the data processing system 102 components, including servers and storage systems and coupling them with advanced system management tools allows for more efficient use of server resources, which saves power and processing requirements and reduces bandwidth usage. Each component of the data processing system 102 may each include at least one processing unit, server, virtual server, circuit, engine, agent, device, or other logic device, such as a programmable logic array configured to communicate with the
The data processing system 102 may include a
Applications, scripts, programs, or other components associated with the data processing system 102 may be installed at the
The
The
From the input audio signal, the
The
The audio
The
The direct action API112 of the data processing system 102 may generate an action data structure based on, for example, a request. The action data structure may include data or instructions for performing the specified action to satisfy the request. The action data structure may be a JSON-formatted data structure or an XML-formatted data structure.
The action data structure may include information for completing the request. For example, the action data structure may be an XML (extensible markup language) or JSON (JavaScript object notation) formatted data structure that includes attributes for completing or otherwise fulfilling a request. The attributes may include a location of the
The direct action API112 may retrieve the
The direct action API112 may populate the fields with data from the input audio signal. The direct action API112 may also populate the fields with data from the
The direct action API112 may obtain response data 124 (or
Fig. 2 shows a top view of the
The user 202 may generate the request in the form of an input audio signal 204. The input audio signal 204 may be recorded or detected by the
Referring also to fig. 1, wherein the
The
Upon receipt, the
Prior to receiving the request for the second instance of the input audio signal 204 from the
The
FIG. 3 illustrates a block diagram of an
The
The
The
The
The
The
The
The
In response to the request from the
When the
The
The
The second input audio signal may be an input audio signal detected or recorded by one of the
Returning to the recognition step at the
The
Fig. 4 is a block diagram of an
The processes, systems, and methods described herein may be implemented by computing
Although an example computing system has been described in fig. 4, the subject matter including the operations described in this specification can be implemented in other types of digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
For the cases where the system discussed herein collects or may make use of personal information about a user, the user may be provided with an opportunity to control whether programs or features may collect personal information (e.g., information about the user's social network, social activities, the user's preferences, or the user's location), or whether or how to receive content from a content server or other data processing system that may be more relevant to the user. In addition, certain data may be anonymized in one or more ways prior to storage or use, in order to remove personally identifiable information when generating the parameters. For example, the identity of the user may be anonymous so that no personal identity information can be determined for the user, or the geographic location of the user may be generalized when location information is obtained (e.g., to a city, zip code, or state level), so that no particular location of the user can be determined. Thus, the user may control how information is collected about him or her and used by the content server.
The subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The subject matter described in this specification can be implemented as one or more computer programs (e.g., one or more circuits of computer program instructions) encoded on one or more computer storage media for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by the data processing apparatus). The computer storage medium may be or be included in a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Although a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially generated propagated signal. The computer storage medium may also be or be included in one or more separate components or media (e.g., multiple CDs, disks, or other storage devices). The operations described in this specification may be implemented as operations performed by data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.
The terms "data processing system," "computing device," "component," or "data processing apparatus" include various devices, apparatuses, and machines for processing data, including by way of example programmable processors, computers, systems on a chip, or multiple systems or combinations of the foregoing. The apparatus can comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. The apparatus and execution environment may implement a variety of different computing model infrastructures, such as web services, distributed computing, and grid computing infrastructures. The components of
A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. The computer program may correspond to a file in a file system. A computer program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors (e.g., components of data processing system 102) executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example: semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks (e.g., internal hard disks or removable disks); magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
The subject matter described herein can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface), or a web browser through which a user can interact with an implementation of the subject matter described in this specification, or a combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include local area networks ("LANs") and wide area networks ("WANs"), internetworks (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
A computing system, such as
Although operations are depicted in the drawings in a particular order, these operations need not be performed in the particular order shown or in sequential order, and all illustrated operations need not be performed. The actions described herein may be performed in a different order.
In all embodiments, the separation of various system components need not be separate, and the described program components may be included in a single hardware or software product. For example, the
Having now described some illustrative embodiments, it will be apparent that the foregoing is illustrative and not limiting, having been presented by way of example. In particular, although many of the examples presented herein involve specific combinations of method acts or system elements, those acts and those elements may be combined in other ways to accomplish the same objectives. Acts, elements and features discussed in connection with one implementation are not intended to be excluded from a similar role in other implementations.
The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," "having," "containing," "involving," "characterized by," and variations thereof herein, is meant to encompass the items listed thereafter, equivalents thereof, and other items, as well as alternative embodiments that consist exclusively of the items listed thereafter. In one embodiment, the systems and methods described herein consist of one, each combination of more than one, or all of the described elements, acts, or components.
Any reference to an embodiment or element or act of the systems and methods referred to herein in the singular may also encompass embodiments comprising a plurality of such elements, and any plural reference to any implementation or element or act herein may also encompass embodiments comprising only one element. References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts or elements to a single or multiple configurations. A reference to any action or element based on any information, action, or element may include an implementation in which the action or element is based, at least in part, on any information, action, or element.
Any embodiment disclosed herein may be combined with any other embodiment or examples, and references to "an embodiment," "some embodiments," or "one embodiment," etc., are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment or example. The terms used herein do not necessarily all refer to the same embodiment. Any embodiment may be included in or exclusively combined with any other embodiment in any manner consistent with aspects and embodiments disclosed herein.
References to "or" may be construed as inclusive such that any term described using "or" may refer to any term in the singular, more than one, and all of the described terms. For example, a reference to "at least one of a 'and' B" may include only "a", only "B", and both "a" and "B". These references, used in conjunction with "including" or other open-ended terms, may include additional items.
Where technical features in the figures, detailed description or any claims are followed by reference signs, the reference signs are included to increase the intelligibility of the figures, detailed description and claims. Accordingly, neither the reference signs nor their absence should have any limiting effect on the scope of any claim elements.
The systems and methods described herein may be embodied in other specific forms without departing from the characteristics thereof. The foregoing embodiments are illustrative and not limiting of the described systems and methods. The scope of the systems and methods described herein is, therefore, indicated by the appended claims rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are intended to be embraced therein.