Selectively activating on-device speech recognition and using recognized text in selectively activating NLUs on devices and/or fulfillment on devices

文档序号：1958017 发布日期：2021-12-10 浏览：21次中文

阅读说明：本技术 选择性地激活设备上语音识别并且在选择性地激活设备上的nlu和/或设备上履行中使用识别的文本 (Selectively activating on-device speech recognition and using recognized text in selectively activating NLUs on devices and/or fulfillment on devices ) 是由迈克尔·戈利科夫扎希德·萨布尔丹尼斯·布拉科夫贝沙德·贝扎迪谢尔盖·纳扎罗夫丹尼于 2019-05-31 设计创作，主要内容包括：实现可以例如通过下述方式减少从自动化助理获得响应所需的时间：在说出命令或查询之前,无需向自动化助理提供显式调用,诸如通过说出热词/短语或执行特定用户输入。此外,自动化助理可以在不与服务器通信的情况下选择性地接收、理解和/或响应命令或查询,从而进一步减少可以提供响应的时间。实现仅响应于确定一个或多个条件被满足而选择性地启动设备上的语音识别。此外,在一些实现中,设备上的NLU、设备上的履行和/或由此产生的执行仅响应于基于识别的文本确定应该发生这样的进一步处理而发生。因此,通过设备上的语音处理的选择性激活和/或设备上的NLU和/或设备上的履行的选择性激活,节省了资源。(Implementations may reduce the time required to obtain a response from an automated assistant, for example, by: no explicit call need be provided to the automated assistant prior to uttering the command or query, such as by uttering a hotword/phrase or performing a particular user input. Further, the automated assistant can selectively receive, understand, and/or respond to commands or queries without communicating with the server, thereby further reducing the time in which responses can be provided. Enabling selective initiation of speech recognition on the device only in response to determining that one or more conditions are satisfied. Moreover, in some implementations, the NLU on the device, the fulfillment on the device, and/or the resulting execution only occur in response to determining that such further processing should occur based on the recognized text. Thus, resources are saved by selective activation of speech processing on the device and/or selective activation of NLUs on the device and/or fulfillment on the device.)

1. A method performed by an automated assistant application of a client device, the method being performed using one or more processors of the client device, and the method comprising:

determining to activate speech recognition on a device, wherein determining to activate speech recognition on the device is in response to determining satisfaction of one or more conditions, the determining satisfaction of the one or more conditions comprising determining the satisfaction based on processing of both:

non-thermal word audio data detected by one or more microphones of the client device, an

Additional sensor data based on output from at least one non-microphone sensor of the client device;

generating recognized text from a spoken utterance captured by the audio data and/or additional non-thermal word audio data detected by one or more of the microphones subsequent to the audio data using speech recognition on the device, the generating the recognized text comprising performing speech recognition on the device on the audio data and/or the additional audio data;

based on the recognized text, determining whether to activate natural language understanding on a device of the recognized text and/or activate fulfillment on a device based on natural language understanding on the device;

when it is determined to activate natural language understanding on the device and/or activate fulfillment on the device:

performing natural language understanding on the device and/or initiating the fulfillment on the device;

when it is determined not to activate natural language understanding on the device and/or not to activate fulfillment on the device:

disabling speech recognition on the device.

2. The method of claim 1, wherein the at least one non-microphone sensor on which the additional sensor data is based comprises an accelerometer, magnetometer, and/or gyroscope.

3. The method of any preceding claim, wherein the at least one non-microphone sensor on which the additional sensor data is based comprises a laser-based vision sensor.

4. The method of any preceding claim, wherein determining satisfaction of the one or more conditions based on processing the hotword-free audio data comprises:

processing the non-thermal word audio data using an acoustic model to generate a directed speech metric, the acoustic model being trained to distinguish between spoken utterances directed to a client device and spoken utterances not directed to the client device; and

determining satisfaction of the one or more conditions based in part on the directional speech metric.

5. The method of any preceding claim, wherein determining satisfaction of the one or more conditions based on processing the hotword-free audio data comprises:

processing the non-vocabulary audio data using a voice activity detector to detect the presence of human speech; and

determining satisfaction of the one or more conditions based in part on detecting the presence of the human voice.

6. The method of any preceding claim, wherein determining satisfaction of the one or more conditions based on processing the hotword-free audio data comprises:

processing the non-vocabulary audio data using a text-independent speaker recognition model to generate a voice embedding;

comparing the voice embedding to recognized voice embedding stored locally on the client device; and

determining satisfaction of the one or more conditions based in part on the comparison.

7. The method of any preceding claim, wherein determining whether to activate natural language understanding on a device and/or activate fulfillment on the device based on the recognized text comprises:

determining whether the text matches one or more phrases stored in a locally stored assistant language model, the locally stored assistant language model including a plurality of phrases, each of the phrases interpretable by the automated assistant.

8. The method of any preceding claim, wherein determining whether to activate natural language understanding on a device and/or activate fulfillment on the device based on the recognized text comprises:

determining whether the text conforms to a predefined assistant query pattern.

9. The method of any preceding claim, wherein determining whether to activate natural language understanding on a device and/or activate fulfillment on the device based on the recognized text comprises:

determining one or more related action phrases based on the one or more related action phrases each having a defined correspondence to a most recent action performed at the client device in response to a previous user input;

determining whether at least a portion of the text matches at least one of the one or more related action phrases.

10. The method of any preceding claim, wherein determining whether to activate natural language understanding on a device and/or activate fulfillment on the device based on the recognized text comprises:

determining whether at least a portion of the recognized text conforms to content rendered at the client device during the spoken utterance.

11. The method of claim 10, wherein the content rendered at the client device includes suggested automated assistant actions that are graphically rendered.

12. The method of any preceding claim, wherein determining whether to activate natural language understanding on a device and/or activate fulfillment on the device based on the recognized text comprises:

determining the fulfillment on a device, and further comprising:

performing the fulfillment on the device.

13. The method of claim 12, wherein performing the fulfillment on a device comprises providing a command to a separate application on the client device.

14. The method of any preceding claim, wherein disabling speech recognition on the device comprises: deactivating language recognition on the device when it is determined that natural language understanding and/or the fulfillment on the device is not activated, and further based at least on a threshold duration of time elapsed without further voice activity detection and/or further recognized text.

15. The method of any preceding claim, wherein performing natural language understanding on the device and/or fulfillment on the device comprises:

performing natural language understanding on the device to generate natural language understanding data; and

performing fulfillment on the device using the natural language understanding data.

16. The method of any preceding claim, further comprising, during generating the recognized text using speech recognition on the device:

causing a streaming transcription of the identified text to be rendered in a graphical interface at a display of the client device.

17. The method of claim 16, further comprising rendering a selectable interface element in the graphical interface with the streaming transcription, the selectable interface element, when selected, causing speech recognition on the device to cease.

18. The method of claim 16 or claim 17, further comprising changing the graphical interface when it is determined to activate natural language understanding on the device and/or activate fulfillment on the device.

19. A method performed by an automated assistant application of a client device, the method being performed using one or more processors of the client device, and the method comprising:

non-thermal word audio data detected by one or more microphones of the client device, an

Additional sensor data based on output from at least one non-microphone sensor of the client device;

determining, based on the recognized text, a natural language understanding on a device that activates the recognized text;

performing natural language understanding on the activated device on the recognized text; and

initiating fulfillment of the spoken utterance on a device based on natural language understanding on the device.

20. The method of claim 19, wherein determining, based on the recognized text, natural language understanding on the device that activates the recognized text comprises:

determining whether at least a portion of the recognized text conforms to content rendered on the client device during the spoken utterance, and/or

Determining whether at least a portion of the text matches one or more related action phrases, each of the related action phrases having a defined correspondence to a most recent action performed at the client device in response to a previous user input.

21. A client device, comprising:

at least one microphone;

at least one display;

one or more processors executing locally stored instructions to cause the processors to perform the method of any one of claims 1 to 20.

22. A computer program product comprising instructions which, when executed by one or more processors, cause the one or more processors to carry out the method according to any one of claims 1 to 20.

Background

Humans may conduct human-computer conversations using an interactive software application, also referred to herein as an "automated assistant" (also referred to as a "digital agent," "interactive personal assistant," "intelligent personal assistant," "assistant application," "conversation agent," etc.). For example, a human being (which may be referred to as a "user" when they interact with the automated assistant) may use spoken natural language input (i.e., an utterance) that may be converted to text in some cases and then processed, and/or provide commands and/or requests to the automated assistant by providing textual (e.g., typed) natural language input. The automated assistant responds to the request by providing a responsive user interface output, which may include an audible and/or visual user interface output.

As described above, many automated assistants are configured to interact via spoken utterances. To protect user privacy and/or save resources, users must often explicitly invoke automated assistants before they fully process spoken utterances. Explicit invocation of automated assistants typically occurs in response to receipt of certain user interface inputs at a client device. The client device includes an assistant interface that provides an interface for a user of the client device to interact with the automated assistant (e.g., receives input from the user and provides audible and/or graphical responses), and that interfaces with one or more additional components that implement the automated assistant (e.g., on-device components that process user input and generate appropriate responses and/or a remote server device).

Some user interface inputs that may invoke the automated assistant via the client device include hardware and/or virtual buttons (e.g., a tap of a hardware button, a selection of a graphical interface element displayed by the client device) used at the client device to invoke the automated assistant. Many automated assistants may additionally or alternatively be invoked in response to one or more specific spoken invocation phrases (also referred to as "hotwords/phrases" or "trigger words/phrases"). For example, a specific spoken call phrase, such as "hey assistant," "OK assistant," and/or "assistant," can be spoken to invoke the automated assistant. When automated assistants are invoked using such user interface inputs, the detected audio data is typically streamed from the client device to the remote automated assistant components, which typically indiscriminately perform each of speech recognition, natural language understanding, and fulfillment (or at least attempt to fulfill).

Disclosure of Invention

Various implementations disclosed herein may be used to reduce the time required to obtain a response/fulfillment from an automated assistant. This is particularly because such an implementation may avoid the need for the user to provide an explicit invocation to the automated assistant (such as by speaking a hotword/phrase or performing a particular user input) before speaking a command or query. Further, in some implementations, the automated assistant may receive, understand, and in some cases respond/fulfill commands or queries without communicating with the server, thereby further reducing the time in which responses/fulfillment can be provided.

Implementations disclosed herein are directed to client devices (e.g., smartphones and/or other client devices) that include at least one or more microphones and an automated assistant application. The automated assistant application may be installed "on top of" the operating system of the client device and/or may itself form a portion (or all) of the operating system of the client device. The automated assistant application includes and/or has access to speech recognition on the device, natural language understanding on the device, and fulfillment on the device. For example, speech recognition on the device may be performed using a speech recognition module on the device that processes the audio data (detected by the microphone) using an end-to-end speech recognition machine learning model stored locally at the client device. Speech recognition on the device generates recognized text for spoken utterances (if any) present in the audio data. Alternatively, speech recognition on the device may verify that the recognized text corresponds to the currently active (or unique) profile of the client device (e.g., using text-independent speaker identification/recognition described below). Further, Natural Language Understanding (NLU) on the device may be performed, for example, using an NLU module on the device that processes recognized text and optionally context data generated using speech recognition on the device to generate NLU data. The NLU data can include parameters (e.g., a bin value) corresponding to an intent and optionally an intent of the spoken utterance. Fulfillment on the device may be performed using a fulfillment module on the device that utilizes the NLU data (from the NLU on the device) and optionally other local data to determine an action to take to resolve the spoken utterance (and optionally the intended parameters). This may include determining local and/or remote responses (e.g., answers) to the spoken utterance, interactions with locally installed applications for performing based on the spoken utterance, commands sent to internet of things (IoT) devices (directly or via a corresponding remote system) based on the spoken utterance, and/or other analytic actions performed based on the spoken utterance. Fulfillment on the device may then initiate local and/or remote enforcement/execution of the determined action to resolve the spoken utterance.

In various implementations, at least remote speech processing, remote NLU, and/or remote fulfillment may be selectively utilized. For example, the recognized text may be at least selectively transmitted to a remote automation assistant component for remote NLU and/or remote fulfillment. For example, the recognized text may optionally be transmitted for remote performance in parallel with performance on the device, or in response to a failure of the NLU on the device and/or fulfillment on the device. However, speech processing on the device, NLUs on the device, fulfillment on the device, and/or execution on the device may be prioritized at least because they provide a reduction in latency in resolving the spoken utterance (due to the lack of client-server round trips required to resolve the spoken utterance). Further, the functionality on the device may be the only available functionality without or with limited network connectivity.

While voice recognition on the device, NLU on the device, and fulfillment on the device provide advantages, the resources of the client device are burdened with continuously performing processing on all devices, and/or such continuous execution may compromise the security/privacy of the user data. For example, it may increase the burden of processor resources and power resources (e.g., battery when the client device is powered by battery) to run processes on all devices continuously. Moreover, if NLU and/or fulfillment were indiscriminately performed on recognized text from all detected spoken utterances, fulfillment (and resulting performance) of certain spoken utterances may occur inadvertently despite no user intent for the automated assistant to perform any responsive actions for those spoken utterances. In addition to potentially compromising the security of user data, such inadvertent fulfillment and resulting execution may result in excessive consumption of various client device resources.

In view of these and/or other considerations, implementations disclosed herein selectively enable only speech recognition on a device. For example, various implementations initiate speech recognition on a device only in response to determining that one or more conditions are satisfied. Moreover, in some of those implementations, the NLU on the device, fulfillment on the device (and/or resulting execution) only occur in response to a determination that such further processing should occur based on recognized text from speech recognition on the device. Thus, by selective activation of speech processing on the device and/or further selective activation of the NLU on the device and/or fulfillment on the device, various client device resources are conserved and/or security of user data is maintained.

In various implementations, speech recognition on a device is activated in response to detecting the occurrence of an explicit assistant invocation prompt. An explicit invocation prompt is a prompt that when detected alone will always cause at least speech recognition on the device to be activated. Some non-limiting examples of explicit invocation cues include detecting spoken hotwords having at least a threshold confidence, activation of an explicit assistant interface element (e.g., a hardware button or a graphical button on a touch screen display), a "phone squeeze" having at least a threshold intensity (e.g., detected by a sensor in a bezel of a mobile phone), and/or other explicit invocation cues.

However, other prompts are implicit in that speech recognition on the device will only be activated in response to certain occurrences of those prompts, such as occurrences in certain contexts (e.g., occurrences following or in conjunction with other implicit prompts). For example, speech recognition on the device may optionally not be activated in response to detecting voice activity alone, but may be activated in response to detecting voice activity and detecting the presence of the user at the client device and/or detecting the presence of the user within a threshold distance at the client device. One or more non-microphone sensors, such as Passive Infrared (PIR) sensors and/or laser-based sensors, may optionally be used to determine the presence of a user and/or the distance of the user. Further, for example, sensor data from non-microphone sensors (e.g., gyroscopes, accelerometers, magnetometers, and/or other sensors) indicating that the user has picked up the client device and/or is currently holding the client device may optionally not separately activate voice recognition on the device. However, speech recognition on the device may be activated in response to such an indication and detection of voice activity and/or directional speech (described in more detail herein) in the hotword-free audio data. Hotword-free audio data is audio data that lacks any spoken utterance that includes a hotword, which is an explicit assistant invocation prompt. As yet another example, a "phone squeeze" with less than a threshold intensity may optionally be insufficient to activate speech recognition on the device alone. However, speech recognition on the device may be activated in response to such low intensity "telephone squeeze" and detection of voice activity and/or directional speech in the hotword-free audio data. As yet another example, speech recognition on the device may optionally not be activated in response to detecting voice activity alone, but may be activated in response to detecting voice activity and text-independent speaker identification/recognition (also described below). As yet another example, speech recognition on the device may optionally not be activated in response to detecting a separate directional gaze (as described below), but may be activated in response to detecting a directional gaze as well as voice activity, directional speech, and/or text-independent speaker identification/recognition. As yet another example, speech recognition on the device may optionally not be activated in response to detecting a directional gaze of less than a threshold number (and/or percentage) of consecutive image frames (i.e., instantaneous directional gaze), but may be activated in response to detecting a directional gaze of at least a threshold number and/or percentage of consecutive image frames (i.e., persistent directional gaze). While several examples are provided above, additional and/or alternative implicit invocation hints may be provided. Further, in various implementations, the one or more implicit hints described above can alternatively or explicitly invoke hints. For example, in some implementations, sensor data from non-microphone sensors (such as gyroscopes, magnetometers, and/or accelerometers) indicating that a user has picked up a client device and/or is currently holding the client device may optionally separately activate voice recognition on the device

Some implementations disclosed herein relate to determining whether to activate speech recognition on a device based on one or more implicit prompts, such as those described above. In some of these implementations, the determination is made based on-device processing of the hotword-free audio data and/or additional sensor data based on output from one or more non-microphone sensors of the client device. Those implementations enable user interaction with an automated assistant to be initiated and/or guided without the user having to begin such interaction with an utterance of a hotword and/or with other explicit invocation cues. This results in a reduction in user input provided by the user (at least due to the omission of hotwords or other explicit invocation cues), which directly reduces the duration of the interaction, which may reduce fulfillment time and save various local processing resources that would otherwise be used in a long interaction.

On-device processing performed on the non-thermal word audio data in determining whether to activate speech recognition on the device may include, for example, voice activity processing, directional speech processing, and/or text-independent speaker identification.

The voice activity processing processes audio data (e.g., raw audio data or mel-frequency cepstral coefficients (MFCCs) or other representations of the audio data) to monitor for the occurrence of any human speech and may output a voice activity metric indicating whether voice activity is present. The voice activity metric may be a binary metric or may be a probability that human speech is present in the audio data.

The directional speech processing may utilize trained acoustic models that are trained to distinguish between spoken utterances that are directed to the client device and spoken utterances that are not directed to the client device. For example, instances of a user speaking to an automated assistant may be tagged with a first tag (e.g., "1"), while instances of the user speaking to other humans may be tagged with a second tag (e.g., "0"). This is effective because the audio data itself may indicate whether the input is intended as an assistant input, because users often talk using different voice attributes (e.g., pitch change, timbre, rhythm) while talking to an assistant device than those voice attributes while talking to another human. Thus, in addition to distinguishing between human speech and non-human speech, directional speech processing seeks to distinguish between human speech that is directed to a client device and human speech that is not directed to the client device (e.g., human speech for another human, human speech emanating from a television or other source). Directional speech processing using acoustic models may result in a directional speech metric indicating whether human speech is detected, which metric is directed to the client device and may be a binary metric or a probability.

Text-independent speaker identification/recognition (TI-SID) processes audio data using a TI-SID model to generate an embedding of a spoken utterance captured by the audio data, and compares the embedding to locally stored embedding for one or more user accounts/profiles registered on the client device. A TI-SID metric may then be generated based on the comparison, wherein the TI-SID metric indicates whether the generated embeddings match one of the stored embeddings and/or the extent to which they match.

Additional sensor data that is based on output from one or more non-microphone sensors of the client device and that is processed in determining whether to activate speech recognition on the device may include sensor data from a gyroscope, an accelerometer, a laser-based vision sensor, a camera, and/or other sensor components of the client device. The raw sensor data itself or an abstraction or other representation of the raw sensor data (such as an abstraction provided by the operating system of the client device) may be processed. For example, a representation of accelerometer sensor data and/or an indication from the operating system that the client device has been picked up may be provided and utilized (where the indication may be determined by the operating system based on sensor data from one or more sensors).

In some implementations, the attention processor may process various metrics (e.g., from the processing of hotword-free audio data) and/or sensor data (e.g., representations or abstractions thereof) in determining whether to activate speech recognition on the device. The attention processor may utilize one or more rules and/or attention models to determine whether to activate speech recognition on the device. The attention model may be, for example, a machine learning model trained based on supervised and/or semi-supervised training examples. For example, semi-supervised training examples may have training example inputs based on audio data and/or additional sensor data from actual interactions of the participating users, with the permission of those participating users. Further, in the event of a participant user permission, in response to detecting that the directional gaze occurs concurrently with such data, the semi-supervised training example may be marked as "positive" (i.e., speech recognition on the device should occur). In the event of a participant user permission, the semi-supervised training example may be marked as "negative" (i.e., no speech recognition on the device should occur) in response to detecting that the directional gaze does not occur concurrently with such data. A directional gaze is a gaze for the client device (e.g., a threshold duration and/or at least a threshold percentage for a sensor frame) determined based on a sensor frame from a vision sensor of the client device (e.g., an image frame from a camera). The presence of the directional gaze does not necessarily indicate that the user intends to interact with the automated assistant (e.g., the user may only intend to interact with the client device generally). However, using a directional gaze (or other signal) as a supervisory signal may ensure that speech recognition on the device is sufficiently activated to ensure that the exiting head utterance is recognized, and it is recognized that other techniques described herein will prevent NLUs on the device and/or fulfillment on the device without intending to interact with the automated assistant. In these and other approaches, by utilizing a trained attention model, speech recognition on a device may be selectively activated in response to the occurrence of one or more implicit invocation cues and in response to the occurrence of explicit invocation cues. In various implementations, training on the device of the attention model may occur to personalize the attention model to the client device and/or provide gradients (from the training) for joint learning (e.g., training the attention model further based on gradients from multiple client devices and providing further trained models for use). When training on the device occurs, directional gaze and/or other signals may be used as supervisory signals for training.

Once speech recognition on the device is activated, whether in response to implicit or explicit prompts such as those described herein, the audio data is processed using the speech recognition on the device to determine recognized terms in the spoken utterance (if any) captured by the audio data. The processed audio data may include audio data captured after voice recognition on the device is activated, and optionally locally buffered most recent audio data (e.g., 3 seconds or other duration of locally buffered most recent audio data). In some implementations, when speech recognition on the device is activated, a human-perceptible cue is rendered to notify the user that such activation has occurred and/or render the recognized text stream as recognition occurs. The human-perceptible cue may include, for example, a visual rendering of at least the recognized text stream on a touchscreen display of the client device (e.g., a visual rendering at the bottom of the display), optionally overlaid over any active content (and optionally rendered semi-transparent). The visual rendering may also include a selectable "cancel" element that, when selected via touch input on the touch screen display, stops speech recognition on the device. As described herein, the human perceptible cue may optionally be further adjusted when the NLU on the device and/or fulfillment on the device is activated, and/or in response to performance of fulfillment.

Various implementations described herein also relate to determining whether to activate an NLU on a device and/or fulfillment (and/or resulting execution) on a device. In some of those implementations, the NLU on the device and/or fulfillment on the device only occurs in response to determining that such further processing should occur based at least in part on recognized text from speech recognition on the device. Through this selective activation of speech processing on the device, and/or further selective activation of NLUs on the device and/or fulfillment on the device, various client device resources are conserved and/or security of user data is maintained.

In some implementations, whether to activate an NLU on the device and/or fulfillment on the device may be determined based on determining whether the recognized text matches one or more phrases stored in the locally stored assistant language model. When the recognized text matches a phrase stored in a locally stored assistant language model, it is more likely to activate the NLU on the device and/or fulfillment on the device. Soft or exact matching may be used. The locally stored assistant language model may include a plurality of phrases, each of which may be interpreted and operated on by the automated assistant, and the locally stored assistant language model may exclude any phrases that cannot be interpreted and operated on by the automated assistant. If the phrase that results in fulfillment is not "pending," such as "sorry, I can't help with that at", "error tone", or other insubstantial response, the phrase may be interpreted and operated by the automated assistant. An assistant language model can be generated to include phrases previously published to and successfully operated by the automated assistant. Optionally, the assistant language model may be limited to a certain number of phrases in view of storage limitations of the client device, and the included phrases may be selected for inclusion in the assistant language model based on frequency of use and/or other considerations.

In some implementations, determining whether to activate the NLU on the device and/or fulfillment on the device may additionally or alternatively be based on: determining one or more related action phrases based on the one or more related action phrases, each related action phrase having a defined correspondence to a most recent action performed on and/or via the client device in response to a previous user input; and determining whether at least a portion of the recognized text matches at least one of the one or more related action phrases. The NLU on the device and/or fulfillment on the device are more likely to be activated when the recognized text matches the associated action phrase. Soft or exact matching may be used. For example, if the most recent action was turning on a smart light, related action phrases such as "dim" and/or "turn off" may be determined to facilitate performance on the device and/or NLU for subsequent spoken utterances without hotwords, such as "dim to 50% (dim them to 50%)". As another example, if the most recent action was setting an alert for a particular time, the follow-up action phrase may typically include "times" to facilitate fulfillment on the device and/or NLU for follow-up spoken utterances without hotwords, such as "actually, make it for 8:05AM (actually, done at 8 AM 05 minutes).

In some implementations, whether to activate an NLU on the device and/or fulfillment on the device may additionally or alternatively be determined based on determining whether at least a portion of the recognized text conforms to content rendered at the client device during the spoken utterance. For example, if a suggested automated assistant action of "turn up the volume" is visually displayed during the spoken utterance, it may be determined that the recognized text including "turn up" and/or "volume" conforms to the content of the visual rendering, and that NLUs on the device and/or fulfillment on the device are more likely to be activated. As another example, if an image and/or text description of a given entity is being rendered, it may be determined that the identified text (which includes aliases of the given entity, attributes of the given entity, and/or attributes of additional entities related to the given entity) conforms to the content of the visual rendering. For example, if content related to a particular network router is being rendered, it may be determined that the identified text, including aliases of the router (e.g., brand and/or model), attributes of the router (e.g., 802.11ac compliant), and/or attributes of related entities (e.g., modem), conforms to the visually rendered content.

In some implementations, whether to activate an NLU on the device and/or fulfillment on the device can be additionally or alternatively determined based on determining whether at least a portion of the recognized text conforms to a non-automated assistant application executing during the spoken utterance. For example, if the recognized text conforms to an action applicable to an application executing in the foreground of the client device during the spoken utterance, the NLU on the device and/or fulfillment on the device are more likely to be activated. For example, if the messaging application is executing in the foreground, recognized text including "reply with", "send", and/or other text related to the messaging application may be considered consistent with actions executable by the non-automated assistant application. Alternatively, when multiple actions may be performed by the non-automated assistant application, but only a subset of those actions may be performed in the current state of the non-automated assistant application, the subset may be the only action considered or may be weighted more heavily in the determination (than actions not in the subset). As described herein, in some implementations where the recognized text relates to an executing non-automated assistant application, the fulfillment determined based on the spoken utterance may be a fulfillment that includes an automated assistant application that interfaces (directly or via an operating system) with the non-automated assistant application to cause the non-automated assistant application to perform an action that conforms to the spoken utterance. For example, a spoken utterance of "reply with sound great" may cause the automated assistant to send a command to the messaging application (optionally via the operating system) that causes the messaging application to reply to a rendered message that is recent and/or currently has "sound great").

In some implementations, whether to activate an NLU on the device and/or fulfillment on the device may additionally and/or alternatively be determined based on a probability of processing the recognized text using a semantic model on the device to determine whether the recognized text is directed to the automated assistant. When the probability is more indicative that the recognized text is directed to the automated assistant, the NLU on the device and/or fulfillment on the device are more likely to be activated. The semantic model may be trained to distinguish between text that is directed to the automated assistant and text that is not directed to the automated assistant (e.g., but is directed to another human and/or from a television or other source). The semantic model may be used to process text on a tag-by-tag basis, or may selectively process the embedding of recognized text, such as generated Word2Vec embedding or other semantic embedding.

In some implementations, whether to activate an NLU on the device and/or fulfillment on the device may additionally and/or alternatively be determined based on TI-SID (as described above), directional speech (as described above), and/or other considerations. For example, if directional speech is detected and/or determined with a higher probability, the NLU on the device and/or fulfillment on the device are more likely to be activated. Further, for example, if the TI-SID indicates that the spoken utterance is from a currently active profile of the client device, the NLU on the device and/or fulfillment on the device are more likely to be activated.

In some implementations, one or more of the above considerations may be processed by the query classifier in determining whether to activate NLUs on the device and/or fulfillment on the device. The query classifier may utilize one or more rules and/or query models to determine whether to activate NLUs on the device and/or fulfillment on the device. For example, a rule may specify that if a condition exists alone or in combination with other particular conditions, the NLU on the device and/or fulfillment on the device should be activated. For example, if the recognized text matches a phrase in the assistant language model and matches the currently rendered content, the rule may specify that the NLU on the device and the fulfillment on the device should be activated. In implementations using a query model, it may be a machine learning model trained, for example, based on supervised and/or semi-supervised training examples. For example, under the permission of the participating user, the training examples may have training example inputs from the actual interactions of those participating users based on the various determinations described above. Further, the training examples can be marked as "positive" (i.e., the spoken utterance is intended as an assistant request) in response to the respective user interacting with the result response and/or providing positive feedback to a prompt asking whether the spoken utterance is intended as an assistant request. The supervised training example may be marked as "negative" (i.e., the spoken utterance is intended as an assistant request) in response to a corresponding user quick reject result response (e.g., before it is fully rendered and/or before it can be fully interpreted)) and/or providing negative feedback to a prompt asking whether the spoken utterance is intended as an assistant request.

Note that the supervisory signals or explicit labels used in training the query model seek to ensure that the user intends to interact with the automated assistant, as opposed to the supervisory signals or explicit labels used in training the attention model (as described above). Thus, while the attention model may be trained to purposefully cause some "over-triggering" of speech recognition on the device, the query model is trained to reduce false-positive occurrences that trigger NLUs on the device and/or fulfillment on the device. In these and other ways, by using the speech recognition activation techniques on the device disclosed herein in conjunction with the NLU and/or fulfillment activation techniques on the device, when intended for an automated assistant, the non-spoken utterance is fully processed and operated while reducing under-trigger conditions. This may reduce the occurrence of situations where the user needs to provide the spoken utterance again, optionally before an explicit call queue, which may prolong the user's interaction with the automated assistant and result in excessive consumption of resources. In various implementations, on-device training of the query model may occur to personalize the attention model to the client device and/or to provide gradients (from training) for joint learning (e.g., training the query model further based on gradients from multiple client devices and providing further trained models for use). When training on the device occurs, signals such as, but not limited to, those described above may be used as supervisory signals for training.

When the NLU on the device is activated, natural language understanding on the device is performed on the recognized text to generate natural language understanding data. Further, when fulfillment on the device is activated, the fulfillment on the device is determined using natural language understanding data. As described herein, fulfillment on the device may be performed using a fulfillment module on the device that utilizes the NLU data (from the NLU on the device) and optionally other local data to determine an action to take to resolve the intent of the spoken utterance (and optionally the intent parameters). This may include determining local and/or remote responses (e.g., answers) to the spoken utterance, interactions with a locally installed application to perform based on the spoken utterance, commands sent to internet of things (IoT) devices based on the spoken utterance (either directly or via a corresponding remote system), and/or other analytic actions performed based on the spoken utterance. Fulfillment on the device may then initiate local and/or remote enforcement/execution of the determined action to resolve the spoken utterance.

Some implementations disclosed herein include one or more computing devices including one or more processors, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), and/or a Tensor Processing Unit (TPU). The one or more processors are operable to execute instructions stored in the associated memory and the instructions are configured to perform any of the methods described herein. The computing device may include, for example, a client assistant device having a microphone, at least one display, and/or other sensor components. Some implementations also include one or more non-transitory computer-readable storage media storing computer instructions executable by one or more processors to implement any of the methods described herein.

Drawings

Fig. 1A depicts an example process flow demonstrating aspects of the present disclosure, in accordance with various implementations.

FIG. 1B is a block diagram of an example environment that includes various components from FIG. 1A and in which implementations disclosed herein may be implemented.

Fig. 2 depicts a flowchart illustrating an example method in accordance with implementations disclosed herein.

Fig. 3A depicts an example of a client device, a user providing a non-spoken utterance, and speech recognition on an activated device that causes a streaming transcription of the spoken utterance to be rendered on a display of the client device.

Fig. 3B depicts an example of a client device after the spoken utterance of fig. 3A has been provided, after the NLU on the device and the fulfillment on the device have been activated and the corresponding performance of fulfillment has been implemented.

Fig. 4A depicts an example of a client device with a messaging application in the foreground, a user providing a non-spoken utterance, and speech recognition on the activated device causing a streaming transcription of the spoken utterance to be rendered on a display of the client device.

Fig. 4B depicts an example of a client device after the spoken utterance of fig. 4A has been provided, and after an NLU on the device and fulfillment on the device have been activated, and executed on the corresponding device that implemented fulfillment, to cause a messaging application to send a reply based on the spoken utterance of fig. 4A.

Fig. 5A depicts an example of a client device with a lighting control application in the foreground, a user providing a non-spoken utterance, and speech recognition on the activated device causing a streaming transcription of the spoken utterance to be rendered on a display of the client device.

Fig. 5B depicts an example of a client device after the spoken utterance of fig. 5A has been provided, and after determining not to activate an NLU on the device and/or fulfillment on the device.

FIG. 6 illustrates an example architecture of a computing device.

Detailed Description

Turning first to fig. 1A, an example process flow is shown demonstrating aspects of the present disclosure, in accordance with various implementations. The components shown in FIG. 1A may be implemented on client device 160 (FIG. 1B). In fig. 1A, attention processor 115 receives audio data 110 detected via one or more microphones (165, fig. 1B) of client device 160 and/or sensor data from one or more non-microphone sensors 105 of client device 160. As described herein, the audio data 110 received and/or utilized by the attention processor 115 may include raw audio data and/or representations thereof. When new audio data is detected, the audio data 110 may be provided in a streaming manner. Further, the sensor data received from the non-microphone sensors 105 may be raw sensor data and/or representations and/or abstractions thereof (e.g., abstractions provided by an operating system of the client device 160). When new sensor data is detected, the sensor data may also be provided in a streaming manner. The non-microphone sensors 105 may include other sensor components such as gyroscopes, accelerometers, laser-based vision sensors, cameras, and/or client devices.

The attention processor 115 processes the audio data 110 and/or sensor data from the non-microphone sensor 105 to determine whether to activate the on-device speech recognition engine 120. In addition to activating the speech recognition engine 120 on the device in response to detecting one or more explicit invocation prompts, in various implementations, the attention processor 115 may additionally or alternatively activate the speech recognition engine 120 on the device in response to various implicit prompts. This results in a reduction in user input provided by the user (at least due to the omission of hotwords or other explicit invocation cues), which directly reduces the duration of the interaction, thereby saving various local processing resources that would otherwise be used in long-term interactions.

The attention processor 115 processes various metrics (e.g., from the audio data 110) and/or sensor data (e.g., representations or abstractions thereof) to determine whether to activate the speech recognition engine 120 on the device. The attention processor 115 may utilize one or more rules and/or attention models 1151 to determine whether to activate the speech recognition engine 120 on the device. The attention model 1151 may be a trained machine learning model, for example, based on supervised and/or semi-supervised training examples.

In some implementations, the attention processor 115 includes a TI-SID module 115A, a directional speech module 115B, and/or a Voice Activity Detection (VAD) module 115C, each for processing audio data and providing one or more metrics to the attention processor 115. The attention processor 115 uses the provided metrics to determine whether to activate speech recognition 120 on the device.

The TI-SID module 115A processes the audio data 110 using the TI-SID model 115A1 to generate an embedding for all or part of the spoken utterance captured by the audio data. The TI-SID model 115a1 may be, for example, a recurrent neural network model and/or other models trained to process a sequence of audio data to generate rich embedding of audio data for text-independent speech. The TI-SID model is in contrast to text-dependent speaker recognition models, which can only be used for a very limited set of words (e.g., hotwords).

The TI-SID module 115A compares the generated embedding to locally stored embedding of one or more user accounts/profiles for registering with the client device 160 (e.g., embedding for unique and/or primary users registered with the client device 160). The TI-SID module 115A may then be generated based on the comparison, where the TI-SID metric indicates whether the generated embeddings match one of the stored embeddings and/or the extent to which they match. In some implementations or situations, the attention processor 115 may optionally activate the on-device speech recognition engine 120 only when the TI-SID metric indicates a match (i.e., the distance between the thresholds satisfies the embedding). For example, the attention processor 115 may always require a TI-SID metric indication match to activate the speech recognition engine 120 on the device. As another example, the attention processor 115 may require a TI-SID metric indication match to activate the speech recognition engine 120 on the device when the TI-SID metric is the only metric that can be relied upon and/or only when combined with one or more implicit invocation cues (i.e., when no explicit cues are detected).

The directional speech module 115B may utilize a trained acoustic model 115B1, the acoustic model 115B1 being trained to distinguish spoken utterances that are directed to the client device from spoken utterances that are not directed to the client device. In addition to distinguishing between human speech and non-human speech, directional speech processing module 115B seeks to distinguish between human speech that is directed to a client device and human speech that is not directed to a client device (e.g., human speech directed to another person, human speech originating from a television or other source). Directional speech processing module 115B, by processing audio data 110 using acoustic model 115B1, may generate a directional speech metric indicating whether human speech is detected, which directional speech metric is directed to the client device and may be a binary metric or a probability. In some implementations or scenarios, the attention processor 115 may optionally activate the on-device speech recognition engine 120 only when the directional speech module 115B indicates directional speech (e.g., a directional speech metric that satisfies a threshold). For example, the attention processor 115 may always require the directional speech module 115B to instruct directional speech to activate the speech recognition engine 120 on the device. As another example, the attention processor 115 may require the directional speech module 115B to instruct the directional speech to activate the speech recognition engine 120 on the device when directional speech is the only metric that may be relied upon and/or only in combination with one or more implicit invocation prompts (i.e., when no explicit prompt is detected).

VAD module 115C processes audio data 110 to monitor for the occurrence of any human speech and may output a voice activity metric that indicates whether voice activity is present. The voice activity metric may be a binary metric or may be a probability that human speech is present in the audio data. The VAD module 115C may optionally utilize the VAD model 115B1 to process the audio data and determine whether voice activity is present. The VAD model 115B1 may be a machine learning model that is trained to enable discrimination between audio data without any human utterances and audio data with human utterances. In some implementations or situations, the attention processor 115 may optionally activate the speech recognition engine 120 on the device only when the VAD module 115C indicates voice activity. For example, the attention processor 115 may always ask the VAD module 115C to indicate a match to activate the speech recognition engine 120 on the device. As another example, the attention processor 115 may ask the VAD module 115C to indicate a match to activate the speech recognition engine 120 on the device when the VAD module 115C is the only metric that may be relied upon and/or only when combined with one or more implicit invocation prompts (i.e., when no explicit prompt is detected). In some implementations, the TI-SID module 115A and/or the directional-speech module 115B may optionally be activated only when voice activity is detected by the VAD module 115C, although they may optionally process buffered audio data once activated.

In some implementations or scenarios, the attention processor 115 may activate the on-device speech recognition engine 120 based on separate processing of the audio data 110. However, in other implementations or scenarios, the attention processor 115 may additionally or alternatively activate the on-device speech recognition engine 120 based on processing of sensor data from the non-microphone sensor 105. The raw sensor data itself or an abstraction or other representation of the raw sensor data (such as an abstraction provided by the operating system of the client device) may be processed. For example, representations of sensor data from accelerometers, gyroscopes, cameras, and/or laser-based visual sensors may be utilized. Further, for example, and/or an indication from the operating system and/or another component based on the raw sensor data may be utilized and may indicate whether the client device has been picked up, is currently being held, is in the user's pocket, and/or is in another state. The attention processor 115 may optionally prevent the speech recognition engine 120 on the device from being activated in response to certain sensor data (e.g., an abstraction indicating that the client device 160 is in a user's pocket or other storage location) and/or may require certain sensor data prior to activating the speech recognition engine 120 on the device.

As described herein, the attention processor 115 may optionally utilize the attention model 1151 (alone or in combination with one or more rules) to determine whether to activate the speech recognition engine 120 on the device. The attention model 1151 may be, for example, a machine learning model trained based on supervised and/or semi-supervised training examples. For example, semi-supervised training examples may have training example inputs based on audio data and/or additional sensor data from actual interactions of the participating users, with the permission of those participating users. Further, semi-supervised training examples may be labeled as "positive" or "negative" based on directional gaze detection and/or other semi-supervised (or even supervised) techniques. As also described herein, the tagging can ensure sufficient activation of the speech recognition engine 120 on the device to ensure that the spoken utterance is recognized, and recognize that the additional techniques described herein will prevent NLUs on the device and/or fulfillment on the device without intending to interact with the automated assistant.

Once the attention processor 115 activates the on-device speech recognition engine 120, the on-device speech recognition engine 120 processes the audio data 110 using an on-device speech recognition model (not shown in fig. 1A for simplicity) to determine recognized text 125 in the spoken utterance (if any) captured by the audio data 110. The speech recognition model on the device may optionally be an end-to-end model, and may optionally be supplemented by one or more techniques that seek to generate additional recognized text hypotheses, and various considerations are used to select the best hypothesis. The processed audio data 110 may include audio data captured after speech recognition on the device is activated, and optionally recent audio data buffered locally (e.g., at least some processed by the attention processor 115 prior to activation of the speech recognition engine 120 on the device). In some implementations, when the speech recognition engine 120 on the device is activated, a human-perceptible cue is rendered to inform the user that such activation has occurred and/or to render a stream of recognized text 125 while recognition is occurring. The visual rendering may also include a selectable "cancel" element that stops the speech recognition engine 120 on the device when selected via touch input at the touch screen display. As used herein, activating the speech recognition engine 120 or other component means at least making the processing it performs beyond the processing it previously performed prior to activation. This may mean activating it from a fully dormant state.

The query classifier 135 determines whether to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device based on the recognized text 125 and optionally the context data 130 (and/or to cause the resulting execution 150 based on output from the fulfillment engine 145). Query classifier 135 may activate NLU engine 140 on the device and/or fulfillment engine 145 on the device only in response to determining that such further processing should occur using one or more of the techniques described herein. In many implementations, the processing performed by query classifier 135 is more computationally efficient than the processing that occurs through execution of NLU engine 140 on the device, fulfillment engine 145 on the device, and/or any generated fulfillment. Through this selective activation of speech processing on the device and/or further selective activation of NLUs on the device and/or fulfillment on the device, various resources of client device 160 are conserved and/or security of user data is maintained.

In some implementations, query classifier 135 includes an assistant Language Model (LM) module 135A, a semantics module 135B, a recent actions module 135C, a rendered content module 135D, an application module 135E, and/or an entity matcher 135F. Each module utilizes the identified text 125 and optionally the context data 130 and/or associated models to provide one or more metrics to the query classifier 135. Query classifier 135 utilizes the provided metrics to determine whether to activate NLU engine 140 on the device and/or fulfillment engine 145 on the device.

The assistant Language Model (LM) module 135A may determine whether the recognized text (in whole or in part) matches one or more phrases in the locally stored assistant LM module 135A 1. When the recognized text matches a phrase stored in the locally stored assistant LM module 135a1, the query classifier 135 is more likely to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device. The locally stored assistant LM module 135a1 can include a plurality of phrases, each of which can be interpreted and operated by the automated assistant, and the locally stored assistant LM module 135a1 can exclude any phrases that cannot be interpreted and operated by the automated assistant. Optionally, the assistant LM module 135a1 may be limited to a certain number of phrases in view of storage limitations of the client device, and the included phrases may be selected for inclusion in the assistant LM module 135a1 based on considerations such as frequency of use (e.g., globally, at the client device 160, and/or optionally across multiple client devices by a user of the client device 160).

The semantic module 135B processes the recognized text 125 using the semantic model 135B1 to determine a probability that the recognized text is directed to the automated assistant. The semantic model 135B1 may be trained to distinguish text that is directed to the automated assistant from text that is not directed to the automated assistant (e.g., but is directed to another person and/or from a television or other source). The semantic model 135B1 may be used to process text on a label-by-label basis (e.g., it may be a recurrent neural network model), or may optionally process the embedding of recognized text, such as the generated Word2Vec embedding or other semantic embedding. When the semantic module 135B indicates that the recognized text is directed to the automated assistant, the query classifier 135 is more likely to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device.

The recent action module 135C may optionally refer to the relevant action model 135C1 to determine one or more relevant action phrases based on the one or more relevant action phrases, each having a defined correspondence with the recent action. The recent action is an action performed at the client device 160 and/or via the client device 160 in response to a previous user input (i.e., prior to the current spoken utterance). For example, the recent action module 135C may determine the recent action from the context data 130. Further, the recent actions module 135C may determine the relevant actions using a relevant action model 135C1 (which may include a mapping between each of a plurality of actions) and a relevant action phrase related to each action. The recent action module 135C may further determine whether at least a portion of the recognized text 125 matches at least one of the one or more related action phrases. When the recognized text 125 matches the relevant action phrase, the query classifier 135 is more likely to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device.

The render content module 135D may optionally reference the render content model 135Dl to determine whether at least a portion of the recognized text 125 conforms to content that was rendered at the client device 160 during the spoken utterance. The content rendered at the client device may be determined from the context data 130 and may optionally be supplemented with "relevant" content using the rendered content model 135D 1. For example, if the suggested automated assistant action "show me weather for city" (show me weather for city) "is visually displayed during the spoken utterance, it may be determined that the recognized text including" show me weather "and/or the city name conforms to the content being visually rendered. The name of the city (indicated by the placeholder [ city ] in the suggestion) may be determined with reference to the rendered content model 135D 1. As another example, content that is audibly rendered at client device 160 may also be considered by rendering content module 135D. When the recognized text 125 conforms to the content being rendered, the query classifier 135 is more likely to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device.

App module 135E determines whether at least a portion of the recognized text 125 conforms to a non-automated assistant application executing during the spoken utterance. For example, if the lighting control application is being executed in the foreground, recognized text including "turn on", "adjust", "light", and/or other text related to the lighting control application may be considered consistent with actions that may be performed by the non-automated assistant application. Optionally, foreground applications and/or the most recent foreground applications may be actively requested to determine actions and/or text related to the respective current state of the application and/or the application ensemble. Such a request may optionally occur via the operating system of client device 160. When the recognized text 125 conforms to an application on the client device 160 (such as an application executing in the foreground and/or recently executing in the foreground), the query classifier 135 is more likely to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device.

The entity matcher 135F may determine whether the identified text 125 conforms to an entity rendered by the client device 160 and/or a locally stored entity based on being determined to be of interest to a user of the client device 160. The entity matcher 135F can utilize an entity database 135F1, which entity database 135F1 can include a subset of global entities that are stored locally in response to being determined to be relevant based on past interactions (assistant or otherwise) at the client device 160, the geographic location of the client device 160, and/or other considerations. If the identified text 125 matches, for example, an alias of an entity, attributes of the entity, and/or attributes of additional entities related to the entity (i.e., has at least a threshold degree of correlation), may be determined to be compliant with the visually rendered content, the entity matcher 135F may determine that the identified text 125 is compliant with any such entity. For example, if a particular sports team is stored in the entity database 135F1, it may be determined that the identified text including the sports team alias conforms to the entity. When the recognized text 125 conforms to an entity being rendered and/or in the entity database 135F1, the query classifier 135 is more likely to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device.

In some implementations, one or more of the above considerations may be processed by query classifier 135 in determining whether to activate NLUs on the device and/or fulfillment on the device. Query classifier 135 may utilize one or more rules and/or query models 135B1 to determine whether to activate NLUs on the device and/or fulfillment on the device. For example, a rule may specify that NLUs on a device should (or should not) be activated and/or fulfillment on a device should occur if a condition exists alone or in combination with other particular conditions. For example, a rule may specify that NLUs on a device and fulfillment on a device should only be activated if the recognized text matches a phrase in the assistant language model or matches the content that is currently being rendered. In implementations where the query model 135B1 is used, it may be a trained machine learning model, for example, based on supervised and/or semi-supervised training examples. For example, under the permission of those participating users, the training examples may have training example inputs based on the actual interactions of the participating users. Notably, the supervisory signals or explicit labels used in training the query model seek to ensure that the user intends to interact with the automated assistant, as compared to the supervisory signals or explicit labels used in training the attention model (as described above).

When the on-device NLU engine 140 is activated, the on-device NLU engine 140 performs on-device natural language understanding on the recognized text 125 to generate NLU data 141. NLU engine 140 may optionally generate NLU data 141 using an NLU model (not shown in fig. 1A for simplicity) on one or more devices. NLU data 141 can include, for example, an intent corresponding to the spoken utterance and optionally parameters for the intent (e.g., a bin value).

Further, when the fulfillment engine 145 on the device is activated, the fulfillment engine 145 on the device generates fulfillment data 146 using natural language understanding data. The fulfillment engine 145 may optionally utilize a fulfillment model (not shown in fig. 1A for simplicity) on one or more devices to generate fulfillment data 146. The fulfillment data 146 may define local and/or remote responses (e.g., answers) to spoken utterances, interactions performed with locally installed applications based on the spoken utterances, commands transmitted to internet of things (IoT) devices based on the spoken utterances (directly or via a corresponding remote system), and/or other analytic actions performed based on the spoken utterances. Fulfillment data 146 is then provided for locally and/or remotely implementing/performing the determined action to resolve the spoken utterance. Execution may include, for example, rendering local and/or remote responses (e.g., visually and/or audibly rendering (optionally with a local text-to-speech module)), interacting with locally installed applications, transmitting commands to internet of things devices, and/or other actions.

Turning now to FIG. 1B, a block diagram of an example environment is illustrated that includes various components from FIG. 1A and in which implementations disclosed herein may be implemented. The client device 160 at least selectively executes the automated assistant client 170. The term "assistant device" is also used herein to refer to a client device 160 that at least selectively executes an automated assistant client 170. The automated assistant client 170 includes, in the example of fig. 1B, the attention processor 115, the on-device speech recognition engine 120, the query classifier 135, the on-device NLU engine 140, and the on-device fulfillment engine 145 described above with respect to fig. 1A. The automated assistant client 170 also includes a speech capture engine 172 and a visual capture engine 174, which are described in more detail below.

The one or more cloud-based automated assistant components 180 can optionally be implemented on one or more computing systems (collectively "cloud" computing systems) communicatively coupled to the client device 160 via one or more local and/or wide area networks (e.g., the internet), indicated generally at 190. The cloud-based automated assistant component 180 may be implemented, for example, via a cluster of high-performance servers.

In various implementations, through its interaction with the one or more cloud-based automated assistant components 180, an instance of the automated assistant client 170 can form a logical instance that appears from the perspective of the user to be the automated assistant 195 with which the user can engage in human-machine interactions (e.g., verbal interactions, gesture-based interactions, and/or touch-based interactions).

One or more client devices 160 may include, for example, one or more of the following: a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a computing device of a user vehicle (e.g., an in-vehicle communication system, an in-vehicle entertainment system, an in-vehicle navigation system), a standalone interactive speaker, a smart appliance such as a smart television (or a standard television equipped with a networked dongle with automated assistant capabilities), and/or a wearable apparatus of a user that includes a computing device (e.g., a watch of a user with a computing device, glasses of a user with a computing device, a virtual or augmented reality computing device).

Client device 160 may optionally be equipped with one or more visual components 163 having one or more fields of view. The vision component 163 may take various forms, such as a single-image camera, a stereo camera, a LIDAR component (or other laser-based component), a radar component, and so forth. One or more vision components 163 may be used, for example, by a vision capture engine 174, to capture visual frames (e.g., image frames, laser-based visual frames) of the environment in which the client device 160 is deployed. In some implementations, such visual frames can be utilized to determine whether a user is present near the client device 160 and/or a distance of the user (e.g., the user's face) relative to the client device. The attention processor 115 may utilize such determinations to determine whether to activate the speech recognition engine 120 on the device, and/or the query classifier 135 may utilize such determinations to determine whether to activate the NLU engine 140 on the device and/or the fulfillment engine 145 on the device.

Client device 160 may also be equipped with one or more microphones 165. The speech capture engine 172 may be configured to capture the user's speech and/or other audio data captured via the microphone 165. Such audio data may be used by the attention processor 115 and/or the speech recognition engine 120 on the device, as described herein.

Client device 160 may also include one or more presence sensors 167 and/or one or more displays 169 (e.g., touch-sensitive displays). The display 169 may be used to render streaming text transcription from the speech recognition engine 120 on the device and/or may be used to render an assistant response generated while performing some fulfillment from the fulfillment engine 145 on the device. The display 103 may also be one of the user interface output components through which a visual portion of the response from the automated assistant client 110 is rendered. The presence sensor 167 may include, for example, a PIR and/or other passive presence sensor. In various implementations, one or more components and/or functions of the automated assistant client 170 may be initiated in response to detection of human presence based on output from the presence sensor 167. For example, the attention processor 115, the visual capture engine 174, and/or the voice capture engine 172 may optionally be activated only in response to detecting the presence of a human. Further, for example, those and/or other components (e.g., speech recognition engine 120 on the device, NLU engine 140 on the device, and/or fulfillment engine 145 on the device) may optionally be deactivated in response to no longer detecting human presence. In implementations where the initiating component and/or functionality of the automated assistant client 170 is dependent on first detecting the presence of one or more users, power resources may be conserved.

In some implementations, the cloud-based automated assistant component 180 can include a remote ASR engine 182 that performs speech recognition, a remote NLU engine 183 that performs natural language understanding, and/or a remote fulfillment engine 184 that generates fulfillment. A remote execution module may also optionally be included that performs remote execution based on locally or remotely determined fulfillment data. Additional and/or alternative remote engines may be included. As described herein, in various implementations, on-device speech processing, NLUs on the device, fulfillment on the device, and/or execution on the device may be prioritized (as no client-server round trips are required to resolve the spoken utterance) at least due to the delay and/or reduction in network usage that they provide in resolving the spoken utterance. However, one or more cloud-based automated assistant components 180 can be at least selectively utilized. For example, such components may be used in parallel with components on the device, and the output of such components may be used when a local component fails. For example, the on-device fulfillment engine 145 may fail in certain circumstances (e.g., due to the relatively limited resources of the client device 160), and in such circumstances the remote fulfillment engine 184 may generate fulfillment data with the more robust resources of the cloud. Remote fulfillment engine 184 may operate in parallel with and utilize the results of fulfillment engine 145 on the device upon a fulfillment failure on the device, or remote fulfillment engine 184 may be invoked in response to determining a fulfillment failure on the device.

In various implementations, the NLU engine (on-device and/or remote) may generate an annotated output that includes one or more annotations of recognized text and one or more (e.g., all) terms of the natural language input. In some implementations, the NLU engine is configured to recognize and annotate various types of grammatical information in the natural language input. For example, the NLU engine may include a morpheme module that may separate individual words into morphemes and/or annotate morphemes, e.g., with their classes. The NLU engine may also include a portion of a speech annotator configured to annotate terms with their grammatical roles. Further, for example, in some implementations, the NLU engine may additionally and/or alternatively include a dependency parser configured to determine syntactic relationships between terms in the natural language input.

In some implementations, the NLU engine may additionally and/or alternatively include an entity tagger configured to annotate entity references in one or more segments, such as references to people (including, for example, literary characters, celebrities, public personalities, etc.), organizations, places (real and imaginary), and so forth. In some implementations, the NLU engine may additionally and/or alternatively include a co-reference parser (not depicted) configured to group or "cluster" references to the same entity based on one or more context hints. In some implementations, one or more components of the NLU engine may rely on annotations from one or more other components of the NLU engine.

The NLU engine may also include an intent matcher configured to determine an intent of a user participating in an interaction with the automated assistant 195. The intent matcher may use various techniques to determine the intent of the user. In some implementations, the intent matcher may access one or more local and/or remote data structures that include, for example, a plurality of mappings between grammars and response intents. For example, the grammars included in the mapping may be selected and/or learned over time and may represent common intentions of the users. For example, a syntax "play < artist >" may be mapped to an intent to invoke a response action that causes < artist > music to be played on client device 160. Another grammar "[ weather | for ] today" may be able to match user queries, such as "what's the weather today" and "what's the weather for today? (what is the forecast today). In addition to or instead of a grammar, in some implementations, the intent matcher may use one or more trained machine learning models, alone or in combination with one or more grammars. These trained machine learning models may be trained to recognize intent, for example, by: text from recognition of the spoken utterance is embedded into a reduced-dimensional space, and then, for example, techniques such as euclidean distance, cosine similarity, etc., are used to determine which other insertions (and thus intentions) are closest. As shown in the "play < artist >" example syntax above, some syntaxes have slots (e.g., < artist >) that can be filled with slot values (or "parameters"). The slot value may be determined in various ways. Typically the user will actively provide the slot value. For example, for the grammar "Order me a < pointing > pizza", the user may speak the phrase "Order me a usage pizza (please give me a pizza for sausage)", in which case the slot < pointing > is automatically filled. Other slot values may be inferred based on, for example, user location, content currently being rendered, user preferences, and/or other cues.

The fulfillment engine (local and/or remote) may be configured to receive the predicted/estimated intent output by the NLU engine and any associated slot values and fulfill (or "resolve") the intent. In various implementations, fulfillment (or "parsing") of a user's intent may result in various fulfillment information (also referred to as fulfillment data) generated/obtained, for example, by a fulfillment engine. This may include determining local and/or remote responses (e.g., answers) to the spoken utterance for interactions with a locally installed application performed based on the spoken utterance, commands transmitted to an internet of things (IoT) device (directly or via a corresponding remote system) based on the spoken utterance, and/or other analytic actions performed based on the spoken utterance. Fulfillment on the device may then initiate local and/or remote enforcement/execution of the determined action to resolve the spoken utterance.

In some implementations, the fulfillment engine 145 on the device may utilize various local data to determine fulfillment information, such as local pre-cache fulfillment for various intents, information obtained locally from locally installed applications, and/or other local data. For example, the fulfillment engine 145 (or other component) on the device may maintain a local cache that includes mappings between various intents (and optionally slot values) and associated fulfillment data. In response to a failure to locally fulfill a previous request by the fulfillment engine 145 on the device at the client device 160, at least some of the local caches may be populated with fulfillment data based on the fulfillment data that has been previously provided to the assistant application from the cloud-based automated assistant component 180. Fulfillment data may be mapped to the intent of the request (and optional slot value), and/or to the identified text regarding the generation intent (and optional slot value). For example, the previous request may be "what is the default IP address for the financial router", and the response (text and/or voice) of "192.168.1.1" may have been previously retrieved from the cloud-based automated assistant component 180 in response to not being locally implemented by the fulfillment engine 145 on the device. The response may then optionally be cached locally in response to an indication that it is static in the response and mapped to the recognized text and/or corresponding NLU data of the previous response. Thereafter, while the response is still cached locally, subsequent requests of "what the default IP address of the router is assumed to be" may be fulfilled locally by using the cache (i.e., by using the "192.168.1.1" cached response). As another example, some NLU data (and/or corresponding queries) fulfilling the data and mapping may be actively pushed to the automated assistant client 170, even though the response was not previously rendered by the automated assistant client 170 and/or the corresponding query was not previously submitted at the automated assistant client 170. For example, today's weather forecasts and/or tomorrow's weather forecasts may be proactively pushed along with mappings to corresponding intents (e.g., "weather requests") and slot values (e.g., "today", "tomorrow", respectively), even though those predictions were not previously rendered at the automated assistant client 170 (although the forecast for the previous day may have been rendered in response to the relevant request). While responses are still cached locally, requests for "what is today's weather" or "what is tomorrow's weather" can be fulfilled locally by using a cache.

Fig. 2 depicts a flowchart illustrating an example method 200 in accordance with implementations disclosed herein. For convenience, the operations of method 200 are described with reference to a system performing the operations. The system may include various components of various computer systems, such as one or more components of a client device (e.g., client device 160 of FIG. 1). Further, while the operations of method 200 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, or added. It will be understood that the operations shown in fig. 2 may correspond to various operations described herein (e.g., operations described in the summary of the invention and fig. 1A, 1B, 3A, 3B, 4A, 4B, 5A, and 5B).

At block 252, the system processes the audio data and/or additional sensor data. The audio data is detected by one or more microphones of the client device. The additional sensor data may be from one or more non-microphone sensors of the client device. As described herein, processing the audio data and/or the additional sensor data may include processing the raw audio data and/or the raw additional sensor data, and/or representations and/or abstractions thereof.

At block 254, the system determines whether to activate speech recognition on the device based on the processing of block 252. If the determination at block 254 is negative, the system continues processing audio data and/or additional sensor data at block 252.

If the decision at block 254 is yes, the system proceeds to block 256 and generates recognized text using speech recognition on the device now activated. The recognized text may be generated based on the buffered audio data (e.g., buffered during the processing of block 252) and/or subsequently received audio data. Optionally, at block 256A, the system also provides a streaming transcription of the recognized text via the display of the client device as it is recognized by speech recognition on the device being activated.

At block 258, the system processes the recognized text (generated at block 256) and/or other data. In some implementations, the block 258 may begin in response to detecting the end of the spoken utterance using an endpoint and/or other techniques. In some other implementations, block 258 may begin when the recognized text is generated or otherwise before the termination point to reduce latency.

At block 260, the system optionally determines whether to activate the NLU on the device based on the processing of block 258.

If the decision at block 260 is no, the system proceeds to block 262 and determines whether to stop speech recognition on the device. In some implementations, determining whether to stop speech recognition on the device may include determining whether a threshold amount of time has elapsed without detecting voice activity, directional speech, any recognized text, and/or other considerations.

If the decision at block 262 is yes, the system proceeds to block 262A and stops speech recognition on the device and then returns to block 252.

If the decision at block 262 is no, the system proceeds to block 256 and continues to recognize text for any spoken utterances in the new audio data using speech recognition on the device.

If the decision at block 260 is yes, then the system proceeds to block 264 and generates an NLU output based on the recognized text using the NLU on the device.

At block 266, the system uses the fulfillment engine on the device to determine whether to generate fulfillment data. In some implementations, if the system generates NLU output at block 264, the system determines to generate fulfillment data. In some implementations, the system determines to generate fulfillment data based on the processing of block 258. As described herein, in some implementations, block 266 may further include, if fulfillment on the device is unsuccessful, determining to utilize remote fulfillment data from the remote fulfillment engine. The NLU data and/or recognized text may be provided to a remote fulfillment engine to obtain remote fulfillment data. Such provision of data may occur in response to determining that fulfillment on the device is unsuccessful, or may occur preemptively to reduce latency in receiving remote fulfillment data if fulfillment on the device is unsuccessful.

If the decision at block 266 is no, the system proceeds to block 262. If the determination at block 266 is yes, the system proceeds to block 268 and performs fulfillment in accordance with the fulfillment data generated at block 266. Execution of fulfillment may occur on the device and/or remotely.

Fig. 3A depicts an example of a client device 360, the user 302 provides a non-spoken utterance 302A of "Turn on light1 (Turn light1 on)", and speech recognition on the activated device causes a streaming transcription 362 of the spoken utterance 302A to be rendered on a display 369 of the client device 360. Also shown is a cancel button 364, which when selected, cancels the speech recognition on the device and prevents fulfillment on the device and NLU on the spoken utterance generation device. Note that the streaming transcript 362 and optional cancel button 364 occupy very little (i.e., less than 5%) space on the display 369 in order to minimally obscure much of the content currently being rendered and mitigate the risk of transferring (and potentially prolonging) any current touch interaction of the user with the client device 360. The streaming transcript 362 and optional cancel button 364 may optionally overlay any currently rendered content, and may optionally be rendered semi-transparently. Further, at least streaming transcription 362 may optionally include only text without any background elements. Such optional features may further mitigate the risk of transferring (and potentially prolonging) any current touch interaction of the user with the client device 360.

As described herein, various considerations may be considered in determining to activate speech recognition on a device. For example, speech recognition on the device may be activated based on a non-microphone sensor signal indicating that the user 302 is holding the client device 360 and/or has picked up the client device 360. Further, for example, speech recognition on the device may additionally or alternatively be activated based on audio data (from a microphone of the client device 360) indicating the presence of voice activity, directional speech, and/or speech recognized as belonging to the profile of the user 302.

Fig. 3B depicts an example of the client device 360 after the spoken utterance 302A of fig. 3A has been provided, and after the NLU on the device and the fulfillment on the device have been activated and the corresponding performance of fulfillment has been carried out. For example, as described herein, NLUs on a device and/or fulfillment on a device may be activated in response to various considerations based on recognized text and/or contextual data. Further, the recognized text may be processed with an NLU on the device and NLU data may be generated, and fulfillment data may be generated based on the NLU data with fulfillment on the device. In the example of fig. 3A and 3B, the fulfillment data defines that a command should be sent (directly or indirectly) to "light 1" to turn it on, and that graphical interface 363 should be rendered. Fulfillment may be performed by a component on the device and/or a remote component. Graphical interface 363 both informs the user that "light 1" has been turned on and provides user 302 with the ability to touch interact with the dimming element to adjust the brightness of "light 1". Note that graphical interface 363 occupies relatively little (i.e., less than 25%) space on display 369 in order to minimally occlude the content currently being rendered and to reduce the risk of transferring (and potentially lengthening) any current touch interaction of the user with client device 360. Graphical interface 363 may optionally overlay any currently rendered content and may optionally be rendered semi-transparently.

Fig. 4A depicts an example of a client device 460 having a messaging application 408 in the foreground, the user 402 providing a non-spoken utterance 402A of "Reply to it, 'sound good' (replying to it, 'sound good')", and voice recognition on the activated device causes a streaming transcription 462 of the spoken utterance 462A to be displayed on a display of the client device 460. Also shown is a cancel button 464 that, when selected, causes speech recognition on the device to be cancelled and prevents fulfillment on the device and NLU on the spoken utterance generation device. Note that the streaming transcription 462 and optional cancel button 464 occupy very little (i.e., less than 3%) space on the display 469 in order to minimally occlude many of the currently rendered content and mitigate transferring (and potentially lengthening) any current interaction of the user with the messaging application 408 (e.g., reading the message 408A from "Bob") that is rendered in the foreground of the client device 460. Note further that streaming transcription 462 and cancel button 464 are rendered by an assistant application separate from messaging application 408. As described herein, various considerations may be considered in determining to activate speech recognition on a device.

Fig. 4B depicts an example of the client device 460 after the spoken utterance 402A of fig. 4A has been provided, and after the NLU on the device and the fulfillment on the device have been activated, and the corresponding fulfillment is enabled, executing on the device, having the messaging application 480 send a reply based on the spoken utterance of fig. 4A. For example, the NLU on the device and/or fulfillment on the device may be activated based at least in part on determining that the recognized text conforms to the messaging application 408. For example, the recognized text is determined to conform to the action performed by messaging application 408. In some cases, this may be further based on recognized text-compliant actions that the messaging application 408 is in the foreground, and/or may be performed in the current state of the messaging application 408.

The recognized text may be processed with an NLU on the device and NLU data generated, and fulfillment data may be generated based on the NLU data with fulfillment on the device. In the example of fig. 4A and 4B, the fulfillment data defines that the assistant application should send a command (directly or via the operating system) to the messaging application 408 to cause a "good-sounding" reply 408B (fig. 4B) to be generated and sent as a reply in the current rendering thread. Execution of fulfillment may occur by sending such a command.

Fig. 5A depicts an example of a client device 560 in which, with a lighting control application 510 in the foreground, a user 502 provides a non-spoken utterance 502A intended for another user 504 (e.g., the utterance 502A may be in response to another user 504 asking "what should I say in response to Jane's email. As described herein, various considerations may be considered in determining to activate speech recognition on a device. Although speech recognition on the device is shown in fig. 5A, in various implementations it may optionally not be activated and/or not generate streaming transcription 562. Determining not to activate speech recognition on the device may be based on one or more considerations. For example, since user 502 actually provided utterance 502A to another user 504, it cannot be activated based at least in part on determining that directional speech did not occur (e.g., did not occur with at least a threshold probability). As another example, it may not be possible to additionally or alternatively activate two or more users based at least in part on determining that they are talking to each other using TI-SID and/or speaker classification techniques. As another example, the lighting application 510 cannot additionally or alternatively be activated based at least in part on it being received in the foreground and/or recently no notification being received and/or in the notification bar.

Fig. 5B depicts an example of the client device 560 after the spoken utterance 502A of fig. 5A has been provided and after it has been determined not to activate an NLU on the device and/or fulfillment on the device. As a result of determining not to activate the NLU on the device and/or fulfillment on the device, fulfillment is not performed and the streaming transcription is removed from display 569. Note that utterance 502A is the same as utterance 402A of fig. 4A. However, unlike fig. 4B, fulfillment is not performed (or even generated) in the example of fig. 5B. This may be based at least in part on determining that the recognized text does not conform to the executing lighting application 510 (while it does conform to the messaging application 408 in fig. 4A and 4B). For example, it is determined that the recognized text fails to conform to an action that may be performed by the lighting application 510. In some cases, this may be further based on the lighting application 510 being in the foreground and/or the messaging application 408 (or other messaging application) not executing and/or recently accessed.

FIG. 6 is a block diagram of an example computing device 610 that may optionally be used to perform one or more aspects of the techniques described herein. In some implementations, one or more of the client device, cloud-based automated assistant component, and/or other components may include one or more components of the example computing device 610.

Computing device 610 typically includes at least one processor 614, which communicates with a number of peripheral devices via a bus subsystem 612. These peripheral devices may include: storage subsystems 624 including, for example, a memory subsystem 625 and a file storage subsystem 626; a user interface output device 620; user interface input devices 622; and a network interface subsystem 616. The input and output devices allow a user to interact with the computing device 610. Network interface subsystem 616 provides an interface to external networks and couples to corresponding interface devices in other computing devices.

User interface input devices 622 may include: a keyboard; a pointing device such as a mouse, trackball, touchpad, or tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems, microphones, and the like; and/or other types of input devices. In general, use of the term "input device" is intended to include all possible types of devices and ways to input information onto computing device 610 or a communication network.

User interface output device 620 may include a display subsystem, a printer, a facsimile machine, or a non-visual display such as an audio output device. The display subsystem may include a Cathode Ray Tube (CRT), a flat panel device such as a Liquid Crystal Display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide a non-visual display, such as via an audio output device. In general, use of the term "output device" is intended to include all possible types of devices as well as ways to output information from computing device 610 to a user or to another machine or computing device.

Storage subsystem 624 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, storage subsystem 624 may include logic to perform selected aspects of the method of fig. 2 and to implement various components depicted in fig. 1A, 1B, 2A, 2B, 3A, 3B, 4A, and 4B.

These software modules are typically executed by processor 614 alone or in combination with other processors. Memory 625 used in storage subsystem 624 may include a number of memories, including a main Random Access Memory (RAM)630 for storing instructions and data during program execution and a Read Only Memory (ROM)632 in which fixed instructions are stored. File storage subsystem 626 may provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. Modules implementing the functionality of some implementations may be stored by file storage subsystem 626 in storage subsystem 624, or in other machines accessible to processor 614.

Bus subsystem 612 provides a mechanism for various components and subsystems of computing device 610 to communicate with one another as intended. Although bus subsystem 612 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple buses.

Computing device 610 may be of various types, including a workstation, a server, a computing cluster, a blade server, a server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 610 depicted in FIG. 6 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 610 are possible with more or fewer components than the computing device depicted in fig. 6.

Where the systems described herein collect or otherwise monitor personal information about a user, or perhaps utilize personal and/or monitored information, the user may be provided with an opportunity to control whether programs or features collect user information (e.g., information about the user's social network, social actions or activities, profession, user preferences, or the user's current geographic location), or whether and/or how to receive content from a content server that may be more relevant to the user. In addition, some data may be processed in one or more ways before being stored or used in order to remove personally identifiable information. For example, the user's identity may be processed such that no personal identity information of the user can be determined, or the user's geographic location may be generalized (such as to a city, zip code, or state level) when obtaining the location of the geographic location information, such that no particular geographic location of the user can be determined. Thus, the user may control how information about the user is collected and/or used.

The above description is provided as an overview of some implementations of the present disclosure. Further description of those implementations and others are described in more detail below.

In some implementations, a method is provided that is performed by an automated assistant application of a client device using one or more processors of the client device. The method includes determining to activate speech recognition on the device in response to determining that one or more conditions are satisfied. Determining satisfaction of one or more conditions comprises determining satisfaction based on processing of both: non-thermal word audio data detected by one or more microphones of a client device; and additional sensor data based on output from the at least one microphone sensor of the client device. The method further comprises the following steps: recognized text is generated from a spoken utterance captured by audio data and/or additional non-thermal word audio data detected by one or more microphones subsequent to the audio data using speech recognition on the device. Generating the recognized text includes performing speech recognition on the device on the audio data and/or the additional audio data. The method also includes determining whether to activate natural language understanding on the device of the recognized text and/or activate fulfillment on the device based on natural language understanding on the device based on the recognized text. The method also includes, upon determining to activate natural language understanding on the device and/or activate fulfillment on the device, performing natural language understanding on the device and/or initiating fulfillment on the device. Further, the method includes deactivating speech recognition on the device when it is determined that natural language understanding on the device is not activated and/or fulfillment on the device is not activated.

These and other implementations of the technology may include one or more of the following features.

In some implementations, the at least one non-microphone sensor on which the additional sensor data is based may include an accelerometer, magnetometer, gyroscope, and/or laser-based visual sensor.

In some implementations, determining satisfaction of one or more conditions based on processing the hotword-free audio data includes: the non-thermal word audio data is processed using an acoustic model to generate a directional speech metric. The acoustic model may be trained to distinguish between spoken utterances that are directed to the client device and spoken utterances that are not directed to the client device. In some of those implementations, determining satisfaction of the one or more conditions based on processing the hotword-free audio data may further include determining satisfaction of the one or more conditions based in part on the directional speech metric.

In some implementations, determining satisfaction of the one or more conditions based on processing the hotword-free audio data may additionally or alternatively include processing the hotword-free audio data using a voice activity detector to detect the presence of human speech, and determining satisfaction of the one or more conditions based in part on detecting the presence of human speech.

In some implementations, determining satisfaction of one or more conditions based on processing the hotword-free audio data may additionally or alternatively include: processing the non-thermal word audio data using a text-independent speaker recognition model to generate a voice embedding; comparing the voice embedding to recognized voice embedding stored locally on the client device; and determining satisfaction of the one or more conditions based in part on the comparison.

In some implementations, determining whether to activate natural language understanding on the device and/or activate fulfillment on the device based on the recognized text may include: a determination is made whether the text matches one or more phrases stored in the locally stored assistant language model. The locally stored assistant language model may include a plurality of phrases, each of which may be interpreted by the automated assistant.

In some implementations, determining whether to activate natural language understanding on the device and/or activate fulfillment on the device based on the recognized text may additionally or alternatively include: determining one or more related action phrases based on the one or more related action phrases, each action phrase having a defined correspondence to a most recent action performed at the client device in response to a previous user input; and determining whether at least a portion of the text matches at least one of the one or more related action phrases.

In some implementations, determining whether to activate natural language understanding on the device and/or activate fulfillment on the device based on the recognized text may additionally or alternatively include determining whether at least a portion of the recognized text conforms to content that is rendered at the client device during the spoken utterance. In some of these implementations, the content rendered at the client device includes suggested automated assistant actions that are graphically rendered.

In some implementations, determining whether to activate natural language understanding on the device and/or activate fulfillment on the device based on the recognized text may additionally or alternatively include determining fulfillment on the device, and further include performing fulfillment on the device. In some of these implementations, execution fulfillment on the device includes providing the command to a separate application on the client device.

In some implementations, disabling speech recognition on the device may include: when it is determined that natural language understanding and/or fulfillment on the device is not activated, and further based at least on a threshold duration of time that has elapsed without further voice activity detection and/or further recognized text, the on-device language recognition is deactivated.

In some implementations, performing natural language understanding on a device and/or fulfillment on a device may include: performing natural language understanding on a device to generate natural language understanding data; and performing fulfillment on the device using the natural language understanding data.

In some implementations, the method may further include, during generation of the recognized text using speech recognition on the device, causing a streaming transcription of the recognized text to be rendered in a graphical interface at a display of the client device. In some of those implementations, the method may further include rendering a selectable interface element in the graphical interface with the streaming transcription, the selectable interface element, when selected, causing speech recognition on the device to cease. In some of those implementations, the method may further include changing the graphical interface when it is determined to activate natural language understanding on the device and/or to activate fulfillment on the device.

In some implementations, a method is provided that is performed by an automated assistant application of a client device using one or more processors of the client device. The method includes determining to activate speech recognition on the device in response to determining that one or more conditions are satisfied. Determining satisfaction of one or more conditions comprises determining satisfaction based on: non-thermal word audio data detected by one or more microphones of the client device; and/or additional sensor data based on output from at least one non-microphone sensor of the client device. The method further comprises the following steps: using speech recognition on the device, recognized text is generated from spoken utterances captured by the audio data and/or captured by additional non-thermal word audio data detected by one or more microphones subsequent to the audio data. Generating the recognized text includes performing speech recognition on the device on the audio data and/or the additional audio data. The method further comprises the following steps: determining, based on the recognized text, a natural language understanding on a device that activates the recognized text; performing an activated natural language understanding on the recognized text on the device; and initiating fulfillment of the spoken utterance on the device based on the natural language understanding on the device.

These and other implementations of the technology may include one or more of the following features.

In some implementations, determining a natural language understanding on a device that activates the recognized text based on the recognized text may include: determining whether at least a portion of the recognized text conforms to content rendered on the client device during the spoken utterance and/or determining whether at least a portion of the text matches one or more related action phrases, each action phrase having a defined correspondence to a most recent action performed at the client device in response to a previous user input.

Other implementations may include a computer program including instructions that may be used by one or more processors (e.g., a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and/or a Tensor Processing Unit (TPU)) to perform a method such as one or more of the methods described above and/or elsewhere herein. Still other implementations may include a client device having at least one microphone, at least one display, and one or more processors operable to execute stored instructions to perform a method, such as one or more of the methods described above and/or elsewhere herein.

It should be understood that all combinations of the foregoing concepts and additional concepts described in greater detail herein are contemplated as being part of the subject matter disclosed herein. For example, all combinations of claimed subject matter appearing at the end of this disclosure are considered to be part of the subject matter disclosed herein.

36页详细技术资料下载

Selectively activating on-device speech recognition and using recognized text in selectively activating NLUs on devices and/or fulfillment on devices

相关技术

网友询问留言