Voice trigger of digital assistant

文档序号：344512 发布日期：2021-12-03 浏览：8次中文

阅读说明：本技术 数字助理的语音触发器 (Voice trigger of digital assistant ) 是由 J·G·宾德 O·塔金 S·D·波斯特 T·R·格鲁伯于 2014-02-07 设计创作，主要内容包括：本发明涉及数字助理的语音触发器,并且提供了一种用于操作语音触发器的方法。在一些具体实施中,该方法在包括存储器和一个或多个处理器的电子设备上执行,该存储器存储由一个或多个处理器执行的指令。该方法包括接收声音输入。声音输入可对应于口头字词或短语,或者它们的一部分。该方法包括确定声音输入的至少一部分是否对应于预先确定的声音类型,诸如人声。该方法包括在确定声音输入的至少一部分对应于预先确定的类型时,确定声音输入是否包括预先确定的内容,诸如预先确定的触发字词或短语。该方法还包括在确定声音输入包括预先确定的内容时,启动基于语音的服务,诸如基于语音的数字助理。(The invention relates to a voice trigger of a digital assistant and provides a method for operating the voice trigger. In some implementations, the method is performed on an electronic device that includes one or more processors and memory storing instructions for execution by the one or more processors. The method includes receiving a sound input. The voice input may correspond to a spoken word or phrase, or a portion thereof. The method includes determining whether at least a portion of the sound input corresponds to a predetermined sound type, such as a human voice. The method includes determining whether the sound input includes predetermined content, such as a predetermined trigger word or phrase, upon determining that at least a portion of the sound input corresponds to a predetermined type. The method also includes initiating a voice-based service, such as a voice-based digital assistant, upon determining that the sound input includes the predetermined content.)

1. A non-transitory computer readable storage medium storing one or more programs for execution by one or more processors of an electronic device, the one or more programs comprising instructions for:

determining whether the electronic device is in a predetermined orientation, wherein the predetermined orientation corresponds to a display of the electronic device being horizontal and facing down;

upon determining that the electronic device is in the predetermined orientation, generating an instruction to enable a predetermined mode of the voice trigger, wherein the predetermined mode is a standby mode; and

enabling the predetermined mode of the voice trigger based on the instruction.

2. The non-transitory computer readable storage medium of claim 1, wherein the one or more programs further comprise instructions for:

determining whether the electronic device is in a second predetermined orientation, wherein the second predetermined orientation corresponds to the display of the electronic device being horizontal and right side up; and

enabling a second predetermined mode of the voice trigger upon determining that the electronic device is in the second predetermined orientation, wherein the second predetermined mode is a listening mode.

3. The non-transitory computer-readable storage medium of claim 1, wherein a trigger phrase associated with the voice trigger is based on the predetermined orientation.

4. The non-transitory computer-readable storage medium of claim 1, wherein determining whether the electronic device is in the predetermined orientation comprises: determining an orientation of the electronic device using one or more light sensors.

5. A method for operating a voice trigger, the method being performed on an electronic device comprising one or more processors and memory storing instructions for execution by the one or more processors, the method comprising:

determining whether the electronic device is in a predetermined orientation, wherein the predetermined orientation corresponds to a display of the electronic device being horizontal and facing down;

enabling the predetermined mode of the voice trigger based on the instruction.

6. The method of claim 5, further comprising:

7. The method of claim 5, wherein a trigger phrase associated with the voice trigger is based on the predetermined orientation.

8. The method of claim 5, wherein determining whether the electronic device is in the predetermined orientation comprises: determining an orientation of the electronic device using one or more light sensors.

9. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:

determining whether the electronic device is in a predetermined orientation, wherein the predetermined orientation corresponds to a display of the electronic device being horizontal and facing down;

enabling the predetermined mode of the voice trigger based on the instruction.

10. The apparatus of claim 9, wherein the one or more programs further comprise instructions for:

11. The apparatus of claim 9, wherein a trigger phrase associated with the voice trigger is based on the predetermined orientation.

12. The device of claim 9, wherein determining whether the electronic device is in the predetermined orientation comprises: determining an orientation of the electronic device using one or more light sensors.

Technical Field

Implementations disclosed herein relate generally to digital assistants and, more particularly, to a method and system for voice triggers for a digital assistant.

Background

Recently, SIRI for voice-based digital assistants, such as apple, has been introduced into the market to handle various tasks such as web searching and navigation. One advantage of such a voice-based digital assistant is that a user can interact with the device in a hands-free manner without touching or even viewing the device. Hands-free operation may be particularly advantageous in situations where a person is unable or unable to physically manipulate the device, such as when they are driving a car. However, to launch a voice-based assistant, the user typically must press a button or select an icon on the touch screen. This tactile input detracts from the hands-free experience. It would therefore be advantageous to provide a method and system for enabling a voice-based digital assistant (or other voice-based service) using speech inputs or signals rather than tactile inputs.

Enabling a voice-based assistant using speech input requires monitoring an audio channel to detect speech input. This monitoring consumes power, which is a limited resource on a handheld or portable device that relies on a battery and on which such voice-based digital assistants are often run. It would therefore be advantageous to provide an energy efficient voice trigger that can be used to initiate voice-based services on a device.

Disclosure of Invention

Thus, there is a need for a low-power voice trigger that can provide "listen-at-time" voice trigger functionality without unduly consuming limited power resources. The following detailed description provides systems and methods for initiating a voice-based assistant using a voice trigger located on an electronic device. Interaction with a voice-based digital assistant (or other voice-based service, such as a voice-to-text transcription service) often begins when a user presses an affordance (e.g., a button or icon) on the device to enable the digital assistant, after which the device provides the user with some indication that the digital assistant is active and listening, such as a light, a sound (e.g., beep) or a vocalized output (e.g., "what do i can do you"). As described herein, a voice trigger may also be implemented such that it is enabled in response to a particular predetermined word, phrase, or sound, and without physical interaction by the user. For example, a user can enable a SIRI digital assistant on IPHONE (both provided by Apple inc., the assignee of the present application) by narrating the phrase "hey, SIRI". In response, the device outputs a beep, sound, or voice output (e.g., "what do i can do you. Thus, a user may initiate interaction through the digital assistant without having to physically touch the device providing digital assistant functionality.

One technique for initiating a voice-based service through a voice trigger is for the voice-based service to continuously listen for a predetermined trigger word, phrase, or sound (any of which may be referred to herein as a "trigger sound"). However, continuous operation of voice-based services (e.g., voice-based digital assistant) requires significant audio processing and battery power. To reduce the power consumed by providing voice activated functionality, several techniques may be employed. In some implementations, the main processor (i.e., the "application processor") of the electronic device remains in a low-power or no-power state, while one or more sound detectors that use lower power (e.g., because they are not dependent on the application processor) remain active. (an application processor or any other processor, program or module may be described as inactive or in a standby mode when in a low power or no power state.) for example, a low power sound detector is used to monitor the audio channel that triggers the sound even when the application processor is inactive. This sound detector is sometimes referred to herein as a trigger sound detector. In some implementations, it is configured to detect specific sounds, phonemes, and/or words. Trigger tone detectors (including hardware and/or software components) are designed to recognize specific words, tones, or phrases, but are typically not capable of providing a complete speech-to-text function or optimized for that function in terms of tasks requiring greater computational and power resources. Thus, in some implementations, the trigger sound detector identifies whether the speech input includes a predefined pattern (e.g., a sonic pattern that matches the words "hey, SIRI"), but it cannot (or is not configured to) convert the speech input to text or identify a large number of other words. Upon detecting the trigger sound, the digital assistant is taken out of the standby mode so that the user can provide a voice command.

In some implementations, the trigger sound detector is configured to detect several different trigger sounds, such as a set of words, phrases, sounds, and/or combinations thereof. The user may then use any of those sounds to initiate the voice-based service. For example, a voice trigger is preconfigured to the phrases "hey, SIRI", "wake up, SIRI", "call my digital assistant", or "hello, HAL, did you hear, HAL? "respond. In some implementations, the user must select one of the preconfigured trigger sounds as the only trigger sound. In some implementations, the user selects a subset of pre-configured trigger sounds so that the user can initiate a voice-based service with different trigger sounds. In some implementations, all of the preconfigured trigger sounds remain valid trigger sounds.

In some implementations, another sound detector is used so that even the triggering sound detector can remain in a low or no power mode for most of the time. For example, the audio channel is monitored using a different type of sound detector (e.g., a lower power sound detector than the trigger sound detector) to determine whether the sound input corresponds to a certain sound type. Sounds are classified into different "types" based on certain recognizable features of the sound. For example, sounds belonging to the "human voice" type have a certain spectral content, periodicity, fundamental frequency, etc. Other types of sounds (e.g., whistling sounds, clapping sounds, etc.) have different characteristics. Different types of sounds are identified using audio and/or signal processor techniques described herein. This sound detector is sometimes referred to herein as a "sound type detector". For example, if the predetermined trigger phrase is "hey, SIRI," the voice-type detector determines whether the input is likely to correspond to human voice. If the trigger sound is a non-speech sound, such as a whistle, the sound type detector determines whether the sound input is likely to correspond to a whistle. When the appropriate sound type is detected, the sound type detector activates a trigger sound detector to further process and/or analyze the sound. And since the voice type detector requires less power than the trigger voice detector (e.g., since it uses circuitry with lower power requirements and/or more efficient audio processing algorithms than the trigger voice detector), the voice trigger function consumes less power than if the trigger voice detector were used alone.

In some implementations, another sound detector is used so that both the sound type detector and the trigger sound detector described above can remain in a low or no power mode for most of the time. For example, audio channels are monitored using a lower power sound detector than the sound type detector to determine whether the sound input satisfies a predetermined condition, such as a magnitude (e.g., volume) threshold. The sound detector may be referred to herein as a "noise detector". When the noise detector detects a sound that meets a predetermined threshold, the noise detector activates the sound type detector to further process and/or analyze the sound. And since the noise detector requires less power than the sound type detector or the trigger sound detector (e.g., because it uses circuitry with lower power requirements and/or more efficient audio processing algorithms), the voice trigger function consumes less power than using the combination of the sound type detector and the trigger sound detector without using the noise detector.

In some implementations, any one or more of the sound detectors described above operate according to a duty cycle, in which they cycle between "on" and "off states. This further helps to reduce the power consumption of the voice trigger. For example, in some implementations, the noise detector is "on" (i.e., actively monitoring the audio channel) for 10 milliseconds, and is "off for the next 90 milliseconds. In this way, the noise detector is "off 90% of the time while still effectively providing a continuous noise detection function. In some implementations, the on and off durations of the sound detectors are selected so that all detectors are enabled with the trigger sound remaining input at all times. For example, for the trigger phrase "hey SIRI," the sound detector can be configured such that regardless of where the trigger phrase begins at one or more duty cycles, the trigger sound detector is enabled in time to analyze a sufficient amount of input. For example, the trigger sound detector would be enabled in time to receive, process, and analyze the sound "hey SIRI", which is sufficient to determine that the sound matches the trigger phrase. In some implementations, the sound input is stored in memory as it is received and transmitted to the upstream detector so that a larger portion of the sound input can be analyzed. Thus, even if the trigger sound detector does not start until after the trigger phrase is emitted, it can still analyze the entire recorded trigger phrase.

Some implementations provide a method for operating a voice trigger. The method is performed on an electronic device comprising one or more processors and a memory storing instructions for execution by the one or more processors. The method includes receiving a sound input. The method also includes determining whether at least a portion of the sound input corresponds to a predetermined sound type. The method also includes determining whether the sound input includes predetermined content upon determining that at least a portion of the sound input corresponds to the predetermined type. The method also includes initiating a voice-based service upon determining that the sound input includes predetermined content. In some implementations, the voice-based service is a voice-based digital assistant. In some implementations, the voice-based service is a dictation service.

In some implementations, determining whether the sound input corresponds to a predetermined sound type is performed by a first sound detector, and determining whether the sound input includes predetermined content is performed by a second sound detector. In some implementations, the first sound detector consumes less power when operating than the second sound detector. In some implementations, the first sound detector performs a frequency domain analysis of the sound input. In some implementations, determining whether the sound input corresponds to a predetermined sound type is performed upon determining that the sound input satisfies a predetermined condition (e.g., determined by a third sound detector described below).

In some implementations, the first sound detector periodically monitors the audio channel according to a duty cycle. In some implementations, the duty cycle includes an on time of about 20 milliseconds and an off time of about 100 milliseconds.

In some implementations, the predetermined type is voice and the predetermined content is one or more words. In some implementations, determining whether at least a portion of the sound input corresponds to a predetermined sound type includes determining whether at least a portion of the sound input includes a frequency characteristic of a human voice.

In some implementations, the second sound detector is activated in response to the first sound detector determining that the sound input corresponds to a predetermined type. In some implementations, the second sound detector operates for at least a predetermined amount of time after the first sound detector determines that the sound input corresponds to the predetermined type. In some implementations, the predetermined amount of time corresponds to a duration of the predetermined content.

In some implementations, the predetermined content is one or more predetermined phonemes. In some implementations, the one or more predetermined phonemes constitute at least one word.

In some implementations, the method includes determining whether the sound input satisfies a predetermined condition before determining whether the sound input corresponds to a predetermined sound type. In some implementations, the predetermined condition is a magnitude threshold. In some implementations, determining whether the sound input satisfies the predetermined condition is performed by a third sound detector, wherein the third sound detector consumes less power when in operation than the first sound detector. In some implementations, the third sound detector periodically monitors the audio channel according to a duty cycle. In some implementations, the duty cycle includes an on time of about 20 msec and an off time of about 500 msec. In some implementations, the third sound detector performs a time domain analysis of the sound input.

In some implementations, the method includes storing at least a portion of the sound input in a memory and providing the portion of the sound input to the voice-based service upon initiation of the voice-based service. In some implementations, a portion of the sound input is stored in memory using direct memory access.

In some implementations, the method includes determining whether the sound input corresponds to speech of a particular user. In some implementations, the voice-based service is initiated upon determining that the sound input includes predetermined content and that the sound input corresponds to a particular user's voice. In some implementations, the voice-based service is initiated in the limited-access mode upon determining that the sound input includes predetermined content and that the sound input does not correspond to a voice of a particular user. In some implementations, the method includes outputting a voice prompt including a name of the particular user upon determining that the sound input corresponds to voice of the particular user.

In some implementations, determining whether the sound input includes the predetermined content includes comparing the representation of the sound input to the reference representation and determining whether the sound input includes the predetermined content if the representation of the sound input matches the reference representation. In some implementations, a match is determined if the representation of the sound input matches the reference representation with a predetermined confidence. In some implementations, the method includes receiving a plurality of sound inputs, the plurality of sound inputs including a sound input; and iteratively adjusting the reference representation using a respective sound input of the plurality of sound inputs in response to determining that the respective sound input includes the predetermined content.

In some implementations, the method includes determining whether the electronic device is in a predetermined orientation, and enabling a predetermined mode of the voice trigger upon determining that the electronic device is in the predetermined orientation. In some implementations, the predetermined orientation corresponds to a substantially horizontal and front-facing-down display screen of the device, and the predetermined mode is a standby mode. In some implementations, the predetermined orientation corresponds to a substantially horizontal and right-side-up display screen of the device, and the predetermined mode is a listening mode.

Some implementations provide a method for operating a voice trigger. The method is performed on an electronic device comprising one or more processors and a memory storing instructions for execution by the one or more processors. The method includes operating a voice trigger in a first mode. The method also includes determining whether the electronic device is in a substantially enclosed space by detecting that one or more of a microphone and a camera of the electronic device is occluded. The method also includes switching the voice trigger to a second mode when it is determined that the electronic device is in the substantially enclosed space. In some implementations, the second mode is a standby mode.

Some implementations provide a method for operating a voice trigger. The method is performed on an electronic device comprising one or more processors and a memory storing instructions for execution by the one or more processors. The method includes determining whether the electronic device is in a predetermined orientation, and enabling a predetermined mode of the voice trigger upon determining that the electronic device is in the predetermined orientation. In some implementations, the predetermined orientation corresponds to a substantially horizontal and face-down display screen of the device, and the predetermined mode is a standby mode. In some implementations, the predetermined orientation corresponds to a substantially horizontal and right-side-up display screen of the device, and the predetermined mode is a listening mode.

According to some implementations, an electronic device includes a sound receiving unit configured to receive a sound input and a processing unit coupled to the sound receiving unit. The processing unit is configured to determine whether at least a portion of the sound input corresponds to a predetermined sound type; upon determining that at least a portion of the sound input corresponds to a predetermined type, determining whether the sound input includes predetermined content; and initiating a voice-based service upon a determination that the sound input includes the predetermined content. In some implementations, the processing unit is further configured to, in determining whether the sound input corresponds to a predetermined sound type, determine whether the sound input satisfies a predetermined condition. In some implementations, the processing unit is further configured to determine whether the sound input corresponds to speech of a particular user.

According to some implementations, an electronic device includes a voice trigger unit configured to operate a voice trigger in a first mode of a plurality of modes; and a processing unit coupled to the voice trigger unit. In some implementations, the processing unit is configured to: determining whether the electronic device is in a substantially enclosed space by detecting that one or more of a microphone and a camera of the electronic device is occluded; and switching the voice trigger to the second mode upon determining that the electronic device is in the substantially enclosed space. In some implementations, the processing unit is configured to determine whether the electronic device is in a predetermined orientation; and upon determining that the electronic device is in the predetermined orientation, enabling a predetermined mode of the voice trigger.

According to some implementations, a computer-readable storage medium (e.g., a non-transitory computer-readable storage medium) is provided that stores one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing any of the methods described herein.

According to some implementations, an electronic device (e.g., a portable electronic device) is provided that includes means for performing any of the methods described herein.

According to some implementations, an electronic device (e.g., a portable electronic device) is provided that includes a processing unit configured to perform any of the methods described herein.

According to some implementations, an electronic device (e.g., a portable electronic device) is provided that includes one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing any of the methods described herein.

According to some implementations, there is provided an information processing apparatus for use in an electronic device, the information processing apparatus comprising means for performing any of the methods described herein.

Drawings

FIG. 1 is a block diagram illustrating an environment in which a digital assistant operates according to some implementations.

FIG. 2 is a block diagram illustrating a digital assistant client system in accordance with some implementations.

FIG. 3A is a block diagram illustrating a standalone digital assistant system or digital assistant server system, according to some implementations.

Fig. 3B is a block diagram illustrating functionality of the digital assistant shown in fig. 3A according to some implementations.

FIG. 3C is a network diagram illustrating a portion of an ontology according to some implementations.

FIG. 4 is a block diagram illustrating components of a voice trigger system according to some implementations.

Fig. 5-7 are flow diagrams illustrating methods for operating a voice activated system according to some implementations.

Fig. 8-9 are functional block diagrams of electronic devices according to some embodiments.

Like reference numerals refer to corresponding parts throughout the several views of the drawings.

Detailed Description

Fig. 1 is a block diagram of an operating environment 100 of a digital assistant in accordance with some implementations. The terms "digital assistant," "virtual assistant," "intelligent automated assistant," "voice-based digital assistant," or "automatic digital assistant" refer to any information processing system that interprets natural language input in spoken and/or textual form to infer user intent (e.g., identify a task type corresponding to the natural language input) and performs actions based on the inferred user intent (e.g., perform a task corresponding to the identified task type). For example, to perform an action in compliance with the inferred user intent, the system may perform one or more of the following operations: identifying a task flow (e.g., identifying a task type) by steps and parameters designed to implement the inferred user intent, entering specific requirements from the inferred user intent into the task flow, executing the task flow (e.g., sending a request to a service provider) by calling a program, method, service, API, or the like; and generating an output response to the user in audible (e.g., speech) and/or visual form.

In particular, the digital assistant system, upon startup, is capable of accepting user requests in the form of, at least in part, natural language commands, requests, statements, narratives and/or inquiries. Typically, the user request seeks either an informational answer from the digital assistant system or the digital assistant system performs a task. A satisfactory response to a user request is typically to provide the requested informational answer, perform the requested task, or a combination of both. For example, the user may suggest to the digital assistant system a message such as "are i now? "and the like. Based on the user's current location, the digital assistant may answer "you are near the central public garden siemens. The "user may also request to perform a task, such as by stating" please invite my friend to join my girlfriend's birthday party. In response, the digital assistant may confirm the request by generating a speech output "good, now" and then send an appropriate calendar invitation from the user's email address to each friend of the user listed in the user's electronic address book or contact list. There are many other ways to interact with a digital assistant to request information or perform various tasks. In addition to providing a spoken response and performing a programmed action, the digital assistant may provide responses in other visual or audio forms (e.g., like text, alerts, music, video, animation, etc.).

As shown in fig. 1, in some implementations, the digital assistant system is implemented according to a client-server model. The digital assistant system includes a client-side portion (e.g., 102a and 102b) (hereinafter "Digital Assistant (DA) client 102") executing on a user device (e.g., 104a and 104b), and a server-side portion 106 (hereinafter "Digital Assistant (DA) server 106") executing on a server system 108. The DA client 102 communicates with the DA server 106 over one or more networks 110. The DA client 102 provides client-side functions such as user-oriented input and output processing and communication with the DA server 106. The DA server 106 provides server-side functionality for any number of DA clients 102, each of the any number of DA clients 102 residing on a respective user device 104 (also referred to as a client device or electronic device).

In some implementations, DA server 106 includes a client-facing I/O interface 112, one or more processing modules 114, a data and model 116, an I/O interface 118 to external services, a photo and tag database 130, and a photo-tag module 132. The client-facing I/O interface facilitates client-facing input and output processing by the digital assistant server 106. The one or more processing modules 114 utilize the data and models 116 to determine the user's intent based on natural language input and perform task execution based on the inferred user intent. The photo and tag database 130 stores fingerprints of the digital photos and optionally the digital photos themselves, as well as tags associated with the digital photos. Photo-tag module 132 creates a tag, stores the tag and/or fingerprint associated with the photo, automatically tags the photo, and connects the tag to a location in the photo.

In some implementations, the DA Server 106 communicates with external services 120 (e.g., one or more navigation services 122-1, one or more messaging services 122-2, one or more information services 122-3, a calendar service 122-4, a telephony service 122-5, one or more photo services 122-6, etc.) over one or more networks 110 to accomplish tasks or collect information. An I/O interface 118 to external services facilitates such communication.

Examples of user equipment 104 include, but are not limited to, handheld computers, Personal Digital Assistants (PDAs), tablets, laptops, desktop computers, cellular phones, smart phones, Enhanced General Packet Radio Service (EGPRS) mobile phones, media players, navigation devices, game controllers, televisions, remote controls, or a combination of any two or more of these or any other suitable data processing device. More details regarding the user device 104 are provided with reference to the exemplary user device 104 shown in fig. 2.

Examples of one or more communication networks 110 include a Local Area Network (LAN) and a Wide Area Network (WAN) such as the Internet. The one or more communication networks 110 may be implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE (FIREWIRE), Global System for Mobile communications (GSM), Enhanced Data GSM Environment (EDGE), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth, Wi-Fi, Voice over Internet protocol (VoIP), Wi-MAX, or any other suitable communication protocol.

The server system 108 may be implemented on at least one data processing device of a computer and/or a distributed network. In some implementations, the server system 108 also employs various virtual devices and/or services of third party service providers (e.g., third party cloud service providers) to provide potential computing resources and/or infrastructure resources of the server system 108.

Although the digital assistant system shown in fig. 1 includes both a client-side portion (e.g., DA client 102) and a server-side portion (e.g., DA server 106), in some implementations, the digital assistant system refers to only the server-side portion (e.g., DA server 106). In some implementations, the functionality of the digital assistant may be implemented as a standalone application installed on the user device. Moreover, the division of functionality between the client portion and the server portion of the digital assistant may vary in different implementations. For example, in some implementations, DA client 102 is a thin client that provides only user-oriented input and output processing functions and delegates all other functions of the digital assistant to DA server 106. In some other implementations, the DA client 102 is configured to perform or assist one or more functions of the DA server 106.

Fig. 2 is a block diagram of a user device 104 according to some implementations. The user device 104 includes a memory interface 202, one or more processors 204, and a peripheral interface 206. The various components in the user equipment 104 are coupled by one or more communication buses or signal lines. The user device 104 includes various sensors, subsystems, and peripherals coupled to the peripheral interface 206. The sensors, subsystems, and peripherals collect information and/or facilitate various functions of the user device 104.

For example, in some implementations, a motion sensor 210 (e.g., an accelerometer), a light sensor 212, a GPS receiver 213, a temperature sensor, and a proximity sensor 214 are coupled to the peripheral interface 206 to facilitate orientation, lighting, and proximity sensing functions. In some implementations, other sensors 216, such as biosensors, barometers, etc., are connected to the peripheral interface 206 to facilitate related functions.

In some implementations, the user device 104 includes a camera subsystem 220 coupled to the peripheral interface 206. In some implementations, the optical sensor 222 of the camera subsystem 220 facilitates camera functions, such as taking pictures and recording video clips. In some implementations, the user device 104 includes one or more wired and/or wireless communication subsystems 224 that provide communication functions. Communication subsystem 224 typically includes various communication ports, radio frequency receivers and transmitters, and/or optical (e.g., infrared) receivers and transmitters. In some implementations, the user device 104 includes an audio subsystem 226 coupled to one or more speakers 228 and one or more microphones 230 to facilitate voice-enabled functions, such as voice recognition, voice replication, digital recording, and telephony functions. In some specific implementations, the audio subsystem 226 is coupled to the voice trigger system 400. In some implementations, the voice trigger system 400 and/or the audio subsystem 226 include low-power audio circuitry and/or programming (i.e., including hardware and/or software) for receiving and/or analyzing sound input, including, for example, one or more analog-to-digital converters, Digital Signal Processors (DSPs), sound detectors, memory buffers, codecs, and so forth. In some implementations, the low power audio circuitry (alone or in conjunction with other components of the user device 104) provides voice (or sound) triggering functionality of one or more aspects of the user device 104, such as a voice-based digital assistant or other voice-based service. In some implementations, the low power audio circuitry provides voice-triggered functionality even when other components of the user device 104, such as the one or more processors 204, I/O subsystem 240, memory 250, etc., are off and/or in standby mode. The voice trigger system 400 is described in further detail with reference to fig. 4.

In some implementations, the I/O subsystem 240 is also coupled to the peripheral interface 206. In some implementations, the user device 104 includes a touchscreen 246 and the I/O subsystem 240 includes a touchscreen controller 242 coupled to the touchscreen 246. When the user device 104 includes a touchscreen 246 and a touchscreen controller 242, the touchscreen 246 and touchscreen controller 242 are generally configured to detect their contact and movement or disconnection using, for example, any of a variety of touch sensitivity technologies, such as capacitive technologies, resistive technologies, infrared technologies, surface acoustic wave technologies, proximity sensor arrays, and the like. In some implementations, the user device 104 includes a display without a touch-sensitive surface. In some implementations, the user device 104 includes a separate touch-sensitive surface. In some particular implementations, the user device 104 includes one or more other input controllers 244. When the user device 104 includes one or more other input controllers 244, the one or more other input controllers 244 are typically coupled to other input/control devices 248, such as one or more buttons, rocker switches, thumbwheels, infrared ports, USB ports, and/or a pointing device such as a stylus.

The memory interface 202 is coupled to the memory 250. In some implementations, the memory 250 includes a non-transitory computer-readable medium, such as high-speed random access memory and/or non-volatile memory (e.g., one or more magnetic disk storage devices, one or more flash memory devices, one or more optical storage devices, and/or other non-volatile solid-state memory devices). In some implementations, memory 250 stores operating system 252, communication module 254, graphical user interface module 256, sensor processing module 258, telephony module 260, and application programs 262, as well as subsets or supersets thereof. Operating system 252 includes instructions for handling basic system services and for performing hardware related tasks. Communication module 254 facilitates communication with one or more additional devices, one or more computers, and/or one or more servers. The graphical user interface module 256 facilitates graphical user interface processing. The sensor processing module 258 facilitates sensor-related processing and functions (e.g., processing voice input received with the one or more microphones 228). The phone module 260 facilitates phone-related processes and functions. Application modules 262 facilitate various functions of user applications, such as electronic messaging, web browsing, media processing, navigation, imaging, and/or other processes and functions. In some implementations, the user device 104 stores one or more software applications 270-1 and 270-2, each associated with at least one of the external service providers, in the memory 250.

As described above, in some implementations, the memory 250 also stores client-side digital assistant instructions (e.g., in the digital assistant client module 264) as well as various user data 266 (e.g., user-specific vocabulary data, preference data, and/or other data such as a user's electronic address book or contact list, to-do list, shopping list, etc.) to provide client-side functionality of the digital assistant.

In various implementations, the digital assistant client module 264 can accept voice input, text input, touch input, and/or gesture input through various user interfaces of the user device 104 (e.g., the I/O subsystem 244). The digital assistant client module 264 can also provide output in audio, visual, and/or tactile forms. For example, the output may be provided as voice, sound, alarm, text message, menu, graphics, video, animation, vibration, and/or a combination of two or more of the foregoing. During operation, the digital assistant client module 264 uses the communication subsystem 224 to communicate with a digital assistant server (e.g., the digital assistant server 106, fig. 1).

In some implementations, the digital assistant client module 264 utilizes various sensors, subsystems, and peripherals to gather additional information from the surrounding environment of the user device 104 to establish a context associated with the user input. In some implementations, the digital assistant client module 264 provides the contextual information or a subset thereof along with the user input to a digital assistant server (e.g., the digital assistant server 106, fig. 1) to help infer the user's intent.

In some implementations, contextual information that may accompany the user input includes sensor information, such as lighting, ambient noise, ambient temperature, images or video of the surrounding environment, and so forth. In some implementations, the context information also includes physical states of the device, such as device orientation, device location, device temperature, power level, velocity, acceleration, motion pattern, cellular signal strength, and the like. In some specific implementations, information related to the software state of the user device 106, e.g., running processes, installed programs, past and current network activities, background services, error logs, resource usage, etc., of the user device 104 is also provided to a digital assistant server (e.g., digital assistant server 106, fig. 1) as contextual information associated with the user input.

In some implementations, the DA client module 264 selectively provides information (e.g., at least a portion of the user data 266) stored on the user device 104 in response to a request from the digital assistant server. In some implementations, the digital assistant client module 264 also elicits additional input from the user via a natural language dialog or other user interface upon request by the digital assistant server 106 (fig. 1). The digital assistant client module 264 communicates additional input to the digital assistant server 106 to help the digital assistant server 106 make intent inferences and/or satisfy the user intent expressed in the user request.

In some implementations, memory 250 may include additional instructions or fewer instructions. Further, various functions of the user device 104 may be implemented in hardware and/or in firmware (including in one or more signal processing integrated circuits and/or application specific integrated circuits), and thus, the user device 104 need not include all of the modules and applications shown in fig. 2.

Fig. 3A is a block diagram of an exemplary digital assistant system 300 (also referred to as a digital assistant) according to some implementations. In some implementations, the digital assistant system 300 is implemented on a stand-alone computer system. In some implementations, the digital assistant system 300 is distributed across multiple computers. In some implementations, some of the modules and functionality of the digital assistant are divided into a server portion and a client portion, where the client portion resides on a user device (e.g., user device 104) and communicates with the server portion (e.g., server system 108) over one or more networks, for example as shown in fig. 1. In some implementations, the digital assistant system 300 is an embodiment of the server system 108 (and/or the digital assistant server 106) shown in fig. 1. In some implementations, the digital assistant system 300 is implemented in a user device (e.g., user device 104, fig. 1), thereby eliminating the need for a client-server system. It should be noted that the digital assistant system 300 is only one example of a digital assistant system, and that the digital assistant system 300 may have more or fewer components than shown, may combine two or more components, or may have a different configuration or layout of components. The various components shown in fig. 3A may be implemented in hardware, software, firmware (including one or more signal processing integrated circuits and/or application specific integrated circuits), or a combination thereof.

The digital assistant system 300 includes memory 302, one or more processors 304, input/output (I/O) interfaces 306, and a network communication interface 308. These components communicate with each other via one or more communication buses or signal lines 310.

In some implementations, the memory 302 includes non-transitory computer-readable media such as high-speed random access memory and/or non-volatile computer-readable storage media (e.g., one or more magnetic disk storage devices, one or more flash memory devices, one or more optical storage devices, and/or other non-volatile solid-state memory devices).

The I/O interface 306 couples input/output devices 316 of the digital assistant system 300, such as a display, a keypad, a touch screen, and a microphone to a user interface module 322. The I/O interface 306, in conjunction with the user interface module 322, receives user inputs (e.g., voice inputs, keyboard inputs, touch inputs, etc.) and processes those inputs accordingly. In some implementations, when the digital assistant is implemented on a standalone user device, the digital assistant system 300 includes any of the components and I/O interfaces and communication interfaces described with respect to the user device 104 in fig. 2 (e.g., the one or more microphones 230). In some implementations, the digital assistant system 300 represents the server portion of a digital assistant implementation and interacts with the user through the client-side portion located on a user device (e.g., the user device 104 shown in FIG. 2).

In some implementations, the network communication interface 308 includes one or more wired communication ports 312 and/or wireless transmit and receive circuitry 314. The one or more wired communication ports receive and transmit communication signals via one or more wired interfaces, such as ethernet, Universal Serial Bus (USB), firewire, and the like. The radio circuit 314 typically receives and transmits RF and/or optical signals from and to communication networks and other communication devices. The wireless communication may use any of a number of communication standards, protocols, and technologies, such as GSM, EDGE, CDMA, TDMA, Bluetooth, Wi-Fi, VoIP, Wi-MAX, or any other suitable communication protocol. Network communication interface 308 enables digital assistant system 300 to communicate with other devices via a network, such as the internet, an intranet, and/or a wireless network, such as a cellular telephone network, a wireless Local Area Network (LAN), and/or a Metropolitan Area Network (MAN).

In some implementations, the non-transitory computer-readable storage medium of the memory 302 stores programs, modules, instructions, and data structures including all or a subset of the following: an operating system 318, a communication module 320, a user interface module 322, one or more application programs 324, and a digital assistant module 326. The one or more processors 304 execute the programs, modules, and instructions and read data from, or write data to, the data structures.

An operating system 318 (e.g., Darwin, RTXC, LINUX, UNIX, OS X, iOS, WINDOWS, or embedded operating systems such as VxWorks) includes various software components and/or drivers for controlling and managing general system tasks (e.g., memory management, storage device control, power management, etc.) and facilitates communication between various hardware, firmware, and software components.

The communication module 320 facilitates communication between the digital assistant system 300 and other devices via the network communication interface 308. For example, the communication module 320 may communicate with the communication module 254 of the device 104 shown in fig. 2. The communication module 320 also includes various software components for processing data received by the wireless circuitry 314 and/or the wired communication port 312.

In some implementations, the user interface module 322 receives commands and/or input from a user via the I/O interface 306 (e.g., from a keyboard, touch screen, and/or microphone) and provides user interface objects on the display.

The application programs 324 include programs and/or modules configured to be executed by the one or more processors 304. For example, if the digital assistant system is implemented on a standalone user device, the applications 324 may include user applications, such as games, calendar applications, navigation applications, or mail applications. If the digital assistant system 300 is implemented on a server farm, the applications 324 may include, for example, a resource management application, a diagnostic application, or a scheduling application.

The memory 302 also stores a digital assistant module (or server portion of a digital assistant) 326. In some implementations, the digital assistant module 326 includes the following sub-modules, or a subset or superset thereof: an input/output processing module 328, a Speech To Text (STT) processing module 330, a natural language processing module 332, a conversation stream processing module 334, a task stream processing module 336, a service processing module 338, and a photos module 132. Each of these processing modules has access to one or more of the following data and models, or a subset or superset thereof, of the digital assistant 326: knowledge ontology 360, vocabulary index 344, user data 348, classification module 349, disambiguation module 350, task flow model 354, service model 356, photo tagging module 358, search module 360, and local tag/photo store 362.

In some implementations, using processing modules (e.g., input/output processing module 328, STT processing module 330, natural language processing module 332, conversation flow processing module 334, task flow processing module 336, and/or service processing module 338), data, and models implemented in digital assistant module 326, digital assistant system 300 performs at least some of the following operations: identifying a user's intent expressed in a natural language input received from the user; actively elicit and obtain the information needed to fully infer the intent of the user (e.g., by disambiguating words, names, intentions, etc.); determining a task flow for achieving the inferred intent; and executing the task flow to achieve the inferred intent. In some implementations, the digital assistant also takes appropriate action when, for various reasons, a satisfactory response is not or cannot be provided to the user.

In some implementations, as described below, the digital assistant system 300 identifies a user intent to tag the digital photograph from the natural language input and processes the natural language input to tag the digital photograph with the appropriate information. In some implementations, the digital assistant system 300 also performs other tasks related to photos, such as searching for digital photos using natural language input, automatically tagging photos, and so forth. As shown in fig. 3B, in some implementations, I/O processing module 328 interacts with a user through I/O device 316 in fig. 3A or with a user device (e.g., user device 104 in fig. 1) through network communication interface 308 in fig. 3A to obtain user input (e.g., voice input) and provide a response to the user input. The I/O processing module 328 optionally obtains contextual information associated with the user input from the user device along with or shortly after receiving the user input. Contextual information includes user-specific data, vocabulary, and/or preferences related to user input. In some implementations, the context information also includes software and hardware states of a device (e.g., user device 104 in fig. 1) at the time the user request is received, and/or information related to the user's surroundings at the time the user request is received. In some implementations, the I/O processing module 328 also sends follow-up questions to the user regarding the user's request and receives answers from the user. In some implementations, when a user request is received by the I/O processing module 328 and the user request includes voice input, the I/O processing module 328 forwards the voice input to a Speech To Text (STT) processing module 330 for speech to text conversion.

In some implementations, the speech to text processing module 330 receives speech input (e.g., user utterances captured in a voice recording) through the I/O processing module 328. In some implementations, the speech-to-text processing module 330 uses various sound and language models to recognize the speech input as a sequence of phonemes and ultimately as a sequence of words or symbols written in one or more languages. The phonetic-to-text processing module 330 is implemented using any suitable speech recognition techniques, acoustic models, and language models, such as hidden markov models, Dynamic Time Warping (DTW) based speech recognition, and other statistical and/or analytical techniques. In some implementations, the speech to text processing can be performed at least in part by a third party service or on a device of the user. Once the speech to text processing module 330 obtains the results of the speech to text processing (e.g., a sequence of words or symbols), it passes the results to the natural language processing module 332 for intent inference. A natural language processing module 332 ("natural language processor") of the digital assistant 326 takes a sequence of words or symbols ("symbol sequence") generated by the speech to text processing module 330 and attempts to associate the symbol sequence with one or more "actionable intents" identified by the digital assistant. As used herein, "actionable intent" represents a task that can be performed by the digital assistant 326 and/or the digital assistant system 300 (fig. 3A) and has an associated task flow that is implemented in the task flow model 354. The associated task stream is a series of programmed actions and steps taken by the digital assistant system 300 to perform a task. The capability scope of the digital assistant system depends on the number and variety of task flows that have been implemented and stored in the task flow model 354, or in other words, the number and variety of "actionable intents" identified by the digital assistant system 300. However, the effectiveness of the digital assistant system 300 also depends on the ability of the digital assistant system to infer the correct "executable intent or intents" from a user request expressed in natural language.

In some implementations, the natural language processor 332 receives context information associated with the user request (e.g., from the I/O processing module 328) in addition to the sequence of words or symbols obtained from the speech-to-text processing module 330. The natural language processor 332 optionally uses the context information to clarify, supplement, and/or further define the information contained in the sequence of symbols received from the speech to text processing module 330. Contextual information includes, for example, user preferences, hardware and/or software states of the user device, sensor information collected before, during, or shortly after a user request, previous interactions (e.g., conversations) between the digital assistant and the user, and so forth.

In some implementations, the natural language processing is based on an ontology 360. Ontology 360 is a hierarchical structure that contains a plurality of nodes, each node representing an "actionable intent" or "attribute" that is related to one or more of the "actionable intents" or other "attributes". As described above, the "executable intent" represents a task that the digital assistant system 300 is capable of performing (e.g., "executable" or a task that can be performed). "Properties" represent parameters associated with a sub-aspect of an actionable intent or another property. The connection between the actionable intent node and the property node in the ontology 360 defines how the parameters represented by the property node pertain to the task represented by the actionable intent node. In some implementations, the ontology 360 is composed of actionable intent nodes and attribute nodes. Within ontology 360, each actionable intent node is connected to one or more property nodes either directly or through one or more intermediate property nodes. Similarly, each attribute node is connected to one or more actionable intent nodes either directly or through one or more intermediate attribute nodes. For example, ontology 360 shown in FIG. 3C includes a "restaurant reservation" node, which is an actionable intent node. The property nodes "restaurant," "date/time" (for reservations), and "number of diners" are each directly connected to the "restaurant reservation" node (i.e., actionable intent node). Further, the attribute nodes "cuisine", "price range", "phone number", and "location" are child nodes of the attribute node "restaurant", and are each connected to the "restaurant reservation" node (i.e., actionable intent node) through an intermediate attribute node "dining room". As another example, ontology 360 shown in FIG. 3C also includes a "set reminder" node, which is another actionable intent node. The attribute node "date/time" (for setting the reminder) and "subject" (for the reminder) are each connected to the "set reminder" node. Since the attribute "date/time" is related to both the task of making restaurant reservations and the task of setting reminders, the attribute node "date/time" is connected to both the "restaurant reservation" node and the "set reminders" node in the ontology 360.

The actionable intent node, along with the concept nodes to which it connects, may be described as a "domain". In the present discussion, each domain is associated with a respective actionable intent and refers to a set of nodes (and relationships therebetween) that are related to a particular actionable intent. For example, ontology 360 shown in FIG. 3C includes an instance of restaurant reservation domain 362 and an instance of reminder domain 364 within ontology 360. The restaurant reservation domain includes the actionable intent node "restaurant reservation," the property nodes "restaurant," date/time, "and" number of diners, "and the child property nodes" cuisine, "" price range, "" phone number, "and" location. The reminder field 364 includes the actionable intent node "set reminder" and the property nodes "subject" and "date/time". In some implementations, ontology 360 is composed of a number of domains. Each domain may share one or more attribute nodes with one or more other domains. For example, in addition to the restaurant reservation field 362 and reminder field 364, the "date/time" property node may be associated with a number of other fields (e.g., a scheduling field, a travel reservation field, a movie tickets field, etc.). Although fig. 3C illustrates two exemplary domains within ontology 360, ontology 360 may include other domains (or actionable intents), such as "initiate a phone call," "find a direction," "schedule a meeting," "send a message," and "provide an answer to a question," "tag a photo," etc. For example, a "send message" field is associated with a "send message" actionable intent node, and may also include attribute nodes, such as "one or more recipients," message type, "and" message body. The attribute node "recipient" may be further defined, for example, by child attribute nodes such as "recipient name" and "message address".

In some implementations, ontology 360 includes all domains (and thus executables intent) that the digital assistant is able to understand and act upon. In some implementations, the ontology 360 can be modified, such as by adding or removing domains or nodes, or by modifying relationships between nodes within the ontology 360.

In some implementations, nodes associated with multiple related executable intents can be clustered under a "super domain" in the awareness ontology 360. For example, a "travel" super-domain may include a cluster of attribute nodes and actionable intent nodes related to travel. Executable intent nodes related to travel may include "airline reservation," "hotel reservation," "car rental," "route planning," "finding points of interest," and so forth. An actionable intent node under the same super-domain (e.g., a "travel" super-domain) may have many common attribute nodes. For example, executable intent nodes for "airline reservations," hotel reservations, "" car rentals, "" route planning, "" point of interest finding, "may share one or more of the attribute nodes" starting location, "" destination, "" departure date/time, "" arrival date/time, "and" number of diners.

In some implementations, each node in ontology 360 is associated with a set of words and/or phrases that are related to the attribute or actionable intent represented by the node. The respective set of words and/or phrases associated with each node is a so-called "vocabulary" associated with the node. The respective set of words and/or phrases associated with each node may be stored in the vocabulary index 344 (FIG. 3B) associated with the properties or actionable intent represented by the node. For example, returning to fig. 3B, the vocabulary associated with the node of the "restaurant" attribute may include words such as "food," "drinks," "cuisine," "hunger," "eating," "pizza," "fast food," "meal," and so forth. As another example, the words associated with the node for which a "place a phone call" actionable intent may include words and phrases such as "call," "make a call," "dial," "call … …," "call the number," "call to," and the like. The vocabulary index 344 optionally includes words and phrases in different languages. In some implementations, the natural language processor 332 shown in fig. 3B receives a sequence of symbols (e.g., a text string) from the speech-to-text processing module 330 and determines which nodes are involved in words in the sequence of symbols. In some implementations, if a word or phrase in a sequence of symbols is found to be associated with one or more nodes in ontology 360 (via lexical index 344), that word or phrase will "trigger" or "enable" those nodes. When multiple nodes are "triggered," the natural language processor 332 will select one of the actionable intents as the task (or task type) that the user expects to have the digital assistant perform, based on the number and/or relative importance of the enabled nodes. In some implementations, the domain with the most "triggered" nodes is selected. In some implementations, the domain with the highest confidence (e.g., based on the relative importance of its respective triggered node) is selected. In some implementations, the domain is selected based on a combination of the number and importance of triggered nodes. In some implementations, additional factors are also considered in selecting a node, such as whether the digital assistant system 300 has previously correctly interpreted a similar request from the user.

In some implementations, the digital assistant system 300 also stores the names of particular entities in the lexical index 344 so that when one of these names is detected in a user request, the natural language processor 332 will be able to recognize that the name relates to a particular instance of an attribute or sub-attribute in the body of knowledge. In some implementations, the name of a particular entity is the name of a business, restaurant, person, movie, etc. In some implementations, the digital assistant system 300 can search for and identify a particular entity name from other data sources, such as the user's address book or contact list, a movie database, a musician database, and/or a restaurant database. In some implementations, when the natural language processor 332 recognizes that a word in the sequence of symbols is the name of a particular entity (such as the last name in a user's address book or contact list), the word is assigned additional importance in selecting actionable intents within the body of knowledge requested by the user. For example, when the word "mr. Santo" is recognized from the user request and the last name "Santo" is found in the vocabulary index 344 to be one of the contacts in the user's contact list, then the user request is likely to correspond to the "send message" or "initiate phone call" domains. As another example, when the word "ABC cafe" is found in the user request, and the word "ABC cafe" is found in the vocabulary index 344 as the name of a particular restaurant in the city where the user is located, then the user request is likely to correspond to the "dining room reservation" domain.

The user data 348 includes user-specific information such as user-specific vocabulary, user preferences, user address, the user's default and second languages, the user's contact list, and other short-term or long-term information for each user. The natural language processor 332 may use the user-specific information to supplement the information contained in the user input to further define the user intent. For example, for a user request "invite my friends to my birthday party," the natural language processor 332 can access the user data 348 to determine which people the "friends" are and where and when the "birthday party" will be held without the user explicitly providing such information in their request.

In some implementations, the natural language processor 332 includes a classification module 349. In some specific implementations, as discussed in more detail below, the classification module 349 determines whether each of one or more words in the text string (e.g., corresponding to a voice input associated with the digital photograph) is one of a solid, an activity, or a location. In some implementations, the classification module 349 classifies each of the one or more words as one of an entity, an activity, or a location. Once the natural language processor 332 identifies an executable intent (or domain) based on the user request, the natural language processor 332 generates a structured query to represent the identified executable intent. In some implementations, the structured query includes parameters for one or more nodes within the domain that can execute the intent, and at least some of these parameters are populated with specific information and requirements specified in the user request. For example, the user may say "help me reserve a seat at 7 pm in a sushi shop". In this case, the natural language processor 332 can correctly recognize the executable intent as "restaurant reservation" based on the user input. According to the knowledge body, the structured query of the "dining room reservation" domain may include parameters such as { cuisine }, { time }, { date }, { number of diners }, and the like. Based on information contained in the user utterance, the natural language processor 332 may generate a partially structured query for the restaurant reservation field, where the partially structured query includes parameters { cuisine ═ sushi ") and { time ═ 7 pm"). However, in this example, the user utterance contains insufficient information to complete a structured query associated with the domain. Thus, other necessary parameters such as { number of people to eat } and { date } are not specified in the structured query based on the currently available information. In some implementations, the native language processor 332 populates some parameters of the structured query with the received contextual information. For example, if the user requests a "sushi store near me," the natural language processor 332 may populate the { location } parameter in the structured query with the GPS coordinates from the user device 104.

In some implementations, the natural language processor 332 communicates the structured query (including any completed parameters) to a task flow processing module 336 ("task flow processor"). The task flow processor 336 is configured to perform one or more of the following: receives the structured query from the natural language processor 332, completes the structured query, and performs the actions necessary to "complete" the user's final request. In some implementations, the various processes necessary to accomplish these tasks are provided in the task flow model 354. In some implementations, the task flow model 354 includes a process for obtaining additional information from the user, as well as a task flow for performing actions associated with the executable intent. As described above, to complete a structured query, the task flow processor 336 may need to initiate additional dialogs with the user in order to obtain additional information and/or disambiguate potentially ambiguous utterances. When such interaction is necessary, the task flow processor 336 invokes the dialog processing module 334 ("dialog processor") to conduct a dialog with the user. In some implementations, the dialog processing module 334 determines how (and/or when) to query the user for additional information and receives and processes the user response. In some implementations, the questions are provided to the user and the answers are received from the user through the I/O processing module 328. For example, the dialog processing module 334 presents dialog output to the user via audio and/or video output and receives input from the user via spoken or physical (e.g., touch gesture) responses. Continuing with the example above, when the task flow processor 336 invokes the conversation processor 334 to determine "number of people to eat" and "date" information for a structured query associated with the domain "restaurant reservation," the conversation processor 334 generates a message such as "how many people to eat in common? "and" specific meals on the day? "or the like to the user. Upon receiving an answer from the user, the dialog processing module 334 populates the structured query with missing information or passes the information to the task flow processor 336 to complete the missing information in the structured query.

In some cases, the task flow processor 336 may receive a structured query having one or more ambiguous attributes. For example, a structured query for the "send message" field may indicate that the intended recipient is "Bob," and that the user may have multiple contacts named "Bob. The task flow processor 336 will request the session processor 334 to disambiguate this attribute of the structured query. In turn, the dialog handler 334 may ask the user "which Bob? "and displays (or reads) a list of contacts named" Bob "from which the user can select.

In some implementations, the dialog processor 334 includes a disambiguation module 350. In some implementations, the disambiguation module 350 disambiguates one or more ambiguous terms (e.g., one or more ambiguous terms in a text string corresponding to a speech output associated with a digital photograph). In some implementations, the disambiguation module 350 identifies that a first word of the one or more words has a plurality of candidate meanings, prompts the user for additional information about the first word, receives the additional information from the user in response to the prompt, and identifies an entity, activity, or location associated with the first word from the additional information.

In some implementations, the disambiguation module 350 disambiguates pronouns. In such implementations, the disambiguation module 350 identifies one of the one or more words as a pronoun and determines the noun to which the pronoun refers. In some implementations, the disambiguation module 350 determines the noun to which the pronoun refers by using a contact list associated with a user of the electronic device. Alternatively or in addition, the disambiguation module 350 also determines the noun to which the pronoun refers as the name of the entity, activity, or location identified in the previous voice input associated with the previously labeled digital photograph. Alternatively or in addition, disambiguation module 350 also determines the noun to which the pronoun refers as the name of the person recognized based on the previous voice input associated with the previously labeled digital photograph. In some implementations, the disambiguation module 350 accesses information obtained from one or more sensors (e.g., the proximity sensor 214, the light sensor 212, the GPS receiver 213, the temperature sensor 215, and the motion sensor 210) of a handheld electronic device (e.g., the user device 104) to determine the meaning of one or more words. In some implementations, the disambiguation module 350 identifies two words that are each associated with one of an entity, activity, or location. For example, the first of the two words refers to a person and the second of the two words refers to a location. In some implementations, the disambiguation module 350 identifies three words that are each associated with one of an entity, activity, or location.

Once the task flow processor 336 has completed the structured query for the executable intent, the task flow processor 336 continues to execute the final task associated with the executable intent. Thus, the task flow processor 336 performs the steps and instructions in the task flow model according to the specific parameters contained in the structured query. For example, a task flow model for the actionable intent "restaurant reservation" may include steps and instructions for contacting a restaurant and actually requesting a reservation for a particular number of people at a particular time. For example, using structured queries such as: { restaurant reservation, restaurant ═ ABC cafe, date ═ 2012/3/12, time ═ 7 pm, and diner number ═ 5 }, the task flow processor 336 may perform the following steps: (1) logging into a server of an ABC cafe or a dining room reservation system configured to accept reservations for multiple restaurants, such as the ABC cafe, (2) entering date, time, and diner information in a table on a website, (3) submitting the table, and (4) making a calendar entry for the reservation in a user calendar. As another example, the task flow processor 336 performs steps and instructions associated with marking or searching for digital photographs in response to voice input, such as with the photograph module 132, as described in detail below. In some implementations, the task flow processor 336 completes tasks requested in the user input or provides informational answers requested in the user input with the assistance of a service processing module 338 ("service processor"). For example, the service processor 338 may initiate phone calls, set calendar entries, invoke map searches, invoke or interact with other applications installed on the user device, and invoke or interact with third-party services (e.g., restaurant reservation portals, social networking sites or services, bank portals, etc.) in place of the task flow processor 336. In some implementations, the protocols and Application Programming Interfaces (APIs) required for each service may be specified by respective ones of the service models 356. The service handler 338 accesses the appropriate service model for the service and generates requests for the service in accordance with the service model according to the protocols and APIs required by the service.

For example, if a restaurant has enabled an online reservation service, the restaurant may submit a service model that specifies the necessary parameters to make a reservation and an API to communicate the values of the necessary parameters to the online reservation service. Upon request by the task stream processor 336, the service processor 338 may use the web address stored in the service model 356 to establish a network connection with the online booking service and send the necessary parameters for booking (e.g., time, date, number of people having a meal) to the online booking interface in a format according to the API of the online booking service.

In some implementations, the natural language processor 332, the dialog processor 334, and the task flow processor 336 are used collectively and iteratively to infer and define the user's intent, to obtain information to further clarify and refine the user's intent, and to ultimately generate a response (e.g., provide output to the user, or complete a task) to satisfy the user's intent.

In some implementations, after all tasks required to satisfy the user request have been performed, the digital assistant 326 formulates a confirmation response and sends the response back to the user through the I/O processing module 328. If the user requests an informational answer, the confirmation response presents the requested information to the user. In some implementations, the digital assistant also requests that the user indicate whether the user is satisfied with the response generated by the digital assistant 326.

Attention is now directed to FIG. 4, which is a block diagram illustrating components of a voice trigger system 400 according to some implementations. (the voice trigger system 400 is not limited to voice, and the implementations described herein are equally applicable to non-voice sounds). The voice trigger system 400 is comprised of various components, modules, and/or software programs within the electronic device 104. In some implementations, the voice trigger system 400 includes a noise detector 402, a sound type detector 404, a trigger sound detector 406, and a voice-based service 408, each coupled to the audio bus 401, and the audio subsystem 226. In some implementations, more or fewer of these modules are used. The sound detectors 402, 404, and 406 may be referred to as modules and may include hardware (e.g., circuitry, memory, processors, etc.), software (e.g., program, on-chip software, firmware, etc.), and/or any combination thereof for performing the functions described herein. In some implementations, as illustrated by the dashed lines in fig. 4, the sound detectors are communicatively, programmably, physically and/or operatively coupled to each other (e.g., via a communication bus). (for ease of illustration, FIG. 4 shows each sound detector coupled only to adjacent sound detectors; it should be understood that each sound detector may also be coupled to any of the other sound detectors.)

In some implementations, the audio subsystem 226 includes a codec 410, an audio Digital Signal Processor (DSP)412, and a memory buffer 414. In some implementations, the audio subsystem 226 is coupled to one or more microphones 230 (fig. 2) and one or more speakers 228 (fig. 2). The audio subsystem 226 provides sound inputs to the sound detectors 402, 404, 406 and the voice-based service 408 (and other components or modules, such as the phone and/or baseband subsystem of the phone) for processing and/or analysis. In some implementations, the audio subsystem 226 is coupled to an external audio system 416 that includes at least one microphone 418 and at least one speaker 420.

In some implementations, the voice-based service 408 is a voice-based digital assistant and corresponds to one or more components or functions of the digital assistant system described above with reference to fig. 1-3C. In some implementations, the voice-based service is a voice-to-text service, a dictation service, or the like. In some implementations, the noise detector 402 monitors the audio channels to determine whether the sound input from the audio subsystem 226 satisfies a predetermined condition, such as a magnitude threshold. The audio channels correspond to streams of audio information received by one or more sound pickup devices, such as one or more microphones 230 (fig. 2). An audio channel refers to audio information regardless of its processing state or the particular hardware processing and/or transmitting the audio information. For example, the audio channels may refer to analog electrical impulses from the microphone 230 (and/or the circuitry over which the electrical impulses are propagated), as well as digitally encoded audio streams produced by processing the analog electrical impulses (e.g., by the audio subsystem 226 and/or any other audio processing system of the electronic device 104).

In some implementations, the predetermined condition is whether the sound input exceeds a certain volume within a predetermined amount of time. In some implementations, the noise detector uses a time-domain analysis of the sound input that requires relatively less computational and battery resources than other types of analysis (e.g., performed by the sound type detector 404, the trigger word detector 406, and/or the speech-based service 408). In some implementations, other types of signal processing and/or audio analysis are used, including, for example, frequency domain analysis. If noise detector 402 determines that the sound input satisfies a predetermined condition, it activates an upstream sound detector, such as sound type detector 404 (e.g., by providing a control signal to activate one or more processing routines and/or by providing power to the upstream sound detector). In some implementations, the upstream sound detector is activated in response to other conditions being met. For example, in some implementations, the upstream sound detector is activated in response to determining that the device is not stored in the enclosed space (e.g., a light detector based on a threshold level of detected light).

The sound type detector 404 monitors the audio channels to determine whether the sound input corresponds to a certain type of sound, such as a sound characterized by a human voice, a whistle, a clapping of hands, and so forth. The type of sound that the sound type detector 404 is configured to recognize will correspond to the particular trigger sound or sounds that the voice trigger is configured to recognize. In implementations where the trigger sounds are spoken words or phrases, the sound type detector 404 includes a "voice activity detector" (VAD). In some implementations, the sound type detector 404 uses frequency domain analysis of the sound input. For example, the sound type detector 404 generates a spectrogram of the received sound input (e.g., using a fourier transform), and analyzes spectral components of the sound input to determine whether the sound input is likely to correspond to a particular type or class of sound (e.g., human speech). Thus, in implementations where the trigger sound is a spoken word or phrase, the VAD will not activate the trigger sound detector 406 if the audio channel is picking up ambient sound (e.g., traffic noise) instead of human speech. In some implementations, the sound type detector 404 remains active as long as the predetermined condition of any downstream sound detector (e.g., the noise detector 402) is met. For example, in some implementations, the sound type detector 404 remains active as long as the sound input includes sound that exceeds a predetermined amplitude threshold (as determined by the noise detector 402), and is deactivated when the sound falls below the predetermined threshold. In some implementations, once activated, the sound type detector 404 remains active until a condition is met, such as expiration of a timer (e.g., 1, 2, 5, or 10 seconds in duration, or any other suitable duration), the end of a certain number of on/off cycles of the sound type detector 404, or the occurrence of an event (e.g., the magnitude of the sound falls below a second threshold, as determined by the noise detector 402 and/or the sound type detector 404).

As described above, if sound type detector 404 determines that the sound input corresponds to a predetermined sound type, it activates an upstream sound detector (e.g., by providing a control signal to activate one or more processing routines and/or by providing power to the upstream sound detector), such as triggering sound detector 406.

The trigger sound detector 406 is configured to determine whether the sound input includes at least a portion of certain predetermined content (e.g., at least a portion of a trigger word, phrase, or sound). In some implementations, the trigger sound detector 406 compares a representation of the sound input ("input representation") to one or more reference representations of the trigger words. If the input representation matches at least one of the one or more reference representations with an acceptable degree of confidence, the trigger sound detector 406 initiates the speech-based service 408 (e.g., by providing a control signal to initiate one or more processing procedures and/or by providing power to the upstream sound detector). In some implementations, the input representation and the one or more reference representations are spectrograms (or mathematical representations thereof) that show how the spectral density of the signal varies over time. In some implementations, the representation is other types of audio features or voiceprints. In some implementations, initiating the voice-based service 408 includes taking one or more circuits, programs, and/or processors out of a standby mode and invoking the sound-based service. The sound-based service is then ready to provide more comprehensive speech recognition, speech-to-text processing, and/or natural language processing. In some implementations, the voice trigger system 400 includes voice authentication functionality that enables it to determine whether the voice input corresponds to the voice of a particular person, such as the owner/user of the device. For example, in some implementations, the sound type detector 404 uses voiceprint technology to determine that the sound input was made by an authorized user. Voice authentication and voiceprints are described in more detail in U.S. patent application 13/053,144, assigned to the assignee of the present application, which is hereby incorporated by reference in its entirety. In some implementations, voice authentication is included in any of the sound detectors described herein (e.g., noise detector 402, sound type detector 404, trigger sound detector 406, and/or voice-based service 408). In some implementations, voice authentication is implemented as a module separate from the sound detector listed above (e.g., as voice authentication module 428, fig. 4) and may be operatively positioned after noise detector 402, after sound type detector 404, after triggering sound detector 406, or at any other suitable location.

In some implementations, the trigger sound detector 406 remains active as long as the conditions of any one or more downstream sound detectors (e.g., the noise detector 402 and/or the sound type detector 404) are met. For example, in some implementations, trigger sound detector 406 remains active as long as the sound input includes sound that exceeds a predetermined threshold (as detected by noise detector 402). In some implementations, the trigger sound detector 406 remains active as long as the sound input includes a certain type of sound (as detected by the sound type detector 404). In some implementations, the trigger tone detector 406 remains active as long as the above conditions are simultaneously met.

In some implementations, once activated, the trigger tone detector 406 remains active until a condition is met, such as expiration of a timer (e.g., 1, 2, 5, or 10 seconds in time, or any other suitable duration), the end of a certain number of on/off cycles of the trigger tone detector 406, or an occurrence of an event (e.g., the amplitude of the tone falling below a second threshold). In some implementations, both sound detectors remain active when one activates the other. However, the sound detectors may be active or inactive at different times, and it is not necessary to have all upstream (e.g., lower power and/or complexity) sound detectors active (or meet their respective conditions) in order for the upstream sound detectors to be active. For example, in some implementations, after the noise detector 402 and the sound type detector 404 determine that their respective conditions are met and the trigger sound detector 406 is activated, upon operation of the trigger sound detector 406, one or both of the noise detector 402 and the sound type detector 404 are deactivated and/or enter a standby mode. In other implementations, both the noise detector 402 and the sound type detector 404 (or one or the other) remain active when triggering the operation of the sound detector 406. In various implementations, different combinations of sound detectors are active at different times, and whether one sound detector is active or inactive may depend on the state of the other sound detector, or may be independent of the state of the other sound detector.

Although fig. 4 depicts three separate sound detectors each configured to detect a different aspect of the sound input, more or fewer sound detectors may be used in various implementations of the voice trigger. For example, in some implementations, only the trigger sound detector 406 is used. In some implementations, the trigger sound detector 406 is used with the noise detector 402 or the sound type detector 404. In some implementations, all of the detectors 402 and 406 are used. In some implementations, additional sound detectors are included.

Furthermore, different combinations of sound detectors may be used at different times. For example, the particular combination of sound detectors and the manner in which they interact may depend on one or more conditions, such as the context or operational state of the device. As a specific example, if the device is powered on (and thus not relying solely on battery power), the trigger sound detector 406 is active, while the noise detector 402 and sound type detector 404 remain inactive. As another example, if the device is in a pocket or backpack, all of the sound detectors are inactive. By cascading sound detectors as described above, wherein a detector requiring less power only invokes a detector requiring more power when necessary, an energy efficient voice trigger function can be provided. As described above, additional power efficiency is achieved by operating one or more of the sound detectors according to a duty cycle. For example, in some implementations, noise detector 402 operates according to a duty cycle such that the noise detector can effectively perform continuous noise detection even when it is off for at least a portion of the time. In some implementations, the noise detector 402 is turned on for 10 milliseconds and turned off for 90 milliseconds. In some implementations, the noise detector 402 is turned on for 20 milliseconds and turned off for 500 milliseconds. Other switch durations are also possible.

In some implementations, if the noise detector 402 detects noise during its "on" period, the noise detector 402 will remain on to further process and/or analyze the sound input. For example, the noise detector 402 may be configured to activate the upstream sound detector if it detects sound exceeding a predetermined amplitude for a predetermined amount of time (e.g., 100 milliseconds). Thus, if the noise detector 402 detects a sound exceeding a predetermined magnitude during its 10 millisecond "on" period, it will not immediately enter the "off period. Instead, the noise detector 402 remains active and continues to process the sound input to determine whether it exceeds the threshold for the entire predetermined duration (e.g., 100 milliseconds).

In some implementations, the sound type detector 404 operates according to a duty cycle. In some implementations, the sound type detector 404 is turned on for 20 milliseconds and turned off for 100 milliseconds. Other switching durations are also possible. In some implementations, the sound type detector 404 can determine whether the sound input corresponds to a predetermined type of sound within an "on" period of its duty cycle. Thus, if the sound type detector 404 determines that the sound is of a certain type during its "on" period, the sound type detector 404 will activate the trigger sound detector 406 (or any other upstream sound detector). Alternatively, in some implementations, if the sound type detector 404 detects a sound that may correspond to a predetermined type during an "on" period, the detector will not immediately enter an "off" period. Instead, the sound type detector 404 remains active and continues to process the sound input and determine whether it corresponds to a predetermined sound type. In some implementations, if the sound detector determines that a predetermined sound type has been detected, it activates the trigger sound detector 406 to further process the sound input and determine whether a trigger sound has been detected. Similar to the noise detector 402 and the sound type detector 404, in some implementations, the trigger sound detector 406 operates according to a duty cycle. In some implementations, the trigger sound detector 406 is turned on for 50 milliseconds and turned off for 50 milliseconds. Other switching durations are also possible. If the trigger sound detector 406 detects the presence of a sound that may correspond to a trigger sound during its "on" period, the detector will not immediately enter the "off" period. Instead, the trigger sound detector 406 remains active and continues to process the sound input and determine whether it includes a trigger sound. In some implementations, if such a sound is detected, the trigger tone detector 406 remains active to process the audio within a predetermined duration, such as 1 second, 2 seconds, 5 seconds, or 10 seconds, or any other suitable duration. In some implementations, the duration is selected based on the length of a particular trigger word or sound configured to be detected. For example, if the trigger phrase is "hey, SIRI," the trigger word detector operates within about 2 seconds to determine whether the voice input includes the phrase.

In some implementations, some of the sound detectors operate according to a duty cycle, while others operate continuously while active. For example, in some implementations, only the first sound detector (e.g., noise detector 402 in fig. 4) operates according to a duty cycle, and the upstream sound detector operates continuously once activated. In some other implementations, the noise detector 402 and the sound type detector 404 operate according to a duty cycle, while the sound detector 406 is triggered to operate continuously. Whether a particular sound detector operates continuously or according to a duty cycle depends on one or more conditions, such as the context or operating state of the device. In some implementations, if the device is powered on and does not rely solely on battery power, all sound detectors operate continuously once activated. In other implementations, if the device is in a pocket or backpack (e.g., as determined by sensors and/or microphone signals), the noise detector 402 (or any of the sound detectors) operates according to a duty cycle, but the noise detector operates continuously if it is determined that the device is likely not stored. In some implementations, whether a particular sound detector operates continuously or according to a duty cycle depends on the battery charge level of the device. For example, when the battery charge amount exceeds 50%, the noise detector 402 continuously operates, and when the battery charge amount is lower than 50%, the noise detector 402 operates according to the duty ratio. In some implementations, the voice trigger includes noise, echo, and/or sound cancellation functions (collectively referred to as noise cancellation). In some implementations, noise cancellation is performed by the audio subsystem 226 (e.g., by the audio DSP 412). Noise cancellation unnecessary noise or sound is reduced or eliminated from the sound input before it is processed by the sound detector. In some cases, the unwanted noise is background noise from the user's environment, such as a clicking sound from a fan or keyboard. In some implementations, the unwanted noise is any sound at the above, below, or predetermined amplitude or frequency. For example, in some implementations, sounds in excess of a typical human range (e.g., 3,000Hz) are filtered or removed from the signal. In some implementations, multiple microphones (e.g., microphone 230) are used to help determine which components of the received sound should be reduced and/or removed. For example, in some implementations, audio subsystem 226 uses beamforming techniques to identify sounds or portions of sound inputs that appear to originate from a single point in space (e.g., in the user's mouth). The audio subsystem 226 then focuses on the sound by removing from the sound input the sound that is received equally by all microphones (e.g., as though the ambient sound did not originate from any particular direction).

In some implementations, the DSP 412 is configured to cancel or remove from the sound input the sound being output by the device on which the digital assistant operates. For example, if the audio subsystem 226 is outputting music, a broadcast, a podcast, a speech output, or any other audio content (e.g., through the speaker 228), the DSP 412 removes any outgoing sound picked up by the microphone and included in the sound input. Thus, the sound input contains no (or at least less) output audio. Thus, the sound input provided to the sound detector will be clearer and the triggering more accurate. Aspects of noise cancellation are described in more detail in U.S. patent 7,272,224, assigned to the assignee of the present application, which is hereby incorporated by reference in its entirety.

In some implementations, different sound detectors require sound inputs to be filtered and/or preprocessed in different ways. For example, in some implementations, the noise detector 402 is configured to analyze the time-domain audio signal between 60Hz and 20,000Hz, and the sound type detector is configured to perform a frequency-domain analysis of the audio between 60Hz and 3,000 Hz. Thus, in some implementations, the audio DSP 412 (and/or other audio DSPs of the device 104) pre-processes the received audio according to the respective needs of the sound detector. In some implementations, on the other hand, the sound detectors are configured to filter and/or pre-process audio from the audio subsystem 226 according to their particular needs. In this case, the audio DSP 412 may still perform noise cancellation before providing the sound input to the sound detector. In some implementations, the context of the electronic device can be used to help determine whether and how to operate the voice trigger. For example, when the device is stored in a user's pocket, purse, or backpack, they will be less likely to invoke a voice-based service, such as a voice-based digital assistant. Additionally, when a user is listening to a loud rock concert, they will be less likely to invoke a voice-based service. For some users, it is unlikely that they will invoke voice-based services at certain times of the day (e.g., late at night). On the other hand, there is also a context in which the user will likely use the voice trigger to invoke the voice-based service. For example, some users will likely use voice triggers when they are driving, when they are alone, when they are working, and the like. Various techniques are used to determine the context of a device. In various implementations, the device uses information from any one or more of the following components or information sources to determine the context of the device: GPS receivers, light sensors, microphones, proximity sensors, orientation sensors, inertial sensors, cameras, communication circuitry and/or antennas, charging and/or power circuitry, switch positions, temperature sensors, compasses, accelerometers, calendars, user preferences, and the like. The context of the device can then be used to adjust how and whether the voice trigger operates. For example, in some contexts, the voice trigger will be deactivated (or operate in a different mode) as long as the context is maintained. For example, in some implementations, the voice trigger is deactivated when the phone is in a predetermined orientation (e.g., face down on a surface), for a predetermined period of time (e.g., between 10:00 pm and 8:00 am), when the phone is in a "silent" or "do not disturb" mode (e.g., based on switch position, mode settings, or user preferences), when the device is in a substantially enclosed space (e.g., a pocket, bag, purse, drawer, or glove box), when the device is located near other devices with voice triggers and/or voice-based services (e.g., based on proximity sensors, acoustic/wireless/infrared communications), and so forth. In some implementations, the voice trigger system 400 is operated in a low power mode (e.g., by operating the noise detector 402 according to a duty cycle having a 10 millisecond "on" period and a 5 second "off period) rather than being deactivated. In some implementations, the audio channel is monitored at a lower frequency when the voice trigger system 400 is operated in the low power mode. In some implementations, when the voice trigger is in the low power mode, the voice trigger uses a different sound detector or combination of sound detectors than it is in the normal mode. (the voice trigger can be capable of a number of different modes or operating states, each of which can use different power values, and different implementations will use these modes or operating states depending on their specific design).

On the other hand, when the device is in some other context, the voice trigger will be enabled (or operated in a different mode) as long as the context is maintained. For example, in some implementations, the voice trigger remains active when the voice trigger is powered on, when the phone is in a predetermined orientation (e.g., placed face up on a surface), for a predetermined period of time (e.g., between 8:00 am and 10:00 pm), while the device is traveling and/or in the vehicle (e.g., based on GPS signals, bluetooth connections, docking with the vehicle, etc.), and so forth. The detection of the apparatus in a vehicle is described in more detail in U.S. provisional patent application 61/657,744, assigned to the assignee of the present application, which is hereby incorporated by reference in its entirety. Several specific examples of how certain contexts are determined are provided below. In various embodiments, these and other contexts are detected using different techniques and/or information sources.

As described above, whether the voice trigger system 400 is active (e.g., listening) may depend on the physical orientation of the device. In some implementations, the voice trigger is active when the device is positioned "face up" on a surface (e.g., the display and/or touch screen surface is visible), and/or is inactive when the device is positioned "face down". This provides the user with an easy way to activate and/or deactivate the voice trigger without having to manipulate a settings menu, switch or button. In some implementations, the device detects whether it is positioned face-up or face-down on a surface using light sensors (e.g., based on differences in incident light to the front and back of the device 104), proximity sensors, magnetic sensors, accelerometers, gyroscopes, tilt sensors, cameras, and the like. In some implementations, other operating modes, settings, parameters, or preferences are influenced by the orientation and/or location of the device. In some implementations, the particular trigger sound, word, or phrase of the voice trigger is listened to depending on the orientation and/or location of the device. For example, in some implementations, the voice trigger listens for a first trigger word, phrase, or sound when the device is in a first orientation (e.g., placed face up on a surface), and listens for a different trigger word, phrase, or sound when the device is in another orientation (e.g., placed face down). In some implementations, the trigger phrase for the face-down orientation is longer and/or more complex than the trigger phrase for the face-up orientation. Thus, the user may have the device face down when there are other people around them or in a noisy environment so that the voice trigger can still operate with a reduced false acceptance rate, which may more frequently require shorter or simpler trigger words. To take a specific example, the front-up trigger word may be "hey, SIRI" and the front-down trigger phrase may be "hey, SIRI, i is Andrew, please wake up". The longer trigger phrase also provides a larger voice sample for the voice detector and/or voice authenticator to use in processing and/or analyzing, thereby improving the accuracy of the voice trigger and reducing the false acceptance rate.

In some implementations, the device 104 detects whether it is in a vehicle (e.g., an automobile). The voice trigger is particularly beneficial for invoking voice-based services in the event that the user is in a vehicle, as it helps to reduce the physical interaction necessary to operate the device and/or the voice-based services. Indeed, one of the benefits of a voice-based digital assistant is that it can be used to perform tasks where viewing and touching the device would be less likely or less secure. Thus, the voice trigger may be used while the device is in a vehicle so that the user does not have to touch the device to invoke the digital assistant. In some implementations, the device determines that it is in the vehicle by detecting that it has been connected to and/or paired with the vehicle, such as through bluetooth communication (or other wireless communication) or through a docking interface or cable. In some implementations, the device determines that it is in the vehicle by determining the location and/or velocity of the device (e.g., using a GPS receiver, accelerometer, and/or gyroscope). If it is determined that the device may be in a vehicle, for example, because the device is traveling at speeds above 20 miles per hour and is determined to be traveling along a road, the voice trigger remains active and/or in a high power or more sensitive state.

In some implementations, the device detects whether it is stored (e.g., in a pocket, purse, bag, drawer, etc.) by determining whether it is in a substantially enclosed space. In some implementations, the device uses a light sensor (e.g., a dedicated ambient light sensor and/or camera) to determine that it is stored. For example, in some implementations, if the light sensor detects little or no light, the device is likely to be stored. In some implementations, the time of day and/or the location of the device are also considered. For example, if a light sensor detects a lower light level in anticipation of a higher light level (e.g., during the day), the device may be in a stowed state and the system 400 need not be triggered audibly. Thus, the voice activated system 400 will be placed in a low power or standby state. In some implementations, differences in light detected by sensors located on different sides of the device can be used to determine the location of the device and thus whether it is stored. In particular, the user is likely to attempt to enable the voice trigger when the device is placed on a table or surface, rather than when the device is stored in a pocket or bag. However, when the device is placed face down (or face up) on a surface, such as a table or desk, one surface of the device will be shielded so that little or no light reaches the surface, while the other surface will be exposed to ambient light. Thus, if the light sensors located on the front and back of the device detect significantly different levels of illumination, the device determines that it is not being stored. On the other hand, if the same or similar light level is detected by the light sensors on the opposite side, the device determines that it is stored in a substantially enclosed space. Additionally, if both light sensors check a lower light level during the day (or when the device expects the phone to be in a bright environment), the device determines with greater confidence that it is stored.

In some implementations, other techniques (instead of or in addition to light sensors) are also used to determine whether the device is stored. For example, in some implementations, the device emits one or more sounds (e.g., pitch bells, clicks, strikes, etc.) from a speaker or transducer (e.g., speaker 228), and monitors one or more microphones or transducers (e.g., microphone 230) to detect echoes of the emitted one or more sounds. (in some implementations, the device emits an inaudible signal, such as a sound outside the human hearing range). From the echoes, the device determines characteristics of the surrounding environment. For example, a relatively large environment (e.g., a room or vehicle) will reflect sound differently than a relatively small enclosed environment (e.g., a pocket, purse, bag, drawer, etc.).

In some implementations, the voice trigger system 400 operates differently if it is located in proximity to other devices (such as other devices having voice triggers and/or voice-based services) than if it is not. For example, where many devices are in close proximity to one another, it may be useful to turn off or desensitize the voice activation system 400 so that when a person utters an activation word, other peripheral devices are not activated as well. In some implementations, the device uses RFID, near field communication, infrared/acoustic signals, etc. to determine proximity to other devices. As mentioned above, voice triggers are particularly useful when the device is operating in a hands-free mode, such as when the user is driving a vehicle. In such cases, users often use external audio systems, such as wired or wireless headsets, watches with speakers and/or microphones, built-in microphones and speakers in vehicles, etc., to make themselves unnecessary to bring the device close to their face to make a call or indicate text entry. For example, the wireless headset and vehicle audio system may be connected to the electronic device using bluetooth communication or any other suitable wireless communication. However, monitoring the received audio by the wireless audio accessory may be inefficient due to the power required to keep the audio channel open by the wireless accessory. In particular, the wireless headset may retain a sufficient charge in its battery to provide several hours of continuous talk time, and it is therefore preferable to reserve the battery for use in situations where actual communication requires the headset, rather than using it to simply monitor ambient audio and wait for a possible trigger sound. Furthermore, a wired external headset accessory may require significantly more power than using the vehicle microphone alone, and keeping the headset microphone active will drain the battery charge of the device. This is particularly true insofar as ambient audio received by wireless or wired headphones will typically consist primarily of silent or incoherent sound. Thus, in some implementations, the voice trigger system 400 monitors audio from the microphone 230 on the device even when the device is coupled to an external microphone (wired or wireless). Then, when the voice trigger detects a trigger word, the device initializes the active audio link with the external microphone to receive subsequent voice input (such as a command to the voice-based digital assistant) through the external microphone instead of the microphone 230 on the device. When certain conditions are met, but an active communication link may also be maintained between the external audio system 416 (which may be communicatively coupled to the device 104 via wired or wireless means) and the device such that the voice trigger system 400 may also listen for trigger sounds through the external audio system 416 in lieu of (or in addition to) the microphone 230 on the device. For example, in some implementations, characteristics of the motion of the electronic device and/or the external audio system 416 (e.g., as determined by an accelerometer, gyroscope, etc. on the respective device) are used to determine whether the voice trigger system 400 should use the microphone 230 on the device or the external microphone 418 to monitor ambient sounds. In particular, the difference between the motion of the device and the external audio system 416 provides information as to whether the external audio system 416 is actually in use. For example, if both the device and the wireless headset are moving (or not moving) substantially the same, it may be determined that the headset is not in use or worn. This may occur, for example, because the two devices are close to each other and idle (e.g., on a table or stored in a pocket, bag, purse, drawer, etc.). Thus, in these cases, it is unlikely that the voice trigger system 400 monitors the microphone on the device, because the headset is actually being used. However, if there is a difference in motion between the wireless headset and the device, it is determined that the headset is being worn by the user. These situations may arise, for example, because the device has been set down (e.g., on a surface or in a bag) while the headset is being worn on the user's head (i.e., so that the headset will likely move at least slightly when the wearer is relatively still). In these cases, the voice trigger system 400 maintains an active communication link and monitors the headset's microphone 418 in place of (or in addition to) the microphone 230 on the device, since the headset is likely to be being worn. And since the technology focuses on differences in the motion of the device and the headset, the motion common to both devices can be cancelled. This is useful for the following cases: for example, when the user is using the headset on a moving vehicle, where the device (e.g., a cellular telephone) is being located in a cup holder, on an empty seat, or in the user's pocket, and the headset is being worn on the user's head. Once the motion common to the two devices is counterbalanced (e.g., the motion of the vehicle), the relative motion of the headset compared to the devices (if any) may be determined to determine whether the headset is likely in use (or whether the headset is not being worn). Although the above discussion relates to wireless headsets, similar techniques may also be applied to wired headsets.

Since human voices are very different, it is necessary or advantageous to adjust the voice triggers to improve their accuracy in recognizing the voice of a particular user. Additionally, a person's voice may change over time, for example, due to illness, natural voice changes associated with age changes or hormonal changes, and the like. Thus, in some implementations, the voice trigger system 400 can adjust its voice and/or sound recognition profile for a particular user or group of users. As described above, the sound detector (e.g., the sound type detector 404 and/or the trigger sound detector 406) may be configured to compare a representation of the sound input (e.g., a user-provided sound or utterance) to one or more reference representations. For example, if the input representation matches the reference representation with a predetermined confidence, the sound detector will determine that the sound input corresponds to a predetermined sound type (e.g., sound type detector 404), or that the sound input includes predetermined content (e.g., trigger sound detector 406). To adjust the voice activated system 400, in some implementations, the device adjusts the reference representation to which the input representation is compared. In some implementations, the reference representation is adjusted (or created) as part of a voice enrollment or "training" procedure in which the user outputs a trigger sound multiple times so that the device can adjust (or create) the reference representation. The device may then use the actual speech of the person to create the reference representation.

In some implementations, the device uses the trigger sound received under normal use conditions to adjust the reference representation. For example, after a successful voice trigger event (e.g., in the case of finding a sound input that meets all trigger criteria), the device will use information from the sound input to adjust and/or tune the reference representation. In some implementations, only sound inputs determined to meet all or some of the triggering criteria with a certain degree of confidence are used to adjust the reference representation. Thus, when the speech trigger has a low confidence that the sound input corresponds to or includes the trigger sound, the speech input may be ignored for purposes of adjusting the reference representation. On the other hand, in some implementations, meeting the acoustic input of the voice trigger system 400 with a lower confidence level is used to adjust the reference representation.

In some implementations, as more and more acoustic inputs are received, the device 104 iteratively adjusts the reference representation (using these or other techniques) so that small changes in the user's speech over time can be accommodated. For example, in some implementations, the device 104 (and/or associated device or service) adjusts the reference representation after each successful trigger event. In some implementations, the device 104 analyzes the sound input associated with each successful trigger event and determines whether the reference representation should be adjusted based on the input (e.g., if certain conditions are met), and only adjusts the reference representation if appropriate. In some implementations, the device 104 maintains a moving average of the reference representation over time. In some implementations, the voice trigger system 400 detects that one or more of the trigger criteria are not met (e.g., determined by one or more of the sound detectors) but in fact may be the sound attempted by the authorized user. For example, the voice trigger system 400 may be configured to respond to a trigger phrase such as "hey, SIRI," but if the user's voice changes (e.g., due to illness, age, accent/tone changes, etc.), the voice trigger system 400 may not recognize the user's attempt to enable the device. (this may also occur if the voice trigger system 400 has not been appropriately adjusted for the user's particular voice, such as if the voice trigger system 400 is set to a default condition and/or the user has not performed an initialization or training procedure to customize the voice trigger system 400 for their voice). If the voice trigger system 400 does not respond to the user's first attempt to enable the voice trigger, the user is likely to repeat the trigger phrase. The device detects that these repeated sound inputs are similar to each other and/or that they are similar to a trigger phrase (although not similar enough to cause the voice trigger system 400 to enable a voice-based service). If such a condition is satisfied, the device determines that the sound input corresponds to a valid attempt to enable the voice trigger system 400. Thus, in some implementations, the voice trigger system 400 uses those received sound inputs to adjust one or more aspects of the voice trigger system 400 such that similar utterances by the user will not be accepted as valid triggers in the future. In some implementations, these sound inputs are used to adjust the voice trigger system 400 only if certain conditions or combinations of conditions are met. For example, in some implementations, the voice trigger system 400 is adjusted using the sound inputs when a predetermined number of sound inputs are received in succession (e.g., 2 times, 3 times, 4 times, 5 times, or any other suitable number), when the sound inputs are sufficiently similar to the reference representation, when the sound inputs are sufficiently similar to each other, when the sound inputs are close together (e.g., they are received within a predetermined time period and/or at or near a predetermined time interval), and/or any combination of these or other conditions. In some cases, the voice trigger system 400 may detect one or more sound inputs that do not meet one or more of the trigger criteria before manually initiating the voice-based service (e.g., by pressing a button or icon). In some implementations, because the voice-based service is initiated shortly after receiving the voice input, the voice trigger system 400 determines that the voice input actually corresponds to a failed voice trigger attempt. Thus, as described above, the voice trigger system 400 uses those received voice inputs to adjust one or more aspects of the voice trigger system 400 such that the user's utterance will not be accepted as a valid trigger in the future.

Although the above-described adjustment technique involves adjusting the reference representation, other aspects of the trigger sound detection technique may be adjusted in the same or similar manner in addition to or instead of adjusting the reference representation. For example, in some implementations, the device adjusts how and/or which filters are applied to the sound input to focus on and/or eliminate certain frequencies or frequency ranges of the sound input. In some implementations, the device adjusts an algorithm used to compare the input representation to the reference representation. For example, in some implementations, one or more terms in the mathematical function used to determine the difference between the input representation and the reference representation are changed, added, or deleted, or a different mathematical function is replaced. In some implementations, adjustment techniques such as those described above require more resources than the voice trigger system 400 can provide or is configured to provide. In particular, the sound detector may have no or access to a number or type of processors, data, or memory necessary for performing the iterative adjustment reference representation and/or the sound detection algorithm (or any other suitable aspect of the speech triggered system 400). Thus, in some implementations, one or more of the above-described adjustment techniques are performed by a more powerful processor, such as an application processor (e.g., one or more processors 204), or by a different device (e.g., server system 108). However, the voice trigger system 400 is designed to operate even when the application processor is in a standby mode. Thus, the acoustic input for adjusting the voice trigger system 400 is received in the event that the application processor is inactive and cannot process the acoustic input. Thus, in some implementations, the sound input is stored by the device so that it can be further processed and/or analyzed after it is received. In some implementations, the sound input is stored in a memory buffer 414 of the audio subsystem 226. In some implementations, the sound input is stored in a system memory (e.g., memory 250, fig. 2) using Direct Memory Access (DMA) techniques, including, for example, using a DMA engine to copy or move data without starting the application processor. Then, once the application processor (or server system 108 or another suitable device) boots up, the stored sound input is provided to or accessed by the application processor such that the application processor can perform one or more of the above-described adjustment techniques. In some embodiments of the present invention, it is preferred that,

fig. 5-7 are flow diagrams illustrating methods for operating a voice trigger according to some implementations. The methods are optionally controlled by instructions stored in a computer memory or non-transitory computer-readable storage medium (e.g., memory 250 of client device 104, memory 302 associated with digital assistant system 300) and executed by one or more processors of one or more computer systems of a digital assistant system, including but not limited to server system 108 and/or user device 104 a. The computer-readable storage medium may include a magnetic or optical disk storage device, a solid-state storage device such as flash memory, or one or more other non-volatile memory devices. The computer readable instructions stored on the computer readable storage medium may include one or more of the following: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. In various implementations, some of the operations in each method may be combined and/or the order of some operations may be changed from that shown in the figures. Additionally, in some implementations, operations shown in the various figures and/or discussed in connection with the various methods may be combined to form other methods, and operations shown in the same figure and/or discussed in connection with the same method may be separated into different methods. Further, in some implementations, one or more operations of the methods are performed by modules of the digital assistant system 300 and/or an electronic device (e.g., user device 104) including, for example, a natural language processing module 332, a dialog flow processing module 334, an audio subsystem 226, a noise detector 402, a sound type detector 404, a trigger sound detector 406, a speech-based service 408, and/or any sub-modules thereof. Fig. 5 illustrates a method 500 for operating a voice activated system (e.g., voice activated system 400, fig. 4), according to some implementations. In some implementations, the method 500 is performed on an electronic device (e.g., the electronic device 104) that includes one or more processors and memory storing instructions for execution by the one or more processors. An electronic device receives a sound input (502). The sound input may correspond to a spoken utterance (e.g., a word, phrase, or sentence), a human-uttered sound (e.g., a whistling sound, a tongue-sucking sound, a ring finger sound, a clapping hand sound, etc.), or any other sound (e.g., an electronically-generated chirp, a mechanical noise source, etc.). In some implementations, the electronic device receives sound input through the audio subsystem 226 (including, for example, the codec 410, the audio DSP 412, and the buffer 414, and the microphones 230 and 418 as described with reference to fig. 4).

In some implementations, the electronic device determines whether the sound input satisfies a predetermined condition (504). In some implementations, the electronic device applies temporal analysis to the sound input to determine whether the sound input satisfies a predetermined condition. For example, the electronic device analyzes the sound input over a period of time to determine whether the sound amplitude reaches a predetermined level. In some implementations, the threshold is met if the amplitude (e.g., volume) of the sound input meets and/or exceeds a predetermined threshold. In some implementations, the threshold is met if the sound input meets and/or exceeds a predetermined threshold within a predetermined amount of time. As discussed in more detail below, in some implementations, determining whether the sound input satisfies the predetermined condition (504) is performed by a third sound detector (e.g., noise detector 402). (in this case, the third sound detector is used to distinguish the sound detector from other sound detectors (e.g., the first sound detector and the second sound detector described below), and does not necessarily indicate any operation position or order of the sound detectors).

The electronic device determines whether the sound input corresponds to a predetermined sound type (506). As mentioned above, sounds are classified into different "types" based on certain recognizable features of the sound. Determining whether the sound input corresponds to the predetermined type includes determining whether the sound input includes or exhibits a particular type of characteristic. In some implementations, the predetermined sound type is a human voice. In such implementations, determining whether the sound input corresponds to a human voice includes determining whether the sound input includes a frequency characteristic of the human voice (508). As discussed in more detail below, in some implementations, determining whether the sound input corresponds to a predetermined sound type (506) is performed by a first sound detector (e.g., sound type detector 404). Upon determining that the sound input corresponds to a predetermined sound type, the electronic device determines whether the sound input includes predetermined content (510). In some implementations, the predetermined content corresponds to one or more predetermined phonemes (512). In some implementations, the one or more predetermined phonemes constitute at least one word. In some implementations, the predetermined content is a sound (e.g., a whistle, click sound, or clapping sound). In some implementations, determining whether the sound input includes predetermined content (510) is performed by a second sound detector (e.g., triggering sound detector 406), as described below.

Upon determining that the sound input includes the predetermined content, the electronic device initiates a voice-based service (514). In some implementations, as detailed above, the voice-based service is a voice-based digital assistant. In some implementations, the speech-based service is a dictation service in which speech input is converted to text and included and/or displayed in a text entry field (e.g., of an email, text message, word processing or note taking application, etc.). In implementations where the voice-based service is a voice-based digital assistant, once the voice-based digital assistant is activated, a prompt (e.g., a sound or voice prompt) is issued to the user indicating that the user may provide voice input and/or commands to the digital assistant. In some implementations, launching the voice-based digital assistant includes enabling an application processor (e.g., one or more processors 204, fig. 2), launching one or more programs or modules (e.g., digital assistant client module 264, fig. 2), and/or establishing a connection to a remote server or device (e.g., digital assistant server 106, fig. 1).

In some implementations, the electronic device determines whether the sound input corresponds to speech of a particular user (516). For example, one or more voice authentication techniques are applied to the voice input to determine whether it corresponds to the voice of an authorized user of the device. Voice authentication techniques are described in more detail above. In some implementations, voice authentication is performed by one of the sound detectors (e.g., triggering sound detector 406). In some implementations, voice authentication is performed by a dedicated voice authentication module (including any suitable hardware and/or software). In some implementations, the voice-based service is initiated in response to determining that the voice input includes predetermined content and that the voice input corresponds to a particular user's voice. Thus, for example, a voice-based service (e.g., a voice-based digital assistant) would only be initiated when an authorized user speaks a trigger word or phrase. This reduces the likelihood that an unauthorized user may invoke a service, and is particularly useful in situations where multiple electronic devices are in close proximity, as the words that trigger a sound by one user will not enable the voice triggers of another user.

In some implementations, where the voice-based service is a voice-based digital assistant, in response to determining that the sound input includes predetermined content but does not correspond to the voice of a particular user, the voice-based digital assistant is launched in a limited access mode. In some implementations, the limited-access mode allows the digital assistant to access only a subset of the data, services, and/or functionality that the digital assistant is capable of providing in other ways. In some implementations, the limited access mode corresponds to a write-only mode (e.g., such that an unauthorized user of the digital assistant cannot access data from a calendar, task list, contact, photo, email, text message, etc.). In some implementations, the restricted access pattern corresponds to a sandbox instance of the voice-based service such that the voice-based service will not read from or write to user data, such as user data 266 (fig. 2) on device 104 or user data on any other device (e.g., user data 348 of fig. 3A, which may be stored on a remote server, such as server system 108 of fig. 1).

In some implementations, in response to determining that the voice input includes predetermined content and that the voice input corresponds to speech of a particular user, the speech-based digital assistant outputs a prompt including a name of the particular user. For example, when a particular user is identified through voice authentication, the voice-based digital assistant may output prompts such as "what i can help you do, Peter? "rather than more general cues such as tones, beeps, or non-personalized voice cues. As described above, in some implementations, the first sound detector determines whether the sound input corresponds to a predetermined sound type (at step 506), and the second sound detector determines whether the sound detector includes predetermined content (at step 510). In some particular implementations, the first sound detector consumes less power when operating than the second sound detector, e.g., because the first sound detector uses a technique that is less processor intensive than the second detector. In some specific implementations, the first sound detector is a sound type detector 404 and the second sound detector is a trigger sound detector 406, both discussed above with respect to fig. 4. In some implementations, as described above with reference to fig. 4, in operation, the first sound detector and/or the second sound detector periodically monitor the audio channel according to a duty cycle.

In some implementations, the first sound detector and/or the second sound detector perform a frequency domain analysis of the sound input. For example, these sound detectors perform a laplace transform, a Z transform, or a fourier transform to generate or determine a spectral density of the sound input or a portion thereof. In some implementations, the first sound detector is a voice activity detector configured to determine whether the sound input includes a frequency having characteristics of a human voice (or other features, aspects, or attributes of the sound input having characteristics of a human voice).

In some implementations, the second sound detector is turned off or inactive until the first sound detector detects a predetermined type of sound input. Thus, in some implementations, the method 500 includes activating the second sound detector in response to determining that the sound input corresponds to the predetermined type. (in other implementations, the second sound detector is activated in response to other conditions, or operates continuously regardless of whether the first sound detector is determined or not.) in some implementations, activating the second sound detector includes enabling hardware and/or software (including, for example, circuitry, a processor, a program, memory, etc.). In some implementations, the second sound detector operates for at least a predetermined amount of time (e.g., is active and monitoring the audio channel) after its activation. For example, when the first sound detector determines that the sound input corresponds to a predetermined type (e.g., includes a human voice), the second sound detector operates to determine whether the sound input also includes predetermined content (e.g., trigger words). In some implementations, the predetermined amount of time corresponds to a duration of the predetermined content. Thus, if the predetermined content is the phrase "hey, SIRI," the predetermined amount of time will be a length (e.g., 1 or 2 seconds, or any other suitable duration) sufficient to determine whether to issue the phrase. If the predetermined content is longer, such as the phrase "hey, SIRI, please wake up to help me", the predetermined time will be longer (e.g., 5 seconds or another suitable duration). In some implementations, the second sound detector operates whenever the first sound detector detects a sound corresponding to a predetermined type. In such implementations, for example, whenever a first sound detector detects human speech in a sound input, a second sound detector will process the sound input to determine whether it includes predetermined content.

As described above, in some implementations, a third sound detector (e.g., noise detector 402) determines whether the sound input satisfies a predetermined condition (at step 504). In some implementations, the third sound detector consumes less power when operating than the first sound detector. In some implementations, the third sound detector periodically monitors the audio channel according to a duty cycle, as discussed above with respect to fig. 4. Additionally, in some implementations, a third sound detector performs a time domain analysis of the sound input. In some implementations, the third sound detector consumes less power than the first sound detector because the time domain analysis is less processor intensive than the frequency domain analysis applied by the second sound detector.

Similar to the discussion above regarding activating the second sound detector (e.g., trigger sound detector 406) in response to a determination of the first sound detector (e.g., sound type detector 404), in some implementations the first sound detector is activated in response to a determination of the third sound detector (e.g., noise detector 402). For example, in some implementations, the sound type detector 404 is activated in response to the noise detector 402 determining that the sound input satisfies a predetermined condition (e.g., exceeds a certain volume for a sufficient duration). In some implementations, activating the first sound detector includes activating hardware and/or software (including, for example, circuitry, a processor, a program, memory, etc.). In other implementations, the first sound detector is activated in response to other conditions, or operates continuously. In some implementations, the device stores at least a portion of the sound input in a memory (518). In some implementations, the memory is a buffer 414 (fig. 4) of the audio subsystem 226. The stored voice input allows the device to perform non-real time processing of the voice input. For example, in some implementations, one or more of the sound detectors read and/or receive stored sound inputs to process the stored sound inputs. This is particularly useful in situations where an upstream sound detector (e.g., trigger sound detector 406) does not activate until a sound input is partially received by audio subsystem 226. In some implementations, a stored portion of the sound input is provided to the voice-based service upon initiation of the voice-based service (520). Thus, the speech-based service is able to transcribe, process, or otherwise operate on the stored portion of the voice input, although the speech-based service is not fully operable until the portion of the voice input is received. In some implementations, the stored portion of the sound input is provided to an adjustment module of the electronic device.

In various implementations, steps (516) - (520) are performed at different locations in method 500. For example, in some implementations, one or more of steps (516) - (520) are performed between steps (502) and (504), between steps (510) and (514), or at any other suitable location.

Fig. 6 illustrates a method 600 for operating a voice activated system (e.g., voice activated system 400, fig. 4), according to some implementations. In some implementations, the method 600 is performed on an electronic device (e.g., the electronic device 104) that includes one or more processors and memory storing instructions for execution by the one or more processors. The electronic device determines whether it is in a predetermined orientation (602). In some implementations, the electronic device detects its orientation using light sensors (including cameras), microphones, proximity sensors, magnetic sensors, accelerometers, gyroscopes, tilt sensors, and so forth. For example, the electronic device determines whether it is placed face-down or face-up on a surface by comparing the amount or brightness of light incident on the sensor of the front camera with the amount or brightness of light incident on the sensor of the rear camera. If the amount and/or brightness detected by the front camera is far larger than the amount and/or brightness detected by the rear camera, the electronic equipment determines that the front camera faces upwards. On the other hand, if the amount and/or brightness detected by the rear camera is much greater than the amount and/or brightness detected by the front camera, the device will determine that it is right side down. Upon determining that the electronic device is in the predetermined orientation, the electronic device enables a predetermined mode of the voice trigger (604). In some implementations, the predetermined orientation corresponds to a substantially horizontal and face-down display screen of the device, and the predetermined mode is a standby mode (606). For example, in some implementations, if the smartphone or tablet is located on a desk or office with the screen facing down, the voice trigger is placed in a standby mode (e.g., turned off) to avoid inadvertent activation of the voice trigger.

On the other hand, in some implementations, the predetermined orientation corresponds to a substantially horizontal and right-side-up display screen of the device, and the predetermined mode is a listening mode (608). Thus, for example, if a smartphone or tablet is located on a desk or desk with the screen facing up, the voice trigger is placed in a listening mode so that it can respond to the user when the trigger is detected.

Fig. 7 illustrates a method 700 for operating a voice trigger (e.g., voice trigger system 400, fig. 4), according to some implementations. In some implementations, the method 700 is performed on an electronic device (e.g., the electronic device 104) that includes one or more processors and memory storing instructions for execution by the one or more processors. The electronic device operates a voice trigger (e.g., voice trigger system 400) in a first mode (702). In some implementations, the first mode is a normal listening mode.

The electronic device determines whether it is in a substantially enclosed space by detecting that one or more of a microphone and a camera of the electronic device is occluded (704). In some implementations, the substantially enclosed space includes a pocket, purse, bag, drawer, glove box, briefcase, and the like.

As described above, in some implementations, the device detects that the microphone is occluded by emitting one or more sounds (e.g., pitch bells, clicks, impacts, etc.) from the speaker or transducer and monitoring one or more microphones or transducers to detect echoes of the emitted one or more sounds. For example, a relatively large environment (e.g., a room or vehicle) will reflect sound differently than a relatively small, substantially enclosed environment (e.g., a purse or pocket). Thus, if the device detects that the microphone (or the loudspeaker emitting sound) is occluded based on an echo (or no echo), the device determines that it is in a substantially enclosed space. In some implementations, the device detects that the microphone is occluded by detecting that the microphone is picking up sound characterized by an enclosed space. For example, when the device is placed in a pocket, the microphone may detect a characteristic rustling noise due to the microphone touching or being close to the fabric of the pocket. In some implementations, the device detects that the camera is occluded based on the level of illumination received by the sensor, or by determining whether it is capable of obtaining a focused image. For example, if the camera sensor detects a lower light level within a time when a higher light level is expected (e.g., during the day), the device determines that the camera is occluded and the device is in a substantially enclosed space. As another example, a camera may attempt to obtain a focused image through its sensor. Often, this will be difficult if the camera is in an extremely dark place (e.g., a pocket or backpack) or if it is too close to an object that is attempting to focus on (e.g., the interior of a purse or backpack). Thus, if the camera is unable to obtain a focused image, the device is determined to be in a substantially enclosed space.

Upon determining that the electronic device is in the substantially enclosed space, the electronic device switches the voice trigger to a second mode (706). In some implementations, the second mode is a standby mode (708). In some implementations, while in the standby mode, the voice trigger system 400 will continue to monitor the ambient audio, but will not respond to received sounds regardless of whether they would otherwise trigger the voice trigger system 400. In some implementations, in the standby mode, the voice trigger system 400 is deactivated and does not process audio used to detect the trigger sound. In some implementations, the second mode includes operating one or more sound detectors of the voice trigger system 400 according to a different duty cycle than the first mode. In some implementations, the second mode includes operating a different combination of sound detectors than the first mode.

In some implementations, the second mode corresponds to a more sensitive monitoring mode, such that the voice trigger system 400 is able to detect and respond to trigger sounds even when in a substantially enclosed space. In some implementations, once the voice trigger switches to the second mode, the device periodically determines whether the electronic device is still in a substantially enclosed space by detecting whether one or more of a microphone and a camera of the electronic device is occluded (e.g., using any of the techniques described above with respect to step (704)). If the device is still in a substantially enclosed space, the voice activated system 400 will remain in the second mode. In some implementations, if the device leaves the substantially enclosed space, the electronic device will return the voice trigger to the first mode.

According to some implementations, fig. 8 illustrates a functional block diagram of an electronic device 800 configured in accordance with the principles of the invention as described above. The functional blocks of the device may be implemented by hardware, software, or a combination of hardware and software to carry out the principles of the invention. Those skilled in the art will appreciate that the functional blocks described in fig. 8 may be combined or separated into sub-blocks to implement the principles of the present invention as described above. Thus, the description herein may support any possible combination or separation or further definition of the functional blocks described herein.

As shown in fig. 8, the electronic device 800 includes a sound receiving unit 802 configured to receive sound input. The electronic device 800 also includes a processing unit 806 coupled to the voice receiving unit 802. In some embodiments, the processing unit 806 includes a noise detection unit 808, a sound type detection unit 810, a trigger sound unit 812, a service initiation unit 814, and a voice authentication unit 816. In some implementations, the noise detection unit 808 corresponds to the noise detector 402 described above and is configured to perform any of the operations described above with reference to the noise detector 402. In some implementations, the sound type detection unit 810 corresponds to the sound type detector 404 described above and is configured to perform any of the operations described above with reference to the sound type detector 404. In some implementations, the trigger tone detection unit 812 corresponds to the trigger tone detector 406 described above and is configured to perform any of the operations described above with reference to the trigger tone detector 406. In some implementations, the voice authentication unit 816 corresponds to the voice authentication module 428 described above and is configured to perform any of the operations described above with reference to the voice authentication module 428. The processing unit 806 is configured to: determining whether at least a portion of the sound input corresponds to a predetermined sound type (e.g., with sound type detection unit 810); upon determining that at least a portion of the sound input corresponds to the predetermined type, determining whether the sound input includes predetermined content (e.g., with the trigger sound detection unit 812); and upon determining that the sound input includes the predetermined content, a voice-based service is initiated (e.g., with service initiation unit 814).

In some implementations, the processing unit 806 is further configured to determine whether the sound input satisfies a predetermined condition (e.g., with the noise detection unit 808) before determining whether the sound input corresponds to a predetermined sound type. In some implementations, the processing unit 806 is further configured to determine whether the sound input corresponds to a voice of a particular user (e.g., with the voice authentication unit 816).

Fig. 9 illustrates a functional block diagram of an electronic device 900 configured in accordance with the principles of the invention as described above, according to some implementations. The functional blocks of the device may be implemented by hardware, software, or a combination of hardware and software to carry out the principles of the invention. Those skilled in the art will appreciate that the functional blocks described in fig. 9 may be combined or separated into sub-blocks to implement the principles of the present invention as described above. Thus, the description herein may support any possible combination or separation or further definition of the functional blocks described herein.

As shown in fig. 9, the electronic device 900 includes a voice trigger unit 902. The voice trigger unit 902 may operate in different modes. In a first mode, the voice trigger unit receives sound inputs and determines whether they meet certain criteria (e.g., a listening mode). In the second mode, the voice trigger unit 902 does not receive and/or process a sound input (e.g., standby mode). The electronic device 900 further comprises a processing unit 906 coupled to the voice trigger unit 902. In some implementations, the processing unit 906 includes an environment detection unit 908, which may include and/or interact with one or more sensors (e.g., including a microphone, camera, accelerometer, gyroscope, etc.), and a mode switching unit 910. In some implementations, the processing unit 906 is configured to: determining whether the electronic device is in a substantially enclosed space by detecting that one or more of a microphone and a camera of the electronic device is occluded (e.g., with the environment detection unit 908); and upon determining that the electronic device is in the substantially enclosed space, switching the voice trigger to a second mode (e.g., with mode switching unit 910).

In some implementations, the processing unit is configured to: determining whether the electronic device is in a predetermined orientation (e.g., with the environment detection unit 908); and upon determining that the electronic device is in the predetermined orientation, enabling a predetermined mode of the voice trigger (e.g., with mode switching unit 910).

The foregoing description, for purposes of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the disclosed embodiments to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The implementations were chosen and described in order to best explain the principles of the disclosed concept and its practical application to thereby enable others skilled in the art to best utilize the disclosed concept and its practical application with various modifications as are suited to the particular use contemplated.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, without changing the descriptive intent, a first sound detector may be referred to as a second sound detector, and similarly, a second sound detector may be referred to as a first sound detector, so long as all occurrences of the "first sound detector" are renamed uniformly and all occurrences of the "second sound detector" are renamed uniformly. The first sound detector and the second sound detector are both sound detectors, but they are not the same sound detector.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the claims. As used in the description of the implementations and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associatively listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "if" is to be understood in accordance with the context to mean "when" or "in. Similarly, the phrase "if it is determined that [ stated prerequisite is true ]" or "if [ stated prerequisite is true ]" or "when [ stated prerequisite is true ]" may be understood contextually to mean "at the time of the determination" or "at the time of the determination.

44页详细技术资料下载

Voice trigger of digital assistant

相关技术

网友询问留言