Hot word aware speech synthesis

文档序号:1436444 发布日期:2020-03-20 浏览:16次 中文

阅读说明:本技术 热词感知语音合成 (Hot word aware speech synthesis ) 是由 A.克拉库恩 M.沙里菲 于 2018-06-25 设计创作,主要内容包括:一种方法(400),包括:接收用于转换成合成语音(160)的文本输入数据(302),并且使用被训练成检测分配给用户设备(110)的热词(130)的存在的热词感知模型(320)来确定文本输入数据的发音是否包括热词。热词被配置为在用户设备上发起用于处理音频输入数据中的热词和/或热词之后的一个或多个其他术语的唤醒过程。当文本输入数据的发音包括热词时,该方法还包括从文本输入数据生成音频输出信号(304),并将音频输出信号提供给音频输出设备(118)以输出音频输出信号。音频输出信号当被用户设备的音频捕获设备捕获时,被配置为阻止在用户设备上发起唤醒过程。(A method (400) comprising: text input data (302) for conversion into synthesized speech (160) is received and a hotword perception model (320) trained to detect the presence of a hotword (130) assigned to a user device (110) is used to determine whether a pronunciation of the text input data includes the hotword. The hotword is configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data. When the pronunciation of the text input data includes a hotword, the method further includes generating an audio output signal (304) from the text input data and providing the audio output signal to an audio output device (118) to output the audio output signal. The audio output signal, when captured by an audio capture device of the user device, is configured to prevent initiation of a wake-up process on the user device.)

1. A method (400) comprising:

receiving text input data (302) for conversion into synthesized speech (160) at data processing hardware (112) of a speech synthesis apparatus (300);

determining, by the data processing hardware (112) and using a hotword perception model (320) trained to detect the presence of at least one hotword (130) assigned to a user device (110), whether a pronunciation of the text input data (302) includes a hotword (130), the hotword (130), when included in audio input data received by the user device (110), being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) in the audio input data and/or one or more other terms following the hotword (130); and

when the pronunciation of the text input data (302) includes the hotword (130):

generating an audio output signal (304) from the text input data (302); and

providing, by the data processing hardware (112), the audio output signal (304) to the audio output device (118) to output the audio output signal (304), when captured by the audio capture device (116) of the user device (110), being configured to prevent initiation of a wake-up process on the user device (110).

2. The method (400) of claim 1, wherein determining whether the pronunciation of the text input data (302) includes the hotword (130) comprises determining that a pronunciation of at least one of a word, a subword, or a text-to-speech sequence of the text input data (302) is associated with the hotword (130).

3. The method (400) of claim 1 or 2, wherein the hotword perception model (320) is trained on a text-to-speech sequence or audio representation of hotwords (130) assigned to user devices (110).

4. The method (400) according to any one of claims 1-3, wherein the textual input data (302) includes a first language and the audio output signal (304) includes a translation of the textual input data (302) in a different language.

5. The method (400) according to any one of claims 1-4, further including:

detecting, by data processing hardware (112), a presence of a user device (110) within an operating environment of the speech synthesis device (300); and

querying, by data processing hardware (112), the user device (110) for hotwords (130) assigned to the user device (110) for training the hotword perception model (320).

6. The method (400) according to any one of claims 1-5, further comprising querying, by the data processing hardware (112), a remote hotword repository (142) to obtain at least hotwords (130) assigned to the user device (110) for training the hotword perception model (320).

7. The method (400) according to any one of claims 1-6, wherein generating the audio output signal (304) from the text input data (302) includes inserting a watermark (352) into the audio output signal (304), the watermark (352) indicating that the audio output signal (304) corresponds to synthesized speech (160) and instructing a hotword detector (200) of a user device (110) to ignore detection of hotwords (130) in the synthesized speech (160).

8. The method (400) according to any one of claims 1-6, wherein generating the audio output signal (304) from the text input data (302) includes:

determining a speech waveform representing a text-to-speech output for the text input data (302); and

altering the speech waveform by removing or altering any sounds associated with the hotword (130) to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).

9. The method (400) according to any one of claims 1-6, wherein generating the audio output signal (304) from the text input data (302) includes:

determining a speech waveform representing the text input data (302); and

filtering an audio waveform to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).

10. A method (500), comprising:

receiving, at a hotword detector (200) of a user device (110), audio input data comprising a hotword (130), the hotword (130) being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data;

determining, by the hotword detector (200), whether the audio input data includes synthesized speech (160) using a hotword detector model (220), the hotword detector model (220) configured to detect the presence of hotwords (130) and the synthesized speech (160) in the audio input data; and

preventing, by the hotword detector (200), when the audio input data comprises synthesized speech (160), a wake-up process from being initiated on a user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data.

11. The method (500) of claim 10, wherein the hotword detector model (220) is trained on a plurality of training samples, the plurality of training samples including:

a positive training sample (212b) comprising artificially generated audio data corresponding to one or more users speaking hotwords (130) assigned to the user device (110); and

negative training samples (212a) comprising synthesized speech utterances (160) output from one or more speech synthesizer devices (300).

12. The method (500) of claim 11, wherein at least one of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).

13. The method (500) of claim 11, wherein none of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).

14. The method (500) according to any one of claims 10-13, wherein determining whether the audio input data includes the synthesized speech (160) includes detecting a presence of synthesized speech (160) in the audio input data by analyzing acoustic features of the audio input data using the hotword detector model (220) without transcribing or semantically interpreting the audio input data.

15. A system (100) comprising:

data processing hardware (112) of the speech synthesis apparatus (300); and

memory hardware (114) in communication with the data processing hardware (112), the memory hardware (114) storing instructions that, when executed by the data processing hardware (112), cause the data processing hardware (112) to perform operations comprising:

receiving text input data (302) for conversion into synthesized speech (160);

determining, using a hotword perception model (320) trained to detect the presence of at least one hotword (130) assigned to a user device (110), whether a pronunciation of the text input data (302) includes a hotword (130), the hotword (130), when included in audio input data received by the user device (110), being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) in the audio input data and/or one or more other terms following the hotword (130); and

when the pronunciation of the text input data (302) includes the hotword (130):

generating an audio output signal (304) from the text input data (302); and

providing the audio output signal (304) to an audio output device (118) to output the audio output signal (304), when captured by an audio capture device (116) of the user device (110), being configured to prevent initiation of a wake-up process on the user device (110).

16. The system (100) of claim 15, wherein determining whether the pronunciation of the text input data (302) includes the hotword (130) comprises determining that at least one of a word, a subword, or a text-to-speech sequence of the text input data (302) is associated with the hotword (130).

17. The system (100) according to claim 15 or 16, wherein the hotword perception model (320) is trained on a text-to-speech sequence or audio representation of hotwords (130) assigned to user devices (110).

18. The system (100) according to any one of claims 15-17, wherein the text input data (302) includes a first language and the audio output signal (304) includes a translation of the text input data (302) in a different language.

19. The system (100) according to any one of claims 15-18, wherein the operations further include:

detecting a presence of a user device (110) within an operating environment of the speech synthesis device (300); and

querying the user device (110) for hotwords (130) assigned to the user device (110) for training the hotword perception model (320).

20. The system (100) according to any one of claims 15-19, wherein the operations further include querying a remote hotword repository (142) to obtain at least hotwords (130) assigned to user devices (110) for training the hotword perception model (320).

21. The system (100) according to any one of claims 15-20, wherein generating the output audio signal from the text input data (302) includes inserting a watermark (352) into the output audio signal, the watermark (352) indicating that the output audio signal corresponds to synthesized speech (160) and instructing a hotword detector (200) of a user device (110) to ignore detection of hotwords (130) in the synthesized speech (160).

22. The system (100) according to any one of claims 15-20, wherein generating the output audio signal from the text input data (302) includes:

determining a speech waveform representing a text-to-speech output for the text input data (302); and

altering the speech waveform by removing or altering any sounds associated with the hotword (130) to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).

23. The system (100) according to any one of claims 15-20, wherein generating the output audio signal from the text input data (302) includes:

determining a speech waveform representing the text input data (302); and

filtering an audio waveform to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).

24. A system (100) comprising:

data processing hardware (112) of a user equipment (110); and

memory hardware (114) in communication with the data processing hardware (112), the memory hardware (114) storing instructions that, when executed by the data processing hardware (112), cause the data processing hardware (112) to perform operations comprising:

receiving, at a hotword detector (200) of a user device (110), audio input data comprising a hotword (130), the hotword (130) being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data;

determining, by the hotword detector (200), whether the audio input data includes synthesized speech (160) using a hotword detector model (220), the hotword detector model (220) configured to detect the presence of hotwords (130) and the synthesized speech (160) in the audio input data; and

preventing, by the hotword detector (200), when the audio input data comprises synthesized speech (160), a wake-up process from being initiated on a user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data.

25. The system (100) according to claim 24, wherein the hotword detector model (220) is trained on a plurality of training samples including:

a positive training sample (212b) comprising artificially generated audio data corresponding to one or more users speaking hotwords (130) assigned to the user device (110); and

negative training samples (212a) comprising synthesized speech utterances (160) output from one or more speech synthesizer devices (300).

26. The system (100) of claim 25, wherein at least one of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).

27. The system (100) of claim 25, wherein none of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).

28. The system (100) according to any one of claims 24-27, wherein determining whether the audio input data includes synthesized speech (160) includes detecting a presence of synthesized speech (160) in the audio input data by analyzing acoustic features of the audio input data using the hotword detector model (220) without transcribing or semantically interpreting the audio input data.

Technical Field

The present disclosure relates to hotword-aware speech synthesis.

Background

Voice-enabled environments (e.g., home, workplace, school, automobile, etc.) allow a user to speak a query or command aloud to a computer-based system that fields and answers the query and/or performs a function based on the command. A voice-enabled environment may be implemented using a network of networked microphone devices distributed in different rooms or areas of the environment. These devices may use hotwords to help discern when a given utterance is directed to the system, as opposed to an utterance directed to another individual present in the environment. Thus, the device may operate in a sleep state or a dormant state and only wake up when the detected utterance includes a hotword. Once the device is awakened by the hotword within the detected utterance, the device performs further processing on the hotword and/or one or more terms (term) following the hotword. In other words, the hotword and/or the one or more terms form a query or voice command to be executed by the device. As speech synthesizers become more prevalent in speech-enabled environments, synthesized utterances that contain hotwords or include other words/subwords that sound like hotwords can cause a device to wake up from a sleep/hibernate state and begin processing the synthesized utterance even if the synthesized utterance is not directed at the device. In other words, synthesized speech can inadvertently activate the device, which is often frustrating to users of speech synthesizers. Thus, a system that receives speech within an environment must have some way to distinguish between utterances of human speech directed at the system and utterances of synthesized speech output from nearby devices that are not directed at the system.

Disclosure of Invention

A method for preventing initiation of a wake-up procedure on a user equipment. The method comprises the following steps: receiving text input data for conversion into synthesized speech at data processing hardware of a speech synthesis device; and determining, by the data processing hardware and using a hotword perception model trained to detect the presence of at least one hotword assigned to the user device, whether a pronunciation of the text input data includes a hotword, which when included in the audio input data received by the user device, is configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data. When the pronunciation of the text input data includes a hotword, the method further includes generating an audio output signal from the text input data, and providing, by the data processing hardware, the audio output signal to an audio output device to output the audio output signal. The audio output signal, when captured by an audio capture device of the user device, is configured to prevent initiation of a wake-up process on the user device.

Implementations of the disclosure may include one or more of the following optional features. In some implementations, determining whether the pronunciation of the text input data includes a hotword includes determining that a pronunciation of at least one of a word, a subword, or a text-to-speech sequence of the text input data is associated with the hotword. The hotword perception model may be trained on text-to-speech sequences or audio representations of hotwords assigned to the user device. Further, the textual input data may include a first language, and the audio output signal may include a translation of the textual input data in a different language.

In some examples, the method further comprises detecting, by the data processing hardware, a presence of a user device within an operating environment of the speech synthesis device; and querying, by the data processing hardware, the user device for hotwords assigned to the user device for training a hotword perception model. Additionally or alternatively, the method may include querying a remote hotword repository to obtain at least hotwords assigned to the user device for training a hotword perception model.

In some implementations, generating the audio output signal from the text input data includes inserting a watermark into the audio output signal indicating that the audio output signal corresponds to synthesized speech and indicating that a hotword detector of the user device ignores detection of hotwords in the synthesized speech. In other implementations, generating the audio data includes determining a speech waveform representing a text-to-speech output for the text input data, and altering the speech waveform by removing or altering any sounds associated with hotwords to circumvent detection of the hotwords by a hotword detector of the user device. In yet another embodiment, generating the audio data includes determining a speech waveform representing the text input data and filtering the audio waveform to avoid detection of hotwords by a hotword detector of the user device.

Another aspect of the present disclosure provides a method for preventing initiation of a wake-up procedure on a user equipment. The method comprises the following steps: receiving, at a hotword detector of a user device, audio input data containing a hotword, the hotword configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data; determining, by a hotword detector, whether the audio input data includes synthesized speech using a hotword detector model, the hotword detector model configured to detect the presence of hotwords and synthesized speech in the audio input data; and when the audio input data comprises synthesized speech, preventing, by the hotword detector, initiation of a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data.

This aspect may include one or more of the following optional features. In some embodiments, the hotword detector model is trained on a plurality of training samples including positive training samples and negative training samples. The training samples include artificially generated audio data corresponding to one or more users speaking hotwords assigned to the user devices. The negative training samples include synthesized speech utterances output from one or more speech synthesizer devices. In some examples, at least one of the synthesized speech utterances of the negative training sample pronounces a hotword assigned to the user device. In other examples, none of the synthesized speech utterances of the negative training samples pronounces the hotword assigned to the user device. Determining whether the audio input data includes synthesized speech may include using a hotword detector model to detect the presence of synthesized speech in the audio input data by analyzing acoustic features of the audio input data without transcribing or semantically interpreting the audio input data.

Another aspect of the present disclosure provides a system for preventing initiation of a wake-up procedure on a user equipment. The system includes data processing hardware of the speech synthesis device and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that, when executed by data processing hardware, cause the data processing hardware to perform operations comprising: the method includes receiving text input data for conversion into synthesized speech, and determining whether a pronunciation of the text input data includes a hotword using a hotword perception model trained to detect presence of at least one hotword assigned to the user device, the hotword, when included in audio input data received by the user device, being configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data. When the pronunciation of the text input data includes a hotword, the operations further include generating an audio output signal from the text input data and providing the audio output signal to an audio output device to output the audio output signal. The audio output signal, when captured by an audio capture device of the user device, is configured to prevent initiation of a wake-up process on the user device.

Implementations of the disclosure may include one or more of the following optional features. In some implementations, determining whether the pronunciation of the text input data includes a hotword includes determining that at least one of a word, a subword, or a text-to-speech sequence of the text input data is associated with the hotword. The hotword perception model may be trained on text-to-speech sequences or audio representations of hotwords assigned to the user device. Further, the textual input data may include a first language and the audio output signal may include a translation of the textual input data in a different language.

In some examples, the operations further include detecting a presence of a user device within an operating environment of the speech synthesis device, and querying the user device for hotwords assigned to the user device for training a hotword perception model. Additionally or alternatively, the operations may further include querying a remote hotword repository to obtain at least hotwords assigned to the user device for training a hotword perception model.

In some implementations, generating the audio output signal from the text input data includes inserting a watermark into the audio output signal indicating that the audio output signal corresponds to synthesized speech and indicating that a hotword detector of the user device ignores detection of hotwords in the synthesized speech. In other implementations, generating the audio data includes determining a speech waveform representing a text-to-speech output for the text input data, and altering the speech waveform by removing or altering any sounds associated with hotwords to circumvent detection of the hotwords by a hotword detector of the user device. In yet another embodiment, generating the audio data includes determining a speech waveform representing the text input data and filtering the audio waveform to avoid detection of hotwords by a hotword detector of the user device.

Another aspect of the present disclosure provides a system for preventing initiation of a wake-up procedure on a user equipment. The system includes data processing hardware of the user device and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that, when executed by data processing hardware, cause the data processing hardware to perform operations comprising: receiving, at a hotword detector of a user device, audio input data containing a hotword configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data; determining, by a hotword detector, whether the audio input data includes synthesized speech using a hotword detector model configured to detect the presence of hotwords and synthesized speech in the audio input data; and when the audio input data comprises synthesized speech, preventing, by the hotword detector, initiation of a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data.

This aspect may include one or more of the following optional features. In some embodiments, the hotword detector model is trained on a plurality of training samples including positive training samples and negative training samples. The training samples include artificially generated audio data corresponding to one or more users speaking hotwords assigned to the user devices. The negative training samples include synthesized speech utterances output from one or more speech synthesizer devices. In some examples, at least one of the synthesized speech utterances of the negative training sample pronounces a hotword assigned to the user device. In other examples, none of the synthesized speech utterances of the negative training samples pronounces the hotword assigned to the user device. Determining whether the audio input data includes synthesized speech may include using a hotword detector model to detect the presence of synthesized speech in the audio input data by analyzing acoustic features of the audio input data without transcribing or semantically interpreting the audio input data.

The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 is a schematic diagram of an example voice-enabled environment.

FIG. 2 is a schematic diagram of an example hotword detector from a speech-enabled environment.

FIGS. 3A and 3B are schematic diagrams of an example synthesized speech system incorporating a hotword perception trainer.

FIG. 4 is a flow diagram of an example arrangement of operations of a method for detecting the presence of hotwords in text input data for conversion to synthesized speech at a speech synthesis device.

Fig. 5 is a flow diagram of an example arrangement of operations of a method for preventing initiation of a wake-up procedure on a user device when audio input data includes synthesized speech.

FIG. 6 is a schematic diagram of an example computing device that may be used to implement the systems and methods described herein.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

In a speech-enabled environment, the manner in which users interact with a computer-based system, which may be implemented using a network of networked microphone devices distributed throughout the environment (e.g., a room or other area of a home, workplace, school, etc.), is designed primarily, if not exclusively, through voice input (i.e., audio commands). More devices are using audio commands to instruct the operation of the user device. By using "hotwords" (also referred to as "attention words", "wake phrases/words", "trigger phrases" or "voice action initiation commands") where predetermined terms (i.e., keywords) spoken to draw the attention of the system are preserved through negotiation, the system is able to discern between utterances directed to the system (i.e., for initiating a wake process for processing one or more terms following the hotword in the utterance) and utterances directed to individuals in the environment. In other words, the user device may operate in a low power mode, but upon detection of a hotword, the user device may switch to a full power mode in order to detect, process, and analyze all audio data captured by the microphone. However, as the output of synthesized Speech from a Speech synthesizer (e.g., a Text-To-Speech (TTS) system) becomes more prevalent within a Speech-enabled environment, synthesized Speech that includes hotwords assigned To nearby user devices, or words or subwords that make up or sound like hotwords, may inadvertently cause a hotword detector (e.g., hotword) on the user device To detect the presence of a hotword and initiate a wake-up process for processing terms in the synthesized Speech. As used herein, the terms "synthesized speech" and "synthesized utterance" are used interchangeably. As used herein, synthesized speech output from a TTS system or speech synthesis device includes machine output from a non-audible originating data input. The machine output may inform the user of an operation being performed by a device associated with the TTS system or confirm an instruction provided by the user to the device associated with the TTS system. Thus, synthesized speech may be distinguished from broadcast audio output from a television, multimedia set-top box, stereo system, radio, computer system, or other type of device capable of outputting broadcast audio.

For example, in a voice-enabled environment (such as a user's home), a user may have one or more mobile devices (e.g., a smartphone and/or a tablet computer) and a smart speaker/display device. The smart speaker/display device may be used as a digital assistant for outputting synthesized speech and triggering the processing of a voice query or voice command that will be executed when preceded by a hotword assigned to the respective user device. A scenario may occur in which: the synthesized speech output from one of the devices (e.g., the smart speaker) directed to the user contains one or more words or sub-words that constitute hotwords assigned to one of the other devices in the environment (e.g., the user's tablet). For example, the term "Dog" may be designated as a hotword for the user's tablet, and a portion of the synthesized speech may recite the term "hotsog". As a result, the microphone of the other device may capture the synthesized speech, and the hotword detector may detect that the term "hot" precedes the term "dog" and trigger the user's tablet to inadvertently initiate a wake-up process. Thus, pronunciation of hotwords in the synthesized speech may inadvertently cause a neighboring speech-enabled device to transition from a sleep/hibernate state to an active state in which the neighboring speech-enabled device begins processing (i.e., transcribing and/or semantically interpreting) the synthesized speech.

It is an object of the present disclosure to avoid initiating a wake-up process of one or more other user devices due to the use of hotwords or other terms that sound like hotwords generated by TTS audio (e.g., synthesized speech). This will prevent accidental initiation of the wake-up procedure, allowing the user equipment to remain in the low power state for a longer time to save power.

To prevent inadvertent initiation of a wake-up process in response to detecting pronunciation of a hotword in a synthesized utterance, embodiments herein are directed to injecting a hotword assigned to a proximate device into a training pipe of a TTS system to generate a hotword-aware (hotword-aware) model for detecting the presence of the hotword. The hotword perception model may be trained on any combination of hotwords assigned to proximate devices, a list of hotwords associated with one or more devices that a particular user owns-controls, and/or a list of all potential hotwords that may be assigned to any given device for initiating a wake-up process. For example, the speech synthesizer device may use a hotword perception model to determine whether the pronunciation of the text input data for conversion into synthesized speech includes a hotword. In some examples, the hotword perception model is trained on audio representations (e.g., acoustic features) of hotwords, such as sequences or strings of hotwords. Thus, a speech synthesis device receiving text input data (text and content) for conversion into synthesized speech may pre-process the text input data to obtain individual sequences (TTS sequences), and use a hotword perception model to recognize sequences that, when audibly pronounced, constitute hotwords or sound-like phrases (sound-like phrases) that constitute hotwords, by identifying matches or similarities between the TTS sequences and hotword sequences obtained from the hotword perception model. For example, textual input data comprising the phrase "dawg" will, when audibly pronounced, constitute a phonetically-like phrase of the hotword of the term "dog". Thus, the hotword perception model is trained to detect whether the pronunciation of the text input data includes a hotword (e.g., a constituent hotword or a phonemic phrase that constitutes a hotword). The TTS system may include a multilingual TTS system trained on multiple languages such that the hotword perception model is trained to detect hotwords or phonemic phrases of the hotwords in the multiple languages.

If the speech synthesis apparatus simply saves a log or white list of known hotwords in text form, as opposed to using a hotword perception model, the speech synthesis apparatus will not be able to recognize misspelled words in the text input data that constitute the hotwords and will not be able to recognize subwords within the words that constitute the hotwords. For example, if the speech synthesis device simply references a white list of known hot words, the speech synthesis device will not be able to recognize the text input data for the phrase "dawg" that constitutes a hot word of the term "dog" (unless the spelling of "dawg" is included in the white list), and will not be able to recognize the subword "dog" in the text input data for the phrase "hotog" (unless "hotog" is included in the white list).

Once the speech synthesis device determines that the pronunciation of the text input data includes a hotword, embodiments further include a waveform generator of the speech synthesis device that generates an audio output signal that synthesizes the speech, the waveform generator configured to prevent initiation of a wake-up process on a proximate user device when the audio output signal is captured by a microphone of the proximate user device. In some examples, the waveform generator uses cell selection logic for generating the output audio signal. In these examples, the waveform generator may transmit a known watermark over the audio sequence, where the known watermark is identifiable to a hotword detector on a proximate user device; thus, the hotword detector on the nearby user device will simply ignore the audio output signal with the known watermark, even if the audio output signal pronounces the hotword. Alternatively, the unit selection logic may select an alternative variant of the unit (or subset of units) used to generate the synthesized speech (e.g., audio output signal) that is known to oppose (adapt) the hotword detection model used by the hotword detector of the proximate user device. Here, the hotword detection model may be trained on these same opposing units, so that the hotword detector knows to ignore any utterances that include these units (i.e., the untrained mode of the hotword detector) during inference, thereby preventing initiation of the wake-up process even if the utterances contain hotwords. Further, the waveform generator may distort the synthesized speech using a filter trained for hotword detectors in proximity to the user device such that the hotword detectors ignore or fail to detect the synthesized speech.

In other examples, the waveform generator may generate the output audio signal by using a neural network (e.g., based on WaveNet) to output an audio sequence of synthesized phonemes representing the textual input data. In these examples, when a portion of the synthesized phonemes form a hotword, the waveform generator may provide additional adjustment information that causes the neural network to emit a known watermark over the audio sequence that is identifiable to a hotword detector on a neighboring user device such that the neighboring user device may simply ignore the audio output signal even if the hotword is pronounced. In other words, the presence of the watermark is used to instruct the nearby user device to ignore the pronounced hotword. Alternatively, the synthesized speech segments output from the neural network that constitute the hotwords (or the phonemic phrases that constitute the hotwords) may be modified (e.g., distorted) to generate an output audio signal in a manner that opposes the detection of a hotword detector of a nearby user device.

Additionally or alternatively, embodiments may also include injecting the synthesized speech utterance into a training pipeline of the hotword detector to generate a hotword detector model. The hotword detector model is configured to detect the presence of synthesized speech in the audio input data received by the hotword detector. For example, the hotword detector trainer may train the hotword detector to detect hotwords in the utterance and further determine whether the utterance includes synthesized speech, such as audio data output from a speech synthesis device (e.g., a TTS system). Thus, when a microphone on the user device captures an utterance containing a hotword assigned to the user device, if the hotword detector detects that the utterance includes synthesized speech, the hotword detector will simply ignore the presence of the hotword in the captured utterance, thereby preventing the wake-up process from being initiated on the user device. In some examples, the hotword detector model is trained on positive (positive) training examples including artificially generated audio data corresponding to one or more users who uttered a hotword assigned to the user device and negative (negative) training examples including synthesized speech utterances output from one or more speech synthesizer devices. By training the hotword detector model to detect the presence of synthesized speech in the audio input data, the hotword detector may advantageously use the hotword detector model to detect the presence of synthesized speech by analyzing acoustic features of the received audio input data without transcribing or semantically interpreting the audio input data.

Referring to FIG. 1, in some implementations, a speech-enabled system 100 includes one or more user devices 110, 110 a-b. For example, the voice-enabled system 100 includes two user devices 110a, 110b in close proximity to each other and connected to a remote server 140 (e.g., a cloud computing environment) via a network 130. The user devices 110a, 110b may or may not communicate with each other. Each user device 110 is configured to capture sound corresponding to an utterance 150 from the user 10. The user 10 may speak the utterance 150 aloud as a query or command. The voice-enabled system 100 may formulate a query or command by answering the query and/or causing the command to be executed. Each user device 110 includes data processing hardware 112 and memory hardware 114 in communication with the data processing hardware 112 and storing instructions that, when executed by the data processing hardware 112, cause the data processing hardware 112 to perform one or more operations. Each user device 110 also includes an audio capture device (e.g., a microphone) 116 for capturing and converting spoken utterances 150 within the speech-enabled system 100 into electrical signals, and a voice output device (e.g., a speaker) 118 for communicating audible audio signals (e.g., as output audio data from the user device 110).

Each user device 110 may be associated with a user 10 and capable of processing an utterance 150 from the associated user 10 when the utterance 150 begins with a hotword 130. The hotword 130 may be a spoken phrase that causes any user device 110 to treat the subsequent spoken phrase as a voice input to the system. In other words, the hotword 130 may be a spoken phrase that specifically indicates that the spoken input is to be considered a voice command. That is, the hotword 130 may be a spoken phrase that triggers an end point, automatic speech recognition, or semantic interpretation with respect to the hotword 130 or one or more terms following the hotword 130. In other words, reference to "hotword" refers to a word or phrase that is the specified hotword or a word or phrase that sounds similar to at least a portion of the specified hotword (e.g., sounds similar to hotwords in other languages).

To detect the presence of hotword 130 within utterance 150, each user device 110 includes a hotword detector 200. The hotword detector 200 may receive sound corresponding to the utterance 150 and determine whether the utterance 150 includes terms that have been designated or assigned as hotwords 130. In some examples, hotword detector 200 detects acoustic features of captured sound from utterance 150. Here, when the acoustic feature is a characteristic of the hotword 130, the hotword detector 200 identifies the hotword 130. With the detection of the hotword 130, the hotword detector 200 may initiate a wake-up process and further processes for the user device 110. In other configurations, the hotword detector 200 communicates the detection of the hotword 130 to other components of the user device 110. In some implementations, to efficiently and effectively detect the hotword 130, the hotword detector 200 is trained by the hotword detector model 220 with data or examples of speech to learn how to identify whether the utterance 150 includes the hotword 130. For example, the hotword detector 200 is taught by a machine learning model to identify the hotwords 130.

In some examples, the user 10 or the user device 110 generates a hotword query 132 to identify hotwords 130 of interest to the user 10 and/or the user device 110. In some implementations, the user device 110 communicates with the remote server 140 via the network 120 to identify and/or receive the hotword 130 from a hotword repository 142 in communication with the remote server 140. In some examples, the hotword query 132 may include a user identifier that maps to all hotwords 130 assigned to the user device 110, the user device 110 being owned by the user 10 associated with the user identifier. Additionally or alternatively, the user device 110 may obtain an identifier (e.g., a Media Access Control (MAC) identifier) associated with each neighboring user device 110 and provide the identifier in a query 132 to obtain all hotwords 130 associated with each identifier from the repository 142. The hotword repository 142 may include any combination of hotwords 130 assigned to proximate devices 110, a list of hotwords 130 associated with one or more devices 110 owned and/or controlled by a particular user 10, and/or a list of all potential hotwords 130 that may be assigned to any given device 110 for initiating a wake-up process (e.g., global hotwords associated with a particular type(s) of device(s) 110). By generating the hotword, the hotword(s) 130 may be received to form a robust hotword training process for the hotword detector 200. Referring to FIG. 1, each user device 110 is configured to send and/or receive hotword queries 132 for one or more other user devices 110 to understand and/or compile the hotword(s) 130 assigned to the other user devices 110.

Each user device 110 may also be configured as a speech synthesis device. As a speech synthesis device, user device 110 may also include a speech synthesizer 300, such as a text-to-speech (TTS) system, that generates synthesized speech 160. For example, the synthesized speech 160 may audibly convey an answer to a query received from the user 10. In some examples, all of the functionality of the speech synthesizer 300 may reside on the user device 110. In other examples, a portion of the speech synthesizer 300 resides on the user device 110 and the remaining portion of the speech synthesizer 300 resides on the distributed environment (e.g., the cloud computing environment 140).

In some examples, the speech synthesizer 300 on one device 110 (e.g., a first user device 110, 110a) is trained on a text-to-speech sequence or audio representation of the hotword 130 assigned to another user device 110 (e.g., a second user device 110 b). For example, a training pipeline (e.g., hotword perception trainer 310) of a speech synthesizer 300 (e.g., a TTS system) associated with one device 110 may generate a hotword perception model 320 for detecting the presence of hotwords 130. The hotword awareness model 320 may be trained on any combination of hotwords 130 assigned to proximate devices 110, a list of hotwords 130 associated with one or more devices 110 owned and/or controlled by a particular user 10, and/or a list of all potential hotwords 130 that may be assigned to any given device 110 for initiating a wake-up process (e.g., global hotwords associated with a particular type(s) of device(s) 110). Additionally or alternatively, the hotword query 132 may be used to obtain the hotword(s) 130 for the hotword perception model 320. The speech synthesizer 300 of each user device 110 may also include a waveform generator 312 for generating the synthesized speech 160. Waveform generator 312 may use unit selection logic for generating synthesized speech 160 in the form of output audio data. In some examples, the waveform generator 312 uses a neural network for generating the output audio data. Although the examples are directed to using the hotword perception model 320 for detecting the presence of hotwords 130 in synthesized speech, the hotword perception model 320 may be similarly trained for detecting hotwords 320 in other types of specified audio, such as, but not limited to, broadcast audio.

In the illustrated example, the voice-enabled system 100 includes a first user device 110a and a second user device 110 b. The second user equipment 110b may be considered a proximity device to the first user equipment 110a, or vice versa. Here, the user devices 110a, 110b are considered "proximate" to one another when the respective audio capture device 116 on one device 102 is capable of capturing an utterance 150 directed to the other device 102. In other words, the "proximate" user devices 110a, 110b are within overlapping audio reception proximity such that the speech output device 118 of one user device 110, 110a is within detectable range of the audio capture device 116 of the proximate user device 110, 110 b. Although the voice-enabled system 100 is shown to include two user devices 110a, 110b, in other examples, the voice-enabled system 100 includes additional user devices 110 without departing from the scope of the present disclosure. Some examples of user device 110 are a portable computer, a smart phone, a tablet computing device, a smart speaker, a smart display, or a wearable computing device.

In some examples, the user devices 110, 110 a-b each correspond to a user 10 speaking a word or sub-word over one or more networks 120. For example, the user 10 may speak a message that is detectable by the first user device 110a that includes "okgiogle: the first utterance 150a that reminds me that the first thing in the morning is to restart the computer at work. Here, the phrase "OK Google" is the hotword 130 assigned to the user device 110a, such that the hotword detector 200 triggers the user device 110a to initiate a wake-up process for processing the hotword 130 in the audio input data and/or one or more other terms following the hotword 130 (e.g., the remainder of the first utterance 150a, "remind me that the first thing in the morning is to restart the computer at work"). In this example, the first user device 110a responds to the first utterance 150a with a synthesized speech 160 that speaks "OK Jim, having set a reminder" for the next morning.

Similarly, the second user device 110, 110b may be assigned the hotword 130 "start computer". In this configuration, when the user 10 "starts the computer" using the hotword 130, the user 10 expects the second user device 110, 110b to initiate a wake-up process. Thus, when the user 10 speaks a signal detectable by the second user device 110b that includes "start computer: when the second utterance 150b of music from the 70 s music playlist, "the phrase" start the computer "causes the hotword detector 200 to trigger the second user device 110b to initiate a wake-up process for processing the hotword 130 in the audio input data and/or one or more other terms" music from the 70 s music playlist.

When two user devices 110 are in proximity, the synthesized speech 160 including the hotword 130 as output data from the first user device 110a may be inadvertently received by the audio capture device 116, 116b of the second user device 110 b. In response to the inadvertently received synthesized speech 160 containing the hotword 130, the user 10 does not intend the hotword detector 200, 200b for the second device 110, 110b to wake up and/or initiate further processing based on the inadvertently received synthesized speech 160. To prevent the hotword detector 200 from activating the second user device 110, 110b, the hotword detector 200 may be configured to identify the synthesized speech 160 and ignore the synthesized speech 160 containing the hotword 130.

FIG. 2 is an example of a hotword detector 200 within the user device 110 of the speech-enabled system 100. The hotword detector 200 is configured to determine whether audio input data, such as the utterance 150, includes the hotword 130 (e.g., based on detecting that some or all of the acoustic features of the sound corresponding to the hotword 130 are similar to the acoustic features of the hotword 130). For example, the hotword detector 200 determines that the utterance 150 starts with the hotword 130 and then initiates a wake-up process for the user device 110 of the hotword detector 200.

In some examples, hotword detector 200 includes hotword detector trainer 210 and hotword detector model 220. In addition to training on positive training samples 212, 212b containing audio representations of the hotwords, the hotword detector trainer 210 trains on negative training samples 212, 212a of the synthesized speech 160 to generate a hotword detector model 220 to teach the hotword detector 200 to distinguish between artificially generated utterances 150 (e.g., non-synthesized speech) and synthesized utterances 160 (generated by the speech synthesizer 300). The hotword detector model 220 is a synthetic speech perception model 220 generated by the hotword detector trainer 210 based on the training examples 212, 212 a-b

In some implementations, the hotword detector trainer 210 trains the hotword detector model 220 by negative training examples 212a and positive training examples 212 b. The negative training example 212a is an audio sample that the hotword detector trainer 210 teaches the hotword detector model 220 to ignore. Here, to prevent unintentional initiation of wake-up of user device 110 based on synthesized speech 160, negative training examples 212a are audio samples corresponding to synthesized speech 160. The synthesized speech 160 of one or more negative training examples 212a may be synthesized speech 160 that includes the hotword 130 (i.e., pronunciations to the hotword 130) or synthesized speech that does not include the hotword 130. In either scenario, the hotword detector 200 is taught to ignore the synthesized speech 160 so that the utterance 150-based wake-up process is not inadvertently initiated by the synthesized speech 160 containing the hotword or one or more words/subwords that sound like the hotword 130. By disregarding the synthesized speech 160, the hotword detector 200 prevents a wake-up process from being initiated on the user device 110 for processing the hotword 130 and/or one or more other terms following the hotword 130 in the audio input data.

Optionally, the hotword detector trainer 210 may additionally or alternatively train the hotword detector model 220 through negative training examples 212a that include samples of other types of audio (e.g., broadcast audio). Accordingly, hotword detector 200 may similarly be taught to ignore these other types of audio, such that the wake-up process based on utterance 150 is not inadvertently initiated by these other types of audio that contain hotwords or one or more words/subwords that sound like hotwords 130.

In contrast, positive training example 212b is an audio sample of utterance 150 of human speech that includes hotword 130. The hotword detector trainer 210 feeds the hotword detector model 220 with a positive training example 212b to teach an example where the hotword detector 200 should initiate a wake-up process. Additionally or alternatively, the hotword detector trainer 210 may train the hotword detector model 220 with training examples 212 that are audio samples of the utterance 150 of human speech without the hotword 130 in order to expose the hotword detector 200 to further scenarios that may occur during operation of the hotword detector 200. In some implementations, the more training examples 212 taught to the hotword detector model 220 by the hotword detector trainer 210, the more robust and/or computationally efficient the hotword detector 200 becomes when implementing the hotword detector model 220. Moreover, by training the hotword detector 200 with the hotword detector model 220 taught by the training example 212 from the hotword detector trainer 210, the hotword detector model 220 allows for the detection of the presence of synthesized speech (e.g., audio input data) in the utterance 150 by analyzing the acoustic features of the utterance 150 without transcribing or semantically interpreting the utterance 150.

With continued reference to FIG. 2, hotword detector 200 of user device 110 implements hotword detector model 220 to determine whether the received audio input data "reminder to restart computer when you arrive at work today in the morning" includes hotword 130. For example, the first user device 110, 110a generates the audio input data as synthesized speech 160. The second user equipment 110, 110b, being a proximate user equipment of the first user equipment 110, 110a, may for example hear the synthesized speech 160 accidentally at the audio capturing device 116, 116b of the second user equipment 110, 110 b. Here, instead of the hotword detector 200, 200b initiating the wake-up process due to the hotword 130 "starting the computer" as an acoustic feature of the synthesized speech 160, the hotword detector 200, 200b implements a hotword detector model 220 to identify the audio input data as the synthesized speech 160, ignoring the phrase "for re-entering work when you arrive at work today in the morningBooting a computerReminder (reminder to re)start computerThe presence of the specified hotword 130 "Start computer" within the "where you are concerned at work this moving)".

In some configurations, the hotword detector trainer 210 is configured to separate the training examples 212 into a training set and an evaluation set (e.g., 90% training and 10% evaluation). With these sets, the hotword detector trainer 210 trains the hotword detector model 220 with the audio samples until the performance of the hotword detector model 220 on the evaluation set stops degrading. Once performance on the evaluation set stops degrading, hotword detector model 220 is ready for modeling in which hotword detector model 220 allows hotword detector 200 to accurately detect hotwords 130 received at user device 110 that do not correspond to synthesized speech 160.

Additionally or alternatively, the hotword detector model 220 is a neural network. The hotword detector model 220 may be a Convolutional Neural Network (CNN) or a Deep Neural Network (DNN). In some examples, the hotword detector model 220 is a combination of a convolutional neural network and a deep neural network, such that the convolutional neural network filters, pools, and then flattens the information for transmission to the deep neural network. Much like when the hotword detector model 220 is a machine learning model, the neural network is trained (e.g., by the hotword detector trainer 210) to generate meaningful outputs that can be used for accurate hotword detection. In some examples, when hotword detector model 220 is a neural network, the mean square error loss function trains hotword detector model 220.

Fig. 3A and 3B are examples of a speech synthesizer 300 of the user equipment 110. By way of example, the speech synthesizer 300 is a TTS system, where the input to the speech synthesizer is text input data 302. The speech synthesizer 300 may be configured to generate synthesized speech 160 from textual input data 302 by converting the textual input data 302 into the synthesized speech 160. As shown in fig. 3A and 3B, speech synthesizer 300 may generate synthesized speech 160 through different processes, such as a unit selection process (fig. 3A) or a neural network process (fig. 3B). In either process, the speech synthesizer 300 includes a hotword perception trainer 310 and a hotword perception model 320 to provide an audio output signal 304 that may be identified by the proximate user device(s) 110 to prevent a wake-up process from being initiated on the proximate user device 110. In other words, while the audio output signal 304 may include the hotword 130 that should initiate a wake-up process of the proximate user device 110, the speech synthesizer 300 identifies the audio output signal 304 as synthesized speech 160 to avoid/circumvent initiating wake-up processes associated with other proximate user devices 110. In these examples, speech synthesizer 300 uses hotword perception model 320 to detect the presence of hotwords 130 assigned to user device 110 and determine whether the pronunciation of text input data 302 (e.g., audio output signal 304 of synthesized speech 160) includes hotwords 130. When the utterance includes a hotword 130, the speech synthesizer 300 generates an audio output signal 304 such that the utterance cannot trigger the hotword detector(s) 200 from a different user device 110.

In some examples, the hotword perception trainer 310 utilizes the hotword query 132 to obtain the hotword 130 or the list of hotwords 130 (e.g., from the hotword repository 142 or directly from the proximate user device 110). As previously described, the hotword query 132 may obtain any combination of the hotwords 130 assigned to neighboring devices 110, a list of hotwords 130 associated with one or more devices 110, 110 a-n that a particular user 10 owns-controls, and/or a list of all potential hotwords 130 that may be assigned to any given device 110 for initiating a wake-up process. In other examples, the user 10 of the speech-enabled system 100 or an administrator of the user device 110 pre-programs and/or updates the hotword perception trainer 310 with the hotword(s) 130. The hotword perception trainer 310 trains a hotword perception model 320 based on the received and/or obtained hotwords 130. In some examples, hotword perception trainer 310 trains hotword perception model 320 based on a TTS sequence or audio representation of at least one hotword 130.

The speech synthesizer 300 may use the hotword perception model 320 at any time during the speech synthesis process. In some examples, the speech synthesizer 300 first generates a text-to-speech output and then analyzes the synthesized speech 160 for hotwords 130 or phonemic phrases using the hotword perception model 320. In other examples, the speech synthesizer 300 uses the hotword perception model 320 during generation of the synthesized speech 160 to analyze the text-to-speech output for the hotword 130.

When the hotword perception model 320 identifies the hotword 130 during a speech synthesis process or within the synthesized speech 160, the speech synthesizer 300 provides an indication that the synthesized speech 160 includes (e.g., within the audio output signal 304) the hotword 130. In some examples, the speech synthesizer 300 transmits a known watermark 352 over the audio sequence of the synthesized speech 160, the known watermark 352 being identifiable to the hotword detector 200. The speech synthesizer 300 may insert the watermark 352 into the synthesized speech 160 or over the synthesized speech 160 in any manner identifiable to the hotword detector 200. For example, the speech synthesizer 300 may insert a watermark by adding/pre-adding/overlaying a watermark or encoding a watermark within the synthesized speech 160. The speech synthesizer 300 may insert unique features, such as known watermarks 352, at discrete intervals on the audio sequence within the synthesized speech 160. These discrete intervals can range from millisecond intervals to any interval of greater span of several seconds. For example, smaller intervals, such as millisecond intervals, even allow portions of the synthesized speech 160 received at the proximate user device 110 to be identifiable to prevent unwanted wake up initiation. Inserting the watermark 352 at intervals may further prevent unwanted speech recognition in the event that the user device is active and has awakened. In some implementations, the speech synthesizer 300 uses the filter 354 trained for a given hotword detector 200 to distort the synthesized speech 160. In other words, hotword detector 200 on neighboring device 110 is trained with filter 354 to ignore filtered synthesized speech 160. In some examples, filter 354 masks hotword 130 within synthesized speech 160 from hotword detector 200. Similarly, the speech synthesizer 300 may alter the speech waveform corresponding to the audio output signal 304 associated with the synthesized speech 160 by removing or altering any sounds associated with the hotword 130 in order to circumvent the hotword detection by the hotword detector 200.

Referring to fig. 3A, speech synthesizers 300, 300a use unit selection logic to generate synthesized speech 160. Here, speech synthesizer 300 is a TTS system in which unit extender 330 receives text input data 302 and parses the text input data 302 into components compatible with the phonetic units of unit database 340. The unit selector 350 is configured to interpret the parsed text input data from the unit extender 330 and to select a phonetic unit corresponding to the parsed text input data from the unit database 340 in communication with the unit selector 350. The unit database 340 is a database that typically includes units of parsed text and a collection of corresponding audio signal forms (i.e., units of speech) for those units. The unit selector 350 constructs a unit sequence 360 from the speech units associated with the parsed text input data to form the synthesized speech 160 for the text input data 302. In some configurations, when the synthesized speech 160 includes the hotword 130, the speech synthesizer 300, 300a is configured to select an alternative variant of the speech unit to form the synthesized speech 160 such that the hotword detector 200 will not be able to detect the hotword 130.

Fig. 3B is an example of a speech synthesizer 300, 300B similar to that of fig. 3A, except that the speech synthesizer 300, 300B receives textual input data 302 and generates an input text sequence 370 to be input into a waveform neural network model 380. Unlike the cell selection process, the waveform neural network model 380 does not require the cell database 340. The waveform neural network model 380 may achieve higher computational efficiency and reduce part of the computational load without the unit database 340 when compared to the speech synthesizers 300, 300 a.

Similar to the hotword detector model 220, the hotword perception model 320 and/or the waveform neural network model 380 may be machine learning models that may first undergo model training (e.g., via the hotword perception trainer 310 in the case of the hotword perception model 320) and, once trained, may continue to be implemented by the speech synthesizer 300. During model training, the models 320, 380 receive a data set and a result set to predict their own output based on input data similar to the data set. In the case of the hotword perception model 320, the data set and the result set may be audio samples or text samples associated with the hotword 130, such as phrases, words, subwords, text-to-speech sequences, language variants, language translations, and so forth. In the case of the waveform neural network model 380, the data set and the result set may be text samples configured to train the waveform neural network model 380 to generate the synthesized speech 160 from the input text sequence 370. In some examples, for training purposes, the data is divided into a training set and an evaluation set (e.g., 90% training and 10% evaluation). With these sets, the models 320, 380 are trained until performance on the evaluation set stops degrading. Once the performance on the evaluation set stops degrading, each respective model 320, 380 is ready for modeling (e.g., identifying the hotwords 130 for the hotword perception model 320 or generating synthetic speech 160 for the waveform neural network model 380).

Additionally or alternatively, each respective model 320, 380 is a neural network. The models 320, 380 may be Convolutional Neural Networks (CNNs) (e.g., modified WaveNet) or Deep Neural Networks (DNNs). In some examples, the models 320, 380 are a combination of convolutional and deep neural networks, such that the convolutional neural networks filter, pool, and then flatten the information for transmission to the deep neural network. Much like when the models 320, 380 are machine learning models, the neural network is trained to generate meaningful audio output signals 304. In some examples, when the models 320, 380 are neural networks, the mean square error loss function trains the models 320, 380.

FIG. 4 is a flow diagram of an example arrangement of operations of a method 400 for determining that a pronunciation of text input data 302 includes a hotword 130 assigned to a proximate device 110. The data processing hardware 112 may perform the operations of the method 400 by executing instructions stored on the memory hardware 114. At operation 402, the method 400 includes receiving, at the data processing hardware 112 of the speech synthesis apparatus 300, the text input data 302 for conversion into the synthesized speech 160. At operation 404, the method 400 includes determining, by the data processing hardware 112 and using a hotword perception model 320 trained to detect the presence of hotwords 130 assigned to the user device 110, whether a pronunciation of the text input data 302 includes a hotword 130, the hotword 130, when included in audio input data received by the user device 110, being configured to initiate a wake-up process on the user device 110 for processing the hotword 130 in the audio input data and/or one or more other terms following the hotword 130.

At operation 406, when the pronunciation of the text input data 302 includes the hotword 130, the method 400 includes generating an audio output signal 304 from the input text data 302. At operation 408, when the utterance of text input data 302 includes a hotword 130, the method 400 includes providing, by the data processing hardware 112, the audio output signal 304 to the audio output device 118 to output the audio output signal 304, when captured by the audio capture device 116 of the user device 110, being configured to prevent initiation of a wake-up process on the user device 110.

Fig. 5 is a flow diagram of an example arrangement of operations of a method 500 for preventing initiation of a wake-up process on a user device 110 for processing audio input data when the audio input data includes synthesized speech 160. The data processing hardware 112 may perform the operations of the method 500 by executing instructions stored on the memory hardware 114. At operation 502, the method 500 includes receiving audio input data containing the hotword 130 at the hotword detector 200 of the user device 110. The hotword 130 is configured to initiate a wake-up process on the user device 110 for processing the hotword 130 and/or one or more other terms following the hotword 130 in the audio input data. At operation 504, the method 500 includes determining, by the hotword detector 200, whether the audio input data includes synthesized speech 160 using a hotword detector model 220 configured to detect the presence of hotwords 130 and synthesized speech 160 in the audio input data. At operation 506, when the audio input data includes the synthesized speech 160, the method 500 includes preventing, by the hotword detector 200, initiation of a wake-up process on the user device 110 for processing the hotword 130 and/or one or more other terms following the hotword 130 in the audio input data.

A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform tasks. In some examples, a software application may be referred to as an "application," app, "or" program. Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.

The non-transitory memory may be a physical device for temporarily or permanently storing programs (e.g., sequences of instructions) or data (e.g., program state information) for use by the computing device. The non-transitory memory may be volatile and/or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electrically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware such as boot programs). Examples of volatile memory include, but are not limited to, Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Phase Change Memory (PCM), and magnetic disks or tapes.

FIG. 6 is a schematic diagram of an example computing device 600 that may be used to implement the systems and methods described in this document. Computing device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

Computing device 600 includes a processor 610, memory 620, a storage device 630, a high-speed interface/controller 640 connected to memory 620 and high-speed expansion ports 650, and a low-speed interface/controller 660 connected to low-speed bus 670 and storage device 630. Each of the components 610, 620, 630, 640, 650, and 660 are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 610 may process instructions for execution within the computing device 600, including instructions stored in the memory 620 or storage device 630 to display graphical information for a Graphical User Interface (GUI) on an external input/output device, such as display 680 coupled to high speed interface 640. In other embodiments, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Moreover, multiple computing devices 600 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

The memory 620 stores information within the computing device 600 non-temporarily. The memory 620 may be a computer-readable medium, volatile memory unit(s), or non-volatile memory unit(s). Non-transitory memory 620 may be a physical device for temporarily or permanently storing programs (e.g., sequences of instructions) or data (e.g., program state information) for use by computing device 600. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electrically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware, such as a boot program). Examples of volatile memory include, but are not limited to, Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Phase Change Memory (PCM), and magnetic disks or tape.

The storage device 630 can provide mass storage for the computing device 600. In some implementations, the storage device 630 is a computer-readable medium. In various different embodiments, the storage device 630 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state storage device, or an array of devices, including devices in a storage area network or other configurations. In further embodiments, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions which, when executed, perform one or more methods, such as those described above. The information carrier is a computer-or machine-readable medium, such as the memory 620, the storage device 630, or memory on processor 610.

The high speed controller 640 manages bandwidth-intensive operations for the computing device 600, while the low speed controller 660 manages lower bandwidth-intensive operations. Such allocation of duties is exemplary only. In some implementations, the high-speed controller 640 is coupled to memory 620, display 680 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 650, which may accept various expansion cards (not shown). In some implementations, the low-speed controller 660 is coupled to the storage device 630 and the low-speed expansion port 690. The low-speed expansion port 690, which may include various communication ports (e.g., USB, bluetooth, ethernet, wireless ethernet), may be coupled to one or more input/output devices (such as a keyboard, pointing device, scanner) or network devices (such as a switch or router), for example, through a network adapter.

As shown, the computing device 600 may be implemented in a number of different forms. For example, it may be implemented as a standard server 600a or multiples of a group of such servers 600a, as a laptop computer 600b, or as part of a rack server system 600 c.

Various implementations of the systems and techniques described here can be implemented in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, non-transitory computer-readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

The processes and logic flows described in this specification can be performed by one or more programmable processors (also known as data processing hardware) executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in particular by, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not require such a device. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, one or more aspects of the disclosure may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube), LCD (liquid crystal display) display, or touch screen) for displaying information to the user and, optionally, a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer. Other types of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending and receiving documents to and from the device used by the user; for example, by sending a web page to a web browser on the user's client device in response to a request received from the web browser.

A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.

26页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:具有减小的干扰的三维存储器件编程

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!