Hot word aware speech synthesis
阅读说明:本技术 热词感知语音合成 (Hot word aware speech synthesis ) 是由 A.克拉库恩 M.沙里菲 于 2018-06-25 设计创作,主要内容包括:一种方法(400),包括:接收用于转换成合成语音(160)的文本输入数据(302),并且使用被训练成检测分配给用户设备(110)的热词(130)的存在的热词感知模型(320)来确定文本输入数据的发音是否包括热词。热词被配置为在用户设备上发起用于处理音频输入数据中的热词和/或热词之后的一个或多个其他术语的唤醒过程。当文本输入数据的发音包括热词时,该方法还包括从文本输入数据生成音频输出信号(304),并将音频输出信号提供给音频输出设备(118)以输出音频输出信号。音频输出信号当被用户设备的音频捕获设备捕获时,被配置为阻止在用户设备上发起唤醒过程。(A method (400) comprising: text input data (302) for conversion into synthesized speech (160) is received and a hotword perception model (320) trained to detect the presence of a hotword (130) assigned to a user device (110) is used to determine whether a pronunciation of the text input data includes the hotword. The hotword is configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data. When the pronunciation of the text input data includes a hotword, the method further includes generating an audio output signal (304) from the text input data and providing the audio output signal to an audio output device (118) to output the audio output signal. The audio output signal, when captured by an audio capture device of the user device, is configured to prevent initiation of a wake-up process on the user device.)
1. A method (400) comprising:
receiving text input data (302) for conversion into synthesized speech (160) at data processing hardware (112) of a speech synthesis apparatus (300);
determining, by the data processing hardware (112) and using a hotword perception model (320) trained to detect the presence of at least one hotword (130) assigned to a user device (110), whether a pronunciation of the text input data (302) includes a hotword (130), the hotword (130), when included in audio input data received by the user device (110), being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) in the audio input data and/or one or more other terms following the hotword (130); and
when the pronunciation of the text input data (302) includes the hotword (130):
generating an audio output signal (304) from the text input data (302); and
providing, by the data processing hardware (112), the audio output signal (304) to the audio output device (118) to output the audio output signal (304), when captured by the audio capture device (116) of the user device (110), being configured to prevent initiation of a wake-up process on the user device (110).
2. The method (400) of claim 1, wherein determining whether the pronunciation of the text input data (302) includes the hotword (130) comprises determining that a pronunciation of at least one of a word, a subword, or a text-to-speech sequence of the text input data (302) is associated with the hotword (130).
3. The method (400) of claim 1 or 2, wherein the hotword perception model (320) is trained on a text-to-speech sequence or audio representation of hotwords (130) assigned to user devices (110).
4. The method (400) according to any one of claims 1-3, wherein the textual input data (302) includes a first language and the audio output signal (304) includes a translation of the textual input data (302) in a different language.
5. The method (400) according to any one of claims 1-4, further including:
detecting, by data processing hardware (112), a presence of a user device (110) within an operating environment of the speech synthesis device (300); and
querying, by data processing hardware (112), the user device (110) for hotwords (130) assigned to the user device (110) for training the hotword perception model (320).
6. The method (400) according to any one of claims 1-5, further comprising querying, by the data processing hardware (112), a remote hotword repository (142) to obtain at least hotwords (130) assigned to the user device (110) for training the hotword perception model (320).
7. The method (400) according to any one of claims 1-6, wherein generating the audio output signal (304) from the text input data (302) includes inserting a watermark (352) into the audio output signal (304), the watermark (352) indicating that the audio output signal (304) corresponds to synthesized speech (160) and instructing a hotword detector (200) of a user device (110) to ignore detection of hotwords (130) in the synthesized speech (160).
8. The method (400) according to any one of claims 1-6, wherein generating the audio output signal (304) from the text input data (302) includes:
determining a speech waveform representing a text-to-speech output for the text input data (302); and
altering the speech waveform by removing or altering any sounds associated with the hotword (130) to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).
9. The method (400) according to any one of claims 1-6, wherein generating the audio output signal (304) from the text input data (302) includes:
determining a speech waveform representing the text input data (302); and
filtering an audio waveform to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).
10. A method (500), comprising:
receiving, at a hotword detector (200) of a user device (110), audio input data comprising a hotword (130), the hotword (130) being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data;
determining, by the hotword detector (200), whether the audio input data includes synthesized speech (160) using a hotword detector model (220), the hotword detector model (220) configured to detect the presence of hotwords (130) and the synthesized speech (160) in the audio input data; and
preventing, by the hotword detector (200), when the audio input data comprises synthesized speech (160), a wake-up process from being initiated on a user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data.
11. The method (500) of claim 10, wherein the hotword detector model (220) is trained on a plurality of training samples, the plurality of training samples including:
a positive training sample (212b) comprising artificially generated audio data corresponding to one or more users speaking hotwords (130) assigned to the user device (110); and
negative training samples (212a) comprising synthesized speech utterances (160) output from one or more speech synthesizer devices (300).
12. The method (500) of claim 11, wherein at least one of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).
13. The method (500) of claim 11, wherein none of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).
14. The method (500) according to any one of claims 10-13, wherein determining whether the audio input data includes the synthesized speech (160) includes detecting a presence of synthesized speech (160) in the audio input data by analyzing acoustic features of the audio input data using the hotword detector model (220) without transcribing or semantically interpreting the audio input data.
15. A system (100) comprising:
data processing hardware (112) of the speech synthesis apparatus (300); and
memory hardware (114) in communication with the data processing hardware (112), the memory hardware (114) storing instructions that, when executed by the data processing hardware (112), cause the data processing hardware (112) to perform operations comprising:
receiving text input data (302) for conversion into synthesized speech (160);
determining, using a hotword perception model (320) trained to detect the presence of at least one hotword (130) assigned to a user device (110), whether a pronunciation of the text input data (302) includes a hotword (130), the hotword (130), when included in audio input data received by the user device (110), being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) in the audio input data and/or one or more other terms following the hotword (130); and
when the pronunciation of the text input data (302) includes the hotword (130):
generating an audio output signal (304) from the text input data (302); and
providing the audio output signal (304) to an audio output device (118) to output the audio output signal (304), when captured by an audio capture device (116) of the user device (110), being configured to prevent initiation of a wake-up process on the user device (110).
16. The system (100) of claim 15, wherein determining whether the pronunciation of the text input data (302) includes the hotword (130) comprises determining that at least one of a word, a subword, or a text-to-speech sequence of the text input data (302) is associated with the hotword (130).
17. The system (100) according to claim 15 or 16, wherein the hotword perception model (320) is trained on a text-to-speech sequence or audio representation of hotwords (130) assigned to user devices (110).
18. The system (100) according to any one of claims 15-17, wherein the text input data (302) includes a first language and the audio output signal (304) includes a translation of the text input data (302) in a different language.
19. The system (100) according to any one of claims 15-18, wherein the operations further include:
detecting a presence of a user device (110) within an operating environment of the speech synthesis device (300); and
querying the user device (110) for hotwords (130) assigned to the user device (110) for training the hotword perception model (320).
20. The system (100) according to any one of claims 15-19, wherein the operations further include querying a remote hotword repository (142) to obtain at least hotwords (130) assigned to user devices (110) for training the hotword perception model (320).
21. The system (100) according to any one of claims 15-20, wherein generating the output audio signal from the text input data (302) includes inserting a watermark (352) into the output audio signal, the watermark (352) indicating that the output audio signal corresponds to synthesized speech (160) and instructing a hotword detector (200) of a user device (110) to ignore detection of hotwords (130) in the synthesized speech (160).
22. The system (100) according to any one of claims 15-20, wherein generating the output audio signal from the text input data (302) includes:
determining a speech waveform representing a text-to-speech output for the text input data (302); and
altering the speech waveform by removing or altering any sounds associated with the hotword (130) to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).
23. The system (100) according to any one of claims 15-20, wherein generating the output audio signal from the text input data (302) includes:
determining a speech waveform representing the text input data (302); and
filtering an audio waveform to circumvent detection of the hotword (130) by a hotword detector (200) of the user device (110).
24. A system (100) comprising:
data processing hardware (112) of a user equipment (110); and
memory hardware (114) in communication with the data processing hardware (112), the memory hardware (114) storing instructions that, when executed by the data processing hardware (112), cause the data processing hardware (112) to perform operations comprising:
receiving, at a hotword detector (200) of a user device (110), audio input data comprising a hotword (130), the hotword (130) being configured to initiate a wake-up process on the user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data;
determining, by the hotword detector (200), whether the audio input data includes synthesized speech (160) using a hotword detector model (220), the hotword detector model (220) configured to detect the presence of hotwords (130) and the synthesized speech (160) in the audio input data; and
preventing, by the hotword detector (200), when the audio input data comprises synthesized speech (160), a wake-up process from being initiated on a user device (110) for processing the hotword (130) and/or one or more other terms following the hotword (130) in the audio input data.
25. The system (100) according to claim 24, wherein the hotword detector model (220) is trained on a plurality of training samples including:
a positive training sample (212b) comprising artificially generated audio data corresponding to one or more users speaking hotwords (130) assigned to the user device (110); and
negative training samples (212a) comprising synthesized speech utterances (160) output from one or more speech synthesizer devices (300).
26. The system (100) of claim 25, wherein at least one of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).
27. The system (100) of claim 25, wherein none of the synthesized speech utterances (160) of the negative training sample (212a) pronounces the hotword (130) assigned to the user device (110).
28. The system (100) according to any one of claims 24-27, wherein determining whether the audio input data includes synthesized speech (160) includes detecting a presence of synthesized speech (160) in the audio input data by analyzing acoustic features of the audio input data using the hotword detector model (220) without transcribing or semantically interpreting the audio input data.
Technical Field
The present disclosure relates to hotword-aware speech synthesis.
Background
Voice-enabled environments (e.g., home, workplace, school, automobile, etc.) allow a user to speak a query or command aloud to a computer-based system that fields and answers the query and/or performs a function based on the command. A voice-enabled environment may be implemented using a network of networked microphone devices distributed in different rooms or areas of the environment. These devices may use hotwords to help discern when a given utterance is directed to the system, as opposed to an utterance directed to another individual present in the environment. Thus, the device may operate in a sleep state or a dormant state and only wake up when the detected utterance includes a hotword. Once the device is awakened by the hotword within the detected utterance, the device performs further processing on the hotword and/or one or more terms (term) following the hotword. In other words, the hotword and/or the one or more terms form a query or voice command to be executed by the device. As speech synthesizers become more prevalent in speech-enabled environments, synthesized utterances that contain hotwords or include other words/subwords that sound like hotwords can cause a device to wake up from a sleep/hibernate state and begin processing the synthesized utterance even if the synthesized utterance is not directed at the device. In other words, synthesized speech can inadvertently activate the device, which is often frustrating to users of speech synthesizers. Thus, a system that receives speech within an environment must have some way to distinguish between utterances of human speech directed at the system and utterances of synthesized speech output from nearby devices that are not directed at the system.
Disclosure of Invention
A method for preventing initiation of a wake-up procedure on a user equipment. The method comprises the following steps: receiving text input data for conversion into synthesized speech at data processing hardware of a speech synthesis device; and determining, by the data processing hardware and using a hotword perception model trained to detect the presence of at least one hotword assigned to the user device, whether a pronunciation of the text input data includes a hotword, which when included in the audio input data received by the user device, is configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data. When the pronunciation of the text input data includes a hotword, the method further includes generating an audio output signal from the text input data, and providing, by the data processing hardware, the audio output signal to an audio output device to output the audio output signal. The audio output signal, when captured by an audio capture device of the user device, is configured to prevent initiation of a wake-up process on the user device.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, determining whether the pronunciation of the text input data includes a hotword includes determining that a pronunciation of at least one of a word, a subword, or a text-to-speech sequence of the text input data is associated with the hotword. The hotword perception model may be trained on text-to-speech sequences or audio representations of hotwords assigned to the user device. Further, the textual input data may include a first language, and the audio output signal may include a translation of the textual input data in a different language.
In some examples, the method further comprises detecting, by the data processing hardware, a presence of a user device within an operating environment of the speech synthesis device; and querying, by the data processing hardware, the user device for hotwords assigned to the user device for training a hotword perception model. Additionally or alternatively, the method may include querying a remote hotword repository to obtain at least hotwords assigned to the user device for training a hotword perception model.
In some implementations, generating the audio output signal from the text input data includes inserting a watermark into the audio output signal indicating that the audio output signal corresponds to synthesized speech and indicating that a hotword detector of the user device ignores detection of hotwords in the synthesized speech. In other implementations, generating the audio data includes determining a speech waveform representing a text-to-speech output for the text input data, and altering the speech waveform by removing or altering any sounds associated with hotwords to circumvent detection of the hotwords by a hotword detector of the user device. In yet another embodiment, generating the audio data includes determining a speech waveform representing the text input data and filtering the audio waveform to avoid detection of hotwords by a hotword detector of the user device.
Another aspect of the present disclosure provides a method for preventing initiation of a wake-up procedure on a user equipment. The method comprises the following steps: receiving, at a hotword detector of a user device, audio input data containing a hotword, the hotword configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data; determining, by a hotword detector, whether the audio input data includes synthesized speech using a hotword detector model, the hotword detector model configured to detect the presence of hotwords and synthesized speech in the audio input data; and when the audio input data comprises synthesized speech, preventing, by the hotword detector, initiation of a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data.
This aspect may include one or more of the following optional features. In some embodiments, the hotword detector model is trained on a plurality of training samples including positive training samples and negative training samples. The training samples include artificially generated audio data corresponding to one or more users speaking hotwords assigned to the user devices. The negative training samples include synthesized speech utterances output from one or more speech synthesizer devices. In some examples, at least one of the synthesized speech utterances of the negative training sample pronounces a hotword assigned to the user device. In other examples, none of the synthesized speech utterances of the negative training samples pronounces the hotword assigned to the user device. Determining whether the audio input data includes synthesized speech may include using a hotword detector model to detect the presence of synthesized speech in the audio input data by analyzing acoustic features of the audio input data without transcribing or semantically interpreting the audio input data.
Another aspect of the present disclosure provides a system for preventing initiation of a wake-up procedure on a user equipment. The system includes data processing hardware of the speech synthesis device and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that, when executed by data processing hardware, cause the data processing hardware to perform operations comprising: the method includes receiving text input data for conversion into synthesized speech, and determining whether a pronunciation of the text input data includes a hotword using a hotword perception model trained to detect presence of at least one hotword assigned to the user device, the hotword, when included in audio input data received by the user device, being configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data. When the pronunciation of the text input data includes a hotword, the operations further include generating an audio output signal from the text input data and providing the audio output signal to an audio output device to output the audio output signal. The audio output signal, when captured by an audio capture device of the user device, is configured to prevent initiation of a wake-up process on the user device.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, determining whether the pronunciation of the text input data includes a hotword includes determining that at least one of a word, a subword, or a text-to-speech sequence of the text input data is associated with the hotword. The hotword perception model may be trained on text-to-speech sequences or audio representations of hotwords assigned to the user device. Further, the textual input data may include a first language and the audio output signal may include a translation of the textual input data in a different language.
In some examples, the operations further include detecting a presence of a user device within an operating environment of the speech synthesis device, and querying the user device for hotwords assigned to the user device for training a hotword perception model. Additionally or alternatively, the operations may further include querying a remote hotword repository to obtain at least hotwords assigned to the user device for training a hotword perception model.
In some implementations, generating the audio output signal from the text input data includes inserting a watermark into the audio output signal indicating that the audio output signal corresponds to synthesized speech and indicating that a hotword detector of the user device ignores detection of hotwords in the synthesized speech. In other implementations, generating the audio data includes determining a speech waveform representing a text-to-speech output for the text input data, and altering the speech waveform by removing or altering any sounds associated with hotwords to circumvent detection of the hotwords by a hotword detector of the user device. In yet another embodiment, generating the audio data includes determining a speech waveform representing the text input data and filtering the audio waveform to avoid detection of hotwords by a hotword detector of the user device.
Another aspect of the present disclosure provides a system for preventing initiation of a wake-up procedure on a user equipment. The system includes data processing hardware of the user device and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that, when executed by data processing hardware, cause the data processing hardware to perform operations comprising: receiving, at a hotword detector of a user device, audio input data containing a hotword configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data; determining, by a hotword detector, whether the audio input data includes synthesized speech using a hotword detector model configured to detect the presence of hotwords and synthesized speech in the audio input data; and when the audio input data comprises synthesized speech, preventing, by the hotword detector, initiation of a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data.
This aspect may include one or more of the following optional features. In some embodiments, the hotword detector model is trained on a plurality of training samples including positive training samples and negative training samples. The training samples include artificially generated audio data corresponding to one or more users speaking hotwords assigned to the user devices. The negative training samples include synthesized speech utterances output from one or more speech synthesizer devices. In some examples, at least one of the synthesized speech utterances of the negative training sample pronounces a hotword assigned to the user device. In other examples, none of the synthesized speech utterances of the negative training samples pronounces the hotword assigned to the user device. Determining whether the audio input data includes synthesized speech may include using a hotword detector model to detect the presence of synthesized speech in the audio input data by analyzing acoustic features of the audio input data without transcribing or semantically interpreting the audio input data.
The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.
Drawings
FIG. 1 is a schematic diagram of an example voice-enabled environment.
FIG. 2 is a schematic diagram of an example hotword detector from a speech-enabled environment.
FIGS. 3A and 3B are schematic diagrams of an example synthesized speech system incorporating a hotword perception trainer.
FIG. 4 is a flow diagram of an example arrangement of operations of a method for detecting the presence of hotwords in text input data for conversion to synthesized speech at a speech synthesis device.
Fig. 5 is a flow diagram of an example arrangement of operations of a method for preventing initiation of a wake-up procedure on a user device when audio input data includes synthesized speech.
FIG. 6 is a schematic diagram of an example computing device that may be used to implement the systems and methods described herein.
Like reference symbols in the various drawings indicate like elements.
Detailed Description
In a speech-enabled environment, the manner in which users interact with a computer-based system, which may be implemented using a network of networked microphone devices distributed throughout the environment (e.g., a room or other area of a home, workplace, school, etc.), is designed primarily, if not exclusively, through voice input (i.e., audio commands). More devices are using audio commands to instruct the operation of the user device. By using "hotwords" (also referred to as "attention words", "wake phrases/words", "trigger phrases" or "voice action initiation commands") where predetermined terms (i.e., keywords) spoken to draw the attention of the system are preserved through negotiation, the system is able to discern between utterances directed to the system (i.e., for initiating a wake process for processing one or more terms following the hotword in the utterance) and utterances directed to individuals in the environment. In other words, the user device may operate in a low power mode, but upon detection of a hotword, the user device may switch to a full power mode in order to detect, process, and analyze all audio data captured by the microphone. However, as the output of synthesized Speech from a Speech synthesizer (e.g., a Text-To-Speech (TTS) system) becomes more prevalent within a Speech-enabled environment, synthesized Speech that includes hotwords assigned To nearby user devices, or words or subwords that make up or sound like hotwords, may inadvertently cause a hotword detector (e.g., hotword) on the user device To detect the presence of a hotword and initiate a wake-up process for processing terms in the synthesized Speech. As used herein, the terms "synthesized speech" and "synthesized utterance" are used interchangeably. As used herein, synthesized speech output from a TTS system or speech synthesis device includes machine output from a non-audible originating data input. The machine output may inform the user of an operation being performed by a device associated with the TTS system or confirm an instruction provided by the user to the device associated with the TTS system. Thus, synthesized speech may be distinguished from broadcast audio output from a television, multimedia set-top box, stereo system, radio, computer system, or other type of device capable of outputting broadcast audio.
For example, in a voice-enabled environment (such as a user's home), a user may have one or more mobile devices (e.g., a smartphone and/or a tablet computer) and a smart speaker/display device. The smart speaker/display device may be used as a digital assistant for outputting synthesized speech and triggering the processing of a voice query or voice command that will be executed when preceded by a hotword assigned to the respective user device. A scenario may occur in which: the synthesized speech output from one of the devices (e.g., the smart speaker) directed to the user contains one or more words or sub-words that constitute hotwords assigned to one of the other devices in the environment (e.g., the user's tablet). For example, the term "Dog" may be designated as a hotword for the user's tablet, and a portion of the synthesized speech may recite the term "hotsog". As a result, the microphone of the other device may capture the synthesized speech, and the hotword detector may detect that the term "hot" precedes the term "dog" and trigger the user's tablet to inadvertently initiate a wake-up process. Thus, pronunciation of hotwords in the synthesized speech may inadvertently cause a neighboring speech-enabled device to transition from a sleep/hibernate state to an active state in which the neighboring speech-enabled device begins processing (i.e., transcribing and/or semantically interpreting) the synthesized speech.
It is an object of the present disclosure to avoid initiating a wake-up process of one or more other user devices due to the use of hotwords or other terms that sound like hotwords generated by TTS audio (e.g., synthesized speech). This will prevent accidental initiation of the wake-up procedure, allowing the user equipment to remain in the low power state for a longer time to save power.
To prevent inadvertent initiation of a wake-up process in response to detecting pronunciation of a hotword in a synthesized utterance, embodiments herein are directed to injecting a hotword assigned to a proximate device into a training pipe of a TTS system to generate a hotword-aware (hotword-aware) model for detecting the presence of the hotword. The hotword perception model may be trained on any combination of hotwords assigned to proximate devices, a list of hotwords associated with one or more devices that a particular user owns-controls, and/or a list of all potential hotwords that may be assigned to any given device for initiating a wake-up process. For example, the speech synthesizer device may use a hotword perception model to determine whether the pronunciation of the text input data for conversion into synthesized speech includes a hotword. In some examples, the hotword perception model is trained on audio representations (e.g., acoustic features) of hotwords, such as sequences or strings of hotwords. Thus, a speech synthesis device receiving text input data (text and content) for conversion into synthesized speech may pre-process the text input data to obtain individual sequences (TTS sequences), and use a hotword perception model to recognize sequences that, when audibly pronounced, constitute hotwords or sound-like phrases (sound-like phrases) that constitute hotwords, by identifying matches or similarities between the TTS sequences and hotword sequences obtained from the hotword perception model. For example, textual input data comprising the phrase "dawg" will, when audibly pronounced, constitute a phonetically-like phrase of the hotword of the term "dog". Thus, the hotword perception model is trained to detect whether the pronunciation of the text input data includes a hotword (e.g., a constituent hotword or a phonemic phrase that constitutes a hotword). The TTS system may include a multilingual TTS system trained on multiple languages such that the hotword perception model is trained to detect hotwords or phonemic phrases of the hotwords in the multiple languages.
If the speech synthesis apparatus simply saves a log or white list of known hotwords in text form, as opposed to using a hotword perception model, the speech synthesis apparatus will not be able to recognize misspelled words in the text input data that constitute the hotwords and will not be able to recognize subwords within the words that constitute the hotwords. For example, if the speech synthesis device simply references a white list of known hot words, the speech synthesis device will not be able to recognize the text input data for the phrase "dawg" that constitutes a hot word of the term "dog" (unless the spelling of "dawg" is included in the white list), and will not be able to recognize the subword "dog" in the text input data for the phrase "hotog" (unless "hotog" is included in the white list).
Once the speech synthesis device determines that the pronunciation of the text input data includes a hotword, embodiments further include a waveform generator of the speech synthesis device that generates an audio output signal that synthesizes the speech, the waveform generator configured to prevent initiation of a wake-up process on a proximate user device when the audio output signal is captured by a microphone of the proximate user device. In some examples, the waveform generator uses cell selection logic for generating the output audio signal. In these examples, the waveform generator may transmit a known watermark over the audio sequence, where the known watermark is identifiable to a hotword detector on a proximate user device; thus, the hotword detector on the nearby user device will simply ignore the audio output signal with the known watermark, even if the audio output signal pronounces the hotword. Alternatively, the unit selection logic may select an alternative variant of the unit (or subset of units) used to generate the synthesized speech (e.g., audio output signal) that is known to oppose (adapt) the hotword detection model used by the hotword detector of the proximate user device. Here, the hotword detection model may be trained on these same opposing units, so that the hotword detector knows to ignore any utterances that include these units (i.e., the untrained mode of the hotword detector) during inference, thereby preventing initiation of the wake-up process even if the utterances contain hotwords. Further, the waveform generator may distort the synthesized speech using a filter trained for hotword detectors in proximity to the user device such that the hotword detectors ignore or fail to detect the synthesized speech.
In other examples, the waveform generator may generate the output audio signal by using a neural network (e.g., based on WaveNet) to output an audio sequence of synthesized phonemes representing the textual input data. In these examples, when a portion of the synthesized phonemes form a hotword, the waveform generator may provide additional adjustment information that causes the neural network to emit a known watermark over the audio sequence that is identifiable to a hotword detector on a neighboring user device such that the neighboring user device may simply ignore the audio output signal even if the hotword is pronounced. In other words, the presence of the watermark is used to instruct the nearby user device to ignore the pronounced hotword. Alternatively, the synthesized speech segments output from the neural network that constitute the hotwords (or the phonemic phrases that constitute the hotwords) may be modified (e.g., distorted) to generate an output audio signal in a manner that opposes the detection of a hotword detector of a nearby user device.
Additionally or alternatively, embodiments may also include injecting the synthesized speech utterance into a training pipeline of the hotword detector to generate a hotword detector model. The hotword detector model is configured to detect the presence of synthesized speech in the audio input data received by the hotword detector. For example, the hotword detector trainer may train the hotword detector to detect hotwords in the utterance and further determine whether the utterance includes synthesized speech, such as audio data output from a speech synthesis device (e.g., a TTS system). Thus, when a microphone on the user device captures an utterance containing a hotword assigned to the user device, if the hotword detector detects that the utterance includes synthesized speech, the hotword detector will simply ignore the presence of the hotword in the captured utterance, thereby preventing the wake-up process from being initiated on the user device. In some examples, the hotword detector model is trained on positive (positive) training examples including artificially generated audio data corresponding to one or more users who uttered a hotword assigned to the user device and negative (negative) training examples including synthesized speech utterances output from one or more speech synthesizer devices. By training the hotword detector model to detect the presence of synthesized speech in the audio input data, the hotword detector may advantageously use the hotword detector model to detect the presence of synthesized speech by analyzing acoustic features of the received audio input data without transcribing or semantically interpreting the audio input data.
Referring to FIG. 1, in some implementations, a speech-enabled system 100 includes one or more user devices 110, 110 a-b. For example, the voice-enabled system 100 includes two user devices 110a, 110b in close proximity to each other and connected to a remote server 140 (e.g., a cloud computing environment) via a network 130. The user devices 110a, 110b may or may not communicate with each other. Each user device 110 is configured to capture sound corresponding to an
Each user device 110 may be associated with a user 10 and capable of processing an
To detect the presence of hotword 130 within
In some examples, the user 10 or the user device 110 generates a hotword query 132 to identify hotwords 130 of interest to the user 10 and/or the user device 110. In some implementations, the user device 110 communicates with the remote server 140 via the network 120 to identify and/or receive the hotword 130 from a hotword repository 142 in communication with the remote server 140. In some examples, the hotword query 132 may include a user identifier that maps to all hotwords 130 assigned to the user device 110, the user device 110 being owned by the user 10 associated with the user identifier. Additionally or alternatively, the user device 110 may obtain an identifier (e.g., a Media Access Control (MAC) identifier) associated with each neighboring user device 110 and provide the identifier in a query 132 to obtain all hotwords 130 associated with each identifier from the repository 142. The hotword repository 142 may include any combination of hotwords 130 assigned to proximate devices 110, a list of hotwords 130 associated with one or more devices 110 owned and/or controlled by a particular user 10, and/or a list of all potential hotwords 130 that may be assigned to any given device 110 for initiating a wake-up process (e.g., global hotwords associated with a particular type(s) of device(s) 110). By generating the hotword, the hotword(s) 130 may be received to form a robust hotword training process for the hotword detector 200. Referring to FIG. 1, each user device 110 is configured to send and/or receive hotword queries 132 for one or more other user devices 110 to understand and/or compile the hotword(s) 130 assigned to the other user devices 110.
Each user device 110 may also be configured as a speech synthesis device. As a speech synthesis device, user device 110 may also include a
In some examples, the
In the illustrated example, the voice-enabled system 100 includes a first user device 110a and a second user device 110 b. The second user equipment 110b may be considered a proximity device to the first user equipment 110a, or vice versa. Here, the user devices 110a, 110b are considered "proximate" to one another when the respective audio capture device 116 on one
In some examples, the user devices 110, 110 a-b each correspond to a user 10 speaking a word or sub-word over one or more networks 120. For example, the user 10 may speak a message that is detectable by the first user device 110a that includes "okgiogle: the first utterance 150a that reminds me that the first thing in the morning is to restart the computer at work. Here, the phrase "OK Google" is the hotword 130 assigned to the user device 110a, such that the hotword detector 200 triggers the user device 110a to initiate a wake-up process for processing the hotword 130 in the audio input data and/or one or more other terms following the hotword 130 (e.g., the remainder of the first utterance 150a, "remind me that the first thing in the morning is to restart the computer at work"). In this example, the first user device 110a responds to the first utterance 150a with a
Similarly, the second user device 110, 110b may be assigned the hotword 130 "start computer". In this configuration, when the user 10 "starts the computer" using the hotword 130, the user 10 expects the second user device 110, 110b to initiate a wake-up process. Thus, when the user 10 speaks a signal detectable by the second user device 110b that includes "start computer: when the second utterance 150b of music from the 70 s music playlist, "the phrase" start the computer "causes the hotword detector 200 to trigger the second user device 110b to initiate a wake-up process for processing the hotword 130 in the audio input data and/or one or more other terms" music from the 70 s music playlist.
When two user devices 110 are in proximity, the
FIG. 2 is an example of a hotword detector 200 within the user device 110 of the speech-enabled system 100. The hotword detector 200 is configured to determine whether audio input data, such as the
In some examples, hotword detector 200 includes hotword detector trainer 210 and
In some implementations, the hotword detector trainer 210 trains the
Optionally, the hotword detector trainer 210 may additionally or alternatively train the
In contrast, positive training example 212b is an audio sample of
With continued reference to FIG. 2, hotword detector 200 of user device 110 implements
In some configurations, the hotword detector trainer 210 is configured to separate the training examples 212 into a training set and an evaluation set (e.g., 90% training and 10% evaluation). With these sets, the hotword detector trainer 210 trains the
Additionally or alternatively, the
Fig. 3A and 3B are examples of a
In some examples, the hotword perception trainer 310 utilizes the hotword query 132 to obtain the hotword 130 or the list of hotwords 130 (e.g., from the hotword repository 142 or directly from the proximate user device 110). As previously described, the hotword query 132 may obtain any combination of the hotwords 130 assigned to neighboring devices 110, a list of hotwords 130 associated with one or more devices 110, 110 a-n that a particular user 10 owns-controls, and/or a list of all potential hotwords 130 that may be assigned to any given device 110 for initiating a wake-up process. In other examples, the user 10 of the speech-enabled system 100 or an administrator of the user device 110 pre-programs and/or updates the hotword perception trainer 310 with the hotword(s) 130. The hotword perception trainer 310 trains a
The
When the
Referring to fig. 3A,
Fig. 3B is an example of a
Similar to the
Additionally or alternatively, each
FIG. 4 is a flow diagram of an example arrangement of operations of a
At
Fig. 5 is a flow diagram of an example arrangement of operations of a
A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform tasks. In some examples, a software application may be referred to as an "application," app, "or" program. Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.
The non-transitory memory may be a physical device for temporarily or permanently storing programs (e.g., sequences of instructions) or data (e.g., program state information) for use by the computing device. The non-transitory memory may be volatile and/or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electrically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware such as boot programs). Examples of volatile memory include, but are not limited to, Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Phase Change Memory (PCM), and magnetic disks or tapes.
FIG. 6 is a schematic diagram of an
The
The
The
As shown, the
Various implementations of the systems and techniques described here can be implemented in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, non-transitory computer-readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
The processes and logic flows described in this specification can be performed by one or more programmable processors (also known as data processing hardware) executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in particular by, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not require such a device. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, one or more aspects of the disclosure may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube), LCD (liquid crystal display) display, or touch screen) for displaying information to the user and, optionally, a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer. Other types of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending and receiving documents to and from the device used by the user; for example, by sending a web page to a web browser on the user's client device in response to a request received from the web browser.
A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:具有减小的干扰的三维存储器件编程