Linear filtering for noise-suppressed voice detection

文档序号：914643 发布日期：2021-02-26 浏览：4次中文

阅读说明：本技术 用于噪声抑制话音检测的线性滤波 (Linear filtering for noise-suppressed voice detection ) 是由塞义德·巴盖里·塞雷斯基丹妮亚·贾科贝洛于 2019-05-17 设计创作，主要内容包括：用于抑制噪声并检测由多个麦克风捕获的多通道音频信号中的语音输入的系统和方法包括：(i)经由第一麦克风捕获第一音频信号以及经由第二麦克风捕获第二音频信号,其中,第一音频信号和第二音频信号分别包括来自噪声源的第一噪声内容和第二噪声内容；(ii)识别第一音频信号中的第一噪声内容；(iii)使用所识别的第一噪声内容来确定由多个麦克风捕获的估计噪声内容；(iv)使用估计噪声内容来抑制第一音频信号和第二音频信号中的第一噪声内容和第二噪声内容；(v)将被抑制的第一音频信号和第二音频信号组合为第三音频信号；以及(vi)确定第三音频信号包括包含唤醒词的语音输入。(Systems and methods for suppressing noise and detecting speech input in a multi-channel audio signal captured by multiple microphones include: (i) capturing a first audio signal via a first microphone and a second audio signal via a second microphone, wherein the first and second audio signals comprise first and second noise content, respectively, from a noise source; (ii) identifying a first noise content in a first audio signal; (iii) determining an estimated noise content captured by a plurality of microphones using the identified first noise content; (iv) suppressing the first and second noise content in the first and second audio signals using the estimated noise content; (v) combining the suppressed first and second audio signals into a third audio signal; and (vi) determining that the third audio signal includes a speech input that includes a wake-up word.)

1. A method comprising for a network device (700) comprising a plurality of microphones (702, 702a-g, 802a-g), the plurality of microphones including at least a first microphone and a second microphone, the method comprising:

capturing a plurality of respective audio signals via at least a portion of the plurality of microphones (702a, 702 b);

identifying a first noise content at least for a first audio signal obtained by a first microphone;

determining estimated noise content captured by the plurality of microphones (702, 702a-g, 802a-g) based on the identified noise content;

generating respective adjusted audio signals by suppressing estimated noise content in the plurality of respective audio signals;

combining the adjusted audio signals into a second audio signal;

determining whether the second audio signal includes a speech input including a wake-up word; and

when it is determined that the second audio signal includes a speech input that includes the wake word, sending at least a portion of the speech input to a remote computing device for speech processing to recognize a speech utterance that is different from the wake word.

2. The method of claim 1, wherein identifying noise content in the audio signal captured by the microphone device comprises:

determining a probability that the audio signal includes voice content; and identifying the audio signal as providing an estimate of noise present in other audio signals captured by another of the plurality of microphone devices when the determined probability is below a threshold probability.

3. The method of claim 2, wherein determining estimated noise content captured by a plurality of microphones comprises: updating a power spectral density matrix of the noise content with the audio content of the microphone based on determining that the probability is below the threshold probability.

4. The method of any preceding claim, further comprising:

capturing, via a third microphone (702c) of the plurality of microphones, fourth audio comprising third noise content from the noise source;

identifying the third noise content in a third audio signal; and

updating estimated noise content captured by the plurality of microphones (702, 702a-g, 802a-g) using the identified third noise content.

5. The method of any preceding claim, further comprising simultaneously capturing a plurality of audio signals captured by respective ones of the plurality of microphones.

6. The method of any preceding claim, wherein generating respective adjusted audio signals by suppressing estimated noise content in the plurality of respective audio signals comprises: passing one or more of the audio signals through at least one linear filter h_i(f)(808)。

7. A method according to claim 6 when dependent on claim 3, wherein the audio signal is passed through at least one linear filter h_i(f) (808) comprising: passing (808) the updated power spectral density matrix of the noise content through the linear filter.

8. The method of claim 6, wherein the at least one linear filter h_i(f)

Dependent on:

a power spectral density matrix of the captured audio signal, an

A power spectral density matrix of a noise component of the captured audio signal, and

independent of the voice portion of the captured audio signal.

9. The method of claim 6, wherein the linear filter h_i(f) Is represented as:

wherein:

P_yy(f) is the power of the audio content captured by the corresponding microphoneA spectral density matrix;

P_vv(f) a power spectral density matrix that is a noise portion of audio content captured by the respective microphone; and is

β is the inverse of the Lagrangian multiplier factor, which can be modified to tune signal distortion and reduce noise; and is

10. The method according to any one of claims 5-9, further comprising: a first order exponential smoothing is applied to the captured audio signal.

11. The method of any of claims 2-10, wherein determining the probability that the first audio signal includes speech content comprises using at least one of:

a voice activity detection algorithm for detecting the presence of speech in the first audio signal; and

a speech presence probability algorithm for determining a probability of speech being present in the first audio signal.

12. The method of any preceding claim, further comprising: the initialization is performed at an initial time frame of a predetermined time period,

estimating P at the initial time of the predetermined period of time_yyThen, then

Using estimated P_yy(f) Initializing P by the reciprocal of^-1 _vv(f)。

Wherein:

P_vv(f) a power spectral density matrix that is a noise portion of the audio content captured by the respective microphone; and is

P_yy(f) Is a power spectral density matrix of audio content captured by the respective microphone.

13. The method of claim 12, wherein the predetermined period of time is 500 ms.

14. The method of claim 12 or 13 as dependent on any of claims 5-7, wherein the estimated noise content captured by the plurality of microphones (702, 702a-g, 802a-g) comprises:

updating P for all frequency bins_yy(n)；

Calculating the existence probability of voices in all frequency sections;

updating P of all frequency bins when the calculated probability of speech presence is below a threshold^-1 _vv(n); and

linear filter (808) h for computing all frequency bins_i(n)。

15. The method of any preceding claim, wherein transmitting at least a portion of the speech input to a remote computing device comprises: sending a portion of the speech input after the wake word to a separate computing system for speech analysis.

16. The method according to any of the preceding claims,

wherein the plurality of microphones (702, 702a-g, 802a-g) are arranged along a housing (704) of the network device (700) and are separated from each other by a distance of greater than about five centimeters, the housing at least partially enclosing components of the network device (700).

17. A tangible, non-transitory computer-readable medium storing instructions executable by one or more processors to cause a network device to perform the method of any preceding claim.

18. A network device (700) comprising a housing (704) having a plurality of microphones (702, 702a-g, 802a-g) disposed thereon, the network device configured to perform the method of any preceding claim.

Technical Field

The present disclosure relates to consumer products, and more particularly, to methods, systems, products, features, services, and other elements directed to media playback and aspects thereof.

Background

The options for accessing and listening to digital Audio for a Playback setting are limited until the first patent application entitled "Method for Synchronizing Audio Playback between Multiple network Devices" was filed by SONOS corporation in 2003 and the sale of media Playback systems began in 2005. Sonos wireless fidelity (HiFi) systems allow people to experience music from many sources through one or more networked playback devices. Through a software control application installed on a smartphone, tablet computer, or computer, a person is able to play his or her desired content in any room with a networked playback device. In addition, using the controller, for example, different songs can be streamed to each room having a playback device, rooms can be combined together for synchronous playback, or the same song can be listened to simultaneously in all rooms.

In view of the growing interest in digital media, there remains a need to develop a technology that is easy for consumers to use to further enhance the listening experience.

Disclosure of Invention

The present disclosure describes, among other things, systems and methods for processing audio content captured by a plurality of networked microphones to suppress noise content from the captured audio and detect speech input in the captured audio.

Some example embodiments relate to capturing, via a plurality of microphones of a network device, (i) a first audio signal via a first microphone of the plurality of microphones, and (ii) a second audio signal via a second microphone of the plurality of microphones. The first audio signal comprises a first noise content from a noise source and the second audio signal comprises a second noise content from the same noise source. The network device identifies a first noise content in the first audio signal and determines an estimated noise content captured by the plurality of microphones using the identified first noise content. The network device then uses the estimated noise content to suppress a first noise content in the first audio signal and a second noise content in the second audio signal. The network device combines the suppressed first audio signal and the suppressed second audio signal into a third audio signal. Finally, the network device determines that the third audio signal includes a speech input that includes a wake word, and in response to the determination, sends at least a portion of the speech input to the remote computing device for speech processing to recognize a speech utterance that is different from the wake word.

Some embodiments include an article of manufacture comprising a tangible, non-transitory computer-readable medium storing program instructions that, upon execution by one or more processors of a network device, cause the network device to perform operations in accordance with example embodiments disclosed herein.

Some embodiments include a network device comprising one or more processors, and a tangible, non-transitory computer-readable medium storing program instructions that, upon execution by the one or more processors, cause the network device to perform operations in accordance with example embodiments disclosed herein.

This summary is illustrative only and is not intended to be limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

Drawings

The features, aspects, and advantages of the disclosed technology will become better understood with regard to the following description, appended claims, and accompanying drawings where:

FIG. 1 illustrates an example media playback system configuration in which certain embodiments may be practiced;

FIG. 2 shows a functional block diagram of an example playback device;

FIG. 3 shows a functional block diagram of an example control device;

FIG. 4 illustrates an example controller interface;

FIG. 5 illustrates an example plurality of network devices;

FIG. 6 shows a functional block diagram of an example network microphone apparatus;

fig. 7A illustrates an example network device with microphones arranged in a beamforming array in accordance with some embodiments.

Fig. 7B illustrates an example network device with microphones arranged in an out-of-order manner, in accordance with some embodiments.

Fig. 7C illustrates two example network devices with a microphone disposed between the two devices, in accordance with some embodiments.

FIG. 8 illustrates an example media network configuration in which certain embodiments may be practiced.

FIG. 9 illustrates an example method according to some embodiments.

FIG. 10 illustrates an example speech input, according to some embodiments.

Fig. 11 shows experimental results of wake-up word detection improvement by the static beamforming technique.

The drawings are for purposes of illustrating example embodiments, and it is to be understood that the invention is not limited to the arrangements and instrumentality shown in the drawings.

Detailed Description

I. Overview

This disclosure describes systems and methods for performing noise suppression, among other things, using a network microphone. In some embodiments, one or more microphones of a microphone network are components of a network device (e.g., a voice-enabled device ("VED")). In operation, a microphone equipped VED (or other network device) listens for a "wake word" or wake phrase that prompts the VED to capture speech for voice command processing. In some embodiments, the wake phrase includes a wake word, and vice versa.

Some examples of "wake words" (or wake phrases) may include: sonos VED, "Sonos," Amazon VED, "Alexa," or apple VED, "Siri. Other VEDs from other manufacturers may use different wake words and/or wake phrases. In operation, a microphone-equipped VED listens for its wake words. And in response to detecting its wake word, the VED (alone or in combination with one or more other computing devices) records speech after the wake word, analyzes the recorded speech to determine a voice command, and then implements the voice command. Typical examples of voice commands include: "play my beatles playlist", "turn on my hall lantern", "set my thermostat to 75 degrees", "add milk and banana to my shopping list", etc.

FIG. 10 shows an example of speech input 1090 that may be provided to a VED. The speech input 1090 may include a wake word 1092, a speech utterance 1094, or both. The speech utterances 1094 may include: for example, one or more spoken commands 1096 (identified as first command 1096a and second command 1096b, respectively) and one or more spoken keywords 1098 (identified as first keyword 1098a and second keyword 1098b, respectively). In one example, the first command 1096a may be a command to play music, e.g., a particular song, album, playlist, etc. In this example, the keywords 1098 may be one or more words that identify one or more regions (e.g., living rooms and restaurants as shown in fig. 1) in which music is to be played. In some examples, the speech utterances 1094 may include other information, such as detected pauses (e.g., periods of non-speech) between words spoken by the user, as shown in fig. 10. Pauses can distinguish the location within the speech utterance portion 1094 of individual commands, keywords, or other information spoken by the user.

As further shown in fig. 10, the VED may instruct the playback device to temporarily reduce the amplitude of the audio content playback (or "sound avoidance") during the capture of the wake words and/or voice utterances 1096 that include commands. Sound avoidance can reduce audio interference and improve speech processing accuracy. Various examples of wake-up words, Voice commands, and related Voice input capture techniques, processes, devices, and systems may be found in, for example, U.S. patent application No.15/721,141 entitled "Media Playback System with Voice Assistance," filed on 27.9.2017, the entire contents of which are incorporated herein by reference.

One challenge in determining a voice command is obtaining a high quality recording of speech including the voice command for analysis. A higher quality recording of speech that includes a voice command makes the speech algorithm easier to analyze than a lower quality recording of speech that includes a voice command. In environments where there may be multiple people speaking, household appliances (e.g., televisions, stereos, air conditioners, dishwashers, etc.) emitting noise, and other extraneous sounds present, obtaining a high quality recording of speech, including voice commands, can be challenging.

One way to improve the quality of sound recordings that include speech commands is to employ a microphone array and use beamforming to (i) amplify sound from the direction in which the speech containing the speech command is oriented relative to the microphone array, and (ii) attenuate sound from other directions relative to the microphone array. In a beamforming system, a plurality of microphones arranged in a structured array may perform spatial localization of sound (i.e., determining the direction of sound origin) relative to the microphone array. However, although unnecessary noise from sound recordings is effectively suppressed, beamforming has limitations. For example, beamforming is only feasible in scenarios where it is possible to implement such microphone arrays, since beamforming requires the microphones to be arranged in a specific array configuration. Some network devices may not be able to support such microphone arrays due to hardware or other design constraints. As described in more detail below, network devices and associated systems and methods configured in accordance with various embodiments of the technology may address these and other challenges associated with conventional techniques (e.g., conventional beamforming) to suppress noise content from captured audio.

This disclosure describes the use of multi-microphone noise suppression techniques that do not necessarily rely on the geometric arrangement of the microphones. Rather, techniques for suppressing noise according to various embodiments include: the observed noise process, as well as the additive noise, is linearly time invariant filtered assuming a fixed signal and noise spectrum is known. In some embodiments, the present technology uses first audio content captured by one or more respective microphones within a microphone network to estimate noise in second audio content simultaneously captured by one or more other respective microphones of the microphone network. The estimated noise from the first audio content may then be used to filter out the noise and preserve the speech in the second audio content.

In various embodiments, the present techniques may involve aspects of Wiener (Wiener) filtering. Conventional wiener filtering techniques have been used for image filtering and noise cancellation, but typically include fidelity of the resulting filtered signal. However, the inventors have recognized that wiener filtering-based correlation techniques may be applied to speech input detection (e.g., wake-up word detection) in a manner that enhances speech detection accuracy, as compared to speech input detection using conventional beamforming techniques.

In some embodiments, the microphone network implementing the multi-microphone noise suppression techniques of various embodiments is a component of a network device. A network device is any computing device that includes (i) one or more processors, (ii) one or more network interfaces and/or one or more other types of communication interfaces, and (iii) a tangible, non-transitory computer-readable medium including instructions encoded therein, wherein the instructions, when executed at least in part by the one or more processors, cause the network device to perform the functions disclosed and described herein. Network devices are a general class of devices including, but not limited to, voice-enabled devices (VED), Networked Microphone Devices (NMD), audio playback devices (PBD), and Video Playback Devices (VPD). VEDs are a class of devices including, but not limited to, NMDs, PBDs, and VPDs. For example, one type of VED is an NMD, which is a network device that includes one or more processors, a network interface, and one or more microphones. Some NMDs may additionally include one or more speakers and perform media playback functions. Another type of VED is a PBD, which is a network device that includes one or more processors, a network interface, and one or more speakers. Some PBDs may optionally include one or more microphones and perform the functions of an NMD. Another type of VED is a VPD, which is a network device that includes one or more processors, a network interface, one or more speakers, and at least one video display. Some VPDs may optionally include one or more microphones and perform the functions of an NMD. PBDs and VPDs may be generally referred to as media playback devices.

Each of the above-described VEDs may implement at least some voice control functionality that allows the VED (alone or possibly in combination with one or more other computing devices) to act on voice commands received via its microphone, thereby allowing a user to control the VED and, possibly, the other devices.

Other embodiments include tangible, non-transitory computer-readable media having program instructions stored thereon that, when executed by a computing device, cause the computing device to perform the features and functions disclosed and described herein.

Some embodiments include a computing device comprising at least one processor, as well as a data storage device and program instructions. In operation, program instructions are stored in a data storage device and, when executed by at least one processor, cause a computing device (alone or in combination with other components or systems) to perform the features and functions disclosed and described herein.

While some examples described herein may relate to functions performed by a given actor (e.g., "user" and/or other entity), it should be understood that this is for explanation purposes only. The claims should not be construed as requiring any such example actor to perform an action unless the claim's own language expressly requires such language. One of ordinary skill in the art will appreciate that the present disclosure includes many other embodiments.

Example operating Environment

Fig. 1 illustrates an example configuration of a media playback system 100 in which one or more embodiments disclosed herein may be practiced or implemented. The media playback system 100 as shown is associated with an example home environment having several rooms and spaces (e.g., a master bedroom, a study, a restaurant, and a living room). As shown in the example of fig. 1, media playback system 100 includes playback device 102, control devices 126 and 128, and wired or wireless network router 130. In operation, any of the playback devices (PBDs) 102 and 124 may be a voice-enabled device (VED) as previously described.

Further discussion of different components of the example media playback system 100 and how the different components may interact to provide a media experience to a user may be found in the following sections. While the discussion herein may generally refer to an example media playback system 100, the techniques described herein are not limited to application within a home environment, as shown in fig. 1, among others. For example, the techniques described herein may be useful in environments where multi-regional audio may be desired, e.g., commercial environments such as restaurants, shopping centers, or airports, vehicles such as Sports Utility Vehicles (SUVs), buses or cars, ships or boats, airplanes, and so forth.

a.Example playback device

Fig. 2 illustrates a functional block diagram of an example playback device 200, which example playback device 200 may be configured as one or more of the playback devices 102 and 124 of the media playback system 100 of fig. 1. As described above, playback device (PBD)200 is a type of voice-enabled device (VED).

The playback device 200 includes one or more processors 202, software components 204, memory 206, audio processing components 208, an audio amplifier 210, a speaker 212, a network interface 214 including a wireless interface 216 and a wired interface 218, and a microphone 220. In one case, the playback device 200 may not include the speaker 212, but rather a speaker interface for connecting the playback device 200 to external speakers. In another case, the playback device 200 may include neither the speaker 212 nor the audio amplifier 210, but an audio interface for connecting the playback device 200 to an external audio amplifier or audiovisual receiver.

In some examples, the one or more processors 202 include one or more clock driven computing components configured to process input data according to instructions stored in the memory 206. The memory 206 may be a tangible, non-transitory, computer-readable medium configured to store instructions executable by the one or more processors 202. For example, the memory 206 may be a data storage device that may be loaded with one or more software components 204 that the one or more processors 202 may execute to implement particular functionality. In one example, these functions may involve the playback device 200 retrieving audio data from an audio source or another playback device. In another example, these functions may involve the playback device 200 sending audio data to another device or playback device on the network. In yet another example, the functions may involve pairing the playback device 200 with one or more playback devices to create a multi-channel audio environment.

The particular functionality may involve the playback device 200 playing back audio content in synchronization with one or more other playback devices. During synchronized playback, the listener will preferably not be able to perceive the time delay difference between the playback device 200 and the playback of the audio content by one or more other playback devices. Some examples of audio playback synchronization between playback devices are provided in more detail by U.S. patent No.8,234,395 entitled "System and method for synchronizing operations, a plurality of activities, and digital data processing devices," which is incorporated herein by reference.

The memory 206 may also be configured to store data associated with the playback device 200, such as one or more zones and/or zone groups of which the playback device 200 is a part, audio sources accessible to the playback device 200, or a playback queue with which the playback device 200 (or some other playback device) may be associated. The data may be stored as one or more state variables that are periodically updated and used to describe the state of the playback device 200. The memory 206 may also include data associated with the status of other devices of the media system and is shared between the devices from time to time such that one or more of the devices has up-to-date data associated with the system. Other embodiments are also possible.

The audio processing component 208 may include one or more digital-to-analog converters (DACs), audio pre-processing components, audio enhancement components, or Digital Signal Processors (DSPs), among others. In one embodiment, one or more of the audio processing components 208 may be a subcomponent of one or more of the processors 202. In one example, the audio processing component 208 may process and/or intentionally alter audio content to produce an audio signal. The resulting audio signal may then be provided to an audio amplifier 210 for amplification and playback through a speaker 212. In particular, the audio amplifier 210 may include a device configured to amplify an audio signal to a level for driving one or more of the speakers 212. The speaker 212 may include a separate transducer (e.g., a "driver") or a complete speaker system including a housing with one or more drivers. For example, the particular drivers of the speaker 212 may include, for example, a woofer (e.g., for low frequencies), mid-range drivers (e.g., for mid-frequencies), and/or tweeters (e.g., for high frequencies). In some cases, each transducer of the one or more speakers 212 may be driven by a respective corresponding one of the audio amplifiers 210. In addition to producing analog signals for playback by the playback device 200, the audio processing component 208 may also be configured to process audio content to be transmitted to one or more other playback devices for playback.

The audio content to be processed and/or played back by the playback device 200 may be received from an external source, for example, through an audio line-in input connection (e.g., an auto-detect 3.5mm audio line-in connection) or the network interface 214.

The network interface 214 may be configured to facilitate data flow between the playback device 200 and one or more other devices on a data network, including, but not limited to, data to/from other VEDs (e.g., commands to perform SPL measurements, SPL measurement data, commands to set system responses, and other data and/or commands that facilitate performing the features and functions disclosed and described herein). As such, the playback device 200 may be configured to receive audio content over a data network from one or more other playback devices in communication with the playback device 200, a network device within a local area network, or an audio content source on a wide area network (e.g., the internet). The playback device 200 may transmit and/or receive metadata to/from other devices on the network, including but not limited to the components of the networked microphone systems disclosed and described herein. In one example, audio content and other signals (e.g., metadata and other signals) transmitted and received by the playback device 200 may be transmitted in the form of digital packet data containing an Internet Protocol (IP) based source address and an IP based destination address. In this case, the network interface 214 may be configured to parse the digital packet data so that data destined for the playback device 200 is properly received and processed by the playback device 200.

As shown, the network interface 214 may include a wireless interface 216 and a wired interface 218. The wireless interface 216 may provide network interface functionality for the playback device 200 to wirelessly communicate with other devices (e.g., other playback devices, speakers, receivers, network devices, control devices within a data network associated with the playback device 200) according to a communication protocol (e.g., any wireless standard, including IEEE 802.11a, 802.11b, 802.11G, 802.11n, 802.11ac, 802.15, 4G mobile communication standards, etc.). The wired interface 218 may provide network interface functionality for the playback device 200 to communicate with other devices over a wired connection according to a communication protocol (e.g., IEEE 802.3). Although the network interface 214 shown in fig. 2 includes both a wireless interface 216 and a wired interface 218, in some embodiments, the network interface 214 may include only a wireless interface or only a wired interface.

The microphone 220 may be arranged to detect sound in the environment of the playback device 200. For example, the microphone may be mounted on an outer wall of the housing of the playback device. The microphone may be any type of microphone now known or later developed, such as a condenser microphone, an electret condenser microphone, or a dynamic microphone. The microphone may be sensitive to a portion of the frequency band of the speaker 220. One or more of the speakers 220 may operate in reverse of the microphone 220. In some aspects, the playback device 200 may not have a microphone 220.

In one example, the playback device 200 and another playback device may be paired to play two separate audio components of the audio content. For example, the playback device 200 may be configured to play a left channel audio component, while another playback device may be configured to play a right channel audio component, thereby creating or enhancing a stereo effect of the audio content. Paired playback devices (also referred to as "bound playback devices") can also play audio content in synchronization with other playback devices.

In another example, the playback device 200 may be sound merged with one or more other playback devices to form a single merged playback device. The merged playback device may be configured to process and reproduce sound differently than the non-merged playback device or the paired playback device, as the merged playback device may have additional speaker drivers through which audio content may be rendered. For example, if the playback device 200 is a playback device designed to render low-band audio content (i.e., a subwoofer), the playback device 200 can be incorporated with a playback device designed to render full-band audio content. In this case, when incorporated with the low frequency playback device 200, the full band playback device may be configured to present only the mid-high frequency components of the audio content, while the low frequency playback device 200 presents the low frequency components of the audio content. The merged playback device may also be paired with a single playback device or another merged playback device.

For example, SONOS companies currently offer (or have offered) sales including "PLAY: 1 "," PLAY: 3 "," PLAY: 5 "," PLAYBAR "," CONNECT: AMP "," CONNECT ", and" SUB ".

Any other past, present, and/or future playback devices may additionally or alternatively be used to implement the playback devices of the example embodiments disclosed herein. Furthermore, it should be understood that the playback device is not limited to the example shown in fig. 2 or the Sonos' product offering. For example, the playback device may include a wired or wireless headset. In another example, the playback device may include or interact with an extended base of a personal mobile media playback device. In yet another example, the playback device may be integrated into another device or component, such as a television, a lighting fixture, or some other device for use indoors or outdoors.

b.Example playback zone configuration

Referring back to the media playback system 100 of fig. 1, the environment may have one or more playback zones, each with one or more playback devices and/or other VEDs. The media playback system 100 may be established with one or more playback zones, after which one or more zones may be added or removed to achieve the example configuration shown in fig. 1. Each region may be named according to a different room or space (e.g., study, restroom, master bedroom, kitchen, dining room, living room, and/or balcony). In one case, a single playback zone may include multiple rooms or spaces. In another case, a single room or space may include multiple playback zones.

As shown in fig. 1, each of the veranda, restaurant, kitchen, restroom, study room, and bedroom area has one playback device, and each of the living room and the main bedroom area has a plurality of playback devices. In the living room area, playback devices 104, 106, 108, and 110 may be configured to play audio content synchronously as individual playback devices, as one or more bound playback devices, as one or more consolidated playback devices, or any combination thereof. Similarly, in the case of a master bedroom, playback devices 122 and 124 may be configured to play audio content synchronously as separate playback devices, as bundled playback devices, or as a combined playback device.

In one example, one or more playback zones in the environment of FIG. 1 may each play different audio content. For example, a user may be grilling and listening to hip-hop music being played by the playback device 102 in a balcony area, while another user may be preparing food and listening to classical music being played by the playback device 114 in a kitchen area. In another example, a playback zone may play the same audio content in synchronization with another playback zone. For example, the user may be in a study area where the playback device 118 is playing the same rock music as the playback device 102 in the balcony area. In this case, the playback devices 102 and 118 may play rock music synchronously so that the user may seamlessly (or at least substantially seamlessly) enjoy the played-out audio content as the user moves between different playback zones. Synchronization between playback zones may be accomplished in a manner similar to synchronization between playback devices as described in the previously referenced U.S. patent No.8,234,395.

As suggested above, the regional configuration of the media playback system 100 may be dynamically modified, and in some embodiments, the media playback system 100 supports multiple configurations. For example, if a user physically moves one or more playback devices into or out of a region, the media playback system 100 may be reconfigured to accommodate the change. For example, if a user physically moves playback device 102 from a balcony area to a study area, the study area may now include both playback device 118 and playback device 102. The playback device 102 may be paired or grouped with the study area via control devices (e.g., control devices 126 and 128), and/or renamed (if desired). On the other hand, if one or more playback devices are moved to a particular zone in the home environment that is not yet a playback zone, a new playback zone may be created for the particular zone.

Further, the different playback zones of the media playback system 100 may be dynamically combined into zone groups or divided into separate playback zones. For example, a restaurant region and a kitchen region may be combined into a group of regions for a banquet such that the playback devices 112 and 114 may synchronously present (e.g., play back) audio content. On the other hand, if a user desires to listen to music in the living room space and another user desires to watch television, the living room area may be divided into a television area including playback device 104 and a listening area including playback devices 106, 108, and 110.

c.Example control device

Fig. 3 illustrates a functional block diagram of an example control device 300, which example control device 300 may be configured as one or both of the control devices 126 and 128 of the media playback system 100. As shown, the control device 300 may include one or more processors 302, memory 304, a network interface 306, a user interface 308, a microphone 310, and software components 312. In one example, the control device 300 may be a dedicated controller for the media playback system 100. In another example, control device 300 may be a network device on which media playback system controller application software may be installed, e.g., an iPhone^TM、iPad^TMOr any other smart phone, tablet computer, or network device (e.g., networked computer such as a PC or Mac^TM)。

The one or more processors 302 may be configured to perform functions related to facilitating user access, control, and configuration of the media playback system 100. The memory 304 may be a data store that may be loaded with one or more of the software components executable by the one or more processors 302 for performing these functions. The memory 304 may also be configured to store media playback system controller application software and other data associated with the media playback system 100 and the user. In one example, the network interface 306 may be based on an industry standard (e.g., infrared; radio; wired standards including IEEE 802.3; wireless standards including IEEE 802.11a, 802.11b, 802.11G, 802.11n, 802.11ac, 802.15, 3G, 4G, or 5G mobile communication standards, etc.). The network interface 306 may provide a means for controlling the device 300 to communicate with other devices in the media playback system 100. In one example, data and information (e.g., state variables) may be communicated between the control device 300 and other devices via the network interface 306. For example, the playback zone and zone group configuration in the media playback system 100 may be received by the control device 300 from a playback device or another network device through the network interface 306, or transmitted by the control device 300 to another playback device or network device through the network interface 306. In some cases, the other network device may be another control device.

Playback device control commands (e.g., volume control and audio playback control) may also be communicated from the control device 300 to the playback device through the network interface 306. As suggested above, changes to the configuration of the media playback system 100 may also be performed by the user using the control device 300. The configuration change may include: adding/deleting one or more playback devices to/from the zone; adding/deleting one or more regions to/from a region group; forming a bound or merged player; separating one or more playback devices from a bound or consolidated player, and the like. Thus, the control device 300 may sometimes be referred to as a controller, whether the control device 300 is a dedicated controller or a network device on which the media playback system controller application software is installed.

The control device 300 may include a microphone 310. The microphone 310 may be arranged to detect sound in the environment of the control device 300. The microphone 310 may be any type of microphone now known or later developed, such as a condenser microphone, an electret condenser microphone, or a dynamic microphone. The microphone may be sensitive to a portion of the frequency band. Two or more microphones 310 may be arranged to capture location information of an audio source (e.g., speech, audible sound) and/or to help filter background noise.

The user interface 308 of the control device 300 may be configured to facilitate user access and control of the media playback system 100 by providing a controller interface (e.g., the example controller interface 400 shown in fig. 4). Controller interface 400 includes a playback control region 410, a playback zone region 420, a playback status region 430, a playback queue region 440, and an audio content source region 450. The illustrated user interface 400 is merely one example of a user interface that may be provided on a network device (e.g., the control device 300 of fig. 3 (and/or the control devices 126 and 128 of fig. 1)) and accessed by a user to control a media playback system (e.g., the media playback system 100). Alternatively, other user interfaces of varying formats, styles, and interaction sequences may be implemented on one or more network devices to provide similar control access to the media playback system.

The playback control area 410 may include selectable (e.g., by touching or by using a cursor) icons to cause playback devices in a selected playback zone or group of zones to play or pause, fast forward, fast rewind, skip next, skip previous, enter/exit a random play mode, enter/exit a repeat mode, enter/exit a cross-fade mode. The playback control area 410 may also include selectable icons for modifying equalization settings, playback volume, and the like.

The playback zone region 420 may include a representation of a playback zone within the media playback system 100. In some embodiments, the graphical representation of the playback zone may be selectable to recall additional selectable icons to manage or configure the playback zone in the media playback system, e.g., create a bound zone, create a zone group, split zone group, rename a zone group, and so forth.

For example, as shown, a "grouping" icon may be provided within each graphical representation of the playback zone. The "group" icon provided within the graphical representation of the particular zone may be selectable to invoke an option for selecting one or more other zones in the media playback system that are to be grouped with the particular zone. Once grouped, playback devices in regions that have been specifically distinguished from one group will be configured to play audio content in synchronization with playback devices in the specific region. Similarly, a "grouping" icon may be provided within the graphical representation of the zone group. In this case, the "group" icon may be selectable to invoke an option to deselect one or more regions in the region group to be removed from the region group. Other interactions and implementations of grouping and ungrouping regions via a user interface (e.g., user interface 400) are also possible. The representation of the playback zone in the playback zone region 420 may be dynamically updated as the playback zone or zone group configuration is modified.

The playback status region 430 may include a graphical representation of audio content currently playing, previously playing, or scheduled to play next in a selected playback zone or group of zones. The selected playback zone or group of zones may be visually distinguished on the user interface, for example, within playback zone region 420 and/or playback status region 430. The graphical representation may include track name, artist name, album year, track length, and other relevant information that a user knows will be useful when controlling the media playback system via the user interface 400.

The playback queue zone 440 may include a graphical representation of the audio content in the playback queue associated with the selected playback zone or group of zones. In some embodiments, each playback zone or group of zones may be associated with a playback queue containing information corresponding to zero or more audio items played back by that playback zone or group of zones. For example, each audio item in the playback queue may include a Uniform Resource Identifier (URI), a Uniform Resource Locator (URL), or some other identifier that may be used by the playback devices in the playback zone or group of zones to find and/or retrieve audio items from a local audio content source or a networked audio content source, possibly for playback by the playback devices.

In one example, a playlist may be added to the playback queue, in which case information corresponding to each audio item in the playlist may be added to the playback queue. In another example, the audio items in the playback queue may be saved as a playlist. In another example, the playback queue may be empty or filled but "unused" when the playback zone or group of zones is continuously playing streaming audio content (e.g., an internet radio, which may continue to play until stopped) rather than a separate audio item having a playback duration. In alternative embodiments, the playback queue may include internet radio and/or other streaming audio content items and be "in use" when the playback zone or group of zones is playing those items. Other examples are possible.

When a playback zone or group of zones is "grouped" or "ungrouped," the playback queue associated with the affected playback zone or group of zones may be cleared, or re-associated. For example, if a first playback zone that includes a first playback queue is grouped with a second playback zone that includes a second playback queue, the established zone group may have an associated playback queue that is initially empty, contains audio items from the first playback queue (e.g., if the second playback zone is added to the first playback zone), or contains audio items from the second playback queue (e.g., if the first playback zone is added to the second playback zone), or contains a combination of audio items from both the first playback queue and the second playback queue. Subsequently, if the established zone group is ungrouped, the resulting first playback zone may be re-associated with the previous first playback queue, or associated with a new playback queue that is empty, or contains audio items from a playback queue associated with the zone group established before the established zone group was ungrouped. Similarly, the resulting second playback zone may be re-associated with the previous second playback queue, or associated with a new playback queue that is empty, or contains audio items from a playback queue associated with the zone group established before the established zone group was ungrouped. Other examples are possible.

Referring back to the user interface 400 of fig. 4, the graphical representation of the audio content in the playback queue region 440 may include the track title, artist name, track length, and other relevant information associated with the audio content in the playback queue. In one example, the graphical representation of the audio content may be selectable to bring up additional selectable icons to manage and/or manipulate the playback queue and/or the audio content represented in the playback queue. For example, the represented audio content may be removed from the playback queue, moved to a different location within the playback queue, or selected to play immediately, or played after any currently playing audio content, and so forth. The playback queue associated with a playback zone or group of zones may be stored in memory on one or more playback devices in the playback zone or group of zones, on playback devices not in the playback zone or group of zones, and/or on some other designated device.

Audio content source region 450 may include a graphical representation of a selectable audio content source from which audio content may be obtained and played by a selected playback zone or group of zones. For a discussion of the audio content sources, see the following section.

d.Example Audio content Source

As previously described, one or more playback devices in a region or group of regions may be configured to retrieve played audio content from various available audio content sources (e.g., according to corresponding URIs or URLs for the audio content). In one example, the playback device may retrieve audio content directly from a corresponding audio content source (e.g., a line-in connection). In another example, audio content may be provided to a playback device over a network via one or more other playback devices or network devices.

Example audio content sources may include: a memory of one or more playback devices in a media playback system (e.g., media playback system 100 of fig. 1), a local music library on one or more network devices (e.g., a controller device, a network-enabled personal computer, or a Network Attached Storage (NAS), etc.), a streaming audio service that provides audio content over the internet (e.g., the cloud), or an audio source connected to the media playback system through a line-in connection on a playback device or network device, etc.

In some embodiments, audio content sources may be added to or removed from a media playback system (e.g., media playback system 100 of fig. 1) periodically. In one example, indexing audio items may be performed each time one or more audio content sources are added, removed, or updated. Indexing the audio item may include: identifiable audio items in all folders/directories shared on a network accessible by playback devices in the media playback system are scanned, and an audio content database containing metadata (e.g., title, artist, album, track length, etc.) and other associated information (e.g., URL or URI for each identifiable audio item found) is generated or updated. Other examples for managing and maintaining audio content sources are possible.

The above discussion of playback devices, controller devices, playback zone configurations, and media content sources provides but a few examples of operating environments in which the functions and methods described below may be implemented. Configurations of media playback systems, playback devices, and network devices and other operating environments not explicitly described herein may also be applicable and suitable for implementation of the functions and methods.

e. Example multiple network devices

Fig. 5 illustrates an example plurality of network devices 500 that may be configured to provide an audio playback experience with voice control. Those of ordinary skill in the art will appreciate that the devices shown in fig. 5 are for illustration purposes only, and that variations including different and/or additional (or fewer) devices are possible. As shown, the plurality of network devices 500 includes computing devices 504, 506, and 508; network Microphone Devices (NMDs) 512, 514, 516, and 518; playback devices (PBDs) 532, 534, 536, and 538; and a controller device 522. As previously described, any one or more (or all) of the NMDs 512-16, PBDs 532-38, and/or controller device 522 can be VEDs. For example, in some embodiments, PBDs 532 and 536 may be VEDs, while PBDs 534 and 538 may not be VEDs.

Each of the plurality of network devices 500 is a network enabled device that may be in accordance with one or more network protocols (e.g., NFC, Bluetooth)^TMEthernet, and IEEE 802.11, etc.), establish communications with one or more other devices of the plurality of devices over one or more types of networks (e.g., a Wide Area Network (WAN), a Local Area Network (LAN), a Personal Area Network (PAN), etc.).

As shown, computing devices 504, 506, and 508 are part of cloud network 502. Cloud network 502 may include additional computing devices (not shown). In one example, computing devices 504, 506, and 508 may be different servers. In another example, two or more of computing devices 504, 506, and 508 may be modules of a single server. Similarly, each of computing devices 504, 506, and 508 may includeIncluding one or more modules or servers. For ease of illustration herein, each of computing devices 504, 506, and 508 may be configured to perform particular functions within cloud network 502. For example, computing device 508 may be a source of audio content for a streaming music service, while computing device 506 may be associated with a voice assistant service for processing voice input that has been captured after the wake word is detected (e.g.,Google or other voice service). As an example, the VED may send the captured speech input (e.g., the speech utterance and the wake word) or a portion thereof (e.g., the speech utterance immediately following the wake word) over the data network to the computing device 506 for voice processing. The computing device 506 may employ a text-to-speech engine to convert the speech input into text, which may be processed to determine a basic intent of the speech utterance. The computing device 506 or another computing device may send a corresponding response to the voice input, e.g., a response that includes one or more audible outputs as its payload (e.g., a voice response to a query and/or acknowledgement) and/or instructions intended for one or more network devices of the local system, to the VED. The instructions may include, for example, commands to initiate, pause, resume, or stop playback of audio content, increase/decrease playback volume, retrieve tracks or playlists corresponding to the audio queue via a particular URI or URL, or the like, on one or more network devices. Additional examples of speech processing to determine intent and respond to speech input can be found, for example, in the previously referenced U.S. patent application No.15/721,141.

As shown, the computing device 504 may be configured to interface with the NMDs 512, 514, and 516 via a communication path 542. The NMDs 512, 514, and 516 may be components of one or more "smart home" systems. In one case, the NMDs 512, 514, and 516 may be physically distributed throughout the home, similar to the distribution of devices shown in fig. 1. In another case, two or more of the NMDs 512, 514, and 516 may be physically located relatively close to each other. The communication path 542 may include one or more types of networks, such as a WAN, a LAN, and/or a PAN, including the internet, among others.

In one example, one or more of the NMDs 512, 514, and 516 are devices configured primarily for audio detection. In another example, one or more of the NMDs 512, 514, and 516 may be components of a device having various primary utilities. For example, as discussed above in connection with fig. 2 and 3, one or more of the NMDs 512, 514, and 516 may be (or may at least include) the microphone 220 of the playback device 200 or the microphone 310 of the network device 300 (or may at least be a component thereof). Further, in some cases, one or more of the NMDs 512, 514, and 516 may be (or at least may include) the playback device 200 or the network device 300 (or at least may be a component thereof). In an example, one or more of the NMDs 512, 514, and/or 516 can include a plurality of microphones arranged in a microphone array. In some embodiments, one or more of the NMDs 512, 514, and/or 516 may be a microphone on a mobile computing device (e.g., a smartphone, tablet, or other computing device).

As shown, computing device 506 is configured to interface with controller device 522 and PBDs 532, 534, 536, and 538 via communication path 544. In one example, controller device 522 may be a network device, such as network device 200 of fig. 2. Accordingly, the controller device 522 may be configured to provide the controller interface 400 of fig. 4. Similarly, the PBDs 532, 534, 536, and 538 may be playback devices, such as the playback device 300 of fig. 3. Thus, PBDs 532, 534, 536, and 538 may be physically distributed throughout the home, as shown in FIG. 1. For purposes of illustration, PBDs 536 and 538 are shown as members of the binding locale 530, while PBDs 532 and 534 are members of their respective locales. As described above, PBDs 532, 534, 536, and 538 may be dynamically bound, grouped, unbound, and unbound. The communication path 544 may include one or more types of networks, such as a WAN, a LAN, and/or a PAN, including the internet, among others.

In one example, as with the NMDs 512, 514, and 516, the controller device 522 and the PBDs 532, 534, 536, and 538 may also be components of one or more "smart home" systems. In one case, the PBDs 532, 534, 536, and 538 are distributed in the same home as the NMDs 512, 514, and 516. Further, as suggested above, one or more of the PBDs 532, 534, 536, and 538 may be one or more of the NMDs 512, 514, and 516. For example, any one or more (or possibly all) of the NMDs 512-16, PBDs 532-38, and/or controller device 522 may be voice-enabled devices (VEDs).

The NMDs 512, 514, and 516 can be part of a local area network, and the communication path 542 can include an access point that links the local area network of the NMDs 512, 514, and 516 to the computing device 504 over a WAN (communication path, not shown). Likewise, each of the NMDs 512, 514, and 516 may communicate with each other via the access point.

Similarly, the controller device 522 and the PBDs 532, 534, 536, and 538 can be part of a local area network and/or a local playback network (as discussed in previous sections), and the communication path 544 can include access points that link the local area network and/or the local playback network of the controller device 522 and the PBDs 532, 534, 536, and 538 to the computing device 506 over the WAN. As such, the controller device 522 and each of the PBDs 532, 534, 536, and 538 may also communicate with each other through the access point.

In one example, communication paths 542 and 544 can include the same access point. In an example, each of the NMDs 512, 514, and 516, the controller device 522, and the PBDs 532, 534, 536, and 538 can access the cloud network 502 via the same access point of the home.

As shown in fig. 5, each of the NMDs 512, 514, and 516, the controller device 522, and the PBDs 532, 534, 536, and 538 may also communicate directly with one or more other devices via the communication means 546. The communication means 546 as described herein may relate to and/or include one or more forms of communication between devices over one or more types of networks according to one or more network protocols, and/or may relate to communication via one or more other network devices. For example, communication means 546 may comprise Bluetooth^TM(IEEE 802.15), NFC, wireless direct and/or proprietary wireless, etc.

In one example, the controller device 522 may be via bluetooth^TMCommunicates with the NMD 512 and may communicate with the PBD 534 via another local area network. In another example, the NMD 514 may communicate with the controller device 522 over another local area network and may communicate via bluetooth^TMCommunicating with the PBD 536. In yet another example, each of the PBDs 532, 534, 536, and 538 can communicate with each other over a local playback network in accordance with the spanning tree protocol while communicating with the controller device 522 over a local area network different from the local playback network, respectively. Other examples are possible.

In some cases, the manner of communication between the NMDs 512, 514, and 516, the controller device 522, and the PBDs 532, 534, 536, and 538 may be different (or may change) depending on the type of communication requirements, network conditions, and/or latency requirements between the devices. For example, the communication means 546 may be used when the NMD 516 is first introduced into a home with PBDs 532, 534, 536, and 538. In one case, the NMD 516 can send identification information corresponding to the NMD 516 to the PBD 538 via NFC, and in response, the PBD 538 can send local network information to the NMD 516 via NFC (or some other form of communication). However, once the NMD 516 is configured in the home, the communication means between the NMD 516 and the PBD 538 may change. For example, the NMD 516 may then communicate with the PBD 538 via the communication path 542, the cloud network 502, and the communication path 544. In another example, the NMD and PBD may never communicate through the native communication means 546. In another example, the NMD and the PBD may communicate primarily through local communication means 546. Other examples are possible.

In an illustrative example, the NMDs 512, 514, and 516 may be configured to receive voice input for controlling the PBDs 532, 534, 536, and 538. The available control commands may include any of the media playback system controls previously discussed, such as playback volume controls, playback transport controls, music source selection and grouping, and the like. In one example, the NMD 512 can receive voice input for controlling one or more of the PBDs 532, 534, 536, and 538. In response to receiving the voice input, the NMD 512 can send the voice input to the computing device 504 for processing over the communication path 542. In one example, the computing device 504 may convert the speech input into an equivalent text command and parse the text command to identify the command. Computing device 504 may then send a text command to computing device 506, and computing device 506 may then control one or more of PBDs 532 and 538 to execute the command. In another example, computing device 504 may convert the speech input into an equivalent text command and then send the text command to computing device 506. Computing device 506 can then parse the text command to identify one or more playback commands, and then computing device 506 can additionally control one or more of PBDs 532 and 538 to execute the command.

For example, if the textual command is "play track 1 from artist 1 of streaming media service 1 in region 1," the computing device 506 may identify (i) the URL of track 1 of artist 1 available from streaming media service 1, and (ii) at least one playback device in region 1. In this example, the URL of track 1 from artist 1 of streaming media service 1 may be a URL that points to computing device 508, and region 1 may be binding region 530. As such, when a URL and one or both of PBDs 536 and 538 are identified, computing device 506 may send the identified URL to one or both of PBDs 536 and 538 via communication path 544 for playback. In response, one or both of PBDs 536 and 538 may retrieve audio content from computing device 508 based on the received URL and begin playing track 1 from artist 1 of streaming media service 1.

Those of ordinary skill in the art will appreciate that the above are merely some illustrative examples, and that other implementations are possible. In one case, as described above, the operations performed by one or more of the plurality of network devices 500 may be performed by one or more other of the plurality of network devices 500. For example, the conversion from speech input to text commands may alternatively, partially, or completely be performed by another device or devices, such as the controller device 522, the NMD 512, the computing device 506, the PBD 536, and/or the PBD 538. Similarly, the identification of the URL may alternatively, partially, or completely be performed by another device or devices, such as NMD 512, computing device 504, PBD 536, and/or PBD 538.

f.Example network microphone device

Fig. 6 illustrates a functional block diagram of an example network microphone apparatus 600, which example network microphone apparatus 600 may be configured as one or more of the NMDs 512, 514, and 516 of fig. 5 and/or any of the VEDs disclosed and described herein. As shown, the network microphone device 600 includes one or more processors 602, tangible, non-transitory computer-readable memory 604, a microphone array 606 (e.g., one or more microphones), a network interface 608, a user interface 610, software components 612, and speakers 614. One of ordinary skill in the art will appreciate that other network microphone device configurations and arrangements are possible. For example, alternatively, the network microphone device may not include the speaker 614, or have a single microphone instead of the microphone array 606.

The one or more processors 602 may include one or more processors and/or controllers, which may take the form of general or special purpose processors or controllers. For example, the one or more processing units 602 may include a microprocessor, microcontroller, application specific integrated circuit, digital signal processor, or the like. The tangible, non-transitory computer-readable memory 604 may be a data storage device that may be loaded with one or more of the software components executable by the one or more processors 602 to perform these functions. Thus, memory 604 may include one or more non-transitory computer-readable storage media, examples of which may include: volatile storage media (e.g., random access memory, registers, cache memory, etc.), and non-volatile storage media (e.g., read-only memory, hard disk drives, solid state drives, flash memory and/or optical storage, etc.).

The microphone array 606 may be a plurality of microphones arranged to detect sound in the environment of the network microphone arrangement 600. The microphone array 606 may include any type of microphone now known or later developed, such as a capacitive microphone, an electret capacitive microphone, or a dynamic microphone, among others. In one example, the microphone array may be arranged to detect audio from one or more directions relative to the network microphone device. The microphone array 606 may be sensitive to a portion of the frequency band. In one example, a first subset of the microphone array 606 may be sensitive to a first frequency band and a second subset of the microphone array may be sensitive to a second frequency band. The microphone array 606 may also be arranged to capture location information of an audio source (e.g., speech, audible sound) and/or to help filter background noise. It is noted that in some embodiments, the microphone array may consist of only a single microphone, rather than multiple microphones.

The network interface 608 may be configured to facilitate wireless and/or wired communication between various network devices (e.g., with reference to fig. 5, the controller device 522, the PBD 532, 538, the computing device 504 in the cloud network 502, and other network microphone devices, etc.). As such, the network interface 608 may perform these functions in any suitable form, examples of which may include: an ethernet interface, a serial bus interface (e.g., FireWire, USB 2.0, etc.), a chipset and antenna adapted to facilitate wireless communication, and/or any other interface that provides wired and/or wireless communication. In one example, the network interface 608 may be based on an industry standard (e.g., infrared; radio; wired standards including IEEE 802.3; wireless standards including IEEE 802.11a, 802.11b, 802.11G, 802.11n, 802.11ac, 802.15, 4G mobile communication standards, etc.).

The user interface 610 of the network microphone device 600 may be configured to facilitate user interaction with the network microphone device. In one example, the user interface 610 may include one or more of physical buttons, graphical interfaces disposed on a touch-sensitive screen and/or surface, and the like, for a user to provide input directly to the network microphone apparatus 600. The user interface 610 may also include one or more of lights and speakers 614 to provide visual and/or audio feedback to the user. In one example, the network microphone device 600 may also be configured to play back audio content through the speaker 614.

Example noise suppression systems and methods

Fig. 7A-7C depict network devices 700 (identified as network devices 700a-700d, respectively). Each network device 700 includes a housing 704, the housing 704 at least partially enclosing certain components of the network device (not shown), such as amplifiers, transducers, processors, and antennas, within the housing. Network device 700 also includes microphones 702 (identified individually as microphones 702a-g) disposed at various locations of housing 704. For example, network device 700a includes a structured array of microphones 702. In some embodiments, the microphone 702 may be located within and/or exposed through an aperture in the housing 704. The network device 700a may be configured as one or more of the NMDs 512, 514, and 516 of fig. 5 and/or any of the VEDs disclosed and described herein.

As described above, embodiments described herein facilitate suppressing noise from audio content captured by multiple microphones to help detect the presence of wake words in the captured audio content. Some noise suppression processes involve single microphone techniques for suppressing frequencies at which noise dominates over the speech content. However, these techniques can result in significant distortion of the voice content. Other noise suppression processes involve beamforming techniques in which a structured array of microphones is used to capture audio content from a particular direction in which speech is dominant over noise content, rather than from a direction in which noise is dominant over speech content.

While effective at suppressing unnecessary noise when capturing audio content, beamforming has limitations. For example, conventional beamforming may generally not be optimal in detecting speech input as compared to the enhanced suppression techniques described below. For example, fig. 11 shows that under the same conditions, the use of the multi-channel wiener filter (MCWF) algorithm described below significantly improves wake-up word detection relative to conventional static beamforming, which involves (1) detecting the wake-up word from noisy sound samples with SNR of-15 dB (2) playing back the same sample track ("Relax" flange-based to Hollywood), and (3) using the same NMD for each test (the NMD for the test has an array of six microphones that are spaced apart from each other within the appropriate distance for conventional beamforming). FIG. 11 depicts the results of three different test cases under test conditions: (1) graph 1110 depicts sound samples detected without the use of beamforming or MCWF algorithms, (2) graph 1120 depicts sound samples detected with the use of beamforming, and (3) graph 1130 depicts sound samples detected with the use of MCWF-based algorithms. In each of the graphs 1110, 1120, and 1130, the x-axis represents time, the y-axis corresponds to frequency, and the darkness of the graph represents the intensity of the detected sound sample in dB (where the intensity increases with increasing darkness). Further, in each of the graphs 1110, 1120, and 1130, the wake-up word (identified by the arrow in fig. 11) begins approximately at half of the x-axis and ends at three quarters of the x-axis. Comparing graph 1120 to graph 1110, it can be seen that beamforming removes some of the noise, but still retains a significant amount of noise. However, comparing graph 1130 to graph 1120, it can be seen that the MCWF algorithm filters out much more noise than beamforming, and thus wake-up words can be more easily identified from the MCWF filtered sound samples than the beamforming filtered sound samples.

Additionally, beamforming typically requires a known array configuration, and the network device 700 selectively captures audio from a particular direction relative to the array. Beamforming may only be possible if such a microphone array 702 can be implemented. For example, if the microphones 702 and the processing components of the network device 700a of fig. 7A are configured for conventional beamforming, the spacing or distance d between adjacent microphones 702 is such that at frequencies up to 4kHz, using conventional aliasing-free beamforming, the spacing or distance d between adjacent microphones 702 is greater₁Will be limited to a theoretical maximum of about 4.25 cm. However, some network devices may not be able to support such a closely spaced array of microphones 702 due to hardware or other design constraints. Thus, when using the enhanced noise suppression techniques described herein, the distance d between microphones 702 in various embodiments₁May not be reconstructed to such a theoretical maximum.

For example, fig. 7B depicts a network device 700B in which microphones 702 are arranged in an unordered manner. As used herein, the term "disordered manner" refers to any arrangement of microphones that is not used as a beamforming array. In this way, the microphones arranged in an unordered manner may be arranged in any order with respect to each other; more conveniently located along the housing, for example between speakers, electronics, buttons and/or other components; and/or arranged in an order but not (or at least may not) support beamforming. For example, as shown in fig. 7B, microphones 702 appear to be arranged according to a particular geometric configuration, with microphones 702a, 702B, 702f, and 702g arranged in a first horizontal plane, and microphones 702c, 702d, and 702e arranged in a second horizontal plane. However, even though the microphone 702 arrangement in fig. 7B includes some aspect of order, the arrangement is referred to as "out-of-order" because the microphones 702 are too far apart from each other to perform beamforming, or at least too far apart from each other to effectively perform beamforming for the type of voice application disclosed and described herein. In some embodiments, the minimum distance between two given microphones is greater than 5 cm. For example, the spacing or distance d between microphones 702c and 702d or any other set of two or more microphones₂May be between 5cm and 60 cm.

Fig. 7C depicts microphones 702 distributed across multiple network devices, according to an example embodiment. In particular, microphones 702c, 702d, and 702e are disposed in housing 704 of network device 700c, and microphones 702a, 702b, 702f, and 702g are disposed in housing 704 of network device 700 d. In some embodiments, network devices 700c and 700d are located in the same room (e.g., as separate devices in a home theater configuration), but in different areas of the room. In such embodiments, the separation or distance between microphones 702 on network devices 700c and 700d (e.g., distance d between microphones 702d and 702 f)₃) And may exceed 60 cm. For example, microphones 702d and 702f may be disposed on separate network devicesDistance d between any other set of one or more microphones₃May be between 1 and 5 meters.

In each of the arrangements depicted in fig. 7A-7C, network device 700 employs a multi-microphone noise suppression technique that does not necessarily rely on the geometric arrangement of microphones 702. Alternatively, techniques to suppress noise according to various embodiments include: the observed noise process, as well as the additive noise, is linearly time invariant filtered assuming a fixed signal and noise spectrum is known. Network device 700 uses first audio content captured by one or more microphones 702 to estimate noise in second audio content simultaneously captured by one or more other ones of microphones 702. For example, microphone 702a captures first audio content while microphone 702g simultaneously captures second audio content. If a user near network device 700 speaks a voice command, the voice content in both the first audio content captured by microphone 702a and the second audio content captured by microphone 702g includes the same voice command. Further, if a noise source is near the network device 700, both the first audio content captured by the microphone 702a and the second audio content captured by the microphone 702g include noise content from the noise source.

However, since the microphones 702a and 702g are spaced apart from each other, the intensity of the speech content and the noise content may vary between the first audio content and the second audio content. For example, if microphone 702a is closer to a noise source and microphone 702g is closer to a speaking user, noise content may dominate first audio content captured by microphone 702a, while voice content may dominate second audio content captured by microphone 702 g. Also, if the noise content dominates the first audio content, the network device 700 may use the first audio content to generate an estimate of the noise content present in the second audio content. The estimated noise from the first audio content may then be used to filter out the noise and preserve the speech in the second audio content.

In some embodiments, network device 700 performs this process for all microphones 702 simultaneously, such that the noise content captured by each microphone is used to estimate the noise content captured by each other microphone. The network device 700 filters the respective audio signals captured by each microphone 702 using the estimated noise content to suppress the respective noise content in each audio signal, and then combines the filtered audio signals. In the case of suppressing the noise content of each audio signal, the main content of each audio signal is the speech content, and thus the combined audio signal is also speech dominant.

An example MCWF algorithm for performing these processes is described in further detail below in conjunction with FIG. 8.

Fig. 8 depicts an example environment 800 in which such a noise suppression process is performed. The environment 800 includes a plurality of microphones 802 (identified as microphones 802a-g, respectively) for capturing audio content. Microphone 802 may be configured as one or more of microphones 702 in fig. 7A-7C. As shown, environment 800 includes seven microphones 802, but in other embodiments, environment 800 includes additional or fewer microphones. In some embodiments, microphone 802 is disposed on or within a single network device, such as network device 700. In other embodiments, one or more microphones 802 are disposed on or within one network device, while the remaining microphones are disposed on or within one or more other network devices.

In practice, the microphone 802 captures audio content that reaches the microphone 802. As shown, when a person 804 speaks near the microphone 802, the person 804 generates a voice signal s (t). As the voice signals s (t) propagate throughout the environment 800, at least some of the voice signals s (t) reflect off walls or other nearby objects in the environment 800. These reflections can distort the speech signal s (t) such that the version of the speech signal captured by the microphone 802 is a different reverberant speech signal x (t) than the original speech signal s (t).

Further, the environment 800 includes one or more noise sources 806, for example, noise from nearby traffic or buildings, noise from people moving throughout the environment, noise from one or more playback devices in the environment 800, or any other environmental noise. In some embodiments, noise source 806 includes speech content from a person other than person 804. In any case, noise source 806 produces a noise signal v (t) that is captured by some or all of microphones 802. In this regard, the audio signal captured by the microphone 802 is denoted as y (t), which is the sum of the reverberant voice signal x (t) and the noise signal v (t). And for each individual one of the microphones 802, the captured audio signal may thus be characterized as:

y_n(t)＝x_n(t)+v_n(t), N ═ 1, 2,.., N (equation 1)

Where N is the index of the reference microphone and N is the total number of microphones. From the time domain to the frequency domain, the above equation can be expressed as:

Y_n(f)＝X_n(f)+V_n(f) n1, 2., N (equation 2)

Alternatively, expressed in vector form as:

y (f) ═ x (f) + v (f) (equation 3)

Furthermore, a Power Spectral Density (PSD) matrix P is defined_yy(f)、P_xx(f) And P_vv(f) Wherein P is_yy(f) Is a PSD matrix, P, of the total captured audio content_xx(f) Is a PSD matrix that always captures the voice portion of the audio content, and P_vv(f) Is a PSD matrix that always captures the noise portion of the audio content. These PSD matrices are determined using the following equations:

P_yy(f)＝E{y(f)y^H(f) }, (Eq.4)

P_xx(f)＝E{x(f)x^H(f) }, (Eq.5)

P_vv(f)＝E{v(f)v^H(f) } (EQUATION 6)

Where E { } denotes the expected value operator, and H denotes the Hermitian (Hermitian) transpose operator. Assuming a lack of correlation between the speech portion and the noise portion of the total captured audio content (which is typically the case), the PSD matrix for the speech portion of the total captured audio content can be written as:

P_xx(f)＝P_yy(f)P_vv(f) (equation 7)

To reduce the noise content v (f) and recover the speech content x (f) of the captured multi-channel audio content y (f), the captured multi-channel audio content y (f) is passed through a filter 808. In some embodiments, the filter 808 comprises a tangible, non-transitory computer-readable medium that, when executed by one or more processors of a network device, causes the network device to perform the multi-channel filtering functions disclosed and described herein.

The filter 808 may filter the captured multi-channel audio content y (f) in various ways. In some embodiments, filter 808 filters h a linear filter_i(f) (where i 1, 2.., N is the index of the reference microphone) is applied to the vector y (f) of the captured multichannel audio content. In this way, N linear filters hi (f) (one for each microphone 802) are applied to the audio content vector y (f). Applying these filters will produce a filtered output Z given below_i(f)：

The filtered output Z_i(f) Comprising a filtered speech component D_i(f) And a residual noise component v_i(f) Wherein

And is

To determine the linear filter hi (f), a set of optimization constraints is defined. In some embodiments, optimization constraints are defined to maximize the degree of noise reduction while limiting the degree of signal distortion, for example, by limiting the degree of signal distortion to less than or equal to a threshold degree. Noise reduction coefficient xi_nr(h_i(f) Is defined as:

and signal distortion index v_Sd(h_i(f) Is defined as:

wherein u is_iIs the ith normal basis vector and is defined as:

thus, to maximize noise reduction while limiting signal distortion, an optimization problem in some implementations is to satisfy v_sd(h_i(f))≤σ²(f) In the case of (1), maximize xi_nr(h_i(f) ). To find a solution associated with this optimization problem, the derivative of the associated lagrangian function with respect to hi (f) is set to zero, and the resulting closed-form solution is:

h_i(f)＝[P_xx(f)+βP_vv(f)]^-1P_xx(f)u_i(equation 14)

Where β (which is a positive value and is the inverse of the Lagrangian multiplier) is the allowed tuning h_i(f) Signal distortion at the output and factors for noise reduction.

Such a linear filter h_i(f) May be computationally demanding. In order to reduce the filter h_i(f) By using the matrix P, in some embodiments_xx(f) The fact that it is a matrix of rank 1, a more simplified form is obtained. And, due to P_xx(f) Is a matrix of rank 1, so P^-1 _vv(f)P_xx(f) Is also 1. Additionally, matrix inversion can be further simplified using the wood-birry (Woodbury) matrix identity. Applying all these concepts, a linear filter h_i(f) Can be expressed as:

wherein

Is P^-1 _vv(f)P_xx(f) And used as a normalization factor.

The linear filter h_i(f) One advantage of (c) is that it relies only on the PSD matrix of the total captured audio and the PSD matrix of the noise portion of the total captured audio, and thus does not rely on the speech portion of the total captured audio. Another advantage is that the parameter beta allows to customize the degree of noise reduction and signal distortion. For example, increasing β increases noise reduction at the cost of increasing signal distortion, while decreasing β decreases signal distortion at the cost of increasing noise.

Due to the linear filter h_i(f) PSD matrix P dependent on total captured audio_yy(f) And PSD matrix P for the noise portion of the total captured audio_vv(f) And therefore estimate these PSD matrices to apply the filters. In some embodiments, first order exponential smoothing is used to smooth P_yyEstimated as:

P_yy(n)＝α_yP_yy(n-1)+(1-α_y)yy^H(equation 17)

Wherein alpha is_yIs a smoothing coefficient and where n denotes a temporal frame index. In addition, the frequency index (f) has been deleted from this equation and the following equations for simplicity of notation, but it should be understood that the process disclosed herein is performed for each frequency bin. The smoothing coefficient ranges from 0 to 1 and can be adjusted to tune the pair P_yyAnd (4) estimating. By reducing P between successive time frame indices_yyTo an increased extent of alpha_yCan increase P_yySmoothness of estimation by increasing P between successive temporal frame indices_yyTo a degree of variation of, reduce alpha_yCan reduce P_yyThe estimated smoothness.

To estimate P_vvIn some embodiments, the filter 808 determines whether speech content is present in each frequency bin. If the filter 808 determines that speech content is present or likely to be present in a particular frequency bin, the filter 808 determines that the frequency bin does not represent noise content and the filter 808 does not use the frequency bin to estimate P_vv. On the other hand, if the filter 808 determines that speech content is not present or is unlikely to be present in a particular frequency bin, the filter 808 determines that the frequency bin is composed primarily or entirely of noise content, and then the filter 808 uses the noise content to estimate P_vv。

The filter 808 may determine whether voice content is present in the frequency bin in various ways. In some embodiments, filter 808 makes this determination using a hard Voice Activity Detection (VAD) algorithm. In other embodiments, the filter 808 makes this determination using a soft speech presence probability algorithm. For example, assuming that there is a gaussian distribution, the speech presence probability is calculated as follows:

where n is the time frame index, where

And wherein

Is the alpha prior probability of a voice loss. The derivation of such a Probability of Speech Presence is described in "Gaussian Model-Based Multichannel Speech Presence Proavailability" of Souden et al, IEEE Audio, Speech and language processing transactions (2010), the entire contents of which are incorporated herein by reference.

Notably, the speech presence probability calculation depends on the speech content PSD matrix P_xx. However, due to P_xx(f)＝P_yy(f)-P_vv(f) This dependency can thus be eliminated by overwriting γ as follows:

further, the variable ξ may be written as:

wherein

Wherein

And wherein

By defining vectors, the computational complexity of the computation of the probability of existence of speech can be further reduced:

so that ψ can be written as:

and γ can be written as:

thus, by calculating y before attempting to calculate ψ or γ_tempRepetition of the calculations may be avoided when the filter 808 determines the probability of speech presence.

Once the probability of speech presence is determined for a given time frame, the filter 808 updates the estimate of the noise covariance matrix according to the following equation by employing the desired operator:

wherein

Is an effective frequency dependent smoothing factor.

To obtain updated P^-1 _vv(n) for use in hi (f), the usage of the scherman-mollison formula is as follows:

wherein

Once updated P is determined^-1 _vv(n), for all f values and all i values, filter 808 may determine a linear filter h_i(n) and applying it to the captured audio content. The output of filter 808 is then denoted as y_0，i(n)＝h^H _i(n) y (n). In some embodiments, filter 808 computes the output in parallel for all i using a matrix h (n), where the columns are h_i(n) is such that

And y is_out＝H^Hy, (eq 36)

Wherein

And ξ ═ λ (n) -n. (equation 38)

In some embodiments, filter 808 does not compute H directly, which requires matrix multiplication. Instead, the computational complexity is significantly reduced by computing the output by filter 808 as follows:

and is

With the above concept, the filter 808 suppresses noise and retains speech content in the multi-channel audio signal captured by the microphone 802. In a simplified manner, this may include

A. Updating P for all f_yy(n)

B. Calculating the probability of speech existence P (H1| y (n))

C. Updating P for all f using voice presence probability^-1 _vv(n)

D. The linear filter hi (n) is calculated for all f and all i and the output is calculated as follows

yo，i(n)＝h^H _i(n)y(n)

A more detailed example may include performing the following steps.

Step 1:the parameters and state variables are initialized at time frame 0. In some embodiments, P is estimated by estimating P over a certain period of time (e.g., 500ms)_yyTo initialize P_yyAnd p^-1 _vvThen using the estimated P_yyWill P^-1 _vvInitialized to its inverse.

Step 2:at each time frame n, the following steps 3-13 are performed.

And step 3:for each frequency index f ═ 1.. multidata., K }, the pair P is updated according to equation 17_yy(n) estimation, y is calculated according to equation 27_tempAnd ψ is calculated according to equation 28.

And 4, step 4:for each frequency index f { 1.,. K }, ψ is calculated using vector operations according to equation 24.

And 5:for each frequency index f ═ 1.,. K }, ξ is calculated according to equation 23 using vector operations.

Step 6:for each frequency index f ═ 1.·, K }, γ is calculated according to equation 29.

And 7:the probability of speech presence over all frequency bins is calculated using a vector operation according to equation 18.

And 8:calculating P for updating according to equations 30 and 31_vv(n) is as followsThe effective smoothing factor.

And step 9:w is calculated according to equation 34.

Step 10:for each frequency index f { 1.·, K }, K (n) is updated according to equation 32, and P is updated according to equation 33^-1 _vv(n)。

Step 11:for each frequency index f { 1.·, K }, λ (n) is updated according to equation 37.

Step 12:ξ is calculated from equation 38.

Step 13:for each frequency index f ═ 1.. multidata., K }, the output y is calculated by following equation 39 and by following equation 40_outCalculatingAn output vector of size N x 1.

The above-described MCWF-based processing provides other advantages, in addition to the other advantages already described. For example, the captured audio signals are filtered in a distributed manner such that the audio signals do not need to be aggregated at the central node for processing. Furthermore, the MCWF algorithm may be executed at a separate node where the microphone is present, which may then share the output from the MCWF algorithm with some or all other nodes in the networked system. For example, each of the microphones 702 in fig. 7C is part of a respective node capable of performing the MCWF algorithm. As such, the node comprising microphone 702a processes audio captured by microphone 702a according to the MCWF algorithm and then provides MCWF outputs to nodes associated with microphones 702 b-g. Similarly, the node that includes microphone 702a receives the MCWF output from each of the nodes associated with microphones 702 b-g. Thus, each node can use the MCWF output from the other nodes when estimating and filtering out noise content according to the MCWF algorithm.

Referring again to fig. 8, once the filter 808 has suppressed the noise content and retained the speech content in the respective audio signals captured by the microphone 802, for example using the MCWF algorithm described above, the filter 808 combines the filtered audio signals into a single signal. In the case where the noise content of each audio signal is suppressed and the speech content is preserved, the combined signal similarly has both suppressed noise content and preserved speech content.

The filter 808 provides the combined signal to a voice processing block 810 for further processing. The speech processing block 810 runs a wake word detection process on the output of the filter 808 to determine whether the speech content output by the filter includes a wake word. In some embodiments, the voice processing block 810 is implemented as software executed by one or more processors of the network device 700. In other embodiments, voice processing block 810 is a separate computing system, e.g., one or more of computing devices 504, 506, and/or 508 shown and described with reference to fig. 5.

In response to determining that the output of filter 808 includes a wake word, speech processing block 810 performs further speech processing on the output of filter 808 to recognize a voice command following the wake word. And in response to the voice processing block 810 recognizing a voice command after the wakeup word, the network device 700 performs a task corresponding to the recognized voice command. For example, as described above, in particular embodiments, network device 700 may send voice input, or a portion thereof, to a remote computing device associated with, for example, a voice assistant service.

In some embodiments, the robustness and performance of the MCWF may be enhanced based on one or more of the following adjustments to the aforementioned algorithm.

1) The parameter β may be time-frequency dependent. There are various methods to design time-frequency dependent β according to the speech presence probability, Signal Dispersion Ratio (SDR), etc. The idea is to use smaller values to reduce speech distortion when the SDR is high and there is speech, and larger values to increase noise reduction when the SDR is low or there is no speech. This value is a trade-off between noise reduction and speech distortion based on the conditional speech presence probability. One simple and effective method is to define β as:

β(y)＝β₀/(α_β+(1-α_β)β₀ P(H1|y))

wherein the conditional speech presence probability is combined to adjust the parameter beta based on the input vector y. Parameter alpha_βA compromise is provided between a fixed adaptation parameter and a parameter that is completely dependent on the probability of speech being present. In one implementation, α_β＝0.5。

2) An MMSE estimate of the desired voice signal may be obtained according to the following equation

y_out＝P(H₁|y)H^H(n)y(n)+(1-P(H₁|y))G_miny

Wherein the gain factor G_minA maximum amount of noise reduction is determined when the probability of speech presence indicates that speech is not present. The importance of this model is that it can mitigate speech distortion in the event of an erroneous decision making on the probability of speech presence. This approach improves robustness. This implementation can be done after step 13 of the algorithm, and y can be modified as follows_out

y_out＝P(H₁|y)y_out+(1-P(H₁|y))G_min y

Wherein the probability of speech presence is used to generate an output and to control how G is applied_min。

3) The algorithm is adapted and implemented in two supported modes. A) Noise Suppression (NS); B) residual Echo Suppression (RES). If the speaker is playing content, the algorithm may run in the RES mode. Otherwise, the algorithm will run in NS mode. The mode may be determined using internal conditions regarding the presence of audio playback.

4) The covariance matrix is initialized in step 1 of the algorithm. The algorithm contains an initialization period in which the input signals of the microphone array are used to estimate the initial input and the noise covariance matrix. It may be assumed that no speech is present during this initialization. These covariance matrices are initialized with diagonal matrices to simplify the implementation. The initialization time may be adjusted in the algorithm to, for example, 0.5 seconds. This method provides a more robust solution that is insensitive to input level and noise type. Thus, relatively very similar convergence speeds can be achieved at all SNR levels and loudness levels.

5) In order to improve the multi-channel speech existence probability in consideration of the statistical characteristics of the speech signal, the following recursively smoothed multi-channel speech existence probability may be used

Wherein the smoothing coefficient alpha_PIs between 0 and 1 and the smoothing coefficient may be adjusted to adjust the estimate of the probability of speech presence in the parameter adjustment stage.

V. example noise suppression method

Fig. 9 illustrates an example embodiment of a method 900 that may be implemented by a network device, such as the network device 700 or any of the PBDs, NMDs, controller devices, or other VEDs disclosed and/or described herein, or any other voice-enabled device now known or later developed.

Various embodiments of method 900 include one or more of the operations, functions, and actions shown in blocks 902-914. Although the blocks are shown sequentially, the blocks may also be performed in parallel and/or in a different order than that disclosed and described herein. Moreover, the various blocks may be combined into fewer blocks, divided into additional blocks, and/or removed based on the desired implementation.

Additionally, the flow diagrams illustrate the functionality and operation of one possible implementation of some embodiments with respect to the method 900 and other processes and methods disclosed herein. In this regard, each block may represent a module, segment, or portion of program code, which comprises one or more instructions executable by one or more processors for implementing the specified logical function or step in the process. The program code may be stored on any type of computer readable medium, such as a storage device including a diskette or hard drive. The computer readable medium may include a non-transitory computer readable medium, for example, a tangible non-transitory computer readable medium for storing data for a short time, such as register memory, processor cache, and Random Access Memory (RAM). The computer-readable medium may also include non-transitory media such as secondary or persistent long-term storage devices, e.g., Read Only Memory (ROM), optical or magnetic disks, compact disk read only memory (CD-ROM), and so forth. The computer readable medium may also be any other volatile or non-volatile storage system. The computer-readable medium may be considered a computer-readable storage medium, such as a tangible storage device. Additionally, for the method 800 and other processes and methods disclosed herein, each block in fig. 9 may represent circuitry connected to perform a particular logical function in the process.

The method 900 begins at block 902, which includes a network device capturing (i) a first audio signal via a first microphone of a plurality of microphones and (ii) a second audio signal via a second microphone of the plurality of microphones, wherein the first audio signal includes first noise content from a noise source and the second audio signal includes second noise content from the same noise source. In some embodiments, the multiple microphones, including the first microphone and the second microphone, are components of the same network device (e.g., network device 700a or 700B depicted in fig. 7A-7B). In other embodiments, at least some of the plurality of microphones are components of different network devices, for example, as shown in fig. 7C. In an example implementation, the first microphone is a component of a first network device (e.g., network device 700c) and the second microphone is a component of a second network device (e.g., network device 700 d).

Next, the method 900 proceeds to block 904, which includes identifying first noise content in the first audio signal. In some embodiments, the step of identifying the first noise content in the first audio signal involves one or more of: (i) the network device uses a VAD algorithm to detect the absence of speech in the first audio signal, or (ii) the network device uses a speech presence probability algorithm to determine a probability that speech is present in the first audio signal. An example of a speech presence probability algorithm is described above with reference to equation 18. If the VAD algorithm detects that no speech is present in the first audio signal, or the speech presence probability algorithm indicates that the probability of speech being present in the first audio signal is below a threshold probability, this may indicate that the first audio signal is noise-dominant and includes little or no speech content.

Next, method 900 proceeds to block 906, which includes determining an estimated noise content captured by the plurality of microphones using the identified first noise content. In some embodiments, the step of using the identified first noise content to determine an estimated noise content captured by the plurality of microphones involves the network device updating a noise content PSD matrix for the MCWF algorithm described above with reference to equations 30-34.

In some embodiments, the following steps are performed based on the probability of speech being present in the first audio signal being below a threshold probability: first noise content in the first audio signal is identified at block 904, and estimated noise content captured by the plurality of microphones is determined at block 906 using the identified first noise content. As described above, the speech presence probability algorithm indicating that the probability of speech being present in the first audio signal is below the threshold probability indicates that: the first audio signal is noise dominant and includes little or no speech content. Such noise-dominated signals are more likely to provide an accurate estimate of the noise present in other signals (e.g., the second audio signal) captured by the microphone than non-noise-dominated signals. Thus, in some embodiments, the step of using the identified first noise content to determine an estimated noise content captured by the plurality of microphones is performed in response to determining that the probability of speech being present in the first audio signal is below a threshold probability. The threshold probability may take on various values, and in some embodiments, may be adjusted to tune the noise filtering method described herein. In some embodiments, the threshold probability is set as low as 1%. In other embodiments, the threshold probability is set to a higher value, for example between 1% and 10%.

Next, the method 900 proceeds to block 908, which includes suppressing a first noise content in the first audio signal and a second noise content in the second audio signal using the estimated noise content. In some embodiments, the step of suppressing the first noise content in the first audio signal and the second noise content in the second audio signal using the estimated noise content involves the network device applying a linear filter to each audio signal captured by the plurality of microphones described above with reference to equations 35-40 using the updated noise content PSD matrix.

Next, method 900 proceeds to block 910, which includes combining the suppressed first audio signal and the suppressed second audio signal into a third audio signal. In some embodiments, the step of combining the suppressed first audio signal and the suppressed second audio signal into the third audio signal involves the network device combining the suppressed audio signals from all of the plurality of microphones into the third audio signal.

Next, the method 900 proceeds to block 912, which includes determining that the third audio signal includes a speech input including a wake-up word. In some embodiments, the step of determining that the third audio signal includes a speech input including a wake-up word involves the network device performing one or more speech processing algorithms on the third audio signal to determine whether any portion of the third audio signal includes a wake-up word. In operation, the step of determining that the third audio signal comprises a speech input comprising a wake-up word may be performed in accordance with any wake-up word detection method disclosed and described herein and/or any wake-up word detection method now known or later developed.

Finally, the method 900 proceeds to block 914, which includes: in response to determining that the third audio signal includes speech content that includes a wake word, at least a portion of the speech input is sent to the remote computing device for speech processing to recognize a speech utterance that is different from the wake word. As described above, the voice input may include a wake word and a voice utterance following the wake word. The voice utterance may include a spoken command and one or more spoken keywords. Thus, in some embodiments, the step of sending at least a portion of the voice input to the remote computing device for voice processing to identify a different voice utterance than the wake word includes sending a portion of the voice input (which may include spoken commands and/or spoken keywords) after the wake word to a separate computing system for voice analysis.

VII. conclusion

The above description discloses, among other things, various example systems, methods, apparatus, and articles of manufacture including, among other things, firmware and/or software executed on hardware. It should be understood that these examples are illustrative only and should not be considered as limiting. For example, it is contemplated that any or all of these firmware, hardware, and/or software aspects or components could be embodied exclusively in hardware, exclusively in software, exclusively in firmware, or in any combination of hardware, software, and/or firmware. Accordingly, the examples provided are not the only way to implement such systems, methods, apparatus, and/or articles of manufacture.

(feature 1) a network device comprising: (i) a plurality of microphones including a first microphone and a second microphone; (ii) one or more processors; and (iii) a tangible, non-transitory computer-readable medium storing instructions executable by the one or more processors to cause the network device to perform operations comprising: (a) capturing (i) a first audio signal via the first microphone, and (ii) a second audio signal via the second microphone, wherein the first audio signal comprises first noise content from a noise source, and the second audio signal comprises second noise content from the noise source; (b) identifying a first noise content in the first audio signal; (c) determining estimated noise content captured by the plurality of microphones using the identified first noise content; (d) using the estimated noise content to suppress a first noise content in the first audio signal and a second noise content in the second audio signal; (e) combining the suppressed first audio signal and the suppressed second audio signal into a third audio signal; (f) determining that the third audio signal includes a speech input including a wake-up word; and (g) in response to the determination, sending at least a portion of the speech input to a remote computing device for speech processing to recognize a speech utterance that is different from the wake word.

(feature 2) the network device of feature 1, the operations further comprising: (i) determining a probability that the first audio signal comprises speech content, (ii) wherein the following steps are performed based on the determined probability being below a threshold probability: (a) identifying first noise content in the first audio signal, and (b) using the identified first noise content to determine estimated noise content captured by the plurality of microphones.

(feature 3) the network device of feature 1, further comprising a housing at least partially enclosing components of the network device within the housing, wherein the first and second microphones are arranged along the housing and separated from each other by a distance greater than about five centimeters.

(feature 4) the network device of feature 1, the operations further comprising: (i) capturing a fourth audio signal via a third microphone of the plurality of microphones, wherein the fourth audio signal includes third noise content from the noise source; (ii) identifying a third noise content in the fourth audio signal; and (iii) updating the estimated noise content captured by the plurality of microphones using the identified third noise content.

(feature 5) the network device of feature 4, wherein the network device captures the fourth audio signal simultaneously with the first and second audio signals.

(feature 6) the network device of feature 4, further comprising a housing at least partially enclosing components of the network device within the housing, wherein the first, second, and third microphones are arranged along the housing and are separated from each other by a distance greater than about five centimeters.

(feature 7) the network device of feature 1, wherein sending at least a portion of the speech input to a remote computing device for speech processing to recognize a speech utterance that is different from the wake word comprises: sending a portion of the speech input after the wake word to a separate computing system for speech analysis.

(feature 8) a tangible, non-transitory computer-readable medium storing instructions executable by one or more processors to cause a network device to perform operations comprising: (i) capturing, via a plurality of microphones of a network device, (a) a first audio signal via a first microphone of the plurality of microphones, and (b) a second audio signal via a second microphone of the plurality of microphones, wherein the first audio signal includes first noise content from a noise source and the second audio signal includes second noise content from the noise source; (ii) identifying a first noise content in a first audio signal; (iii) determining an estimated noise content captured by a plurality of microphones using the identified first noise content; (iv) using the estimated noise content to suppress a first noise content in the first audio signal and a second noise content in the second audio signal; (v) combining the suppressed first audio signal and the suppressed second audio signal into a third audio signal; (vi) determining that the third audio signal comprises a voice input, the voice input comprising a wake-up word; and (vii) in response to the determination, sending at least a portion of the voice input to a remote computing device for voice processing to recognize a different voice utterance than the wake word.

(feature 9) the tangible, non-transitory computer-readable medium of feature 8, the operations further comprising: (i) determining a probability that the first audio signal comprises speech content, (ii) wherein the following steps are performed based on the determined probability being below a threshold probability: (a) identifying first noise content in the first audio signal, and (b) using the identified first noise content to determine estimated noise content captured by the plurality of microphones.

(feature 10) the tangible, non-transitory computer-readable medium of feature 8, wherein the network device includes a housing at least partially enclosing components of the network device within the housing, and wherein the first and second microphones are arranged along the housing and separated from each other by a distance greater than about five centimeters.

(feature 11) the tangible, non-transitory computer-readable medium of feature 8, the operations further comprising: (i) capturing a fourth audio signal via a third microphone of the plurality of microphones, wherein the fourth audio signal includes third noise content from the noise source; (ii) identifying a third noise content in the fourth audio signal; and (iii) updating the estimated noise content captured by the plurality of microphones using the identified third noise content.

(feature 12) the tangible, non-transitory computer-readable medium of feature 11, wherein the fourth audio signal is captured simultaneously with the first audio signal and the second audio signal.

(feature 13) the tangible, non-transitory computer-readable medium of feature 11, wherein the network device includes a housing at least partially enclosing components of the network device within the housing, wherein the first, second, and third microphones are arranged along the housing and separated from each other by a distance greater than about five centimeters.

(feature 14) the tangible, non-transitory computer-readable medium of feature 8, wherein sending at least a portion of the speech input to a remote computing device for speech processing to recognize a speech utterance different from the wake word comprises: sending a portion of the speech input after the wake word to a separate computing system for speech analysis.

(feature 15) a method comprising: (i) capturing, via a plurality of microphones of a network device, (a) a first audio signal via a first microphone of the plurality of microphones, and (b) a second audio signal via a second microphone of the plurality of microphones, wherein the first audio signal includes first noise content from a noise source and the second audio signal includes second noise content from the noise source; (ii) identifying a first noise content in a first audio signal; (iii) determining an estimated noise content captured by a plurality of microphones using the identified first noise content; (iv) using the estimated noise content to suppress a first noise content in the first audio signal and a second noise content in the second audio signal; (v) combining the suppressed first audio signal and the suppressed second audio signal into a third audio signal; (vi) determining that the third audio signal comprises a voice input, the voice input comprising a wake-up word; and (vii) in response to the determination, sending at least a portion of the voice input to a remote computing device for voice processing to recognize a different voice utterance than the wake word.

(feature 16) the method of feature 15, further comprising: (i) determining a probability that the first audio signal comprises speech content, (ii) wherein the following steps are performed based on the determined probability being below a threshold probability: (a) identifying first noise content in the first audio signal, and (b) using the identified first noise content to determine estimated noise content captured by the plurality of microphones.

(feature 17) the method of feature 15, wherein the network device includes a housing at least partially enclosing components of the network device within the housing, and wherein the first and second microphones are arranged along the housing and separated from each other by a distance greater than about five centimeters.

(feature 18) the method of feature 15, further comprising: (i) capturing a fourth audio signal via a third microphone of the plurality of microphones, wherein the fourth audio signal includes third noise content from the noise source; (ii) identifying a third noise content in the fourth audio signal; and (iii) updating the estimated noise content captured by the plurality of microphones using the identified third noise content.

(feature 19) the method of feature 18, wherein the fourth audio signal is captured simultaneously with the first audio signal and the second audio signal.

(feature 20) the method of feature 18, wherein the network device includes a housing at least partially enclosing components of the network device within the housing, wherein the first, second, and third microphones are arranged along the housing and separated from each other by a distance greater than about five centimeters.

Furthermore, references herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one exemplary embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Thus, those skilled in the art will explicitly and implicitly appreciate that the embodiments described herein can be combined with other embodiments.

The description is presented primarily in terms of illustrative environments, systems, processes, steps, logic blocks, processing, and other symbolic representations that are directly or indirectly analogous to the operation of data processing devices coupled to a network. These process descriptions and representations are generally used by those skilled in the art to convey the substance of their work to others skilled in the art. Numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood by those skilled in the art that the present disclosure may be practiced without these specific, specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments. For example, in some embodiments, other techniques may be employed to determine the probability of speech loss. Accordingly, the scope of the disclosure is defined by the appended claims rather than the description of the embodiments above.

When any of the appended claims are read to cover a purely software and/or firmware implementation, at least one element in at least one example is hereby expressly defined to include a non-transitory tangible medium such as a memory, DVD, CD, Blu-ray, etc. storing the software and/or firmware.

41页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：NAND温度感知操作

Linear filtering for noise-suppressed voice detection

相关技术

网友询问留言