Adaptive energy limiting for transient noise suppression

文档序号：1804303 发布日期：2021-11-05 浏览：11次中文

阅读说明：本技术 用于瞬态噪声抑制的自适应能量限制 (Adaptive energy limiting for transient noise suppression ) 是由约翰·弗雷德里克·林德斯特伦卡尔·塞缪尔·索宁于 2020-10-13 设计创作，主要内容包括：本公开描述了用于瞬态噪声抑制的自适应能量限制的方面。在一些方面,自适应能量限制器将音频信号的限制器上限设置为满量程,并接收音频信号的一部分。对于音频信号的该部分,自适应能量限制器确定最大幅度,并用神经网络评估该部分,以提供语音似然性估计。基于最大幅度和语音似然性估计,自适应能量限制器确定音频信号的该部分包括噪声。响应于确定音频信号的该部分包括噪声,自适应能量限制器降低限制器上限,并将限制器上限提供给限制器模块,以有效地限制音频信号的能量的数量。这可以有效地防止音频信号将全能量瞬态噪声带入到会议音频中。(The present disclosure describes aspects of adaptive energy limiting for transient noise suppression. In some aspects, the adaptive energy limiter sets a limiter upper limit for the audio signal to a full scale and receives a portion of the audio signal. For the portion of the audio signal, the adaptive energy limiter determines a maximum amplitude and evaluates the portion with a neural network to provide a speech likelihood estimate. Based on the maximum amplitude and the speech likelihood estimate, the adaptive energy limiter determines that the portion of the audio signal includes noise. In response to determining that the portion of the audio signal includes noise, the adaptive energy limiter reduces the limiter upper limit and provides the limiter upper limit to the limiter module to effectively limit the amount of energy of the audio signal. This may effectively prevent the audio signal from carrying full energy transient noise into the conference audio.)

1. A method, comprising:

setting a limiter upper limit of the audio signal to a full scale;

receiving a portion of the audio signal;

determining a maximum amplitude of the portion of the audio signal;

evaluating the portion of the audio signal with a neural network to provide a speech likelihood estimate of the portion of the audio signal;

determining that the portion of the audio signal includes noise based on the maximum amplitude and the speech likelihood estimate;

in response to determining that the portion of the audio signal includes noise, lowering the limiter upper limit; and

providing the limiter upper limit to a limiter module through which the audio signal is to pass to limit an amount of energy of the audio signal.

2. The method of claim 1, wherein the portion of the audio signal is a frame of audio corresponding to the portion of the audio signal, and the method further comprises:

converting the frame of audio from a time domain to a frequency domain prior to evaluating the frame of audio.

3. The method of claim 2, wherein the frame of audio is a frame of first audio, and the method further comprises:

receiving a frame of second audio corresponding to a second portion of the audio signal;

evaluating the frames of the second audio with the neural network to provide respective speech likelihood estimates for the frames of the second audio;

determining that frames of the second audio include speech based on the respective speech likelihood estimates; and

resetting the limiter upper limit to the full scale.

4. The method of claim 2, wherein the frame of audio is a frame of first audio, and the method further comprises:

receiving a frame of second audio corresponding to a second portion of the audio signal;

determining respective maximum amplitudes of frames of the second audio;

comparing the respective maximum amplitudes of the frames of the second audio to a threshold value corresponding to an average of respective maximum amplitudes of a plurality of frames of audio corresponding to a plurality of respective portions of the audio signal; and

maintaining a current limiter ceiling in response to the respective amplitude of the frames of the second audio not exceeding the threshold.

5. The method of any of claims 2 to 4, wherein the frame of audio corresponds to an audio duration ranging from about 10 milliseconds of the audio to about 50 milliseconds of the audio.

6. The method of any of claims 1-5, wherein evaluating the portion of the audio signal with the neural network to provide the speech likelihood estimate comprises: analyzing the portion of the audio signal with a neural network enabled Voice Activity Detector (VAD) to provide an Instantaneous Voice Likelihood (IVL) of the portion of the audio signal.

7. The method of claim 6, wherein the upper limiter limit is lowered by a predefined amount, and the method further comprises:

determining an aggregate voice likelihood estimate, ASLE, based on a plurality of IVLs provided by the neural network enabled VAD;

updating an aggregate voice likelihood estimate (ASLE) based on the IVL by:

increasing the ASLE in response to the IVL exceeding the ASLE and exceeding a speech detection threshold; or

In response to the IVL not exceeding the ASLE or not exceeding the speech detection threshold, decreasing the ASLE; and

setting the predefined amount by which the limiter upper limit is lowered based on the ASLE.

8. The method of claim 7, wherein the limiter upper limit has a minimum value, and the method further comprises: configuring a minimum value of the limiter upper limit based on the ASLE.

9. The method of claim 8, further comprising configuring a minimum value of the limiter upper limit based on the ASLE and one of:

an average of respective amplitudes of a plurality of portions of the audio signal; or

An average of respective maximum amplitudes of a plurality of portions of the audio signal.

10. An apparatus, comprising:

a network interface for receiving or transmitting audio signals over a data network;

a limiter module to limit energy of the audio signal;

a hardware-based processor associated with the data interface; and

a storage medium storing processor-executable instructions that, in response to execution by the hardware-based processor, implement an adaptive energy limiter to:

setting a limiter upper limit of the audio signal to a full scale;

providing frames of audio from the audio signal corresponding to a duration of audio from the audio signal;

determining a maximum amplitude of the audio signal for a frame of the audio;

evaluating the frames of audio with a neural network to provide speech likelihood estimates for the frames of audio;

determining that a frame of the audio includes noise based on the maximum amplitude and the speech likelihood estimate;

in response to determining that the frame of audio includes noise, lowering the limiter upper limit; and

providing the limiter upper limit to the limiter module to reduce the energy of the audio signal.

11. The apparatus of claim 10, wherein the adaptive energy limiter is further implemented to:

capturing a frame of audio as the portion of the audio signal; and is

Converting the frame of audio from a time domain to a frequency domain for evaluation by the neural network.

12. The apparatus of claim 11, wherein the frame of audio is a frame of first audio, and the adaptive energy limiter is further implemented to:

capturing a frame of second audio corresponding to a second portion of the audio signal;

converting a frame of the second audio from the time domain to the frequency domain;

evaluating frames of the second audio of the audio signal with the neural network to provide respective speech likelihood estimates for the frames of the second audio;

determining that frames of the second audio include speech based on the respective speech likelihood estimates; and

resetting the limiter upper limit to the full scale.

13. The apparatus of claim 11, wherein the frame of audio is a frame of first audio, and the adaptive energy limiter is further implemented to:

capturing a frame of second audio corresponding to a second portion of the audio signal;

determining respective maximum amplitudes of frames of the second audio;

maintaining the limiter upper limit at a current level in response to the respective amplitude of the frames of the second audio not exceeding the threshold.

14. The apparatus of any of claims 11 to 13, wherein the frames of audio correspond to a duration of audio information from the audio signal ranging from about 5 milliseconds of audio information to about 50 milliseconds of audio information.

15. The apparatus of any of claims 10 to 14, wherein the neural network comprises a voice activity detector, VAD, and the adaptive energy limiter is further implemented to provide the voice likelihood estimate as an instantaneous voice likelihood, IVL, of the portion of the audio signal using the VAD of the neural network.

16. The apparatus of claim 15, wherein the adaptive energy limiter lowers the limiter upper limit by a predefined amount, and the adaptive energy limiter is further implemented to:

determining an aggregate voice likelihood estimate (ASLE) based on a plurality of IVLs provided by the VAD of the neural network;

updating an aggregate voice likelihood estimate (ASLE) based on the IVL by:

increasing the ASLE in response to the IVL exceeding the ASLE and exceeding a speech detection threshold; or

In response to the IVL not exceeding the ASLE or not exceeding the speech detection threshold, decreasing the ASLE; and

setting the predefined amount by which the limiter upper limit is lowered based on the ASLE.

17. The apparatus of claim 16, wherein the limiter upper bound has a minimum value, and the adaptive energy limiter is further implemented to configure the minimum value of the limiter upper bound based on the ASLE.

18. The apparatus of claim 17, wherein the adaptive energy limiter is further implemented to configure a minimum value of the limiter upper limit based on the ASLE and one of:

an average of respective amplitudes of a plurality of portions of the audio signal; or

An average of respective maximum amplitudes of a plurality of portions of the audio signal.

19. A system, comprising:

a hardware-based processor operably associated with an audio interface or a data interface through which audio signals are received; and

a storage medium storing processor-executable instructions that, in response to execution by the hardware-based processor, implement an adaptive energy limiter for:

setting a limiter upper limit of the audio signal to a full scale;

generating, based on the audio signal, a frame of audio corresponding to a duration of audio from the audio signal;

determining a maximum amplitude of the audio signal for a frame of the audio;

evaluating the frames of audio with a neural network to provide speech likelihood estimates for the frames of audio;

determining that a frame of the audio includes noise based on the maximum amplitude and the speech likelihood estimate;

in response to determining that the frame of audio includes noise, lowering the limiter upper limit; and

providing the limiter upper limit to a limiter module to reduce the energy of the audio signal.

20. The system of claim 19, wherein the system is implemented as one of: an audio conferencing system, a video conferencing system, an application specific integrated circuit, an application specific standard product, a system on a chip, a system in package, a complex programmable logic device, an audio codec or an audio processor.

Background

An audio conference or video conference typically includes a number of participants, with one or a few of the participants actively speaking at any given time. When not speaking, the other participants typically generate noise that may be picked up by their microphones and fed into the audio of the conference for all participants to hear. Examples of noise generated by conference participants may include typing on a keyboard, placing a coffee cup on a table, flipping paper, moving a chair, closing a door, and so forth. Some of these noises have transient characteristics, that is, unlike static or recurrent noises, these noises having transient characteristics cannot be suppressed by conventional noise reduction techniques. Furthermore, the audio energy of transient noise is typically as high or higher than the energy level associated with the speech of the conference participant. Thus, these transient noises are typically fed into the conference audio as uninhibited energy, resulting in noise that can disturb the speakers and listeners, squelch the voice of the speakers, trigger residual echo suppression, falsely trigger audio or video switching schemes, and the like.

Disclosure of Invention

This disclosure describes apparatus and techniques for adaptive energy limiting for transient noise suppression. In some aspects, a method for adaptive energy limiting includes setting a limiter upper limit of an audio signal to a full scale and receiving a portion of the audio signal. The method then determines a maximum amplitude of the portion of the audio signal and evaluates the portion of the audio signal with a neural network to provide a speech likelihood estimate for the portion of the audio signal. Based on the maximum amplitude and the speech likelihood estimate, the method determines that the portion of the audio signal includes noise. In response to determining that the portion of the audio signal includes noise, the method lowers the limiter upper limit. The limiter ceiling is then provided to the limiter module through which the audio signal is to pass to limit the amount of energy of the audio signal. By doing so, the audio signal can be prevented from carrying full-energy transient noise into the conference audio or subsequent audio processes (e.g., speaker selection for video conferencing).

In other aspects, an apparatus includes a network interface to receive or transmit an audio signal over a data network and a limiter module to limit energy of the audio signal. The apparatus also includes a hardware-based processor associated with the data interface and a storage medium storing processor-executable instructions for the adaptive energy limiter. The adaptive energy limiter is implemented to set a limiter upper limit of the audio signal to a full scale and to provide frames of audio from the audio signal corresponding to an audio duration from the audio signal. The adaptive energy limiter then determines a maximum amplitude of the audio signal for the frame of audio and evaluates the frame of audio with a neural network to provide a speech likelihood estimate for the frame of audio. Based on the maximum amplitude and the speech likelihood estimate, the adaptive energy limiter determines that the frame of audio includes noise. The adaptive energy limiter then reduces the limiter upper limit in response to a determination that the frame of audio includes noise and provides the limiter upper limit to the limiter module to reduce the energy of the audio signal.

In other aspects, a system includes a hardware-based processor operatively associated with an audio interface or a data interface, and a storage medium storing processor-executable instructions for an adaptive energy limiter, wherein an audio signal is received through the audio interface or the data interface. The adaptive energy limiter is implemented to set a limiter upper limit of the audio signal to a full scale and generate frames of audio corresponding to an audio duration from the audio signal based on the audio signal. The adaptive energy limiter then determines a maximum amplitude of the audio signal for the frame of audio and evaluates the frame of audio with a neural network to provide a speech likelihood estimate for the frame of audio. Based on the maximum amplitude and the speech likelihood estimate, the adaptive energy limiter determines that the frame of audio includes noise. The adaptive energy limiter then reduces the limiter upper limit in response to a determination that the frame of audio includes noise and provides the limiter upper limit to the limiter module to reduce the energy of the audio signal.

The details of one or more implementations of adaptive energy limiting for transient noise suppression are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims. This summary is provided to introduce a selection of subject matter that is further described in the detailed description and the accompanying drawings. Accordingly, this summary should not be considered to describe essential features nor should it be taken as limiting the scope of the claimed subject matter.

Drawings

This specification describes apparatus and techniques for adaptive energy limiting for transient noise suppression with reference to the following drawings. Throughout the drawings, the same reference numerals are used to refer to the same features and components:

FIG. 1 illustrates an example conferencing environment in which aspects of adaptive energy limiting for transient noise suppression may be implemented.

Fig. 2 illustrates an example device diagram of a user device and a conference device including respective instances of adaptive energy limiters in accordance with one or more aspects.

FIG. 3 illustrates an example configuration of components capable of implementing various aspects of adaptive energy limiting.

Fig. 4 illustrates an example method for adaptively limiting energy of an audio signal in accordance with one or more aspects.

Fig. 5A and 5B illustrate an example method of scaling an audio signal based on the instantaneous speech likelihood provided by a neural network-enabled speech activity detector.

FIG. 6 illustrates an example graph that limits energy of an audio signal in accordance with one or more aspects.

Fig. 7 illustrates a system diagram of components for implementing adaptive energy limiting for transient noise suppression in accordance with one or more aspects.

Detailed Description

SUMMARY

An audio conference or video conference typically includes a number of participants, with one or a few of the participants actively speaking at any given time. When not speaking, the other participants typically generate noise that may be picked up by their microphones and fed into the audio of the conference for all participants to hear. Examples of noise generated by conference participants may include typing on a keyboard, placing a coffee cup on a table, flipping paper, moving a chair, closing a door, and so forth. Some of these noises have transient characteristics, that is, unlike static or recurrent noises, these noises having transient characteristics cannot be suppressed by conventional noise reduction techniques. Furthermore, the audio energy of transient noise is typically as high or higher than the energy level associated with the speech of the conference participant. Thus, these transient noises are typically fed into the conference audio as raw, uninhibited energy, resulting in noise that can disturb the speakers and listeners, squelch the speaker's voice, trigger residual echo suppression, falsely trigger audio or video switching schemes, and so forth.

Because conventional noise reduction techniques cannot mitigate transient noise, there are a number of negative consequences to conference call participants. Typically, the unchecked noise can pass to the other end of the call, interfering with the speaker and other listeners. When this un-suppressed noise is passed to the current speaker in the call, residual echo suppression may also be triggered, which attenuates the (dampens) speaker's voice or affects back-end speaker selection schemes such as top-3 filtering (e.g., passing the respective audio of the three call participants with the greatest energy). In addition, the conferencing system may erroneously prioritize noisy participants over actively speaking participants, or interrupt the video switching scheme by switching the video feed of the speaker to a noisy participant.

Some conventional techniques involve having participants who are not currently speaking manually mute their respective microphones. However, muting solutions are undesirable and inconvenient because they can lead to unnatural conversational flow and often cause problems when a participant forgets to cancel the microphone muting before speaking. Manually muting a microphone is particularly frustrating in large conference rooms where many participants take turns speaking, as muting can occur very frequently. For example, whenever a person wants to talk to another participant, the person needs to reach out to a button on the remote control or device to cancel the microphone mute and then remember to mute again later. Thus, manual muting, which relies on timely manual interaction from all participants, is inconvenient and generally ineffective in suppressing transient noise.

Other conventional techniques also typically fail to prevent transient noise from entering the conference audio, or from entering the conference audio at the expense of other impairments to call flow or quality. For example, some phones include a noise gate that automatically mutes unless there is strong energy in the audio stream. However, these noise gates can result in unstable quality audio and often allow high energy noise to be transmitted into the conference audio. Other noise reduction techniques are only applicable to stationary or slightly non-stationary noise (e.g., fan, traffic, background noisy noise), and are not applicable to bursty, non-constant, and high-energy transient noise. In other cases, keyboard rejection predicts when keyboard sounds will occur and selectively rejects those sounds. This suppression is limited to typing on the same laptop that hosts the conference and only applies to keyboard noise. Thus, conventional noise suppression techniques for conference calls cannot suppress or limit transient noise, which often interferes with call flow and quality.

This document describes apparatus and techniques for adaptive energy limiting for transient noise suppression. As described above, participants in a conference call may generate transient noise that, when allowed into the conference audio, typically disturbs the speaker and other participants. Transient noise may also interfere with or degrade conference service processes for audio and video features, such as audio stream or video stream selection (e.g., active speaker) presented to other participants. In general, aspects of adaptive energy limiting manage or control the maximum energy level that a participant is allowed to contribute based on the participant's history of recently generated noise or speech. In various aspects, an adaptive energy limiter of a user device or conferencing system sets an upper limiter limit of an audio signal to a full scale and receives a portion of the audio signal. For the portion of the audio signal, the adaptive energy limiter determines a maximum amplitude and evaluates the portion with a neural network to provide a speech likelihood estimate. Based on the maximum amplitude and the speech likelihood estimate, the adaptive energy limiter determines that the portion of the audio signal includes noise. In response to determining that the portion of the audio signal includes noise, the adaptive energy limiter reduces the limiter upper limit and provides the limiter upper limit to the limiter module to effectively limit the amount of energy of the audio signal. By doing so, the adaptive energy limiter may prevent the audio signal from carrying full-energy transient noise into the conference audio or subsequent audio processes (e.g., speaker selection for a video conference).

For example, if the participant makes noise, the adaptive energy limiter will gradually lower the upper limit of energy allowed to pass through. Typically, this will result in sudden noise generated by the participants in the future becoming less intrusive and more easily ignored by other conference service algorithms, such as speaker selection for video conferencing. In some aspects, after about 10 to 15 seconds of moderate or high energy noise, the upper limit of audio energy is reduced to a minimum level, after which the audio energy (e.g., noise energy) from the participant will be very limited. When the participant does begin speaking, the adaptive energy limiter may reset the upper limit of the audio energy to a maximum level (e.g., voice level or full scale) to allow the participant's voice audio to pass to other conference participants. The adaptive energy limiter does so quickly so that the transient noise suppression provided by the adaptive energy limiter has little detrimental effect on the voice audio of the conference call. Alternatively or additionally, if the participant is silent, quiet, or emits low energy background noise, the adaptive energy limiter may maintain or keep the upper limit of the audio energy high so as not to affect the voice audio when the participant begins to speak.

Generally, the aspect of adaptive energy limiting for transient noise suppression limits the energy of the transient noise without compromising the voice audio quality of a conference call or a voice call. For example, by using the long-term statistical nature of noise and/or speech in the context of an audio or video conferencing scenario, the adaptive energy limiter can significantly reduce the amount or impact of transient noise while minimally impacting speech. In other words, the adaptive noise limiter does not attempt to remove noise from concurrent noise and speech, which is otherwise a typical problem of conventional noise reduction techniques when attempting to remove noise (particularly noise that may be confused with speech).

In various aspects of adaptive energy limiting, the amplitude of the audio signal is measured for a period of time, and along with other described utilizations of statistical properties, the limiter ceiling of the audio energy is configured to prevent or suppress transient noise from entering the conference call. In some cases, a neural network is implemented to provide statistical properties about the audio signal. According to various aspects, a small neural network has sufficient accuracy for such tasks, such that no special acceleration hardware is required, and voice quality is not affected by the accuracy limitations of the neural network or an associated Voice Activity Detector (VAD). Alternatively or additionally, an adaptive energy manager may be implemented to adjust or manage the gain or subband gain of an audio signal based on the audio signal evaluation described herein.

In this way, various aspects of energy limiting (or energy management) may be implemented to limit or reduce the amount of energy that audio signals can be brought into a conference service or out of conference call participants through a conference call. In other words, for each participant, the adaptive energy limiter may track the noise liability that accumulates as the participant continues to make noise. As noise liabilities accumulate (or energy limits decrease), the adaptive energy limiter prevents or disallows the participant from sending large amounts of energy into the call until the participant proves that they are sending voice (e.g., by sending a statistically large amount of voice audio). The adaptive energy limiter may also effectively suppress transient noise by using (e.g., via a neural network) statistical energy differences of transient noise (e.g., high energy), vowels (e.g., medium energy), and consonants (e.g., low energy) to allow speech (e.g., consonants) to pass through perceptually unaffected even when transient noise is reduced by 20dB or more. Aspects of adaptive energy limiting may achieve this effect by using limiter upper limits for audio signal energy and/or by managing subband gains for processing the audio signals of the participants.

Although any number of different environments, systems, devices, and/or various configurations may implement the features and concepts of the described techniques and apparatus for adaptive energy limiting for transient noise suppression, aspects of adaptive energy limiting for transient noise suppression are described in the context of the following example environments, devices, configurations, methods, and systems.

Example Environment

Fig. 1 illustrates an example environment 100 in which aspects of adaptive energy limiting for transient noise suppression may be implemented. In the example environment 100, the user devices 102 may communicate audio and/or video through a conferencing system 104, where access to the system is provided by a conferencing service 106 (e.g., a cloud-based conference or conferencing service). The user devices 102 in this example include a smartphone 102-1, a laptop computer 1022, a tablet computer 102-3, a smart watch 102-4, a phone 102-5, a conference bridge 102-6, and a video conference display 102-7. Although shown as devices, the user device may be implemented as any suitable computing or electronic device, such as a mobile communication device, a computing device, a client device, an entertainment device, a gaming device, a mobile gaming console, a personal media device, a media playback device, a charging station, an Advanced Driver Assistance System (ADAS), a point of sale (POS) transaction system, a health monitoring device, a drone, a camera, a wearable smart device, a navigation device, a Mobile Internet Device (MID), an internet home appliance with wireless internet access and browsing capabilities, an internet of things (IoT) device, a fifth generation radio (5G NR) user equipment, and/or other types of user devices.

In general, respective users of user devices 102 may interact with other users through audio and/or video data exchanged with a data or voice connection of conferencing service 106. In some aspects, each user device 102 participating in a conference call instance facilitated by conference service 106 provides audio signals 108 and/or video signals over a respective connection to the conference service. For example, any or all of the user devices 102 may provide channels of audio signals 108 (or audio data) corresponding to audio captured by a microphone of the device. During a conference call, participants typically take turns speaking, while other inactive or non-speaking participants listen or watch. However, some of these participants may choose to move the chair, write an email, or take notes on the computer. Such movement and typing activity may generate transient noise that may include sounds or sound waves having short pulse-like signal characteristics. Other potential sources of transient noise may include click noise from a computer mouse, moving an item on a table or work surface, closing a door, a telephone key press or telephone ring, etc. For example, if two participants (each at a respective endpoint of a conference or voice call) are located close to each other in an open office, one of the participants using a smartphone 102-1 and the other using a laptop computer 102-2, when the participant using the laptop computer 102-2 begins typing, potential transient noise may be generated at both endpoints.

In terms of adaptive energy limiting for transient noise suppression, conference service 106 includes an instance of adaptive energy limiter 110 (adaptive limiter 110) that may limit or manage the energy of an audio signal to suppress various forms of transient noise. Although illustrated with reference to conferencing service 106, any or all of user devices 102 may also include an instance of adaptive energy limiter 110. Thus, adaptive energy limiter 110 may limit or manage the energy of audio signals sent to conference service 106, processed by conference service 106, or sent by conference service to other user devices 102. The adaptive energy limiter 110 is associated with or has access to a neural network 112, which may be implemented as a Recurrent Neural Network (RNN). In this example, the neural network 112 includes a voice activity detector 114(VAD 114) that may be configured to provide an indication of the voice likelihood of an audio signal or frame of audio. For example, adaptive energy limiter 110 may use voice activity detector 114 to obtain an indication of the speech likelihood of a frame of audio. Such an indication may be useful for determining whether an audio signal or frame of audio is more likely to be speech or noise. Alternatively or additionally, the voice activity detector 114 may be implemented as a neural network-enabled voice activity detector that uses a neural network to determine or provide a voice likelihood measure of a sample of an audio signal or audio frame.

Fig. 2 illustrates an example device diagram of a user device 102 and a conference device 202 that may provide a conferencing service 106 at 200. Although each device is shown with an example of an adaptive energy limiter, aspects of the adaptive energy limiting may be implemented on one device, on two devices, or cooperatively between devices. For example, the adaptive energy limiter 110 of the user device 102 may interact with the adaptive energy limiter 110 of the conference device 202 or the neural network 112 to set a limiter upper value at the user device 102. As shown in the exemplary configuration, the user device 102 or the conference device 202 may also include additional functionality, components, or interfaces that have been omitted from fig. 2 for clarity or visual brevity. Alternatively or additionally, any of the respective components of the user device 102 or the conference device 202 may be implemented in whole or in part as hardware logic or circuitry integrated or separate from the other components.

In this example, the user device 102 includes a network interface 204 for exchanging data, such as audio signals or video streams, over various types of networks or communication protocols. In general, network interface 204 may be implemented as any one or more of a serial and/or parallel interface, a wireless interface, a wired interface, or a modem to transmit or receive data or signals. In some cases, the network interface 204 provides a connection and/or communication link between the user device 102 and a communication network by which the other user devices 102 and the conference device 202 communicate audio signals 108, video data, etc. for conference media communications. The user device 102 also includes at least one microphone 206 for capturing audio (e.g., speech, sound, or noise) from the environment of the user device 102, and at least one speaker 208 for generating audio or sound based on audio data of the user device 102. In some aspects, the microphone captures audio generated by the user, e.g., speech, and provides the audio signal to audio circuitry (not shown) of the user device 102 for encoding or other signal processing.

The user device 102 also includes a processor 210 and a computer-readable storage medium 212(CRM 212). Processor 210 may be a single core processor or a multi-core processor composed of a variety of materials, such as silicon, polysilicon, high dielectric constant dielectric, and the like. The computer-readable storage medium 212 is configured as a memory and therefore does not include a transitory signal or carrier wave. The CRM212 may include any suitable memory or storage device, such as Random Access Memory (RAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), non-volatile random access memory (NVRAM), read-only memory (ROM), or flash memory that can be used to store device data 214 for the user device 102.

Device data 214 may include user data, multimedia data (e.g., audio data or video data), applications 216 (e.g., media conferencing client applications), user interfaces 218, and/or an operating system of user device 102 that are accessible or executable by processor 210 to enable audio or video conferencing and/or interaction with other users of user device 102. The user interface 218 may be configured to receive input from a user of the user device 102, such as input from a user, which may define and/or facilitate one or more aspects of the adaptive energy limiting for transient noise suppression. The user interface 218 may include a Graphical User Interface (GUI) that receives input information via touch input. In other cases, user interface 218 includes an intelligent assistant that receives input information via audible input. Alternatively or additionally, the operating system of the user device 102 may be maintained as firmware or an application on the CRM212 and executed by the processor 210.

The CRM212 also includes an adaptive energy limiter 110, a neural network 112, and a voice activity detector 114. In various aspects, the adaptive energy limiter 110 uses a neural network 112 and/or a voice activity detector 114(VAD 114) to determine whether the audio signal includes voice or noise. Based on this determination, the adaptive energy limiter 110 may lower the limiter upper limit to limit the energy of the noise that would otherwise disturb the conference call or voice call if the noise were allowed to pass at full energy. The implementation and use of the adaptive energy limiter 110, the neural network 112, and/or the voice activity detector 114 vary and are described throughout the disclosure.

Aspects and functionality of the user device 102 may be managed via operating system controls presented through at least one application programming interface 220(API 220). In some aspects, the adaptive energy limiter 110 or an application of the user device 102 accesses the API 220 or API service of the user device 102 to control aspects and functions of the audio or video conferencing application. For example, adaptive energy limiter 110 may access low-level audio processor settings of user device 102 to implement aspects of adaptive energy limiting, such as setting a minimum limiter upper limit level, adjusting audio gain settings, managing respective signal levels of incoming and outgoing audio signals, and so forth. The CRM212 of the user device 102 may also include a user device manager 222, which may be implemented in whole or in part as hardware logic or circuitry integrated or separate from other components of the user device 102. In at least some aspects, the user device manager 222 configures the microphone 206 and other audio circuitry of the user device 102 to implement the transient noise suppression techniques described herein.

User device 102 also includes a display 224 for displaying and/or providing information or video feeds to a user. For example, through display 224, user device 102 may provide a video feed to the user from a video conference enabled by conference service 106. Alternatively or additionally, the user device 102 may also include a camera (not shown) to enable generation of a video feed for the multimedia conference from the user device 102.

The conferencing device 202 may be implemented as a computing device, server, cloud-based hardware, or other resource used to provide the conferencing service 106 to the user devices 102. In general, conferencing device 202 may act as a collector and/or arbitrator for multimedia data or streams for a conference call instance. In this way, the conferencing device 202 can implement aspects of adaptive energy limiting with respect to inbound (inbound) audio data received from the user equipment 102, internal multimedia processing operations, or outbound (outbound) audio data sent to the user equipment 102 as part of a conference or voice call.

In this example, conferencing device 202 includes a network interface 226 for exchanging data, such as audio signals or video streams, over various types of networks or communication protocols. In general, the network interface 226 may be implemented as any one or more of a serial and/or parallel interface, a wireless interface, a wired interface, or a modem to transmit or receive data or signals. In some cases, network interface 226 provides a connection and/or communication link between conferencing device 202 and a communication network through which user device 102 communicates audio signals 108, video data, and the like for conferencing media communications.

In this example, conferencing device 202 also includes a processor 228 or computing resource, and a computer-readable storage medium 230(CRM 230). The computer-readable storage medium 230 is configured as a memory and thus does not include a transitory signal or carrier wave. CRM 230 may include any suitable memory or storage device, such as RAM, SRAM, DRAM, NVRAM, ROM, or flash memory, that can be used to store multimedia data 232 for conferencing device 202.

The multimedia data 232 of the conferencing device 202 may include audio data, audio signals, or video data that facilitate the conference call through the instance of the conferencing service 106. Multimedia data 232 and conferencing service 106, as well as other applications (e.g., media conferencing server applications) and/or operating systems of conferencing device 202, can be accessed or executed by processor 228 to enable audio or video conferencing with multiple user devices 102.

In this example, CRM 230 also includes instances of adaptive energy limiter 110, neural network 112, and voice activity detector 114. As mentioned above, aspects of the adaptive energy limitation may be implemented by the user device 102, the conference device 202, or a combination of the two devices. In various aspects, the adaptive energy limiter 110 uses the neural network 112 and/or the voice activity detector 114 to determine whether one or more audio signals include speech or noise. Based on this determination, the adaptive energy limiter 110 of the conference device 202 may lower the limiter ceiling of the respective audio signal or audio feed to limit the noise energy that would otherwise disturb the conference call or voice call if the noise were allowed to pass at full energy. The implementation and use of the adaptive energy limiter 110, the neural network 112, and/or the voice activity detector 114 vary and are described throughout the disclosure.

Aspects and functionality of conferencing device 202 may be managed via system controls presented through at least one Application Programming Interface (API) of API library 234. In some aspects, an application of adaptive energy limiter 110 or conferencing device 202 accesses an API or library of API library 234 to implement aspects of transient noise limitation. For example, the adaptive energy limiter 110 may be implemented as part of or in conjunction with a network-based real-time communication library.

FIG. 3 illustrates an example configuration of components capable of implementing various aspects of adaptive energy limiting at 300. In general, the components of fig. 3 may be embodied on the user device 102, the conference device 202, or a combination thereof. In some aspects, the components shown at 300 are implemented as integrated components of a device (e.g., a system on a chip) and/or in conjunction with a memory storing processor-executable instructions to provide respective functionality of one or more components. Thus, the configuration of components shown in fig. 3 is non-limiting and may be implemented in any suitable device, combination of devices, and/or as hardware (e.g., logic circuitry) in combination with firmware or software to provide the described functionality.

In some aspects, the audio signal 108 is segmented or divided into audio frames 302 corresponding to respective portions of the audio signal. For example, each of the audio frames 302 may correspond to a portion, segment, or duration of audio (e.g., speech and/or noise) of the audio signal 108. In some cases, audio frame 302 or frame of audio corresponds to a range of approximately 5 milliseconds to 50 milliseconds of audio (e.g., 10 milliseconds of audio). Alternatively or additionally, the audio frame 302 may be converted from the time domain to the frequency domain, e.g., to enable spectral analysis or other frequency-domain-based processing.

As shown in fig. 3, the example components include an amplitude detector 304 and a neural network 112, the neural network 112 including or providing the voice activity detector 114 for processing the audio frame 302. In general, the amplitude detector 304 measures or determines the amplitude of the audio signal 108 corresponding to an audio frame. For example, the amplitude detector 304 may generate or provide an indication of a maximum amplitude 306 of an audio frame or audio signal portion. In some aspects, the adaptive energy limiter 110 determines or updates an average amplitude 308 (e.g., a moving average) of the audio signal 108 or the audio frame 302 based on the plurality of maximum amplitudes 306.

The neural network 112 may be implemented as a network operating on a processor of the user device 102 to provide a speech likelihood estimate of the audio frame 302. Alternatively or additionally, the neural network 112 may be implemented as a Recurrent Neural Network (RNN) or a machine learning model with memory (e.g., RNNoise). In some aspects, voice activity detector 114 provides instantaneous voice likelihoods 310(IVL 310) for one or more of the audio frames. Although described as a neural network-enabled voice activity detector, other types of voice activity detection or voice classification may be used.

For example, the neural network 112 and/or the voice activity detector 114 may be implemented as a neural network (e.g., a Deep Neural Network (DNN)) that includes an input layer, an output layer, and one or more hidden intermediate layers located between the input layer and the output layer of the neural network. Any or all of the nodes of the neural network may then be fully or partially connected between the layers of the neural network. The voice activity detector 114 may be implemented with or through any type of neural network, such as a convolutional neural network including GoogleNet or similar convolutional networks. Alternatively or additionally, the voice activity detector 114 or machine-learned voice activity detection model may include any suitable Recurrent Neural Network (RNN) or any variant thereof. In general, the neural network 112 and/or the voice activity detector 114 employed by the adaptive energy limiter may also include any other supervised learning, unsupervised learning, reinforcement learning algorithms, and the like.

In various aspects, the neural network 112 and/or the voice activity detector 114 associated with the adaptive energy limiter 110 may be implemented as a Recurrent Neural Network (RNN) with connections between nodes that form a loop to retain information from a previous portion of the input data sequence for a subsequent portion of the input data sequence (e.g., a frame of previous audio of speech or noise generated by a participant). In other cases, the neural network 112 is implemented as a feed-forward neural network having connections between nodes that do not form loops between input data sequences. Alternatively, the neural network 112 may be implemented as a Convolutional Neural Network (CNN) with multi-layer perceptrons, where each neuron in a particular layer is connected to all neurons of an adjacent layer. In various aspects of adaptive energy limiting, the neural network 112 and/or the speech activity detector 114 may use previous determinations of noise or speech by the participant to predict or determine whether a subsequent frame of the audio signal includes speech or noise that may be suppressed.

In general, the neural network 112 may enable determination of a speech likelihood estimate that converges quickly to a high statistical confidence, especially in the presence of vowels. Recall that transient noise typically has stronger full-band energy than vowels and is stronger than consonants in speech. Thus, the adaptive energy limiter, while utilizing the statistical confidence provided by the neural network 112, can exploit the historical noise or voice patterns of the participant to distinguish between noise, the vowels and consonants of the voice. In other words, speech and noise tend to be bursty, that is, participants who have recently spoken are more likely to continue speaking in the near future (e.g., less than a second). Alternatively, participants who have recently generated noise are more likely to generate additional noise in the future. In some cases, any lag introduced by the adaptive energy limiter is imperceptible to conference call participants, yet the neural network 112 is able to more accurately determine whether the audio of a frame or signal is noise or speech when recalling (retrospecting) (e.g., a few 100 milliseconds) than in real time.

Based on one or more of the instantaneous speech likelihoods 310, the adaptive energy limiter 110 may determine an aggregate voice likelihood estimate 312(ASLE312) for the audio signal 108 or audio 302. The aggregate voice likelihood estimate 312 may be configured or updated based on the current aggregate voice likelihood estimate 312 and/or a threshold for detecting speech or noise. For example, in some cases, adaptive energy limiter 110 increases the aggregate voice likelihood estimate 312 in response to the instantaneous voice likelihood 310 exceeding the current aggregate voice likelihood estimate 312 and exceeding a threshold for voice detection. In other cases, the adaptive energy limiter 110 may decrease the aggregate voice likelihood estimate 312 in response to the instantaneous voice likelihood 310 not exceeding the current aggregate voice likelihood estimate 312 or exceeding a threshold for voice detection.

The adaptive energy limiter 110 further comprises or provides a limiter upper limit 314 by which the limiter upper limit 314 may limit the energy of the audio signal 108, e.g. suppress the energy of transient noise. Typically, the limiter ceiling 314 is provided to an audio signal limiter module 316 through which the audio signal 108 passes before being transmitted to other audio components or processes. The audio signal limiter module 316 may deliver the audio signal at a full scale (e.g., unreduced or unrestricted) or at a reduced scale or reduced amplitude as dictated by the limiter upper limit 314 set by the adaptive energy limiter 110. In the context of fig. 3, the audio signal limiter module 316 limits or reduces the energy of the audio signal 108 to provide or generate an energy limited audio signal 318 based on the limiter upper limit 314 provided by the adaptive energy limiter 110. In various aspects, the adaptive energy limiter 110 limits the energy of an audio signal that is determined to be noise or includes noise in order to suppress noise and possibly future noise. The energy limited audio signal 318 may then be sent to an audio-based process 320 for subsequent processing or for other features (e.g., speaker selection) before being included in conference audio 322 shared with other participants of the audio or video conference call.

Example method

In accordance with one or more aspects of adaptive energy limiting for transient noise suppression, example methods 400 and 500 are described with reference to fig. 4, 5A, and 5B. In general, the methods 400 and 500 illustrate a set of operations (or actions) performed in the order or combination of operations illustrated herein, but are not necessarily limited thereto. Moreover, any of one or more of the operations may be repeated, combined, re-organized, skipped, or linked to provide a wide variety of additional and/or alternative approaches. In portions of the following discussion, reference may be made to the example conferencing environment 100 of fig. 1, the example device of fig. 2, the example component of fig. 4, the example system of fig. 7, and/or the entities detailed in fig. 1, by way of example only. The techniques and apparatus described in this disclosure are not limited to embodiments or capabilities of an entity or entities operating on one device.

The method 400 is a method performed by the user device 102 or the conferencing device 202. The method 400 limits the amount of energy of the audio signal to mitigate the effects associated with transient noise in the conference environment or other audio process (e.g., speaker selection for a video conference). In some aspects, the operations of method 400 are implemented by or with adaptive limiter 110, neural network 112, and/or voice activity detector 114 of user device 102 or conference device 202.

At 402, the upper limiter limit for the audio signal is set to the full scale. In some cases, the limiter upper limit or limit value is set to full scale at initialization of the adaptive energy limiter or in response to a participant's voice for which the audio signal is being processed to suppress noise.

At 404, a portion of an audio signal is received. The portion of the audio signal may include a frame of audio, an audio frame, a segment of the audio signal, and so forth. In some cases, an audio signal is received and divided into frames of audio for analysis by an adaptive energy limiter. For example, a frame of audio may correspond to a range of approximately 5 milliseconds to 50 milliseconds of audio. Alternatively or additionally, frames of audio may be converted from the time domain to the frequency domain to enable spectral analysis or other frequency domain-based processing.

At 406, a maximum amplitude of the portion of the audio signal is determined. The maximum amplitude may be determined for the portion of the audio signal that corresponds to a frame or audio duration of the audio (e.g., 10 milliseconds). In some cases, the maximum amplitude of the audio signal is compared to a threshold to determine that the participant is silent, quiet, or otherwise not producing noise. Alternatively, if the audio signal is quiet or silent, the method 400 may return from operation 406 to operation 404. Thus, if and when a silent participant begins to speak, the voice energy of the silent participant will not decrease.

At 408, the portion of the audio signal is evaluated with a neural network to provide a speech likelihood estimate. In some aspects, the portion of the audio signal or the frame of audio is evaluated with a neural network or a neural network-enabled speech activity detector to provide an instantaneous speech likelihood of the portion of the audio signal or the frame of audio. In general, the instantaneous speech likelihood may indicate whether the audio stream is more likely to be speech or noise, and the adaptive energy limiter will suppress the noise.

At 410, it is determined whether the portion of the audio signal includes speech or noise based on the maximum amplitude and the speech likelihood estimate. For example, if the maximum amplitude of the portion of the audio signal exceeds a moving average of the maximum amplitude (e.g., the maximum average plus a small correction value) and the instantaneous speech likelihood is less than 0.5 or 50% (indicating noise), then the portion of the audio may be determined to include or be noise. Alternatively, if the maximum amplitude of the portion of the audio signal does not exceed the moving average of the maximum amplitude (e.g., the maximum average plus a small correction value) or the instantaneous speech likelihood is greater than 0.5 or 50%, then it may be determined that the portion of audio is not noise or speech (e.g., exceeds the maximum average and IVL is greater than 50%). Alternatively, if it is determined that the portion of the audio signal is or includes the voice of the participant, the method 400 may return from operation 410 to operation 402.

At 412, in response to determining that the portion of the audio signal includes noise, an upper limiter limit of the audio signal is lowered. In some aspects, the limiter upper limit is lowered by a particular rate or amount based on the aggregate voice likelihood estimate. For example, if the aggregate voice likelihood estimate is high, the upper limit is lowered by a small amount or slowly toward the minimum limiter upper limit value. In other cases, the upper limit may be reduced by a large amount when the aggregate voice likelihood estimate is low, or quickly towards the minimum limiter upper limit value. Alternatively or additionally, the minimum limiter upper limit may be configured based on the aggregate voice likelihood estimate, an average of respective amplitudes of the plurality of portions of the audio signal, or an average of respective maximum amplitudes of the plurality of portions of the audio signal, e.g., representing a portion of current energy estimated as voice.

At 414, the limiter upper limit is provided to the limiter module through which the audio signal is to pass. The limiter module limits an amount of energy of the audio signal based on a limiter upper limit. By limiting the energy that an audio signal is allowed to transmit or carry into the conference audio environment, aspects of adaptive energy limiting may prevent full-energy transient noise from entering the conference audio and disturbing participants and/or other audio-based processes.

The method 500 of fig. 5A and 5B is a method performed by the user device 102 or the conferencing device 202. Method 500 scales the audio signal to not exceed the limiter upper limit, which may effectively prevent the audio signal from carrying full-energy transient noise into the conference audio environment. In some aspects, the operations of method 500 are implemented by or with adaptive limiter 110, neural network 112, and/or voice activity detector 114 of user device 102 or conference device 202.

At 502, the upper limiter limit for the audio signal is set to full scale (e.g., 1.0 or 100%). The limiter upper limit or limit value may be set to full range upon initialization of the adaptive energy limiter, or in response to a participant's voice for which the audio signal is being processed to suppress noise being reset to full range.

At 504, a frame of audio corresponding to a portion of an audio signal is generated. In some cases, an audio signal is received and/or separated, segmented, or otherwise divided into frames of audio for analysis by a voice activity detector and/or an adaptive energy limiter. In other cases, the audio frames may be received from an audio codec or other entity configured to provide frames from an audio signal. For example, a frame of audio may correspond to a range of approximately 5 milliseconds to 50 milliseconds of audio (e.g., 10 milliseconds). Alternatively or additionally, frames of audio may be converted from the time domain to the frequency domain to enable spectral analysis or other frequency domain-based processing.

At 506, frames of audio are evaluated with a neural network enabled voice activity detector to provide an Instantaneous Voice Likelihood (IVL). In some aspects, the portion of the audio signal or a frame of audio is evaluated with a neural network or a neural network-enabled voice activity detector to provide an instantaneous speech likelihood of the portion of the audio signal or the frame of audio. In general, the instantaneous speech likelihood may indicate whether the audio stream is more likely to be speech or noise, and the adaptive energy limiter will suppress the noise.

At 508, the maximum amplitude of the audio signal is recorded from the frame of audio. The maximum amplitude may be determined or recorded for an audio signal duration or audio duration (e.g., 10 milliseconds) corresponding to one frame of audio. In some cases, the maximum amplitude of the audio signal is compared to a threshold to determine that the participant is silent, quiet, or otherwise not producing noise. In this case, if the audio signal is quiet or unvoiced, the method 500 may return to operation 504.

At 510, a moving average of the maximum amplitude of the audio signal is updated based on the maximum amplitude of the frames of recorded audio. The moving average of the maximum amplitude may correspond to any suitable number of audio frames or audio durations, such as a range of approximately 100 milliseconds to 500 milliseconds.

As shown at 512 in fig. 5B, operation 514 determines an aggregate voice likelihood estimate (ASLE) based on instantaneous speech likelihood (IVL) of the frame of audio. The aggregate voice likelihood estimate may be determined or configured based on the current aggregate voice likelihood estimate and/or a threshold for detecting speech (or noise). In some cases, the aggregate voice likelihood estimate is increased in response to the instantaneous voice likelihood exceeding the current aggregate voice likelihood estimate and the voice detection threshold. In other cases, the aggregate voice likelihood estimate is reduced in response to the instantaneous voice likelihood not exceeding a current aggregate voice likelihood estimate or a voice detection threshold.

At 516, it is determined whether the maximum amplitude exceeds the moving average and the instantaneous speech likelihood indicates that the frame of audio is noise. For example, if the maximum amplitude of the portion of the audio signal exceeds a moving average of the maximum amplitude (e.g., the maximum average plus a small correction value) and the instantaneous speech likelihood is less than 0.5 or 50% (indicating noise), the audio frame may include or be noise. Alternatively, if the maximum amplitude of the portion of the audio signal does not exceed the moving average of the maximum amplitude (e.g., the maximum average plus a small correction value), or the instantaneous speech likelihood is greater than 0.5 or 50%, then the audio frame may not include noise or may be primarily noise.

Optionally, at 518, the limiter upper limit is not lowered in response to the maximum amplitude not exceeding the moving average and/or the instantaneous speech likelihood not indicating that the frame of audio is noise. Optionally, at 520, the limiter upper limit is reduced based on an aggregate voice likelihood estimate (ASLE). The limiter upper limit is lowered in response to the maximum amplitude exceeding the moving average and the IVL indicating that the frame of audio is noise. Typically, the amount or rate of limiter ceiling reduction is determined based on aggregate voice likelihood estimates.

At 522, the current value of the limiter upper limit is provided to the limiter module to scale the audio signal to not exceed the current value. The limiter module scales an amount of energy of the audio signal passing through the limiter module based on the limiter upper limit. By scaling or limiting the energy that an audio signal is allowed to transmit or carry into the conference audio environment, aspects of adaptive energy limiting may prevent full-energy transient noise from entering the conference audio and disturbing participants and/or other audio-based processes. From operation 522, the method 500 may return to operation 504 to perform another iteration of the method 500 to further limit the energy of the audio signal, reset the limiter upper limit, or maintain the limiter upper limit. In some aspects, the method 500 or process for adaptive energy limiting iterates or repeats approximately every 5 to 50 milliseconds (e.g., 10 milliseconds) to provide response suppression of transient noise.

By way of example, consider fig. 6, where a graph 600 illustrates aspects of adaptive energy limiting. In the context of the limiter module, the energy of the audio signal is delivered in a full scale 602, or limited to a minimum 604 of the upper limit of the limiter. In this example, it is assumed that the audio signal 606 is received from a participant who is continuously producing a medium to high level of noise (no speech). Here, adaptive energy limiter 110 may quickly limit the energy of the audio signal delivered to the conference audio environment to prevent noise of audio signal 606 from disturbing other participants of the conference call.

As another example, consider graph 608, which includes an audio signal 610 of another participant of a conference call. Here, it is assumed that the participant has not spoken, but has not made too much noise. The adaptive energy limiter 110 gradually limits the audio signal 610 until the participant begins to speak at 612. In response to detecting speech, the adaptive energy limiter 110 resets the limiter upper limit to the full scale 602 at 614 and does not begin limiting the energy of the audio signal 610 until the participant stops speaking at 616.

System for controlling a power supply

Fig. 7 illustrates various components of an example system 700, which example system 700 may be implemented as any of the types of user equipment 102 or conference equipment 202 described with reference to fig. 1-6 to implement adaptive energy limiting for transient noise suppression. In some aspects, system 700 is implemented as a component of or embodied on a user equipment device or a base station. For example, system 700 may be implemented as a system of hardware-based components, such as, but not limited to, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (socs), system-in-packages, Complex Programmable Logic Devices (CPLDs), audio codecs, audio processors, co-processors, context hubs, communication co-processors, sensor co-processors, and the like.

The system 700 includes a communication device 702 that enables wired and/or wireless communication of system data 704 (e.g., encoded audio data or audio signals). The system data 704 or other system content may include configuration settings for the system, media content stored on the device, and/or information associated with a user of the device. Media content stored on system 700 can include any type of audio, video, and/or image data. The system 700 includes one or more data inputs 706 via which any type of data, media content, and/or inputs can be received, such as human utterances, speech, interactions with radar fields, user selectable inputs (explicit or implicit), messages, music, television media content, recorded video content, and any other type of audio, video, and/or image data received from any content and/or data source.

System 700 also includes communication interfaces 708 that can be implemented as any one or more of a serial and/or parallel interface, a wireless interface, a network interface, a modem, and as any other type of communication interface. The communication interfaces 708 provide a connection and/or communication links between the system 700 and a communication network by which other electronic, computing, and communication devices communicate data with the system 700.

The system 700 includes one or more processors 710 (e.g., any of microprocessors, controllers, and the like), which one or more processors 710 process various computer-executable instructions to control the operation of the system 700 and enable or may be embodied in techniques for adaptive energy limiting for transient noise suppression. Alternatively or in addition, system 700 can be implemented with any one or combination of hardware, firmware, or fixed logic circuitry that is implemented in connection with processing and control circuits which are generally identified at 712. Although not shown, system 700 can include a system bus or data transfer system that couples the various components within the device. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures.

System 700 also includes computer-readable media 714(CRM 714), such as one or more storage devices capable of persistent and/or non-transitory data storage, and therefore does not include transitory signals or carriers. Examples of CRM 714 include Random Access Memory (RAM), non-volatile memory (e.g., any one or more of read-only memory (ROM), flash memory, EPROM, EEPROM, etc.), or a disk storage device. Disk storage devices may be implemented as magnetic or optical storage devices, such as a hard disk drive, a recordable and/or rewriteable Compact Disc (CD), any type of a Digital Versatile Disc (DVD), and the like. The system 700 may also include a mass storage media device (storage media) 716 or mass storage device interface. In this example, the system 700 also includes or may be implemented as an audio codec 722 to support encoding or decoding of audio signals or audio data, such as encoding audio from a microphone to provide audio signals or audio data for a conference service or voice call.

Computer-readable media 714 provides data storage mechanisms to store the device data, as well as various system applications 718 and any other types of information and/or data related to operational aspects of system 700. For example, an operating system 720 can be maintained as a computer application with the computer-readable media 714 executing on processors 710. The system applications 718 can include a system manager, such as any form of a control application, software application, signal processing and control module, code that is native to a particular device, abstraction module, or gesture module, and so forth. The system applications 718 also include system components and utilities that implement adaptive energy limiting for transient noise suppression, such as the adaptive limiter 110, the neural network 112, and the voice activity detector 114. Although not shown, one or more elements of adaptive limiter 110, neural network 112, or voice activity detector 114 may be implemented in whole or in part by hardware or firmware.

According to one example, the present disclosure describes aspects of adaptive energy limiting for transient noise suppression. In some aspects, the adaptive energy limiter sets a limiter upper limit for the audio signal to a full scale and receives a portion of the audio signal. For the portion of the audio signal, the adaptive energy limiter determines a maximum amplitude and evaluates the portion with a neural network to provide a speech likelihood estimate. Based on the maximum amplitude and the speech likelihood estimate, the adaptive energy limiter determines that the portion of the audio signal includes noise. In response to determining that the portion of the audio signal includes noise, the adaptive energy limiter lowers an upper limiter limit and provides the upper limiter limit to a limiter module that effectively limits an amount of energy of the audio signal. This may effectively prevent the audio signal from carrying full energy transient noise into the conference audio.

Although the above-described devices, systems, and methods are described in the context of adaptive energy limiting for transient noise suppression in an audio/video conferencing environment, the described devices, systems, or methods are non-limiting and may be applied to other contexts, user device deployments, or audio-based communication environments.

In addition to the above, controls may be provided to the user allowing the user to make selections as to whether and when the systems, programs, and/or features described herein are capable of collecting user information (e.g., audio, sound, speech, voice, user preferences, user current location) and whether to send content and/or communications to the user from the server. In addition, some data may be disposed of in one or more ways prior to storage or use, thereby removing personally identifiable information. For example, the identity of the user may be processed such that personally identifiable information of the user cannot be determined. For example, where location information is obtained, the user's geographic location (e.g., city, zip code, or state/province level) may be generalized such that the user's particular location cannot be determined. Thus, the user may have control over what information (e.g., audio) is collected about the user, how the information is used, and what information is provided to the user.

Several examples are presented below:

example 1: a method, comprising: setting a limiter upper limit of the audio signal to a full scale; receiving a portion of the audio signal; determining a maximum amplitude of the portion of the audio signal; evaluating the portion of the audio signal with a neural network to provide a speech likelihood estimate of the portion of the audio signal; determining that the portion of the audio signal includes noise based on the maximum amplitude and the speech likelihood estimate; in response to determining that the portion of the audio signal includes noise, lowering the limiter upper limit; and providing the limiter upper limit to a limiter module through which the audio signal is to pass to limit an amount of energy of the audio signal.

Example 2: the method of example 1, wherein the portion of the audio signal is a frame of audio corresponding to the portion of the audio signal, and the method further comprises: converting the frame of audio from a time domain to a frequency domain prior to evaluating the frame of audio.

Example 3: the method of example 2, wherein the frame of audio is a frame of first audio, and the method further comprises: receiving a frame of second audio corresponding to a second portion of the audio signal; evaluating the frames of the second audio with the neural network to provide respective speech likelihood estimates for the frames of the second audio; determining that frames of the second audio include speech based on the respective speech likelihood estimates; and resetting the limiter upper limit to the full scale.

Example 4: the method of example 2, wherein the frame of audio is a frame of first audio, and the method further comprises: receiving a frame of second audio corresponding to a second portion of the audio signal; determining respective maximum amplitudes of frames of the second audio; comparing the respective maximum amplitudes of the frames of the second audio to a threshold value, the threshold value corresponding to an average of respective maximum amplitudes of a plurality of frames of audio corresponding to a plurality of respective portions of the audio signal; and maintaining a current limiter ceiling in response to the respective amplitude of frames of the second audio not exceeding the threshold.

Example 5: the method of any of examples 2 to 4, wherein the frame of audio corresponds to an audio duration ranging from about 10 milliseconds of the audio to about 50 milliseconds of the audio.

Example 6: the method of any of examples 1-5, wherein evaluating the portion of the audio signal with the neural network to provide the speech likelihood estimate comprises analyzing the portion of the audio signal with a neural network-enabled Voice Activity Detector (VAD) to provide an instantaneous speech likelihood (IVL) of the portion of the audio signal.

Example 7: the method of example 6, wherein the limiter upper limit is lowered by a predefined amount, and the method further comprises: determining an aggregate voice likelihood estimate (ASLE) based on a plurality of IVLs provided by the neural network enabled VAD; updating an aggregate voice likelihood estimate (ASLE) based on the IVL by: increasing the ASLE in response to the IVL exceeding the ASLE and exceeding a speech detection threshold; or in response to the IVL not exceeding the ASLE or not exceeding the speech detection threshold, decreasing ASLE; and setting the predefined amount by which the upper limiter limit is lowered based on the ASLE.

Example 8: the method of example 7, wherein the limiter upper limit has a minimum value, and the method further comprises configuring the minimum value of the limiter upper limit based on the ASLE.

Example 9: the method of example 8, further comprising configuring a minimum value of the limiter upper limit based on the ASLE and one of: an average of respective amplitudes of a plurality of portions of the audio signal; or an average of respective maximum amplitudes of a plurality of portions of the audio signal.

Example 10: an apparatus, comprising: a network interface for receiving or transmitting audio signals over a data network; a limiter module to limit energy of the audio signal; a hardware-based processor associated with the data interface; and a storage medium storing processor-executable instructions that, in response to execution by the hardware-based processor, implement an adaptive energy limiter to: setting a limiter upper limit of the audio signal to a full scale; providing frames of audio from the audio signal corresponding to audio durations from the audio signal; determining a maximum amplitude of the audio signal for a frame of the audio; evaluating the frames of audio with a neural network to provide speech likelihood estimates for the frames of audio; determining that a frame of the audio includes noise based on the maximum amplitude and the speech likelihood estimate; in response to determining that the frame of audio includes noise, lowering the limiter upper limit; and providing the limiter upper limit to the limiter module to reduce the energy of the audio signal.

Example 11: the apparatus of example 10, wherein the adaptive energy limiter is further implemented to: capturing frames of audio as part of the audio signal; and converting the frames of audio from the time domain to the frequency domain for evaluation by the neural network.

Example 12: the apparatus of example 11, wherein the frame of audio is a frame of first audio, and the adaptive energy limiter is further implemented to: capturing a frame of second audio corresponding to a second portion of the audio signal; converting a frame of the second audio from the time domain to the frequency domain; evaluating frames of the second audio of the audio signal with the neural network to provide respective speech likelihood estimates for the frames of the second audio; determining that frames of the second audio include speech based on the respective speech likelihood estimates; and resetting the limiter upper limit to the full scale.

Example 13: the apparatus of example 11, wherein the frame of audio is a frame of first audio, and the adaptive energy limiter is further implemented to: capturing a frame of second audio corresponding to a second portion of the audio signal; determining respective maximum amplitudes of frames of the second audio; comparing the respective maximum amplitudes of the frames of the second audio to a threshold value, the threshold value corresponding to an average of respective maximum amplitudes of a plurality of frames of audio corresponding to a plurality of respective portions of the audio signal; and maintaining the upper limiter limit at a current level in response to the respective amplitude of the frames of the second audio not exceeding the threshold.

Example 14: the apparatus of any of examples 11 to 13, wherein the frames of audio correspond to a duration of audio information from the audio signal, the duration ranging from approximately 5 milliseconds of audio information to approximately 50 milliseconds of audio information.

Example 15: the apparatus according to any of examples 10 to 14, wherein the neural network comprises a Voice Activity Detector (VAD), and the adaptive energy limiter is further implemented to provide the voice likelihood estimate as an Instantaneous Voice Likelihood (IVL) of the portion of the audio signal using the VAD of the neural network.

Example 16: the apparatus of example 15, wherein the adaptive energy limiter lowers the limiter upper limit by a predefined amount, and the adaptive energy limiter is further implemented to: determining an aggregate voice likelihood estimate (ASLE) based on a plurality of IVLs provided by the VAD of the neural network; updating an aggregate voice likelihood estimate (ASLE) based on the IVL by: increasing the ASLE in response to the IVL exceeding the ASLE and exceeding a speech detection threshold; or in response to the IVL not exceeding the ASLE or not exceeding the speech detection threshold, decreasing ASLE; and setting the predefined amount by which the upper limiter limit is lowered based on the ASLE.

Example 17: the apparatus of example 16, wherein the limiter upper bound has a minimum value, and the adaptive energy limiter is further implemented to configure the minimum value of the limiter upper bound based on the ASLE.

Example 18: the apparatus of example 17, wherein the adaptive energy limiter is further implemented to configure a minimum value of the limiter upper limit based on the ASLE and one of: an average of respective amplitudes of a plurality of portions of the audio signal; or an average of respective maximum amplitudes of a plurality of portions of the audio signal.

Example 19: a system, comprising: a hardware-based processor operatively associated with an audio interface or a data interface through which audio signals are received; and a storage medium storing processor-executable instructions that, in response to execution by the hardware-based processor, implement an adaptive energy limiter to: setting a limiter upper limit of the audio signal to a full scale; generating frames of audio corresponding to an audio duration from the audio signal based on the audio signal; determining a maximum amplitude of the audio signal for a frame of the audio; evaluating the frames of audio with a neural network to provide speech likelihood estimates for the frames of audio; determining that a frame of the audio includes noise based on the maximum amplitude and the speech likelihood estimate; in response to determining that the frame of audio includes noise, lowering the limiter upper limit; and providing the limiter upper limit to a limiter module to reduce the energy of the audio signal.

Example 20: the system of example 19, wherein the system is implemented as one of: an audio conferencing system, a video conferencing system, an application specific integrated circuit, an application specific standard product, a system on a chip, a system in package, a complex programmable logic device, an audio codec or an audio processor.

27页详细技术资料下载

Adaptive energy limiting for transient noise suppression

相关技术

网友询问留言