Lightweight full 360 degree audio source location detection using two microphones

文档序号：574704 发布日期：2021-05-21 浏览：9次中文

阅读说明：本技术 使用两个麦克风进行轻型全360度音频源位置检测 (Lightweight full 360 degree audio source location detection using two microphones ) 是由赫克托·A·科尔多瓦·马鲁里约瑟·R·卡马乔·佩雷斯保罗·洛佩兹·迈耶朱利欧·C·萨莫拉于 2020-09-25 设计创作，主要内容包括：本公开涉及使用两个麦克风进行轻型全360度音频源位置检测。本文描述了一种系统。该系统包括至少一个硬件处理器,该硬件处理器被配置为识别预定声学屏障滤波器,其中,该声学屏障滤波器与物理声学屏障相一致,并且在时间窗口内在第一麦克风和第二麦克风处接收音频信号。硬件处理器还被配置为计算第一变化性度量、第二变化性度量、第三变化性度量、和第四变化性度量。硬件处理器还将第一变化性度量、第二变化性度量、第三变化性度量、和第四变化性度量相连以形成特征向量,并且将特征向量输入到位置分类器中以获得音频源位置。(The present disclosure relates to lightweight full 360 degree audio source location detection using two microphones. A system is described herein. The system includes at least one hardware processor configured to identify a predetermined acoustic barrier filter, wherein the acoustic barrier filter is in conformity with the physical acoustic barrier and receive audio signals at the first microphone and the second microphone within the time window. The hardware processor is further configured to calculate a first variability metric, a second variability metric, a third variability metric, and a fourth variability metric. The hardware processor also concatenates the first variability metric, the second variability metric, the third variability metric, and the fourth variability metric to form a feature vector, and inputs the feature vector into a location classifier to obtain an audio source location.)

1. A system, comprising:

a physical acoustic barrier;

a microphone array comprising a first microphone and a second microphone;

at least one hardware processor configured to:

identifying a predetermined acoustic barrier filter, wherein the acoustic barrier filter is consistent with the physical acoustic barrier;

receiving audio signals at the first microphone and the second microphone within a time window;

calculating a first variability metric of a direct difference of the audio signals received at the first microphone and the second microphone;

calculating a second variability metric of a delay difference of the audio signals received at the first microphone and the second microphone;

calculating a third variability metric of a filtered direct difference of the audio signal received at the first microphone and the second microphone, wherein the audio signal is filtered by the predetermined acoustic barrier filter;

calculating a fourth variability metric of a filtered delay difference of the audio signal received at the first microphone and the second microphone, wherein the audio signal is filtered by the predetermined acoustic barrier filter;

concatenating the first, second, third, and fourth variability metrics to form a feature vector; and

the feature vectors are input into a location classifier to obtain audio source locations.

2. The system of claim 1, wherein the predetermined acoustic barrier filter is consistent with the physical acoustic barrier filter by replicating a frequency response of the physical acoustic barrier filter.

3. The system of claim 1, wherein the location classifier is a shallow neural network.

4. The system of claim 1, wherein the first, second, third, and fourth variability metrics are root mean square values.

5. The system of claim 1, wherein the first, second, third, and fourth variability metrics are root mean square values.

6. The system of claim 1, wherein the predetermined acoustic barrier filter is a band pass filter that is consistent with the physical acoustic barrier filter.

7. The system of claim 1, wherein the physical acoustic barrier is a surface that alters frequency components of an audio signal from an audio source.

8. The system of claim 1, wherein the difference is calculated by: normalizing the audio signals received by the first and second microphones and subtracting the normalized audio signal captured by the first microphone from the normalized audio signal captured by the second microphone.

9. The system of claim 1, wherein the delayed audio signal is generated by delaying the audio signal at the second microphone by a predetermined number of samples.

10. The system of claim 1, wherein the audio source location is an angle of arrival.

11. A method, comprising:

identifying a predetermined acoustic barrier filter, wherein the acoustic barrier filter is consistent with a physical acoustic barrier;

receiving audio signals at a first microphone and a second microphone within a time window;

calculating a first variability metric of a direct difference of the audio signals received at the first and second microphones, a second variability metric of a delay difference of the audio signals received at the first and second microphones, a third variability metric of a filtered direct difference of the audio signals received at the first and second microphones, wherein the audio signals are filtered by the predetermined acoustic barrier filter, and a fourth variability metric of a filtered delay difference of the audio signals received at the first and second microphones, wherein the audio signals are filtered by the predetermined acoustic barrier filter;

concatenating the first, second, third, and fourth variability metrics to form a feature vector; and

the feature vectors are input into a location classifier to obtain audio source locations.

12. The method of claim 11, wherein the predetermined acoustic barrier filter is consistent with the physical acoustic barrier filter by replicating a frequency response of the physical acoustic barrier filter.

13. The method of claim 11, wherein the location classifier is a shallow neural network.

14. The method of claim 11, wherein the first, second, third, and fourth variability metrics are root mean square values.

15. The method of claim 11, wherein the first, second, third, and fourth variability metrics are root mean square values.

16. The method of claim 11, wherein the predetermined acoustic barrier filter is a band pass filter that is consistent with the physical acoustic barrier filter.

17. The method of claim 11, wherein the physical acoustic barrier is a surface that alters frequency components of an audio signal from an audio source.

18. The method of claim 11, wherein the difference is calculated by: normalizing the audio signals received by the first and second microphones and subtracting the normalized audio signal captured by the first microphone from the normalized audio signal captured by the second microphone.

19. The method of claim 11, wherein the delayed audio signal is generated by delaying the audio signal at the second microphone by a predetermined number of samples.

20. The method of claim 11, wherein the audio source location is an angle of arrival.

21. A machine readable medium comprising code that when executed causes a machine to perform the method of any of claims 11 to 20.

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to lightweight full 360 degree audio source location detection using two microphones.

Background

There are many applications for determining the spatial location of an audio source. For example, in a smart environment or smart transportation device, knowing the location of an audio source is the basis for determining whether the sound is coming from the intended user, from some disturbance, or from some additional source that may be used for context awareness. The determination of the spatial location of the audio sources also enables the use of audio enhancement techniques on the selected audio source for Automatic Speech Recognition (ASR), speaker recognition, audio event detection, or even collision avoidance. Typically, real-time audio localization requires multiple microphone arrays or complex signal processing and machine learning techniques.

Disclosure of Invention

Embodiments of the present disclosure provide a system. The system comprises: a physical acoustic barrier; a microphone array including a first microphone and a second microphone; at least one hardware processor configured to: identifying a predetermined acoustic barrier filter, wherein the acoustic barrier filter is consistent with the physical acoustic barrier; receiving audio signals at a first microphone and a second microphone within a time window; calculating a first variability metric of a direct difference of audio signals received at the first microphone and the second microphone; calculating a second variability metric of a delay difference of the audio signals received at the first microphone and the second microphone; calculating a third variability metric of a filtered direct difference of audio signals received at the first microphone and the second microphone, wherein the audio signals are filtered by a predetermined acoustic barrier filter; calculating a fourth variability metric of a filtered delay difference of audio signals received at the first microphone and the second microphone, wherein the audio signals are filtered by a predetermined acoustic barrier filter; concatenating the first variability metric, the second variability metric, the third variability metric, and the fourth variability metric to form a feature vector; and inputting the feature vectors into a location classifier to obtain audio source locations.

Embodiments of the present disclosure also provide a method. The method comprises the following steps: identifying a predetermined acoustic barrier filter, wherein the acoustic barrier filter is consistent with the physical acoustic barrier; receiving audio signals at a first microphone and a second microphone within a time window; calculating a first variability metric of a direct difference of audio signals received at the first and second microphones, a second variability metric of a delay difference of audio signals received at the first and second microphones, a third variability metric of a filtered direct difference of audio signals received at the first and second microphones, wherein the audio signals are filtered by a predetermined acoustic barrier filter, and a fourth variability metric of a filtered delay difference of audio signals received at the first and second microphones, wherein the audio signals are filtered by the predetermined acoustic barrier filter; concatenating the first variability metric, the second variability metric, the third variability metric, and the fourth variability metric to form a feature vector; and inputting the feature vectors into a location classifier to obtain audio source locations.

Drawings

FIG. 1 is a graphical representation of the amplitude and frequency content differences heard by a person;

FIG. 2 is a graphical representation of amplitude and frequency content differences in audio received by an electronic device;

FIG. 3 is a block diagram of feature extraction according to the present technique;

FIG. 4 is an illustration of location classification;

FIG. 5 is a graphical representation of an exemplary form factor;

FIG. 6 is an illustration of an exemplary environment in which audio sources may be placed;

FIG. 7 is a process flow diagram of a method;

FIG. 8 is a block diagram of an electronic device that enables lightweight all three hundred sixty degree audio sound localization using two microphones; and

fig. 9 is a block diagram illustrating a medium that enables lightweight all three hundred sixty degree audio sound localization using two microphones.

The same numbers are used throughout the disclosure and figures to reference like components and features. The numbers in the 100 series refer to the features originally found in FIG. 1; the numbers in the 200 series refer to the features originally found in FIG. 2; and so on.

Detailed Description

Traditionally, high quality real-time audio position determination requires multiple microphone arrays or complex signal processing and machine learning techniques. Multiple microphone arrays require additional power. Furthermore, complex signal processing and machine learning techniques consume additional power when processing audio signals. Furthermore, including additional hardware and software for implementing audio source location detection may increase the overall cost of the device.

The present technology enables the use of two microphones to determine the audio source location. The audio source position can be determined within a full 360 ° around the two microphones. In particular, the present techniques include identifying a predetermined acoustic barrier filter, and receiving audio signals at a first microphone and a second microphone within a time window, wherein the acoustic barrier filter is consistent with a physical acoustic barrier. A first, second, third, and fourth variability metric may be calculated based on the received audio signal. The first, second, third, and fourth variability metrics are concatenated to form a feature vector. The feature vectors are input to a location classifier to obtain audio source locations. Thus, the present technique enables the spatial position of a sound source captured by an array of two microphones to be detected using very low computer overhead.

In an embodiment, the present technology uses only one pair of "sensors" to simulate the way the human ear detects the location of the sound source, where the two microphones simulate the function of the human ear. In particular, the present technology enables detection of a 360 ° angle of arrival using only a pair of microphones and an acoustic barrier installed in a device (laptop, smart speaker, infotainment center, autonomous vehicle, etc.). The variability metric may be a Root Mean Square (RMS) value. In an embodiment, the RMS value of the difference of the unfiltered and filtered microphone signals may be used as a descriptor feature, and machine learning may take the descriptor as input and estimate the location of the sound source based on the descriptor. In an embodiment, the machine learning technique used herein is a shallow Neural Network (NN) implemented as a location estimator.

In an embodiment, the location of the sound source may be an angle of arrival estimated or determined according to the present techniques. The present techniques may be implemented with low cost hardware and low computer overhead at the same time. In this manner, the present techniques do not require a tradeoff between hardware and software, as each component is low cost and consumes less overhead. In particular, the present technique may be implemented using two microphones (e.g., most laptops already have), a smaller acoustic barrier (which may already be part of the form factor), and a very lightweight algorithm (which does not require computation of an FFT or other type of complex signal processing routine). The present techniques do not require Digital Signal Processing (DSP) blocks or dedicated hardware acceleration. Similar to human hearing, the present technique is capable of detecting a full 360 ° source position. Furthermore, the present technique is not affected by the case where each microphone has a slightly different gain.

Fig. 1 is a graphical representation of the amplitude and frequency content differences heard by a person 100. As shown, the acoustic source 102 may be located substantially in front of the person 100. The acoustic source 104 may be located substantially behind the person 100. As used herein, substantially in front of person 100 may refer to a location that a person can see through the eyes. In contrast, substantially behind person 100 may refer to a location where a person may not see through the eyes. In an example, when sound waves propagate towards the eardrum of a person, audio from a sound source located substantially in front of the person will encounter a different physical barrier of the person's ear than a sound source located substantially behind the person. In particular, the components of the human ear may act as an acoustic barrier. For example, the outer ear component is used to filter components in the audio frequency components according to the angle of arrival of the audio. In particular, audio may be filtered differently by physical outer ear components based on the direction from which sound arrives. The direction may indicate the location of the sound source.

Thus, the graph 106 represents the perceptual spectrum of audio content received from the front sound source 102. The perceptual spectrum is illustrated in terms of frequency content of the perceptual spectrum. Note that the audio received from the front sound source 102 is received as having full-spectrum audio content. In contrast, the graph 108 represents the perceived spectrum of audio content received from the rear sound source 104. The perceptual spectrum is illustrated in terms of frequency content of the perceptual spectrum. Note that in the example of fig. 1, the front and rear sound sources 102 and 104 emit the same audio content, as shown by the perceptual spectra 110A and 110B. However, as shown at 112, as the frequency of the audio content increases, the actual received spectrum under the solid line in graph 108 undergoes enhanced filtering.

Fig. 1 shows a real-world scene in which the human brain uses differences in frequency content to estimate the location of a sound source. As used herein, a sound source refers to an entity that emits sound. The location of a sound source may be described as the location in space relative to an entity (e.g., a person or a microphone) that hears or captures the sound. As shown in fig. 1, humans and many other animals can estimate the location of an omnidirectional sound source using only two ears or "sensors". This is possible because the shape of the ear and the presence of the head "filters out" certain audio frequency content (especially high frequencies) in certain audio directions. The brain uses this frequency content difference to correctly estimate the sound location.

The determination of the location of the sound source may be used to determine whether the sound is coming from the intended user, from some disturbance, or some additional source that may be used for context awareness. Furthermore, in smart home environments, office environments, or smart transportation devices (autonomous cars, drones, etc.), real-time detection of spatial location of audio sources may be an important function that can be used to determine whether the audio is from one or more intended users, from some disturbance, or from some additional audio source that may be used for context awareness. It may also use different types of audio enhancement techniques on the selected audio source for ASR, speaker ID, audio event detection, or even collision avoidance.

Traditionally, high quality sound location detection is performed by audio captured by a microphone array, typically comprising about 4 to 8 elements, to allow correct positioning in all directions. The principle behind this is to have a sensor or other audio capture device in the platform pointing generally at any possible audio source location. This conventional technique is accompanied not only by the additional cost of multiple microphones, but also by processing audio channels in the platform, which can place a heavy burden on computer overhead. Such an implementation may also require dedicated DSP hardware.

Fig. 2 is a graphical representation of amplitude and frequency content differences in audio received by an electronic device. As shown in fig. 2, the electronic device 212 may be a laptop computer. The electronic device 212 may include a microphone array 210. A microphone array according to the present technology includes two microphones. As shown, the sound source 202 may be located substantially in front of the laptop computer 212. The sound source 204 may be located substantially behind the laptop 212. As used herein, being substantially in front of the laptop 212 may refer to a location in front of a plane created by a lid or display screen of the laptop 212. In contrast, being substantially behind laptop computer 212 may refer to a position behind a plane created by the lid or display of laptop computer 212, wherein the speaker is facing in front of the plane created by the lid or display of laptop computer 212. In an example, as audio propagates toward microphone 210, audio from a sound source located substantially in front of laptop 212 will encounter a different physical barrier created by the laptop than a sound source located substantially behind laptop 212. Thus, the components of the laptop computer may act as an acoustic barrier. For example, a combination of a display and a cover is used to filter components of audio content received from various directions. In particular, the filter may vary based on the spatial location of the laptop. Thus, audio may be filtered differently by the laptop component based on the direction of arrival of the sound. The direction may indicate the location of the sound source.

A physical acoustic barrier as described herein may be a surface that alters frequency components of an audio signal from an audio source. Sound that encounters an acoustic barrier may be reflected from the surface of the acoustic barrier. In addition, sound may be absorbed by and/or transmitted through the acoustic barrier. Typically, the acoustic barrier is formed of a solid material and is wide enough or large enough to have a measurable effect on the frequency content of the audio signal. The acoustic barrier has a frequency response that changes the frequency spectrum of the audio signal that encounters the acoustic barrier. As used herein, an audio signal encounters an acoustic barrier when a waveform comprising the audio signal is reflected, transmitted, or absorbed by the acoustic barrier. In an embodiment, an audio signal that encounters an acoustic barrier at a given frequency will respond to an encounter or collision with the acoustic barrier with the same frequency variation as indicated by the frequency response of the acoustic barrier. The frequency response of the acoustic barrier may be determined and used to derive a digital filter. The digital filter simulates the physical frequency response of the acoustic barrier.

The frequency response applied to the audio signal may act as a low pass filter. In particular, when an audio signal encounters an acoustic barrier, the effect on the audio signal is to pass frequencies below a selected cutoff frequency and attenuate signals having frequencies above the cutoff frequency. The particular cutoff frequency associated with the acoustic barrier depends on the material used to fabricate the acoustic barrier, the shape of the acoustic barrier, and other physical properties of the acoustic barrier. In an embodiment, the acoustic barrier may be designed according to a predetermined cut-off frequency, which may be used to: when compared to the predetermined cut-off frequency, sound from a sound source located in front of the microphone is distinguished from sound from a sound source located behind the microphone. For example, an audio signal that must pass through an acoustic barrier may undergo stronger filtering than an audio signal that does not pass through an acoustic barrier. In this example, audio signals that must pass through the acoustic barrier may experience more audio signal reflections, and therefore, the frequency content available for capture by the microphone is reduced. Audio signals that do not pass through the acoustic barrier may experience fewer reflections of the audio signal, thus preserving more frequency content for capture by the microphone.

In an embodiment, the acoustic barrier may be designed such that it is possible for a particular phoneme to be filtered based on the relative frequencies of the phonemes. The relative frequency of a phoneme is the frequency of that phoneme compared to other phonemes spoken by the same user. For example, the/s/voice from a user may be spoken at a higher frequency relative to other portions of the same user's voice. Thus, the acoustic barrier may be designed to have a corresponding cut-off frequency that filters phonemes that naturally include higher frequency content than other phonemes.

Accordingly, the graph 206 represents the perceptual spectrum of audio content received from the front sound source 202. The perceptual spectrum is illustrated in terms of frequency content of the perceptual spectrum. Note that the audio received from the front sound source 202 is received as having full-spectrum audio content. In contrast, the graph 208 represents the perceived spectrum of audio content received from the rear sound source 204. The perceptual spectrum is illustrated in terms of frequency content of the perceptual spectrum. Note that in the example of fig. 2, the front sound source 202 and the rear sound source 204 emit the same audio content, as shown by spectra 214A and 214B. However, as shown at 216, as the frequency of the audio content increases, the actual received spectrum under the solid line in graph 208 undergoes enhanced filtering.

The present technique implements a position detection routine that does not require a spectral representation or any other digital transformation, which enables improved processing overhead. In particular, the present technology enables full 360 ° position detection in spaces of different sizes and shapes using simplified hardware (two microphone arrays and acoustic barriers). In a conventional laptop computer with a microphone array mounted on top of the lid, the difference in frequency content between the audio captured from the front and rear audio sources can also be used to detect such source locations with an acoustic barrier filter.

A similar situation can be seen in other platforms (e.g. conventional laptops) where a pair of microphones is located in an orientation where the difference in frequency content can also be used to detect such a source location. For example, in a conventional laptop computer with two microphone arrays mounted on top of the lid, the lid itself may be considered an acoustic barrier that is acoustically transparent to low frequency sounds and acoustically opaque (similar to the human ear) to high frequency sounds. The opaque band of the barrier can be modeled as an acoustic barrier filter that is "created" by the material of the laptop cover itself. The scheme for this phenomenon can be seen in fig. 2.

Fig. 3 is a block diagram of feature extraction 300 in accordance with the present technique. In fig. 3, feature extraction 300 is based on computing the Root Mean Square (RMS) of the difference of the normalized time-domain signals from the microphone pair. The RMS value of a signal may represent the average power or intensity associated with the signal. In an embodiment, the audio signal received by the microphone according to the present technique may be defined by a time frame or window. The time frame or window may be a time period of any length in which the audio signal is captured. In an embodiment, the signals for the same time window are obtained from each microphone pair. As described herein, the first microphone of a microphone pair may be referred to as microphone 1 and the second microphone of the microphone pair may be referred to as microphone 2. Descriptors as described herein may be computed on a per window basis for microphone pairs. As used herein, a descriptor provides a representation of an audio signal during a time window.

At block 302, audio signals during the identified time window are obtained from each microphone. Audio can be digitized by capturing air vibrations of sound and converting the vibrations into electrical signals. During the time window, the air vibration may be sampled at equally spaced times. The sampled audio may be represented as a time vector.

In an embodiment, each microphone detects a change in air pressure and sends a corresponding voltage change based on the air pressure change to an analog-to-digital converter where the voltage is periodically sampled according to an audio sampling rate. The sampled audio values may be time domain signals called time vectors. At block 302, audio captured by each of microphone 1 and microphone 2 is converted to a time vector, where the first time vector corresponds to microphone 1 and the second time vector corresponds to microphone 2. Each time vector is normalized to eliminate the effect of each microphone having a slightly different gain. The normalized time vector from the first microphone is subtracted from the normalized time vector from the second microphone to obtain a difference in frequency content between the pair of microphones for the time window. In an embodiment, the subtraction is a vector subtraction performed element by element for each element of the time vector. A first RMS value of a resulting difference related to a delay between two microphone signals is calculated to obtain a first characteristic coefficient. The first characteristic coefficient is an RMS value of a direct difference in content between the first microphone and the second microphone.

In an embodiment, the RMS value may be calculated as the square root of the arithmetic mean of the squares of the elements in the resulting difference in frequency content. In an embodiment, the RMS value may also be calculated as the square of a function defining the continuous waveform. The calculations performed when calculating the RMS value do not include transformations such as Fast Fourier Transform (Fast Fourier Transform), Laplace Transform (Laplace Transform), and the like. Thus, the use of an RMS transform results in lower computational costs when determining the location of a sound source. Furthermore, in addition to being less computationally expensive, the present technique also results in reduced power consumed in determining the location of a sound source due to the limited number of microphones required, as compared to other microphone arrays using FFT-based cross-correlation and deep learning algorithms. Moreover, the present techniques do not require the use of any additional hardware, such as optical sensors, cameras, or ultrasonic sensors. In practice, optical devices are often unable to detect whether an object sounds itself. Furthermore, the image processing performed on all these optics always implies a very large number of operations. Furthermore, ultrasound devices are limited to simply detecting solid surfaces that may or may not emit sound. Specifically, the ultrasonic device does not allow detection of an active sound source.

For ease of description, the RMS value is used to derive a plurality of characteristic coefficients. However, any value proportional to the amplitude or energy of the signal may be used. For example, a Mean Absolute Value (MAV) may be applied to the difference in frequency content to determine the feature coefficients. Further, the RMS values may be calculated in parallel.

At block 304, a second descriptor is computed for the microphone pair from audio captured during the time window. At block 304, a delay is applied to an audio signal captured by a second microphone of the microphone pair. In an embodiment, the samples captured by the second microphone may be delayed by a predetermined number of samples. At block 304, the second channel is delayed by a small and fixed number "D" of samples (about 2 for a sampling frequency of 16 kHz) before subtraction is performed. The delay is not determined using cross-correlation. In an embodiment, the delay is selected such that the number of samples represented by the delay is a fraction of the total number of samples in a single wavelength of audio captured within the time window. The number of samples in the delay may be 2 to 5 samples.

Each time vector is normalized to eliminate the effect of each microphone having a slightly different gain. Thus, the time vector sampled from the audio captured by the first microphone is normalized and the time vector sampled from the audio captured by the second microphone and delayed is normalized.

The normalized time vector from the first microphone may be subtracted from the normalized delay time vector from the second microphone to obtain a difference in frequency content for the time window. In an embodiment, the subtraction is a vector subtraction performed element by element for each element of the time vector. A second RMS value of a resulting difference related to the delay between the two microphone signals is calculated to obtain a second coefficient of merit. The second characteristic coefficient is an RMS value of a delayed difference of content between the first microphone and the second microphone.

At block 306, audio signals during the identified time window are obtained from each microphone. At block 306, an acoustic barrier filter is applied to the audio captured by each of microphone 1 and microphone 2. In an embodiment, the filter may be a band pass filter in line with the acoustic barrier filter. This may ensure that when the signal is located behind the barrier, it has a very different profile than when it is located in front of the barrier. In particular, the digital filter may simulate the frequency response of a physical acoustic barrier present on the device. The signals from the two vectors are normalized and the signals are subtracted element by element. The RMS value of the resulting subtraction is then calculated.

The present technology implements a variability metric, such as an RMS value, that distinguishes between captured microphone signals based on the location of the sound source. For example, if the sound source is generally located in front of the microphone array, without an acoustic barrier that substantially obstructs the path from the sound source to the microphone array, a comparison of the digitally filtered audio signal and the unfiltered audio signal shows a very different audio signal. A comparison of a digitally filtered audio signal and an unfiltered audio signal shows a similar audio signal if the sound source is generally located behind the microphone array with an acoustic barrier obstructing the path from the sound source to the microphone array. In an embodiment, the greater the impact of the physical acoustic barrier on the audio signal, the higher the likelihood that the audio source is located where the audio signal is significantly obstructed by the acoustic barrier. In this case, the content of the filtered audio signal and the unfiltered audio signal is similar. However, if the audio signal originates from a sound source located substantially in front of the physical acoustic barrier, the content of the filtered audio signal and the unfiltered audio signal is different, because the unfiltered signal typically contains a larger range of frequency content than the digitally filtered signal. Thus, in an embodiment, a high pass filter having the same cut-off frequency as the acoustic barrier may be implemented to emphasize the difference between the audio signals from the front of the physical acoustic barrier and the back of the physical acoustic barrier.

Thus, at block 306, the filtered audio signal is converted to a time vector, where the first time vector corresponds to microphone 1 and the second time vector corresponds to microphone 2. Each time vector produced by the filtered audio is normalized to eliminate the effect of each microphone having a slightly different gain. The normalized time vector from the first microphone is subtracted from the normalized time vector from the second microphone to obtain the difference in frequency content between the microphone pair for the time window. In an embodiment, the subtraction is a vector subtraction performed element by element for each element of the time vector. A third RMS value of the difference related to the delay between the two microphone signals is calculated to obtain a third coefficient of merit. The third characteristic coefficient is an RMS value of a filtered content difference between the first microphone and the second microphone.

At block 308, fourth feature coefficients are calculated for the microphone pair from the filtered audio captured during the time window. At block 308, a delay is applied to the filtered audio signal captured by the second microphone of the microphone pair. In an embodiment, the samples captured by the second microphone may be delayed by a predetermined number of samples. At block 308, the second channel is delayed by a small and fixed number "D" of samples (about 2 for a sampling frequency of 16 kHz) before subtraction is performed. Each time vector is normalized to eliminate the effect of each microphone having a slightly different gain. Thus, the time vector sampled from the audio captured by the first microphone is normalized and the time vector sampled from the audio captured by the second microphone and delayed is normalized. A fourth RMS value of a resulting difference related to the delay between the two microphone signals is calculated to obtain a fourth characteristic coefficient. The fourth characteristic coefficient is an RMS value of the filtered and delayed inner tolerance between the first microphone and the second microphone.

At block 310, all feature coefficients are concatenated into a final feature vector corresponding to the analyzed time window. Specifically, the first feature coefficient, the second feature coefficient, the third feature coefficient, and the fourth feature coefficient are concatenated to form a feature vector representing a time window. The full eigenvector includes the RMS values of the direct channel difference, delayed channel difference, filtered channel difference, and filtered and delayed channel difference found at blocks 302, 304, 306, and 308. In an embodiment, the feature vectors are input into a trained neural network. The neural network may be trained to determine the location of an audio source that outputs audio captured during a time window.

The diagram of fig. 3 is not intended to indicate that the example feature extraction 300 includes all of the components shown in fig. 3. Rather, the example feature extraction 300 may be implemented using fewer or additional components (e.g., additional variability metrics, neural networks, filters, etc.) not shown in fig. 3.

Fig. 4 is an illustration of location classification 400. In fig. 4, a scheme of a full source position detection pipeline is shown. Fig. 4 includes a sound source 402. The laptop 404 includes a microphone array 406 having two microphone sensors. In particular, the microphone array 406 includes a first microphone 406A and a second microphone 406B. The microphones 406A and 406B may capture audio signals emanating from the sound source 402. In addition, the lid of the laptop 404 acts as an acoustic barrier to sound emitted by the audio signal from the sound source 402.

The audio signals from the sound source 402 may be processed as described with respect to fig. 3 to obtain the feature vectors 408. The feature vectors 408 may be input to a location classifier 410. The classifier can be, for example, a supervised machine learning classifier that outputs the source location 412. The source location may be an angle identifying the location of the sound source relative to the microphone array. For example, the location classifier may output an angle of arrival or azimuth (azimuth) associated with the sound. The classifier may be a feed forward network with two layers. The location classifier 420 may be constructed using a shallow neural network and generates locations from input features. The location classifier may also be able to estimate a general location, such as a distance or an altitude.

FIG. 5 is an illustration of an exemplary form factor. In particular, fig. 5 shows an example of an acoustic barrier implemented in an array of two microphones in laptop 502, smart speaker 508, and smart vehicle 514. Laptop 502 may include microphone array 504. The microphone array 504 includes microphones 504A and 504B. As shown, the acoustic barrier is formed by a lid 506 of the laptop 502. In this manner, sound encountered by microphones 504A and 504B undergoes filtering due to acoustic barrier 506. The particular filtering enabled by the acoustic barrier 506 may be used to digitally filter the received signal to derive the full-length feature vector. The specific frequency response of the digital filtering may be the same as the actual physical filtering provided by the acoustic barrier 506.

The smart speaker 508 may include a microphone array 510. The microphone array 510 includes microphones 510A and 510B. In the vicinity of the microphone array 510, an acoustic barrier is formed. As shown, the acoustic barrier defines a semi-circular area within which microphone 510A and microphone 510B are located. In this manner, sound encountered by microphones 510A and 510B may undergo filtering due to acoustic barrier 512. As described above, the particular filtering implemented by the acoustic barrier 512 may be used to filter the received signal to derive the full-length feature vector. The specific frequency response of the digital filtering may be the same as the actual physical filtering provided by the acoustic barrier 512.

Similarly, the vehicle 514 may include a microphone array 516. The microphone array 516 includes microphones 516A and 516B. An acoustic barrier 518 is formed near the microphone array 516. In the example of the smart vehicle 514, the acoustic barrier is formed by a physical enclosure of the frame of the smart vehicle 514. For example, the frame 518A of the vehicle 514 may form a portion of an acoustic barrier. In addition, glass 518B disposed throughout the frame of the vehicle 514 may also form a portion of the acoustic barrier 518. The particular filtering implemented by the acoustic barrier 518 may be used to filter the signals received by the microphones 516A and 516B and may be used to derive the full-length feature vectors. The specific frequency response of the digital filtering may be the same as the actual physical filtering provided by the acoustic barrier 518. Although a specific form factor has been described, the present techniques may be used on any form factor having an acoustic barrier and two microphones. Thus, this concept can be implemented as different form factors or systems, such as conventional laptop computers, smart speakers, or other home/office equipment, and vehicles.

Fig. 6 is an illustration of an exemplary environment 600 in which audio sources may be placed. The laptop computer 602 may include a microphone array 604. Spherical coordinate system 606 is shown one meter from laptop 602, which is located at the origin of spherical coordinate system 606. In an embodiment, the location classifier outputs the sound location as an azimuth. The azimuth angle may be used to determine a vector from the origin to the location of the sound source. In this way, the location of the sound source can be identified.

Consider an exemplary use case with a total of 1500 audio segments, each of one second duration and having a sampling frequency of 44100 Hz. The audio segments may be recorded at eight different angles (0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, and 315 °) at a distance of one meter around the open laptop 602. In the example of fig. 6, the acoustic barrier filter may be selected from 4000Hz to 8000 Hz.

In this example, 80% of the randomly selected segments are used for training and the rest (20%, 300 samples) are used for verification. The feature is obtained from the audio samples using the proposed routine described in fig. 3, where the fixed delay D is 3 samples. A shallow fully-connected neural network consisting of 2 inputs, 2 hidden layers, and 6 neurons at the output (22 total neurons) was trained and tested using the features shown in fig. 3, and the classification results were measured and compared to the authentic label of the validation sample.

The results of the present technique applied to the example of fig. 6 are shown below. It can be noted that in all 300 validation samples, the neural network only misrecognized 2 angles, and the correct angle-of-arrival classification rate accounts for 99.7%.

Measuring angle

True angle

	0°	45°	90°	135°	180°	235°	270°	315°
									0°	42	0	0	0	0	0	2	0
45°	0	33	0	0	0	0	0	0
									90°	0	0	31	0	0	0	0	0
135°	0	0	0	40	0	0	0	0
									180°	0	0	0	0	47	0	0	0
235°	0	0	0	0	0	38	0	0
									270°	0	0	0	0	0	0	34	0
315°	0	0	0	0	0	0	0	33

TABLE 1

The results in table 1 demonstrate the feasibility of implementing an array of two microphones and adding a human heuristic acoustic barrier to detect a full 360 ° angle of arrival detection. The method is based on only two microphones and a very lightweight neural network technology for sound source localization, which eliminates the need for a Digital Signal Processor (DSP) to process the incoming signals for this task. In a very simple implementation, it successfully detects all 360 ° of audio around the array (which cannot be achieved using such small arrays using conventional techniques) and has a 99.3% correctly classified performance.

Fig. 7 is a process flow diagram of a method 700. The example method 700 may be implemented in the feature extraction 300 of fig. 3, the computing device 800 of fig. 8, or the computer-readable medium 900 of fig. 9. In some examples, the method 300 may be implemented using the location classifier 400 of fig. 4. At block 702, variability metrics of the direct difference, the delay difference, the filtered direct difference, and the filtered delay difference are calculated. At block 704, the computed variability metrics are concatenated to obtain a feature vector. At block 706, the feature vectors are input into a location classifier to obtain a source location.

The process flow diagram is not intended to indicate that the blocks of the example method 700 are to be performed in any particular order, or that all of the blocks are included in each case. Further, any number of additional blocks not shown may be included within exemplary method 700 depending on the details of the particular implementation. For example, the audio signals may be captured by a pair of microphones and normalized prior to calculating the variability metric.

Fig. 8 is a block diagram of an electronic device that enables lightweight all three hundred sixty degree audio sound localization using two microphones. The location of the audio source can be determined in real time. The electronic device 800 may be, for example, a laptop computer, a tablet computer, a mobile phone, a smart phone, a wearable headset, a smart headset, smart glasses or speaker systems, or a vehicle, etc. Electronic device 800 may include a Central Processing Unit (CPU)802 configured to execute stored instructions, and a memory device 804 that stores instructions that may be executed by CPU 802. The CPU may be coupled to a memory device 804 through a bus 806. Further, the CPU 802 may be a single-core processor, a multi-core processor, a computing cluster, or any number of other configurations. Further, the electronic device 800 may include more than one CPU 802. The memory device 804 may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory system. For example, memory device 804 may include Dynamic Random Access Memory (DRAM).

Computing device 800 may also include a Graphics Processing Unit (GPU) 808. As shown, CPU 802 may be coupled to GPU 808 via bus 806. The GPU 808 may be configured to perform any number of graphics operations within the computing device 800. For example, the GPU 808 may be configured to render or manipulate graphical images, graphical frames, videos, and the like to be displayed to a user of the computing device 800.

The memory device 804 may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory system. For example, memory device 804 may include Dynamic Random Access Memory (DRAM). The memory device 804 may include device drivers 810, the device drivers 810 configured to execute instructions for training a plurality of convolutional neural networks to perform sequence-independent processing. The device driver 810 may be software, an application program, application code, or the like.

The CPU 802 may also be connected via the bus 806 to an input/output (I/O) device interface 812, which interface 812 is configured to connect the computing device 800 to one or more I/O devices 814. The I/O devices 814 may include, for example, a keyboard and a pointing device, wherein the pointing device may include a touchpad or a touchscreen, among others. The I/O device 814 may be a built-in component of the computing device 800 or may be a device externally connected to the computing device 800. In some examples, the memory 804 may be communicatively coupled to the I/O device 814 through Direct Memory Access (DMA).

The CPU 802 may also be linked through the bus 806 to a display interface 816, the display interface 816 being configured to connect the computing device 800 to a display device 818. The display device 818 may include a display screen that is a built-in component of the computing device 800. Display device 818 may also include a computer monitor, television, or projector, etc. internal to computing device 800 or externally connected to computing device 800.

Computing device 800 also includes storage 820. Storage 820 is a physical memory, such as a hard disk drive, an optical disk drive, a thumb drive, a drive array, a solid state drive, or any combination thereof. Storage 820 may also include remote storage drives.

Computing device 800 may also include a Network Interface Controller (NIC) 822. NIC 822 may be configured to connect computing device 800 to network 824 via bus 806. The network 824 may be a Wide Area Network (WAN), a Local Area Network (LAN), the internet, or the like. In some examples, a device may communicate with other devices via wireless technology. For example, a device may communicate with other devices over a wireless local area network connection. In some examples, the device may be via bluetooth Or the like and other devicesAnd connecting and communicating.

The electronic device 800 may also include a microphone array 826. Microphone array 826 includes two independent microphones. In an embodiment, each microphone may be a microelectromechanical system (MEMS) microphone. Audio from a sound source may be captured by microphone array 826. A position detector 828 may obtain the electrical signals captured by the microphones and determine the position of the sound source. In particular, the variability metric unit 830 may be used to calculate the characteristic coefficients associated with the microphone pairs. In particular, the variability metric may be any value proportional to the amplitude or energy of the signal that may be used. For example, an RMS value or Mean Absolute Value (MAV) may be applied to the difference in frequency content to determine the characteristic coefficients. The serializer 832 may concatenate a plurality of eigen coefficients into an eigen vector. Location classifier 834 may take the feature vectors as input and determine the location.

The block diagram of fig. 8 is not intended to indicate that the computing device 800 includes all of the components shown in fig. 8. Rather, computing system 800 may include fewer components or additional components not shown in fig. 8 (e.g., sensors, power management integrated circuits, additional network interfaces, etc.). Depending on the details of the particular implementation, computing device 800 may include any number of additional components not shown in fig. 8. Further, any of the functions of the CPU 802 may be partially or fully implemented in hardware and/or in a processor. For example, the functions may be implemented in an application specific integrated circuit, in logic implemented in a processor, in logic implemented in a dedicated graphics processing unit, or in any other device.

Fig. 9 is a block diagram illustrating a medium 900 that enables lightweight all three hundred sixty degree audio sound localization using two microphones 900. The medium 900 may be a computer-readable medium including a non-transitory medium storing code accessible by the processor 902 through the computer bus 904. The computer-readable medium 900 may be, for example, a volatile or nonvolatile data storage device. The medium 900 may also be a logic unit such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or an arrangement of logic gates implemented in one or more integrated circuits.

The medium 900 may include a module 906 configured to perform the techniques described herein 910. For example, the variability metric module 906 may be configured to calculate characteristic coefficients associated with the microphone pairs. In particular, the variability metric may be any value proportional to the amplitude or energy of the signal that may be used. For example, an RMS value or Mean Absolute Value (MAV) may be applied to the difference in frequency content to determine the characteristic coefficients. The concatenating module 908 is configured to concatenate the plurality of eigen coefficients into an eigen vector. The classification module 910 may be configured to take the feature vector as input and determine the location. In some embodiments, module 906 and 910 may be modules of computer code configured to direct the operation of processor 902.

The block diagram of FIG. 9 is not intended to indicate that the medium 900 includes all of the components shown in FIG. 9. Furthermore, depending on the details of the particular implementation, media 900 may include any number of additional components not shown in fig. 9.

Example 1 is a system. The system comprises: a physical acoustic barrier; a microphone array comprising a first microphone and a second microphone; at least one hardware processor configured to: identifying a predetermined acoustic barrier filter, wherein the acoustic barrier filter is consistent with the physical acoustic barrier; receiving audio signals at a first microphone and a second microphone within a time window; calculating a first variability metric of a direct difference of audio signals received at the first microphone and the second microphone; calculating a second variability metric of a delay difference of the audio signals received at the first microphone and the second microphone; calculating a third variability metric of a filtered direct difference of audio signals received at the first microphone and the second microphone, wherein the audio signals are filtered by a predetermined acoustic barrier filter; calculating a fourth variability metric of a filtered delay difference of audio signals received at the first microphone and the second microphone, wherein the audio signals are filtered by a predetermined acoustic barrier filter; concatenating the first variability metric, the second variability metric, the third variability metric, and the fourth variability metric to form a feature vector; and inputting the feature vectors into a location classifier to obtain audio source locations.

Example 2 includes the system of example 1, including or excluding the optional feature. In this example, the predetermined acoustic barrier filter is consistent with the physical acoustic barrier filter by replicating the frequency response of the physical acoustic barrier filter.

Example 3 includes the system of any of examples 1-2, including or excluding the optional feature. In this example, the location classifier is a shallow neural network.

Example 4 includes the system of any of examples 1-3, including or excluding the optional feature. In this example, the first, second, third, and fourth variability metrics are root mean square values.

Example 5 includes the system of any of examples 1-4, including or excluding the optional feature. In this example, the first, second, third, and fourth variability metrics are root mean square values.

Example 6 includes the system of any of examples 1-5, including or excluding the optional feature. In this example, the predetermined acoustic barrier filter is a band pass filter that is consistent with a physical acoustic barrier filter.

Example 7 includes the system of any of examples 1-6, including or excluding the optional feature. In this example, the physical acoustic barrier is a surface that alters the frequency components of the audio signal from the audio source.

Example 8 includes the system of any of examples 1-7, including or excluding the optional feature. In this example, the difference is calculated by: the audio signals received by the first and second microphones are normalized and the normalized audio signal captured by the first microphone is subtracted from the normalized audio signal captured by the second microphone.

Example 9 includes the system of any of examples 1-8, including or excluding the optional feature. In this example, the delayed audio signal is generated by delaying the audio signal at the second microphone by a predetermined number of samples.

Example 10 includes the system of any of examples 1-9, including or excluding the optional feature. In this example, the audio source location is the angle of arrival.

Example 11 is a method. The method comprises the following steps: identifying a predetermined acoustic barrier filter, wherein the acoustic barrier filter is consistent with the physical acoustic barrier; receiving audio signals at a first microphone and a second microphone within a time window; calculating a first variability metric of a direct difference of audio signals received at the first and second microphones, a second variability metric of a delay difference of audio signals received at the first and second microphones, a third variability metric of a filtered direct difference of audio signals received at the first and second microphones, wherein the audio signals are filtered by a predetermined acoustic barrier filter, and a fourth variability metric of a filtered delay difference of audio signals received at the first and second microphones, wherein the audio signals are filtered by the predetermined acoustic barrier filter; the first, second, third, and fourth variability metrics are concatenated to form a feature vector, and the feature vector is input into a location classifier to obtain an audio source location.

Example 12 includes the method of example 11, including or excluding the optional feature. In this example, the predetermined acoustic barrier filter is consistent with the physical acoustic barrier filter by replicating the frequency response of the physical acoustic barrier filter.

Example 13 includes the method of any of examples 11 to 12, including or excluding the optional feature. In this example, the location classifier is a shallow neural network.

Example 14 includes the method of any one of examples 11 to 13, including or excluding the optional feature. In this example, the first, second, third, and fourth variability metrics are root mean square values.

Example 15 includes the method of any of examples 11 to 14, including or excluding the optional feature. In this example, the first, second, third, and fourth variability metrics are root mean square values.

Example 16 includes the method of any of examples 11 to 15, including or excluding the optional feature. In this example, the predetermined acoustic barrier filter is a band pass filter that is consistent with a physical acoustic barrier filter.

Example 17 includes the method of any one of examples 11 to 16, including or excluding the optional feature. In this example, the physical acoustic barrier is a surface that alters the frequency components of the audio signal from the audio source.

Example 18 includes the method of any one of examples 11 to 17, including or excluding the optional feature. In this example, the difference is calculated by: the audio signals received by the first and second microphones are normalized and the normalized audio signal captured by the first microphone is subtracted from the normalized audio signal captured by the second microphone.

Example 19 includes the method of any one of examples 11 to 18, including or excluding the optional feature. In this example, the delayed audio signal is generated by delaying the audio signal at the second microphone by a predetermined number of samples.

Example 20 includes the method of any of examples 11 to 19, including or excluding the optional feature. In this example, the audio source location is the angle of arrival.

Example 21 is at least one computer-readable medium for concealing a phrase in audio, having instructions stored thereon. The computer-readable medium includes instructions that direct a processor to: identifying a predetermined acoustic barrier filter, wherein the acoustic barrier filter is consistent with the physical acoustic barrier; receiving audio signals at a first microphone and a second microphone within a time window; calculating a first variability metric of a direct difference of audio signals received at the first and second microphones, a second variability metric of a delay difference of audio signals received at the first and second microphones, a third variability metric of a filtered direct difference of audio signals received at the first and second microphones, wherein the audio signals are filtered by a predetermined acoustic barrier filter, and a fourth variability metric of a filtered delay difference of audio signals received at the first and second microphones, wherein the audio signals are filtered by the predetermined acoustic barrier filter; concatenating the first variability metric, the second variability metric, the third variability metric, and the fourth variability metric to form a feature vector; and inputting the feature vectors into a location classifier to obtain audio source locations.

Example 22 includes the computer-readable medium of example 21, including or excluding the optional feature. In this example, the predetermined acoustic barrier filter is consistent with the physical acoustic barrier filter by replicating the frequency response of the physical acoustic barrier filter.

Example 23 includes the computer-readable medium of any one of examples 21 to 22, including or excluding the optional feature. In this example, the location classifier is a shallow neural network.

Example 24 includes the computer-readable medium of any of examples 21 to 23, including or excluding the optional feature. In this example, the first, second, third, and fourth variability metrics are root mean square values.

Example 25 includes the computer-readable medium of any one of examples 21 to 24, including or excluding the optional feature. In this example, the first, second, third, and fourth variability metrics are root mean square values.

Some embodiments may be implemented in one or a combination of hardware, firmware, and software. Some embodiments may also be implemented as instructions stored on a tangible, non-transitory, machine-readable medium, which may be read and executed by a computing platform to perform the operations described. Further, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium may include Read Only Memory (ROM); random Access Memory (RAM); a magnetic disk storage medium; an optical storage medium; a flash memory device; or electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, or interfaces that transmit and/or receive signals), and so forth.

An embodiment is an implementation or example. Reference in the specification to "an embodiment," "one embodiment," "some embodiments," "various embodiments," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the technology. Various appearances of "an embodiment," "one embodiment," or "some embodiments" are not necessarily all referring to the same embodiments.

Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. For example, if the specification states a component, feature, structure, or characteristic "may", "might", "could", or "could" be included, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claims refer to "a" or "an" element, that does not mean there is only one of the element. If the specification or claims refer to "an additional" element, that does not preclude there being more than one of the additional element.

It is noted that although some embodiments have been described with reference to particular implementations, other implementations are possible according to some embodiments. Moreover, the arrangement and/or order of circuit elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.

In each system shown in the figures, each element may in some cases have the same reference number or a different reference number to mean that the elements represented may be different/similar. However, the elements may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which is referred to as a first element and which is referred to as a second element is arbitrary.

It will be appreciated that the details of the foregoing examples may be used anywhere in one or more examples. For example, all optional features of the computing device described above may also be implemented with respect to any of the methods or computer readable media described herein. Further, although flow diagrams and/or state diagrams may have been used herein to describe embodiments, the techniques are not limited to those diagrams or to corresponding descriptions herein. For example, flow need not move through each illustrated block or state or in exactly the same order as illustrated and described herein.

The present technology is not limited to the specific details set forth herein. Indeed, those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present technology. Accordingly, any modifications of these claims which define the scope of the technology are intended to be included in the following claims.

25页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种无人机雷达清洁方法和无人机

Lightweight full 360 degree audio source location detection using two microphones

相关技术

网友询问留言