Method and apparatus for fingerprinting audio signals via normalization

文档序号：1804302 发布日期：2021-11-05 浏览：36次中文

阅读说明：本技术 经由归一化对音频信号进行指纹识别的方法和装置 (Method and apparatus for fingerprinting audio signals via normalization ) 是由 R·库弗 Z·拉菲于 2019-09-06 设计创作，主要内容包括：公开了经由均值归一化来对音频进行指纹识别的方法、装置、系统和制品。一种用于音频指纹识别的示例装置包括：频率范围分离器,其将音频信号变换到频域,变换后的音频信号包括多个时间-频率仓,所述多个时间-频率仓包括第一时间-频率仓；音频特性确定器,其确定多个时间-频率仓中的第一组时间-频率仓的第一特性,第一组时间-频率仓包围第一时间-频率仓；以及信号归一化器,其归一化音频信号,从而生成归一化能量值,音频信号的归一化包括按第一特性归一化第一时间-频率仓。示例装置还包括点选择器,其选择归一化能量值之一；以及指纹生成器,其使用归一化能量值中的所选择的一个归一化能量值来生成音频信号的指纹。(Methods, apparatus, systems, and articles of manufacture to fingerprint audio via mean normalization are disclosed. An example apparatus for audio fingerprinting includes: a frequency range separator that transforms an audio signal to a frequency domain, the transformed audio signal comprising a plurality of time-frequency bins, the plurality of time-frequency bins comprising a first time-frequency bin; an audio characteristic determiner to determine a first characteristic of a first set of time-frequency bins of the plurality of time-frequency bins, the first set of time-frequency bins surrounding the first time-frequency bin; and a signal normalizer that normalizes the audio signal, thereby generating a normalized energy value, the normalization of the audio signal including normalizing the first time-frequency bin by the first characteristic. The example apparatus also includes a point selector that selects one of the normalized energy values; and a fingerprint generator that generates a fingerprint of the audio signal using the selected one of the normalized energy values.)

1. An apparatus for audio fingerprinting, the apparatus comprising:

a frequency range separator that transforms an audio signal to a frequency domain, the transformed audio signal comprising a plurality of time-frequency bins, the plurality of time-frequency bins comprising a first time-frequency bin;

an audio characteristic determiner to determine a first characteristic of a first set of time-frequency bins of the plurality of time-frequency bins, the first set of time-frequency bins surrounding the first time-frequency bin;

a signal normalizer to normalize the audio signal to generate a normalized energy value, the normalization of the audio signal including normalizing the first time-frequency bin by the first characteristic;

a point selector that selects one of the normalized energy values; and

a fingerprint generator that generates a fingerprint of the audio signal using the selected one of the normalized energy values.

2. The apparatus of claim 1, wherein the frequency range splitter further performs a fast fourier transform of the audio signal.

3. The apparatus of claim 1, wherein the point selector is further to:

determining a category of the audio signal; and

weighting the selection of said one of the normalized energy values by the category of the audio signal.

4. The apparatus of claim 3, wherein the category of the audio signal comprises at least one of music, human voice, sound effects, or advertisements.

5. The apparatus of claim 1, wherein the audio characteristic determiner further determines a second characteristic of a second set of the plurality of time-frequency bins surrounding a second time-frequency bin of the plurality of time-frequency bins, and the signal normalizer further normalizes the first time-frequency bin by the first characteristic.

6. The apparatus of claim 1, wherein the point selector selects the one of the normalized energy values based on an energy extremum of the normalized audio signal.

7. The apparatus of claim 1, wherein each time-frequency bin of the plurality of time-frequency bins is a unique combination of: (1) a time segment of the audio signal and (2) a frequency bin of the transformed audio signal.

8. A method for audio fingerprinting, the method comprising the steps of:

transforming an audio signal into a frequency domain, the transformed audio signal comprising a plurality of time-frequency bins, the plurality of time-frequency bins comprising a first time-frequency bin;

determining a first characteristic of a first set of time-frequency bins of the plurality of time-frequency bins, the first set of time-frequency bins encompassing the first time-frequency bin;

normalizing the audio signal, thereby generating a normalized energy value, the normalizing of the audio signal comprising normalizing the first time-frequency bin by the first characteristic;

selecting one of the normalized energy values; and

generating a fingerprint of the audio signal using the selected one of the normalized energy values.

9. The method of claim 8, wherein transforming the audio signal to the frequency domain comprises performing a fast fourier transform of the audio signal.

10. The method of claim 8, wherein selecting one of the normalized energy values comprises:

determining a category of the audio signal; and

weighting the selection of said one of the normalized energy values by the category of the audio signal.

11. The method of claim 10, wherein the category of the audio signal comprises at least one of music, human voice, sound effects, or advertisements.

12. The method of claim 8, further comprising:

determining a second characteristic of a second set of time-frequency bins of the plurality of time-frequency bins, the second set of time-frequency bins encompassing a second time-frequency bin of the plurality of time-frequency bins; and

normalizing the first time-frequency bin by the first characteristic.

13. The method of claim 8, wherein the step of selecting one of the normalized energy values is based on an energy extremum of the normalized audio signal.

14. The method of claim 8, wherein each time-frequency bin of the plurality of time-frequency bins is a unique combination of: (1) a time segment of the audio signal and (2) a frequency bin of the transformed audio signal.

15. A non-transitory computer-readable storage medium comprising instructions that, when executed, cause a processor to at least:

determining a first characteristic of a first set of time-frequency bins of the plurality of time-frequency bins, the first set of time-frequency bins encompassing the first time-frequency bin;

normalizing the audio signal, thereby generating a normalized energy value, the normalizing of the audio signal comprising normalizing the first time-frequency bin by the first characteristic;

selecting one of the normalized energy values; and

generating a fingerprint of the audio signal using the selected one of the normalized energy values.

16. The non-transitory computer-readable storage medium of claim 15, wherein transforming an audio signal to the frequency domain comprises performing a fast fourier transform of the audio signal.

17. The non-transitory computer readable storage medium of claim 15, wherein the instructions, when executed, cause the processor to:

determining a category of the audio signal; and

weighting the selection of said one of the normalized energy values by the category of the audio signal.

18. The non-transitory computer-readable storage medium of claim 17, wherein the category of the audio signal comprises at least one of music, human voice, sound effects, or advertisements.

19. The non-transitory computer readable storage medium of claim 15, wherein the instructions, when executed, cause the processor to:

normalizing the first time-frequency bin by the first characteristic.

20. The non-transitory computer-readable storage medium of claim 15, wherein each time-frequency bin of the plurality of time-frequency bins is a unique combination of: (1) a time segment of the audio signal and (2) a frequency bin of the transformed audio signal.

Technical Field

The present disclosure relates generally to audio signals, and more particularly, to methods and apparatus for fingerprinting audio signals via normalization.

Background

Audio information (e.g., sound, speech, music, etc.) may be represented as digital data (e.g., electronic, optical, etc.). The captured audio (e.g., via a microphone) may be digitized, electronically stored, processed, and/or classified. One method of classifying audio information is by generating an audio fingerprint. An audio fingerprint is a digital digest of audio information created by sampling a portion of an audio signal. Audio fingerprints have historically been used to identify audio and/or verify the authenticity of audio.

Drawings

FIG. 1 is an example system in which the teachings of the present disclosure may be implemented.

Fig. 2 is an example implementation of the audio processor of fig. 1.

Fig. 3A and 3B depict example unprocessed frequency spectra generated by the example frequency range splitter of fig. 2.

Fig. 3C depicts an example of a normalized spectrogram generated by the signal normalizer of fig. 2 from the unprocessed spectrograms of fig. 3A and 3B.

Fig. 4 is an example unprocessed spectral diagram of fig. 3A and 3B separated into fixed audio signal frequency components.

Fig. 5 is an example of a normalized spectrogram generated by the signal normalizer of fig. 2 from the fixed audio signal frequency components of fig. 4.

Fig. 6 is an example of a normalized and weighted spectrogram generated by the point selector of fig. 2 from the normalized spectrogram of fig. 5.

Fig. 7 and 8 are flow diagrams representing machine readable instructions that may be executed to implement the audio processor of fig. 2.

Fig. 9 is a block diagram of an example processing platform configured to execute the instructions of fig. 7 and 8 to implement the audio processor of fig. 2.

The figures are not drawn to scale. Generally, the same reference numbers will be used throughout the drawings and the following written description to refer to the same or like parts.

Detailed Description

Fingerprint or signature based media monitoring techniques typically utilize one or more inherent characteristics of the media being monitored during a monitoring interval to generate a substantially unique proxy (proxy) for the media. Such agents are referred to as signatures or fingerprints and may take any form (e.g., a series of digital values, waveforms, etc.) that represents any aspect of a media signal (e.g., an audio signal and/or a video signal that forms a media presentation being monitored). The signature may be a series of signatures collected continuously over a time interval. The terms "fingerprint" and "signature" are used interchangeably herein and are defined herein to mean an agent generated from one or more inherent characteristics of media used to identify the media.

Signature-based media monitoring typically involves: a signature representative of a media signal (e.g., an audio signal and/or a video signal) output by a monitored media device is determined (e.g., generated and/or collected), and the monitored signature is compared to one or more reference signatures corresponding to known (e.g., reference) media sources. Various comparison criteria (such as cross-correlation values, Hamming distances, etc.) may be evaluated to determine whether the monitored signature matches a particular reference signature.

When a match is found between the monitored signature and one of the reference signatures, the monitored media may be identified as corresponding to the particular reference media represented by the reference signature that matches the monitored signature. Since attributes such as media identifier, presentation time, broadcast channel, etc. are collected for the reference signature, these attributes can then be associated with the monitored media whose monitored signature matches the reference signature. Example systems for identifying media based on code and/or signature have long been known and are first disclosed in U.S. patent 5,481,294 to Thomas, the entire contents of which are incorporated herein by reference.

Historically, audio fingerprinting techniques use the loudest portions of an audio signal (e.g., the portions with the greatest energy, etc.) to create a fingerprint over a period of time. However, in some cases, this approach has several serious limitations. In some examples, the loudest portions of the audio signal may be associated with noise (e.g., unwanted audio) rather than from the audio of interest. For example, if a user attempts to fingerprint a song in a noisy restaurant, the loudest portions of the captured audio signal may be conversations between restaurant customers, rather than the song or media to be identified. In this example, many sampled portions of the audio signal will have background noise and no music, which reduces the usefulness of the generated fingerprint.

Another potential limitation of previous fingerprinting techniques is that, particularly in music, the audio in the low audio frequency range tends to be loudest. In some examples, the dominant bass frequency energy results in a sampled portion of the audio signal that is primarily within the bass frequency range. Thus, fingerprints generated using existing methods typically do not include samples from all parts of the audio spectrum that may be used for signature matching, especially samples in higher frequency ranges (e.g., treble ranges, etc.).

Example methods and apparatus disclosed herein overcome the above-described problems by generating fingerprints from audio signals using mean normalization (mean normalization). One example method includes normalizing one or more time-frequency bins (time-frequency bins) of an audio signal by audio characteristics of a surrounding audio region. As used herein, a "time-frequency bin" is a portion of an audio signal that corresponds to a particular frequency bin (e.g., an FFT bin) at a particular time (e.g., three seconds into the audio signal). In some examples, the normalization is weighted by the audio class of the audio signal. In some examples, the fingerprint is generated by selecting points from the normalized time-frequency bins.

Another example method disclosed herein includes separating an audio signal into two or more audio signal frequency components. As used herein, an "audio signal frequency component" is a portion of an audio signal that corresponds to a range of frequencies and a time period. In some examples, the audio signal frequency components may be comprised of a plurality of time-frequency bins. In some examples, audio characteristics are determined for some of the audio signal frequency components. In this example, individual ones of the audio signal frequency components are normalized by an associated audio characteristic (e.g., audio mean, etc.). In some examples, the fingerprint is generated by selecting points from the normalized audio signal frequency components.

Fig. 1 is an example system 100 in which the teachings of the present disclosure may be implemented. The example system 100 includes an example audio source 102, an example microphone 104, the example microphone 104 capturing sound from the audio source 102 and converting the captured sound into an example audio signal 106. The example audio processor 108 receives the audio signal 106 and generates an example fingerprint 110.

The example audio source 102 emits audible sound. Example audio sources may be speakers (e.g., electro-acoustic transducers, etc.), live shows, conversations, and/or any other suitable audio source. Example audio sources 102 may include desired audio (e.g., audio to be fingerprinted, etc.) and may also include undesired audio (e.g., background noise, etc.). In the illustrated example, the audio source 102 is a speaker. In other examples, the audio source 102 may be any other suitable audio source (e.g., a person, etc.).

The example microphone 104 is a transducer that converts sound emitted by the audio source 102 into an audio signal 106. In some examples, the microphone 104 may be a component of a computer, mobile device (smartphone, tablet, etc.), navigation device, or wearable device (e.g., smartwatch, etc.). In some examples, the microphone may include audio-to-digital conversion to digitize the audio signal 106. In other examples, the audio processor 108 may digitize the audio signal 106.

The example audio signal 106 is a digitized representation of the sound emitted by the audio source 102. In some examples, the audio signal 106 may be saved on a computer before being processed by the audio processor 108. In some examples, the audio signal 106 may be communicated to the example audio processor 108 over a network. Additionally or alternatively, any other suitable method may be used to generate audio (e.g., digital synthesis, etc.).

The example audio processor 108 converts the example audio signal 106 into an example fingerprint 110. In some examples, the audio processor 108 divides the audio signal 106 into frequency bins and/or time segments and then determines the mean energy of one or more of the created audio signal frequency components. In some examples, the audio processor 108 may normalize the audio signal frequency components using the associated mean energies of the audio regions around the respective time-frequency bins. In other examples, any other suitable audio characteristics may be determined and used to normalize the various time-frequency bins. In some examples, the fingerprint 110 may be generated by selecting the highest energy among the normalized audio signal frequency components. Additionally or alternatively, the fingerprint 110 may be generated using any suitable means. An example implementation of the audio processor 108 is described below in conjunction with fig. 2.

The example fingerprint 110 is a concise digital digest of the audio signal 106 that may be used to identify and/or verify the audio signal 106. For example, the fingerprint 110 may be generated by sampling portions of the audio signal 106 and processing the portions. In some examples, the fingerprint 110 may include samples of the highest energy portions of the audio signal 106. In some examples, the fingerprint 110 may be indexed in a database, which may be used to compare with other fingerprints. In some examples, the fingerprint 110 may be used to identify the audio signal 106 (e.g., determine what song is being played, etc.). In some examples, the fingerprint 110 may be used to verify the authenticity of the audio.

Fig. 2 is an example implementation of the audio processor 108 of fig. 1. The example audio processor 108 includes an example frequency range separator 202, an example audio characteristic determiner 204, an example signal normalizer 206, an example point selector 208, and an example fingerprint generator 210.

The example frequency range separator 202 separates an audio signal (e.g., the digitized audio signal 106 of fig. 1) into time-frequency bins and/or audio signal frequency components. For example, the frequency range separator 202 may perform a Fast Fourier Transform (FFT) on the audio signal 106 to transform the audio signal 106 to the frequency domain. Additionally, the example frequency range separator 202 may separate the transformed audio signal 106 into two or more frequency bins (e.g., using a Hamming function, a Hann function, etc.). In this example, each audio signal frequency component is associated with a frequency bin of the two or more frequency bins. Additionally or alternatively, the frequency range splitter 202 may aggregate the audio signals 106 into one or more time segments (e.g., duration of audio, period of six seconds, period of 1 second, etc.). In other examples, the frequency range separator 202 may transform the audio signal 106 using any suitable technique (e.g., discrete fourier transform, sliding time window fourier transform, wavelet transform, discrete Hadamard transform, discrete Walsh Hadamard, discrete cosine transform, etc.). In some examples, the frequency range splitter 202 may be implemented by one or more Band Pass Filters (BPFs). In some examples, the output of the example frequency range separator 202 may be represented by a spectral graph. Example outputs of the frequency range splitter 202 are discussed below in conjunction with fig. 3A-3B and fig. 4.

The example audio characteristic determiner 204 determines audio characteristics of a portion of the audio signal 106 (e.g., audio signal frequency components, audio regions around time-frequency bins, etc.). For example, the audio characteristic determiner 204 may determine a mean energy (e.g., average power, etc.) of one or more of the audio signal frequency components. Additionally or alternatively, the audio characteristic determiner 204 may determine other characteristics of a portion of the audio signal (e.g., mode energy, median energy, mode power, median energy, mean amplitude, etc.).

The example signal normalizer 206 normalizes one or more time-frequency bins by the associated audio characteristics of the surrounding audio region. For example, the signal normalizer 206 may normalize the time-frequency bins by the mean energy of the surrounding audio region. In other examples, the signal normalizer 206 normalizes some of the audio signal frequency components by associated audio characteristics. For example, the signal normalizer 206 may use the mean energy associated with an audio signal frequency component to normalize various time-frequency bins of the audio signal frequency component. In some examples, the output of the signal normalizer 206 (e.g., normalized time-frequency bins, normalized audio signal frequency components, etc.) may be represented as a spectrogram. Example outputs of the signal normalizer 206 are discussed below in conjunction with fig. 3C and 5.

The example point selector 208 selects one or more points from the normalized audio signal to be used to generate the fingerprint 110. For example, the example point selector 208 may select a plurality of energy maxima of the normalized audio signal. In other examples, the point selector 208 may select any other suitable point of the normalized audio.

Additionally or alternatively, the point selector 208 may weight the selection of points based on the category of the audio signal 106. For example, if the category of the audio signal is music, the point selector 208 may focus the selection of points into a common frequency range of music (e.g., bass, treble, etc.). In some examples, the point selector 208 may determine a category of the audio signal (e.g., music, voice, sound effects, advertisements, etc.). The example fingerprint generator 210 generates a fingerprint (e.g., the fingerprint 110) using the points selected by the example point selector 208. The example fingerprint generator 210 may generate a fingerprint from the selected points using any suitable method.

Although fig. 2 illustrates an example manner of implementing the audio processor 108 of fig. 1, one or more of the elements, processes and/or devices illustrated in fig. 2 may be combined, divided, rearranged, omitted, eliminated and/or implemented in any other way. Further, the example frequency range separator 202, the example audio characteristic determiner 204, the example signal normalizer 206, the example point selector 208, and the example fingerprint generator 210 and/or, more generally, the example audio processor 108 of fig. 1 and 2 may be implemented by hardware, software, firmware, and/or any combination of hardware, software, and/or firmware. Thus, for example, any of the example frequency range separator 202, the example audio characteristic determiner 204, the example signal normalizer 206, the example point selector 208, and the example fingerprint generator 210, and/or the more general example audio processor 108, may be implemented by one or more analog or digital circuits, logic circuits, programmable processors, programmable controllers, Graphics Processing Units (GPUs), Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Programmable Logic Devices (PLDs), and/or Field Programmable Logic Devices (FPLDs). When any of the apparatus or system claims of this patent are understood to cover a purely software and/or firmware implementation, at least one of the example frequency range splitter 202, the example audio characteristic determiner 204, the example signal normalizer 206, the example point selector 208, and the example fingerprint generator 210 is thereby expressly defined to include a non-transitory computer-readable storage device or storage disk (such as a memory, a Digital Versatile Disk (DVD), a Compact Disk (CD), a blu-ray disk, etc.) having software and/or firmware. Still further, the example audio processor 106 of fig. 1 and 2 may include one or more elements, processes and/or devices in addition to or instead of those illustrated in fig. 2, and/or may include more than one of any or all of the illustrated elements, processes and devices. As used herein, the phrase "communicate" (including variations thereof) encompasses direct communication and/or indirect communication through one or more intermediate components, and does not require direct physical (e.g., wired) communication and/or continuous communication, but additionally includes selective communication at regular intervals, scheduled intervals, non-periodic intervals, and/or one-time events.

Fig. 3A-3B depict an example unprocessed frequency spectrum plot 300 generated by the example frequency range splitter of fig. 2. In the illustrated example of fig. 3A, the example unprocessed spectrogram 300 includes example first time-frequency bins 304A surrounded by an example first audio region 306A. In the illustrated example of fig. 3B, the example unprocessed spectrogram includes an example second time-frequency bin 304B surrounded by an example audio region 306B. The example unprocessed spectrograms 300 and the normalized spectrograms 302 of fig. 3A and 3B each include an example vertical axis 308 representing frequency bins and an example horizontal axis 310 representing time bins. Fig. 3A and 3B illustrate example audio regions 306A and 306B from which the audio characteristic determiner 204 obtains normalized audio characteristics and which the signal normalizer 206 uses to normalize the first and second time-frequency bins 304A and 304B, respectively. In the illustrated example, the various time-frequency bins of the raw spectrogram 300 are normalized to generate a normalized spectrogram 302. In other examples, any suitable number of time-frequency bins of the unprocessed spectrogram 300 can be normalized to generate the normalized spectrogram 302 of fig. 3C.

The example vertical axis 308 has frequency bin units generated by a Fast Fourier Transform (FFT) and has a length of 1024 FFT bins. In other examples, the example vertical axis 308 may be measured by any other suitable technique for measuring frequency (e.g., hertz, another transform algorithm, etc.). In some examples, the vertical axis 308 encompasses the entire frequency range of the audio signal 106. In other examples, the vertical axis 308 may encompass a portion of the audio signal 106.

In the illustrated example, the example horizontal axis 310 represents a time period for which the total length of the unprocessed spectrogram 300 is 11.5 seconds. In the illustrated example, the horizontal axis 310 has sixty-four millisecond (ms) intervals as units. In other examples, horizontal axis 310 may be measured in any other suitable unit (e.g., 1 second, etc.). For example, horizontal axis 310 encompasses the full duration of audio. In other examples, the horizontal axis 310 may encompass a portion of the duration of the audio signal 106. In the illustrated example, the size of each time-frequency bin of the spectrograms 300, 302 is 64ms x 1FFT bins.

In the illustrated example of fig. 3A, the first time-frequency bin 304A is associated with an intersection of a frequency bin and a time bin of the unprocessed spectrogram 300 and a portion of the audio signal 106 associated with the intersection. The example first audio region 306A includes time-frequency bins within a predefined distance from the example first time-frequency bin 304A. For example, the audio characteristic determiner 204 may determine a vertical length of the first audio region 306A (e.g., a length of the first audio region 306A along the vertical axis 308, etc.) based on a set number of FFT bins (e.g., 5 bins, 11 bins, etc.). Similarly, the audio characteristic determiner 204 may determine a horizontal length of the first audio region 306A (e.g., a length of the first audio region 306A along the horizontal axis 310, etc.). In the illustrated example, the first audio region 306A is square. Alternatively, the first audio region 306A may be any suitable size and shape and may contain any suitable combination of time-frequency bins within the raw spectrogram 300 (e.g., any suitable set of time-frequency bins, etc.). The example audio characteristic determiner 204 may then determine audio characteristics (e.g., mean energy, etc.) of the time-frequency bins contained within the first audio region 306A. Using the determined audio characteristics, the example signal normalizer 206 of fig. 2 may normalize the associated values of the first time-frequency bin 304A (e.g., may normalize the energy of the first time-frequency bin 304A by the mean energy of the various time-frequency bins within the first audio region 306A).

In the illustrated example of fig. 3B, the second time-frequency bin 304B is associated with an intersection of a frequency bin and a time bin of the unprocessed spectrogram 300 and a portion of the audio signal 106 associated with the intersection. The example second audio region 306B includes time-frequency bins within a predefined distance from the example second time-frequency bin 304B. Similarly, the audio characteristic determiner 204 may determine a horizontal length of the second audio region 306B (e.g., a length of the second audio region 306B along the horizontal axis 310, etc.). In the example shown, the second audio region 306B is square. Alternatively, the second audio region 306B may be any suitable size and shape and may contain any suitable combination of time-frequency bins within the raw spectrogram 300 (e.g., any suitable set of time-frequency bins, etc.). In some examples, the second audio region 306B may overlap the first audio region 306A (e.g., contain some of the same time-frequency bins, shift on the horizontal axis 310, shift on the vertical axis 308, etc.). In some examples, the second audio region 306B may have the same size and shape as the first audio region 306A. In other examples, the second audio region 306B may have a different size and shape than the first audio region 306A. The example audio characteristic determiner 204 may then determine audio characteristics (e.g., mean energy, etc.) of the time-frequency bins contained by the second audio region 306B. Using the determined audio characteristics, the example signal normalizer 206 of fig. 2 may normalize the associated values of the second time-frequency bin 304B (e.g., may normalize the energy of the second time-frequency bin 304B by the mean energy of the bins located within the second audio region 306B).

Fig. 3C depicts an example of a normalized spectrogram 302 generated by the signal normalizer of fig. 2 by normalizing a plurality of time-frequency bins of the unprocessed spectrogram 300 of fig. 3A-3B. For example, some or all of the time-frequency bins of the raw spectrogram 300 can be normalized in the same manner as the time-frequency bins 304A and 304B. An example process 700 of generating a normalized spectrogram is described in connection with fig. 7. The resulting frequency bins of fig. 3C have now been normalized by the local mean energy in the local region surrounding the region. As a result, the darker regions are regions having the greatest energy in their respective local regions. This enables the fingerprint to contain relevant audio features in even lower energy regions than the generally louder bass frequency regions.

Fig. 4 illustrates the example unprocessed spectrogram 300 of fig. 3 divided into a plurality of fixed audio signal frequency components. The example unprocessed spectrogram 300 is generated by processing the audio signal 106 using a Fast Fourier Transform (FFT). In other examples, the raw spectrogram 300 can be generated using any other suitable method. In this example, the raw spectrogram 300 is divided into a plurality of example audio signal frequency components 402. The example unprocessed spectrogram 400 includes the example vertical axis 308 of fig. 3 and the example horizontal axis 310 of fig. 3. In the illustrated example, the example audio signal frequency components 402 each have an example frequency range 408 and an example time period 410. The example audio signal frequency components 402 include an example first audio signal frequency component 412A and an example second audio signal frequency component 412B. In the illustrated example, the darker portions of the unprocessed spectrogram 300 represent portions of the audio signal 106 having higher energy.

The example audio signal frequency components 402 are each associated with a unique combination of a continuous frequency range (e.g., frequency bins, etc.) and a continuous time period. In the illustrated example, each of the audio signal frequency components 402 has equally sized frequency bins (e.g., frequency range 408). In other examples, some or all of the audio signal frequency components 402 may have different sized frequency bins. In the illustrated example, each of the audio signal frequency components 402 has a time period (e.g., time period 410) of equal duration. In other examples, some or all of the audio signal frequency components 402 may have time periods of different durations. In the illustrated example, the audio signal frequency components 402 constitute the entire audio signal 106. In other examples, the audio signal frequency component 402 may comprise a portion of the audio signal 106.

In the illustrated example, the first audio signal frequency component 412A is located in the treble range of the audio signal 106 and has no visible energy points. The example first audio signal frequency component 412A is associated with a frequency bin between 768 and 896FFT bins and a time period between 10024ms and 11520 ms. In some examples, there are multiple portions of the audio signal 106 within the first audio signal frequency component 412A. In this example, because audio within the bass spectrum of the audio signal 106 (e.g., audio in the second audio signal frequency component 412B, etc.) has relatively high energy, the portion of the audio signal 106 that is within the audio signal frequency component 412A is not visible. The second audio signal frequency component 412B is located in the bass range of the audio signal 106 and has visible energy points. The example second audio signal frequency component 412B is associated with frequency bins between 128FFT bins and 256FFT bins and a time period between 10024ms and 11520 ms. In some examples, because portions of the audio signal 106 that are within the bass frequency spectrum (e.g., the second audio signal frequency components 412B, etc.) have relatively high energy, a fingerprint generated from the unprocessed spectrogram 300 will include a disproportionate number of samples from the bass frequency spectrum.

Fig. 5 is an example of a normalized spectrogram 500 generated by the signal normalizer of fig. 2 from the fixed audio signal frequency components of fig. 4. The example normalized spectrogram 500 includes the example vertical axis 308 of fig. 3 and the example horizontal axis 310 of fig. 3. The example normalized spectrogram 500 is divided into a plurality of example audio signal frequency components 502. In the illustrated example, the audio signal frequency components 502 each have an example frequency range 408 and an example time period 410. The example audio signal frequency components 502 include an example first audio signal frequency component 504A and an example second audio signal frequency component 504B. In some examples, the first audio signal frequency component 504A and the second audio signal frequency component 504B correspond to the same frequency bins and time periods as the first audio signal frequency component 412A and the second audio signal frequency component 412B of fig. 3. In the illustrated example, the darker portions of the normalized spectrogram 500 represent regions of the audio spectrum having higher energy.

The example normalized spectrogram 500 is generated by normalizing the raw spectrogram 300 by normalizing the individual audio signal frequency components 402 of fig. 4 by the associated audio characteristics. For example, the audio characteristic determiner 204 may determine an audio characteristic (e.g., mean energy, etc.) of the first audio signal frequency component 412A. In this example, the signal normalizer 206 may then normalize the first audio signal frequency component 412A by the determined audio characteristic to create the example audio signal frequency component 402A. Similarly, the example second audio signal frequency component 402B may be generated by normalizing the second audio signal frequency component 412B of fig. 4 by the audio characteristic associated with the second audio signal frequency component 412B. In other examples, the normalized spectrogram 500 may be generated by normalizing a portion of the audio signal component 402. In other examples, any other suitable method may be used to generate the example normalized spectrogram 500.

In the illustrated example of fig. 5, a first audio signal frequency component 504A (e.g., the first audio signal frequency component 412A of fig. 4 after processing by the signal normalizer 206, etc.) has visible energy points on the normalized spectrogram 500. For example, because the first audio signal frequency component 504A has been normalized by the energy of the first audio signal frequency component 412A, previously hidden portions of the audio signal 106 (e.g., when compared to the first audio signal frequency component 412A) are visible on the normalized spectrogram 500. The second audio signal frequency component 504B (e.g., the second audio signal frequency component 412B of fig. 4, etc., after processing by the signal normalizer 206) corresponds to a bass range of the audio signal 106. For example, because the second audio signal frequency component 504B has been normalized by the energy of the second audio signal frequency component 412B, the number of visible energy points has been reduced (e.g., when compared to the second audio signal frequency component 412B). In some examples, a fingerprint generated from the normalized spectrogram 500 (e.g., the fingerprint 110 of fig. 1) will include samples that are more evenly distributed in the audio spectrum than a fingerprint generated from the unprocessed spectrogram 300 of fig. 4.

Fig. 6 is an example of a normalized and weighted spectrogram 600 generated by the point selector 208 of fig. 2 from the normalized spectrogram 500 of fig. 5. The example spectrogram 600 includes the example vertical axis 308 of fig. 3 and the example horizontal axis 310 of fig. 3. The example normalized and weighted spectrogram 600 is divided into a plurality of example audio signal frequency components 502. In the illustrated example, the example audio signal frequency components 502 each have an example frequency range 408 and an example time period 410. The example audio signal frequency components 502 include an example first audio signal frequency component 604A and an example second audio signal frequency component 604B. In some examples, the first audio signal frequency component 604A and the second audio signal frequency component 604B correspond to the same frequency bins and time periods as the first audio signal frequency component 412A and the second audio signal frequency component 412B of fig. 3, respectively. In the illustrated example, the darker portions of the normalized and weighted spectrogram 600 represent regions of the audio spectrum having higher energy.

An example normalized and weighted spectrogram 600 is generated by weighting the normalized spectrogram 600 with a range of values from zero to one based on the classification of the audio signal 106. For example, if the audio signal 106 is music, the point selector 208 of FIG. 2 would weight the regions of the audio spectrum associated with the music along the columns. In other examples, the weighting may be applied to multiple columns, and different ranges from zero to one may be employed.

Fig. 7 and 8 illustrate flow diagrams representing example hardware logic, machine readable instructions, hardware-implemented state machines, and/or any combination thereof for implementing the audio processor 108 of fig. 2. The machine-readable instructions may be executable programs or portions of executable programs for execution by a computer processor, such as the processor 912 shown in the example processor platform 900 discussed below in connection with fig. 9. The programs may be embodied in software stored on a non-transitory computer readable storage medium such as a CD-ROM, a floppy disk, a hard drive, a DVD, a blu-ray disk, or a memory associated with the processor 912, but the entire program and/or parts thereof could alternatively be executed by a device other than the processor 912 and/or embodied in firmware or dedicated hardware. Further, although the example program is described with reference to the flowcharts illustrated in fig. 7 and 8, many other methods of implementing the example audio processor 108 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuitry, etc.) configured to perform corresponding operations without the execution of software or firmware.

As described above, the example processes of fig. 7 and 8 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory, and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended periods of time, permanently, for simple instances, for temporarily buffering, and/or for caching the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

The terms "comprising" and "including" (and all forms and tenses thereof) are used herein as open-ended terms. Thus, whenever a claim recitations in any form, "comprise" or "comprise" (e.g., comprise, include, contain, have, etc.) as a preamble or within any claim recitation of that kind, it will be understood that additional elements, terms, etc. may be present without falling outside the scope of the corresponding claim or recitation. As used herein, the phrase "at least" when used as a transitional term in the claims, such as in the preamble, is open-ended in the same manner that the terms "comprising" and "including" are open-ended. When used, for example, in a format such as A, B and/or C, the term "and/or" refers to any combination or subset of A, B, C, such as (1) a alone, (2) B alone, (3) C alone, (4) a and B, (5) a and C, (6) B and C, and (7) a and B and C. As used herein, in the context of describing structures, components, items, objects, and/or things, the phrase "at least one of a and B" is intended to mean an implementation that includes any one of: (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein, in the context of describing structures, components, items, objects, and/or things, the phrase "at least one of a or B" is intended to mean an implementation that includes any of the following: (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. As used herein, in the context of describing processes, instructions, actions, activities, and/or performance of steps, the phrase "at least one of a and B" is intended to mean an implementation that includes any of the following: (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein, in the context of describing processes, instructions, actions, activities, and/or performance of steps, the phrase "at least one of a or B" is intended to mean an implementation that includes any of the following: (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.

The process of fig. 7 begins at block 702. At block 702, the audio processor 108 receives the digitized audio signal 106. For example, the audio processor 108 may receive audio captured by the microphone 104 (e.g., emitted by the audio source 102 of fig. 1, etc.). In this example, the microphone may include an analog-to-digital converter to convert the audio into a digitized audio signal 106. In other examples, the audio processor 108 may receive audio stored in a database (e.g., the volatile memory 914 of fig. 9, the non-volatile memory 916 of fig. 9, the mass storage device 928 of fig. 9, etc.). In other examples, the digitized audio signal 106 may be sent to the audio processor 108 over a network (e.g., the internet, etc.). Additionally or alternatively, the audio processor 108 may receive the audio signal 106 by any other suitable means.

At block 704, the frequency range separator 202 windows the audio signal 106 and transforms the audio signal 106 to the frequency domain. For example, the frequency range separator 202 may perform a fast fourier transform to transform the audio signal 106 to the frequency domain, and may perform a windowing function (e.g., a Hamming function, a Hann function, etc.). Additionally or alternatively, the frequency range splitter 202 may aggregate the audio signal 106 into two or more time bins. In these examples, the time-frequency bin corresponds to an intersection of a frequency bin and a time bin and contains a portion of the audio signal 106.

At block 706, the audio characteristic determiner 204 selects time-frequency bins for normalization. For example, the audio characteristic determiner 204 may select the first time-frequency bin 304A of fig. 3A. In some examples, the audio characteristic determiner 204 may select a time-frequency bin adjacent to a previously selected first time-frequency bin.

At block 708, the audio characteristic determiner 204 determines audio characteristics of the surrounding audio regions. For example, if the audio characteristic determiner 204 selects the first time-frequency bin 304A, the audio characteristic determiner 204 may determine the audio characteristic of the first audio region 306A. In some examples, the audio characteristic determiner 204 may determine a mean energy of the audio regions. In other examples, the audio characteristic determiner 204 may determine any other suitable audio characteristic (e.g., mean amplitude, etc.).

At block 710, the audio characteristic determiner 204 determines if another time-frequency bin is to be selected, the process 700 returns to block 706. If another time-frequency bin is not selected, process 700 proceeds to block 712. In some examples, blocks 706-710 are repeated until each time-frequency bin of the raw spectrogram 300 has been selected. In other examples, blocks 706-710 may be repeated for any suitable number of iterations.

At block 712, the signal normalizer 206 normalizes the various time-frequency bins based on the associated audio characteristics. For example, the signal normalizer 206 may utilize the associated audio characteristics determined at block 708 to normalize various ones of the time-frequency bins selected at block 706. For example, the signal normalizer may normalize the first and second time-frequency bins 304A and 304B by audio characteristics (e.g., mean energy) of the first and second audio regions 306A and 306B, respectively. In some examples, the signal normalizer 206 generates a normalized spectrogram (e.g., normalized spectrogram 302 of fig. 3C) based on the normalization of the time-frequency bins.

At block 714, the point selector 208 determines if the fingerprint generation is to be weighted based on the audio category, the process 700 proceeds to block 716. If the fingerprint generation is not weighted based on audio categories, process 700 proceeds to block 720. At block 716, the point selector 208 determines an audio class of the audio signal 106. For example, the point selector 208 may present a prompt to the user to indicate a category of audio (e.g., music, voice, sound effects, advertisements, etc.). In other examples, the audio processor 108 may determine the audio class using an audio class determination algorithm. In some examples, the audio category may be a particular person's voice, a general human voice, music, sound effects, and/or advertisements.

At block 718, the point selector 208 weights the temporal frequency bins based on the determined audio class. For example, if the audio category is music, the point selector 208 may weight the audio signal frequency components in association with the treble and bass ranges with which music is typically associated. In some examples, if the audio class is a particular person's voice, the point selector 208 may weight the audio signal frequency components in association with the person's voice. In some examples, the output of the signal normalizer 206 may be represented as a spectrogram.

At block 720, the fingerprint generator 210 generates a fingerprint of the audio signal 106 (e.g., the fingerprint 110 of fig. 1) by selecting energy extrema of the normalized audio signal. For example, the fingerprint generator 210 may use frequency bins and energies associated with one or more energy extremes (e.g., one extreme, twenty extremes, etc.). In some examples, the fingerprint generator 210 may select an energy maximum of the normalized audio signal 106. In other examples, the fingerprint generator 210 may select any other suitable characteristic of the normalized audio signal frequency components. In some examples, the fingerprint generator 210 may utilize any suitable means (e.g., algorithms, etc.) to generate the fingerprint 110 representative of the audio signal 106. Once the fingerprint 110 is generated, the process 700 ends.

The process 800 of fig. 8 begins at block 802. At block 802, the audio processor 108 receives a digitized audio signal. For example, the audio processor 108 may receive audio (e.g., emitted by the audio source 102 of fig. 1, etc. and captured by the microphone 104). In this example, the microphone may include an analog-to-digital converter to convert the audio signal to a digitized audio signal 106. In other examples, the audio processor 108 may receive audio stored in a database (e.g., the volatile memory 914 of fig. 9, the non-volatile memory 916 of fig. 9, the mass storage device 928 of fig. 9, etc.). In other examples, the digitized audio signal 106 may be sent to the audio processor 108 over a network (e.g., the internet, etc.). Additionally or alternatively, the audio processor 108 may receive the audio signal 106 by any suitable means.

At block 804, the frequency range separator 202 separates the audio signal into two or more audio signal frequency components (e.g., the audio signal frequency components 402 of fig. 3, etc.). For example, the frequency range separator 202 may perform a fast fourier transform to transform the audio signal 106 to the frequency domain and may perform a windowing function (e.g., a Hamming function, a Hann function, etc.) to create frequency bins. In these examples, each audio signal frequency component is associated with one or more of the frequency bins. Additionally or alternatively, the frequency range splitter 202 may also split the audio signal 106 into two or more time segments. In these examples, each audio signal frequency component corresponds to a unique combination of a time segment of the two or more time segments and a frequency bin of the two or more frequency bins. For example, the frequency range separator 202 may separate the audio signal 106 into a first frequency bin, a second frequency bin, a first time period, and a second time period. In this example, the first audio signal frequency component corresponds to a portion of the audio signal 106 within a first frequency bin and a first time period, the second audio signal frequency component corresponds to a portion of the audio signal 106 within the first frequency bin and a second time period, the third audio signal frequency component corresponds to a portion of the audio signal 106 within the second frequency bin and the first time period, and the fourth audio signal frequency component corresponds to a component of the audio signal 106 within the second frequency bin and the second time period. In some examples, the output of the frequency range splitter 202 may be represented as a spectrogram (e.g., the unprocessed spectrogram 300 of fig. 3).

At block 806, the audio characteristic determiner 204 determines audio characteristics for the respective audio signal frequency components. For example, the audio characteristic determiner 204 may determine the mean energy of the individual audio signal frequency components. In other examples, the audio characteristic determiner 204 may determine any other suitable audio characteristic (e.g., mean amplitude, etc.).

At block 808, the signal normalizer 206 normalizes the audio signal frequency components based on the determined audio characteristics associated with the respective audio signal frequency components. For example, the signal normalizer 206 may normalize the audio signal frequency components by the mean energy associated with the audio signal frequency components. In other examples, the signal normalizer 206 may use any other suitable audio characteristic to normalize audio signal frequency components. In some examples, the output of the signal normalizer 206 may be represented as a spectrogram (e.g., the normalized spectrogram 500 of fig. 5).

At block 810, the audio characteristic determiner 204 determines if the fingerprint generation is to be weighted based on the audio class, then the process 800 proceeds to block 812. If the fingerprint generation is not weighted based on audio category, process 800 proceeds to block 816. At block 812, the audio processor 108 determines an audio class of the audio signal 106. For example, the audio processor 108 may present a prompt to the user to indicate a category of audio (e.g., music, speech, etc.). In other examples, the audio processor 108 may determine the audio class using an audio class determination algorithm. In some examples, the audio category may be a particular person's voice, a general human voice, music, sound effects, and/or advertisements.

At block 814, the signal normalizer 206 weights the audio signal frequency components based on the determined audio class. For example, if the audio category is music, the signal normalizer 206 may weight the audio signal frequency components along the respective columns with different scaling values from zero to one for respective frequency locations from treble to bass associated with an average spectral envelope of the music. In some examples, if the audio category is human sound, the signal normalizer 206 may weight the audio signal frequency components in association with a spectral envelope of the human sound. In some examples, the output of the signal normalizer 206 may be represented as a spectrogram (e.g., spectrogram 600 of fig. 6).

At block 816, the fingerprint generator 210 generates a fingerprint of the audio signal 106 (e.g., the fingerprint 110 of fig. 1) by selecting energy extrema of the normalized audio signal frequency components. For example, the fingerprint generator 210 may use frequency bins and energies associated with one or more energy extremes (e.g., twenty extremes, etc.). In some examples, the fingerprint generator 210 may select an energy maximum of the normalized audio signal. In other examples, the fingerprint generator 210 may select any other suitable characteristic of the normalized audio signal frequency components. In some examples, the fingerprint generator 210 may utilize another suitable means (e.g., an algorithm, etc.) to generate the fingerprint 110 representative of the audio signal 106. Once the fingerprint 110 is generated, the process 800 ends.

Fig. 9 is a block diagram of an example processor platform 900 configured to execute the instructions of fig. 7 and/or 8 to implement the audio processor 108 of fig. 2. For example, processor platform 900 may be a server, personal computer, workstation, self-learning machine (e.g., neural network), mobile device (e.g., cell phone, smart phone, such as an ipd)^TMTablet computer), Personal Digital Assistant (PDA), internet appliance, DVD player, CD player, digital video recorder, blu-ray player, game machine, personal video recorder, machineA set-top box, a head-mounted or other wearable device, or any other type of computing device.

The processor platform 900 of the illustrated example includes a processor 912. The processor 912 of the illustrated example is hardware. For example, the processor 912 may be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor may be a semiconductor-based (e.g., silicon-based) device. In this example, the processor 912 implements the example frequency range separator 202, the example audio characteristic determiner 204, the example signal normalizer 206, the example point selector 208, and the example fingerprint generator 210.

The processor 912 of the illustrated example includes local memory 913 (e.g., a cache). The processor 912 of the illustrated example communicates with a main memory including a volatile memory 914 and a non-volatile memory 916 via a bus 918. The volatile memory 914 may be comprised of Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM),Dynamic random access memoryAnd/or any other type of random access memory device. The non-volatile memory 916 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 914, 916 is controlled by a memory controller.

The processor platform 900 of the illustrated example also includes interface circuitry 920. The interface circuit 920 may interface with any type of interface standard (such as an ethernet interface, Universal Serial Bus (USB)),An interface, a Near Field Communication (NFC) interface, and/or a PCI express interface).

In the illustrated example, one or more input devices 922 are connected to the interface circuit 920. An input device 922 allows a user to enter data and/or commands into the processor 912. For example, input device 922 may be implemented with an audio sensor, a microphone, a camera (still or video), and/or a voice recognition system.

One or more output devices 924 are also connected to the interface circuit 920 of the illustrated example. The output devices 924 can be implemented, for example, by display devices (e.g., Light Emitting Diodes (LEDs), Organic Light Emitting Diodes (OLEDs), Liquid Crystal Displays (LCDs), cathode ray tube displays (CRTs), in-plane switching (IPS) displays, touch screens, etc.), tactile output devices, printers, and/or speakers. Thus, the interface circuit 920 of the illustrated example generally includes a graphics driver card, a graphics driver chip, and/or a graphics driver processor.

The interface circuit 920 of the illustrated example also includes a communication device (such as a transmitter, receiver, transceiver, modem, residential gateway, wireless access point, and/or network interface) to facilitate exchange of data with external machines (e.g., any kind of computing device) via the network 926. For example, the communication may be via an ethernet connection, a Digital Subscriber Line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a peer-to-peer wireless system, a cellular telephone system, or the like.

The processor platform 900 of the illustrated example also includes one or more mass storage devices 928 for storing software and/or data. Examples of such mass storage devices 928 include floppy disk drives, hard disk drives, optical disk drives, blu-ray disk drives, Redundant Array of Independent Disks (RAID) systems, and Digital Versatile Disk (DVD) drives.

Machine-executable instructions 932 for implementing the method of fig. 6 may be stored in mass storage device 928, volatile memory 914, non-volatile memory 916, and/or on a removable non-transitory computer-readable storage medium such as a CD or DVD.

From the foregoing, it will be appreciated that example methods and apparatus have been disclosed that allow for creation of a fingerprint of an audio signal that reduces the amount of noise captured in the fingerprint. In addition, by sampling audio from regions of the audio signal where the energy is small, a more robust audio fingerprint can be created compared to previously used audio fingerprinting methods.

Although certain example methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus, and articles of manufacture fairly falling within the scope of the appended claims either literally or under the doctrine of equivalents.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于瞬态噪声抑制的自适应能量限制

Method and apparatus for fingerprinting audio signals via normalization

相关技术

网友询问留言