Audio classification method, system, equipment and storage medium

文档序号：1921653 发布日期：2021-12-03 浏览：13次中文

阅读说明：本技术 一种音频分类方法及系统及设备及存储介质 (Audio classification method, system, equipment and storage medium ) 是由陈剑超肖龙源李稀敏叶志坚于 2021-08-30 设计创作，主要内容包括：本发明公开了一种音频分类方法,其通过如下步骤实现对混合类音频的处理分类：预处理,对输入的音频信号进行预加重、分帧和加窗实现对音频信号的预处理；音频帧特征提取,通过对输入的音频信号依次进行基音周期检测、谐波噪声比检测、提取语音和音乐和谐度的一阶差分、和谐度分段以及和谐度分段特征提取等步骤实现音频帧特征的提取；建模,建立一个基于CNN-LSTM的分类模型并将提取的音频帧特征样本数据输入到模型中进行训练,直到模型的输出误差达到预设要求；分类处理,将待处理的音频信息的音频帧特征输入到建立的模型中进行处理分类并输出分类结果。本发明的优点在于：可以在音频检索处理时节省大量计算,大幅度缩小检索范围,提高了音频分类效率。(The invention discloses an audio classification method, which realizes the processing and classification of mixed audio through the following steps: preprocessing, namely performing pre-emphasis, framing and windowing on an input audio signal to realize the preprocessing of the audio signal; extracting audio frame characteristics, namely extracting the audio frame characteristics by sequentially performing the steps of detecting a pitch period and a harmonic noise ratio of an input audio signal, extracting a first-order difference of the harmony of voice and music, segmenting the harmony, extracting the characteristics of the harmony segment and the like; modeling, namely establishing a classification model based on CNN-LSTM and inputting the extracted audio frame feature sample data into the model for training until the output error of the model reaches the preset requirement; and (4) classification processing, namely inputting the audio frame characteristics of the audio information to be processed into the established model for processing and classification and outputting a classification result. The invention has the advantages that: a large amount of calculation can be saved during audio retrieval processing, the retrieval range is greatly reduced, and the audio classification efficiency is improved.)

1. An audio classification method is characterized in that the processing classification of mixed audio is realized through the following steps:

preprocessing, namely performing pre-emphasis, framing and windowing on an input audio signal to realize the preprocessing of the audio signal;

extracting audio frame characteristics, namely extracting the audio frame characteristics by sequentially performing the steps of detecting a pitch period and a harmonic noise ratio of an input audio signal, extracting a first-order difference of the harmony of voice and music, segmenting the harmony, extracting the characteristics of the harmony segment and the like;

modeling, namely establishing a classification model based on CNN-LSTM and inputting the extracted audio frame feature sample data into the model for training until the output error of the model reaches the preset requirement;

and (4) classification processing, namely inputting the audio frame characteristics of the audio information to be processed into the established model for processing and classification and outputting a classification result.

2. The audio classification method according to claim 1, wherein the pre-emphasis step is formulated as:wherein, θ x represents a pre-emphasis coefficient, y (n) represents a signal obtained after pre-emphasis processing, and x (n) represents the amplitude of the nth point of the audio signal.

3. The audio classification method according to claim 1, characterized in that the framing procedure selects 20ms as the stabilization duration and 10ms as the frame overlap.

4. The audio classification method of claim 1, wherein the pitch detection process implements pitch detection by one of a time domain estimation method, a transformation method, or a hybrid method.

5. The audio classification method according to claim 1, characterized in that the process of extracting the first order difference of the harmony of speech and music comprises the following steps:

s01, calculating the correlation between each frequency and the frequency spanning a certain step size, by the following formula:

s02, the difference between two adjacent terms in the discrete function in step S01 is calculated, and the first order difference of the harmony of the voice and music is extracted by the change between the discrete quantities.

6. The audio classification method according to claim 1, wherein the harmonicity segmentation feature extraction process is implemented by the following steps:

a01, identifying the music beat, and identifying the music beat by the steps of obtaining the initial envelope of the audio, counting the rhythm, identifying the beat and the like;

and a02, finding the lowest points in each beat, and calculating the variance and the mean of the harmony degree between the lowest points of every two beats.

7. The audio classification method according to claim 1, characterized in that the modeling comprises the following specific steps:

b01, gathering the audio features into a smaller size by using the CNN, and acquiring local features of multiple types in the audio data;

b02, adding a layer of LSTM to enable the model to combine high-level characteristics of each moment in the long-section audio to obtain the variation conditions of the characteristics such as harmonicity and the like at different moments in the audio;

b03, setting two full link layers and a classification layer, integrating features, mapping the features to a sample mark space, and then classifying;

b04, comparing the classified result with the result of manual classification to obtain an error.

8. An audio classification system, comprising an audio signal input module for inputting an audio signal for classification; the characteristic extraction module is used for extracting audio characteristics of the input audio signals; the classification processing module is used for classifying the audio signals according to the extracted audio characteristic data; and the output module is used for outputting the audio classification result.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

Technical Field

The present invention relates to the field of audio processing, and in particular, to an audio classification method, system, device, and storage medium.

Background

Besides information such as coding mode and sampling rate, the audio data is a binary stream without a structure, and has the characteristics of large data volume, complex processing, high information association degree and the like, so that the processing work of the audio signal is more complex, and great difficulty is brought to applications such as audio retrieval and the like.

The audio classification technology is the basis of audio structuring, and can solve the above problems, so the audio classification technology is called as the most common preprocessing technology in audio data processing. However, in the audio classification method in the prior art, audio is generally classified into a single category, and mixed data of speech and music is common audio data in the internet, and if the mixed data is only marked as a mixed category, the mixed data is not fine enough, and the requirements of some audio information processing systems cannot be met.

The audio classification techniques in the prior art mainly include three types, namely rules-based, minimum distance-based and statistical learning algorithms.

The idea of the rule-based audio classification method is to select features that can distinguish audio categories and set rules for classification. When the audio is classified, the calculated characteristic values are compared with a preset threshold value according to rules, and the audio is classified. This classification method is simple to operate, but only identifies audio types with a single characteristic. Meanwhile, if the upper layer in the method makes a wrong decision, the next layer is accumulated, so that the method is very dependent on the priori knowledge of people, the setting of the threshold value is very important, but the performance of the method is not stable under the condition of mass data.

The audio classification method based on the minimum distance utilizes the idea of template matching, the algorithm establishes a template for each category in the audio, when the audio is classified, the feature vector of the audio to be classified is calculated, the template vector is matched, and the distance between the feature vector and the template vector is calculated, thereby realizing the audio classification.

The audio classification algorithm based on statistical learning is the key point and the hot point of the current audio classification, provides an effective way for automatic learning classification, and is also a main method for the future research in the field. However, with the above classification techniques, audio is classified into several larger categories, such as speech, music, silence, ambient noise, etc., and audio mixed with speech and music is not classified. In addition, mixed audio data in the existing network life is huge, and can be better distinguished only by marking fine labels on the mixed audio data.

In summary, the algorithm in the prior art cannot meet the requirement of adding a fine tag to the mixed data in data processing such as audio retrieval.

Disclosure of Invention

The technical problem to be solved by the present invention is how to implement processing and classification of mixed audio, and a method for classifying audio is proposed to solve the technical problem to be solved.

In order to achieve the purpose, the invention provides the following technical scheme: an audio classification method, which realizes the processing classification of mixed audio by the following steps:

preprocessing, namely performing pre-emphasis, framing and windowing on an input audio signal to realize the preprocessing of the audio signal;

Further, the formula of the pre-emphasis specific step is as follows:wherein, theta_xRepresenting the pre-emphasis coefficient, y (n) representing the signal obtained after the pre-emphasis process, and x (n) representing the amplitude of the nth point of the audio signal.

Further, the framing procedure selects 20ms as the stable duration and 10ms as the frame stack.

Further, the pitch period detection process implements pitch period detection by one of a time domain estimation method, a transformation method, or a hybrid method.

Further, the process of extracting the first order difference of the harmony of the voice and the music includes the steps of:

s01, calculating the correlation between each frequency and the frequency spanning a certain step, which is realized by the following formula:

s02, a difference between two adjacent items in the discrete function in step S01 is calculated, and a first order difference of the harmony of the voice and the music is extracted by a change between the discrete quantities.

Further, the extraction process of the harmonicity segmentation features is realized by the following steps:

a01, identifying the music beat, and identifying the music beat by the steps of obtaining the initial envelope of the audio, counting the rhythm, identifying the beat and the like;

and a02, finding the lowest points in each beat, and calculating the variance and the mean of the harmony degree between the lowest points of every two beats.

Further, the modeling comprises the following specific steps:

b01, gathering the audio features into a smaller size by using the CNN, and acquiring local features of multiple types in the audio data;

b03, setting two full link layers and a classification layer, integrating features, mapping the features to a sample mark space, and then classifying;

b04, comparing the classified result with the result of manual classification to obtain an error.

It is another object of the present invention to provide an audio classification system, which includes an audio signal input module for inputting an audio signal for classification; the characteristic extraction module is used for extracting audio characteristics of the input audio signals; the classification processing module is used for classifying the audio signals according to the extracted audio characteristic data; and the output module is used for outputting the audio classification result.

The invention also provides a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 7 when executing the computer program.

The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

Compared with the prior art, the invention has the beneficial effects that:

the invention has the advantages that the voice and music mixed data can be more finely marked, a large amount of calculation can be saved during the audio retrieval processing, the retrieval range is greatly reduced, and the audio classification efficiency is improved; and the appropriate background music audio can be selected for processing according to the robustness of the recognition algorithm on the background music, so that the processing speed is improved, and the recognition error is reduced.

Drawings

FIG. 1 is a schematic process flow diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment discloses an audio classification method, which realizes processing and classification of mixed audio by the following steps:

firstly, training a CNN two classifier to detect pure music sections in mixed audio, estimating the energy of music components according to the characteristic of stable audio energy, and calculating the energy ratio value of voice components; in order to solve the problem that no pause exists in voice or the detected pause duration is short, an energy proportion classification method based on a harmony feature group is provided.

Specifically, the method comprises the following steps:

preprocessing, namely performing pre-emphasis, framing and windowing on an input audio signal to realize the preprocessing of the audio signal; because the audio signal analysis is established on the short-time stable characteristic of the audio signal, the audio signal is firstly preprocessed before being analyzed and processed, and the audio signal analysis method mainly comprises three steps of pre-emphasis, framing and windowing. The pre-emphasis can eliminate the influence of a sounding organ and an oral cavity on a voice signal, and also has the factors of aliasing, higher harmonic distortion and the like brought by acquisition equipment, and the audio is subjected to framing and windowing treatment, so that the short-time stability of the audio signal can be kept during characteristic calculation. These operations can ensure that the signals of the subsequent audio processing are smoother, and the processing quality is improved.

Specifically, the pretreatment process is as follows:

pre-emphasis, because the higher the frequency of the audio signal, the less the signal component, the higher frequency part can be strengthened by pre-emphasis processing, and the influence of oral cavity on the voice signal is reduced. The audio signal x (n) is a discrete signal, and after pre-emphasis processing, the signal y (n) is obtained by:

in the formula-the pre-emphasis coefficients are,

x (n) -the amplitude of the nth point of the audio signal.

A framed, long-term audio signal is a non-stationary signal that varies over time, but over a short period of time, typically 10-30ms, the audio signal can be considered to remain relatively stationary. The framing method has two modes, namely no overlapping and overlapping, and in the specific embodiment, 20ms is selected, and the frames are overlapped by 10ms, so that the continuity of signals between two adjacent frames can be ensured.

Windowing, which smoothes the audio signal, is performed by multiplying each frame of signal with a window function, which is commonly used as a hamming window or a rectangular window. It is feasible to use a hamming window in this embodiment, and the window function is:

where N is window length.

The pre-emphasized, framed, windowed audio signal x (n) becomes:

g(n)＝y(n)w(n)

specifically, the reason for the selection in this embodiment is based on the alternation of unvoiced and voiced speech in speech, with harmony being lower at higher unvoiced times for voiced sounds, and music being rhythmic and more stable, with harmony fluctuating at higher values in the range of rhythms. Meanwhile, since voiced sounds in speech have a certain periodicity, a pitch period in a signal can be extracted. When a person produces sound, airflow in the lung part impacts the glottis to enable the glottis to be combined one by one, so that vocal cords vibrate, a series of quasi-periodic pulses are generated, the quasi-periodic pulses are emitted after the vocal cords and the oral cavity resonate, a voice signal is formed, and therefore voiced sound has certain periodicity. The frequency of the vibration of the vocal cords becomes the fundamental frequency, and such a period is called a pitch period.

Pitch period detection may be performed using methods such as time domain estimation, transformation, or mixing, where applicable.

Preferably, in the present embodiment, a time domain estimation method is used for detection, specifically, the fundamental frequency is estimated by a waveform diagram of a speech signal, and the method includes an autocorrelation function method, an average amplitude difference function method, and the like. Wherein the formula of the autocorrelation function R (m):the method for obtaining the period is that the autocorrelation function takes a maximum value when the signal is delayed by an integral multiple of the period of the signal delay, and the pitch period of the signal x (n) can be known by the position of the maximum value of the autocorrelation function.

The transformation method is a method for solving a pitch period by converting a signal from a time domain to a frequency domain, and includes the steps of firstly, converting a non-linear problem into a linear problem by using a homomorphic analysis method, eliminating the influence of a sound channel, finding an excitation part, and solving the pitch period, wherein a common method is a cepstrum method.

The mixing rule is to eliminate the influence of the vocal tract, obtain the sound source sequence of the voice channel, and finally obtain the pitch period by using the method in the time domain estimation.

In other embodiments of the invention, a conversion or a hybrid approach may be used.

And then, extracting the audio frame characteristics, namely extracting the audio frame characteristics by sequentially carrying out the steps of detecting the pitch period and the harmonic noise ratio of the input audio signal, extracting the first-order difference of the harmony of the voice and the music, segmenting the harmony, extracting the characteristics of the harmony segment and the like.

Where the human auditory system is readily aware of the presence of speech in very noisy environments with respect to harmonicity, this stems from the characteristics of the human auditory system and the speech signal, whose harmonic structure is the main difference between human speech and other noise. The harmonic waves are sub-components which are obtained by performing Fourier series decomposition on the periodic non-sinusoidal alternating current quantity and are larger than integral multiple of the fundamental wave frequency. And (4) extracting harmonicity characteristics by researching the harmonic structure of the sound. Harmonicity represents the degree of harmonic or periodicity of the signal, and the harmonic-to-noise ratio in speech may reflect the overall non-periodicity of the speech signal.

Since the fundamental frequency and the energy of the frequency-doubled signal thereof account for a large proportion of the whole audio signal, in order to reflect this feature, the frequency domain normalized autocorrelation method is preferably used in the present embodiment to estimate the possibility that each frequency is the fundamental frequency, the calculation formula of r (k) is as follows, and the value of r (k) reflects the possibility that the frequency k · fs/N is the fundamental frequency.

After calculating the probability that each frequency in each frame of audio is fundamental frequency, the harmony of a frame signal can be defined as the frame

With an average value of R (k), i.e.: in the formula [ k ]_f1，k_f2]Corresponding to the frequency range under investigation.

The harmonicity of a frame of signals can also be defined as the maximum value of r (k) in the frequency range, i.e.:

h＝max(R_j) j＝k_f1，...k_f2

in order to use the harmonicity characteristics to better reflect the characteristics of the voice and the music, distinguish the characteristics and improve the formula, the definition of R (k) in the above formula (1-1) is to consider the correlation between frequency multiples of a certain frequency, and the harmonicity can also be defined as (1-2):

regarding the harmonicity first order difference, there is not much regularity because the harmonicity of the voice is high and low. The harmony of music shows similar fluctuation in each street beat and is relatively regular in a certain range, and in order to reflect the overall characteristic of music, the difference is considered to be added to depict the difference of the overall waveform changes of music and voice. The difference is the difference between two adjacent terms in the discrete function, and can reflect the change between discrete quantities, thereby extracting the first-order difference of the harmony of the voice and the music.

The specific process for the extraction of the harmonious segment features is as follows:

since the harmony of the music signal exhibits regular and similar fluctuations within a certain range according to the rhythm thereof, the segment statistical characteristics of the harmony, such as variance, mean, and the like, are extracted. The best segmentation mode is to segment the audio according to the beat of the audio and solve the segment characteristics of the harmony degree in the segment.

When the music harmony is segmented by the beats, firstly, the music beats are identified, and the process is as follows:

1) the initial envelope of the audio is obtained, if a piece of music is considered to be composed of a plurality of events, each event can be considered to be an envelope, and the envelope is obtained by extracting Mel spectral energy and passing the Mel spectral energy through a filter.

2) The tempo, i.e. beats per minute, is estimated. Music is a signal having periodicity, and the average tempo of the entire music tempo can be estimated by the periodicity. The period of music can be obtained through the autocorrelation function, the music signal and the signal with the integral multiple of the delay period are operated, the obtained autocorrelation function is maximum, and after the average period is obtained, the overall tempo can be calculated

3) And (4) beat identification, namely tracking beats in the initial envelope line by using a dynamic programming algorithm and utilizing priori knowledge according to the street beat speed obtained in the previous step to obtain the positions of the beats in the signal.

However, a beat in music can be detected using the above method, but there may be a case where a beat point is not detected in the first and last two periods of the music signal. Therefore, it is necessary to first obtain the maximum distance max _ beat _ space between beat points at each beat time of the audio signal, slide on the harmony curve using a sliding window with the length of max _ beat _ space, find the lowest point in each beat, and calculate the variance and mean of the harmony degree between the lowest points of each two beats.

The audio is segmented through the segmentation algorithm, the statistical characteristics of the harmony degree are extracted, the difference of the voice and the music on the harmony degree can be reflected, and meanwhile, the beat characteristics are utilized during segmentation, so that the beat characteristics of the audio are reflected.

And modeling, namely establishing a classification model based on CNN-LSTM and inputting the extracted audio frame feature sample data into the model for training until the output error of the model reaches the preset requirement. In this embodiment, the method using the combination of the CNN network and the LSTM network includes firstly using the CNN to converge audio features into a smaller size, obtaining multiple types of local features in audio data, then adding a layer of LSTM, enabling the model to combine high-level features at each time in a long-segment audio to obtain variation conditions of features such as harmonicity at different times in the audio, and performing better classification on the audio, and finally having two full link layers and a classification layer, integrating the features, mapping the features to a sample mark space, and performing classification.

The LSTM is a one-layer bidirectional LSTM network, the number of hidden layer units is 64, in the unidirectional LSTM, the rear time can be predicted according to the front information, and the bidirectional LSTM can obtain the output of the current time by utilizing a plurality of front inputs and rear inputs, so that the result of classification by the bidirectional LSTM is more accurate. The characteristic latitude of the input network is 85, one audio frequency is 300 frames, the output latitude after passing through the CNN network is (5X 64) 18, wherein 5X 64 is the characteristic latitude, 18 is the time step, after the input network is input into the LSTM and passes through 64 hidden layer nodes in the bidirectional network, a vector of 18 (64X 2) is output, a pin is made for a time axis to obtain a vector with the length of 128, a full link layer with two weight matrix shapes of (128, 64) and (64,11) is passed, and finally the output of the full link layer is mapped to 11 categories by using softsomax.

And (4) classification processing, namely inputting the audio frame characteristics of the audio information to be processed into the established model for processing and classification and outputting a classification result.

The invention has the advantages that the voice and music mixed data can be more finely marked, a large amount of calculation can be saved during the audio retrieval processing, the retrieval range is greatly reduced, and the audio classification efficiency is improved; and the appropriate background music audio can be selected for processing according to the robustness of the recognition algorithm on the background music, so that the processing speed is improved, and the recognition error is reduced.

Example 2

The present embodiment discloses an audio classification system, which includes an audio signal input module for inputting an audio signal for classification; the characteristic extraction module is used for extracting audio characteristics of the input audio signals; the classification processing module is used for classifying the audio signals according to the extracted audio characteristic data; and the output module is used for outputting the audio classification result. The audio classification system performs in particular the method as in embodiment 1.

The above system is carried in a terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method as in embodiment 1 when executing said computer program.

The readable storage medium in the terminal device described above stores a computer program which, when executed by a processor, implements the steps of the method as in embodiment 1.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.

10页详细技术资料下载

Audio classification method, system, equipment and storage medium

相关技术

网友询问留言