Acoustic event detection system based on feature decomposition classifier and self-adaptive post-processing

文档序号:1891612 发布日期:2021-11-26 浏览:17次 中文

阅读说明:本技术 基于特征分解分类器与自适应后处理的声学事件检测系统 (Acoustic event detection system based on feature decomposition classifier and self-adaptive post-processing ) 是由 龙艳花 梁芸浩 李轶杰 于 2021-08-24 设计创作,主要内容包括:本发明涉及基于特征分解分类器与自适应后处理的声学事件检测系统,包括:特征提取网络、特征分解分类器和自适应后处理模块,特征提取网络用于获取输入音频特征的高级特征表示;特征分解分类器用于分解高级特征表示并根据不同事件类型选择相应子特征空间;根据子特征空间信息输出帧级别检测结果;自适应后处理模块用于对帧级别检测结果做平滑处理,得到最终的检测结果。本申请实施例的系统能依据目标事件,针对性学习单一事件的特征信息,并且针对目标事件的特征分解,减轻了重叠事件中其他事件特征的干扰,同时采用自适应后处理方法,过滤系统预测结果中包含的错误时间信息,平滑音频中的事件分布,极大提高了对事件时间戳检测的准确性。(The invention relates to an acoustic event detection system based on a feature decomposition classifier and self-adaptive post-processing, which comprises: the system comprises a feature extraction network, a feature decomposition classifier and a self-adaptive post-processing module, wherein the feature extraction network is used for acquiring high-level feature representation of input audio features; the feature decomposition classifier is used for decomposing high-level feature representation and selecting corresponding sub-feature spaces according to different event types; outputting a frame level detection result according to the sub-feature space information; the self-adaptive post-processing module is used for performing smoothing processing on the frame level detection result to obtain a final detection result. The system provided by the embodiment of the application can be used for learning the feature information of a single event in a targeted manner according to the target event and decomposing the features of the target event, so that the interference of other event features in overlapped events is reduced, meanwhile, an adaptive post-processing method is adopted, the error time information contained in the prediction result of the system is filtered, the event distribution in audio is smoothed, and the accuracy of detecting the event timestamp is greatly improved.)

1. An acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, comprising: a feature extraction network, a feature decomposition classifier and a self-adaptive post-processing module,

the feature extraction network is used for acquiring high-level feature representation of input audio features;

the feature decomposition classifier is used for decomposing the high-level feature representation and selecting corresponding sub-feature spaces according to different event types; outputting a frame level detection result according to the sub-feature space information;

and the self-adaptive post-processing module is used for performing smoothing processing on the frame level detection result to obtain a final detection result.

2. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 1, wherein the feature extraction network comprises: a complex teacher model and a lightweight student model.

3. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 2,

the complex teacher model includes: 5 groups of double-layer convolutional neural network modules and 2 layers of cyclic neural network modules, wherein the 2 layers of cyclic neural network modules are used for extracting time information of the convolutional neural network modules;

the lightweight student model includes: 3 groups of single-layer convolution neural network modules and 2 layers of cyclic neural network modules;

wherein each convolutional neural network module comprises: normalization layer, pooling layer, and activation function.

4. The system of claim 3, wherein the number of nodes of the recurrent neural network module is the same as the number of nodes of the last layer of the convolutional neural network module connected to the upper layer.

5. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 2, wherein the feature extraction network is further configured to:

learning characteristic information of audio data of different labeling types in a mode of combining supervised learning and unsupervised learning;

the different annotation type audio data comprises: strongly labeled audio data, weakly labeled audio data, and unlabeled audio data.

6. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 5,

the supervised learning uses a cross entropy loss function;

the unsupervised learning uses a mean square error loss function;

calculating a mean square error loss function between the complex teacher model and the lightweight student model; as training progresses, the lightweight student model tends to stabilize, fine-tuning the complex teacher model using a smaller weight μ.

7. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 6, wherein the overall loss function is expressed as:

wherein the content of the first and second substances,a classification loss function based on an event level and an acoustic event detection loss function based on a frame level, respectively;

represents the loss of a weak label for a complex teacher model,represents the weak label loss of the lightweight student model,represents a strong tag loss for a complex teacher model,a strong label loss representing a lightweight student model;representing a complex teacher model to guide a student model,representing the student model to fine tune the teacher model.

8. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 7,

and in each iterative training, minimizing a mean square error loss function between the complex teacher model and the lightweight student model, so that the output characteristics of the prediction results of the lightweight student model and the complex teacher model tend to be consistent.

9. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 1, wherein the feature decomposition classifier is further configured to:

calculating a characteristic subspace dimension reference factor to be reserved for the event type:

and calculating the dimension of the high-level feature representation to be reserved for the single event type according to the feature subspace dimension reference factor to be reserved for the event type.

10. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 1, wherein the feature decomposition classifier further comprises: a classifier section, the classifier section comprising: an acoustic event detection task branch and an acoustic event classification task branch;

the acoustic event detection task branch comprises; the method comprises the steps that a full connection layer with a large hidden state is provided, a plurality of groups of full connection layers with the same number of event types and sigmoid activation functions are preset, and each single full connection layer comprises two classification operations; the two-classification operation is used for acquiring whether an event exists in the characteristic information of each frame;

the acoustic event classification task branch comprises: an attention module.

11. The feature decomposition classifier and adaptive post-processing based acoustic event detection system of claim 5,

the acoustic event detection task branch is used for determining a frame-level detection posterior probability;

the acoustic event classification task branch is used for determining a classification posterior probability of an event level.

12. The system of claim 1, wherein the adaptive post-processing module is configured to smooth the frame-level detection result to obtain a final detection result, and the system comprises:

determining the average duration of different events according to the event distribution in the strong label data set in the training set;

and determining a median filtering window according to the characteristics of the target event, and performing post-processing operation on the frame level detection posterior probability to obtain the final detection result.

Technical Field

The invention relates to the technical field of artificial intelligence technology and acoustic event detection, in particular to an acoustic event detection system based on a feature decomposition classifier and self-adaptive post-processing.

Background

Along with the development of artificial intelligence technology in recent years, the life style of people is gradually changed by the intelligent technology. In the aspect of intelligent voice technology, diversified audio technologies such as remote audio and video communication, an intelligent voice interaction system and an intelligent voice loudspeaker box are applied to the aspects of our life. While the traditional speech technology such as speech recognition, voiceprint recognition, speech synthesis and the like is developed, the emerging audio processing technology is also becoming a research enthusiasm gradually. Such as sound scene classification, sound field event localization, abnormal audio event classification, and acoustic event detection technology, are increasingly required. The acoustic event detection task simulates the ability of human beings to identify acoustic events occurring in certain environments, and completes the classification and identification of the acoustic events by using an audio signal processing technology and a deep learning technology, such as identifying that the environment contains "pet sound", "doorbell sound", "automobile engine sound", and the like.

Acoustic Event Detection (AED) refers to the identification of acoustic events occurring in a piece of audio, with the resolution of the start of the event and the offset timestamp. At present, the acoustic event detection technology has very wide application prospects, such as intelligent home equipment, an intelligent health monitoring system, an unmanned technology, a voice recognition technology, a remote audio-video communication technology and the like. For example, in an audio/video conference, the acoustic event detection technology can analyze the environmental information of conference staff, and make adaptive adjustment on audio communication according to the detected environmental information, for example, can assist the voice enhancement technology, the voice separation technology and the like to improve the quality of audio/video conversation; in works such as city security protection and patrol and examine, can distinguish whether there is potential danger information according to the acoustic information that detects out, the assistant personnel judge whether need execute relevant measure. In addition, the real-time acquired environmental information can assist technologies such as intelligent transportation and intelligent driving; in the abnormal sound detection of the equipment, the acoustic event detection technology can monitor the working state of the equipment in time and assist workers to make more detailed analysis on the equipment.

In short, with the development of artificial intelligence and deep learning technology, the acoustic event detection technology gradually becomes the research focus in the current industry, and has a great amount of application prospects and spaces in the aspects of civil use, national defense and the like. As a new research direction, the current acoustic event detection technology still has difficulties in technology, equipment and the like. In the process of exploring an acoustic event detection algorithm, the problem that four points influence the detection accuracy is mainly found out:

1. in an actual application environment, partial target events overlap, so that timestamp information of the events cannot be accurately acquired in the detection process;

2. the acquired training data comprises complex event types and the problem of unbalanced distribution of various event types, so that the performance of the trained model also has the problem of unbalance;

3. the target events to be detected have large self difference, namely, the duration of partial target events in a section of audio segment is too long or too short, so that the system is difficult to capture accurate timestamp information;

4. training data labeling is difficult. The event distribution in the acquired training data is unbalanced, a large amount of non-target event interference exists, errors are easily introduced by manual labeling, and accurate timestamp information is difficult to acquire.

Disclosure of Invention

The invention provides an acoustic event detection system based on a feature decomposition classifier and self-adaptive post-processing, which can solve the technical problem.

The technical scheme for solving the technical problems is as follows:

an acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, comprising: a feature extraction network, a feature decomposition classifier and a self-adaptive post-processing module,

the feature extraction network is used for acquiring high-level feature representation of input audio features;

the feature decomposition classifier is used for decomposing high-level feature representation and selecting corresponding sub-feature spaces according to different event types; outputting a frame level detection result according to the sub-feature space information;

the self-adaptive post-processing module is used for performing smoothing processing on the frame level detection result to obtain a final detection result.

In some embodiments, in the above system for detecting acoustic events based on a feature decomposition classifier and adaptive post-processing, the feature extraction network includes: a complex teacher model and a lightweight student model.

In some embodiments, in the above acoustic event detection system based on feature decomposition classifier and adaptive post-processing, the complex teacher model includes: 5 groups of double-layer convolutional neural network modules and 2 layers of cyclic neural network modules, wherein the 2 layers of cyclic neural network modules are used for extracting time information of the convolutional neural network modules;

lightweight student model includes: 3 groups of single-layer convolution neural network modules and 2 layers of cyclic neural network modules;

wherein each convolutional neural network module comprises: normalization layer, pooling layer, and activation function.

In some embodiments, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the number of nodes of the recurrent neural network module is the same as the number of nodes of the last layer of the convolutional neural network module connected to the upper layer.

In some embodiments, in the above acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, the feature extraction network is further configured to:

learning characteristic information of audio data of different labeling types in a mode of combining supervised learning and unsupervised learning;

the different annotation types of audio data include: strongly labeled audio data, weakly labeled audio data, and unlabeled audio data.

In some embodiments, in the above-described acoustic event detection system based on feature decomposition classifier and adaptive post-processing,

supervised learning uses a cross entropy loss function;

unsupervised learning uses a mean square error loss function;

calculating a mean square error loss function between the complex teacher model and the lightweight student model; with the progress of training, the lightweight student model tends to be stable, and the complex teacher model is finely adjusted by using a smaller weight mu.

In some embodiments, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the overall loss function is represented as:

wherein the content of the first and second substances,a classification loss function based on an event level and an acoustic event detection loss function based on a frame level, respectively;

represents the loss of a weak label for a complex teacher model,represents the weak label loss of the lightweight student model,represents a strong tag loss for a complex teacher model,a strong label loss representing a lightweight student model;representing a complex teacher model to guide a student model,representing the student model to fine tune the teacher model.

In some embodiments, in the above-described acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the consistency loss function between the complex teacher model and the lightweight student model is minimized during each iterative training, so that the predicted result output features of the lightweight student model and the complex teacher model tend to be consistent.

In some embodiments, in the above acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, the feature decomposition classifier is further configured to:

calculating a characteristic subspace dimension reference factor to be reserved for the event type:

and calculating the dimension of the high-level feature representation to be reserved for the single event type according to the feature subspace dimension reference factor to be reserved for the event type.

In some embodiments, in the above acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, the feature decomposition classifier further includes: a classifier section, the classifier section comprising: an acoustic event detection task branch and an acoustic event classification task branch;

the acoustic event detection task branch comprises; the method comprises the steps that a full connection layer with a large hidden state is provided, a plurality of groups of full connection layers with the same number of event types and sigmoid activation functions are preset, and each single full connection layer comprises two classification operations; the binary operation is used for acquiring whether an event exists in the characteristic information of each frame;

the acoustic event classification task branch comprises: an attention module.

In some embodiments, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the acoustic event detection task branch is used to determine a frame-level detection posterior probability;

the acoustic event classification task branch is used to determine the classification posterior probability at the event level.

In some embodiments, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the adaptive post-processing module is configured to smooth the frame-level detection result to obtain a final detection result, and the method includes:

determining the average duration of different events according to the event distribution in the strong label data set in the training set;

and determining a median filtering window according to the characteristics of the target event, and performing post-processing operation on the frame-level detection posterior probability to obtain a final detection result.

The invention has the beneficial effects that: an acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, comprising: the system comprises a feature extraction network, a feature decomposition classifier and a self-adaptive post-processing module, wherein the feature extraction network is used for acquiring high-level feature representation of input audio features; the feature decomposition classifier is used for decomposing high-level feature representation and selecting corresponding sub-feature spaces according to different event types; outputting a frame level detection result according to the sub-feature space information; the self-adaptive post-processing module is used for performing smoothing processing on the frame level detection result to obtain a final detection result. According to the method and the device, the characteristic information of the single event can be learnt in a targeted manner according to the target event, meanwhile, the adaptive post-processing method is adopted, the error time information contained in the model prediction result is filtered, the event distribution in the audio is smoothed, the accuracy of detecting the event timestamp is greatly improved, and the interference of other event characteristics in the overlapping event is reduced by means of characteristic decomposition of the target event.

Drawings

Fig. 1 is a diagram illustrating an acoustic event detection system based on a feature decomposition classifier and adaptive post-processing according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

In order that the above objects, features and advantages of the present application can be more clearly understood, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. The specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the application. All other embodiments that can be derived by one of ordinary skill in the art from the description of the embodiments are intended to be within the scope of the present disclosure.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

Fig. 1 is a diagram illustrating an acoustic event detection system based on a feature decomposition classifier and adaptive post-processing according to an embodiment of the present invention.

An acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, with reference to fig. 1, includes: a feature extraction network 101, a feature decomposition classifier 102 and an adaptive post-processing module 103,

the feature extraction network 101 is used to obtain advanced feature representations of input audio features;

specifically, the feature extraction network 101 in the embodiment of the present application mainly includes a convolutional neural network, a cyclic neural network, and a downsampling layer, a regularizing layer, and an activation function that are matched with the convolutional neural network layer, and is used to obtain a high-level feature representation of an input audio feature;

the feature decomposition classifier 102 is used for decomposing high-level feature representations and selecting corresponding sub-feature spaces according to different event types; outputting a frame level detection result according to the sub-feature space information;

specifically, in the embodiment of the present application, the feature decomposition classifier 102 selects corresponding sub-feature spaces through a decomposition algorithm for different event types according to the high-level feature representation output by the feature extraction network, and outputs the posterior probability of the predicted event, that is, the frame-level detection result, according to the new sub-space feature information.

The adaptive post-processing module 103 is configured to perform smoothing on the frame level detection result to obtain a final detection result.

Specifically, in the embodiment of the present application, the adaptive post-processing module 103 is configured to perform relevant statistical analysis on the priori knowledge of the data set, and perform smoothing on the posterior probability of the event output by the feature decomposition classifier 102, that is, the frame level detection result, to obtain a final detection result.

In some embodiments, in the above system for detecting acoustic events based on a feature decomposition classifier and adaptive post-processing, the feature extraction network includes: a complex teacher model and a lightweight student model.

In some embodiments, in the above acoustic event detection system based on feature decomposition classifier and adaptive post-processing, the complex teacher model includes: 5 groups of double-layer convolutional neural network modules and 2 layers of cyclic neural network modules, wherein the 2 layers of cyclic neural network modules are used for extracting time information of the convolutional neural network modules;

lightweight student model includes: 3 groups of single-layer convolution neural network modules and 2 layers of cyclic neural network modules;

wherein each convolutional neural network module comprises: normalization layer, pooling layer, and activation function.

Specifically, in the embodiment of the application, a complex teacher model and a lightweight student model are built by combining the convolutional neural network of the convolutional neural network, the cyclic neural network, and a downsampling layer, a normalization layer and an activation function which are matched with the convolutional neural network layer.

For a complex teacher model, input audio features firstly pass through a standardization layer, rules in the audio data can be learned more easily by standardizing small-batch data of an input network, the learning speed of the complex teacher model is accelerated, 5 groups of two layers of convolutional neural network modules are arranged behind the standardization layer, and each group of convolutional neural network modules consists of the standardization layer, a down-sampling layer and an activation function; in the complex teacher model, the down-sampling layer in the convolutional neural network module performs down-sampling operation in both a frequency domain and a time domain, and after down-sampling in the same proportion every time, the complex teacher model can better learn more detailed feature information with different dimensions.

For the lightweight student model, input features firstly pass through a standardization layer, 3 groups of single-layer convolutional neural network modules are arranged behind the standardization layer, the structure of the convolutional neural network model is the same as that of a complex teacher model, but the downsampling layer only carries out downsampling operation in a frequency domain, time domains keep original feature dimensions, no time compression ratio exists, integrity of time dimension information in audio features is guaranteed, and therefore better event boundary detection is achieved. In addition, the lightweight student model can learn different feature information, reduce model parameters and improve training efficiency.

And adding a 2-layer cyclic neural network module after different convolutional neural network modules of the complex teacher model and the lightweight student model. Specifically, a bidirectional gated recurrent neural network is selected to extract the time information of the convolutional neural network module.

In some embodiments, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the number of nodes of the recurrent neural network module is the same as the number of nodes of the last layer of the convolutional neural network module connected to the upper layer.

Specifically, the number of nodes of the recurrent neural network module in the embodiment of the present application is the same as the number of nodes of the last layer of the convolutional neural network module connected to the upper layer, so that it is advantageous to further obtain time dimension feature information according to the high-level feature representation output by the convolutional neural network module.

In some embodiments, in the above acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, the feature extraction network is further configured to:

learning characteristic information of audio data of different labeling types in a mode of combining supervised learning and unsupervised learning;

specifically, in the embodiment of the application, in the successive iterative learning of a teacher-student model in a feature extraction network, feature information in audio data of different labeling types is fully learned in a mode of combining supervised learning and unsupervised learning, and the performance of the whole acoustic event detection system is greatly improved.

The different annotation types of audio data include: strongly labeled audio data, weakly labeled audio data, and unlabeled audio data.

Specifically, in the embodiment of the present application, the strongly labeled audio data includes an event type and event timestamp information, the weakly labeled audio data includes an event type and does not include event timestamp information, and the unlabeled audio data does not include an event type and event timestamp information.

In some embodiments, in the above-described acoustic event detection system based on feature decomposition classifier and adaptive post-processing,

supervised learning uses a cross entropy loss function;

unsupervised learning uses a mean square error loss function;

calculating a consistency loss function between the complex teacher model and the lightweight student model; with the progress of training, the lightweight student model tends to be stable, and the complex teacher model is finely adjusted by using a smaller weight mu.

Specifically, in the embodiment of the application, a consistency loss function, namely a mean square error loss function, between the complex teacher model and the lightweight student model is calculated, so that the complex teacher model has a training guiding effect on the lightweight student model in later iterative training, when the models gradually converge, the lightweight student model finely adjusts the complex teacher model through the weighting consistency loss function, and the complex teacher model is further optimized.

In some embodiments, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the overall loss function is represented as:

wherein the content of the first and second substances,a classification loss function based on an event level and an acoustic event detection loss function based on a frame level, respectively;

represents the loss of a weak label for a complex teacher model,represents the weak label loss of the lightweight student model,represents a strong tag loss for a complex teacher model,a strong label loss representing a lightweight student model;representing a complex teacher model to guide a student model,representing the student model to fine tune the teacher model.

In some embodiments, in the above-described acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the consistency loss function between the complex teacher model and the lightweight student model is minimized during each iterative training, so that the predicted result output features of the lightweight student model and the complex teacher model tend to be consistent.

In the prior art, high-level feature representations corresponding to audio features can be obtained, but for a multi-label classification task, when a certain event type and the event type often occur simultaneously, classification directly according to the high-level feature representations is difficult to distinguish each event type. I.e. the formation of high-level feature subspaces in the training set given the types of events for which the identifiable information is insufficient will be largely disturbed by those types of events that occur with them. This effect is exacerbated when the number of audio segments with much recognizable information for certain event types in the unbalanced set is particularly small. To mitigate this effect, the present application performs a decomposition operation on the high-level feature representation by a feature decomposition classifier to re-model a plurality of feature subspaces for a plurality of event types, as described in detail below.

In some embodiments, in the above acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, the feature decomposition classifier is further configured to:

calculating a characteristic subspace dimension reference factor to be reserved for the event type:

feature subspace dimension reference factor k to be preserved according to event typecAnd (4) calculating the dimension of the high-level feature representation to be reserved by the single type of event c.

Specifically, in the embodiment of the present application, each different event type shares a different part of the high-level feature representation instead of the entire feature space, and the high-level feature space is decomposed into feature subspaces in advance according to the prior information thereof. For this purpose, first the feature subspace dimension reference factor k to be preserved for event type c is calculatedc

kc=[((1-n)·lc+n)·d]

Assuming that for event type c, the larger the proportion of audio segments containing little interference from other event types, the more recognizable information in terms of learning event types is needed, and thus a larger feature space is needed. In contrast, the smaller the proportion of these segments, the smaller the volume of feature space required to prevent overfitting. For this reason, kcIncreasing with increasing proportion of these c-class audio segments. Taking into account too small kcSeverely diminishing the ability of the model to recognize event type c, the present application mitigates this effect by using a constant factor n (0. ltoreq. n.ltoreq.1), where l thenc(0≤lc1) and audio including interference in training setThe number of fragments is relevant. As n increases to 1, the feature decomposition choice degenerates to the entire feature space. The level of interference is quantified according to the principle that the more event types are covered by an audio segment, the more interference the other event types cause to any one of them, namely:

in this application, NciRepresenting the number of audio pieces in the training set containing class i, viAre the corresponding constant coefficients representing the importance of these audio segments. The determination of v for this application assumes that the less interference other event types cause to any one of the event types in the segment, the more important the segment isi

Finally, the feature subspace dimension reference factor k to be preserved according to the calculated event type ccThe dimension of the single type of event c to be represented by the high-level feature can be obtained:

Dfea=Fdim·kc

in some embodiments, in the above acoustic event detection system based on a feature decomposition classifier and adaptive post-processing, the feature decomposition classifier further includes: a classifier section, the classifier section comprising: an acoustic event detection task branch and an acoustic event classification task branch;

the acoustic event detection task branch comprises; the method comprises the steps that a full connection layer with a large hidden state is provided, a plurality of groups of full connection layers with the same number of event types and sigmoid activation functions are preset, and each single full connection layer comprises two classification operations; the binary operation is used for acquiring whether an event exists in the characteristic information of each frame;

the acoustic event classification task branch comprises: an attention module.

In some embodiments, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the acoustic event detection task branch is used to determine a frame-level detection posterior probability;

the acoustic event classification task branch is used to determine the classification posterior probability at the event level.

Specifically, in the embodiment of the application, an acoustic event detection task and a classification task are divided into two independent branches, according to a plurality of groups of feature subspaces which are acquired by the feature decomposition classifier and are independent of events, the acoustic event detection task branches firstly use a full connection layer with a large hidden state, then a plurality of groups of full connection layers with the same number of preset event types and a sigmoid activation function are connected, two classification operations are required to be performed in each independent full connection layer, whether an event exists in each frame of feature information is acquired, and then a frame-level detection posterior probability is acquired; in the acoustic event classification branch, the outputs of the convolutional neural network module and the cyclic neural network module are connected as the input features of a 'linear' layer, and then an attention module is used, so that the classification posterior probability of the event level is obtained after the attention module passes through.

In audio classification and acoustic event detection tasks, the frame-level prediction output of the model is often discontinuous, for example, in real circumstances a large amount of background noise or abnormal non-target events may be included in the captured audio and many detected outliers that occur during the detection process may produce target events that are too many poles of short duration to occur, resulting in inaccurate timestamp detection. The traditional approach is to apply linear or non-linear filters to smooth the prediction output. But for multi-target event detection under complex conditions, the subsequent duration of each event in an audio segment varies widely. Conventional median filtering with a fixed window size is no longer applicable.

In view of this, in the above acoustic event detection system based on the feature decomposition classifier and the adaptive post-processing, the adaptive post-processing module is configured to perform smoothing processing on the frame-level detection result to obtain a final detection result, and the method includes:

determining the average duration of different events according to the event distribution in the strong label data set in the training set;

and determining a median filtering window according to the characteristics of the target event, and performing post-processing operation on the frame-level detection posterior probability to obtain a final detection result.

Specifically, in the embodiment of the present application, a median filter bank with an adaptive window size is calculated according to the distribution statistical rule of the strong tag training data and the average duration of the target event. Furthermore, given that each event duration is not evenly distributed, it may not be optimal to use the average duration to optimize the median filtering window size. Thus, the design uses an event-specific median filter window size, as follows:

wherein, WcC is the median filter window size of class C, N is 1,2,3cThe number of the segments of the cumulative distribution function is calculated after C-type target events are sorted from short to long. L isiIs the duration of the i-th segment of event c and β is a scaling factor and was set to 1/3 in the experiment. All strong label audio data participate in calculation and are used for calculating a median filtering window Wc

In addition, element-level multiplication is applied between the audio classification prediction posterior probability and the audio event detection frame-level posterior probability to ensure consistency of audio event detection and classification results.

The F1 score is an index used in statistics to measure the accuracy of the classification model. The method considers the accuracy rate and the recall rate of the classification model at the same time, and the F1 score can be regarded as a weighted average of the accuracy rate and the recall rate of the model, wherein the maximum value is 1, and the minimum value is 0. The calculation method is as follows:

the method is verified, and the performance of the event detection system is obviously improved based on the feature decomposition classifier and the acoustic event detection system with self-adaptive post-processing.

Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.

Although the embodiments of the present application have been described in conjunction with the accompanying drawings, those skilled in the art will be able to make various modifications and variations without departing from the spirit and scope of the application, and such modifications and variations are included in the specific embodiments of the present invention as defined in the appended claims, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of various equivalent modifications and substitutions within the technical scope of the present disclosure, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于声信号及深度学习技术的变压器故障检测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!