Singing voice detection method based on extrusion and excitation residual error network

文档序号：1310320 发布日期：2020-07-10 浏览：11次中文

阅读说明：本技术 一种基于挤压和激励残差网络的歌声检测方法 (Singing voice detection method based on extrusion and excitation residual error network ) 是由桂文明于 2020-03-11 设计创作，主要内容包括：本发明提出了一种基于挤压和激励残差网络(Squeeze-And-Excitation Residual Neural Network,简写SE-ResNet)的歌声检测方法。该方法包括以下步骤：构建挤压和激励残差网络；构造音乐数据集；把音乐数据集转换成图像集；用训练图像集分别训练构建的网络；用测试图像集分别测试训练好的各网络；选择测试正确率最高的网络为最终的歌声检测网络；用选定的网络对被检测的音频文件进行歌声检测。本发明通过深度残差网络隐含提取不同层次的歌声特征,并能利用嵌入的挤压和激励模块的自适应注意力特性判断这些特征的重要性,进而鉴别歌声。(The invention provides a singing voice detection method based on an extrusion-Excitation Residual Neural Network (SE-ResNet). The method comprises the following steps: constructing an extrusion and excitation residual error network; constructing a music data set; converting the music data set into an image set; respectively training the constructed networks by using training image sets; respectively testing the trained networks by using the test image set; selecting the network with the highest test accuracy as a final singing voice detection network; singing voice detection is performed on the detected audio file using the selected network. The invention implicitly extracts singing voice features of different levels through a deep residual error network, and can judge the importance of the features by utilizing the self-adaptive attention characteristics of an embedded extrusion and excitation module so as to identify the singing voice.)

1. A singing voice detection method based on an extrusion and excitation residual error network is characterized by comprising the following steps:

s1 construction for singing voice detection with depth d_iThe squeeze and excitation residual network of;

s1.1, the extrusion and excitation residual error network is a combination of two network structures of a residual error network and an extrusion and excitation network;

the initial input to the network is an image and the final network output is 2 values o₀,o₁Judging whether the singing voice exists;

s1.2 setting the input image as x, x ∈ R^H×WThe output is o, o ∈ R^2×1And the constructed squeeze and residual networks are represented by function F, the whole network acts on the input and can be represented as:

s2 construction of music data set

S2.1 collects music data sets for singing voice detection, a good data set comprising the following conditions:

(1) the total amount is not less than 120 minutes;

(2) the time sum of music segments containing singing voice and music segments without singing voice in the data set reaches the balance;

(3) the music type distribution covers the types to be detected and is balanced;

s2.2, carrying out sample annotation on the audio file of each piece of music; marking the starting time and the ending time of each section of music containing the singing voice, if the section of music contains the singing voice, marking all time points of the time section as 1, otherwise, marking the time points as 0; all labels are written into a text file;

s2.3, randomly dividing the music data set into a training set, a verification set and a test set, wherein the number of samples in the training set is not less than 50%;

s3 converting the music data set into an image set and a corresponding annotation set

S3.1 converting music data set into logarithmic Mel time-frequency diagram file set

Processing each music audio file in the music data set, and converting the music audio file into a file containing a logarithmic Mel time-frequency diagram; the calculation process is as follows: firstly, calculating a time-frequency diagram of an audio signal, wherein the audio sampling rate is f_s22050Hz, the frame length is l 1024, and the frame shift is h 315; then converting the time-frequency diagram into Mel time-frequency diagram, wherein 80 Mel frequency numbers are taken during conversion, and the frequency interval is taken [27.5,8000 ]]Hz, the number of Mel frequencies corresponds to the number of rows H of the time-frequency spectrogram, and finally, logarithm is taken on the amplitude value in the Mel time-frequency diagram, so that a logarithm Mel time-frequency diagram can be obtained, wherein one logarithm Mel time-frequency diagram is equal to a data matrix A (H, L), and L is determined by the length of an audio frequency;

s3.2 converting the logarithmic Mel time-frequency diagram file set into an image set and a corresponding label set

S3.2.1 reading logarithmic Mel time-frequency diagram files in training set data set one by one

S3.2.2 the image data x with size of 80 × 80 is extracted from the beginning column position of the logarithmic Mel time-frequency diagram, and the time t of the 40 th column of the image, i.e. the middle position of the image is calculated_W/2(ii) a Inquiring the time in the corresponding audio fileMarking of points between, if p is marked_file(t_W/2) If 1, label the image p_xIs singing voice, otherwise, is not singing voice; putting the extracted image into an image set, putting the corresponding label into a label set, and enabling the serial numbers of the image set and the label set to be the same so as to facilitate retrieval;

p_x＝p_file(t_W/2) (3)

s3.2.3 moving the extraction position of logarithmic Mel time-frequency diagram to right h₁Reading 80 × 80 image data again, calculating labels, and continuing to put the image set and the label set until the log Mel time-frequency diagram file is read;

s3.2.4 after all time-frequency graph files in the data set to be trained are processed, the time-frequency graph files in all the training data sets are converted into an image set and a corresponding label set;

s3.2.5 executing the operations on the training set from the step 3.2.1 to the step 3.2.4 to the verification set and the test set to generate an image set and an annotation set of the verification set and the test set; setting the total number of images in the verification set and the test set to be N respectively_vAnd N_t；

S4 determining the depth d of each image in the training set obtained in step S3_i,i∈[14,18,34,50,101,152,200]Training the 7 squeezing and exciting residual error networks, and verifying through a verification set in the training process

S4.1 for depth d_iThe network starts to carry out the e-th round of training;

s4.1.1 if it is the first round of training, set E to 0, and set the maximum number of training rounds E, and set the network d_iIs verified to be the current maximum detection accuracy a_imaxSetting the continuous times S to be 0 and setting the maximum value of the continuous times to be S;

if not, executing S4.1.2;

s4.1.2 the images and labels are sequentially or randomly taken from the training image set and the corresponding label set and input to the squeeze and excitation residual error network d_iIn the middle of training

S4.1.3 when all images in the training set are taken out and trained, ending the e-th round of training;

s4.2 after the e-th round of training is finished, the trained network d is trained_iAnd (3) carrying out verification by using a verification set, wherein a verification algorithm is as follows:

s4.2.1 sequentially retrieving images and corresponding annotations from the verification image set and the corresponding annotations set

S4.2.2 input image to the squeeze and excitation residual network d after e-round training_iEach image can obtain 2 output values o₀,o₁Taking the corresponding category with the larger output value as the final classification result;

s4.2.3 if the label corresponding to the image is the same as the final classification result, the test result is determined to be correct;

s4.2.4 repeat steps 4.2.1 through 4.2.3 until N in the verification image set is verified_vFinishing the execution of all the images;

s4.2.5 statistically verifying that all image samples in the dataset are classified as correct number, denoted T_iComputing network d_iDetection accuracy of (a)_ie＝T_i/N_vIf a_ie>a_imaxThen a is_imax＝a_ieThe number of consecutive juxtapositions s is 0,

otherwise, setting s as s + 1;

s4.2.6, if the continuous times S reach S, namely the detection accuracy of the continuous S rounds is not increased, the training is finished completely;

s4.2.7 if the number of consecutive times S is less than S, then set E-E +1, if E > -E, then the training is all over,

otherwise, skipping to the step S4.1 to continue to execute the training;

s4.3, through the training algorithm, the extrusion and excitation residual error network d with fixed parameters can be finally obtained_i；

S5 test set pair training all resulting squeeze and excitation residual error network d_iPerforming tests and comparisons

S5.1, sequentially taking out images and corresponding labels from the test image set and the corresponding label set;

s5.2 inputting the image into the squeeze and excitation residual network d obtained through the training of the step S4_iEach image can be obtained as 2Taking the category corresponding to the larger output value as a final classification result;

s5.3, if the label corresponding to the image is the same as the final classification result, determining that the test result is correct; counting the number of all image samples classified as correct in the test data set, denoted as T_iComputing network d_iDetection accuracy of (a)_i＝T_i/N_t；

S5.4 comparison α_iThe network corresponding to the maximum value is taken as the network finally adopted for determination, and d is set.

2. The singing voice detection method based on squeeze and excitation residual error network as claimed in claim 1, wherein the depth d_iIn i ∈ [14,18,34,50,101,512,200 50,101,512,200 … ]]Where 18,34,50,101,512 are typical depths of squeeze and excitation residual networks, 14 and 200 are depths of the inventive configuration;

the structure of the 7-depth extrusion and excitation residual error network constructed by the invention is as follows:

3. the singing voice detection method based on the squeeze and excitation residual error network as claimed in claim 1, further comprising:

s6 singing voice detection of music audio file to be detected

S6.1, converting the audio file to be detected into a logarithmic Mel time-frequency diagram file and an image set according to the method in the step S3, wherein no label set exists;

s6.2, inputting the images into the trained and selected optimal network d one by one, wherein each image can obtain 2 output values, and the category corresponding to the image with the larger output value is taken as a final classification result.

S6.3, summarizing the detection results corresponding to all the images, wherein each image can correspond to one moment of the music, so that the singing voice detection result of the music can be obtained;

s6.4 time accuracy of singing voice detection results of the invention: t is t_p＝h₁×h/f_s5 × 315/22050 0.0715 seconds 71.5 milliseconds, the singing voice detection time length of each image t_x＝W×h/f_s80 × 315/22050 seconds.

Technical Field

The invention relates to the field of music artificial intelligence, in particular to a singing voice detection method based on an extrusion-And-Excitation Residual Neural Network (Squeeze-And-Excitation Residual Neural Network).

Background

First, the related concepts and application fields of the invention

Singing Voice Detection (SVD) as referred to herein is a determination of whether each small segment of audio present in the music in the form of digital audio contains a person's Singing Voice. In each of the short pieces of music, the sound of the musical instrument is generally contained in addition to the human voice. It is a challenging task to judge whether or not there is a human voice in a music piece mixing musical instruments and human voices. The singing voice detection is schematically shown in figure 1.

Singing voice detection is important basic work in the field of music artificial intelligence, and many other researches such as singer identification, singing voice separation, lyric alignment and the like need the singing voice detection as a prerequisite technology or an enhancement technology. For example, in the singer identification process, it is a necessary technique to detect the singing voice of music first, and only after the singing voice is detected, the singer identification process can be used for the singer identification. Singing voice detection is a binary problem for each small segment of audio. We can denote this segment of audio as X, and assume that our classification function is f, and this segment of audio is denoted as 1 if it contains singing voice, and 0 if it does not, we can express the singing voice detection problem in the following form:

second, general procedure and prior art for singing voice detection

The singing voice detection process generally comprises preprocessing, feature extraction, classification, post-processing and the like. The preprocessing mainly comprises audio signal denoising, signal frequency division and the like, and is also beneficial to extracting and processing the singing voice to a certain extent by using the singing voice separation technology. Feature extraction and classification are two important steps in singing voice detection.

The features include L PC (L initial Predictive Coefficient), P L PC (Perceptual L initial Predictive Coefficient), zero crossing rate ZCR Zerocross rate), Mel Frequency Cepstral coefficients MFCCs (Mel Frequency Cepstral coefficients), and so on.

The classification is to classify the feature information by methods such as machine learning, and the main classification methods include support Vector machine (svm) (support Vector machine), hidden Markov model (hmm) (hidden Markov model), random forest rf (random forest), and the like, and also include the deep neural network (dnn) (deep neural network) method which has appeared in recent years. Some methods using cnn (conditional Neural network) and rnn (current Neural network) improve the accuracy rate of singing voice detection to some extent [1], but the detection accuracy rate still has room for improvement.

The post-processing mainly utilizes the technologies such as smoothing and the like to finely adjust the classification result, so that the detection accuracy is finally improved.

The documents used in the present invention are as follows:

1.K.Lee,K.Choi,J.Nam.Revisiting Singing Voice Detection:aQuantitative Review and the Future Outlook[J].arXiv preprint arXiv:1806.01180,2018.

disclosure of Invention

The invention aims to improve the singing voice detection accuracy, And provides a singing voice detection algorithm based on an extrusion-And-Excitation Residual Neural Network (SE-ResNet).

In order to solve the above problems, the technical solution adopted by the present invention includes the following steps, as shown in fig. 2:

1. constructed for singing voice detection, each depth being d_i,i∈[14,18,34,50,101,152,200...]Squeeze and excitation residual network of

The above depth d_iIn 18,34,50,101,152 are typical depths of the squeeze and excitation residual networks, and 14 and 200 are depths constructed by the present invention, and those skilled in the art can construct other depths suitable for singing voice detection data sets according to actual situations to obtain possible better networks.

1.1 squeeze and excitation residual network is a combination of two network structures, a residual network, a squeeze and excitation network. As in fig. 3, the dashed box is a block diagram of the squeeze and excitation network, and the Residual (Residual) network outside the dashed box includes two types of structures: based block (Basic block) and Bottleneck block (bottle block) based structures (as shown in fig. 4), the two structures are selected and constructed according to the network layer number. The structure of the 7-depth extrusion and excitation residual error network constructed by the invention is as follows:

where the networks with depths of 14,18 and 34 consist of residual structures based on basic blocks and the networks with depths of 50,101,152 and 200 consist of residual structures based on bottleneck blocks the initial input of these networks is an image with a size H × W ═ 80 × 80. the input image is transformed from a music audio signal, how it is transformed in subsequent steps the table is set forth for an input of 80 × 80 image, the table column output size gives the output image size of each layer the image, before entering the residual network, passes through a convolutional layer with a size of 7 × 7, a step number (stride) of 2, and a maximum pooling layer with a size of 3 × 3, a step number of 2, resulting in a feature map of 40 × 40. the final network output has 2 values o 2₀,o₁It can be judged whether or not there is singing voice.

1.2 let the input image be x, x ∈ R^H×WThe output is o, o ∈ R^2×1And the constructed squeeze and residual networks are represented by function F, the whole network acts on the input and can be represented as:

2. constructing music data sets

2.1 collecting music data sets for singing voice detection, a good data set typically includes the following conditions: (1) the more the total amount is, the better, but the total time is not less than 120 minutes; (2) the time sum of music segments containing singing voice and music segments without singing voice in the data set reaches the balance; (3) the music type distribution covers the types to be detected and is balanced.

2.2 sample labeling is performed on the audio file of each piece of music. Marking the starting time and the ending time of the singing voice part in each section of each piece of music, if the singing voice is contained, marking all time points of the section as 1, otherwise, marking the time points as 0. All annotations are written to a text file.

2.3 randomly dividing the music data set into a training set, a verification set and a test set, wherein the number of samples in the training set is not less than 50%.

3. Converting music data sets into image sets and corresponding annotation sets

3.1 converting music data sets into sets of logarithmic Mel time-frequency diagrams files

Each music audio file in the music data set (including the training set, the validation set, and the test set) is processed and converted into a file containing a log mel-spectral plot (log mel-spectral). The calculation process is that firstly, a time-frequency diagram (spectrum) of the audio signal is calculated, and the audio sampling rate is f_s22050Hz, the frame length is l 1024, and the frame shift is h 315; then converting the time-frequency diagram into Mel time-frequency diagram, wherein 80 Mel frequency numbers are taken during conversion, and the frequency interval is taken [27.5,8000 ]]Hz, the number of Mel frequencies corresponds to the number of rows H of the time-frequency spectrogram, and finally, logarithm is taken on the amplitude value in the Mel time-frequency diagram, so that a logarithm Mel time-frequency diagram can be obtained, wherein one logarithm Mel time-frequency diagram is equal to a data matrix A (H, L), and L is determined by the length of the audio frequency.

3.2 converting the logarithmic Mel time-frequency diagram document set into an image set and a corresponding label set

3.2.1 reading Log Mel time-frequency diagram files in training set data set one by one

3.2.2 extracting 80 size 80 × 80 image data x from the beginning column position of logarithmic Mel time-frequency diagram, calculating the time t of the 40 th column of the image, i.e. the middle position of the image_W/2. Inquiring the label of the time point in the corresponding audio file, if label p_file(t_W/2) If 1, label the image p_xIs singing voice, otherwise, is not singing voice. And putting the extracted image into an image set, putting the corresponding label into a label set, and enabling the serial numbers of the image set and the label set to be the same so as to facilitate retrieval.

p_x＝p_file(t_W/2) (3)

3.2.3 moving the extraction position of the logarithmic Mel time-frequency diagram to the right h₁The image data is again read 80 × 80 for column 5, the annotations are computed, and the placement into the image set and the annotations set continues until the log mel time-frequency graph file is read.

3.2.4 after all the time-frequency graph files in the data set to be trained are processed, the time-frequency graph files in all the training data sets are converted into an image set and a corresponding label set.

3.2.5 the operations on the training set in steps 3.2.1 to 3.2.4 above are performed on both the validation set and the test set, generating an image set and an annotation set of the validation set and the test set. Setting the total number of images in the verification set and the test set to be N respectively_vAnd N_t。

4. Respectively enabling the images in the training set obtained in the step 3 to pass through the depth d_i,i∈[14,18,34,50,101,152,200]Training the 7 squeezing and exciting residual error networks, and verifying through a verification set in the training process

4.1 for depth d_iThe network starts the e-th round of training

4.1.1 if the training is the first round, set E to 0, and set the maximum number of training rounds E, set the network d_iIs verified to be the current maximum detection accuracy a_imaxSetting the continuous times S to be 0 and setting the maximum value of the continuous times to be S;

if not, 4.1.2 rounds of training are performed.

4.1.2 taking out images and corresponding labels from the training image set and the corresponding label set sequentially or randomly, and inputting the images and the corresponding labels into a squeezing and excitation residual error network d_iIn the middle of training

4.1.3 when all images in the training set are taken and trained, the e-th round of training is finished.

4.2 after the e-th round of training is finished, the trained network d is trained_iAnd (3) carrying out verification by using a verification set, wherein a verification algorithm is as follows:

4.2.1 sequentially extracting images and corresponding annotations from the verification image set and the corresponding annotations set

4.2.2 input images to go through e rounds of trainingRefined squeeze and excitation residual network d_iEach image can obtain 2 output values o₀,o₁And taking the corresponding category with the larger output value as the final classification result. (e.g., o)₀>o₁The value indicating no singing voice is greater than that of the singing voice, and the final classification result of the image is no singing voice. )

4.2.3 if the label corresponding to the image is the same as the final classification result, the test result is determined to be correct.

4.2.4 repeating steps 4.2.1 to 4.2.3 until N in the set of verification images is verified_vAnd finishing the execution of all the images.

4.2.5 statistically verifying that all image samples in a dataset are classified as correct number, denoted T_iComputing network d_iDetection accuracy of (a)_ie＝T_i/N_vIf a_ie>a_imaxThen a is_imax＝a_ieThe number of consecutive juxtapositions s is 0,

otherwise, s is set to s + 1.

4.2.6 if the consecutive times S reach S, i.e. the consecutive S rounds of detection accuracy are not increased, the training is all over ended.

4.2.7 if the number of consecutive times S is less than S, then set E-E +1, if E > -E, then the training is all over,

otherwise, jumping to step 4.1 to continue training.

4.3 through the training algorithm, finally obtaining the extrusion and excitation residual error network d with fixed parameters_i

5. All squeeze and excitation residual error networks d obtained by training with test set pair_i,i∈[14,18,34,50,101,152,200]Performing tests and comparisons

5.1 sequentially taking out the images and the corresponding labels from the test image set and the corresponding label set.

5.2 inputting the image into the squeezing and excitation residual error network d obtained by training in the step 4_iAnd 2 output values can be obtained from each image, and the category corresponding to the image with the larger output value is taken as the final classification result.

5.3 marking and final scoring corresponding to the imageIf the class result is the same, the test result is determined to be correct. Counting the number of all image samples classified as correct in the test data set, denoted as T_iComputing network d_iDetection accuracy of (a)_i＝T_i/N_t。

5.4 comparison a_iAnd i is a value of 14,18,34,50,101,152 and 200, and the network corresponding to the maximum value is taken as the network finally used for determination, and is set as d.

6. Singing voice detection for music audio file to be detected

6.1 refer to step 3 to convert the detected audio file into a log Mel time-frequency plot file and an image set, where there is no label set.

6.2 the images are input into the optimal network d after training and selection one by one, each image can obtain 2 output values, and the category corresponding to the image with the larger output value is taken as the final classification result.

6.3 summarizing the detection results corresponding to all the images, wherein each image can correspond to one moment of the music, and thus the singing voice detection result of the music can be obtained.

6.4 time accuracy of singing voice detection results of the present invention: t is t_p＝h₁×h/f_s5 × 315/22050 0.0715 seconds 71.5 milliseconds, the singing voice of the present invention detects the duration of each image t_x＝W×h/f_s80 × 315/22050 seconds.

The invention has the beneficial effects that:

the invention provides a singing voice detection method based on an extrusion-excitation residual Neural Network (SE-ResNet). The method is characterized in that the method comprises the steps of designing extrusion and excitation residual error networks with different depths, constructing a music data set, training, verifying, testing and comparing, and finally selecting the trained network with the best effect as a singing voice detection classifier. When singing voice detection is carried out, a simple logarithmic Mel time-frequency diagram is calculated and converted into an image, and the image is input into the selected network, so that the task can be completed. The invention implicitly extracts the characteristics of singing voice of different levels through a deep residual error network, and utilizes the extrusion sum embedded in the residual error networkThe importance of the features is judged by the self-adaptive attention characteristics of the excitation module, so that the singing voice is identified by using the features with high importance degree, and the aim of detecting the singing voice with high accuracy is fulfilled. Document [1]]As a third-party evaluation paper, three methods of representative random forest, CNN and RNN are realized, the singing voice detection accuracy rate of the music data set Jamendo is 0.879, 0.868 and 0.875 respectively, and in the embodiment of the invention, the selected trained extrusion and excitation residual error network d is adopted₃₄The accuracy under jamenda is 0.897, 1.8% higher than the highest reported in this document.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings described below are only part of the drawings of the embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of singing voice detection (accompanied by a color picture).

FIG. 2 is a schematic flow diagram of the present invention.

Fig. 3 is a schematic diagram of a squeeze and excitation residual network.

Fig. 4 is two module types of a residual network architecture.

Wherein: the upper half of fig. 1 is a waveform diagram of audio, and the lower half is a corresponding time-frequency diagram; the yellow-colored audio portion is detected by the singing voice to contain the singing voice, while the remaining portion is free of the singing voice.

In fig. 3, H, W, C are height, width and number of channels of the image, Global is a Global average pooling layer, representing a Squeeze (Squeeze) operation, and an Excitation (Excitation) operation comprises 4 steps, forming a sigmoid-based gate mechanism. The first full connection layer FC and Relu reduce the channel number by r Scale factors and have the effects of dimensionality reduction and generalization, the second full connection layer FC and Sigmoid reduce the channel number, and finally, the weight of the initial input channel is adjusted through a Scale step.

The left diagram in FIG. 4 is an example of a Basic block with 64 channels as inputs, containing 2 convolutional layers; the right diagram is an example of a Bottleneck block (bottleeck block) with 256 inputs, containing 3 convolutional layers.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a singing voice detection method based on an extrusion and excitation residual error network, which comprises the following steps:

1. constructed for singing voice detection, each depth being d_i,i∈[14,18,34,50,101,152,200]Squeeze and excitation residual network of

In the embodiment of the present invention, preferably, the depth d is set_i,i∈[14,18,34,50,101,152,200]A squeeze and excitation residual network for singing voice detection is constructed for example. The 5 networks with the depths of 18,34,50,101,152 in the present embodiment are typical depths of the squeeze and excitation residual networks, and 14 and 200 are depths constructed in the present embodiment, and those skilled in the art can construct other depths suitable for singing voice detection data sets according to actual situations to obtain possible better networks.

wherein the networks with depths of 14,18 and 34 consist of residual structures based on basic blocks and the networks with depths of 50,101,152 and 200 consist of residual structures based on bottleneck blocks the initial input of these deep neural networks is an image with a size H × W ═ 80 × the input image is transformed from a music audio signal, how it is transformed is explained in the subsequent steps the table column output size gives the output image size of each layer for an input 80 × image the image is passed through a convolutional layer with a size of 7 × and a number of steps (stride) of 2 and a maximum pooling layer with a size of 3 × and a number of steps of 2, resulting in a characteristic map of 40 ×. subsequently, the compression and excitation residual network layers are entered into a compression and excitation residual network layer with a layerne of conv 567 x, taking depth 101-layer as an example, wherein a compression and excitation network stack is represented, the residual network structure consists of, excitation and a residual structure 351, wherein the convolution of the residual kernel 2043 represents the number of 2048 and the residual convolutional layer 128,2048, [ 3668 ] wherein the residual network kernel of the output of the residual network is represented in succession by a depth 101-layer of 7-layer and b [ 3 ] a residual convolutional layer with a residual kernel of 3, [ 351 ] the output of 3, [ 358 ] the]× 3 outside the middle bracketed expression, the depth of the extrusion and excitation residual network stack in the middle bracketed expression is 3, namely the depth of the network series stack, the last row 1 × 1 expresses that the output is a one-dimensional vector, averagepool 2-dfc, which expresses that the output passes through a 2-dimensional self-adaptive average pooling layer and the number of output channels is 1, and then enters a fully connected convolutional layer, the final network output is a one-dimensional vector o which contains 2 values o₀,o₁For judging whether the singing voice is contained. In this example, the training, validation and testing processes are all o₀,o₁Corresponding to no singing voice and singing voice respectively.

2. constructing music data sets

2.1 collecting music data sets for singing voice detection, a good data set comprising the following conditions: (1) the more the total amount is, the better, but not less than 120 minutes; (2) the time sum of music segments containing singing voice and music segments without singing voice in the data set reaches the balance; (3) the music type distribution covers the types to be detected and is balanced.

2.3 randomly dividing the music data set into a training set, a verification set and a test set, wherein the number of samples in the training set is not less than 50%.

2.4 in this example, Jamendo was chosen as the experimental dataset. The music data set jamenda is an internationally published data set usable for singing voice detection, and contains music having a total duration of 371 minutes and 93 songs. The audio file of each piece of music has been annotated and is divided into three parts, a training set, a verification set and a test set, which respectively contain 61,16 and 16 songs.

3. Converting a music data set Jamendo into an image set and a corresponding annotation set

3.1 conversion of music data sets into logarithmic Mel time-frequency diagram File collections

Each music audio file in the music data set (including the training set, the validation set, and the test set) is processed and converted into a file containing a log mel-spectral plot (log mel-spectral). The calculation process is that firstly, a time-frequency diagram (spectrum) of the audio signal is calculated, and the audio sampling rate is f_s22050Hz, the frame length is l 1024, and the frame shift is h 315; then converting the time-frequency diagram into Mel time-frequency diagram, wherein 80 Mel frequency numbers are taken during conversion, and the frequency interval is taken [27.5,8000 ]]Hz, the number of Mel frequencies corresponds to the number of rows H of the time-frequency spectrogram; and finally, logarithm is taken on the amplitude in the Mel time-frequency diagram, and the logarithm Mel time-frequency diagram can be obtained. Logarithmic Mel time-frequency diagramEquivalent to a data matrix a (H, L), L is determined by the length of the audio.

3.2 converting the logarithmic Mel time-frequency diagram document set into an image set and a corresponding label set

3.2.1 reading the log Mel time-frequency diagram files in the training set data set one by one,

p_x＝p_file(t_W/2) (5)

3.2.5 the operations on the training set in steps 3.2.1 to 3.2.4 are performed on both the verification set and the test set to generate the image set and the annotation set of the verification set and the test set. Setting the total number of images in the verification set and the test set to be N respectively_vAnd N_t。

4. Using the samples in the constructed Jamendo training set to respectively obtain the depth d_i,i∈[14,18,34,50,101,152,200]Training the 7 squeezing and exciting residual error networks, and verifying through a verification set in the training process

4.1 for depth d_iThe network begins to perform the e-th round of training

if not, 4.1.2 rounds of training are performed.

4.1.2 the invention preferably randomly takes images and corresponding annotations from a training image set and a corresponding annotation set, inputs them to a squeeze and excitation residual network d_iIn the middle of training

4.1.3 when all images in the training set were taken and trained, the e-th round of training was finished

4.2 after the e-th round of training is finished, the trained network d is trained_iAnd (3) carrying out verification by using a verification set, wherein a verification algorithm is as follows:

4.2.1 sequentially extracting images and corresponding annotations from the verification image set and the corresponding annotations set

4.2.2 input images into the squeeze and excitation residual network d after e-round training_iEach image can obtain 2 output values o₀,o₁And taking the corresponding category with the larger output value as the final classification result. (e.g., o)₀>o₁The value indicating no singing voice is greater than that of the singing voice, and the final classification result for this image is no singing voice. )

4.2.3 if the label corresponding to the image is the same as the final classification result, the test result is determined to be correct.

4.2.4 repeating steps 4.2.1 to 4.2.3 until N in the set of verification images is verified_vAnd finishing the execution of all the images.

otherwise, s is set to s + 1.

4.2.6 if the consecutive times S reach S, i.e. the consecutive S rounds of detection accuracy are not increased, the training is all over ended.

4.2.7 if the number of consecutive times S is less than S, then set E-E +1, if E > -E, then the training is all over,

otherwise, jumping to step 4.1 to continue training.

4.3 through the training algorithm, finally obtaining the extrusion and excitation residual error network d with fixed parameters_i

5. Respectively using Jamendo test set to train obtained extrusion and excitation residual error network d_i,i∈[14,18,34,50,101,152,200]Performing tests and comparisons

5.1 sequentially taking out the images and the corresponding annotations from the Jamendo test image set and the corresponding annotation set.

5.2 inputting the image into the squeezing and excitation residual error network d trained in the step 4_iEach image can obtain 2 output values, and the category corresponding to the image with the larger output value is taken as the final classification result

And 5.3, if the label corresponding to the image is the same as the final classification result, determining that the test result is correct. Counting the number of all image samples classified as correct in the test data set, denoted as T_iComputing network d_iDetection accuracy of (a)_i＝T_i/N_t。

5.4 comparison a_iAnd i is the value of 14,18,34,50,101,152 and 200, and the network corresponding to the maximum value is taken as the network finally determined to be adopted. After the above test, the network d is obtained_i,i∈[14,18,34,50,101,152,200]Are 0.8904, 0.8772, 0.8970, 0.8779, 0.864, 0.8850, 0.8818, respectively, it is therefore desirable to finally determine that the network used is d₃₄. It is noted that the networks ultimately employed may vary from data set to data set, and therefore, one skilled in the art should experiment with the data set being constructed and select the appropriate network.

6. Singing voice detection for music audio file to be detected

6.1 the audio file to be detected is converted into a logarithmic Mel time-frequency diagram file and an image set without a label set in reference to the step 3.

6.2 input images one by one to the trained and selected best NetCollaterals of formula (d)₃₄And 2 output values can be obtained from each image, and the category corresponding to the image with the larger output value is taken as the final classification result.

6.4 time accuracy of singing voice detection result of this embodiment: t is t_p＝h₁×h/f_s5 × 315/22050-0.0715 sec-71.5 ms, the singing voice of the present embodiment detects a time duration t of each image_x＝W×h/f_s80 × 315/22050 seconds.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于AhoCorasick模式匹配机的语文课文音频节目发现方法

Singing voice detection method based on extrusion and excitation residual error network

相关技术

网友询问留言