Audio recognition model training method and tone similarity detection method

文档序号：70634 发布日期：2021-10-01 浏览：33次中文

阅读说明：本技术 音频识别模型训练方法，音色相似度检测方法 (Audio recognition model training method and tone similarity detection method ) 是由谭志力胡诗超于 2021-07-16 设计创作，主要内容包括：本申请公开了一种音频识别模型训练方法,一种音色相似度检测方法。本申请中的音频识别模型以两个有噪音色特征及其二者的相似度作为输入数据,能够输出该相似度的纠正结果,该纠正过程无需关注噪声大小、时长大小与相似度大小是不是线性关系,也不用关注音频中的噪声大小,因此可以降低计算开支和耗时,还提高了音色相似度的纠正准确率。(The application discloses an audio recognition model training method, and relates to a tone similarity detection method. The audio recognition model in the application takes two noisy color characteristics and the similarity of the two noisy color characteristics as input data, the correction result of the similarity can be output, the correction process does not need to pay attention to the noise size, the time length and the similarity are not in a linear relation, and the noise size in the audio is not needed to pay attention to, so that the calculation expense and the time consumption can be reduced, and the correction accuracy rate of the timbre similarity is also improved.)

1. An audio recognition model training method, comprising:

acquiring a first noiseless audio and a second noiseless audio from a training sample library;

respectively adding random noise to the first noiseless audio and the second noiseless audio to obtain a first noisy audio corresponding to the first noiseless audio and a second noisy audio corresponding to the second noiseless audio;

inputting the first noiseless audio frequency, the second noiseless audio frequency, the first noisy audio frequency and the second noisy audio frequency into a tone extraction model respectively, and extracting a first noiseless color feature, a second noiseless color feature, a first noisy color feature and a second noisy color feature respectively;

calculating a noisy similarity of the first noisy audio and the second noisy audio based on the first noisy color feature and the second noisy color feature, and calculating a similarity offset true value of the first and second quiet audio based on the first quiet color feature, the second quiet color feature, the first noisy color feature, and the second noisy color feature;

inputting the first noisy color feature, the second noisy color feature and the noisy similarity into a neural network model to be trained, so that the neural network model outputs similarity shift prediction values of the first noisy audio and the second noisy audio;

calculating a loss value between the similarity deviation predicted value and the similarity deviation real value, and adding the loss value to a target loss set;

adjusting model parameters of the neural network model based on each loss value in the target set of losses;

and two noiseless audios are obtained from the training sample library again so as to carry out iterative training on the updated neural network model until a model convergence condition is reached, and an audio recognition model is output.

2. The method of claim 1, further comprising:

acquiring a noise-free similarity true value of the first noise-free audio and the second noise-free audio;

determining a noise-free similarity prediction value of the first noisy audio and the second noisy audio using the neural network model;

and calculating a loss value between the predicted value of the noise-free similarity and the true value of the noise-free similarity, and adding the loss value to the target loss set.

3. The method of claim 1, further comprising:

acquiring a probability true value that the first noiseless audio and the second noiseless audio belong to the same tone;

determining a probability prediction value that the first noisy audio and the second noisy audio belong to the same tone by using the neural network model;

and calculating a loss value between the probability predicted value and the probability true value, and adding the loss value to the target loss set.

4. The method of any of claims 1 to 3, further comprising:

acquiring a first true signal-to-noise ratio of the first noisy audio;

determining a first predicted signal-to-noise ratio for the first noisy audio using the neural network model;

calculating a loss value between the first predicted signal-to-noise ratio and the first true signal-to-noise ratio and adding the loss value to the target loss set;

and/or

Acquiring a second true signal-to-noise ratio of the second noisy audio;

determining a second predicted signal-to-noise ratio for the second noisy audio using the neural network model;

calculating a loss value between the second predicted signal-to-noise ratio and the second true signal-to-noise ratio and adding the loss value to the target loss set;

and/or

Acquiring a first real length of the first noisy audio;

determining a first predicted length of the first noisy audio using the neural network model;

calculating a loss value between the first predicted length and the first real length and adding the loss value to the target loss set;

and/or

Acquiring a second real length of the second noisy audio;

determining a second predicted length of the second noisy audio using the neural network model;

calculating a loss value between the second predicted length and the second real length and adding the loss value to the target loss set.

5. A tone similarity detection method is characterized by comprising the following steps:

acquiring a first audio and a second audio;

inputting the first audio and the second audio into a tone extraction model respectively, so that the tone extraction model outputs a first tone characteristic corresponding to the first audio and a second tone characteristic corresponding to the second audio;

calculating the similarity to be corrected of the first tone color characteristic and the second tone color characteristic;

inputting the first tone color characteristic, the second tone color characteristic and the similarity to be corrected into an audio recognition model so that the audio recognition model outputs a similarity detection result; the audio recognition model is obtained by training by using the method of any one of claims 1 to 4;

determining timbre similarities of the first audio and the second audio based on the similarity detection results.

6. The method of claim 5, wherein the calculating the to-be-corrected similarity of the first and second timbre features comprises:

and calculating the similarity to be corrected based on the PLDA or the cosine distance.

7. The method of claim 5,

if the similarity detection result is a probability value that the first audio and the second audio belong to the same tone, determining the probability value as the tone similarity;

If the similarity detection result is the noise-free similarity of the first tone color feature and the second tone color feature, determining the noise-free similarity as the tone color similarity;

And if the similarity detection result is the offset of the similarity to be corrected and the tone similarity, determining the sum of the similarity to be corrected and the offset as the tone similarity.

8. The method of claim 5, wherein before inputting the first timbre feature, the second timbre feature and the similarity to be corrected into an audio recognition model, further comprising:

optimizing the similarity to be corrected by using a linear formula; the linear formula is:

S'＝W₀+W₁S+W₂SNR_x+W₃SNR_y+W₄L_x+W₅L_y；

wherein S' is the similarity to be corrected after optimization, S is the similarity to be corrected before optimization, and SNR_xIs the signal-to-noise ratio, SNR, of the first audio_yIs the signal-to-noise ratio, L, of the second audio_xIs the length of the first audio, L_yIs the length of the second audio, W₀For presetting a bias parameter, W₁、W₂、W₃、W₄、W₅Is a predetermined weight.

9. An electronic device, comprising a processor and a memory; wherein the memory is for storing a computer program which is loaded and executed by the processor to implement the method of any of claims 1 to 8.

10. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out a method according to any one of claims 1 to 8.

Technical Field

The application relates to the technical field of computers, in particular to an audio recognition model training method, and a tone similarity detection method.

Background

At present, the singer tone and color recognition function is widely applied to scenes such as song recommendation, singer identity confirmation and the like. But limited to non-professional equipment and environments, the user can easily mix noise (microphone fricatives, environmental background noise, etc.) into the recorded singing voice, which poses a challenge to the accuracy of timbre identification.

The timbre similarities of two songs can currently be detected and corrected using linear equations, namely: and carrying out weighted summation on the original similarity score and the information of the noise level, the duration and the like of the song. In this way, the noise magnitude, the time length magnitude and the similarity magnitude are considered to be in a linear relationship, but actually, the noise magnitude, the time length magnitude and the similarity magnitude are not in a linear relationship, so that it is difficult to obtain a good correction effect in this way. Meanwhile, when the noise in the song is estimated, the noise-free signal and the noise signal need to be separated, so that the estimation is difficult to be accurate, and the calculation expense and time consumption are increased.

Disclosure of Invention

In view of this, an object of the present application is to provide an audio recognition model training method, a method for detecting timbre similarity, so as to improve accuracy of correcting timbre similarity. The specific scheme is as follows:

to achieve the above object, in one aspect, the present application provides an audio recognition model training method, including:

acquiring a first noiseless audio and a second noiseless audio from a training sample library;

calculating a loss value between the similarity deviation predicted value and the similarity deviation real value, and adding the loss value to a target loss set;

adjusting model parameters of the neural network model based on each loss value in the target set of losses;

In another aspect, the present application further provides a method for detecting timbre similarity, including:

acquiring a first audio and a second audio;

calculating the similarity to be corrected of the first tone color characteristic and the second tone color characteristic;

determining timbre similarities of the first audio and the second audio based on the similarity detection results.

In yet another aspect, the present application further provides an electronic device comprising a processor and a memory; wherein the memory is for storing a computer program that is loaded and executed by the processor to implement the method of any of the preceding claims.

In yet another aspect, the present application further provides a storage medium having stored therein computer-executable instructions that, when loaded and executed by a processor, implement the method of any one of the preceding claims.

The method and the device can train to obtain the audio recognition model. The model takes two noisy color characteristics and noisy similarity of the two noisy color characteristics as input data, can output similarity deviation predicted values of the two tone color characteristics, then calculates a loss value between the similarity deviation predicted value and a similarity deviation true value, and adds the loss value to a target loss set; adjusting model parameters of the neural network model based on each loss value in the target loss set; and two noiseless audios are obtained from the training sample library again so as to carry out iterative training on the updated neural network model until the model convergence condition is reached, and the audio recognition model is output. Therefore, the audio recognition model is obtained based on the neural network model training, the tone similarity is corrected through the audio recognition model, the noise size, the time length and the similarity are not in a linear relation, and the noise size in the audio is not required to be paid attention to, so that the calculation expense and the time consumption can be reduced, and the correction accuracy of the tone similarity is improved.

Correspondingly, the audio recognition model training component and the tone similarity detection component provided by the application also have the technical effects. Components namely apparatus, devices and media.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a diagram illustrating a physical architecture suitable for use in the present application;

FIG. 2 is a flowchart of a first audio recognition model training method provided in the present application;

FIG. 3 is a flowchart of a second method for training an audio recognition model provided in the present application;

fig. 4 is a flow chart of a method for detecting similarity of timbre provided by the present application;

FIG. 5 is a schematic diagram of an audio recognition model training apparatus provided in the present application;

fig. 6 is a schematic diagram of a timbre similarity detection apparatus provided in the present application;

FIG. 7 is a flowchart of a song classification method provided herein;

FIG. 8 is a schematic diagram of training data for preparing an audio recognition model according to the present application;

FIG. 9 is a schematic diagram of a training task of a timbre similarity detection audio recognition model according to the present application;

FIG. 10 is a block diagram of a server provided by the present application;

fig. 11 is a structural diagram of a terminal according to the present application.

Detailed Description

It is difficult to obtain a good correction effect by correcting the timbre similarities of two songs using a linear equation. Meanwhile, when the noise in the song is estimated, the noise-free signal and the noise signal need to be separated, so that the estimation is difficult to be accurate, and the calculation expense and time consumption are increased.

In view of the above problems, the present application provides an audio recognition model training method, a method for detecting timbre similarity, which can improve the accuracy of correcting timbre similarity.

For ease of understanding, a physical framework to which the present application applies will be described.

It should be understood that the audio recognition model training method and the tone similarity detection method provided by the present application may be applied to a system or a program having a tone similarity detection function. Specifically, the system or the program may be executed in a device such as a server, a personal computer, or the like.

As shown in fig. 1, fig. 1 is a schematic diagram of a physical architecture applicable to the present application. In fig. 1, a system or a program having a tone similarity detection function may be run on a server, which is used for training audio of a model from other terminal devices through a network, and audio for which similarity needs to be calculated; the audio is as follows: songs, drama, character conversations, and the like. The server can obtain two noise-free audios of the training model; respectively adding random noise to the two noiseless frequencies to obtain two noiseless frequencies corresponding to the two noiseless frequencies; respectively inputting two noiseless audios and two noisy audios into a tone extraction model to extract four tone features, namely: a first noiseless color feature, a second noiseless color feature, a first noisy color feature, and a second noisy color feature; calculating the noisy similarity of the first noisy audio and the second noisy audio based on the first noisy color feature and the second noisy color feature, and calculating a similarity offset true value of the first noiseless audio and the second noiseless audio based on the first noiseless color feature, the second noiseless color feature, the first noisy color feature and the second noisy color feature; inputting the first noisy color characteristic, the second noisy color characteristic and the noisy similarity into a neural network model to be trained, so that the neural network model outputs a similarity deviation predicted value of the first noisy audio and the second noisy audio; calculating a loss value between the predicted similarity deviation value and the true similarity deviation value, and adding the loss value to a target loss set; adjusting model parameters of the neural network model based on each loss value in the target loss set; and two noiseless audios are obtained from the training sample library again so as to carry out iterative training on the updated neural network model until the model convergence condition is reached, and the audio recognition model is output.

After the audio recognition model is obtained, respectively inputting two audios (whether the audios contain noise or not is not concerned) of which the similarity needs to be calculated into the tone extraction model so that the tone extraction model outputs two tone features; calculating the similarity to be corrected of the two tone features; inputting the two tone features and the similarity to be corrected into the audio recognition model so that the audio recognition model outputs a similarity detection result; and finally, determining the tone similarity of the first audio and the second audio based on the similarity detection result.

As can be seen, the server can establish communication connection with a plurality of devices, and the server acquires audio meeting training conditions or audio needing to calculate similarity from the devices. The server can train to obtain the audio recognition model by collecting the audio uploaded by the devices. Then, the tone similarity of the two audios can be calculated and corrected according to the tone similarity detection method provided by the application. Fig. 1 shows various terminal devices, in an actual scene, more or fewer types of terminal devices may participate in the process of detecting the tone similarity, the specific number and type are determined by the actual scene, and are not limited herein, and in addition, fig. 1 shows one server, but in an actual scene, there may also be participation of multiple servers, and the specific number of servers is determined by the actual scene.

It should be noted that the tone similarity detection method and the corresponding audio recognition model training method provided in this embodiment may be performed offline, that is, the server locally stores audio meeting training conditions or audio for which similarity needs to be calculated, and the scheme provided in this application may be directly used to calculate and correct the tone similarity.

It is understood that the system or the program with the tone similarity detection function may also be run on a personal mobile terminal, or may also be used as one of cloud service programs, and a specific operation mode is determined according to an actual scene, which is not limited herein. Specifically, the tone color recognition function may be used in scenes such as song recommendation, singer identification, and the like.

With reference to fig. 2, fig. 2 is a flowchart of a first method for training an audio recognition model according to an embodiment of the present disclosure. As shown in fig. 2, the audio recognition model training method may include the following steps:

s201, acquiring a first noiseless audio and a second noiseless audio from a training sample library.

S202, random noise is added to the first noiseless audio and the second noiseless audio respectively, and a first noisy audio corresponding to the first noiseless audio and a second noisy audio corresponding to the second noiseless audio are obtained.

In this embodiment, the first and second noiseless audio may be noiseless songs, dramas, or the like. The noise added to the first noiseless audio and the second noiseless audio respectively may be white noise or audio recorded in a quiet environment. A specific method of adding noise may refer to the related art.

It should be noted that random noise may or may not have noise. That is to say: after random noise is added to the noiseless audio, noisy audio and noiseless audio can be obtained. The present embodiment collectively refers to the audio to which random noise is added as noisy audio.

S203, inputting the first noiseless audio frequency, the second noiseless audio frequency, the first noisy audio frequency and the second noisy frequency into a tone extraction model respectively, and extracting a first noiseless color feature, a second noiseless color feature, a first noisy color feature and a second noisy color feature respectively.

The tone extraction model in step S203 may be a neural network, which may include: convolutional layers, pooling layers, full-link layers, and the like. Of course, other components are possible. The input data of the timbre extraction model is specifically spectral data of audio. Because the audio frequency of the input tone extraction model is different in length, fixed dimensionality can be set in the tone extraction model, so that all tone characteristics output by the tone extraction model keep the same dimensionality, and the similarity can be calculated in the subsequent process. Wherein the size of the dimension requires the selection of a suitable value. The larger dimension may make the timbre features contain more information, but may result in overfitting, may also contain noise, and the more information is not favorable for the subsequent similarity calculation. The smaller dimension, though more compact, may result in insufficient information contained in the timbre features.

When the tone extraction model is trained, the noiseless frequency of the known tone (i.e. the training label) is used at the same time, and the noisy frequency obtained after random noise is added to the noiseless frequency is used as a training set, so that the tone extraction model learns the tone feature extraction capability of the noiseless frequency and the noisy frequency at the same time.

After the tone extraction audio recognition model is trained, the output data of the tone extraction model is used as the training data of the PLDA, and the PLDA (Probabilistic Linear Discriminant Analysis) or other similar networks for calculating tone similarity are trained by the known tone corresponding to each output data.

S204, calculating the noisy similarity of the first noisy audio and the second noisy audio based on the first noisy color feature and the second noisy color feature, and calculating a similarity deviation true value of the first noiseless audio and the second noiseless audio based on the first noiseless color feature, the second noiseless color feature, the first noisy color feature and the second noisy color feature.

The noisy similarity in step S204 may be calculated by using a PLDA, a cosine distance, a deep learning method, or other methods, and specifically, reference may be made to the related art. Of course, the noise-free similarity between the first noise-free audio and the second noise-free audio can also be calculated by using the PLDA, the cosine distance, the deep learning or other methods, and the difference between the noise-free similarity and the noise-free similarity is the true value of the similarity deviation.

S205, inputting the first noisy color characteristic, the second noisy color characteristic and the noisy similarity into a neural network model to be trained, so that the neural network model outputs similarity deviation predicted values of the first noisy audio and the second noisy audio.

And S206, calculating a loss value between the predicted similarity deviation value and the true similarity deviation value, and adding the loss value to the target loss set.

S207, judging whether a model convergence condition is reached or not based on the target loss set; if yes, go to step S208; if not, S209 is executed.

And S208, outputting the current neural network model as an audio recognition model.

S209, after the model parameters of the neural network model are adjusted based on each loss value in the target loss set, S201 is executed to iteratively train the updated neural network model.

Step S209 obtains two noiseless audios from the training sample library again, so as to iteratively train the updated neural network model until a model convergence condition is reached, and then outputs an audio recognition model.

It should be noted that, if the neural network model outputs the similarity deviation prediction values of the first noisy audio and the second noisy audio, it indicates that the current neural network model is used for processing the regression task, and the audio recognition model training process is learning and training of the regression task. Therefore, the loss value can be calculated by the minimum mean square error or other similar error calculation method, and the model parameters can be updated by back propagation according to the minimum mean square error or other similar error calculation method.

The Neural Network model may be a Deep Neural Network (DNN), and the structure may include a fully connected layer, and the like.

Therefore, in the embodiment, the audio recognition model is obtained based on the neural network model training, the tone similarity is corrected by the audio recognition model, the noise size, the time length and the similarity are not in a linear relationship, and the noise size in the audio is not concerned, so that the calculation expense and the time consumption can be reduced, and the correction accuracy of the tone similarity is improved.

Based on the above embodiments, it should be noted that, in the training process, the neural network model may further output a noise-free similarity prediction value, a probability prediction value that the first noisy audio and the second noisy audio belong to the same tone, a first prediction signal-to-noise ratio of the first noisy audio, a second prediction signal-to-noise ratio of the second noisy audio, a first prediction length of the first noisy audio, a second prediction length of the second noisy audio, and the like. The probability prediction values of the first noisy audio and the second noisy audio belonging to the same tone belong to a binary classification problem, so that the probability prediction values are a classification task, and other output results belong to a regression task.

In one embodiment, the training process of the neural network model further includes: acquiring a noise-free similarity true value of the first noise-free audio and the second noise-free audio; determining a noise-free similarity prediction value of the first noisy audio and the second noisy audio by using a neural network model; and calculating a loss value between the predicted value of the noise-free similarity and the true value of the noise-free similarity, and adding the loss value to the target loss set.

In one embodiment, the training process of the neural network model further includes: acquiring a probability true value that the first noiseless audio and the second noiseless audio belong to the same tone; determining probability predicted values of the first noisy audio and the second noisy audio belonging to the same tone by using a neural network model; and calculating a loss value between the probability predicted value and the probability true value, and adding the loss value to the target loss set.

In one embodiment, the training process of the neural network model further includes: acquiring a first true signal-to-noise ratio of the first noisy audio; determining a first predicted signal-to-noise ratio for the first noisy audio using a neural network model; calculating a loss value between the first predicted signal-to-noise ratio and the first true signal-to-noise ratio, and adding the loss value to a target loss set;

and/or

Acquiring a second true signal-to-noise ratio of a second noisy audio; determining a second predicted signal-to-noise ratio for the second noisy audio using the neural network model; calculating a loss value between the second predicted signal-to-noise ratio and the second true signal-to-noise ratio, and adding the loss value to the target loss set;

and/or

Acquiring a first real length of a first noisy audio; determining a first predicted length of the first noisy audio using a neural network model; calculating a loss value between the first predicted length and the first real length, and adding the loss value to a target loss set;

and/or

Acquiring a second real length of a second noisy audio; determining a second predicted length of the second noisy audio using the neural network model; a loss value between the second predicted length and the second real length is calculated and added to the target loss set.

Therefore, when the model is actually applied, the corresponding result can be selected according to the actual situation and the requirement.

Referring to fig. 3, fig. 3 is a flowchart of a second audio recognition model training method according to an embodiment of the present disclosure. As shown in fig. 3, the audio recognition model training method may include the following steps:

s301, acquiring a first noiseless audio and a second noiseless audio.

S302, noise is added to the first noiseless audio and the second noiseless audio respectively, and a first noisy audio corresponding to the first noiseless audio and a second noisy audio corresponding to the second noiseless audio are obtained.

S303, the first noisy audio frequency and the second noisy frequency are respectively input into the tone extraction model, so that the tone extraction model outputs a first noisy color feature corresponding to the first noisy audio frequency and a second noisy color feature corresponding to the second noisy audio frequency.

S304, calculating the noisy similarity of the first noisy color feature and the second noisy color feature.

S305, inputting the first noisy color characteristic, the second noisy color characteristic and the noisy similarity into the neural network model, so that the neural network model outputs a corrected prediction result of the noisy similarity.

S306, judging which contents are included in the correction prediction result; if the probability prediction values that the first noisy audio and the second noisy audio belong to the same tone color are included, executing S307; if the similarity prediction value of the noiseless similarity of the first noiseless color feature and the second noiseless color feature is included, executing S308; if the similarity prediction value and the offset prediction value of the noisy similarity are included, S309 is performed.

Wherein correcting the prediction result may include: the method comprises the steps of obtaining a probability prediction value of a first noisy audio and a second noisy audio belonging to the same tone color, and/or a similarity prediction value of a noiseless similarity of a first noiseless color feature and a second noiseless color feature (namely a noiseless similarity prediction value), and/or an offset prediction value of the similarity prediction value and the noisy similarity (namely a similarity offset prediction value).

The first noiseless tone features are extracted from the first noiseless audio by the tone extraction model, and the second noiseless tone features are extracted from the second noiseless audio by the tone extraction model.

It can be seen that S307, S308, and S309 may be executed alternatively, simultaneously, or optionally two of them.

S307, calculating a loss value between the probability predicted value and the true value, and adding the loss value to a target loss set.

Wherein the true value is a probability true value that the first noisy audio and the second noisy audio belong to the same tone.

S308, calculating a first error value between the similarity prediction value and the noise-free similarity, and adding the first error value to the target loss set.

S309, calculating a second error value between the offset predicted value and the real offset, and adding the second error value to the target loss set.

The real offset is the difference between the noise-free similarity and the noise similarity.

And S310, judging whether a model convergence condition is reached or not based on the target loss set. If yes, go to S311. If not, go to S312.

And S311, determining the neural network model as an audio recognition model.

And S312, updating model parameters of the neural network model based on the target loss set, and executing S301 to iteratively train the updated neural network model.

In this embodiment, if correcting the prediction result includes: and if the first noisy audio and the second noisy audio belong to the probability predicted values of the same tone, indicating that the current neural network model is used for processing the two classification problems, wherein the audio recognition model training process is the learning and training of the classification task. Therefore, when calculating the loss value between the probability predicted value and the actual value, a cross entropy loss function or other similar loss functions can be used, and the back propagation is carried out according to the cross entropy loss function or other similar loss functions so as to update the model parameters.

In this embodiment, if correcting the prediction result includes: and the similarity prediction value of the noiseless similarity of the first noiseless color characteristic and the second noiseless color characteristic and/or the offset prediction value of the similarity prediction value and the noiseless similarity indicate that the current neural network model also processes the regression problem, and the training process of the audio recognition model comprises the learning and training of a regression task. Therefore, the first error and the second error can be calculated by using a minimum mean square error or other similar error calculation method and performing back propagation according to the minimum mean square error or other similar error calculation method to update the model parameters.

As can be seen, the target loss set may include: a loss value between the probability predicted value and the true value, and/or a first error value, and/or a second error value. Therefore, in the training process of the audio recognition model, the parameters are updated based on the classification task and the regression task, so that the multi-task training is carried out, and the correction capability of the model on the tone similarity can be improved.

Based on any of the above embodiments, it should be noted that, in the case that the corrected prediction result includes any one or combination of the probability prediction value, the similarity prediction value, and the offset prediction value, the corrected prediction result may further include the following other parameters related to the regression task, such as: a first predicted signal-to-noise ratio for the first noisy audio, and/or a second predicted signal-to-noise ratio for the second noisy audio, and/or a first predicted length for the first noisy audio, and/or a second predicted length for the second noisy audio.

Correspondingly, a third error value between the first predicted signal-to-noise ratio and the real signal-to-noise ratio with the first noise frequency needs to be calculated, and the third error value is added to the target loss set; and/or calculating a fourth error value between the second predicted signal-to-noise ratio and the true signal-to-noise ratio of the second noisy audio, and adding the fourth error value to the target loss set; and/or calculating a fifth error value between the first predicted length and the true length of the first noisy frequency and adding the fifth error value to the target loss set; and/or calculating a sixth error value between the second predicted length and the true length of the second noisy audio and adding the sixth error value to the target loss set.

Accordingly, the target loss set includes: in the case of any one or combination of the loss value, the first error value, and the second error value between the probability prediction value and the true value, the method may further include: any one or combination of the third error value, the fourth error value, the fifth error value, the sixth error value.

Of course, the character information about the timbre, such as age, gender and the like, can be added to the corrected and predicted result, so that more error-calculable parameters can be added to the regression task, and corresponding errors can be added to the target loss set, so that the correction capability of the model for the timbre similarity can be improved.

Referring to fig. 4, fig. 4 is a flowchart of a method for detecting a timbre similarity according to an embodiment of the present disclosure. As shown in fig. 4, the method for detecting timbre similarity may include the steps of:

s401, acquiring a first audio and a second audio.

Wherein the first audio and the second audio may be songs, lines, etc. performed by two people.

S402, inputting the first audio and the second audio into the tone extraction model respectively, so that the tone extraction model outputs a first tone characteristic corresponding to the first audio and a second tone characteristic corresponding to the second audio.

And S403, calculating the similarity to be corrected of the first tone color characteristic and the second tone color characteristic.

In one embodiment, calculating the similarity to be corrected of the first and second timbre features comprises: and calculating the similarity to be corrected based on the PLDA or the cosine distance.

S404, inputting the first tone color characteristic, the second tone color characteristic and the similarity to be corrected into the audio recognition model, so that the audio recognition model outputs a similarity detection result.

The audio recognition model is obtained by training using the audio recognition model training method provided in any of the above embodiments, so that reference may be made to the relevant content of the audio recognition model described in any of the above embodiments.

S405, determining the tone color similarity of the first audio and the second audio based on the similarity detection result.

Since the audio recognition model can output various prediction results in the training process, the selection can be performed based on the similarity detection result output by the model in the process of using the model. In one embodiment, if the similarity detection result is a probability value that the first audio and the second audio belong to the same timbre, determining the timbre similarity of the first audio and the second audio based on the similarity detection result includes: determining the probability value as the tone similarity; or if the similarity detection result is the noise-free similarity of the first tone color feature and the second tone color feature, determining the noise-free similarity as the tone color similarity; or if the similarity detection result is the offset of the similarity to be corrected and the tone similarity, determining the sum of the similarity to be corrected and the offset as the tone similarity.

If the similarity detection result includes at least two kinds, the at least two kinds may be subjected to the determination of the timbre similarity, and then the obtained plurality of timbre similarities are subjected to the weighted summation, so as to obtain the final timbre similarity.

In a specific embodiment, before inputting the first timbre characteristic, the second timbre characteristic and the similarity to be corrected into the audio recognition model so that the audio recognition model outputs the similarity detection result, the method further includes: optimizing the similarity to be corrected by using a linear formula;

the linear formula is: s ═ W₀+W₁S+W₂SNR_x+W₃SNR_y+W₄L_x+W₅L_y(ii) a Wherein S' is the similarity to be corrected after optimization, S is the similarity to be corrected before optimization, and SNR_xIs the signal-to-noise ratio, SNR, of the first audio_yIs the signal-to-noise ratio, L, of the second audio_xIs the length of the first audio, L_yIs the length of the second audio, W₀For presetting a bias parameter, W₁、W₂、W₃、W₄、W₅Is a predetermined weight. The length is in seconds. The signal-to-noise ratio is typically in dB.

Wherein the magnitude of each preset weight may be determined based on training data. When the signal-to-noise ratio is estimated, the noise in the audio can be detected by using an endpoint detection method, and then calculation is carried out. According to a linear formula, the similarity of the output of the PLDA is optimized, and then the optimized similarity is corrected by using an audio recognition model, so that the accuracy of the similarity is improved. In the process, when the audio recognition model is trained, the noise similarity optimized by a linear formula and two corresponding tone features are preferably used as input, so that the performance of the audio recognition model is improved.

It can be seen that, in the audio recognition model in this embodiment, two noisy color features and their similarities are used as input data, and a correction result of the similarity can be output, and the correction process does not need to pay attention to the noise size, whether the duration and the similarity are linear, nor to the noise size in the audio, so that the calculation cost and time consumption can be reduced, and the accuracy of correcting the timbre similarity can be improved.

Referring to fig. 5, fig. 5 is a schematic diagram of an audio recognition model training apparatus according to an embodiment of the present application, including:

a training data obtaining module 501, configured to obtain a first noiseless audio and a second noiseless audio from a training sample library;

a noise adding module 502, configured to add random noise to the first noiseless audio frequency and the second noiseless audio frequency respectively to obtain a first noisy audio frequency corresponding to the first noiseless audio frequency and a second noisy audio frequency corresponding to the second noiseless audio frequency;

a training feature extraction module 503, configured to input the first noiseless audio frequency, the second noiseless audio frequency, the first noisy audio frequency, and the second noisy audio frequency into a tone extraction model respectively, and extract a first noiseless color feature, a second noiseless color feature, a first noisy color feature, and a second noisy color feature, respectively;

an offset calculation module 504, configured to calculate a noisy similarity between the first noisy audio and the second noisy audio based on the first noisy color feature and the second noisy color feature, and calculate a similarity offset true value between the first noiseless audio and the second noiseless audio based on the first noiseless color feature, the second noiseless color feature, the first noisy color feature, and the second noisy color feature;

a processing module 505, configured to input the first noisy color feature, the second noisy color feature, and the noisy similarity into a neural network model to be trained, so that the neural network model outputs a similarity shift prediction value of the first noisy audio and the second noisy audio;

a loss determining module 506, configured to calculate a loss value between the predicted similarity deviation value and the true similarity deviation value, and add the loss value to the target loss set;

an update module to adjust model parameters of the neural network model based on each loss value in the target loss set; and two noiseless audios are obtained from the training sample library again so as to carry out iterative training on the updated neural network model until the model convergence condition is reached, and the audio recognition model is output.

In one embodiment, the noise-free similarity prediction module is further configured to:

acquiring a noise-free similarity true value of the first noise-free audio and the second noise-free audio;

determining a noise-free similarity prediction value of the first noisy audio and the second noisy audio by using a neural network model;

and calculating a loss value between the predicted value of the noise-free similarity and the true value of the noise-free similarity, and adding the loss value to the target loss set.

In one embodiment, the method further comprises a probabilistic predictive value output module configured to:

acquiring a probability true value that the first noiseless audio and the second noiseless audio belong to the same tone;

determining probability predicted values of the first noisy audio and the second noisy audio belonging to the same tone by using a neural network model;

and calculating a loss value between the probability predicted value and the probability true value, and adding the loss value to the target loss set.

In one embodiment, the system further comprises other information determination module for determining whether the information is relevant to the user

Acquiring a first true signal-to-noise ratio of the first noisy audio;

determining a first predicted signal-to-noise ratio for the first noisy audio using a neural network model;

calculating a loss value between the first predicted signal-to-noise ratio and the first true signal-to-noise ratio, and adding the loss value to a target loss set;

and/or

Acquiring a second true signal-to-noise ratio of a second noisy audio;

determining a second predicted signal-to-noise ratio for the second noisy audio using the neural network model;

calculating a loss value between the second predicted signal-to-noise ratio and the second true signal-to-noise ratio, and adding the loss value to the target loss set;

and/or

Acquiring a first real length of a first noisy audio;

determining a first predicted length of the first noisy audio using a neural network model;

calculating a loss value between the first predicted length and the first real length, and adding the loss value to a target loss set;

and/or

Acquiring a second real length of a second noisy audio;

determining a second predicted length of the second noisy audio using the neural network model;

a loss value between the second predicted length and the second real length is calculated and added to the target loss set.

For more specific working processes of each module and unit in this embodiment, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not described here again.

Therefore, the embodiment provides an audio recognition model training device, and the audio recognition model obtained by training by the device does not need to pay attention to the noise size, the linear relation between the duration and the similarity, and the noise size in the audio, so that the calculation expense and time consumption can be reduced, and the correction accuracy of the tone similarity is improved.

Referring to fig. 6, fig. 6 is a schematic diagram of a timbre similarity detection apparatus according to an embodiment of the present application, including:

a to-be-processed data acquisition module 601, configured to acquire a first audio and a second audio;

a to-be-processed feature extraction module 602, configured to input the first audio and the second audio into the tone extraction model respectively, so that the tone extraction model outputs a first tone feature corresponding to the first audio and a second tone feature corresponding to the second audio;

a similarity to be corrected calculation module 603, configured to calculate a similarity to be corrected of the first timbre feature and the second timbre feature;

the similarity correction module 604 is configured to input the first timbre characteristic, the second timbre characteristic and the similarity to be corrected into the audio recognition model, so that the audio recognition model outputs a similarity detection result; the audio recognition model is obtained by training by using the audio recognition model training method provided by any embodiment;

and a timbre similarity determination module 605 configured to determine timbre similarities of the first audio and the second audio based on the similarity detection result.

In a specific embodiment, the to-be-corrected similarity calculation module is specifically configured to:

and calculating the similarity to be corrected based on the PLDA or the cosine distance.

In a specific embodiment, the timbre similarity determination module is specifically configured to:

if the similarity detection result is a probability value that the first audio and the second audio belong to the same tone, determining the probability value as the tone similarity;

If the similarity detection result is the noise-free similarity of the first tone color feature and the second tone color feature, determining the noise-free similarity as the tone color similarity;

In a specific embodiment, the method further comprises the following steps:

the similarity optimization module to be corrected is used for optimizing the similarity to be corrected by utilizing a linear formula; the linear formula is:

S'＝W₀+W₁S+W₂SNR_x+W₃SNR_y+W₄L_x+W₅L_y；

Therefore, the embodiment provides a color similarity correction device, which corrects the color similarity by using an audio recognition model without paying attention to the linear relationship between the size of noise, the duration and the similarity, or paying attention to the size of noise in audio, so that the calculation cost and time consumption can be reduced, and the accuracy of correcting the color similarity can be improved.

The scheme provided by the application is described by a specific application scenario example.

In a common music application program, the singer tone recognition function is widely applied to scenes such as similar song recommendation, song classification, singer identity confirmation and the like. Specifically, the user may record a song by himself, and then determine the timbre differences between the recorded song and other songs in the song library of the music application. Therefore, the scheme provided by the application can be used for carrying out tone matching on the songs recorded by the non-professional mobile equipment, the tone matching accuracy is guaranteed, and the user experience is improved.

If songs are classified based on the method (namely, songs of the same singer are classified together), a song classification platform can be constructed, and a trained tone extraction model and an audio recognition model are arranged in the song classification platform. Mass songs are stored on the server.

Referring to fig. 7, the song sorting process includes the following steps:

s701, the song classification platform acquires any two songs from the server;

s702, the song classification platform inputs the two songs into a tone extraction model respectively so that the tone extraction model outputs tone characteristics corresponding to the two songs respectively;

s703, calculating similarity to be corrected of the two tone features by the song classification platform;

s704, the song classification platform inputs the two tone features and the similarity to be corrected into the audio recognition model so that the audio recognition model outputs a similarity detection result;

s705, the song classification platform determines the tone similarity of the two songs based on the similarity detection result;

s706, judging whether the two songs are sung by the same singer or not by the song classification platform based on the tone similarity, and classifying the two songs into the same song set if the two songs are sung by the same singer; if not, marking corresponding tone characteristics for the two songs;

s707, the song classification platform pushes the corresponding result to the management client;

if the timbre similarity obtained in S705 is greater than the threshold, it is determined that the two songs are sung by the same singer, otherwise, it is determined that the two songs are sung by different singers, and the two songs are labeled with corresponding timbre features for subsequent classification.

And S708, the management client displays a corresponding result.

In the present embodiment, the audio recognition model is trained based on a deep neural network, and the training data thereof can be prepared with reference to fig. 8. Each neural network in fig. 8 is identical and is a tone extraction model.

In fig. 8, the tone extraction model extracts a vector of a fixed dimension as a tone feature for the singing voice audio of indefinite length, which facilitates the following similarity calculation. For any pair of timbre features (registered singer timbre features and verified singer timbre features), the PLDA may be used to calculate its timbre similarity. Whether the same singer is determined based on the timbre similarity of the output of the PLDA. To improve the accuracy of the timbre likeness of the PLDA output, the present embodiment continues to correct it using the audio recognition model. Of course, the PLDA may be replaced with other classifiers, such as: a cosine similarity classifier (1 minus the cosine distance between two eigenvectors) or other neural network classifiers capable of calculating the similarity.

When the clean noise-free frequency is subjected to noise addition, the noise ratio needs to be flexibly controlled, so that the signal-to-noise ratio of the training data is known. Certainly, zero noise can be added during noise addition, so that the audio after noise addition and the audio before noise addition are not different. That is, the input data of the audio recognition model may also be clean, noise-free audio.

According to the process shown in fig. 8, based on clean noise-free training data, corresponding noisy training data can be obtained, and a noise-free similarity score, a noisy similarity score, and a score offset therebetween can be calculated.

The output result of the audio recognition model may refer to fig. 9. The input layer in fig. 9 receives either noisy color features and its noisy similarity score, making the model suitable for processing noisy audio. The output layer is provided with a plurality of output nodes which are respectively: 1, 2 representing a classification task; regression tasks included 3, 4, 5, 6, 7, 8. Of course, it is also possible to add singer's age, sex, etc. to the regression task, or to delete several nodes thereof. And a plurality of hidden layers are arranged between the input layer and the output layer. The arrows in fig. 9 are used to indicate the parametric effect of the input data on the layers.

Referring to fig. 9, nodes 1 and 2 are used to output the corrected similarity score. The 2 nodes are finally normalized to sum to one by the softmax function operation. For example, if the 2 nd node outputs x%, the similarity of the two audio feature vectors representing the inputs is x%, and the first node value must be 1-x%. Which is the main task of the model. The auxiliary task is not a classification task, but predicts a noise-free similarity score, a score offset, a noise level, an audio length, etc. of the input audio, and thus is a regression task.

Of course, when the model is actually used, the 3 rd node of the output layer of fig. 9 may be used instead of the 2 nd node as the final output. Because the noise-free number output by the 3 rd node and the probability value output by the 2 nd node are both represented by the similarity of two tone features.

In addition, the 4 th node of the output layer of fig. 9 may be used instead of the 2 nd node as the final output. And after the score offset output by the 4 th node is obtained, performing score compensation on the noisy score calculated by the PLDA by using the score offset, and obtaining the similarity of the two tone color characteristics.

It can be seen that, in this embodiment, noise is added to the noiseless frequency, then the similarity score is calculated through the tone extraction model and the PLDA, and the offset of the similarity score under different noise levels is obtained. The deep neural network is then trained to correct the similarity score. The deep neural network can also predict the noise level and the score offset based on the tone characteristic vector, so that the deep neural network is assisted to carry out multi-task learning, and the accuracy of the corrected score is improved. By using the fractional correction method of the embodiment, the fractional deviation is modeled by utilizing the nonlinear deep neural network, so that the linear assumption of the original method is broken; and information such as the signal-to-noise ratio of the audio is not required to be estimated during application, and the defects that the time consumption of a system is increased, accurate estimation is difficult and the like are overcome.

In the multi-task learning process, information such as the length of the audio frequency, the signal to noise ratio and the like is used, so that the accuracy rate of tone color identification is improved, corrected similarity scores have more consistent score distribution under different noise environments and audio frequency lengths, and the difficult problem of score threshold value demarcation is avoided. When the linear equation is used for correcting the similarity, the obtained timbre similarity has distribution with larger difference for different noise environments and audio lengths, so that identity confirmation is difficult to perform by using the same score threshold value.

Further, the embodiment of the application also provides electronic equipment. The electronic device may be the server 50 shown in fig. 10 or the terminal 60 shown in fig. 11. Fig. 10 and 11 are each a block diagram of an electronic device according to an exemplary embodiment, and the contents of the diagrams should not be construed as any limitation to the scope of use of the present application.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application. The server 50 may specifically include: at least one processor 51, at least one memory 52, a power supply 53, a communication interface 54, an input output interface 55, and a communication bus 56. The memory 52 is used for storing a computer program, and the computer program is loaded and executed by the processor 51 to implement the relevant steps in the audio recognition model training method and the timbre similarity detection method disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 53 is used to provide operating voltage for each hardware device on the server 50; the communication interface 54 can create a data transmission channel between the server 50 and an external device, and the communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 55 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

The memory 52 may be a read-only memory, a random access memory, a magnetic disk, an optical disk, or the like as a carrier for storing resources, the resources stored thereon include an operating system 521, a computer program 522, data 523, and the like, and the storage manner may be a transient storage or a permanent storage.

The operating system 521 is used for managing and controlling hardware devices and computer programs 522 on the Server 50 to realize the operation and processing of the processor 51 on the data 523 in the memory 52, and may be a Windows Server, Netware, Unix, Linux, or the like. The computer program 522 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the audio recognition model training method and the timbre similarity detection method disclosed in any of the foregoing embodiments. The data 523 may include data such as song audio, update information of the application program, and the like, and may also include data such as developer information of the application program.

Fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure, where the terminal 60 may specifically include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.

In general, the terminal 60 in the present embodiment includes: a processor 61 and a memory 62.

The processor 61 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 61 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 61 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 61 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 61 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 62 may include one or more computer-readable storage media, which may be non-transitory. The memory 62 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 62 is at least used for storing a computer program 621, wherein after being loaded and executed by the processor 61, the computer program is capable of implementing relevant steps in the audio recognition model training method and the timbre similarity detection method executed by the terminal side disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 62 may also include an operating system 622 and data 623, etc., which may be stored in a transient or persistent manner. The operating system 622 may include Windows, Unix, Linux, etc. Data 623 may include, but is not limited to, update information for applications.

In some embodiments, the terminal 60 may also include a display 63, an input/output interface 64, a communication interface 65, a sensor 66, a power supply 67, and a communication bus 68.

Those skilled in the art will appreciate that the configuration shown in fig. 11 is not intended to be limiting of terminal 60 and may include more or fewer components than those shown.

Further, an embodiment of the present application further discloses a storage medium, where computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are loaded and executed by a processor, the method for training an audio recognition model disclosed in any of the foregoing embodiments is implemented. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present application, and are not intended to limit the present application, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

27页详细技术资料下载

Audio recognition model training method and tone similarity detection method

相关技术

网友询问留言