Model training method, audio processing method, device and readable storage medium

文档序号：344500 发布日期：2021-12-03 浏览：28次中文

阅读说明：本技术 模型训练方法、音频处理方法、设备及可读存储介质 (Model training method, audio processing method, device and readable storage medium ) 是由江益靓姜涛赵合胡鹏于 2021-09-07 设计创作，主要内容包括：本申请公开了一种模型训练方法、音频处理方法、设备及计算机可读存储介质,该模型训练方法包括：获取训练数据；其中,训练数据包括训练干声数据及对应的训练伴奏数据；将训练干声数据输入初始模型的第一特征提取网络,得到训练干声特征；将训练伴奏数据输入初始模型的第二特征提取网络,得到训练伴奏特征；将训练干声特征和训练伴奏特征,输入初始模型的特征处理网络,得到训练参数；利用训练参数与训练数据的训练标签确定损失值,并利用损失值对初始模型进行参数调节；若检测到满足训练完成条件,则将调节后的模型确定为音频评价模型；能够提供更加丰富的评价方式,从乐理的多个角度进行评价,使得处理参数的可信度好,可靠性高。(The application discloses a model training method, an audio processing method, equipment and a computer readable storage medium, wherein the model training method comprises the following steps: acquiring training data; the training data comprises training dry sound data and corresponding training accompaniment data; inputting training dry sound data into a first feature extraction network of the initial model to obtain training dry sound features; inputting training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features; inputting the training dry sound characteristic and the training accompaniment characteristic into a characteristic processing network of the initial model to obtain a training parameter; determining a loss value by using the training parameters and the training labels of the training data, and performing parameter adjustment on the initial model by using the loss value; if the training completion condition is met, determining the adjusted model as an audio evaluation model; the evaluation method can provide more abundant evaluation modes, and can evaluate from multiple angles of music theory, so that the processing parameters have good credibility and high reliability.)

1. A method of model training, comprising:

acquiring training data; the training data comprises training dry sound data and corresponding training accompaniment data;

inputting the training dry sound data into a first feature extraction network of an initial model to obtain training dry sound features;

inputting the training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features;

inputting the training dry sound characteristic and the training accompaniment characteristic into a splicing network of the initial model to obtain a characteristic to be processed;

inputting the features to be processed into a feature processing network of the initial model to obtain training parameters;

determining a loss value by using the training parameters and training labels of the training data, and performing parameter adjustment on the initial model by using the loss value;

and if the condition that the training is completed is detected to be met, determining the adjusted model as an audio evaluation model.

2. The model training method of claim 1, wherein the generation process of the training labels comprises:

outputting training audio corresponding to the training data;

acquiring a plurality of groups of label data corresponding to the training audio; each group of label data comprises a plurality of training sub-labels, and different training sub-labels correspond to different singing voice and accompaniment matching evaluation angles;

and generating an initial training label by using each group of the plurality of training sub-labels, and generating the training label by using the plurality of initial training labels.

3. The model training method of claim 1, wherein the initial model is a twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

utilizing the adjusted first feature extraction network parameters to carry out parameter replacement on the second feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

4. The model training method of claim 1, wherein the initial model is a pseudo-twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

respectively carrying out parameter adjustment on the first feature extraction network and the second feature extraction network by using the loss values;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

5. The model training method of claim 1, wherein the initial model is a semi-twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter replacement on a plurality of corresponding second network layers in the second feature extraction network by using a plurality of adjusted first network layer parameters in the first feature extraction network;

performing parameter adjustment on a non-second network layer in the second feature extraction network by using the loss value;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

6. The model training method of claim 1, wherein the initial model is a varying twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter replacement on the first branch of the second feature extraction network by using the adjusted first feature extraction network parameters;

performing parameter adjustment on a second branch of the second feature extraction network using the loss value or the first feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

7. An audio processing method, comprising:

acquiring target dry sound audio and corresponding target accompaniment audio;

inputting the target dry sound audio into a first feature extraction network of an audio evaluation model to obtain target dry sound features;

inputting the target accompaniment audio into a second feature extraction network of the audio evaluation model to obtain target accompaniment features;

inputting the target dry sound characteristic and the target accompaniment characteristic into a splicing network of the audio evaluation model to obtain a target characteristic;

inputting the target characteristics into a characteristic processing network of the initial model to obtain a processing result; the processing result is used for representing the matching harmony degree between the target dry sound audio and the target accompaniment audio, and the target accompaniment audio is obtained based on the model training method according to any one of claims 1 to 6.

8. The audio processing method according to claim 7, wherein the obtaining of the target dry audio and the corresponding target accompaniment audio comprises:

acquiring initial dry sound audio and corresponding initial accompaniment audio;

identifying and removing a mute blank part in the initial dry audio to obtain an intermediate dry audio;

removing redundant parts in the initial accompaniment audio to obtain an intermediate accompaniment audio; the redundant part corresponds to the mute blank part on a time axis;

performing sliding window segmentation processing with the same parameters on the intermediate dry sound audio and the intermediate accompaniment audio to obtain a plurality of target dry sound audio corresponding to the intermediate dry sound audio and a plurality of target accompaniment audio corresponding to the intermediate accompaniment audio; the parameters include window length and sliding window step length.

9. The audio processing method according to claim 7, wherein the obtaining of the target dry audio and the corresponding target accompaniment audio comprises:

acquiring initial dry sound audio and corresponding initial accompaniment audio;

performing segmentation processing in the same form on the initial dry sound audio and the initial accompaniment audio to obtain a plurality of target dry sound audio and corresponding target accompaniment audio;

the audio processing method further comprises the following steps:

acquiring the processing result corresponding to each target trunk audio;

and generating an evaluation result corresponding to the initial dry sound by using all the processing results.

10. An electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor for executing the computer program to implement the model training method of any one of claims 1 to 6 and/or the audio processing method of any one of claims 7 to 9.

11. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the model training method of any one of claims 1 to 6 and/or the audio processing method of any one of claims 7 to 9.

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a model training method, an audio processing method, an electronic device, and a computer-readable storage medium.

Background

In the karaoke software, the singing of the user needs to be evaluated so that the user can play a game or make the level of singing clear. In the related art, the stem voice of the user is generally evaluated by taking the intonation and the like as evaluation criteria, for example, a fundamental frequency curve of the original singing of a song is obtained, the fundamental frequency curve of the stem voice of the user is compared with the fundamental frequency curve, and the matching degree is used as an evaluation parameter of the singing level of the user. However, the evaluation method of the related art is single and rigid, which limits the user's free play, and does not consider other evaluation considerations such as rhythm, timbre harmony degree, etc., so that the reliability of the evaluation parameters is low.

Disclosure of Invention

In view of the above, an object of the present application is to provide a model training method, an electronic device, and a computer-readable storage medium, which enable the reliability of the evaluation parameter of audio to be good and high.

In order to solve the above technical problem, in a first aspect, the present application provides a model training method, including:

acquiring training data; the training data comprises training dry sound data and corresponding training accompaniment data;

inputting the training dry sound data into a first feature extraction network of an initial model to obtain training dry sound features;

inputting the training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features;

inputting the training dry sound characteristic and the training accompaniment characteristic into a splicing network of the initial model to obtain a characteristic to be processed;

inputting the features to be processed into a feature processing network of the initial model to obtain training parameters;

determining a loss value by using the training parameters and training labels of the training data, and performing parameter adjustment on the initial model by using the loss value;

and if the condition that the training is completed is detected to be met, determining the adjusted model as an audio evaluation model.

Optionally, the generating process of the training label includes:

outputting training audio corresponding to the training data;

and generating an initial training label by using each group of the plurality of training sub-labels, and generating the training label by using the plurality of initial training labels.

Optionally, the initial model is a twin network, and the parameter adjustment of the initial model using the loss value includes:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

utilizing the adjusted first feature extraction network parameters to carry out parameter replacement on the second feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

Optionally, the initial model is a pseudo-twin network, and the performing parameter adjustment on the initial model by using the loss value includes:

respectively carrying out parameter adjustment on the first feature extraction network and the second feature extraction network by using the loss values;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

Optionally, the initial model is a semi-twin network, and the parameter adjustment of the initial model using the loss value includes:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter adjustment on a non-second network layer in the second feature extraction network by using the loss value;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

Optionally, the initial model is a varying twin network, and the parameter adjustment of the initial model using the loss value includes:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter replacement on the first branch of the second feature extraction network by using the adjusted first feature extraction network parameters;

performing parameter adjustment on a second branch of the second feature extraction network using the loss value or the first feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

In a second aspect, the present application further provides an audio processing method, including:

acquiring target dry sound audio and corresponding target accompaniment audio;

inputting the target dry sound audio into a first feature extraction network of an audio evaluation model to obtain target dry sound features;

inputting the target accompaniment audio into a second feature extraction network of the audio evaluation model to obtain target accompaniment features;

inputting the target dry sound characteristic and the target accompaniment characteristic into a splicing network of the audio evaluation model to obtain a target characteristic;

Optionally, the acquiring the target dry sound audio and the corresponding target accompaniment audio includes: