Model training method, audio processing method, device and readable storage medium

文档序号:344500 发布日期:2021-12-03 浏览:28次 中文

阅读说明:本技术 模型训练方法、音频处理方法、设备及可读存储介质 (Model training method, audio processing method, device and readable storage medium ) 是由 江益靓 姜涛 赵合 胡鹏 于 2021-09-07 设计创作,主要内容包括:本申请公开了一种模型训练方法、音频处理方法、设备及计算机可读存储介质,该模型训练方法包括:获取训练数据;其中,训练数据包括训练干声数据及对应的训练伴奏数据;将训练干声数据输入初始模型的第一特征提取网络,得到训练干声特征;将训练伴奏数据输入初始模型的第二特征提取网络,得到训练伴奏特征;将训练干声特征和训练伴奏特征,输入初始模型的特征处理网络,得到训练参数;利用训练参数与训练数据的训练标签确定损失值,并利用损失值对初始模型进行参数调节;若检测到满足训练完成条件,则将调节后的模型确定为音频评价模型;能够提供更加丰富的评价方式,从乐理的多个角度进行评价,使得处理参数的可信度好,可靠性高。(The application discloses a model training method, an audio processing method, equipment and a computer readable storage medium, wherein the model training method comprises the following steps: acquiring training data; the training data comprises training dry sound data and corresponding training accompaniment data; inputting training dry sound data into a first feature extraction network of the initial model to obtain training dry sound features; inputting training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features; inputting the training dry sound characteristic and the training accompaniment characteristic into a characteristic processing network of the initial model to obtain a training parameter; determining a loss value by using the training parameters and the training labels of the training data, and performing parameter adjustment on the initial model by using the loss value; if the training completion condition is met, determining the adjusted model as an audio evaluation model; the evaluation method can provide more abundant evaluation modes, and can evaluate from multiple angles of music theory, so that the processing parameters have good credibility and high reliability.)

1. A method of model training, comprising:

acquiring training data; the training data comprises training dry sound data and corresponding training accompaniment data;

inputting the training dry sound data into a first feature extraction network of an initial model to obtain training dry sound features;

inputting the training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features;

inputting the training dry sound characteristic and the training accompaniment characteristic into a splicing network of the initial model to obtain a characteristic to be processed;

inputting the features to be processed into a feature processing network of the initial model to obtain training parameters;

determining a loss value by using the training parameters and training labels of the training data, and performing parameter adjustment on the initial model by using the loss value;

and if the condition that the training is completed is detected to be met, determining the adjusted model as an audio evaluation model.

2. The model training method of claim 1, wherein the generation process of the training labels comprises:

outputting training audio corresponding to the training data;

acquiring a plurality of groups of label data corresponding to the training audio; each group of label data comprises a plurality of training sub-labels, and different training sub-labels correspond to different singing voice and accompaniment matching evaluation angles;

and generating an initial training label by using each group of the plurality of training sub-labels, and generating the training label by using the plurality of initial training labels.

3. The model training method of claim 1, wherein the initial model is a twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

utilizing the adjusted first feature extraction network parameters to carry out parameter replacement on the second feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

4. The model training method of claim 1, wherein the initial model is a pseudo-twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

respectively carrying out parameter adjustment on the first feature extraction network and the second feature extraction network by using the loss values;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

5. The model training method of claim 1, wherein the initial model is a semi-twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter replacement on a plurality of corresponding second network layers in the second feature extraction network by using a plurality of adjusted first network layer parameters in the first feature extraction network;

performing parameter adjustment on a non-second network layer in the second feature extraction network by using the loss value;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

6. The model training method of claim 1, wherein the initial model is a varying twin network, and wherein the parameter adjusting the initial model using the loss value comprises:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter replacement on the first branch of the second feature extraction network by using the adjusted first feature extraction network parameters;

performing parameter adjustment on a second branch of the second feature extraction network using the loss value or the first feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

7. An audio processing method, comprising:

acquiring target dry sound audio and corresponding target accompaniment audio;

inputting the target dry sound audio into a first feature extraction network of an audio evaluation model to obtain target dry sound features;

inputting the target accompaniment audio into a second feature extraction network of the audio evaluation model to obtain target accompaniment features;

inputting the target dry sound characteristic and the target accompaniment characteristic into a splicing network of the audio evaluation model to obtain a target characteristic;

inputting the target characteristics into a characteristic processing network of the initial model to obtain a processing result; the processing result is used for representing the matching harmony degree between the target dry sound audio and the target accompaniment audio, and the target accompaniment audio is obtained based on the model training method according to any one of claims 1 to 6.

8. The audio processing method according to claim 7, wherein the obtaining of the target dry audio and the corresponding target accompaniment audio comprises:

acquiring initial dry sound audio and corresponding initial accompaniment audio;

identifying and removing a mute blank part in the initial dry audio to obtain an intermediate dry audio;

removing redundant parts in the initial accompaniment audio to obtain an intermediate accompaniment audio; the redundant part corresponds to the mute blank part on a time axis;

performing sliding window segmentation processing with the same parameters on the intermediate dry sound audio and the intermediate accompaniment audio to obtain a plurality of target dry sound audio corresponding to the intermediate dry sound audio and a plurality of target accompaniment audio corresponding to the intermediate accompaniment audio; the parameters include window length and sliding window step length.

9. The audio processing method according to claim 7, wherein the obtaining of the target dry audio and the corresponding target accompaniment audio comprises:

acquiring initial dry sound audio and corresponding initial accompaniment audio;

performing segmentation processing in the same form on the initial dry sound audio and the initial accompaniment audio to obtain a plurality of target dry sound audio and corresponding target accompaniment audio;

the audio processing method further comprises the following steps:

acquiring the processing result corresponding to each target trunk audio;

and generating an evaluation result corresponding to the initial dry sound by using all the processing results.

10. An electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor for executing the computer program to implement the model training method of any one of claims 1 to 6 and/or the audio processing method of any one of claims 7 to 9.

11. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the model training method of any one of claims 1 to 6 and/or the audio processing method of any one of claims 7 to 9.

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a model training method, an audio processing method, an electronic device, and a computer-readable storage medium.

Background

In the karaoke software, the singing of the user needs to be evaluated so that the user can play a game or make the level of singing clear. In the related art, the stem voice of the user is generally evaluated by taking the intonation and the like as evaluation criteria, for example, a fundamental frequency curve of the original singing of a song is obtained, the fundamental frequency curve of the stem voice of the user is compared with the fundamental frequency curve, and the matching degree is used as an evaluation parameter of the singing level of the user. However, the evaluation method of the related art is single and rigid, which limits the user's free play, and does not consider other evaluation considerations such as rhythm, timbre harmony degree, etc., so that the reliability of the evaluation parameters is low.

Disclosure of Invention

In view of the above, an object of the present application is to provide a model training method, an electronic device, and a computer-readable storage medium, which enable the reliability of the evaluation parameter of audio to be good and high.

In order to solve the above technical problem, in a first aspect, the present application provides a model training method, including:

acquiring training data; the training data comprises training dry sound data and corresponding training accompaniment data;

inputting the training dry sound data into a first feature extraction network of an initial model to obtain training dry sound features;

inputting the training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features;

inputting the training dry sound characteristic and the training accompaniment characteristic into a splicing network of the initial model to obtain a characteristic to be processed;

inputting the features to be processed into a feature processing network of the initial model to obtain training parameters;

determining a loss value by using the training parameters and training labels of the training data, and performing parameter adjustment on the initial model by using the loss value;

and if the condition that the training is completed is detected to be met, determining the adjusted model as an audio evaluation model.

Optionally, the generating process of the training label includes:

outputting training audio corresponding to the training data;

acquiring a plurality of groups of label data corresponding to the training audio; each group of label data comprises a plurality of training sub-labels, and different training sub-labels correspond to different singing voice and accompaniment matching evaluation angles;

and generating an initial training label by using each group of the plurality of training sub-labels, and generating the training label by using the plurality of initial training labels.

Optionally, the initial model is a twin network, and the parameter adjustment of the initial model using the loss value includes:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

utilizing the adjusted first feature extraction network parameters to carry out parameter replacement on the second feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

Optionally, the initial model is a pseudo-twin network, and the performing parameter adjustment on the initial model by using the loss value includes:

respectively carrying out parameter adjustment on the first feature extraction network and the second feature extraction network by using the loss values;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

Optionally, the initial model is a semi-twin network, and the parameter adjustment of the initial model using the loss value includes:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter replacement on a plurality of corresponding second network layers in the second feature extraction network by using a plurality of adjusted first network layer parameters in the first feature extraction network;

performing parameter adjustment on a non-second network layer in the second feature extraction network by using the loss value;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

Optionally, the initial model is a varying twin network, and the parameter adjustment of the initial model using the loss value includes:

carrying out parameter adjustment on the first feature extraction network by using the loss value;

performing parameter replacement on the first branch of the second feature extraction network by using the adjusted first feature extraction network parameters;

performing parameter adjustment on a second branch of the second feature extraction network using the loss value or the first feature extraction network;

and utilizing the loss value to carry out parameter adjustment on the feature processing network.

In a second aspect, the present application further provides an audio processing method, including:

acquiring target dry sound audio and corresponding target accompaniment audio;

inputting the target dry sound audio into a first feature extraction network of an audio evaluation model to obtain target dry sound features;

inputting the target accompaniment audio into a second feature extraction network of the audio evaluation model to obtain target accompaniment features;

inputting the target dry sound characteristic and the target accompaniment characteristic into a splicing network of the audio evaluation model to obtain a target characteristic;

inputting the target characteristics into a characteristic processing network of the initial model to obtain a processing result; the processing result is used for representing the matching harmony degree between the target dry sound audio and the target accompaniment audio, and the target accompaniment audio is obtained based on the model training method.

Optionally, the acquiring the target dry sound audio and the corresponding target accompaniment audio includes:

acquiring initial dry sound audio and corresponding initial accompaniment audio;

identifying and removing a mute blank part in the initial dry audio to obtain an intermediate dry audio;

removing redundant parts in the initial accompaniment audio to obtain an intermediate accompaniment audio; the redundant part corresponds to the mute blank part on a time axis;

performing sliding window segmentation processing with the same parameters on the intermediate dry sound audio and the intermediate accompaniment audio to obtain a plurality of target dry sound audio corresponding to the intermediate dry sound audio and a plurality of target accompaniment audio corresponding to the intermediate accompaniment audio; the parameters include window length and sliding window step length.

Optionally, the acquiring the target dry sound audio and the corresponding target accompaniment audio includes:

acquiring initial dry sound audio and corresponding initial accompaniment audio;

performing segmentation processing in the same form on the initial dry sound audio and the initial accompaniment audio to obtain a plurality of target dry sound audio and corresponding target accompaniment audio;

the audio processing method further comprises the following steps:

acquiring the processing result corresponding to each target trunk audio;

and generating an evaluation result corresponding to the initial dry sound by using all the processing results.

In a third aspect, the present application further provides an electronic device, comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the above model training method and/or the above audio processing method.

In a fourth aspect, the present application further provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the model training method described above, and/or the audio processing method described above.

The model training method provided by the application acquires training data; the training data comprises training dry sound data and corresponding training accompaniment data; inputting training dry sound data into a first feature extraction network of the initial model to obtain training dry sound features; inputting training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features; inputting the training dry sound characteristic and the training accompaniment characteristic into a splicing network of the initial model to obtain a characteristic to be processed; inputting the characteristics to be processed into a characteristic processing network of the initial model to obtain training parameters; determining a loss value by using the training parameters and the training labels of the training data, and performing parameter adjustment on the initial model by using the loss value; and if the condition that the training is completed is detected to be met, determining the adjusted model as an audio evaluation model.

The audio processing method provided by the application acquires target dry sound audio and corresponding target accompaniment audio; inputting the target dry sound audio into a first feature extraction network of the audio evaluation model to obtain target dry sound features; inputting the target accompaniment audio into a second feature extraction network of the audio evaluation model to obtain target accompaniment features; inputting the target dry sound characteristic and the target accompaniment characteristic into a splicing network of the audio evaluation model to obtain a target characteristic; inputting the target characteristics into a characteristic processing network of the initial model to obtain a processing result; the processing result is used for representing the matching harmony degree between the target dry sound audio and the target accompaniment audio, and the target accompaniment audio is obtained based on the model training method.

Therefore, the method utilizes the training data to train the initial model to obtain the audio evaluation model. The training data is constructed in groups, wherein the training data comprises training dry sound data and training accompaniment data. And the initial model is provided with a first characteristic region network and a second characteristic extraction network which are respectively used for carrying out corresponding characteristic extraction on the training dry sound data and the training accompaniment data to obtain the training dry sound characteristic and the training accompaniment characteristic. After the training dry sound features and the training accompaniment features are spliced to obtain target features, the target features are input into a feature processing network, the feature processing network can comprehensively consider the matching harmony degree between the training dry sound features and the training accompaniment features, and training parameters capable of reflecting the matching harmony degree are obtained. The training labels are used for expressing the harmonious degree of the dry sound and the accompaniment, loss values are determined by utilizing the training parameters and the training labels, the gap between an evaluation result obtained by the evaluation mode of the initial network and a real result can be determined, then parameter adjustment is carried out on the initial model by utilizing the loss values, the evaluation mode of the initial model is improved, and the harmonious degree between the dry sound and the accompaniment can be evaluated more accurately. After the training completion condition is met, the harmony degree of the dry sound and the accompaniment which can be accurately evaluated by the initial model can be determined, and then the initial model is determined to be an audio evaluation model. When the method is applied, the target dry sound audio singed by the user and the target accompaniment audio corresponding to the song are respectively input into the first characteristic extraction network and the second characteristic extraction network, and then a processing result capable of reflecting the harmonious degree of the target dry sound audio and the target accompaniment audio can be obtained. Through the training mode, an audio evaluation model capable of evaluating the matching degree between the user dry sound and the song accompaniment can be obtained, richer evaluation modes can be provided, evaluation is carried out from a plurality of angles of music theory, the reliability of processing parameters is good, and the reliability is high.

In addition, the application also provides the electronic equipment and the computer readable storage medium, and the electronic equipment and the computer readable storage medium also have the beneficial effects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic diagram of a hardware composition framework to which a model training method according to an embodiment of the present disclosure is applied;

FIG. 2 is a block diagram of a hardware framework for another model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a specific audio evaluation model provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of another specific audio evaluation model provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of another specific audio evaluation model provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of another specific audio evaluation model provided in an embodiment of the present application;

FIG. 8 is a diagram of a specific audio waveform provided by an embodiment of the present application;

fig. 9 is a flowchart of data processing provided in an embodiment of the present application;

fig. 10 is a schematic view of an audio processing flow provided in an embodiment of the present application;

fig. 11 is a flowchart for generating a specific audio evaluation result according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For convenience of understanding, a hardware composition framework used in a model training method and/or a scheme corresponding to an audio processing method provided in the embodiments of the present application is described first. Referring to fig. 1, fig. 1 is a schematic diagram of a hardware composition framework applicable to a model training method according to an embodiment of the present disclosure. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.

Wherein, the processor 101 is used for controlling the overall operation of the electronic device 100 to complete all or part of the steps of the model training method and/or the audio processing method; the memory 102 is used to store various types of data to support operation at the electronic device 100, such data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk. In the present embodiment, the memory 102 stores therein at least programs and/or data for realizing the following functions:

acquiring training data; the training data comprises training dry sound data and corresponding training accompaniment data;

inputting training dry sound data into a first feature extraction network of the initial model to obtain training dry sound features;

inputting training accompaniment data into a second feature extraction network of the initial model to obtain training accompaniment features;

inputting the training dry sound characteristic and the training accompaniment characteristic into a splicing network of the initial model to obtain a characteristic to be processed;

inputting the characteristics to be processed into a characteristic processing network of the initial model to obtain training parameters;

determining a loss value by using the training parameters and the training labels of the training data, and performing parameter adjustment on the initial model by using the loss value;

and if the condition that the training is completed is detected to be met, determining the adjusted model as an audio evaluation model.

And/or the presence of a gas in the gas,

acquiring target dry sound audio and corresponding target accompaniment audio;

inputting the target dry sound audio into a first feature extraction network of the audio evaluation model to obtain target dry sound features;

inputting the target accompaniment audio into a second feature extraction network of the audio evaluation model to obtain target accompaniment features;

inputting the target dry sound characteristic and the target accompaniment characteristic into a splicing network of the audio evaluation model to obtain a target characteristic;

inputting the target characteristics into a characteristic processing network of the initial model to obtain a processing result; and the processing result is used for representing the matching harmony degree between the target dry sound audio and the target accompaniment audio.

The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: Wi-Fi part, Bluetooth part, NFC part.

The electronic Device 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the model training method.

Of course, the structure of the electronic device 100 shown in fig. 1 does not constitute a limitation of the electronic device in the embodiment of the present application, and in practical applications, the electronic device 100 may include more or less components than those shown in fig. 1, or some components may be combined.

It is to be understood that, in the embodiment of the present application, the number of the electronic devices is not limited, and it may be that a plurality of electronic devices cooperate to perform the model training method, and/or the audio processing method. In a possible implementation manner, please refer to fig. 2, and fig. 2 is a schematic diagram of a hardware composition framework applicable to another model training method provided in the embodiment of the present application. As can be seen from fig. 2, the hardware composition framework may include: the first electronic device 11 and the second electronic device 12 are connected to each other through a network 13.

In the embodiment of the present application, the hardware structures of the first electronic device 11 and the second electronic device 12 may refer to the electronic device 100 in fig. 1. That is, it can be understood that there are two electronic devices 100 in the present embodiment, and the two devices perform data interaction. Further, in this embodiment of the application, the form of the network 13 is not limited, that is, the network 13 may be a wireless network (e.g., WIFI, bluetooth, etc.), or may be a wired network.

The first electronic device 11 and the second electronic device 12 may be the same electronic device, for example, the first electronic device 11 and the second electronic device 12 are both servers; or may be different types of electronic devices, for example, the first electronic device 11 may be a smartphone or other smart terminal, and the second electronic device 12 may be a server. In one possible embodiment, a server with high computing power may be used as the second electronic device 12 to improve the data processing efficiency and reliability, and thus the processing efficiency of the model training. Meanwhile, a smartphone with low cost and wide application range is used as the first electronic device 11 to realize interaction between the second electronic device 12 and the user. It is to be understood that the interaction process may be: the smart phone acquires target dry sound or training dry sound data, sends the target dry sound or the training dry sound data to the server, and the server conducts model training or audio processing. And the server sends the trained audio evaluation model or the processed result to the smart phone.

Based on the above description, please refer to fig. 3, and fig. 3 is a schematic flowchart of a model training method according to an embodiment of the present application. The method in this embodiment comprises:

s101: training data is acquired.

The training data comprises training dry sound data and corresponding training accompaniment data which correspond to each other, and the training dry sound data and the training accompaniment data correspond to the same song and are the same in corresponding time periods. The dry sound refers to a sound without accompaniment, the training dry sound data refers to sound data used for training, and the training accompaniment data refers to accompaniment data matched with the training dry sound data. The specific form of the training data is not limited in this embodiment, for example, in a possible embodiment, the training data may be audio file data, such as lippamp 3 format; in another possible embodiment, the training data may be signal waveform data, i.e., a waveform form that varies according to a time variation; in another possible embodiment, the training data may be time-frequency domain feature data, for example, in the form of a mel-frequency spectrum. According to the input data format of the audio evaluation model obtained after training, training data in a corresponding format can be adaptively selected.

It is to be understood that the number of the training data is usually plural, and the content style and the like of each training data are not limited. Specifically, training data may be generated using songs of various song style types to enable the audio evaluation model to accurately evaluate songs of various types. For example, 75% popular music, 15% drama, 5% country music and 5% other genres of music may be included in the training data. In addition, training dry sound data and training accompaniment data in the training data correspond to each other in time, and the lengths of the training dry sound data and the training accompaniment data can be set as required. Because singing modes of different time periods in the same song may change, a mode that the stem sound and the accompaniment are matched with each other and are harmonious is changed. Therefore, in order to improve the recognition accuracy of the model, the lengths of the training stem data and the training accompaniment data may be short (e.g., less than 5 seconds) so that more accurate features can be extracted.

It should be noted that the training data may be generated locally or may be obtained externally. For example, in one embodiment, the designated song may be subjected to dry sound separation (or called sound source separation), resulting in training dry sound data and training accompaniment data; in another embodiment, a plurality of training dry sound data and a plurality of training accompaniment data may be acquired, and the two types of data are in one-to-one correspondence according to the acquired correspondence data to obtain training data.

S102: and inputting the training dry sound data into a first feature extraction network of the initial model to obtain the training dry sound features.

The initial model refers to an audio evaluation model which is not trained, and meets the training completion condition after being trained, or is determined as the audio evaluation model after the training process meets the training completion condition. The initial model comprises a first feature extraction network, a second feature extraction network and a feature processing network, wherein the first feature extraction network is a network for extracting the dry sound features, the second feature extraction network is a network for extracting the accompaniment features, and the feature processing network is a network for processing the dry sound features and the accompaniment features and obtaining a processing result. It should be noted that, the present embodiment does not limit the specific structures of the first feature extraction network, the second feature extraction network, and the feature processing network, and the structures may be set as needed.

After the training dry sound data are obtained, the training dry sound data are input into the first feature extraction network, and then the corresponding training dry sound features can be obtained. The generation process of the training dry sound feature can be different according to different structures of the first feature extraction network.

S103: and inputting the training accompaniment data into a second feature extraction network of the initial model to obtain the training accompaniment features.

Correspondingly, after the training accompaniment data are obtained, the training accompaniment data are input into the second feature extraction network, and then the corresponding training accompaniment data can be obtained. The feature extraction network can extract features of input data so as to express the features of the input data by using the output features and provide a data basis for a subsequent feature processing network.

It should be noted that, as to the execution sequence of step S102 and step S103, the embodiment of the present application is not limited, and it can be understood that the first feature extraction network and the second feature extraction network are respectively used for extracting different features, and the two networks operate independently, so that step S102 and step S103 can be executed simultaneously. In another embodiment, two steps may be sequentially performed under the influence of factors such as a model structure (for example, there is only one feature extraction network, and the identity of the feature extraction network is different according to the type of input data), and the order of performing the two steps is not limited.

S104: and inputting the training dry sound characteristic and the training accompaniment characteristic into a splicing network of the initial model to obtain the characteristic to be processed.

The splicing network is a network which splices input features into one feature according to a certain rule, exemplarily, the training dry sound feature and the training accompaniment feature can be spliced end to end, or two features can be spliced alternately to obtain the features to be processed.

S104: inputting the characteristics to be processed into the characteristic processing network of the initial model to obtain training parameters.

The feature processing network is a network for determining the matching harmony degree of the stem sound and the accompaniment according to the features. Therefore, after the features to be processed are obtained, the features to be processed are input into the feature processing network, and the training dry sound features represent the features of the training dry sound data, while the training accompaniment features represent the features of the training accompaniment data, so that the feature processing network can start from the features of the two data to detect whether the two data are matched and harmonious or how much the two are matched and harmonious, and characterize the result of the detection process in the form of the training parameters. It will be appreciated that the specific form of the training parameters may be set as desired, for example as a percentage score.

S105: and determining a loss value by using the training parameters and the training labels of the training data, and performing parameter adjustment on the initial model by using the loss value.

The training labels of the training data are labels capable of reflecting the real matching degree between the training dry sound data and the training accompaniment data, and are usually obtained by manual marking or can be generated by using a marking network. It should be noted that the matching degree between the dry sound and the accompaniment can be evaluated from multiple musical theory angles, such as musical interval consistency, rhythm matching degree, intonation harmony, timbre harmony, dynamic consistency, and the like, so that the training label can reflect the matching degree between the training dry sound data and the training accompaniment data from multiple angles. The loss value is determined by utilizing the training parameters and the training labels, the distance between the current result obtained by the initial network and the real result can be determined, and then the parameters of the initial model are adjusted according to the distance, so that the initial model can be close to the real result, and further the initial model has the capability of accurately evaluating the harmony matching degree of the dry sound and the accompaniment. The present embodiment does not limit the form and type of the loss value, and may be, for example, a Pearson correlation coefficient (Pearson correlation coefficient). And the performance of the model is improved through multi-round circulating training.

S106: and if the condition that the training is completed is detected to be met, determining the adjusted model as an audio evaluation model.

The training completion condition refers to a condition characterizing that the initial model can be determined as an audio evaluation model, which may limit the initial model itself or may limit a training process of the initial model. When the initial model meets the training completion condition (for example, the accuracy condition reaches a threshold value) by itself, or the training process meets the training completion condition (for example, the training round or the training duration reaches a threshold value), the adjusted model can be determined as the audio evaluation model. Specifically, the adjusted current initial model may be directly determined as the audio evaluation model, or the initial model may be subjected to certain processing, for example, a network layer group used for generating the loss value is removed, so as to obtain the audio evaluation model.

By applying the model training method provided by the embodiment of the application, the initial model is trained by using the training data to obtain the audio evaluation model. The training data is constructed in groups, wherein the training data comprises training dry sound data and training accompaniment data. And the initial model is provided with a first characteristic region network and a second characteristic extraction network which are respectively used for carrying out corresponding characteristic extraction on the training dry sound data and the training accompaniment data to obtain the training dry sound characteristic and the training accompaniment characteristic. The training dry sound characteristic and the training accompaniment characteristic are jointly input into the characteristic processing network, the characteristic processing network can comprehensively consider the matching harmony degree between the training dry sound characteristic and the training accompaniment characteristic, and training parameters capable of reflecting the matching harmony degree are obtained. The training labels are used for expressing the harmonious degree of the dry sound and the accompaniment, loss values are determined by utilizing the training parameters and the training labels, the gap between an evaluation result obtained by the evaluation mode of the initial network and a real result can be determined, then parameter adjustment is carried out on the initial model by utilizing the loss values, the evaluation mode of the initial model is improved, and the harmonious degree between the dry sound and the accompaniment can be evaluated more accurately. After the training completion condition is met, the harmony degree of the dry sound and the accompaniment which can be accurately evaluated by the initial model can be determined, and then the initial model is determined to be an audio evaluation model. When the method is applied, the target dry sound audio singed by the user and the target accompaniment audio corresponding to the song are respectively input into the first characteristic extraction network and the second characteristic extraction network, and then a processing result capable of reflecting the harmonious degree of the target dry sound audio and the target accompaniment audio can be obtained. Through the training mode, an audio evaluation model capable of evaluating the matching degree between the user dry sound and the song accompaniment can be obtained, richer evaluation modes can be provided, evaluation is carried out from a plurality of angles of music theory, the reliability of processing parameters is good, and the reliability is high.

Based on the above embodiments, the present embodiment specifically describes some steps in the above embodiments. In one embodiment, in order to obtain an audio evaluation model with higher accuracy, it is necessary to generate a loss value by using a training label with higher accuracy so as to perform parameter adjustment. Therefore, the generation process of the training label comprises the following steps:

step 11: and outputting training audio corresponding to the training data.

Step 12: and acquiring a plurality of groups of label data corresponding to the training audio.

Step 13: and generating initial training labels by utilizing the plurality of training sub-labels of each group, and generating training labels by utilizing the plurality of initial training labels.

When the training labels are required to be obtained, the training audio corresponding to the training data can be output, so that a marker can determine the label data according to the training audio, wherein the training audio refers to song audio formed by training dry sound data and training accompaniment data. It should be noted that each set of tag data includes a plurality of training sub-tags, and different training sub-tags correspond to different singing voice and accompaniment matching evaluation angles (e.g., musical interval consistency, rhythm matching degree, intonation harmony degree, timbre harmony degree, dynamic consistency, etc.). After listening to the training audio, the marker can evaluate the training audio from the plurality of singing voice and accompaniment matching evaluation angles, and the evaluation mode is to input the corresponding training sub-label. In this embodiment, there may be several evaluators, and thus for one training audio, several corresponding sets of tag data may be acquired.

After obtaining the total number of label data, a plurality of initial training labels can be generated by using the label data, and each initial training label is further used for generating a training label. The embodiment does not limit the specific generation manner of the initial training labels and the training labels, and may be, for example, average calculation or weighted average calculation.

Based on the above embodiments, in an implementation, the initial model may be a twin network, in which case, the process of performing parameter adjustment on the initial model by using the loss value may include:

step 21: and carrying out parameter adjustment on the first feature extraction network by using the loss value.

Step 22: and utilizing the adjusted first feature extraction network parameters to perform parameter replacement on the second feature extraction network.

Step 23: and utilizing the loss value to adjust parameters of the feature processing network.

The twin network is a twin neural network (also known as a simense neural network), and is a coupling framework established based on two artificial neural networks. The twin neural network takes two samples as input and outputs the characterization of embedding high-dimensional space of the two samples so as to compare the similarity degree of the two samples. Usually, the twin neural network is formed by splicing two structurally identical neural networks with shared weight. Therefore, when parameter adjustment is performed, parameter adjustment is performed on the first feature extraction network by using the loss value, and after the parameter adjustment is completed, weight sharing is performed on the second feature extraction network according to the first feature extraction network. The weight sharing means that the parameters of the second feature extraction network are replaced by the parameters of the first feature extraction network, that is, the parameters of the second feature extraction network are replaced by the adjusted parameters of the first feature extraction network. In addition, the loss value is also needed to be used for parameter adjustment of the feature processing network, namely, the parameter adjustment process includes adjustment of all adjustable parameters of the network. It will be appreciated that in this case only one feature extraction network may be included in the initial model, the feature network being the first feature extraction network when the input is dry sound data and the second feature extraction network otherwise. It should be noted that, in this embodiment, the first feature extraction network and the second feature extraction network may be replaced, that is, the second feature extraction network is subjected to parameter adjustment, and the first feature extraction network is subjected to weight sharing.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a specific audio evaluation model according to an embodiment of the present application. In the application process, a target dry sound and a target accompaniment are respectively input, wherein two branches share weight (namely, channels with weight sharing are arranged between layers), and the two branches are respectively used for carrying out feature extraction on the target dry sound and the target accompaniment. And then inputting the collected data into a feature processing network to obtain a final result. In this embodiment, the feature processing network includes a network layer group composed of a concat network layer (feature merging network layer) and a FC network layer (full Connected layer)

In another embodiment, the initial model may be a pseudo-twin network. In this case, the process of performing parameter adjustment on the initial model by using the loss value may include:

step 31: and respectively carrying out parameter adjustment on the first characteristic extraction network and the second characteristic extraction network by using the loss value.

Step 32: and utilizing the loss value to adjust parameters of the feature processing network.

A pseudo-twin network, i.e. a pseudo-parameter network, also has two branches, but each branch has its own weight (i.e. parameter). In this case, the first feature extraction network and the second feature extraction network need to be parameter-adjusted by using the loss values, respectively, and two feature extraction networks must be included in the initial model.

Referring to fig. 5, fig. 5 is a schematic structural diagram of another specific audio evaluation model according to an embodiment of the present application. In the application process, the two branches respectively extract the characteristics of the target dry sound and the target accompaniment.

In another embodiment, the initial model may be a semi-twin network. In this case, the initial model is parametrically adjusted by a loss value, including:

step 41: and carrying out parameter adjustment on the first feature extraction network by using the loss value.

Step 42: and utilizing the adjusted first characteristic to extract a plurality of first network layer parameters in the network, and carrying out parameter replacement on a plurality of corresponding second network layers in the second characteristic extraction network.

Step 43: and carrying out parameter adjustment on a non-second network layer in the second feature extraction network by using the loss value.

Step 44: and utilizing the loss value to adjust parameters of the feature processing network.

The semi-twin network means that the former part of the network layers in the two feature extraction branches of the initial model share the weight, and the latter part of the network layers do not share the weight. Therefore, in this case, after the first feature extraction network is subjected to parameter adjustment by using the loss value, the weight sharing is performed on the corresponding second network layer in the second feature extraction network by using the plurality of first network layers, so that the second network layer does not need to perform parameter adjustment by using the loss value. The loss value may be used to make parameter adjustments to non-second network layers in the second feature extraction network, either simultaneously with weight sharing, or before or after weight sharing. In this embodiment, the first feature extraction network and the second feature extraction network may be replaced, that is, the second feature extraction network is subjected to parameter adjustment, and the first feature extraction network is subjected to weight sharing.

Referring to fig. 6, fig. 6 is a schematic structural diagram of another specific audio evaluation model according to an embodiment of the present application. It can be seen from the figure that the first four network layers of the two feature extraction networks are shared in weight, and the rest network layers are not shared.

In another embodiment, the initial model may be a varying twin network. In this case, the initial model is parametrically adjusted by a loss value, including:

step 51: and carrying out parameter adjustment on the first feature extraction network by using the loss value.

Step 52: and utilizing the adjusted first feature extraction network parameters to carry out parameter replacement on the first branch of the second feature extraction network.

Step 53: and performing parameter adjustment on a second branch of the second feature extraction network by using the loss value or the first feature extraction network.

Step 54: and utilizing the loss value to adjust parameters of the feature processing network.

The variable twin network is a combination of a pseudo twin network and a semi-twin network, specifically, the second feature extraction network has two branch structures, one branch structure is completely the same as the first feature extraction network, and the two branch structures share weight during training; the first feature extraction networks of the other branch structures may be the same or different. If the two branches are different, the loss value is required to be used for independently adjusting the parameters of the second branch; if so, the parameter adjustment can be performed on the first feature extraction network based on the first feature extraction network.

Referring to fig. 7, fig. 7 is a schematic structural diagram of another specific audio evaluation model according to an embodiment of the present application, which shows a case where the second branch and the first feature extraction network do not share a weight.

Based on the embodiment, after the model training is finished, the singing stem voice of the user can be evaluated by the method, and whether the singing stem voice is matched with the corresponding accompaniment or not is judged. Specifically, the method can comprise the following steps:

step 61: and acquiring target dry sound audio and corresponding target accompaniment audio.

Step 62: and inputting the target dry sound audio into a first feature extraction network of the audio evaluation model to obtain the target dry sound feature.

And step 63: and inputting the target accompaniment audio into a second feature extraction network of the audio evaluation model to obtain the target accompaniment features.

Step 64: and inputting the target dry sound characteristic and the target accompaniment characteristic into a splicing network of the audio evaluation model to obtain the target characteristic.

Step 65: and inputting the target characteristics into a characteristic processing network of the initial model to obtain a processing result. The target accompanying audio is obtained based on the model training method. The target dry sound audio refers to the dry sound audio obtained based on singing of the user, and the target accompaniment audio refers to the accompaniment audio matched with the target dry sound audio. After the target stem sound audio and the target accompaniment audio are input into the corresponding feature extraction network, the target stem sound feature and the target accompaniment feature are obtained and are further spliced to obtain the target feature, the target feature is input into the feature processing network to be processed, the audio evaluation model can output a corresponding processing result, the processing result refers to a result capable of evaluating the harmonious matching degree between the target stem sound audio and the target accompaniment audio, and namely the processing result is used for representing the matching harmonious degree between the target stem sound audio and the target accompaniment audio.

In practical applications, a user usually sings a complete song continuously and obtains an evaluation of singing of the whole song, and in order to improve the accuracy of the model, the target audio stem and the target accompaniment audio stem are usually short. In this case, acquiring the target dry sound audio and the corresponding target accompaniment audio includes:

step 71: and acquiring initial dry sound audio and corresponding initial accompaniment audio.

Step 72: and identifying and removing the mute blank part in the initial dry audio to obtain the intermediate dry audio.

Step 73: and removing redundant parts in the initial accompaniment audio to obtain an intermediate accompaniment audio.

Step 74: and performing sliding window segmentation processing with the same parameters on the intermediate dry sound audio and the intermediate accompaniment audio to obtain a plurality of target dry sound audios corresponding to the intermediate dry sound audio and a plurality of target accompaniment audios corresponding to the intermediate accompaniment audio.

The initial dry audio refers to the complete dry audio that the user sings, which usually corresponds to a complete song or a longer song segment (whose length exceeds the window length), and the initial accompaniment audio is the accompaniment audio corresponding to the initial dry audio. The specific manner of acquiring the initial dry sound audio and the initial accompaniment audio is not limited in this embodiment. Referring to fig. 8, fig. 8 is a specific audio waveform diagram according to an embodiment of the present application. Where the upper track records the initial accompaniment audio and the lower track records the initial accompaniment audio.

Since not all the time in a song needs to be sung, there is a time for partial waiting, and thus there is a partial blank, i.e., a silent blank portion, in the initial dry audio. In the silent blank part, the initial dry sound audio and the initial accompaniment audio are not enough to be matched, and the evaluation of the silent blank part can not reflect the singing level of the user, so that the silent blank part in the initial dry sound audio can be identified and removed, the interference to the accuracy of the processing result is avoided, and the intermediate dry sound audio is obtained.

Because the detection that matches degree is just meaningful to the stem sound and the accompaniment that correspond at the same moment, consequently after removing the blank part of silence, get rid of the unnecessary part in the initial accompaniment audio frequency, obtain middle accompaniment audio frequency. Wherein the redundant portion corresponds to the mute blank portion on the time axis. After the intermediate accompaniment audio is obtained, the intermediate stem audio and the intermediate accompaniment audio are segmented by a sliding window segmentation mode to obtain a plurality of target stem audio and a plurality of target accompaniment audio. The parameters include a window length and a sliding window step length, wherein the window length is the length of each target dry audio and target accompaniment audio, and is 5 seconds for example; the sliding window step is the distance of each sliding, and is usually a time length, for example, 2 seconds.

Referring to fig. 9, fig. 9 is a flowchart of data processing according to an embodiment of the present disclosure. The detection mode of the mute blank part can adopt voice activity detection. After the target dry audio and the target accompaniment audio are obtained by segmentation, the target dry audio and the target accompaniment audio can be input into the audio evaluation model. In one embodiment, the audio evaluation model performs downsampling, frame windowing, fourier transformation, mel filtering, etc. on the audio signal to obtain a mel frequency spectrum. In another embodiment, the mel spectrum may be input as input data to the audio evaluation model after being obtained by external processing. The method comprises the steps of performing operations such as convolution and pooling on a Mel frequency spectrum to obtain corresponding features, namely a target dry sound feature and a target accompaniment feature, wherein the two features can be represented in a feature map mode. After the target dry sound characteristic and the target accompaniment characteristic are obtained, the target dry sound characteristic and the target accompaniment characteristic are combined and processed by utilizing a plurality of full connection layers, and a corresponding processing result can be obtained.

Further, in one embodiment, the process of obtaining the target dry sound audio and the corresponding target accompaniment audio may include:

step 81: and acquiring initial dry sound audio and corresponding initial accompaniment audio.

Step 82: and carrying out segmentation processing in the same form on the initial dry sound audio and the initial accompaniment audio to obtain a plurality of target dry sound audio and corresponding target accompaniment audio.

The audio processing method further comprises the following steps:

step 83: and acquiring a processing result corresponding to each target dry sound.

Step 84: and generating an evaluation result corresponding to the initial dry sound by using all the processing results.

For example, the segmentation processing manner of steps 81 to 82 may specifically adopt the sliding window segmentation process described in steps 71 to 74.

In this embodiment, the initial dry sound audio and the initial accompaniment audio may be acquired in two ways. In a first embodiment, the input audio may be used as the initial dry audio, for example, by capturing the user input audio with a microphone component. For the initial accompaniment audio, the selection can be carried out according to the input audio information from a plurality of preset accompaniment audio. That is, the user can indicate the song that oneself was sung through input audio information to provide initial dry sound audio through input audio, to initial accompaniment audio, then can follow and predetermine and accompany the audio acquisition.

In the second embodiment, in order to avoid the storage space occupied by the preset accompaniment audio, and simultaneously avoid the condition that the processing result is invalid due to the mismatch between the audio information and the input audio, the input audio in which the dry sound and the accompaniment are mixed together can be directly acquired, the input audio is subjected to sound source separation processing, the dry sound and the accompaniment are distinguished, and the initial dry sound audio and the initial accompaniment audio can be obtained. Referring to fig. 10, fig. 10 is a schematic view of an audio processing flow provided by an embodiment of the present application, in which an initial trunk audio and an initial accompaniment audio can be obtained by a sound source separation method.

After all the target audio of the dry music and the target audio of the accompaniment are processed, a plurality of corresponding processing results can be obtained. The method is used for evaluating the singing level of the user in a time period by using all the processing results, so that the evaluation result corresponding to the initial dry sound audio can be generated, and the evaluation result can comprehensively evaluate the average singing level of the user in the whole song. Referring to fig. 11, fig. 11 is a flowchart illustrating a specific audio evaluation result generation process provided in an embodiment of the present application, which is a process of scoring a singing performed by a user in a karaoke scene. By means of audio acquisition equipment such as a microphone, the voice of a user can be acquired and segmented to obtain dry sound segments 1 to N. And carrying out segmentation in the same form on the accompaniment corresponding to the song sung by the user to obtain accompaniment segments 1 to N. The song evaluation model is an audio evaluation model, and after the scores (i.e., processing results) of each frequency band are obtained, the score of segment 1, the score of segment 2 and the score of segment N are integrated to obtain the score (i.e., evaluation result) of the whole song. For example, an average score may be calculated as the entire song score.

In the following, a computer-readable storage medium provided by an embodiment of the present application is introduced, and the computer-readable storage medium described below and the model training method described above may be referred to correspondingly.

The present application further provides a computer-readable storage medium having a computer program stored thereon, which, when being executed by a processor, implements the steps of the above-mentioned model training method.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

21页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种用于有限句库的离线语音识别匹配装置与方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!