Multi-singer singing voice synthesis method and system based on generation countermeasure network

文档序号：193371 发布日期：2021-11-02 浏览：39次中文

阅读说明：本技术 一种基于生成对抗网络的多唱歌人歌声合成方法和系统 (Multi-singer singing voice synthesis method and system based on generation countermeasure network ) 是由赵洲李瑞琦黄融杰于 2021-07-29 设计创作，主要内容包括：本发明公开了一种基于生成对抗网络的多唱歌人歌声合成方法和系统,属于歌声合成领域。本发明采用多频段并行的高保真波形生成器,用于捕捉不同频段不同敏感度的信息,且保证了计算的效率。同时,本发明使用了两个不同目标的判别器：条件判别器在输入波形的同时,引入了真实唱歌人的身份特征,用于判别生成器是否正确在波形中重建了唱歌人的身份(即音色等)信息；而非条件判别器仅用于判别该波形是生成的还是真实的。在训练过程中,生成器将尽量迷惑两个判别器,达到快速生成高保真波形的效果的同时,优化了遇到未见唱歌人时的退化问题。(The invention discloses a multi-singer singing voice synthesis method and system based on a generation countermeasure network, and belongs to the field of singing voice synthesis. The invention adopts a multi-band parallel high-fidelity waveform generator for capturing information of different frequency bands with different sensitivities and ensures the calculation efficiency. At the same time, the present invention uses two discriminators of different targets: the condition judger introduces the identity characteristics of a real singer while inputting the waveform, and is used for judging whether the generator correctly reconstructs the identity (namely tone and the like) information of the singer in the waveform; and the non-conditional discriminator is used only to discriminate whether the waveform is generated or real. In the training process, the generator will confuse the two discriminators as much as possible, so that the degradation problem when a singer is not found is optimized while the effect of quickly generating high-fidelity waveforms is achieved.)

1. A multi-singer singing voice synthesis method based on a generation confrontation network is characterized by comprising the following steps:

1) acquiring an aligned singing voice training sample set of multiple singers, wherein each sample consists of source singing voice frequency, aligned lyric text and singer identity information;

2) establishing a multi-singer singing voice generation confrontation network which comprises a multi-band waveform generator, a singer identity characteristic extraction network, a singer condition discriminator and a non-condition discriminator;

taking Mel frequency spectrum corresponding to source singing voice audio as input of a multiband waveform generator, inputting synthesized noise conforming to Gaussian distribution in parallel, generating synthesized waveforms of four different frequency bands by the multiband waveform generator, and processing by a pseudo-orthogonal mirror image filter bank to obtain synthesized waveform output;

inputting a real waveform and a synthesized waveform corresponding to the source singing voice audio into a discriminator according to a proportion, wherein for a singing person condition discriminator, the real waveform or the synthesized waveform is firstly coded, then a singing person identity characteristic sequence is added into a coding sequence, and finally the probability of reconstructing the identity information of the singing person is output; for the unconditional discriminator, the real waveform or the synthesized waveform is used as input, and finally the probability that the waveform belongs to the synthesized waveform is output;

training a generated confrontation network by adopting the aligned singing voice training sample set in the step 1), and training the generated confrontation network for singing more than one singer according to the loss of the multiband waveform generator, the loss of the singer condition judger and the loss of the unconditional judger;

3) the method comprises the steps of dividing source singing voice audio to be synthesized into training samples, using Mel frequency spectrums and noise of the divided samples to be processed as input of a multiband waveform generator, outputting synthesized waveforms, connecting the synthesized waveforms corresponding to the samples to be processed to obtain final synthesized waveforms, and converting the synthesized waveforms into audio output.

2. The method for synthesizing singing voice of multiple singers based on generation of countermeasure network as claimed in claim 1, wherein said step 1) is specifically:

1.1) audio pre-processing: for multi-singer audio files, removing segments exceeding 100 milliseconds continuously using voice detection;

1.2) dividing the preprocessed audio into sample segments of 1-11 seconds, aligning the lyrics with the text, and labeling the personal identity information of the singing of each sample.

3. The multi-singer singing voice synthesizing method based on generation countermeasure network as claimed in claim 1, wherein said multi-band waveform generator is composed of a low frequency adaptive waveform generator and a high frequency adaptive waveform generator, the main structure of both waveform generators is the same, each of them includes an up-sampling layer for Mel frequency spectrum, a 1-dimensional convolution layer for synthesized noise, an adaptive WaveNet neural network block, and two identical ReLU activation layers and a 1x1 convolution layer; the number of the adaptive WaveNet neural network blocks and the receptive field of the internal convolution layer are adaptive according to different frequency bands, wherein: the low-frequency self-adaptive waveform generator is provided with 16 layers of convolutional neural networks, the expansion coefficient circulates in every 8 layers, and the size of a convolutional kernel is 7; the high-frequency self-adaptive waveform generator is provided with 15 layers of convolutional neural networks, the expansion coefficient circulates every 5 layers, and the size of a convolutional kernel is 5;

processing Mel frequency spectrum corresponding to source singing voice audio by an up-sampling layer, processing synthetic noise conforming to Gaussian distribution by a 1-dimensional convolution layer, taking output results of the up-sampling layer and the 1-dimensional convolution layer as input of a WaveNet neural network block, passing output of the WaveNet neural network block through two layers of 1x1 convolution layers and corresponding ReLU activation layers, and outputting two-channel high-frequency band or low-frequency band synthetic waveforms by each generator; two low-frequency band waveforms and two high-frequency band waveforms can be obtained according to a low-frequency adaptive waveform generator and a high-frequency adaptive waveform generator.

4. The singing voice synthesis method for multiple singers based on generation of countermeasure network as claimed in claim 3, characterized in that said WaveNet neural network block comprises an expansion convolution layer for noise input, a 1x1 convolution layer for Mel spectrum input, a sigmoid-tan activation layer for processing four tensors after the two are split, and finally two fully connected layers for output; two pieces of characteristic information output by the previous WaveNet neural network block are respectively used as the input of the next WaveNet neural network block;

in the ith WaveNet neural network block, the input noise processing result X is processed_iAnd processing the Mel spectrum processing result H_iSplit into four different tensors xa, as inputs to the expanded convolutional layer and the 1x1 convolutional layer, respectively_i、xb_i、sa_i、sb_iWherein xa is_i、xb_iConnected and activated by the tanh function in the sigmoid-tanh activation layer, sa_i、sb_iConnecting and activating by a sigmoid function in a sigmoid-tanh activation layer, and respectively outputting a noise processing result X after two tensors obtained after activation pass through two parallel full-connection layers_i+1And Mel spectrum processing result H_i+1And inputting the signal into the (i + 1) th WaveNet neural network block for continuous processing.

5. The multi-singer singing voice synthesis method based on generation countermeasure network as claimed in claim 1, wherein the Mel frequency spectrum corresponding to the source singing voice audio is taken as the input of the multiband waveform generator, and 0 is necessary to be complemented backward in the time dimension so that all the inputs have the same size.

6. The multi-singer singing voice synthesis method based on generation countermeasure network as claimed in claim 1, wherein said singer identity feature extraction network is used for encoding singer identity information and is composed of a three-layer long-short term memory network LSTM layer, a one-layer full-link layer, a ReLU activation layer and batch standardization; the Mel frequency spectrum is firstly extracted from the LSTM layer to obtain hidden layer information, and then the hidden layer information is mapped into a singer identity through a full connection layer and an activation layer to be embedded and used as a coded singer identity characteristic sequence to calculate the perception loss of the singer.

7. The multi-singer singing voice synthesizing method based on generation countermeasure network as claimed in claim 1 or 6, wherein said singer condition discriminator is composed of one-dimensional convolution layer, down sampling layer, long and short term memory network LSTM layer, singer identity characteristic input layer, full connection layer and ReLU activation layer; the real waveform or the synthesized waveform sequence is sequentially processed by a one-dimensional convolution layer, a down-sampling layer, a one-dimensional convolution layer and an LSTM layer to obtain a coded waveform sequence, a corresponding singing person identity characteristic sequence is simultaneously input, the coded waveform sequence and the corresponding singing person identity characteristic sequence are subjected to element-level addition operation, the reconstruction probability of the singing person identity information is output through a full connection layer and an activation function layer, and the loss of a singing person condition discriminator is calculated.

8. The multi-singer singing voice synthesis method based on generation countermeasure network as claimed in claim 1, wherein said unconditional discriminator is composed of 10 non-causal expansion convolutional layers and one-dimensional convolutional layers; the expansion coefficients of the expansion convolutional layers are sequentially increased, the expansion results output by the 10 expansion convolutional layers of the real waveform or the synthesized waveform sequence are mapped to a probability value by the one-dimensional convolutional layer to be output, the probability that the waveform belongs to the synthesized waveform is obtained, and the loss of the unconditional discriminator is calculated.

9. The multi-singer singing voice synthesis method based on generation countermeasure network as claimed in claim 1, characterized in that the conditional arbiter penalty and the non-conditional arbiter penalty are combined as a main penalty value; introducing singing person perception loss and multi-resolution short-time Fourier transform loss as auxiliary loss values, taking the weighting result of main loss and auxiliary loss as final loss, and performing combined training on a counternetwork generated by singing sounds of multiple singing persons.

10. A multi-singer singing voice synthesis system based on a generative confrontation network, characterized by being used for realizing the multi-singer singing voice synthesis method of claim 1.

Technical Field

The invention relates to the technical field of singing voice synthesis, in particular to a multi-singer singing voice synthesis method and system based on a generation countermeasure network.

Background

High fidelity multiple singer singing voice synthesis is a challenge in the field of neural network vocoders due to data set shortages, limited multiple singer generalization performance, huge computational costs, and the like. The objective of the vocoder is to reconstruct the waveform corresponding to the Mel spectrum for a given Mel spectral input of singing voice, while recovering as much as possible the characteristics of the singer hidden in the Mel spectrum.

Vocoder technology has been improving in performance in the field of singing voice synthesis in recent years, but has met with difficulty in the field of synthesis of singing voice by multiple singers. In a scene of modeling by multiple singers, the tone, the intensity change and the speed change of different persons in singing are greatly different. Under the condition of the inference of a singer, namely, under the condition of the singer not appearing in the training set, a plurality of vocoders have obvious degradation phenomenon at present, the performance is greatly reduced, and the quality of the generated audio is also reduced. Meanwhile, in practical applications, the vocoder often needs faster computation speed, the training and inference process of the vocoder consumes very much computing resources, the existing vocoder is difficult to meet the requirement of fast synthesis of singing voice, and the current anti-generation network-based vocoder needs much data and computation.

In conclusion, the existing high-fidelity vocoder can not effectively solve the problems, so that the significant degradation and the inference speed in the scene of a plurality of singers cannot meet the requirements, and the use in the scene with high requirements is difficult to meet.

Disclosure of Invention

The invention aims to solve the problems of degradation of a vocoder in a multi-singer scene, low singing voice synthesis speed and the like in the prior art. At present, a mainstream vocoder does not explicitly construct a framework for reconstructing the identity characteristics of a singer, so that the invention provides a fast high-fidelity vocoder for multi-singer based on a generation countermeasure network, firstly, the characteristics that different frequency bands have different characteristics in a waveform are utilized, the waveform is divided into four different frequency bands to be generated in parallel, a pseudo-orthogonal mirror image filter bank (PQMF) is used for synthesizing an output waveform, good parallelism is realized, and the calculation process is accelerated; in addition to the traditional non-conditional judger, a singer condition judger is added, and the reference singer identity characteristic is added as condition information in the judging process to judge whether the generator reasonably reconstructs the characteristics of the singers in the waveform, so that the degradation problem in a multi-singer model is effectively optimized, and the waveform reconstruction performance of a vocoder in a multi-singer scene is improved.

In order to achieve the purpose, the invention specifically adopts the following technical scheme:

one of the objectives of the present invention is to provide a method for synthesizing singing voices of multiple singers based on a confrontation network generation, comprising the following steps:

1) acquiring an aligned singing voice training sample set of multiple singers, wherein each sample consists of source singing voice frequency, aligned lyric text and singer identity information;

Further, the step 1) specifically comprises:

1.1) audio pre-processing: for multi-singer audio files, removing segments exceeding 100 milliseconds continuously using voice detection;

1.2) dividing the preprocessed audio into sample segments of 1-11 seconds, aligning the lyrics with the text, and labeling the personal identity information of the singing of each sample.

Furthermore, the multiband waveform generator consists of a low-frequency adaptive waveform generator and a high-frequency adaptive waveform generator, the two waveform generators have the same structure and respectively comprise an upsampling layer aiming at the Mel frequency spectrum, a 1-dimensional convolutional layer aiming at the synthetic noise, an adaptive WaveNet neural network block, two identical ReLU activation layers and a 1x1 convolutional layer; the number of the self-adaptive waveNet neural network blocks and the receptive field of the internal convolution layer are self-adaptive according to different frequency bands;

processing the Mel frequency spectrum corresponding to the source singing voice audio by an up-sampling layer, processing the synthetic noise conforming to Gaussian distribution by a 1-dimensional convolution layer, taking the output results of the up-sampling layer and the 1-dimensional convolution layer as the input of a WaveNet neural network block, and finally outputting two high-frequency band or low-frequency band synthetic waveforms; two low-frequency band waveforms and two high-frequency band waveforms can be obtained according to a low-frequency adaptive waveform generator and a high-frequency adaptive waveform generator.

Further, the WaveNet neural network block comprises an expansion convolutional layer for noise input, a 1x1 convolutional layer for Mel frequency spectrum input, a sigmoid-tanh active layer for processing four tensors after the two are split, and two fully-connected layers for output finally; two pieces of characteristic information output by the previous WaveNet neural network block are respectively used as the input of the next WaveNet neural network block;

in the ith waveNet neural network block, inputNoise processing result X_iAnd processing the Mel spectrum processing result H_iSplit into four different tensors xa, as inputs to the expanded convolutional layer and the 1x1 convolutional layer, respectively_i、xb_i、sa_i、sb_iWherein xa is_i、xb_iConnected and activated by the tanh function in the sigmoid-tanh activation layer, sa_i、sb_iConnecting and activating by a sigmoid function in a sigmoid-tanh activation layer, and respectively outputting a noise processing result X after two tensors obtained after activation pass through two parallel full-connection layers_i+1And Mel spectrum processing result H_i+1And continues processing as input to the i +1 th WaveNet neural network block.

Further, when the Mel spectrum corresponding to the source singing voice audio is used as the input of the multiband waveform generator, 0 needs to be added back in the time dimension so that all the inputs have the same size.

Furthermore, the singing person identity feature extraction network is used for encoding the singing person identity information and is composed of a three-layer long-short term memory network LSTM layer, a one-layer full connection layer, a ReLU activation layer and batch standardization; the Mel frequency spectrum is firstly extracted from the LSTM layer to obtain hidden layer information, then the hidden layer information is mapped into a singer identity through a full connection layer and an activation layer to be embedded, finally, a coded singer identity characteristic sequence is output, and the perception loss of the singer is calculated.

Furthermore, the singing person condition discriminator consists of a one-dimensional convolution layer, a down sampling layer, a long-short term memory network LSTM layer, a singing person identity characteristic input layer, a full connection layer and a ReLU activation layer; the real waveform or the synthesized waveform sequence is sequentially processed by a one-dimensional convolution layer, a down-sampling layer, a one-dimensional convolution layer and an LSTM layer to obtain a coded waveform sequence, a corresponding singing person identity characteristic sequence is simultaneously input, the coded waveform sequence and the corresponding singing person identity characteristic sequence are subjected to element-level addition operation, the reconstruction probability of the singing person identity information is output through a full connection layer and an activation function layer, and the loss of a singing person condition discriminator is calculated.

Furthermore, the unconditional discriminator consists of 10 non-causal expansion convolutional layers and one-dimensional convolutional layers; the expansion coefficients of the expansion convolutional layers are sequentially increased, the expansion results output by the 10 expansion convolutional layers of the real waveform or the synthesized waveform sequence are mapped to a probability value by the one-dimensional convolutional layer to be output, the probability that the waveform belongs to the synthesized waveform is obtained, and the loss of the unconditional discriminator is calculated.

Further, combining the loss of the condition discriminator with the loss of the non-condition discriminator to be used as a main loss value; introducing singing person perception loss and multi-resolution short-time Fourier transform loss as auxiliary loss values, taking the weighting result of main loss and auxiliary loss as final loss, and performing combined training on a counternetwork generated by singing sounds of multiple singing persons.

The second purpose of the present invention is to provide a multi-singer singing voice synthesizing system based on the generation countermeasure network, which is used for realizing the multi-singer singing voice synthesizing method.

Compared with the prior art, the invention effectively improves the performance of high-fidelity singing voice synthesis under the scene of multiple singers, and is embodied in the following steps:

(1) aiming at the problem of insufficient computing speed when high-fidelity waveforms are synthesized in the prior art, the invention improves the parallelism on the premise of ensuring that the details of the output waveforms are not lost, thereby reducing the computing time.

The invention utilizes the characteristic that different frequency bands have different characteristics in the waveform, adopts two different frequency self-adaptive waveform generators to divide the waveform into four different frequency bands for parallel generation, and then synthesizes an output waveform by using a pseudo-orthogonal mirror filter bank (PQMF). The two different frequency self-adaptive waveform generators have different sensitivities and attention points to different frequency bands, so that more real sound information can be captured, high-fidelity waveforms are output, good parallelism is realized, and the calculation process is accelerated.

(2) Aiming at the problem of degradation easily caused in a multi-singer scene in the prior art, the invention adds a singer condition discriminator for supervising the characteristics of a generator for reconstructing the singer in the generated waveform. The condition judger introduces the identity characteristics of a real singer (generated by a pre-trained identity characteristic extraction network) in the judging process, so that whether the waveform generator reasonably reconstructs the identity characteristics of the singer in the waveform is judged on the basis of the condition. The Mel frequency spectrum corresponding to the singing voice of the singer identity which does not appear in the training set in the inference process is greatly relieved, namely the degradation problem under the condition that no speaker exists is solved, and the generalization capability of the model is improved.

(3) Based on the countermeasure generation network, the invention also introduces two loss functions which are helpful for improving the performance of the vocoder and stabilizing the countermeasure training. First, to improve the quality of the generated waveform, the present invention introduces Singer Perceptual Loss (Singer Perceptual Loss) that enables the generator to capture the bias between singing and optimize the Singer similarity in the frequency domain between the real and composite waveforms. Second, to make the process of resistance training more stable, the present invention introduces a Multi-resolution short-time Fourier transform Loss (Multi-resolution STFT Loss), which is the sum of the results of short-time Fourier transforms of a set of different resolution and analysis parameters. These two ancillary losses further improve the performance of the vocoder.

Drawings

FIG. 1 is a multi-band high fidelity vocoder structure proposed by the present invention;

FIG. 2 is a schematic diagram of a generator structure in an embodiment of the invention;

fig. 3 is a schematic structural diagram of a singing person condition discriminator in the embodiment of the present invention;

FIG. 4 is a diagram illustrating an unconditional discriminator according to an embodiment of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

The invention provides a multi-singer singing voice synthesis method based on a generation countermeasure network, which mainly comprises the following contents: step one, acquiring an aligned high-quality singing voice training sample set of multiple singers.

Step two, establishing a multi-singer singing voice generation confrontation neural network, which comprises a multi-band waveform generator (a multi-band high-fidelity vocoder), a singer identity characteristic extraction network, a singer condition discriminator and a non-condition discriminator; training the established antagonistic neural network by adopting an aligned high-quality singing voice training sample set of multiple singers, and introducing two discriminators in the training process to obtain a trained multiband waveform generator, namely a trained multiband high-fidelity vocoder.

And step three, taking the Mel frequency spectrum containing singing voice information as input, and generating a high-quality audio waveform by the trained multiband high-fidelity vocoder.

In one embodiment of the present invention, the implementation of step one is described.

The aligned high-quality singing voice training sample set is obtained by preprocessing a source singing voice audio file, a corresponding aligned lyric text and the identity of a singer, and specifically comprises the following steps: using Voice Activated Detection (VAD) to remove silent segments (continuous 100 ms silence, then treated as silent segments) in the audio file, so that the audio file length is greatly reduced; using Lyrics-to-Singing alignment, the entire song is segmented into aligned Lyrics and audio, such that each processed phrase fragment is between 0 and 11 seconds in length; using a Montreal Forced Alignment (MFA) tool, a word-level aligned audio text data set on a time axis is obtained. The phoneme labeling in the MFA algorithm is obtained by aligning a manually labeled phoneme sequence of a song by a GMM-HMM (gaussian mixture model-hidden markov model) algorithm.

After processing in this step, each segmented sample consists of source singing voice audio, aligned lyrics text, and singing person identity information.

In one embodiment of the present invention, the implementation of step two is described.

2.1) establishing a multi-singer singing voice generation antagonistic neural network model.

And establishing a network model consisting of a multiband waveform generator, a singing person identity characteristic extraction network, a singing person condition discriminator and a non-condition discriminator.

The multiband waveform generator is composed of two frequency adaptive waveform generators, one low-frequency adaptive waveform generator and one high-frequency adaptive waveform generator, each of which, as shown in fig. 2, contains an up-sampling layer for Mel spectrum, a 1-dimensional convolution layer for synthetic noise, WaveNet neural network blocks whose number varies depending on the frequency band, and two identical ReLU activation layers and 1x1 convolution layers. Wherein besides the number of WaveNet neural network blocks, the receptive field of convolutional layers in the WaveNet neural network blocks is also adaptive according to different frequency bands. Wherein: the low-frequency generator is provided with 16 layers of convolutional neural networks, the expansion coefficient circulates in every 8 layers, and the size of a convolutional kernel is 7; the high-frequency generator is provided with 15 layers of convolutional neural networks, the expansion coefficient circulates every 5 layers, and the size of a convolutional kernel is 5; finally, each frequency self-adaptive generator outputs two-channel waveforms through two layers of 1x1 convolution layers and the corresponding ReLU active layer. The two generators have a total of four channels, i.e., four different frequency band waveforms. The four waveforms generated by the two frequency adaptive waveform generators will eventually be combined into one output waveform by a pseudo-quadrature mirror filter bank (PQMF) algorithm.

In this embodiment, the WaveNet neural network block includes an expansion convolutional layer for noise input, a 1 × 1 convolutional layer for Mel spectrum input, a sigmoid-tanh active layer for processing four signals after the two are split, and two fully connected layers for output. And two pieces of characteristic information output by the previous WaveNet neural network block are respectively used as the input of the next WaveNet neural network block.

As shown in fig. 3, the singer condition discriminator is composed of a one-dimensional convolution layer, a down-sampling layer, a Long Short Term Memory (LSTM) layer, a singer identity feature, a full link layer, and a ReLU activation layer. The singer condition judger extracts the hidden feature of the singer identity through the multiple layers of LSTMs, and then adds the reference singer identity feature, thereby judging whether the generator reasonably reconstructs the feature of the singer in the waveform.

As shown in fig. 4, the unconditional discriminator is composed of nine non-causal one-dimensional expansion convolution layers and one-dimensional convolution output layer. The step length of the convolution kernel of each expansion convolution layer is 1; the expansion coefficient of the first layer is 1, and the expansion coefficients of the second layer to the ninth layer are from 1 to 8; the number of input channels is 1 in the first layer, 64 in the second layer to the ninth layer, and 64 in the output channels; the size of the convolution kernel is 3. After nine expansion convolutional layers, the extracted features are mapped to a probability value by one-dimensional convolutional layer and output, and the number of output channels is 1, so that whether the waveform is a synthesized waveform or an original waveform is judged.

2.2) training the model.

The established confrontation neural network model is trained by adopting an aligned high-quality singing voice training sample set of multiple singers, and two discriminators are introduced for confrontation training in the training process. At the same time, the training process also includes two penalty functions that help to improve vocoder performance and stabilize the training against.

a. Network pre-training process for extracting the personal identity characteristic of singing persons:

in the working process of the singer condition judger and the calculation process of perception loss of the singer, a pre-training singer identity feature extraction network is needed. The network for extracting the personal identity characteristics of the singer is a singer identity encoder and is composed of a three-layer long-short term memory network (LSTM), a full connection layer, a ReLU activation layer and batch standardization. The singing person identity characteristic of the Mel frequency spectrum is input as a relatively stable characteristic in the RNN sequence, extracted into a hidden layer in the LSTM network, and mapped into the singing person identity embedding through a full connection layer and an activation layer. The training of the encoder uses a generalized end-to-end penalty, making the mapping from Mel-frequency spectrum space to singer identity space more efficient.

b. Multi-band waveform generator training procedure:

in this embodiment, as shown in fig. 1, an 80-band Mel spectrum is used as an input of a multiband generator, and a synthesized noise sequence conforming to a gaussian distribution is also input in parallel with the Mel spectrum. Input frequency spectrumAnd the noise waveform is distributed and input to two different frequency self-adaptive generators in parallel after 0 is compensated, and four waveforms of different frequency bands are synthesized: two high band waveforms and two low band waveforms. Each generator firstly performs up-sampling processing on the Mel frequency spectrum to generate H, performs 1-dimensional convolution processing on synthetic noise to generate X, and then inputs the H and the X into a WaveNet neural network block with the depth varying according to different frequency bands and two pairs of identical ReLU active layers and 1X1 convolution layers. Wherein: the low-frequency generator is provided with 16 layers of convolutional neural networks, the expansion coefficient circulates in every 8 layers, and the size of a convolutional kernel is 7; the high-frequency generator is provided with 15 layers of convolutional neural networks, the expansion coefficient circulates every 5 layers, and the size of a convolutional kernel is 5; finally, each frequency self-adaptive generator outputs two-channel waveforms through two layers of 1x1 convolution layers and the corresponding ReLU active layer. The two generators have a total of four channels, i.e., four different frequency band waveforms.

In the ith WaveNet neural network block, the input noise processing result X is processed_iAnd processing the Mel spectrum processing result H_iSplit into four different tensors xa, as inputs to the expanded convolutional layer and the 1x1 convolutional layer, respectively_i、xb_i、sa_i、sb_iWherein xa is_i、xb_iConnected and activated by the tanh function in the sigmoid-tanh activation layer, sa_i、sb_iConnecting and activating by a sigmoid function in a sigmoid-tanh activation layer, and respectively outputting a noise processing result X after two tensors obtained after activation pass through two parallel full-connection layers_i+1And Mel spectrum processing result H_i+1And continues processing as input to the i +1 th WaveNet neural network block. When passing through each WaveNet neural network block, it needs to be put every H_iThe output is connected in a skipping way, namely all H_iAndafter normalization, inputting the final output network: two layers of 1x1 convolutional layers and their corresponding ReLU active layers are used as the dual channel output of a single frequency adaptive waveform generator.

The four waveforms generated by the two frequency adaptive waveform generators will be finally combined into one output waveform by a pseudo quadrature mirror filter bank (PQMF) algorithm.

c. The training process of the discriminator:

and proportionally inputting the real waveform sequence x and the synthesized waveform sequence y into a condition discriminator. The waveform of the input is complemented by 0 in the time dimension so that all inputs have the same size.

The singing person condition discriminator firstly uses the step length average pooling to carry out 256 times of down sampling, the down sampling operation uses the step length average pooling layer with the size of 4 to respectively carry out four-step down sampling of 8 times, 2 times and 2 times, the input is 1 channel, and the output is 256 channels. After the down-sampling operation is finished, outputting a 256-dimensional vector z after the one-dimensional convolution layer; the vector is input into a three-layer LSTM network to capture a stable long-term information hiding characteristic h of the singer; and finally, carrying out element-level addition operation on the long-term information hiding characteristic h of the singer and the reference singer identity characteristic s, outputting a probability P1 for judging whether singer identity information is reasonably reconstructed through a full connection layer and a ReLU activation layer, and obtaining the loss of a condition discriminator. Wherein, the reference singing person identity characteristic sequence s is extracted by a pre-trained singing person identity characteristic extraction network.

And (3) while training the condition discriminator, supplementing 0 to the real waveform x and the synthesized waveform y, inputting the real waveform x and the synthesized waveform y into the unconditional discriminator in proportion, and entering nine layers of non-causal one-dimensional expansion convolutional layers. The step length of the convolution kernel of each expansion convolution layer is 1; the expansion coefficient of the first layer is 1, and the expansion coefficients of the second layer to the ninth layer are from 1 to 8; the number of input channels is 1 in the first layer, 64 in the second layer to the ninth layer, and 64 in the output channels; the size of the convolution kernel is 3. After nine expansion convolutional layers, the extracted features are mapped to a probability value by a one-dimensional convolutional layer and output, the number of output channels is 1, and the probability value is used for judging whether the waveform is a synthesized waveform or an original waveform and obtaining the loss of an unconditional discriminator.

The generator loss and the discriminator loss are combined to be the antithetical training loss updating network parameter, which is expressed as:

wherein x and y are real waveform and composite waveform, s and m are singing identity feature and Mel frequency spectrum, G is multiband waveform generator, D is unconditional discriminator, Ds is singing condition discriminator, L is_adv(D; G) represents a discriminator loss, L_adv(G; D) represents the generator loss, E_x,mRepresenting the expected loss, E, with respect to the audio and corresponding Mel spectra in the training set_x,s,mRepresenting the expected loss of audio and corresponding Mel-frequency spectrum, corresponding singer identity features in relation to the training set. This loss is called the primary loss.

In one embodiment of the invention, singing human perception loss and multi-resolution short-time fourier transform loss are introduced as auxiliary losses. The singing person perception loss is the sum of L2 norms between each layer of LSTM layers after an original waveform and a generated waveform are converted into Mel frequency spectrums Mel (x), Mel (y) and an input singing person identity characteristic extraction network, and the specific loss is expressed as:

wherein L is_spl(x, y) represents singing person perception loss,represents the norm of L2,the hidden layer of the jth layer of the LSTM in the pre-training singer identity feature extraction network is represented, the input is the Mel frequency spectrum of the waveform x, and L represents the number of the LSTM hidden layers.

The multiresolution short-time fourier transform loss is the sum of losses of a synthesized waveform after short-time fourier transform of a set of different analysis parameters (namely resolution), and the specific losses are represented as follows:

wherein is transformed into_{m_sc}And L_{m_mag}Respectively representing the spectral convergence loss (spectral convergence loss) and the log short-time fourier amplitude loss (log STFT magnetic loss),andrespectively represent Frobenius norm and L1 norm; STFT (— dash) and N represent the amplitude of the mth fourier parameter set of the short-time fourier transform and the number of elements within that amplitude, respectively.

The final multiresolution short time fourier transform loss (SPL) is expressed in the form:

wherein M represents the number of short-time Fourier transforms of different parameters, and the upper superscript M represents the mth spectral convergence loss and the log short-time Fourier transform amplitude loss.

In summary, the auxiliary loss involved in the present invention is represented as:

wherein L is_auxReferred to as assist loss.

The final loss of the invention in the training process is as follows:

where a refers to a hyperparameter for balancing the auxiliary loss and the counter loss, in this embodiment, a = 10.

In one embodiment of the present invention, the process of step three is described.

Using the 80 band Mel spectrum containing singing voice information as input, high quality audio waveform is generated by the trained multiband high fidelity vocoder.

The method specifically comprises the following steps:

3.1) acquiring 80-band Mel spectrum data to be synthesized into a waveform sequence and cutting according to the requirements in training, wherein each audio segment does not exceed 11 seconds in the embodiment. And (3) taking the well-trained multiband waveform generator in the generation countermeasure network as an input by using a small-batch Mel frequency spectrum, and supplementing 0 backwards in the time dimension to keep the same size to obtain a Mel characteristic sequence. At the same time, synthetic noise conforming to Gaussian distribution is generatedAs another input to the generator.

3.2) the input spectrum and noise are distributed and input to two different frequency adaptive generators in parallel and synthesized into four different frequency band waveforms: two high band waveforms and two low band waveforms. The four waveforms are finally combined into an output waveform by a pseudo-quadrature mirror filter bank (PQMF) algorithm, and then the synthesized waveform is converted into audio output.

Corresponding to the foregoing embodiment of the multi-singer singing voice synthesis method based on the generation confrontation network, the present application further provides an embodiment of a multi-singer singing voice synthesis system based on the generation confrontation network, which includes:

the system comprises a sample preprocessing module, a signal processing module and a signal processing module, wherein the sample preprocessing module is used for acquiring an aligned singing voice training sample set of multiple singers, and each sample consists of source singing voice frequency, aligned lyric text and singer identity information;

the network module for generating the countering of the singing voice of a plurality of singers comprises a multiband waveform generator, a singing person identity characteristic extraction network, a singer condition discriminator and a non-condition discriminator;

training a generated confrontation network by adopting an aligned singing voice training sample set in a sample preprocessing module, and training the generated confrontation network for singing more than one singer according to the loss of a multiband waveform generator, the loss of a singer condition discriminator and the loss of a non-condition discriminator;

and the singing voice synthesis module is used for dividing the source singing voice audio to be synthesized into training samples, taking the Mel frequency spectrum and noise of each divided sample to be processed as the input of the multiband waveform generator, outputting a synthesized waveform, connecting the synthesized waveforms corresponding to the samples to be processed to obtain a final synthesized waveform, and converting the synthesized waveform into audio output.

With regard to the system in the above-described embodiment, the specific manner in which each unit or module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.

For the system embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The system embodiments described above are merely illustrative, wherein the confrontation network module generated as a singing voice of a multi-singer may or may not be physically separate. In addition, each functional module in the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one unit. The integrated modules or units can be implemented in the form of hardware, or in the form of software functional units, so that part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application.

In the present embodiment, three evaluation methods are used to score the generated waveform sequence. Respectively as follows: mean Opinion Scores (MOS), FDSD (frechet Deep Speech distance), and cosine similarity.

Wherein: the MOS score is derived from the evaluation of a large number of mother speakers, with a score varying from 1 to 5, where 1 indicates extreme disagreement and with significant distortion and 5 indicates no perceptible distortion.

FDSD calculates scores based on the Distance between the synthesized audio and the reference audio, and these scores are conceptually similar to FID (fringe inclusion Distance).

Cosine similarity is used to calculate similarity between singers in a song collection of multiple singers. In addition, in the present embodiment, the running speed of the model is also estimated using real time Rate (RTF).

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

The method carries out three-step experiments aiming at the three aspects of multi-band waveform generation, multi-singer modeling and ablation verification, and carries out experimental verification on an OpenSing data set.

The following is a basic description of the data set.

The invention processes the data set:

removing silent segments (every 100 ms) in an audio file using Voice Activated Detection (VAD) so that the audio file length is greatly reduced; using Lyrics-to-Singing alignment, the entire song is segmented into aligned Lyrics and audio, such that each processed phrase fragment is between 0 and 11 seconds in length; using Montreal Forced alignment tool, a word-level aligned audio text data set on the time axis is obtained. The phoneme labeling in the MFA algorithm is obtained by aligning a manually labeled phoneme sequence of a song by a GMM-HMM (gaussian mixture model-hidden markov model) algorithm.

After the data set was initially processed, 340 samples were randomly selected as the validation set, and 60 samples were randomly selected from 6 singers as the test set of the singers who have seen. In addition, 5 samples per bit were selected from 5 male singers and 5 female singers as a test set of unseen singers.

The following are experimental results for multi-band waveform generation:

as shown in table 2, the Multi-band WaveRNN model achieves the best performance level and generates the most natural singing voice, but its operation speed is greatly limited due to its autoregressive architecture; the Multi-band MelGAN achieves the fastest operating speed, but significant quality degradation occurs; the multiband high-fidelity vocoder provided by the invention has the optimal performance as a non-autoregressive generator on the premise of realizing rapid operation, because the multiband waveform generating structure has high parallelism and the structure of the generator is adaptive according to the characteristics of different frequency bands.

The following are experimental results modeled for multiple singers:

as can be seen from the results in table 3, none of the non-autoregressive models such as MelGAN and parallell WaveGAN explicitly model and migrate the scenes of multiple singers, so that it is inevitable that significant degradation occurs when synthetic non-singers are encountered; singer Conditional WaveRNN (SC-WaveRNN) uses Singer information embedding as extra information in synthesis to control the identity of singers, but because of the architecture of its autoregressive model, huge computational consumption is incurred; the invention can sense the identity of the singer in the frequency spectrum without consuming extra computing resources, balances the efficiency and the quality and obtains remarkable technical effect.

The rationality and the necessity of the technology adopted by the invention are verified through an ablation experiment.

As can be seen from the results in table 4, the generation speed is greatly reduced and the quality of generated singing voice is also reduced after the multi-band generator is replaced with a full-band generator; when the singer condition discriminator is lacked, the cosine similarity of the singers in the task without the singers is reduced, which indicates that the generator has difficulty in capturing the identities of the singers; the quality of the waveform generated by the generator is also diminished when the loss of perception of the singer is absent. The ablation experiment shows that the invention has the capability of effectively reconstructing the identity of a singer while accelerating the waveform generation under the condition that the singer is not available.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

16页详细技术资料下载

Multi-singer singing voice synthesis method and system based on generation countermeasure network

相关技术

网友询问留言