Many-to-many voice conversion method and system based on speaker style feature modeling

文档序号：1005935 发布日期：2020-10-23 浏览：24次中文

阅读说明：本技术 基于说话人风格特征建模的多对多语音转换方法及系统 (Many-to-many voice conversion method and system based on speaker style feature modeling ) 是由李燕萍张成飞于 2020-06-02 设计创作，主要内容包括：本发明公开了一种基于说话人风格特征建模的多对多语音转换方法及系统,首先提出在StarGAN神经网络中添加多层感知器和风格编码器,实现对说话人风格特征的有效提取和约束,克服了传统模型中one-hot向量携带说话人信息有限的缺点；接着采用自适应实例归一化方法实现语义特征和说话人个性特征的充分融合,使得网络可以学习到更多的语义信息与说话人个性信息；进一步在生成器残差网络中引入一种轻量级的网络模块SKNet,使得网络可以根据输入信息的多个尺度自适应地调节感受野的大小,并通过注意力机制调节每个特征通道的权重,增强对频谱特征的学习能力,细化频谱特征细节。(The invention discloses a many-to-many voice conversion method and a system based on speaker style feature modeling, which firstly put forward to add a multilayer perceptron and a style encoder in a StarGAN neural network to realize effective extraction and constraint of speaker style features and overcome the defect that one-hot vectors in a traditional model carry speaker information to be limited; then, a self-adaptive example normalization method is adopted to realize the full fusion of the semantic features and the speaker personality features, so that the network can learn more semantic information and speaker personality information; and further introducing a lightweight network module SKNet into a generator residual error network, so that the network can adaptively adjust the size of a receptive field according to a plurality of scales of input information, adjust the weight of each characteristic channel through an attention mechanism, enhance the learning capacity of the frequency spectrum characteristics and refine the details of the frequency spectrum characteristics.)

1. A many-to-many voice conversion method based on speaker style feature modeling is characterized by comprising a training stage and a conversion stage, wherein the training stage comprises the following steps:

(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers, and the speakers comprise source speakers and target speakers;

(1.2) extracting the frequency spectrum characteristic x of each speaker voice in the training corpus;

(1.3) the frequency spectrum characteristic x of each speaker voice and the source speaker label c_sTarget speaker tag c_tAnd random noise z which follows normal distribution and is input into an SKNet StarGAN network for training, wherein the SKNet StarGAN network comprises a generator G, a discriminator D, a classifier C, a style encoder S and a multi-layer perceptron M, the generator G comprises an encoding network, a decoding network and at least one SKNet layer, and the SKNet layer is built on the encoding network and the decoding network and is used for solving the SKNet layerIn residual networks between code networks;

(1.4) in the training process, the loss function of the generator G and the loss function of the discriminator D are made as small as possible until the set iteration times are reached, so that the trained SKNet StarGAN network is obtained;

(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises the steps of:

(2.1) extracting the frequency spectrum characteristic x from the voice of the source speaker in the corpus to be converted_s', aperiodic characteristics, and fundamental frequency characteristics;

(2.2) applying the spectral feature x of the source speaker_s', target speaker tag characteristics c_t'and random noise z' obeying normal distribution, inputting into the SKNet StarGAN network trained in the step (1.4) to obtain the target speaker spectral feature x_st'；

(2.3) converting the fundamental frequency features of the source speaker extracted in the step (2.1) into fundamental frequency features of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);

(2.4) the target speaker spectrum characteristic x generated in the step (2.2)_st', the fundamental frequency characteristic of the target speaker obtained in the step (2.3) and the aperiodic characteristic extracted in the step (2.1) are synthesized through a WORLD voice analysis/synthesis model to obtain the converted speaker voice.

2. The method of claim 1, wherein SKNet built between the coding network and the decoding network is 6 layers.

3. The method of claim 1, wherein the style coder S comprises 6 one-dimensional convolutions, filter sizes are 1, 16, step sizes are 1, filter depths are 32, 64, 128, 256, 512, and the middle layer comprises 5 one-dimensional average pooling layers and 5 residual networks, the filter size of each one-dimensional average pooling layer is 2, the step sizes are 2, each residual network layer comprises 2 one-dimensional convolutions, the filter size of each one-dimensional convolution is 2, the step sizes are 2, and the depths are 2 times the depth of the previous filter.

4. The method of claim 1, wherein the multi-layer perceptron M comprises 7 linear layers, 16 input neurons and 512 output neurons in the input layer, 512 input neurons and 512 output neurons in the middle 5 linear layers, 512 input neurons and 512 output neurons in the output layer, and 64 voice conversion population in the output neuron.

5. The method of many-to-many speech conversion based on speaker style feature modeling according to claim 1, wherein the training process of steps (1.3) and (1.4) comprises the steps of:

(1) random noise z and target speaker label characteristic c subject to normal distribution_tInputting the speech signal into the multi-layer sensor M to obtain the style characteristics s of the target speaker_t；

(2) The spectral characteristics x of the source speaker_sInputting a coding network of a generator G to obtain semantic features G (x) irrelevant to the speaker;

(3) the generated semantic features G (x) and the style features s of the target speaker_tInputting the signal into a decoding network of a generator G for training, and minimizing a loss function of the generator G in the training process so as to obtain the spectral feature x of the target speaker_st；

(4) The spectral characteristics x of the source speaker_sSource speaker tag characteristics c_sInput to a style encoder S to obtain the style indicating feature of the speaker

(5) Generating the above-mentioned spectral characteristics x of the target speaker_stAgain input to generator GEncoding the network to obtain speaker independent semantic features G (x)_st)；

(6) Generating the semantic feature G (x)_st) With speaker style indicating characteristics

(7) The frequency spectrum characteristic x of the target speaker generated in the step (3) is compared_stInputting the data into a discriminator D and a classifier C for training, and minimizing a loss function of the discriminator D and a loss function of the classifier C;

(8) the frequency spectrum characteristic x of the target speaker generated in the step (3) is compared_stTarget speaker tag characteristics c_tInputting the style encoder S for training, and minimizing a style reconstruction loss function of the style encoder S;

(9) and (4) returning to the step (1) and repeating the steps until the set iteration number is reached, so that the trained SKNetStarGAN network is obtained.

6. The method of claim 5, wherein the style reconstruction loss function of the style coder S is expressed as:

wherein the content of the first and second substances,

7. The method of many-to-many speech conversion based on speaker style feature modeling according to claim 1, wherein the input process of step (2.2) comprises the steps of:

(1) random noise z' subject to normal distribution and target speaker label characteristic c_t' inputting to the multi-layer sensor M to obtain the style characteristics s of the target speaker_t'；

(2) The spectral characteristics x of the source speaker_s' encoding network of input generator G, deriving speaker independent semantic features G (x)_s')；

(3) Generating the semantic feature G (x)_s'), style characteristics s of the target speaker_t' the decoding network input to the generator G obtains the spectral characteristics x of the targeted speaker_st'。

8. The method of claim 1, wherein the objective function of the SKNet StarGAN network is expressed as:

L_SKNetSTARGAN＝L_G+L_D

wherein L is_GTo a loss function of the generator, L_DIs a loss function of the discriminator;

loss function L of the generator_GExpressed as:

wherein λ is_cyc、λ_ds、λ_styAnd λ_clsIs a group of regularization hyper-parameters which respectively represent the weight of cycle consistency loss, style diversity loss, style reconstruction loss and classification loss,

loss function L of discriminator_DComprises the following steps:

wherein λ is_clsIs the weight lost by the classification,

9. A many-to-many speech conversion system based on speaker style feature modeling, comprising a training phase and a conversion phase, the training phase comprising:

the corpus acquiring module is used for acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers, and the speakers comprise source speakers and target speakers;

the preprocessing module is used for extracting the frequency spectrum characteristic x of each speaker voice in the training corpus;

a network training module for training the frequency spectrum characteristic x of each speaker voice and the source speaker label c_sTarget speaker tag c_tThe method comprises the steps that random noise z which follows normal distribution is input into an SKNet StarGAN network for training, the SKNet StarGAN network comprises a generator G, a discriminator D, a classifier C, a style encoder S and a multi-layer sensor M, the generator G comprises an encoding network, a decoding network and at least one SKNet layer, and the SKNet layer is built in a residual error network between the encoding network and the decoding network;

in the training process, the loss function of the generator G and the loss function of the discriminator D are made as small as possible until the set iteration times, so that the trained SKNet StarGAN network is obtained;

the function construction module is used for constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises:

a source speech processing module for extracting the frequency spectrum character x from the speech of the source speaker in the corpus to be converted_s', aperiodic characteristics, and fundamental frequency characteristics;

a conversion module for converting the source speaker spectrum characteristic x_s', target speaker tag characteristics c_t'and random noise z' obeying normal distribution, inputting into the SKNet StarGAN network trained in the step (1.4) to obtain the target speaker spectral feature x_st'；

The target characteristic acquisition module is used for converting the extracted fundamental frequency characteristic of the source speaker into the fundamental frequency characteristic of the target speaker by using the obtained fundamental frequency conversion function;

a speaker voice acquisition module for generating a target speaker frequency spectrum characteristic x_st' and synthesizing the fundamental frequency characteristic and the aperiodic characteristic of the target speaker through a WORLD voice analysis/synthesis model to obtain the converted speaker voice.

10. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a computer processor, implements the method of any one of claims 1 to 8.

Technical Field

The invention relates to the technical field of voice conversion, in particular to a many-to-many voice conversion method based on speaker style feature modeling.

Background

Speech conversion is a branch of research in the field of speech signal processing, and is developed and extended on the basis of research on speech analysis, synthesis, and speaker recognition. The goal of speech conversion is to change the personality characteristics of the source speaker to have the personality characteristics of the target speaker while leaving the semantic information unchanged, i.e., to make the source speaker's speech sound like the target speaker's speech after conversion.

After years of research, many classical conversion methods have emerged, and the speech conversion technology can be classified into a conversion method under a parallel text condition and a conversion method under a non-parallel text condition according to the classification of training corpora. The method is characterized in that a large number of parallel training texts are collected in advance, time and labor are consumed, and the parallel texts cannot be collected in a cross-language conversion and medical auxiliary system, so that the speech conversion research under the condition of non-parallel texts has greater application background and practical significance.

Conventional speech conversion methods under non-parallel text conditions include a method based on a Cycle-Consistent adaptive network (Cycle-GAN) and a method based on a Conditional variable Auto-Encoder (C-VAE). A voice conversion method based on a C-VAE model directly utilizes an identity tag of a speaker to establish a voice conversion system, wherein a coder decouples the semantics and the individual information of the voice, and a decoder reconstructs the voice through the semantics and the identity tag of the speaker, thereby relieving the dependence on parallel texts. However, since C-VAE is based on an ideal assumption, it is believed that the observed data generally follows a gaussian distribution, resulting in an excessively smooth output speech of the decoder and a low quality of the converted speech. The voice conversion method based on the Cycle-GAN model utilizes the adversity loss and the Cycle consistent loss, and simultaneously learns the forward mapping and the inverse mapping of the acoustic characteristics, so that the over-smooth problem can be effectively relieved, and the conversion voice quality is improved, but the Cycle-GAN can only realize one-to-one voice conversion at present.

The voice conversion method based on the Star-generated confrontation Network (StarGAN) model has the advantages of C-VAE and Cycle-GAN, a generator of the method has a coding and decoding structure and can learn many-to-many mapping at the same time, and the attribute output by the generator is controlled by a speaker identity label, so that many-to-many voice conversion under a non-parallel text condition can be realized, but the method still has the three problems that firstly, the speaker identity label is only one-hot vector, although the speaker identity label has an indicating function, more speaker identity information cannot be provided, and the lack of the speaker identity information causes that the generator is difficult to reconstruct conversion voice with high personality similarity; secondly, the speaker identity tag in the decoding network of the generator only controls the output attribute through simple splicing, and cannot well realize the full fusion of the semantic features and the speaker personality features, so that the deep semantic features and the speaker personality features in the frequency spectrum are easily lost in transmission; in addition, the encoding network and the decoding network in the generator are independent, and the simple network structure causes the generator to lack the extraction capability of deep features, thus easily causing the loss of information and the generation of noise.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects of the prior art, the invention provides a many-to-many voice conversion method based on speaker style characteristic modeling, which solves the problems of insufficient individual information of speaker labels, simple splicing mode of semantic characteristics and speaker characteristics and fixed receptive field and channel weight in a residual error network in the prior method.

The technical scheme is as follows: according to a first aspect of the present invention, a many-to-many voice conversion method based on speaker style feature modeling is provided, which comprises a training phase and a conversion phase, wherein the training phase comprises the following steps:

(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers, and the speakers comprise source speakers and target speakers;

(1.2) extracting the frequency spectrum characteristic x of each speaker voice in the training corpus;

(1.3) the frequency spectrum characteristic x of each speaker voice and the source speaker label c_sObject, objectSpeaker tag c_tThe method comprises the steps that random noise z which follows normal distribution is input into an SKNet StarGAN network for training, the SKNet StarGAN network comprises a generator G, a discriminator D, a classifier C, a style encoder S and a multi-layer sensor M, the generator G comprises an encoding network, a decoding network and at least one SKNet layer, and the SKNet layer is built in a residual error network between the encoding network and the decoding network;

(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises the steps of:

(2.1) extracting the frequency spectrum characteristic x from the voice of the source speaker in the corpus to be converted_s', aperiodic characteristics, and fundamental frequency characteristics;

Further, the method comprises the following steps:

and the SKNet built between the coding network and the decoding network is 6 layers.

Further, the method comprises the following steps:

the style encoder S comprises 6 one-dimensional convolutions, the sizes of filters are respectively 1, 1 and 16, the step length is 1, the depths of the filters are respectively 32, 64, 128, 256, 512 and 512, the middle layer comprises 5 one-dimensional average pooling layers and 5 residual error networks, the size of the filter of each one-dimensional average pooling layer is 2, the step length is 2, each residual error network layer comprises 2 one-dimensional convolutions, the size of the filter of each one-dimensional convolution is 2, the step length is 2, and the depth is 2 times of the depth of the previous layer of the filter.

Further, the method comprises the following steps:

the multilayer perceptron M comprises 7 linear layers, wherein input neurons of the input layer are 16, output neurons of the input layer are 512, input neurons and output neurons of the 5 linear layers in the middle layer are 512, input neurons of the output layer are 512, and output neurons of the output layer are 64 voice conversion people.

Further, the method comprises the following steps:

the training process of steps (1.3) and (1.4) comprises the following steps:

(2) The spectral characteristics x of the source speaker_sInputting a coding network of a generator G to obtain semantic features G (x) irrelevant to the speaker;

(4) The spectral characteristics x of the source speaker_sSource speaker tag characteristics c_sInput to a style encoder S to obtain the style indicating feature of the speaker

(5) Generating the above-mentioned spectral characteristics x of the target speaker_stInputting the data into the coding network of the generator G again to obtain the speaker-independent semantic featuresG(x_st)；

(6) Generating the semantic feature G (x)_st) With speaker style indicating characteristics

Inputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining the reconstructed frequency spectrum characteristics of the speaker

(9) and (4) returning to the step (1) and repeating the steps until the set iteration number is reached, so that the trained SKNet StarGAN network is obtained.

Further, the method comprises the following steps:

the style reconstruction loss function of the style encoder S is represented as:

wherein the content of the first and second substances,

expressing the expectation of the probability distribution generated by the generator, S (-) is the style encoder, S_tRepresenting a stylistic characteristic, G (x), of a target speaker generated from a multi-layered perceptron M_s,s_t) The representation generator generates a spectral feature, x, of the target speaker_sIs the spectral signature of the source speaker.

Further, the method comprises the following steps:

the input process of the step (2.2) comprises the following steps:

(1) random noise z' subject to normal distribution and target speaker label characteristic c_t' inputting to the multi-layer sensor M to obtain the style characteristics s of the target speaker_t'；

(2) The spectral characteristics x of the source speaker_s' encoding network of input generator G, deriving speaker independent semantic features G (x)_s')；

Further, the method comprises the following steps:

the objective function of the SKNet StarGAN network is expressed as:

L_SKNetSTARGAN＝L_G+L_D

wherein L is_GTo a loss function of the generator, L_DIs a loss function of the discriminator;

loss function L of the generator_GExpressed as:

and

respectively representing the confrontation loss, the cycle consistency loss, the style diversity loss, the style reconstruction loss of a style encoder and the classification loss of a classifier of a generator;

loss function L of discriminator_DComprises the following steps:

wherein λ is_clsIs the weight lost by the classification,

respectively, the countermeasure loss of the discriminator, the classifier classification loss.

On the other hand, the invention also provides a many-to-many voice conversion system based on speaker style feature modeling, which comprises a training stage and a conversion stage, wherein the training stage comprises:

the preprocessing module is used for extracting the frequency spectrum characteristic x of each speaker voice in the training corpus;

the transition phase comprises:

a source speech processing module for extracting the frequency spectrum character x from the speech of the source speaker in the corpus to be converted_s', aperiodic character ofAnd a fundamental frequency characteristic;

Furthermore, the present invention provides a computer storage medium having a computer program stored thereon, characterized in that: which when executed by a computer processor implements the method described above.

Has the advantages that: (1) according to the invention, the speaker personality characteristics are obtained by adding the multilayer perceptron and the style encoder, the speaker style characteristics are used for replacing speaker tags, the defect that one-hot vectors carry limited speaker information is overcome, the decoding network is facilitated to learn more speaker personality characteristics, the personality similarity of converted voice is improved, and more ideal converted voice is obtained; (2) the invention adopts a self-adaptive example normalization mode to ensure that semantic features and speaker personality features can be fully fused, the learning capability of a decoding network on spectrum features of different scales is improved, and meanwhile, an SKNet module is added between a generator coding network and the decoding network, so that the network can self-adaptively adjust the size of a receptive field according to a plurality of scales of input information, adjust the weight of each feature channel through an attention mechanism, and refine spectrum feature details, so that the generated spectrum is clearer, more natural and finer; (3) the training network is more stable and efficient, the normalization method can accelerate the network training speed, the problem that the network has gradient disappearance or gradient explosion in the back propagation process is avoided, and meanwhile, the residual error network can effectively solve the problem of network degradation in the training process; therefore, the SKNet StarGAN network realizes a many-to-many voice conversion method with high sound quality and high individual similarity under the condition of non-parallel text, and has good application prospect in the fields of cross-language voice conversion, film dubbing, voice translation and the like.

Drawings

FIG. 1 is a schematic diagram of the SKNet StarGAN principle of the present method;

FIG. 2 is a network architecture diagram of a generator of the model SKNet StarGAN of the present method;

FIG. 3 is a schematic diagram of SKNet principle in model SKNet StarGAN of the present method;

FIG. 4 is a network architecture diagram of the discriminator of the model SKNet StarGAN of the present method;

FIG. 5 is a network architecture diagram of the sensor of the model SKNet StarGAN of the present method;

FIG. 6 is a network architecture diagram of the style encoder of model SKNet StarGAN of the present method;

FIG. 7 is a diagram of the comparison of the SKNet StarGAN model and the reference StarGAN model of the method for synthesizing speech under the same polarity conversion situation;

FIG. 8 is a diagram of the comparison of the SKNet StarGAN model and the reference StarGAN model of the method for synthesizing speech under the heterology conversion situation;

fig. 9 is a graph comparing the convergence speed of the generator reconstruction function of the SKNet StarGAN model and the reference StarGAN model of the present method.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a many-to-many voice conversion method based on speaker style characteristic modeling, which is characterized in that a multilayer perceptron and a style encoder are added on a traditional StarGAN neural network to realize effective extraction and restriction of speaker style characteristics, and the speaker style characteristics are used for replacing speaker label characteristics, so that the defect that one-hot vectors in a traditional model carry speaker information to be limited is overcome; secondly, fully fusing semantic features and speaker personality features in a generator network in a self-adaptive instance normalization mode, and enhancing the learning ability and the expression ability of the generator network; further, an SKNet module is added between the generator coding network and the decoding network, so that the network can adaptively adjust the size of a receptive field according to multiple scales of input information, adjust the weight of each characteristic channel through an attention mechanism and refine the frequency spectrum characteristic details; the SKNet StarGAN network based on speaker style feature modeling can realize high-quality and high-personality similarity converted voice. The present invention refers to the improved StarGAN as SKNet StarGAN.

As shown in fig. 1, the method implemented in this example is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.

The training stage comprises the following implementation steps:

1.1) obtaining a training corpus of a non-parallel text, wherein the training corpus is a corpus of multiple speakers and comprises a source speaker and a target speaker. The corpus is taken from a VCC2018 corpus of speech, which contains 6 male and 6 female speakers, each speaker having 81 sentences of corpus, 35 sentences of corpus tested. In this experiment, 4 female speakers and 4 male speakers, namely VCC2SF3, VCC2SF4, VCC2TF1, VCC2TF2, VCC2SM3, VCC2SM4, VCC2TM1, and VCC2TM2, were selected.

1.2) extracting the spectrum envelope characteristic x, the aperiodic characteristic and the logarithmic base frequency log f of each speaker sentence from the training corpus through a WORLD (word-oriented language analysis/synthesis) model₀Here, since a Fast Fourier Transform (FFT) length is set to 1024, both the obtained spectral envelope characteristic x and aperiodic characteristic are 1024/2+1 ═ 513 dimensions. Each speech block has 512 frames, and each frame is used for extracting 36-dimensional plumThe feature of the Mel-Cepstral coeffients (MCEP) is used as the spectrum feature of the SKNet StarGAN model, and 8 speech blocks are taken in one training. Thus, the corpus has dimensions 8 × 36 × 512.

1.3) the SKNet StarGAN network in the embodiment is based on a StarGAN model, on one hand, effective modeling and extraction of speaker style characteristics are realized by adding a style encoder and a multilayer sensor, on the other hand, a self-adaptive example normalization method is provided to realize full fusion of semantic characteristics and speaker style characteristics, and a novel lightweight network module SKNet is further introduced to realize refinement of spectrum characteristics. SKNet StarGAN consists of four parts: the system comprises a generator G for generating a frequency spectrum, a discriminator D for judging the source of the frequency spectrum, a classifier C for judging the label attribute of the generated frequency spectrum, a multi-layer sensor M for generating the style characteristic of a speaker, and a style encoder S for restricting the style characteristic of the speaker.

The objective function of the SKNet StarGAN network is:

L_SKNetSTARGAN＝L_G+L_D

wherein L is_GTo generate the loss function of the generator:

andrepresenting the confrontation loss, cyclic consistency loss, style diversity loss, style reconstruction loss of the style encoder, and classification loss of the classifier, respectively, of the generator.

The loss function of the discriminator is:

wherein λ is_clsIs the weight lost by the classification,

respectively, the countermeasure loss of the discriminator, the classifier classification loss.

1.4) comparing the random noise z and the target speaker label characteristic c which are subject to normal distribution_tAs combined feature (z, c)_t) Inputting into a multilayer perceptron to obtain the style characteristics s of a speaker_t。

1.5) extracting the frequency spectrum characteristic x of the source speaker_sAnd the stylistic features s obtained in 1.4)_tAs a joint feature (x)_s,s_t) Training the input generator to make its loss function L_GAs small as possible, obtaining the frequency spectrum characteristic x of the generated target speaker_st。

As shown in fig. 2, the Generator (Generator) adopts a two-dimensional convolutional neural network, and is composed of an encoding network, a decoding network and several SKNet layers. The coding network comprises 3 two-dimensional convolution layers, the sizes of filters (k) of the 3 two-dimensional convolution layers are respectively 3 x 9, 4 x 8 and 4 x 8, the step sizes(s) are respectively 1 x 1, 2 x 2 and 2 x 2, and the filter depths (c) are respectively 64, 128 and 256; the decoding network comprises 2 two-dimensional deconvolution layers (ConvT2), wherein the filter sizes of the 2 two-dimensional deconvolution layers are 4 x 4, the step sizes are 2 x 2, and the filter depths are 128 and 64 respectively; the output layer contains 1 two-dimensional convolution with a filter size of 3 x 9, step size of 1 x 1, and filter depth of 1; a plurality of SKNet layers are built in a residual error network between an encoding network and a decoding network, the output of each layer of the residual error network is spliced and input to the next layer through an SKNet, the output which is heavily calibrated by the SKNet, wherein the SKNet is an abbreviation of Selective Kernel Networks, and is a lightweight embedded module, and the inspiration source of the SKNet is that when looking at objects with different sizes and different distances, the size of a receiving domain of visual cortical neurons can be adjusted according to stimulation.

In this embodiment, the SKNet layer is preferably 6 layers. The SKNet principle is as shown in fig. 3, firstly splitting the network into two branches through Split operation to perform convolution respectively; accumulating the output results of the two branches through a Fuse operation, converting each two-dimensional characteristic channel into a real number with a global receptive field through global average pooling, then realizing dimension reduction and dimension rise through two-layer convolution to obtain two groups of channel information, and further obtaining the weights of the two groups of channels through a Softmax function; and finally, weighting the two groups of channel weights to each channel characteristic of the two branch convolution outputs respectively through Select operation so as to finish the recalibration on the channel dimension, and accumulating the two groups of output subjected to recalibration and outputting the two groups of output to the next layer.

The block structures of the first three layers of SKNet are the same, and are all as follows: in this order, a convolutional layer (Conv2), a normalization layer (instant Norm), a modified linear unit (ReLU), an SKNet layer, a convolutional layer, and a normalization layer, with a filter size of 3 × 3, a depth of 256, and a step size of 1 × 1. The block structure of the last three layers of SKNet is slightly different from that of the first three layers of SKNet, the normalization layer is replaced by an adaptive instance normalization method (AdaIN), the adaptive instance normalization method can realize the full fusion of semantic features and speaker personality features, the size of the filter is 3 x 3, the depth is 256, and the step length is 1 x 1.

The filter sizes of the 3 two-dimensional convolution layers of the coding network of the generator G are 3 × 9, 4 × 8, respectively, the step sizes are 1 × 1, 2 × 2, respectively, and the filter depths are 64, 128, 256, respectively; the filter sizes of the 2 two-dimensional deconvolution layers of the decoding network are all 4 x 4, the step sizes are all 2 x 2, and the filter depths are 128 and 64 respectively; the filter size of 1 two-dimensional convolution of the output layer is 3 x 9, the step size is 1 x 1, and the filter depth is 1. The sizes of the filters of the 5 two-dimensional convolution layers shared by the discriminator D and the classifier C are 4 x 4, the step sizes are 2 x 2, and the filter depths are 64, 128, 256, 512 and 1024 respectively; the filter size of the two-dimensional convolution of the discriminator D output layer is 1 × 16, the step size is 1 × 1, and the filter depth is 1; the filter size of the two-dimensional convolution of the classifier C output layer is 1 × 8, the step size is 1 × 1, and the filter depth is the number of conversion people.

Specifically, as shown in fig. 6, the sizes of the filters of the 6 one-dimensional convolutions of the Style Encoder (Style Encoder) S are 1, and 16, the step lengths are 1, the filter depths are 32, 64, 128, 256, 512, and 512, the middle layer includes 5 one-dimensional average pooling layers and 5 residual error networks, the size of the filter of each one-dimensional average pooling layer is 2, the step lengths are 2, each residual error network layer includes 2 one-dimensional convolutions, the size of the filter of each one-dimensional convolution is 2, the step lengths are 2, and the depths are 2 times of the depth of the filter of the previous layer.

As shown in fig. 5, a Multilayer Perceptron (multilayered Perceptron) M includes 7 linear layers, an input neuron of 16 and an output neuron of 512 at the input layer; the input neurons and the output neurons of the 5 linear layers in the middle layer are 512; the number of input neurons and output neurons in the output layer was 512 and 64.

1.6) generating the target speaker frequency spectrum characteristic x obtained in 1.5)_stAnd 1.2) the spectral feature x of the target speaker of the corpus obtained_tTraining the discriminator to make the discriminator resist the loss functionAs small as possible.

As shown in fig. 4, the Discriminator (Discriminator) employs a two-dimensional convolutional neural network, which includes 6 two-dimensional convolutional layers, the filter sizes of the first 5 two-dimensional convolutional layers are all 4 × 4, the step sizes are all 2 × 2, the filter depths are 64, 128, 256, 512, 1024, respectively, the filter sizes of the two-dimensional convolutional layers of the output layer are 1 × 16, the step sizes are 1 × 1, and the filter depths are 1.

The loss function of the discriminator is:

wherein λ is_clsIs the weight lost by the classification,are respectivelyCountermeasures of discriminators, classifier classification losses.

Wherein D is_s(x_s) Indicating that discriminator D discriminates true spectral features, C_t(c_t|G(x_s,s_t) Means that classifier C discriminates the attribution of the generated spectrum label, s_tRepresenting a stylistic characteristic of the target speaker generated by the multi-layered perceptron M, namely M (z, c)_t)＝s_t，G(x_s,s_t) Representing the spectral characteristics of the target speaker, i.e. x, generated by the generator G_ts，D_t(G(x_s,s_t) Is) indicative of the spectral signature generated by the discriminator discrimination,representing the expectation of the probability distribution generated by the generator G,representing the expectation of a true probability distribution.

The optimization target is as follows:

1.7) obtaining the obtained frequency spectrum characteristic x of the target speaker_stInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)_st) The spectral characteristics x of the source speaker_sSource speaker tag characteristics c_sInputting the style encoder S to obtain the style indication characteristic of the source speakerThe obtained semantic feature G (x)_st) Style of speaker with sourceDisplay featureAs a combined feature

Inputting the data into a decoding network of a generator G together for training, minimizing a loss function of the generator G in the training process, and obtaining the spectral characteristics of the reconstructed source speaker

The loss function of the generator is minimized in the training process, including the countermeasure loss of the generator, the cycle consistency loss, the style reconstruction loss of the style encoder, the style diversity loss and the classifier classification loss. Wherein the training cycle consistency loss is to make the source speaker spectral feature x_sAfter passing through the generator G, the reconstructed spectral characteristics of the source speakerCan be mixed with x_sThe training style reconstruction loss is to constrain the multi-layered perceptron to generate a style feature s that more closely matches the target speaker, as consistently as possible_tThe training style diversity loss is to ensure the generator to realize multi-speaker conversion, and the classification loss refers to that the classifier discriminates the target speaker spectrum x generated by the generator_stBelongs to the label c_tThe probability of loss.

The loss function of the generator is:

the optimization target is as follows:

wherein λ is_cyc、λ_ds、λ_styAnd λ_clsIs a set of regularized hyper-parameters, each representing a cyclic agreementWeight of sexual loss, style diversity loss, style reconstruction loss, and classification loss.

Represents the penalty of the generator in GAN:

wherein the content of the first and second substances,

expectation, s, representing the probability distribution generated by the generator_tRepresenting a stylistic characteristic of the target speaker generated by the multi-layered perceptron M, namely M (z, c)_t)＝s_t，G(x_s,s_t) The representation generator generates a spectral feature that,

and loss of discriminator

Together, form the common countermeasures losses in GAN that are used to discriminate whether the spectrum input to the discriminator is the true spectrum or the generated spectrum. During the training process

As small as possible, the generator is continuously optimized until a spectral feature G (x) is generated that can be spurious_s,c_s) Making it difficult for the discriminator to discriminate between true and false.

To generate cycle consistent losses in generator G:

wherein the content of the first and second substances,indicating features of the style of the source speaker, i.e.

For the reconstructed spectral features of the source speaker,an expectation for loss of the reconstructed source speaker spectrum and the true source speaker spectrum. In the loss of the training generator(s),as small as possible, so that the target spectrum G (x) is generated_s,s_t) Style indicating characteristics of the source speakerInputting the data into the generator again, and obtaining the reconstructed source speaker voice frequency spectrum as much as possible_sSimilarly. By trainingThe semantic features of the speaker voice can be effectively ensured, and the semantic features are not lost after being coded by the generator.

For style diversity loss, to ensure the generator to implement multi-speaker conversion:

wherein z is₁,z₂Random noise, s, all following a normal distribution_t1,s_t2Stylistic features of the targeted speaker, M (z), generated for the multi-layered perceptron M₁,c_t)＝s_t1，M(z₂,c_t)＝s_t2In the course of the training process,

as small as possible, a multi-speaker to multi-speaker conversion is achieved.

For the style reconstruction loss of the style encoder S, to optimize the style characteristics S_t：

Wherein s is_tRepresenting a stylistic characteristic, G (x), of the targeted speaker generated by the multi-layered perceptron M_s,s_t) The representation generator generates spectral features of the targeted speaker.

The target speaker spectrum characteristic G (x)_s,s_t) Inputting the speech into a style encoder S to obtain a reconstructed style characteristic S and a style characteristic S of the target speaker generated by a multi-layer perceptron M_tThe absolute value is calculated, and in the training process,

as small as possible, so that the multi-layer perceptron M generates the style characteristics s of the target speaker_tThe individual characteristics of the target speaker can be fully expressed.

For the classification loss of classifier C:

wherein, C_t(c_t|G(x_s,s_t) To discriminate the attribution of the generated spectral labels for the classifier,as small as possible, the loss function of the classifier is minimized.

1.8) repeating the steps 1.4-1.7 until the set number of iterations is reached,thus obtaining the trained SKNetStarGAN network, wherein the generator parameter phi, the discriminator parameter theta, the classifier parameter psi and the multi-layer perceptron parameter phi

And the style encoder parameters are trained parameters. The iteration times are different because the specific setting of the neural network is different and the performance of the experimental equipment is different. The number of iterations in this experiment was chosen to be 200000.

1.9) use of the logarithmic fundamental frequency log f₀The mean value and the mean variance of the pitch frequency are established to establish a fundamental frequency conversion relation, the mean value and the mean variance of the logarithmic fundamental frequency of each speaker are counted, and the logarithmic fundamental frequency log f of the source speaker is converted by utilizing the linear transformation of the logarithmic domain_0sObtaining the logarithm fundamental frequency log f of the target speaker by conversion_0t'。

The fundamental transfer function is:

wherein, mu_sAnd σ_sMean and mean square error, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectively_tAnd σ_tRespectively, the mean and mean square error of the fundamental frequency of the target speaker in the logarithmic domain.

The implementation steps of the conversion stage are as follows:

2.1) passing the source speaker's voice through a WORLD voice analysis/synthesis model to extract the spectral features x of different sentences of the source speaker_s', aperiodic character, fundamental frequency.

2.2) random noise z' subject to normal distribution, target speaker tag characteristic c_t' inputting to the multi-layer sensor M to obtain the style characteristics s of the target speaker_t'。

2.3) extracting the spectral feature x of the source speaker voice extracted in 2.1)_s' and 2.2) extracted style feature s of the target speaker_t', as a combined feature (x)_s',s_t') input 1.8) into the SKNet StarGAN network trained to reconstruct the spectral features x of the target speaker_st'。

2.4) converting the fundamental frequency of the source speaker extracted in the step 2.1) into the fundamental frequency of the target speaker through the fundamental frequency conversion function obtained in the step 1.9).

2.5) comparing the spectral characteristics x of the target speaker generated in 2.3)_st', 2.4) and 2.1) synthesizing the converted speaker's speech by a WORLD speech analysis/synthesis model.

For example, fig. 7a and 7b are speech spectra of a group of source speaker speech and target speaker speech under the condition of the same gender, respectively, and fig. 7d and 7c are speech spectra comparison images of speech synthesized by the model and the reference StarGAN model, respectively.

Fig. 8a and 8b are spectrograms of a set of source speech and target speech under different gender conditions, respectively, and fig. 8d and 8c are spectrograms of synthesized speech of the model and the reference StarGAN model according to the present invention, and in order to explain the advantages of the method adopted by the present invention in detail, the details of the spectral features in the three comparison block diagrams are selected for comparison, and it can be seen from the diagrams that the similarity between the details of the spectral features of the converted speech according to the present invention and the target speech is higher. As shown in fig. 9, the method used in the present invention converges faster and has less reconstruction loss as the number of iterations increases.

On the other hand, the invention also provides an SKNet StarGAN many-to-many voice conversion system based on speaker style feature modeling, which is characterized by comprising a training stage and a conversion stage, wherein the training stage comprises the following steps:

the preprocessing module is used for extracting the frequency spectrum characteristic x of each speaker voice in the training corpus;

a network training module for training the frequency spectrum characteristic x of each speaker voice and the source speaker label c_sTarget speaker tag c_tThe method comprises the steps that random noise z which is subjected to normal distribution is input into an SKNet StarGAN network for training, the SKNet StarGAN network comprises a generator G, a discriminator D, a classifier C, a style encoder S and a multilayer sensor M, the generator G comprises an encoding network, a decoding network and at least one SKNet layer, and the SKNet layer is built in a residual error network between the encoding network and the decoding network;

the transition phase comprises:

a conversion module for converting the source speaker spectrum characteristic x_s', target speaker tag characteristics c_t'and random noise z' obeying normal distribution, inputting into the trained SKNet StarGAN network to obtain the target speaker spectrum feature x_st'；

The embodiments of the present invention, if implemented in the form of software functional modules and sold or used as independent products, may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present invention also provide a computer storage medium having a computer program stored thereon. The computer program, when executed by a processor, may implement the aforementioned sknetstartgan many-to-many speaker conversion method based on speaker style feature modeling. For example, the computer storage medium is a computer-readable storage medium.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

22页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于语音合成的乐谱智能视唱方法和系统

Many-to-many voice conversion method and system based on speaker style feature modeling

相关技术

网友询问留言