Many-to-many speaker conversion method based on improved STARGAN and x vectors

文档序号:1639666 发布日期:2019-12-20 浏览:29次 中文

阅读说明:本技术 基于改进的STARGAN和x向量的多对多说话人转换方法 (Many-to-many speaker conversion method based on improved STARGAN and x vectors ) 是由 李燕萍 曹盼 张燕 于 2019-09-17 设计创作,主要内容包括:本发明公开了一种基于改进的STARGAN与x向量的多对多说话人转换方法,包括训练阶段和转换阶段,使用了改进的STARGAN与x向量相结合来实现语音转换系统,该方法是对STARGAN在语音转换应用中的进一步改进,其中,提出的两步式对抗性损失能够有效解决由于循环一致性损失利用L1造成的过平滑问题,而且生成器采用2-1-2D CNN网络,能够较好地提升模型对于语义的学习能力以及语音频谱的合成能力,克服STARGAN中转换后语音相似度与自然度较差的问题。同时x向量对于短时话语具有更好的表征性能,能够充分表征说话人的个性特征,实现了一种非平行文本条件下的高质量多对多语音转换方法。(The invention discloses a many-to-many speaker conversion method based on improved STARGAN and x vectors, which comprises a training phase and a conversion phase, wherein an improved STARGAN is combined with the x vectors to realize a voice conversion system, the method is a further improvement of the STARGAN in voice conversion application, wherein the provided two-step antagonism loss can effectively solve the problem of over-smoothing caused by using L1 due to cyclic consistency loss, and a generator adopts a 2-1-2D CNN network, so that the learning capability of the model on semantics and the synthesis capability of a voice spectrum can be better improved, and the problem of poor similarity and naturalness of the converted voice in the STARGAN is solved. Meanwhile, the x vector has better representation performance for short-time speech, can fully represent the individual characteristics of a speaker, and realizes a high-quality many-to-many speech conversion method under the condition of non-parallel texts.)

1. A many-to-many speaker transformation method based on improved STARGAN and x-vectors, comprising a training phase and a transformation phase, said training phase comprising the steps of:

(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers and comprises a source speaker and a target speaker;

(1.2) extracting the spectrum envelope characteristic X, the fundamental frequency characteristic and an X vector X-vector representing the personalized characteristic of each speaker from the training corpus through a WORLD (word-oriented language analysis/synthesis) model;

(1.3) envelope feature x of the spectrum of the source speakersSpectral envelope characteristic x of the target speakertSource speaker tag csAnd X vector X-vectorsAnd a targeted speaker tag ctAnd X vector X-vectortThe system is input into a STARGAN-X network for training, wherein the STARGAN-X network consists of a generator G, a discriminator D and a classifier C, the generator G adopts a 2-1-2D network structure and consists of a coding network, a decoding network and ResNet layers, the coding network and the decoding network adopt two-dimensional convolutional neural networks, at least 1 layer of ResNet is built between the coding network and the decoding network, and the ResNet adopts a one-dimensional convolutional neural network;

(1.4) in the training process, the loss function of the generator G, the loss function of the discriminator D and the loss function of the classifier C are made as small as possible until the set iteration times are reached, and a trained STARGAN-X network is obtained;

(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises the steps of:

(2.1) to be rotatedExtracting spectrum envelope characteristic x from the voice of the source speaker in the speech material through a WORLD voice analysis/synthesis models', aperiodic character, and fundamental frequency;

(2.2) enveloping the spectrum of the source speaker with the characteristic xs', target speaker tag characteristics ct', target speaker X vector X-vectort' inputting the trained STARGAN-X network in (1.4), reconstructing the spectral envelope characteristic X of the target speakertc′;

(2.3) converting the fundamental frequency of the source speaker extracted in the step (2.1) into the fundamental frequency of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);

(2.4) the spectral envelope characteristic x of the target speaker obtained in (2.2)tc', (2.3) and (2.1) extracting aperiodic characteristics, and synthesizing by a WORLD speech analysis/synthesis model to obtain the converted speaker speech.

2. The improved STARGAN and x-vector based many-to-many speaker transformation method according to claim 1, wherein: and building ResNet as 6 layers between the coding network and the decoding network of the generator G.

3. The improved STARGAN and x-vector based many-to-many speaker transformation method as claimed in claim 1, wherein the training process in steps (1.3) and (1.4) comprises the steps of:

(1) enveloping the spectrum of the source speaker with the characteristic xsCoding network of input generator G to obtain speaker independent semantic features G (x)s);

(2) The semantic features G (x) obtained above are useds) Tag characteristics c with the targeted speakertX vector of target speakertThe decoding network which is input into the generator G together is trained, and the loss function of the generator G is minimized in the training process, so that the spectral envelope characteristic x of the target speaker is obtainedtc

(3) The obtained spectral envelope characteristic x of the target speakertcIs input again intoThe coding network of the generator G obtains the speaker independent semantic features G (x)tc);

(4) The semantic features G (x) obtained above are usedtc) Source speaker tag feature csSource speaker X vector X-vectorsInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining a spectral envelope characteristic x of a reconstructed source speakerscAnd x isscAnd source speaker tag characteristics csInputting the data into a discriminator D for training, and minimizing a loss function of the discriminator D;

(5) enveloping the frequency spectrum of the target speaker with the characteristic xtcTarget speaker spectral feature xtAnd the tag characteristics c of the targeted speakertInputting the two signals into a discriminator D for training, and minimizing a loss function of the discriminator D;

(6) enveloping the frequency spectrum of the target speaker with the characteristic xtcAnd the spectral envelope characteristic x of the targeted speakertInputting a classifier C for training, and minimizing a loss function of the classifier C;

(7) and (4) returning to the step (1) and repeating the steps until the iteration number is reached, so that the trained STARGAN-X network is obtained.

4. The improved STARGAN and x-vector based many-to-many speaker transformation method as claimed in claim 1, wherein the input procedure in step (2.2) comprises the steps of:

(1) enveloping the spectrum of the source speaker with the characteristic xs' encoding network of input generator G, deriving speaker independent semantic features G (x)s)′;

(2) The semantic features G (x) obtained above are useds) ' tag feature with target speaker ct', X vector of target speaker X-vectort' input to the decoder network of the generator G together to obtain the spectral envelope characteristic x of the target speakertc′。

5. The improved startan and x vector based many-to-many speaker conversion method as claimed in claim 1, wherein the loss function of the generator G is:

wherein λ iscls>=0、λcyc0 and λid0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively,Lcyc(G)、Lid(G) respectively representing two-step countermeasure loss of the generator, classification loss of the classifier optimization generator, cycle consistency loss and feature mapping loss;

the discriminator D adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein the content of the first and second substances,representing a one-step counter-loss of the discriminator D,representing a two-step challenge loss of the discriminator, D (x)s,cs)、D(xt,ct) Respectively representing the discriminators D to discriminate the true source and target spectrum features, G (x)s,ct,X-vectort) Representation generator G generated target utteranceSpeech spectral feature, D (G (x)s,ct,X-vectort),ct) The spectral feature, D (G (x), representing the discrimination of the discriminators,ct,X-vectort),cs) To discriminate the reconstructed source speaker spectral characteristics for the discriminator,representing the expectation of the probability distribution generated by the generator G,the expectation of a distribution of true probabilities is represented,an expectation of a probability distribution representing a spectral feature of the reconstructed source speaker;

the classifier C adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein p isC(ct|xt) The expression classifier C judges the characteristics of the target speaker as a label CtOf the true spectrum of the spectrum.

6. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein:

wherein the content of the first and second substances,one step of the representation generator is to combat the loss,representing a two-step countermeasure loss of the generator;

wherein the content of the first and second substances,expressing the expectation of the probability distribution generated by the generator, G (x)s,ct,X-vectort) A representation generator generates spectral features;

wherein p isC(ct|G(xs,ct,X-vectort) Means that the classifier discriminates that the target speaker spectrum label belongs to ctProbability of (a), G (x)s,ct,X-vectort) Representing the target speaker spectrum generated by the generator;

wherein, G (G (x)s,ct,X-vectort),cs) For the reconstructed spectral features of the source speaker,reconstructing a loss expectation for a source speaker spectrum and a true source speaker spectrum;

wherein, G (x)s,cs,X-vectors) The frequency spectrum of the source speaker,Speaker labels and x vectors, the source speaker spectral characteristics obtained after input to the generator,is xsAnd G (x)s,cs,X-vectors) Is expected to be lost.

7. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein: the coding network of the generator G comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5; the decoding network of the generator G comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively.

8. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein: the discriminator D comprises 5 convolution layers, the filter sizes of the 5 convolution layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1.

9. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein: the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

10. The improved STARGAN and x-vector based many-to-many speaker transformation method according to any of claims 1 to 9, wherein: the fundamental frequency conversion function is as follows:

wherein, musAnd σsMean and variance, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectivelytAnd σtMean and variance, logf, of the fundamental frequency of the target speaker in the logarithmic domain0sLogarithmic fundamental frequency, logf, of the source speaker0t' is the converted logarithmic fundamental frequency.

Technical Field

The invention relates to a many-to-many speaker conversion method, in particular to a many-to-many speaker conversion method based on improved STARGAN and x vectors.

Background

Speech conversion is a branch of research in the field of speech signal processing, and is developed and extended on the basis of research on speech analysis, recognition and synthesis. The goal of speech conversion is to change the speech personality characteristics of the source speaker to have the speech personality characteristics of the target speaker, i.e., to make one person speaking speech sound like another person speaking speech after conversion, while preserving semantics.

After years of research, many classical conversion methods have emerged. The method includes most speech conversion methods such as Gaussian Mixed Model (GMM), Recurrent Neural Network (RNN), Deep Neural Network (DNN), and the like. However, most of these speech conversion methods require that the corpus used for training is parallel text, that is, the source speaker and the target speaker need to send out sentences with the same speech content and speech duration, and the pronunciation rhythm and emotion are consistent as much as possible. However, it is time consuming to collect these data and even if these parallel data are obtained, it is still difficult to solve the problem, because most speech conversion methods rely on accurate time alignment of data, which is a difficult process, and this makes most parallel data cause the problem of inaccurate alignment of speech feature parameters, so the accuracy of speech feature parameter alignment during training becomes a limitation to speech conversion performance. Besides, parallel voice can not be obtained in practical application such as cross-language conversion, medical auxiliary patient voice conversion and the like. Therefore, the research of the voice conversion method under the non-parallel text condition has great practical significance and application value in consideration of the universality and the practicability of the voice conversion system.

The existing voice conversion method under the non-parallel text condition includes a method based on a Cycle-Consistent adaptive network (Cycle-GAN) and a method based on a conditional variant Auto-Encoder (C-VAE). A voice conversion method based on a C-VAE model directly utilizes an identity tag of a speaker to establish a voice conversion system, wherein a coder realizes the separation of semantics and personal information of voice, and a decoder realizes the reconstruction of voice through the semantics and the identity tag of the speaker, thereby being capable of removing the dependence on parallel texts. However, since C-VAE is based on an improved ideal assumption, it is believed that the observed data generally follows a gaussian distribution, resulting in an excessively smooth output speech of the decoder and an inferior quality of the converted speech. The voice conversion method based on the Cycle-GAN model utilizes the adversity loss and the Cycle consistent loss, and simultaneously learns the forward mapping and the inverse mapping of the acoustic characteristics, so that the over-smooth problem can be effectively relieved, and the conversion voice quality is improved, but the Cycle-GAN can only realize one-to-one voice conversion.

The voice conversion method based on the Star-generated confrontation Network (STARGAN) model has the advantages of C-VAE and Cycle-GAN, because a generator of the method has a coding and decoding structure, many-to-many mapping can be learned at the same time, and the attribute output by the generator is controlled by a speaker identity label, so that many-to-many voice conversion under the condition of non-parallel text can be realized. In STARGAN, introducing a resistance loss can effectively alleviate the over-smoothing problem caused by statistical averaging, but the cyclic consistency loss represented by L1 still causes over-smoothing. The generator adopts a two-dimensional convolutional neural network, so that the time sequence information of the voice cannot be effectively captured, more importantly, the coding network and the decoding network in the generator are independent, the separation of the semantic features and the personalized features of the speaker cannot be well realized directly through the coding network of the generator, meanwhile, the synthesis of the semantic features and the personalized features of the speaker cannot be well realized through the decoding network of the generator, and in the method, the personalized features of the speaker cannot be fully expressed by the identity tag of the speaker, so that the converted voice still needs to be promoted in voice quality and personalized similarity.

Disclosure of Invention

The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a many-to-many speaker conversion method based on improved STARGAN and x vectors, the provided two-step antagonism loss can effectively solve the problem of over-smoothness caused by the loss of cyclic consistency by using L1, and the generator adopts a 2-1-2D CNN network, so that the learning capability of the model on semantics and the synthesis capability of a voice spectrum can be better improved, the problem of poor similarity and naturalness of the converted voice in the STARGAN is solved, the learning difficulty of the coding network on the semantics is reduced, the spectrum generation quality of a decoding network is improved, the personalized features of speakers are fully represented by using the x vectors, and the personalized similarity of the converted voice is effectively improved.

The technical scheme is as follows: the invention discloses a method for many-to-many speaker conversion based on improved STARGAN and x vectors, which comprises a training phase and a conversion phase, wherein the training phase comprises the following steps:

(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers and comprises a source speaker and a target speaker;

(1.2) extracting the spectrum envelope characteristic X, the fundamental frequency characteristic and an X vector X-vector representing the personalized characteristic of each speaker from the training corpus through a WORLD (word-oriented language analysis/synthesis) model;

(1.3) envelope feature x of the spectrum of the source speakersSpectral envelope characteristic x of the target speakertSource speaker tag csAnd X vector X-vectorsAnd a targeted speaker tag ctAnd X vector X-vectortThe system is input into a STARGAN-X network for training, wherein the STARGAN-X network consists of a generator G, a discriminator D and a classifier C, the generator G adopts a 2-1-2D network structure and consists of a coding network, a decoding network and ResNet layers, the coding network and the decoding network adopt two-dimensional convolutional neural networks, at least 1 layer of ResNet is built between the coding network and the decoding network, and the ResNet adopts a one-dimensional convolutional neural network;

(1.4) in the training process, the loss function of the generator G, the loss function of the discriminator D and the loss function of the classifier C are made as small as possible until the set iteration times are reached, and a trained STARGAN-X network is obtained;

(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises the steps of:

(2.1) extracting the spectrum envelope characteristic x from the voice of the source speaker in the corpus to be converted through a WORLD voice analysis/synthesis models', aperiodic character, and fundamental frequency;

(2.2) enveloping the spectrum of the source speaker with the characteristic xs', target speaker tag characteristics ct', target speaker X vector X-vectort' inputting the trained STARGAN-X network in (1.4), reconstructing the spectral envelope characteristic X of the target speakertc′;

(2.3) converting the fundamental frequency of the source speaker extracted in the step (2.1) into the fundamental frequency of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);

(2.4) the spectral envelope characteristic x of the target speaker obtained in (2.2)tc', (2.3) and (2.1) extracting aperiodic characteristics, and synthesizing by a WORLD speech analysis/synthesis model to obtain the converted speaker speech.

Further, ResNet is built into 6 layers between the coding network and the decoding network of the generator G.

Further, the training process in steps (1.3) and (1.4) comprises the following steps:

(1) enveloping the spectrum of the source speaker with the characteristic xsCoding network of input generator G to obtain speaker independent semantic features G (x)s);

(2) The semantic features G (x) obtained above are useds) Tag characteristics c with the targeted speakertX vector of target speakertThe decoding network which is input into the generator G together is trained, and the loss function of the generator G is minimized in the training process, so that the spectral envelope characteristic x of the target speaker is obtainedtc

(3) The obtained spectral envelope characteristic x of the target speakertcInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)tc);

(4) The semantic features G (x) obtained above are usedtc) Source speaker tag feature csSource speaker X vector X-vectorsInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining the frequency spectrum of the reconstructed source speakerEnvelope characteristic xscAnd x isscAnd the source speaker's tag characteristics csInputting the data into a discriminator D for training, and minimizing a loss function of the discriminator D;

(5) enveloping the frequency spectrum of the target speaker with the characteristic xtcTarget speaker spectral feature xtAnd the tag characteristics c of the targeted speakertInputting the two signals into a discriminator D for training, and minimizing a loss function of the discriminator D;

(6) enveloping the frequency spectrum of the target speaker with the characteristic xtcAnd the spectral envelope characteristic x of the targeted speakertInputting a classifier C for training, and minimizing a loss function of the classifier C;

(7) and (4) returning to the step (1) and repeating the steps until the iteration number is reached, so that the trained STARGAN-X network is obtained.

Further, the input process in step (2.2) comprises the following steps:

(1) enveloping the spectrum of the source speaker with the characteristic xs' encoding network of input generator G, deriving speaker independent semantic features G (x)s)′;

(2) The semantic features G (x) obtained above are useds) ' tag feature with target speaker ct', X vector of target speaker X-vectort' input to the decoder network of the generator G together to obtain the spectral envelope characteristic x of the target speakertc′。

Further, the loss function of the generator G is:

wherein λ iscls>=0、λcyc0 and λid0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively,Lcyc(G)、Lid(G) representing two-step antagonism of generators separatelyLoss, classification loss of a classifier optimization generator, cycle consistency loss, and feature mapping loss;

the discriminator D adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein the content of the first and second substances,representing a one-step counter-loss of the discriminator D,representing a two-step challenge loss of the discriminator, D (x)s,cs)、D(xt,ct) Respectively representing the discriminators D to discriminate the true source and target spectrum features, G (x)s,ct,X-vectort) Represents the spectral characteristics of the target speaker, D (G (x), generated by the generator Gs,ct,X-vectort),ct) The spectral feature, D (G (x), representing the discrimination of the discriminators,ct,X-vectort),cs) To discriminate the reconstructed source speaker spectral characteristics for the discriminator,representing the expectation of the probability distribution generated by the generator G,the expectation of a distribution of true probabilities is represented,representing reconstructed sourcesAn expectation of a probability distribution of spectral features of the speaker;

the classifier C adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein p isC(ct|xt) The expression classifier C judges the characteristics of the target speaker as a label CtOf the true spectrum of the spectrum.

Further, in the above-mentioned case,

wherein the content of the first and second substances,one step of the representation generator is to combat the loss,representing a two-step countermeasure loss of the generator;

wherein the content of the first and second substances,expressing the expectation of the probability distribution generated by the generator, G (x)s,ct,X-vectort) A representation generator generates spectral features;

wherein p isC(ct|G(xs,ct,X-vectort) Is expressed as a scoreThe classifier discriminates and generates the target speaker frequency spectrum label belonging to ctProbability of (a), G (x)s,ct,X-vectort) Representing the target speaker spectrum generated by the generator;

wherein, G (G (x)s,ct,X-vectort),cs) For the reconstructed spectral features of the source speaker,reconstructing a loss expectation for a source speaker spectrum and a true source speaker spectrum;

wherein, G (x)s,cs,X-vectors) The source speaker frequency spectrum, the speaker label and the x vector are input into the generator to obtain the source speaker frequency spectrum characteristics,is xsAnd G (x)s,cs,X-vectors) Is expected to be lost.

Further, the coding network of the generator G comprises 5 convolutional layers, the filter size of the 5 convolutional layers is 3 × 9, 4 × 8, 3 × 5, 9 × 5, the step size is 1 × 1, 2 × 2, 1 × 1, 9 × 1, and the filter depth is 32, 64, 128, 64, 5; the decoding network of the generator G comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively.

Further, the discriminator D includes 5 convolution layers, the filter sizes of the 5 convolution layers are 3 × 9, 3 × 8, 3 × 6, and 36 × 5, the step sizes are 1 × 1, 1 × 2, and 36 × 1, and the filter depths are 32, and 1, respectively.

Further, the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

Further, the fundamental frequency conversion function is:

wherein, musAnd σsMean and variance, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectivelytAnd σtMean and variance, log f, of the fundamental frequency of the target speaker in the logarithmic domain, respectively0sLogarithmic fundamental frequency, log f, of the source speaker0t' is the converted logarithmic fundamental frequency.

Has the advantages that: the method can use improved STARGAN and X-vector to combine to realize many-to-many speaker voice conversion under the condition of parallel text and non-parallel text, in the existing network structure, an additional discriminator is introduced, a countermeasure loss is applied to the characteristics of cycle conversion, the countermeasure loss is used twice for each cycle, namely, the two-step countermeasure loss, and the method can effectively solve the problem of over-smoothness caused by using L1 for cycle consistency loss; the generator adopts a 2-1-2D CNN network structure, ResNet is built between a coding network and a decoding network of the generator, the main conversion is realized on a ResNet layer, and the 2D CNN is more suitable for converting characteristics while keeping an original structure of voice characteristics, so that the 1D CNN structure is proposed and utilized on the ResNet layer, compared with 2DCNN, the structure can better capture the dynamic change of voice information, and the 2D CNN is adopted on the coding network and the decoding network of the generator, so that the characteristics can be captured more widely, namely the 2-1-2D CNN network structure proposed in the method can effectively overcome the problem of voice characteristic loss caused by STARGAN network degradation, the extraction capability of the coding network of the generator on semantics is improved, and the conversion capability of the decoding network of the generator on voice is improved. The method is a further improvement of STARGAN networks in speech conversion applications.

In addition, the X-vector has better characterization performance for short-time speech, can fully characterize the individual characteristics of a speaker, and realizes a high-quality voice conversion method. The method can realize the voice conversion under the condition of non-parallel texts, does not need any alignment process in the training process, improves the universality and the practicability of the voice conversion system, can integrate the conversion systems of a plurality of source-target speaker pairs into one conversion model, namely realizes the conversion of a plurality of speakers to a plurality of speakers, and has better application prospect in the fields of cross-language voice conversion, film dubbing, voice translation and the like.

Drawings

FIG. 1 is an overall flow diagram of the present method;

FIG. 2 is a network architecture diagram of the generator of the model STARGAN-X of the present method.

Detailed Description

As shown in fig. 1, the method of the present invention is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.

The training stage comprises the following implementation steps:

1.1) obtaining a training corpus of a non-parallel text, wherein the training corpus is a corpus of multiple speakers and comprises a source speaker and a target speaker. The corpus is taken from the VCC2018 corpus. The corpus training set has 6 male and 6 female speakers, each speaker having 81 sentences of corpus. The method can realize conversion under parallel texts and can also realize conversion under non-parallel texts, so the training corpora can also be non-parallel texts.

1.2) extracting the spectrum envelope characteristic x, the aperiodic characteristic and the logarithmic base frequency log f of each speaker sentence from the training corpus through a WORLD (word-oriented language analysis/synthesis) model0. And simultaneously extracting an X-vector X-vector representing the personalized features of each speaker. Wherein, because the Fast Fourier Transform (FFT) length is set to 1024, the obtained spectrum envelope characteristic x and the aperiodic characteristic are both 1024/2+1 ═ 513 dimensions. Each voice block has 512 frames, 36-dimensional Mel cepstrum coefficient (MCEP) features are extracted from the spectral envelope features, and 8 voice blocks are taken during one training. Thus, the corpus has dimensions 8 × 36 × 512.

In practical application, the voice length of a person to be converted is relatively short, and the effect of converting the voice by using the traditional speaker characterization i-vector is general. The X-vector is a novel low-dimensional fixed-length embedding extracted by utilizing DNN, and has better characterization capability for short-time voice due to the extremely strong feature extraction capability of the DNN. The network is implemented in the Kaldi speech recognition tool using the nnet3 neural network library. The main difference between the X-vector and the i-vector lies in the difference of the extraction method, the structure of the system for extracting the X-vector is shown in Table 1, and the X-vector system is composed of a frame layer, a stats posing layer, a segment layer and a softmax layer. T represents all speech frames input, N represents the number of training speakers, and the training corpus is taken from VCC2018 speech corpus, so N is 12.

TABLE 1 System architecture Table for extracting X-vector

Layer(s) Layer context General context Input x output
frame1 [t–2,t+2] 5 120×512
frame2 {t-2,t,t+2} 9 1536×512
frame3 {t–3,t,t+3} 15 1536×512
frame4 {t} 15 512×512
frame5 {t} 15 512×1500
stats pooling [0,T) T 1500T×3000
segment6 {0} T 3000×512
segment7 {0} T 512×512
softmax {0} T 512×N

The DNN in the X-vector system has a time delay structure, firstly splicing 5 frames of context into 1 new frame set, then taking the new frame set as a center, splicing 4 frames of context into 1 new frame set, and so on until splicing 15 frames into a new frame set as the input of the DNN, wherein the input characteristic is 23-dimensional MFCC characteristic, and the frame length is 25 ms. The stats posing layer aggregates all T frame outputs of the frame5 layer and calculates the mean and standard deviation. The statistics are 1500-dimensional vectors, computed once per input speech segment, and then passed together to the segment layer. Finally, a posterior probability is output by the softmax layerThe number of output neurons is consistent with the number of speakers in the training set. The X-vector system classifies the trained speaker using the following formula.

The loss function for DNN network training is:

n denotes an input voice, k denotes each speaker,indicating that the softmax layer gives the posterior probability that the input speech belongs to speaker k, dnkMeaning that it is equal to 1 only if the speaker of the speech is k, and 0 otherwise.

The DNN is not only a classifier, but also a combination of a feature extractor and a classifier, each layer having a very strong feature extraction capability. After training, the segment layer can be used to extract the X-vector of the speech, as shown in Table 1, and the remaining structure is used to extract the X-vector of 512 dimensions at segment 6. And after the X-vector is extracted, calculating the similarity between the X-vector by utilizing the rear end of the probability linear discriminant analysis as the i-vector.

1.3) STARGA in the present exampleThe N network is based on a Cycle-GAN model, and improves the Cycle-GAN effect by improving the structure of GAN and combining a classifier. STARGAN consists of three parts: a generator G for generating a true spectrum, a discriminator D for judging whether an input is a true spectrum or a generated spectrum, and a label for judging whether the generated spectrum belongs to ctThe classifier C of (1).

The objective function of the STARGAN-X network is:

wherein, IG(G) To generate the loss function of the generator:

wherein λ iscls>=0、λcyc0 and λid0 is a regularization parameter that represents the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively.Lcyc(G)、Lid(G) Respectively representing the two-step countermeasure loss of the generator, the classification loss of the classifier optimization generator, the cycle consistency loss and the feature mapping loss.

The loss function of the discriminator is:

wherein the content of the first and second substances,representing the one-step countermeasure loss of the discriminator, namely discriminating the target loss function generating the target spectrum characteristic and the real spectrum characteristic by the discriminator,representing a two-step countermeasure loss of the discriminator, i.e. the discriminator discriminates the target loss function of the source spectral feature and the real source spectral feature generated after the generated spectrum passes through the generator again, D (x)s,cs)、D(xt,ct) Respectively representing the distinguishing true source and target spectrum characteristics of the discriminator D. G (x)s,ct,X-vectort) Represents the spectral characteristics of the target speaker, D (G (x), generated by the generator Gs,ct,X-vectort),ct) The spectral feature, D (G (x), representing the discrimination of the discriminators,ct,X-vectort),cs) To discriminate the reconstructed source speaker spectral characteristics for the discriminator,representing the expectation of the probability distribution generated by the generator G,the expectation of a distribution of true probabilities is represented,an expectation of a probability distribution representing a spectral feature of the reconstructed source speaker;

the loss function of the classifier two-dimensional convolutional neural network is:

wherein p isC(ct|xt) C, representing the characteristic of the classifier for distinguishing the target speaker as a labeltOf the true spectrum of the spectrum.

1.4) extracting the spectral envelope characteristic x of the source speaker extracted in 1.2)sTarget speaker markSignature characteristic ctX vector X-vectortAs a combined feature (x)s,ct,X-vectort) The input generator is trained. Training the generator to make its loss function LGAs small as possible, obtaining the generated target speaker spectrum envelope characteristic xtc

The generator adopts a 2-1-2D CNN structure and is composed of an encoding network, a decoding network and a ResNet layer. The coding network comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5. The decoding network comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are respectively 9 × 5, 3 × 5, 4 × 8 and 3 × 9, the step sizes are respectively 9 × 1, 1 × 1, 2 × 2 and 1 × 1, the filter depths are respectively 64, 128, 64, 32 and 1, a plurality of layers of ResNet layers are established between the encoding network and the decoding network, a one-dimensional convolutional neural network 1D CNN is adopted, and the ResNet layers in the embodiment are preferably 6 layers.

1.5) generating the spectral envelope characteristic x of the target speaker obtained in the step 1.4)tcAnd 1.2) obtaining the spectral envelope characteristic x of the target speaker of the training corpustAnd target speaker tag ctTraining the discriminator as the input of the discriminator to make the discriminator lose functionAs small as possible.

The discriminator uses a two-dimensional convolutional neural network comprising 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1.

The loss function of the discriminator is:

the optimization target is as follows:

1.6) obtaining the spectral envelope characteristic x of the target speakertcInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)tc) The semantic feature G (x) obtained above is usedtc) Source speaker tag feature csSource speaker X vector X-vectorsInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining a spectral envelope characteristic x of a reconstructed source speakersc. The loss function of the generator is minimized in the training process, including the two-step countermeasure loss, the cycle consistency loss, the feature mapping loss and the classification loss of the generator. The proposed two-step countermeasure loss of the generator is based on the countermeasure loss of the STARGAN network, and further proposes to apply countermeasure loss to the characteristics of the cycle switching, and the method can effectively solve the problem of over-smoothing caused by the utilization of L1 of the cycle consistency loss. The training cycle consistency loss is to make the source speaker spectral feature xsAfter passing through the generator G, the reconstructed spectral characteristics x of the source speakerscCan be mixed with xsAs consistent as possible. Loss of training feature mapping to guarantee xsSpeaker tag is still c after passing through generator GsThe classification loss refers to the frequency spectrum x of the target speaker generated by the classifier discrimination generatortcBelongs to the label ctThe probability of loss.

The loss function of the generator is:

the optimization target is as follows:

wherein λ iscls>=0、λcyc0 and λid0 is a regularization parameter, which indicates classification loss, respectively, circulationConsistency loss and weight of feature mapping loss.

Represents the two-step counter-loss of the generator in GAN:

wherein the content of the first and second substances,one step of the representation generator is to combat the loss,representing a two-step countermeasure loss of the generator;

wherein the content of the first and second substances,expressing the expectation of the probability distribution generated by the generator, G (x)s,ct,X-vectort) The representation generator generates a spectral feature that,and loss of discriminatorForming a two-step countermeasure loss in STARGAN-X for discriminating whether the spectrum input to the discriminator is a true spectrum or a generated spectrum, during the training processAs small as possible, the generator is continuously optimized untilGenerating a spectral feature G (x) that can be spuriouss,ct,X-vectort) Making it difficult for the discriminator to discriminate between true and false.

For classifier C to optimize the classification loss of the generator:

wherein p isC(ct|G(xs,ct,X-vectort) Means that the classifier discriminates that the target speaker spectrum label belongs to ctProbability of (a), G (x)s,ct,X-vectort) Representing the target speaker spectrum generated by the generator. In the course of the training process,as small as possible, so that the frequency spectrum G (x) generated by the generator Gs,ct,X-vectort) Can be correctly classified as label c by the classifiert

Lcyc(G) And Lid(G) By using the loss of the generator in the Cycle-GAN model, Lcyc(G) To generate cycle consistent losses in generator G:

wherein, G (G (x)s,ct,X-vectort),cs) For the reconstructed spectral features of the source speaker,an expectation for loss of the reconstructed source speaker spectrum and the true source speaker spectrum. In the loss of the training generator, Lcyc(G) As small as possible, so that the target spectrum G (x) is generateds,ct,X-vectort) Source speaker tag csAfter inputting into the generator again, the obtained reconstructed source wordsVoice spectrum of the speaker as much as possible sum xsSimilarly. By training Lcyc(G) The semantic features of the speaker voice can be effectively ensured and are not lost after being coded by the generator.

Lid(G) To generate the feature mapping penalty for G:

wherein, G (x)s,cs,X-vectors) The source speaker frequency spectrum, the speaker label and the x vector are input into the generator to obtain the source speaker frequency spectrum characteristics,is xsAnd G (x)s,cs,X-vectors) Is expected to be lost. Training Lid(G) Label c capable of effectively ensuring input frequency spectrumsAnd speaker representation vector X-vectorsRemains unchanged after input to the generator.

1.7) generating the spectral envelope characteristic x of the target speakertcAnd the spectral envelope characteristic x of the targeted speakertInputting a classifier for training, and minimizing a loss function of the classifier;

the classifier uses a two-dimensional convolutional neural network C, including 5 convolutional layers, the filter sizes of the 5 convolutional layers are 4 × 4, 3 × 4, and 1 × 4, respectively, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

The loss function of the classifier two-dimensional convolutional neural network is:

the optimization target is as follows:

1.8) repeating 1.4), 1.5), 1.6) and 1.7) until the number of iterations is reached, thereby obtaining a trained STARGAN-X network, wherein the generator parameter phi, the discriminator parameter theta, the classifier parameter psi are trained parameters. The iteration times are different because the specific setting of the neural network is different and the performance of the experimental equipment is different. The number of iterations in this experiment was chosen to be 20000.

1.9) use of the logarithmic fundamental frequency log f0The mean value and the variance of the pitch frequency are established to establish a fundamental frequency conversion relation, the mean value and the variance of the logarithmic fundamental frequency of each speaker are counted, and the logarithmic fundamental frequency log f of the source speaker is converted by utilizing the linear transformation of the logarithmic domain0sObtaining the logarithm fundamental frequency log f of the target speaker by conversion0t′。

The fundamental transfer function is:

wherein, musAnd σsMean and variance, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectivelytAnd σtRespectively, the mean and variance of the fundamental frequency of the target speaker in the logarithmic domain.

The implementation steps of the conversion stage are as follows:

2.1) passing the source speaker's voice through a WORLD voice analysis/synthesis model to extract the spectral envelope characteristics x of different sentences of the source speakers', aperiodic character, fundamental frequency. Wherein the resulting spectral envelope characteristic x is due to a Fast Fourier Transform (FFT) length set to 1024sBoth' and aperiodic features are 1024/2+1 ═ 513 dimensions.

2.2) extracting the spectral envelope characteristic x of the source speaker voice in 2.1)s' with target speaker tag feature ct', target speaker X vector X-vectort' as a joint feature (x)s′,ct′,X-vectort') input 1.8) a trained STARGAN-X network to reconstruct the target speaker spectral envelope feature Xtc′。

2.3) converting the fundamental frequency of the source speaker extracted in the step 2.1) into the fundamental frequency of the target speaker by the fundamental frequency conversion function obtained in the step 1.9).

2.4) combining the spectral envelope characteristic x of the target speaker obtained in 2.2)tc', 2.3) and 2.1) synthesizing the converted speaker's speech by a WORLD speech analysis/synthesis model.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于Perceptual STARGAN的多对多说话人转换方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!