Many-to-many speaker conversion method based on improved STARGAN and x vectors

文档序号：1639666 发布日期：2019-12-20 浏览：29次中文

阅读说明：本技术 基于改进的STARGAN和x向量的多对多说话人转换方法 (Many-to-many speaker conversion method based on improved STARGAN and x vectors ) 是由李燕萍曹盼张燕于 2019-09-17 设计创作，主要内容包括：本发明公开了一种基于改进的STARGAN与x向量的多对多说话人转换方法,包括训练阶段和转换阶段,使用了改进的STARGAN与x向量相结合来实现语音转换系统,该方法是对STARGAN在语音转换应用中的进一步改进,其中,提出的两步式对抗性损失能够有效解决由于循环一致性损失利用L1造成的过平滑问题,而且生成器采用2-1-2D CNN网络,能够较好地提升模型对于语义的学习能力以及语音频谱的合成能力,克服STARGAN中转换后语音相似度与自然度较差的问题。同时x向量对于短时话语具有更好的表征性能,能够充分表征说话人的个性特征,实现了一种非平行文本条件下的高质量多对多语音转换方法。(The invention discloses a many-to-many speaker conversion method based on improved STARGAN and x vectors, which comprises a training phase and a conversion phase, wherein an improved STARGAN is combined with the x vectors to realize a voice conversion system, the method is a further improvement of the STARGAN in voice conversion application, wherein the provided two-step antagonism loss can effectively solve the problem of over-smoothing caused by using L1 due to cyclic consistency loss, and a generator adopts a 2-1-2D CNN network, so that the learning capability of the model on semantics and the synthesis capability of a voice spectrum can be better improved, and the problem of poor similarity and naturalness of the converted voice in the STARGAN is solved. Meanwhile, the x vector has better representation performance for short-time speech, can fully represent the individual characteristics of a speaker, and realizes a high-quality many-to-many speech conversion method under the condition of non-parallel texts.)

1. A many-to-many speaker transformation method based on improved STARGAN and x-vectors, comprising a training phase and a transformation phase, said training phase comprising the steps of:

(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers and comprises a source speaker and a target speaker;

(1.2) extracting the spectrum envelope characteristic X, the fundamental frequency characteristic and an X vector X-vector representing the personalized characteristic of each speaker from the training corpus through a WORLD (word-oriented language analysis/synthesis) model;

(1.3) envelope feature x of the spectrum of the source speaker_sSpectral envelope characteristic x of the target speaker_tSource speaker tag c_sAnd X vector X-vector_sAnd a targeted speaker tag c_tAnd X vector X-vector_tThe system is input into a STARGAN-X network for training, wherein the STARGAN-X network consists of a generator G, a discriminator D and a classifier C, the generator G adopts a 2-1-2D network structure and consists of a coding network, a decoding network and ResNet layers, the coding network and the decoding network adopt two-dimensional convolutional neural networks, at least 1 layer of ResNet is built between the coding network and the decoding network, and the ResNet adopts a one-dimensional convolutional neural network;

(1.4) in the training process, the loss function of the generator G, the loss function of the discriminator D and the loss function of the classifier C are made as small as possible until the set iteration times are reached, and a trained STARGAN-X network is obtained;

(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises the steps of:

(2.1) to be rotatedExtracting spectrum envelope characteristic x from the voice of the source speaker in the speech material through a WORLD voice analysis/synthesis model_s', aperiodic character, and fundamental frequency;

(2.2) enveloping the spectrum of the source speaker with the characteristic x_s', target speaker tag characteristics c_t', target speaker X vector X-vector_t' inputting the trained STARGAN-X network in (1.4), reconstructing the spectral envelope characteristic X of the target speaker_tc′；

(2.3) converting the fundamental frequency of the source speaker extracted in the step (2.1) into the fundamental frequency of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);

(2.4) the spectral envelope characteristic x of the target speaker obtained in (2.2)_tc', (2.3) and (2.1) extracting aperiodic characteristics, and synthesizing by a WORLD speech analysis/synthesis model to obtain the converted speaker speech.

2. The improved STARGAN and x-vector based many-to-many speaker transformation method according to claim 1, wherein: and building ResNet as 6 layers between the coding network and the decoding network of the generator G.

3. The improved STARGAN and x-vector based many-to-many speaker transformation method as claimed in claim 1, wherein the training process in steps (1.3) and (1.4) comprises the steps of:

(1) enveloping the spectrum of the source speaker with the characteristic x_sCoding network of input generator G to obtain speaker independent semantic features G (x)_s)；

(2) The semantic features G (x) obtained above are used_s) Tag characteristics c with the targeted speaker_tX vector of target speaker_tThe decoding network which is input into the generator G together is trained, and the loss function of the generator G is minimized in the training process, so that the spectral envelope characteristic x of the target speaker is obtained_tc；

(3) The obtained spectral envelope characteristic x of the target speaker_tcIs input again intoThe coding network of the generator G obtains the speaker independent semantic features G (x)_tc)；

(4) The semantic features G (x) obtained above are used_tc) Source speaker tag feature c_sSource speaker X vector X-vector_sInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining a spectral envelope characteristic x of a reconstructed source speaker_scAnd x is_scAnd source speaker tag characteristics c_sInputting the data into a discriminator D for training, and minimizing a loss function of the discriminator D;

(5) enveloping the frequency spectrum of the target speaker with the characteristic x_tcTarget speaker spectral feature x_tAnd the tag characteristics c of the targeted speaker_tInputting the two signals into a discriminator D for training, and minimizing a loss function of the discriminator D;

(6) enveloping the frequency spectrum of the target speaker with the characteristic x_tcAnd the spectral envelope characteristic x of the targeted speaker_tInputting a classifier C for training, and minimizing a loss function of the classifier C;

(7) and (4) returning to the step (1) and repeating the steps until the iteration number is reached, so that the trained STARGAN-X network is obtained.

4. The improved STARGAN and x-vector based many-to-many speaker transformation method as claimed in claim 1, wherein the input procedure in step (2.2) comprises the steps of:

(1) enveloping the spectrum of the source speaker with the characteristic x_s' encoding network of input generator G, deriving speaker independent semantic features G (x)_s)′；

(2) The semantic features G (x) obtained above are used_s) ' tag feature with target speaker c_t', X vector of target speaker X-vector_t' input to the decoder network of the generator G together to obtain the spectral envelope characteristic x of the target speaker_tc′。

5. The improved startan and x vector based many-to-many speaker conversion method as claimed in claim 1, wherein the loss function of the generator G is:

wherein λ is_cls＞＝0、λ_cyc0 and λ_id0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively,L_cyc(G)、L_id(G) respectively representing two-step countermeasure loss of the generator, classification loss of the classifier optimization generator, cycle consistency loss and feature mapping loss;

the discriminator D adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein the content of the first and second substances,representing a one-step counter-loss of the discriminator D,representing a two-step challenge loss of the discriminator, D (x)_s,c_s)、D(x_t,c_t) Respectively representing the discriminators D to discriminate the true source and target spectrum features, G (x)_s,c_t,X-vector_t) Representation generator G generated target utteranceSpeech spectral feature, D (G (x)_s,c_t,X-vector_t),c_t) The spectral feature, D (G (x), representing the discrimination of the discriminator_s,c_t,X-vector_t),c_s) To discriminate the reconstructed source speaker spectral characteristics for the discriminator,representing the expectation of the probability distribution generated by the generator G,the expectation of a distribution of true probabilities is represented,an expectation of a probability distribution representing a spectral feature of the reconstructed source speaker;

the classifier C adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein p is_C(c_t|x_t) The expression classifier C judges the characteristics of the target speaker as a label C_tOf the true spectrum of the spectrum.

6. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein:

wherein the content of the first and second substances,one step of the representation generator is to combat the loss,representing a two-step countermeasure loss of the generator;

wherein the content of the first and second substances,expressing the expectation of the probability distribution generated by the generator, G (x)_s,c_t,X-vector_t) A representation generator generates spectral features;

wherein, G (G (x)_s,c_t,X-vector_t),c_s) For the reconstructed spectral features of the source speaker,reconstructing a loss expectation for a source speaker spectrum and a true source speaker spectrum;

wherein, G (x)_s,c_s,X-vector_s) The frequency spectrum of the source speaker,Speaker labels and x vectors, the source speaker spectral characteristics obtained after input to the generator,is x_sAnd G (x)_s,c_s,X-vector_s) Is expected to be lost.

7. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein: the coding network of the generator G comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5; the decoding network of the generator G comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively.

8. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein: the discriminator D comprises 5 convolution layers, the filter sizes of the 5 convolution layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1.

9. The improved STARGAN and x-vector based many-to-many speaker transformation method as in claim 5, wherein: the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

10. The improved STARGAN and x-vector based many-to-many speaker transformation method according to any of claims 1 to 9, wherein: the fundamental frequency conversion function is as follows:

wherein, mu_sAnd σ_sMean and variance, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectively_tAnd σ_tMean and variance, logf, of the fundamental frequency of the target speaker in the logarithmic domain_0sLogarithmic fundamental frequency, logf, of the source speaker_0t' is the converted logarithmic fundamental frequency.

Technical Field

The invention relates to a many-to-many speaker conversion method, in particular to a many-to-many speaker conversion method based on improved STARGAN and x vectors.

Background

Speech conversion is a branch of research in the field of speech signal processing, and is developed and extended on the basis of research on speech analysis, recognition and synthesis. The goal of speech conversion is to change the speech personality characteristics of the source speaker to have the speech personality characteristics of the target speaker, i.e., to make one person speaking speech sound like another person speaking speech after conversion, while preserving semantics.

After years of research, many classical conversion methods have emerged. The method includes most speech conversion methods such as Gaussian Mixed Model (GMM), Recurrent Neural Network (RNN), Deep Neural Network (DNN), and the like. However, most of these speech conversion methods require that the corpus used for training is parallel text, that is, the source speaker and the target speaker need to send out sentences with the same speech content and speech duration, and the pronunciation rhythm and emotion are consistent as much as possible. However, it is time consuming to collect these data and even if these parallel data are obtained, it is still difficult to solve the problem, because most speech conversion methods rely on accurate time alignment of data, which is a difficult process, and this makes most parallel data cause the problem of inaccurate alignment of speech feature parameters, so the accuracy of speech feature parameter alignment during training becomes a limitation to speech conversion performance. Besides, parallel voice can not be obtained in practical application such as cross-language conversion, medical auxiliary patient voice conversion and the like. Therefore, the research of the voice conversion method under the non-parallel text condition has great practical significance and application value in consideration of the universality and the practicability of the voice conversion system.

The existing voice conversion method under the non-parallel text condition includes a method based on a Cycle-Consistent adaptive network (Cycle-GAN) and a method based on a conditional variant Auto-Encoder (C-VAE). A voice conversion method based on a C-VAE model directly utilizes an identity tag of a speaker to establish a voice conversion system, wherein a coder realizes the separation of semantics and personal information of voice, and a decoder realizes the reconstruction of voice through the semantics and the identity tag of the speaker, thereby being capable of removing the dependence on parallel texts. However, since C-VAE is based on an improved ideal assumption, it is believed that the observed data generally follows a gaussian distribution, resulting in an excessively smooth output speech of the decoder and an inferior quality of the converted speech. The voice conversion method based on the Cycle-GAN model utilizes the adversity loss and the Cycle consistent loss, and simultaneously learns the forward mapping and the inverse mapping of the acoustic characteristics, so that the over-smooth problem can be effectively relieved, and the conversion voice quality is improved, but the Cycle-GAN can only realize one-to-one voice conversion.

The voice conversion method based on the Star-generated confrontation Network (STARGAN) model has the advantages of C-VAE and Cycle-GAN, because a generator of the method has a coding and decoding structure, many-to-many mapping can be learned at the same time, and the attribute output by the generator is controlled by a speaker identity label, so that many-to-many voice conversion under the condition of non-parallel text can be realized. In STARGAN, introducing a resistance loss can effectively alleviate the over-smoothing problem caused by statistical averaging, but the cyclic consistency loss represented by L1 still causes over-smoothing. The generator adopts a two-dimensional convolutional neural network, so that the time sequence information of the voice cannot be effectively captured, more importantly, the coding network and the decoding network in the generator are independent, the separation of the semantic features and the personalized features of the speaker cannot be well realized directly through the coding network of the generator, meanwhile, the synthesis of the semantic features and the personalized features of the speaker cannot be well realized through the decoding network of the generator, and in the method, the personalized features of the speaker cannot be fully expressed by the identity tag of the speaker, so that the converted voice still needs to be promoted in voice quality and personalized similarity.

Disclosure of Invention

The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a many-to-many speaker conversion method based on improved STARGAN and x vectors, the provided two-step antagonism loss can effectively solve the problem of over-smoothness caused by the loss of cyclic consistency by using L1, and the generator adopts a 2-1-2D CNN network, so that the learning capability of the model on semantics and the synthesis capability of a voice spectrum can be better improved, the problem of poor similarity and naturalness of the converted voice in the STARGAN is solved, the learning difficulty of the coding network on the semantics is reduced, the spectrum generation quality of a decoding network is improved, the personalized features of speakers are fully represented by using the x vectors, and the personalized similarity of the converted voice is effectively improved.

The technical scheme is as follows: the invention discloses a method for many-to-many speaker conversion based on improved STARGAN and x vectors, which comprises a training phase and a conversion phase, wherein the training phase comprises the following steps:

(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers and comprises a source speaker and a target speaker;

(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises the steps of:

(2.1) extracting the spectrum envelope characteristic x from the voice of the source speaker in the corpus to be converted through a WORLD voice analysis/synthesis model_s', aperiodic character, and fundamental frequency;

Further, ResNet is built into 6 layers between the coding network and the decoding network of the generator G.

Further, the training process in steps (1.3) and (1.4) comprises the following steps:

(1) enveloping the spectrum of the source speaker with the characteristic x_sCoding network of input generator G to obtain speaker independent semantic features G (x)_s)；

(3) The obtained spectral envelope characteristic x of the target speaker_tcInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)_tc)；

(4) The semantic features G (x) obtained above are used_tc) Source speaker tag feature c_sSource speaker X vector X-vector_sInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining the frequency spectrum of the reconstructed source speakerEnvelope characteristic x_scAnd x is_scAnd the source speaker's tag characteristics c_sInputting the data into a discriminator D for training, and minimizing a loss function of the discriminator D;

(7) and (4) returning to the step (1) and repeating the steps until the iteration number is reached, so that the trained STARGAN-X network is obtained.

Further, the input process in step (2.2) comprises the following steps:

(1) enveloping the spectrum of the source speaker with the characteristic x_s' encoding network of input generator G, deriving speaker independent semantic features G (x)_s)′；

Further, the loss function of the generator G is:

wherein λ is_cls＞＝0、λ_cyc0 and λ_id0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively,L_cyc(G)、L_id(G) representing two-step antagonism of generators separatelyLoss, classification loss of a classifier optimization generator, cycle consistency loss, and feature mapping loss;

the discriminator D adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein the content of the first and second substances,representing a one-step counter-loss of the discriminator D,representing a two-step challenge loss of the discriminator, D (x)_s,c_s)、D(x_t,c_t) Respectively representing the discriminators D to discriminate the true source and target spectrum features, G (x)_s,c_t,X-vector_t) Represents the spectral characteristics of the target speaker, D (G (x), generated by the generator G_s,c_t,X-vector_t),c_t) The spectral feature, D (G (x), representing the discrimination of the discriminator_s,c_t,X-vector_t),c_s) To discriminate the reconstructed source speaker spectral characteristics for the discriminator,representing the expectation of the probability distribution generated by the generator G,the expectation of a distribution of true probabilities is represented,representing reconstructed sourcesAn expectation of a probability distribution of spectral features of the speaker;

the classifier C adopts a two-dimensional convolution neural network, and the loss function is as follows:

wherein p is_C(c_t|x_t) The expression classifier C judges the characteristics of the target speaker as a label C_tOf the true spectrum of the spectrum.

Further, in the above-mentioned case,

wherein the content of the first and second substances,one step of the representation generator is to combat the loss,representing a two-step countermeasure loss of the generator;

wherein p is_C(c_t|G(x_s,c_t,X-vector_t) Is expressed as a scoreThe classifier discriminates and generates the target speaker frequency spectrum label belonging to c_tProbability of (a), G (x)_s,c_t,X-vector_t) Representing the target speaker spectrum generated by the generator;

Further, the coding network of the generator G comprises 5 convolutional layers, the filter size of the 5 convolutional layers is 3 × 9, 4 × 8, 3 × 5, 9 × 5, the step size is 1 × 1, 2 × 2, 1 × 1, 9 × 1, and the filter depth is 32, 64, 128, 64, 5; the decoding network of the generator G comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively.

Further, the discriminator D includes 5 convolution layers, the filter sizes of the 5 convolution layers are 3 × 9, 3 × 8, 3 × 6, and 36 × 5, the step sizes are 1 × 1, 1 × 2, and 36 × 1, and the filter depths are 32, and 1, respectively.

Further, the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

Further, the fundamental frequency conversion function is:

wherein, mu_sAnd σ_sMean and variance, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectively_tAnd σ_tMean and variance, log f, of the fundamental frequency of the target speaker in the logarithmic domain, respectively_0sLogarithmic fundamental frequency, log f, of the source speaker_0t' is the converted logarithmic fundamental frequency.

Has the advantages that: the method can use improved STARGAN and X-vector to combine to realize many-to-many speaker voice conversion under the condition of parallel text and non-parallel text, in the existing network structure, an additional discriminator is introduced, a countermeasure loss is applied to the characteristics of cycle conversion, the countermeasure loss is used twice for each cycle, namely, the two-step countermeasure loss, and the method can effectively solve the problem of over-smoothness caused by using L1 for cycle consistency loss; the generator adopts a 2-1-2D CNN network structure, ResNet is built between a coding network and a decoding network of the generator, the main conversion is realized on a ResNet layer, and the 2D CNN is more suitable for converting characteristics while keeping an original structure of voice characteristics, so that the 1D CNN structure is proposed and utilized on the ResNet layer, compared with 2DCNN, the structure can better capture the dynamic change of voice information, and the 2D CNN is adopted on the coding network and the decoding network of the generator, so that the characteristics can be captured more widely, namely the 2-1-2D CNN network structure proposed in the method can effectively overcome the problem of voice characteristic loss caused by STARGAN network degradation, the extraction capability of the coding network of the generator on semantics is improved, and the conversion capability of the decoding network of the generator on voice is improved. The method is a further improvement of STARGAN networks in speech conversion applications.

In addition, the X-vector has better characterization performance for short-time speech, can fully characterize the individual characteristics of a speaker, and realizes a high-quality voice conversion method. The method can realize the voice conversion under the condition of non-parallel texts, does not need any alignment process in the training process, improves the universality and the practicability of the voice conversion system, can integrate the conversion systems of a plurality of source-target speaker pairs into one conversion model, namely realizes the conversion of a plurality of speakers to a plurality of speakers, and has better application prospect in the fields of cross-language voice conversion, film dubbing, voice translation and the like.

Drawings

FIG. 1 is an overall flow diagram of the present method;

FIG. 2 is a network architecture diagram of the generator of the model STARGAN-X of the present method.

Detailed Description

As shown in fig. 1, the method of the present invention is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.

The training stage comprises the following implementation steps:

1.1) obtaining a training corpus of a non-parallel text, wherein the training corpus is a corpus of multiple speakers and comprises a source speaker and a target speaker. The corpus is taken from the VCC2018 corpus. The corpus training set has 6 male and 6 female speakers, each speaker having 81 sentences of corpus. The method can realize conversion under parallel texts and can also realize conversion under non-parallel texts, so the training corpora can also be non-parallel texts.

1.2) extracting the spectrum envelope characteristic x, the aperiodic characteristic and the logarithmic base frequency log f of each speaker sentence from the training corpus through a WORLD (word-oriented language analysis/synthesis) model₀. And simultaneously extracting an X-vector X-vector representing the personalized features of each speaker. Wherein, because the Fast Fourier Transform (FFT) length is set to 1024, the obtained spectrum envelope characteristic x and the aperiodic characteristic are both 1024/2+1 ═ 513 dimensions. Each voice block has 512 frames, 36-dimensional Mel cepstrum coefficient (MCEP) features are extracted from the spectral envelope features, and 8 voice blocks are taken during one training. Thus, the corpus has dimensions 8 × 36 × 512.

In practical application, the voice length of a person to be converted is relatively short, and the effect of converting the voice by using the traditional speaker characterization i-vector is general. The X-vector is a novel low-dimensional fixed-length embedding extracted by utilizing DNN, and has better characterization capability for short-time voice due to the extremely strong feature extraction capability of the DNN. The network is implemented in the Kaldi speech recognition tool using the nnet3 neural network library. The main difference between the X-vector and the i-vector lies in the difference of the extraction method, the structure of the system for extracting the X-vector is shown in Table 1, and the X-vector system is composed of a frame layer, a stats posing layer, a segment layer and a softmax layer. T represents all speech frames input, N represents the number of training speakers, and the training corpus is taken from VCC2018 speech corpus, so N is 12.

TABLE 1 System architecture Table for extracting X-vector

Layer(s)	Layer context	General context	Input x output
				frame1	[t–2，t+2]	5	120×512
frame2	{t-2，t，t+2}	9	1536×512
				frame3	{t–3，t，t+3}	15	1536×512
frame4	{t}	15	512×512
				frame5	{t}	15	512×1500
stats pooling	[0，T)	T	1500T×3000
				segment6	{0}	T	3000×512
segment7	{0}	T	512×512
				softmax	{0}	T	512×N

The DNN in the X-vector system has a time delay structure, firstly splicing 5 frames of context into 1 new frame set, then taking the new frame set as a center, splicing 4 frames of context into 1 new frame set, and so on until splicing 15 frames into a new frame set as the input of the DNN, wherein the input characteristic is 23-dimensional MFCC characteristic, and the frame length is 25 ms. The stats posing layer aggregates all T frame outputs of the frame5 layer and calculates the mean and standard deviation. The statistics are 1500-dimensional vectors, computed once per input speech segment, and then passed together to the segment layer. Finally, a posterior probability is output by the softmax layerThe number of output neurons is consistent with the number of speakers in the training set. The X-vector system classifies the trained speaker using the following formula.

The loss function for DNN network training is:

n denotes an input voice, k denotes each speaker,indicating that the softmax layer gives the posterior probability that the input speech belongs to speaker k, d_nkMeaning that it is equal to 1 only if the speaker of the speech is k, and 0 otherwise.

The DNN is not only a classifier, but also a combination of a feature extractor and a classifier, each layer having a very strong feature extraction capability. After training, the segment layer can be used to extract the X-vector of the speech, as shown in Table 1, and the remaining structure is used to extract the X-vector of 512 dimensions at segment 6. And after the X-vector is extracted, calculating the similarity between the X-vector by utilizing the rear end of the probability linear discriminant analysis as the i-vector.

1.3) STARGA in the present exampleThe N network is based on a Cycle-GAN model, and improves the Cycle-GAN effect by improving the structure of GAN and combining a classifier. STARGAN consists of three parts: a generator G for generating a true spectrum, a discriminator D for judging whether an input is a true spectrum or a generated spectrum, and a label for judging whether the generated spectrum belongs to c_tThe classifier C of (1).

The objective function of the STARGAN-X network is:

wherein, I_G(G) To generate the loss function of the generator:

wherein λ is_cls＞＝0、λ_cyc0 and λ_id0 is a regularization parameter that represents the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively.L_cyc(G)、L_id(G) Respectively representing the two-step countermeasure loss of the generator, the classification loss of the classifier optimization generator, the cycle consistency loss and the feature mapping loss.

The loss function of the discriminator is:

wherein the content of the first and second substances,representing the one-step countermeasure loss of the discriminator, namely discriminating the target loss function generating the target spectrum characteristic and the real spectrum characteristic by the discriminator,representing a two-step countermeasure loss of the discriminator, i.e. the discriminator discriminates the target loss function of the source spectral feature and the real source spectral feature generated after the generated spectrum passes through the generator again, D (x)_s,c_s)、D(x_t,c_t) Respectively representing the distinguishing true source and target spectrum characteristics of the discriminator D. G (x)_s,c_t,X-vector_t) Represents the spectral characteristics of the target speaker, D (G (x), generated by the generator G_s,c_t,X-vector_t),c_t) The spectral feature, D (G (x), representing the discrimination of the discriminator_s,c_t,X-vector_t),c_s) To discriminate the reconstructed source speaker spectral characteristics for the discriminator,representing the expectation of the probability distribution generated by the generator G,the expectation of a distribution of true probabilities is represented,an expectation of a probability distribution representing a spectral feature of the reconstructed source speaker;

the loss function of the classifier two-dimensional convolutional neural network is:

wherein p is_C(c_t|x_t) C, representing the characteristic of the classifier for distinguishing the target speaker as a label_tOf the true spectrum of the spectrum.

1.4) extracting the spectral envelope characteristic x of the source speaker extracted in 1.2)_sTarget speaker markSignature characteristic c_tX vector X-vector_tAs a combined feature (x)_s,c_t,X-vector_t) The input generator is trained. Training the generator to make its loss function L_GAs small as possible, obtaining the generated target speaker spectrum envelope characteristic x_tc。

The generator adopts a 2-1-2D CNN structure and is composed of an encoding network, a decoding network and a ResNet layer. The coding network comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5. The decoding network comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are respectively 9 × 5, 3 × 5, 4 × 8 and 3 × 9, the step sizes are respectively 9 × 1, 1 × 1, 2 × 2 and 1 × 1, the filter depths are respectively 64, 128, 64, 32 and 1, a plurality of layers of ResNet layers are established between the encoding network and the decoding network, a one-dimensional convolutional neural network 1D CNN is adopted, and the ResNet layers in the embodiment are preferably 6 layers.

1.5) generating the spectral envelope characteristic x of the target speaker obtained in the step 1.4)_tcAnd 1.2) obtaining the spectral envelope characteristic x of the target speaker of the training corpus_tAnd target speaker tag c_tTraining the discriminator as the input of the discriminator to make the discriminator lose functionAs small as possible.

The discriminator uses a two-dimensional convolutional neural network comprising 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1.

The loss function of the discriminator is:

the optimization target is as follows:

1.6) obtaining the spectral envelope characteristic x of the target speaker_tcInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)_tc) The semantic feature G (x) obtained above is used_tc) Source speaker tag feature c_sSource speaker X vector X-vector_sInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining a spectral envelope characteristic x of a reconstructed source speaker_sc. The loss function of the generator is minimized in the training process, including the two-step countermeasure loss, the cycle consistency loss, the feature mapping loss and the classification loss of the generator. The proposed two-step countermeasure loss of the generator is based on the countermeasure loss of the STARGAN network, and further proposes to apply countermeasure loss to the characteristics of the cycle switching, and the method can effectively solve the problem of over-smoothing caused by the utilization of L1 of the cycle consistency loss. The training cycle consistency loss is to make the source speaker spectral feature x_sAfter passing through the generator G, the reconstructed spectral characteristics x of the source speaker_scCan be mixed with x_sAs consistent as possible. Loss of training feature mapping to guarantee x_sSpeaker tag is still c after passing through generator G_sThe classification loss refers to the frequency spectrum x of the target speaker generated by the classifier discrimination generator_tcBelongs to the label c_tThe probability of loss.

The loss function of the generator is:

the optimization target is as follows:

wherein λ is_cls＞＝0、λ_cyc0 and λ_id0 is a regularization parameter, which indicates classification loss, respectively, circulationConsistency loss and weight of feature mapping loss.

Represents the two-step counter-loss of the generator in GAN:

wherein the content of the first and second substances,one step of the representation generator is to combat the loss,representing a two-step countermeasure loss of the generator;

wherein the content of the first and second substances,expressing the expectation of the probability distribution generated by the generator, G (x)_s,c_t,X-vector_t) The representation generator generates a spectral feature that,and loss of discriminatorForming a two-step countermeasure loss in STARGAN-X for discriminating whether the spectrum input to the discriminator is a true spectrum or a generated spectrum, during the training processAs small as possible, the generator is continuously optimized untilGenerating a spectral feature G (x) that can be spurious_s,c_t,X-vector_t) Making it difficult for the discriminator to discriminate between true and false.

For classifier C to optimize the classification loss of the generator:

wherein p is_C(c_t|G(x_s,c_t,X-vector_t) Means that the classifier discriminates that the target speaker spectrum label belongs to c_tProbability of (a), G (x)_s,c_t,X-vector_t) Representing the target speaker spectrum generated by the generator. In the course of the training process,as small as possible, so that the frequency spectrum G (x) generated by the generator G_s,c_t,X-vector_t) Can be correctly classified as label c by the classifier_t。

L_cyc(G) And L_id(G) By using the loss of the generator in the Cycle-GAN model, L_cyc(G) To generate cycle consistent losses in generator G:

wherein, G (G (x)_s,c_t,X-vector_t),c_s) For the reconstructed spectral features of the source speaker,an expectation for loss of the reconstructed source speaker spectrum and the true source speaker spectrum. In the loss of the training generator, L_cyc(G) As small as possible, so that the target spectrum G (x) is generated_s,c_t,X-vector_t) Source speaker tag c_sAfter inputting into the generator again, the obtained reconstructed source wordsVoice spectrum of the speaker as much as possible sum x_sSimilarly. By training L_cyc(G) The semantic features of the speaker voice can be effectively ensured and are not lost after being coded by the generator.

L_id(G) To generate the feature mapping penalty for G:

wherein, G (x)_s,c_s,X-vector_s) The source speaker frequency spectrum, the speaker label and the x vector are input into the generator to obtain the source speaker frequency spectrum characteristics,is x_sAnd G (x)_s,c_s,X-vector_s) Is expected to be lost. Training L_id(G) Label c capable of effectively ensuring input frequency spectrum_sAnd speaker representation vector X-vector_sRemains unchanged after input to the generator.

1.7) generating the spectral envelope characteristic x of the target speaker_tcAnd the spectral envelope characteristic x of the targeted speaker_tInputting a classifier for training, and minimizing a loss function of the classifier;

the classifier uses a two-dimensional convolutional neural network C, including 5 convolutional layers, the filter sizes of the 5 convolutional layers are 4 × 4, 3 × 4, and 1 × 4, respectively, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

The loss function of the classifier two-dimensional convolutional neural network is:

the optimization target is as follows:

1.8) repeating 1.4), 1.5), 1.6) and 1.7) until the number of iterations is reached, thereby obtaining a trained STARGAN-X network, wherein the generator parameter phi, the discriminator parameter theta, the classifier parameter psi are trained parameters. The iteration times are different because the specific setting of the neural network is different and the performance of the experimental equipment is different. The number of iterations in this experiment was chosen to be 20000.

1.9) use of the logarithmic fundamental frequency log f₀The mean value and the variance of the pitch frequency are established to establish a fundamental frequency conversion relation, the mean value and the variance of the logarithmic fundamental frequency of each speaker are counted, and the logarithmic fundamental frequency log f of the source speaker is converted by utilizing the linear transformation of the logarithmic domain_0sObtaining the logarithm fundamental frequency log f of the target speaker by conversion_0t′。

The fundamental transfer function is:

wherein, mu_sAnd σ_sMean and variance, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectively_tAnd σ_tRespectively, the mean and variance of the fundamental frequency of the target speaker in the logarithmic domain.

The implementation steps of the conversion stage are as follows:

2.1) passing the source speaker's voice through a WORLD voice analysis/synthesis model to extract the spectral envelope characteristics x of different sentences of the source speaker_s', aperiodic character, fundamental frequency. Wherein the resulting spectral envelope characteristic x is due to a Fast Fourier Transform (FFT) length set to 1024_sBoth' and aperiodic features are 1024/2+1 ═ 513 dimensions.

2.2) extracting the spectral envelope characteristic x of the source speaker voice in 2.1)_s' with target speaker tag feature c_t', target speaker X vector X-vector_t' as a joint feature (x)_s′,c_t′,X-vector_t') input 1.8) a trained STARGAN-X network to reconstruct the target speaker spectral envelope feature X_tc′。

2.3) converting the fundamental frequency of the source speaker extracted in the step 2.1) into the fundamental frequency of the target speaker by the fundamental frequency conversion function obtained in the step 1.9).

2.4) combining the spectral envelope characteristic x of the target speaker obtained in 2.2)_tc', 2.3) and 2.1) synthesizing the converted speaker's speech by a WORLD speech analysis/synthesis model.

15页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于Perceptual STARGAN的多对多说话人转换方法

Many-to-many speaker conversion method based on improved STARGAN and x vectors

相关技术

网友询问留言