Voice synthesis method and device, electronic equipment and storage medium

文档序号：600198 发布日期：2021-05-04 浏览：17次中文

阅读说明：本技术 一种语音合成方法、装置、电子设备和存储介质 (Voice synthesis method and device, electronic equipment and storage medium ) 是由宋飞豹宋锐侯秋侠孟亚洲江源于 2020-12-31 设计创作，主要内容包括：本发明提供一种语音合成方法、装置、电子设备和存储介质,其中方法包括：确定待合成文本；将待合成文本输入至语音合成模型中,得到合成结果；语音合成模型是在用于提取文本特征的语言模型的基础上,通过样本文本及其对应的样本语音,联合基于文本特征的说话人识别模型对抗训练得到的。本发明提供的方法、装置、电子设备和存储介质,依赖于语言模型强大的文本理解能力,保证语音合成过程中对于韵律、音素层面信息预测的合理性,从而保证语音合成结果的可靠性和准确性,无需前端模块的加入,节省了大量的人力时间,尤其是在多语种的语音合成场景下,无需另外获取各语种的前端模块,极大降低了语音合成任务的实现难度,有助于语音合成应用的推广。(The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a text to be synthesized; inputting a text to be synthesized into a voice synthesis model to obtain a synthesis result; the speech synthesis model is obtained by combining a speaker recognition model based on text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features. The method, the device, the electronic equipment and the storage medium provided by the invention rely on the strong text understanding capability of the language model to ensure the rationality of rhythm and phoneme level information prediction in the speech synthesis process, thereby ensuring the reliability and the accuracy of a speech synthesis result, avoiding the addition of a front-end module, saving a large amount of labor time, particularly avoiding the need of additionally acquiring front-end modules of various languages in a multilingual speech synthesis scene, greatly reducing the realization difficulty of a speech synthesis task, and being beneficial to the popularization of speech synthesis application.)

1. A method of speech synthesis, comprising:

determining a text to be synthesized;

inputting the text to be synthesized into a voice synthesis model to obtain a synthesis result output by the voice synthesis model; the speech synthesis model is obtained by combining a speaker recognition model based on the text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features.

2. The method according to claim 1, wherein the inputting the text to be synthesized into a speech synthesis model to obtain a synthesis result output by the speech synthesis model comprises:

inputting the text to be synthesized into a text coding layer of the voice synthesis model to obtain text characteristics output by the text coding layer; the text coding layer is established based on the language model;

and inputting the text features or the text features and the target voiceprint features and/or the target language features into a decoding layer of the speech synthesis model to obtain a synthesis result output by the decoding layer.

3. The speech synthesis method of claim 2, wherein the text encoding layer is trained against the speaker recognition model, the speaker recognition model is used for speaker recognition against text features, and the text features are obtained by gradient inversion of the output of the text encoding layer.

4. The speech synthesis method of claim 2, wherein the text coding layer comprises a plurality of convolution structures and a language coding layer connected in series, and the structure of the language coding layer is consistent with the structure of the language model.

5. The speech synthesis method according to claim 2, wherein the inputting the text feature, or the text feature, and the target voiceprint feature and/or the target language feature into a decoding layer of the speech synthesis model to obtain a synthesis result output by the decoding layer comprises:

inputting the text features, or the text features, and the target voiceprint features and/or the target language features into a fusion decoding layer of the decoding layer to obtain a plurality of fusion acoustic features output by the fusion decoding layer, wherein each fusion acoustic feature corresponds to a preset number of voice frames, and the preset number is an integer greater than 1;

inputting the text feature and the multiple fused acoustic features, or inputting the text feature, the multiple fused acoustic features, and the target voiceprint feature and/or the target language feature into a general decoding layer of the decoding layer, so as to obtain the synthesis result output by the general decoding layer.

6. The speech synthesis method according to claim 5, wherein the loss function of the speech synthesis model is determined based on a loss value of a synthesis result and a loss value of a fused acoustic feature, or based on a loss value of the synthesis result, a loss value of the fused acoustic feature, and a loss value of a target voiceprint feature.

7. The speech synthesis method according to any one of claims 2 to 6, wherein the target voiceprint feature is determined based on:

inputting the voice of a target speaker into a voiceprint model to obtain the target voiceprint characteristics output by the voiceprint model;

the voiceprint model is obtained through countertraining with a language identification model, the language identification model is used for performing language identification on countervoiceprint characteristics, and the countervoiceprint characteristics are obtained through gradient inversion of the output of the voiceprint model.

8. A speech synthesis apparatus, comprising:

the text determining unit is used for determining a text to be synthesized;

the voice synthesis unit is used for inputting the text to be synthesized into a voice synthesis model to obtain a synthesis result output by the voice synthesis model; the speech synthesis model is obtained by combining a speaker recognition model based on the text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech synthesis method according to any of claims 1 to 7 are implemented when the program is executed by the processor.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech synthesis method according to any one of claims 1 to 7.

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a speech synthesis method, apparatus, electronic device, and storage medium.

Background

Speech synthesis refers to the process of converting input text into speech output. The multi-language speech synthesis can realize speech synthesis of different languages, and the input text can be characters of different languages.

The current speech synthesis system usually comprises two parts, namely a front-end module and a speech synthesis model, wherein the front-end module is used for text analysis, prosody prediction, text phoneme conversion and the like, which needs to understand the detailed knowledge of the language, and the construction process needs to consume a lot of manpower and time investment. In a multi-language speech synthesis task, front-end modules of a large number of languages are extremely difficult to obtain, which brings great difficulty to the multi-language speech synthesis task.

Disclosure of Invention

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, which are used for solving the defect that the voice synthesis is difficult to realize due to the fact that a front-end module is difficult to construct in the prior art.

The invention provides a speech synthesis method, which comprises the following steps:

determining a text to be synthesized;

According to a speech synthesis method provided by the present invention, the inputting the text to be synthesized into a speech synthesis model to obtain a synthesis result output by the speech synthesis model includes:

According to the speech synthesis method provided by the invention, the text coding layer is obtained by confrontation training with a speaker recognition model, the speaker recognition model is used for carrying out speaker recognition on confrontation text characteristics, and the confrontation text characteristics are obtained by carrying out gradient inversion on the output of the text coding layer.

According to the speech synthesis method provided by the invention, the text coding layer comprises a plurality of layers of convolution structures and a language coding layer connected in series with the convolution structures, and the structure of the language coding layer is consistent with that of the language model.

According to a speech synthesis method provided by the present invention, the inputting the text feature, or the text feature, and the target voiceprint feature and/or the target language feature into a decoding layer of the speech synthesis model to obtain a synthesis result output by the decoding layer includes:

According to the speech synthesis method provided by the invention, the loss function of the speech synthesis model is determined based on the loss value of the synthesis result and the loss value of the fusion acoustic feature, or based on the loss value of the synthesis result, the loss value of the fusion acoustic feature and the loss value of the target voiceprint feature.

According to a speech synthesis method provided by the present invention, the target voiceprint feature is determined based on the following steps:

inputting the voice of a target speaker into a voiceprint model to obtain the target voiceprint characteristics output by the voiceprint model;

According to the voice synthesis method provided by the invention, the voiceprint model comprises a residual error network and a full connection layer which are connected in series.

The present invention also provides a speech synthesis apparatus comprising:

the text determining unit is used for determining a text to be synthesized;

The invention also provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the steps of any of the speech synthesis methods when executing the computer program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech synthesis method as described in any one of the above.

The speech synthesis method, the speech synthesis device, the electronic equipment and the storage medium provided by the invention realize speech synthesis of a text to be synthesized by applying an end-to-end speech synthesis model constructed based on a language model, ensure the reasonability of rhythm and phoneme level information prediction in the speech synthesis process depending on the strong text understanding capability of the language model, thereby ensuring the reliability and the accuracy of a speech synthesis result, and do not need to add a front-end module in the process, thereby saving a large amount of manpower time.

In addition, in the training process of the voice synthesis model, the confrontation training is carried out by combining the speaker recognition model based on the text characteristics, so that the text characteristics extracted by the voice synthesis model are ensured to be decorrelated with the speaker information, and the fitting degree of the timbre of a synthesis result and the target timbre is improved when the multi-language voice is synthesized, particularly when the small-language voice is synthesized.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a speech synthesis method provided by the present invention;

FIG. 2 is a flow chart illustrating an embodiment of step 120 of the speech synthesis method provided by the present invention;

FIG. 3 is a schematic diagram of a training structure of a text encoding layer according to the present invention;

FIG. 4 is a flowchart illustrating an embodiment of step 122 in the speech synthesis method provided by the present invention;

FIG. 5 is a schematic diagram of a training structure of a voiceprint model provided by the present invention;

FIG. 6 is a schematic diagram of a speech synthesis model provided by the present invention;

FIG. 7 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the voice synthesis task of multi-language unified modeling, a large amount of manpower and material resources are needed to carry out front-end processing on each language, especially on partial small languages, similar to hindi and arabic, the front-end module is extremely difficult to obtain, and the multi-language voice synthesis task is difficult to realize. In view of the above, the embodiment of the present invention provides an end-to-end speech synthesis model, which can realize conversion output from text to speech acoustic features without a front-end module. Fig. 1 is a schematic flow chart of a speech synthesis method provided by the present invention, as shown in fig. 1, the method includes:

step 110, determine the text to be synthesized.

Specifically, the text to be synthesized is a text that needs to be subjected to speech synthesis, where the text to be synthesized may be a text directly input by a user, or a text automatically generated by a computer in a human-computer interaction process, or may be obtained by acquiring an image through an image acquisition device such as a scanner, a mobile phone, or a camera, and performing OCR (Optical Character Recognition) on the image, which is not specifically limited in this embodiment of the present invention.

Step 120, inputting the text to be synthesized into the speech synthesis model to obtain a synthesis result output by the speech synthesis model; the speech synthesis model is obtained by combining a speaker recognition model based on text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features.

The input of the speech synthesis model is a text to be synthesized, the output is a synthesis result, namely, the speech acoustic characteristics or the speech audio corresponding to the text to be synthesized, the conversion from the text to be synthesized to the synthesis result can be realized through an end-to-end speech synthesis model, in the process, the operation of analyzing and coding the text to be synthesized is transferred into the end-to-end speech synthesis model without executing text analysis, prosody prediction and text phoneme conversion through a front end module, and the information of the pitch and phoneme levels corresponding to the text to be synthesized is extracted in the process of text coding of the text to be synthesized.

In consideration of the situation that a front-end module is lacked, the comprehension capability of the text is difficult to guarantee by a general feature extraction mode, so that reasonable and reliable prosody and phoneme level prediction is difficult to realize by the extracted text features, and the reasonability of a correspondingly generated synthesis result is difficult to guarantee. To solve this problem, the embodiment of the present invention constructs a speech synthesis model on the basis of a language model, where the language model may be a pre-training model of natural language processing, such as bert (bidirectional Encoder reproduction from transforms) or m-BRT, that has a prominent expression on tasks such as text synthesis and semantic understanding. Depending on the outstanding ability of the language model to the text understanding, the construction of the speech synthesis model based on the language model can deepen the understanding of the speech synthesis model to the input text to be synthesized, thereby ensuring the prediction reasonability of the speech synthesis model on the prosody and phoneme level.

Before step 120 is executed, a speech synthesis model may also be obtained through pre-training, and the training method of the speech synthesis model includes the following steps: firstly, constructing an initial speech synthesis model structure based on the structure of a pre-trained language model, and initializing parameters for a text coding part in the initial speech synthesis model based on the parameters of the language model; meanwhile, a large amount of sample text and corresponding sample speech are collected. And then, training the initialized voice synthesis model based on the sample text and the corresponding sample voice, thereby obtaining the trained voice synthesis model.

On the basis, in the multi-language speech synthesis scene, the training samples of the speech synthesis model are most likely to be from a few speakers for the sample speech of a single language, especially the sample speech of a small language. The small number of speakers corresponding to the sample speech of a single language directly causes the text features extracted from the speech synthesis model obtained by training to be associated with the corresponding speakers, so that the synthesized speech contains the tone of the speaker corresponding to the language during training no matter how the target speaker is set.

To solve the problem, when the speech synthesis model is trained, the embodiment of the invention combines the speaker recognition model based on the text features, and performs confrontation training with the part for extracting the text features, which is constructed based on the language model, in the initial model. Here, the speaker recognition model based on the text features may recognize a speaker corresponding to the text features by capturing speaker information included in the input text features, and the speaker recognition model based on the text features may be trained based on the text features of the sample text and the speaker corresponding to the text features.

Specifically, during the countertraining, the text features extracted from the part constructed based on the language model in the initial model can be used as the input of the speaker recognition model, the extracted text features are decorrelated with the speaker information as much as possible to serve as the training target for the part constructed based on the language model, and the speaker corresponding to the accurately recognized text features is used as the training target for the part of the speaker recognition model. Therefore, the part constructed based on the language model obtained by the countertraining can be decorrelated with the information of the speaker when the text features are extracted, and the tone of the speaker corresponding to the language during training is prevented from being carried in the voice synthesized by the subsequent part.

The method provided by the embodiment of the invention realizes the speech synthesis of the text to be synthesized by applying the end-to-end speech synthesis model constructed based on the language model, ensures the rationality of rhythm and phoneme level information prediction in the speech synthesis process by depending on the strong text understanding capability of the language model, thereby ensuring the reliability and accuracy of the speech synthesis result, does not need to add a front-end module in the process, saves a large amount of manpower time, particularly does not need to additionally obtain the front-end modules of various languages in a multilingual speech synthesis scene, greatly reduces the realization difficulty of a speech synthesis task, and is beneficial to the popularization of speech synthesis application.

Based on the above embodiment, the speech synthesis model includes a text encoding layer and a decoding layer; fig. 2 is a schematic flowchart of an embodiment of step 120 in the speech synthesis method provided in the present invention, and as shown in fig. 2, step 120 includes:

step 121, inputting a text to be synthesized into a text coding layer of the speech synthesis model to obtain text characteristics output by the text coding layer; the text encoding layer is built based on a language model.

Specifically, the text coding layer is configured to perform feature coding on an input text to be synthesized, so as to output text features of the text to be synthesized. In order to ensure that the text coding layer has excellent text comprehension capability and improve the accuracy of the extracted text features, the text coding layer inside the speech synthesis model can be constructed on the basis of the language model corresponding to the construction stage of the speech synthesis model, so that the strong text comprehension capability of the language model is transferred to the text coding layer.

And step 122, inputting the text characteristics, or the text characteristics, the target voiceprint characteristics and/or the language characteristics of the text to be synthesized into a decoding layer of the speech synthesis model to obtain a synthesis result output by the decoding layer.

Specifically, the decoding layer may input a text feature of the text to be synthesized, or may combine a target voiceprint feature and/or a target language feature on the basis of the text feature:

the target voiceprint feature reflects the voiceprint feature of the expected synthesis result, and if the expected synthesis result is simulated by the voice of the speaker A, the voiceprint feature of the speaker A can be used as the target voiceprint feature to guide the tone of the synthesis result in the voice synthesis process; the target language feature is the encoding vector of the language to which the synthesis result is expected to apply. The target language features are generally applied to a multilingual speech synthesis scenario, and guide the language of a synthesis result in the multilingual speech synthesis process.

The decoding layer can decode the text features or the features obtained by fusing the text features and the target voiceprint features and/or the target language features, so as to predict the acoustic features of each frame in the synthesized voice corresponding to the text to be synthesized, and obtain and output a synthesis result.

Based on any of the above embodiments, fig. 3 is a schematic diagram of a training structure of a text coding layer provided by the present invention, and as shown in fig. 3, the text coding layer is obtained by performing a confrontation training with a speaker recognition model, the speaker recognition model is used for performing speaker recognition on a confrontation text feature, and the confrontation text feature is obtained by performing gradient inversion on an output of the text coding layer.

Specifically, the text coding layer can be regarded as an independent model to perform countermeasure training with the speaker recognition model, in the countermeasure training process, the training target of the text coding layer is to decorrelate text features obtained by text coding and speaker information, so that the text features do not contain the speaker information as much as possible, and the training target of the speaker recognition model is to capture the speaker information from the text features as much as possible to recognize speakers corresponding to the text features.

In the confrontation training process of the text coding layer and the speaker recognition model, the text coding layer and the speaker recognition model are in game learning with each other, so that the information related to the speaker in the text characteristics output by the text coding layer is eliminated, and the capturing capability and the distinguishing capability of the speaker recognition model on the information related to the speaker in the text characteristics are improved. Specifically, a sample text can be input into a text coding Layer, the text coding Layer outputs text characteristics of the sample text, then Gradient Reverse (GRL) is performed on the text coding of the sample text, the text characteristics after Gradient reverse are input into a speaker recognition model for speaker recognition, the Gradient reverse can achieve the purpose of enabling the text coding Layer and the speaker recognition model which are connected in front and at the back to be opposite in training target, and finally the speaker recognition model cannot recognize a speaker corresponding to the text characteristics extracted by the text coding Layer, so that the countermeasure effect is achieved. The text features extracted by the text coding layer obtained by the countertraining can be decorrelated with the information of the speaker, so that the tone of the speaker corresponding to the language during training is prevented from being carried in the voice synthesized by the subsequent part.

Based on any of the above embodiments, the text coding layer includes a plurality of convolution structures and a language coding layer connected in series with the convolution structures, and the structure of the language coding layer is consistent with that of the language model.

In particular, in order to improve the feature extraction capability of the text coding layer, further improvement can be made on the basis of the structure of the language model, and in particular, a multilayer convolution structure can be added before the language coding layer with the same structure as the language model. The multilayer convolution structure can be composed of a plurality of convolution layers, after the text to be synthesized of the input text coding layer is subjected to feature extraction through the multilayer convolution structure, the extracted features are input into the language coding layer for further feature extraction and coding, and therefore the text coding layer is guaranteed to have strong text feature extraction capacity, and the reliability and the accuracy of voice synthesis are improved.

Based on any embodiment, the decoding layer comprises a fusion decoding layer and a general decoding layer; fig. 4 is a flowchart illustrating an embodiment of step 122 in the speech synthesis method provided in the present invention, and as shown in fig. 4, step 122 includes:

step 1221, inputting the text features, or the text features, and the target voiceprint features and/or the target language features into a fusion decoding layer of the decoding layer to obtain a plurality of fusion acoustic features output by the fusion decoding layer, where each fusion acoustic feature corresponds to a preset number of speech frames, and the preset number is an integer greater than 1.

And 1222, inputting the text feature and the multiple fused acoustic features, or inputting the text feature and the multiple fused acoustic features, and the target voiceprint feature and/or the target language feature into a general decoding layer of the decoding layer, so as to obtain a synthesis result output by the general decoding layer.

Specifically, the currently common decoding method is to directly apply the input features and predict the acoustic features of each speech frame in the speech corresponding to the text to be synthesized. However, in the decoding method, in the model training stage, the learning difficulty of the decoding layer is high. In view of this situation, in the embodiment of the present invention, on the basis of a general decoding manner, a fusion decoding layer is added to implement predictive decoding of coarse-grained acoustic features, and the coarse-grained acoustic features output by the fusion decoding layer are also used as a reference for execution of a fine-grained, i.e., general decoding manner, so as to instruct the general decoding layer that applies the general decoding manner to perform speech synthesis to be able to more easily complete speech synthesis. Correspondingly, in the decoding layer, the fusion decoding layer is used for realizing the speech decoding with coarser granularity aiming at a plurality of speech frames, and the universal decoding layer is used for realizing the speech decoding with finer granularity aiming at a single speech frame.

The fusion decoding layer can decode based on the input text features, or based on the input text features and the features obtained by fusing the target voiceprint features and/or the target language features, so as to predict the fusion acoustic features of each preset number of speech frames in the synthesized speech corresponding to the text to be synthesized. Unlike the one-to-one correspondence relationship between the normal acoustic features and the speech frames, one fused acoustic feature herein corresponds to a preset number of consecutive speech frames in the synthesized speech, the preset number is preset, for example, when the preset number is 8, the fusion decoder can predict one fused acoustic feature every 8 speech frames.

The universal decoding layer can decode based on the input text features and the fused acoustic features, or based on the features obtained by fusing the input text features, the fused acoustic features and the target voiceprint features and/or the target language features, so as to predict the acoustic features of each speech frame in the synthesized speech corresponding to the text to be synthesized, and obtain and output a synthesized result. In the process, the acoustic features are fused to play a reference and auxiliary role in predicting the acoustic features corresponding to a plurality of voice frames, so that the prediction difficulty of the acoustic features corresponding to each voice frame is reduced.

According to the method provided by the embodiment of the invention, the acoustic characteristics corresponding to each voice frame are predicted by referring to the fusion acoustic characteristics corresponding to each preset number of voice frames, so that the decoding difficulty in the voice synthesis process is reduced, and the voice synthesis efficiency is improved.

Based on any of the above embodiments, the loss function of the speech synthesis model is determined based on the loss value of the synthesis result and the loss value of the fused acoustic feature, or based on the loss value of the synthesis result, the loss value of the fused acoustic feature, and the loss value of the target voiceprint feature.

Specifically, the loss value of the synthesis result represents a difference between a predicted synthesis result output by the speech synthesis model for the sample text in the training process and the acoustic feature corresponding to the sample speech.

The loss value of the fusion acoustic feature represents the difference between a plurality of predicted fusion acoustic features and sample fusion acoustic features obtained by a speech synthesis model aiming at a sample text in a training process, the sample fusion acoustic feature referred to here can be obtained by grouping and fusing the acoustic features of each speech frame in sample speech into a group according to each preset number of speech frames, and the fusion referred to here can be obtained by averaging the acoustic features of the preset number of speech frames in the group to serve as the sample fusion acoustic features of the group.

The loss value of the target voiceprint feature may be a difference between a voiceprint feature of a predicted synthesis result output by the speech synthesis model for the sample text in the training process and the target voiceprint feature.

Aiming at the arrangement of the fusion decoding layer in the voice synthesis model, the loss value of the fusion acoustic features in the training process of the voice synthesis model can be also included in the loss function, so that the convergence speed of the decoding layer in the voice synthesis model is accelerated, and the learning difficulty of the decoding layer in the voice synthesis model is further reduced.

In addition, the loss value of the target voiceprint feature is added in the loss function, so that the tone of the voice synthesized by the voice synthesis model is consistent with the tone represented by the target voiceprint feature as much as possible. Here, the loss value of the target voiceprint feature may be represented as a similarity between the voiceprint feature of the predicted synthesis result and the target voiceprint feature, and the measure of the similarity may be cosine similarity, euclidean distance similarity, or the like.

For example, the Loss function Loss determined based on the Loss value of the synthesis result, the Loss value of the fused acoustic feature, and the Loss value of the target voiceprint feature may be expressed as follows:

Loss＝t+avgloss+cosi(e_i,s_i)；

avgloss＝MSE(PAM,AM)；

where t represents the loss value of the synthesis result, avgloss represents the loss value of the fusion acoustic feature, cosi (e)_i,s_i) Representing the loss value of the target voiceprint feature. Wherein, avgloss is obtained by calculation based on Mean Square Error (MSE) loss function, PAM represents the prediction fusion acoustic characteristics of the speech synthesis model aiming at the sample text, and AM represents the sample fusion acoustic characteristics of the sample speech; cosi (e)_i,s_i) Denotes e_iAnd s_iCosine similarity between them, e_iVoiceprint features, s, representing the result of predictive synthesis_iIs the target voiceprint feature.

Traditional voiceprint models such as vector, xvector and dvector have good effects on voiceprint extraction of single languages, but the effect on cross-language is not good, for example, Chinese says Chinese and Chinese says English, and great difference often appears in tone color. However, the traditional voiceprint model does not consider a point, which causes a strong correlation between the extracted voiceprint features and the language, and if such voiceprint features are directly applied to speech synthesis, the tone of the synthesized speech is directly affected, for example, when the speech of the speaker a is used for speaking Chinese, the voiceprint features of the speaker a are extracted and used as target voiceprint features to guide speech synthesis of English, which causes the tone of the synthesized speech to be still the tone of the speaker a when speaking Chinese, and to be greatly different from the tone of the speaker when actually speaking English.

In response to this problem, based on any of the above embodiments, the target voiceprint characteristics are determined based on the following steps:

inputting the voice of the target speaker into the voiceprint model to obtain the target voiceprint characteristics output by the voiceprint model; the vocal print model is obtained by countertraining with a language identification model, the language identification model is used for performing language identification on countervocal print characteristics, and the countervocal print characteristics are obtained by performing gradient inversion on the output of the vocal print model.

Specifically, fig. 5 is a schematic diagram of a training structure of the voiceprint model provided by the present invention, and as shown in fig. 5, the voiceprint model can be confronted with the language identification model for training. In the countercheck training process, the training target of the voiceprint model is to decorrelate the extracted voiceprint features and language information, so that the voiceprint features do not contain language information as much as possible, and the training target of the language identification model is to capture the language information from the voiceprint features as much as possible to identify the languages corresponding to the voiceprint features.

In the countertraining process of the voiceprint model and the language identification model, the voiceprint model and the language identification model are in game learning, so that the information related to the language in the voiceprint characteristics output by the voiceprint model is eliminated, and the capturing capability and the distinguishing capability of the language identification model on the information related to the language in the voiceprint characteristics are improved. Specifically, the sample voice can be input into the voiceprint model, the voiceprint characteristic of the sample voice is output by the voiceprint model, the voiceprint characteristic of the sample voice is subjected to gradient reversal, the voiceprint characteristic after the gradient reversal is input into the language recognition model for language recognition, the gradient reversal can achieve the purpose of enabling the voiceprint model and the language recognition model which are connected in front and back to be opposite in training target, and finally the language corresponding to the voiceprint characteristic extracted by the voiceprint model can not be recognized by the language recognition model, so that the effect of countermeasure can be achieved. Voiceprint features extracted against the voiceprint features resulting from the training can be decorrelated from the speech information.

The voiceprint extraction is carried out on the basis of the voiceprint model obtained through the training, so that the target voiceprint characteristics of the input voice synthesis model can be ensured to be irrelevant to languages, and the target voiceprint characteristics can not bring interference on language layers in the cross-language voice synthesis process, thereby ensuring the tone of a synthesis result.

In addition, a general voiceprint model, such as dvector model GE2E (GENERALIZED END-TO-END), adopts LSTM (Long Short-Term Memory network) as a model structure, but due TO the structural characteristics of LSTM, execution of a new frame must wait until the execution of the previous frame is finished, so that the advantage of parallel computation of GPUs (Graphics Processing units) running LSTM cannot be achieved.

In this regard, based on any of the above embodiments, the voiceprint model includes a residual network and a full connection layer in series.

Specifically, in order to fully utilize the advantage of parallel computation of the GPU, the embodiment of the present invention improves the model structure of the voiceprint model, and replaces the general LSTM structure in the voiceprint model with a combination form of the residual error network Resnet and the full connection layer fuse connection, so that the constructed voiceprint model does not need to perform feature coding frame by frame any more, can directly implement parallel coding of each speech frame in the input speech, and can effectively improve the extraction efficiency of the voiceprint features.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a speech synthesis model provided by the present invention, and solid lines in fig. 6 show modules applied in a prediction process of the speech synthesis model, including a text coding layer, a fusion decoding layer, and a general decoding layer. The dashed lines show the modules applied only during the training process, including the gradient inversion and speaker recognition models.

The speech synthesis model may be particularly divided into an encoding part and a decoding part:

the coding part, namely the text coding layer, is realized by additionally adding a multilayer convolution structure on the basis of a language model BERT. The multi-layer convolution structure herein may specifically be a 3-layer convolution. For the text coding layer, the initial parameters may be BERT and training parameters, and in the training process of the speech synthesis model, the initial parameters of the text coding layer may not be fixed, but the parameters may be updated at a lower learning rate, for example, the parameters of the text coding layer may be updated at a learning rate of about 1e-6, so that the text coding layer is more suitable for a multilingual speech synthesis environment than BERT itself.

In addition, under the multi-language voice synthesis scene, the number of speakers corresponding to the sample voice of a single language in the training sample is small, which directly causes the text features extracted by the text coding layer in the trained voice synthesis model to be related to the corresponding speakers, and in the training stage of the text coding layer, a gradient inversion and speaker recognition model is introduced to realize the confrontation training of the text coding layer and the speaker recognition model, so that the text features extracted by the text coding layer obtained by the confrontation training can be decorrelated with the speaker information, and the tone of the speaker corresponding to the language during training is prevented from being carried in the voice synthesized by the subsequent part.

The decoding part can be constructed based on a decoder in a tacotron model, and specifically comprises a fusion decoding layer and a general decoding layer, wherein the fusion decoding layer can predict a fusion acoustic feature corresponding to each preset number of speech frames in the synthesized speech by combining text features, target voiceprint features and target language features, and the general decoding layer can predict an acoustic feature corresponding to each speech frame in the synthesized speech by combining the text features, the target voiceprint features, the target language features and each fusion acoustic feature, so that a synthesis result is obtained. The acoustic feature here may be a mel-frequency feature. The arrangement of the decoding layer is fused, the decoding difficulty in the voice synthesis process is reduced, and the voice synthesis efficiency is improved.

The following describes a speech synthesis apparatus provided by the present invention, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to correspondingly.

Fig. 7 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention, and as shown in fig. 7, the apparatus includes:

a text determining unit 710, configured to determine a text to be synthesized;

the speech synthesis unit 720 is configured to input the text to be synthesized into a speech synthesis model, so as to obtain a synthesis result output by the speech synthesis model; the speech synthesis model is obtained by combining a speaker recognition model based on text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features.

The device provided by the embodiment of the invention realizes the speech synthesis of the text to be synthesized by applying the end-to-end speech synthesis model constructed based on the language model, ensures the rationality of rhythm and phoneme level information prediction in the speech synthesis process by depending on the strong text understanding capability of the language model, thereby ensuring the reliability and accuracy of the speech synthesis result, does not need to add a front-end module in the process, saves a large amount of manpower time, particularly does not need to additionally obtain the front-end modules of various languages in a multilingual speech synthesis scene, greatly reduces the realization difficulty of a speech synthesis task, and is beneficial to the popularization of speech synthesis application.

Based on any of the above embodiments, the speech synthesis unit 720 includes:

the coding subunit is used for inputting the text to be synthesized to a text coding layer of the speech synthesis model to obtain text characteristics output by the text coding layer; the text coding layer is established based on the language model;

and the decoding subunit is used for inputting the text features, or the text features, the target voiceprint features and/or the target language features into a decoding layer of the speech synthesis model to obtain a synthesis result output by the decoding layer.

Based on any of the above embodiments, the text encoding layer is obtained by performing a confrontation training with a speaker recognition model, the speaker recognition model is used for speaker recognition on a confrontation text feature, and the confrontation text feature is obtained by performing gradient inversion on an output of the text encoding layer.

Based on any one of the above embodiments, the text coding layer includes a plurality of layers of convolution structures and a language coding layer connected in series with the convolution structures, and the structure of the language coding layer is consistent with that of the language model.

Based on any of the above embodiments, the decoding subunit is configured to:

According to any of the above embodiments, the loss function of the speech synthesis model is determined based on the loss value of the synthesis result and the loss value of the fused acoustic feature, or based on the loss value of the synthesis result, the loss value of the fused acoustic feature and the loss value of the target voiceprint feature.

Based on any of the above embodiments, the apparatus further comprises a voiceprint extraction unit, the voiceprint extraction unit is configured to:

inputting the voice of a target speaker into a voiceprint model to obtain the target voiceprint characteristics output by the voiceprint model;

Based on any one of the above embodiments, the voiceprint model comprises a residual network and a full connection layer connected in series.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform a speech synthesis method comprising: determining a text to be synthesized; inputting the text to be synthesized into a voice synthesis model to obtain a synthesis result output by the voice synthesis model; the speech synthesis model is obtained by combining a speaker recognition model based on the text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a speech synthesis method provided by the above methods, the method comprising: determining a text to be synthesized; inputting the text to be synthesized into a voice synthesis model to obtain a synthesis result output by the voice synthesis model; the speech synthesis model is obtained by combining a speaker recognition model based on the text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech synthesis methods provided above, the method comprising: determining a text to be synthesized; inputting the text to be synthesized into a voice synthesis model to obtain a synthesis result output by the voice synthesis model; the speech synthesis model is obtained by combining a speaker recognition model based on the text features with a sample text and corresponding sample speech on the basis of a language model for extracting text features.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

18页详细技术资料下载

Voice synthesis method and device, electronic equipment and storage medium

相关技术

网友询问留言