Speech synthesis method and training method of speech synthesis model

文档序号：344492 发布日期：2021-12-03 浏览：5次中文

阅读说明：本技术 一种语音合成方法及语音合成模型的训练方法 (Speech synthesis method and training method of speech synthesis model ) 是由高占杰李文杰于 2021-08-12 设计创作，主要内容包括：本公开提供了一种语音合成方法及语音合成模型的训练方法,涉及人工智能技术领域,具体涉及深度学习、语音技术等领域。具体实现方案为：获取待合成文本和至少两个待合成语音；获取至少两个待合成语音中的第一待合成语音的音色隐向量,以及第二待合成语音的风格隐向量；获取所述待合成文本的文本隐向量；将所述音色隐向量、所述风格隐向量和所述文本隐向量进行拼接,并基于拼接后的隐向量,生成所述待合成文本的目标合成语音。由此,本公开能够针对同一个待合成文本,结合音色隐向量和风格隐向量的多种不同的组合,从而生成多种不同的目标合成语音,实现了风格的迁移,使每一个音色能具备多种风格,提高了语音合成过程中的效率及可靠性。(The disclosure provides a speech synthesis method and a training method of a speech synthesis model, and relates to the technical field of artificial intelligence, in particular to the fields of deep learning, speech technology and the like. The specific implementation scheme is as follows: acquiring a text to be synthesized and at least two voices to be synthesized; obtaining a tone hidden vector of a first voice to be synthesized and a style hidden vector of a second voice to be synthesized in at least two voices to be synthesized; acquiring a text hidden vector of the text to be synthesized; and splicing the tone hidden vector, the style hidden vector and the text hidden vector, and generating a target synthesized voice of the text to be synthesized based on the spliced hidden vector. Therefore, the method and the device can combine multiple different combinations of the tone hidden vectors and the style hidden vectors aiming at the same text to be synthesized, so as to generate multiple different target synthesized voices, realize style migration, enable each tone to have multiple styles, and improve efficiency and reliability in the voice synthesis process.)

1. A method of speech synthesis comprising:

acquiring a text to be synthesized and at least two voices to be synthesized;

obtaining a tone implicit vector of a first voice to be synthesized and a style implicit vector of a second voice to be synthesized in the at least two voices to be synthesized;

acquiring a text hidden vector of the text to be synthesized;

and splicing the tone hidden vector, the style hidden vector and the text hidden vector, and generating a target synthesized voice of the text to be synthesized based on the spliced hidden vector.

2. The speech synthesis method according to claim 1, wherein the obtaining of the timbre hidden vector of a first speech to be synthesized and the style hidden vector of a second speech to be synthesized of the at least two speeches to be synthesized comprises:

extracting features of the tone of the first voice to be synthesized, and generating the tone implicit vector of the first voice to be synthesized according to the extracted features of the tone;

and extracting the style of the second voice to be synthesized, and generating the style implicit vector of the second voice to be synthesized according to the extracted style features.

3. The speech synthesis method of claim 1, wherein the concatenating the timbre hidden vector, the style hidden vector, and the text hidden vector comprises:

performing dimension conversion on the tone hidden vector, the style hidden vector and the text hidden vector to obtain a target text hidden vector, a target tone hidden vector and a target style hidden vector with the same dimension;

and splicing the target text hidden vector, the target tone hidden vector and the target style hidden vector.

4. The speech synthesis method according to any one of claims 1 to 3, wherein the obtaining of the timbre hidden vector of a first speech to be synthesized and the style hidden vector of a second speech to be synthesized of the at least two speeches to be synthesized comprises:

inputting the first voice to be synthesized and the second voice to be synthesized into a voice synthesis model;

coding the first to-be-synthesized voice by a tone coding network in the voice synthesis model to output the tone hidden vector corresponding to the tone of the first to-be-synthesized voice;

and coding the second voice to be synthesized by a style coding network in the voice synthesis model so as to output the style hidden vector corresponding to the style of the second voice to be synthesized.

5. The speech synthesis method according to claim 1, wherein the obtaining of the text hidden vector of the text to be synthesized comprises:

inputting the text to be synthesized into a speech synthesis model;

and encoding the text to be synthesized by a text encoding network in the speech synthesis model so as to output the text hidden vector of the text to be synthesized.

6. The speech synthesis method according to any one of claims 1-3, wherein the generating a target synthesized speech of the text to be synthesized based on the stitched latent vectors comprises:

and inputting the spliced hidden vector into a decoding network in a speech synthesis model for decoding so as to output the target synthesized speech of the text to be synthesized.

7. A method of training a speech synthesis model, comprising:

acquiring a sample text marked with a voice synthesis sample result and at least two sample voices to be synthesized aiming at the same sample speaker, wherein the at least two sample voices to be synthesized have at least two different styles;

inputting the sample text and the at least two sample voices to be synthesized into a voice synthesis model to be trained, and outputting a voice synthesis training result aiming at the sample text;

and obtaining the difference between the voice synthesis sample result and the voice synthesis training result, adjusting the model parameters of the voice synthesis model to be trained according to the difference, returning the sample text with the marked voice synthesis sample result and at least two steps of voice of the sample to be synthesized aiming at the same sample speaker until the training result meets the training end condition, and determining the voice synthesis model to be trained after the model parameters are adjusted for the last time as the trained voice synthesis model.

8. The method for training a speech synthesis model according to claim 7, wherein the inputting the sample text and the at least two sample voices to be synthesized into the speech synthesis model to be trained and outputting a speech synthesis training result for the sample text comprises:

inputting the at least two sample voices to be synthesized into the voice synthesis model to be trained, and outputting a sample tone hidden vector corresponding to the tone of a first sample voice to be synthesized and a sample style hidden vector corresponding to the style of a second sample voice to be synthesized in the at least two sample voices to be synthesized;

inputting the sample text into the speech synthesis model to be trained, and outputting a sample text hidden vector of the sample text;

and splicing the sample tone hidden vector, the sample style hidden vector and the sample text hidden vector, and generating the speech synthesis training result of the sample text based on the spliced hidden vector.

9. A speech synthesis apparatus comprising:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a text to be synthesized and at least two voices to be synthesized;

the second obtaining module is used for obtaining a timbre implicit vector of a first voice to be synthesized and a style implicit vector of a second voice to be synthesized in the at least two voices to be synthesized;

a third obtaining module, configured to obtain a text hidden vector of the text to be synthesized;

10. The speech synthesis apparatus of claim 9, wherein the second obtaining module is further configured to:

extracting features of the tone of the first voice to be synthesized, and generating the tone implicit vector of the first voice to be synthesized according to the extracted features of the tone;

and extracting the style of the second voice to be synthesized, and generating the style implicit vector of the second voice to be synthesized according to the extracted style features.

11. The speech synthesis apparatus of claim 9, wherein the generation module is further configured to:

and splicing the target text hidden vector, the target tone hidden vector and the target style hidden vector.

12. The speech synthesis apparatus according to any one of claims 9-11, wherein the second obtaining module is further configured to:

inputting the first voice to be synthesized and the second voice to be synthesized into a voice synthesis model;

coding the first to-be-synthesized voice by a tone coding network in the voice synthesis model to output the tone hidden vector corresponding to the tone of the first to-be-synthesized voice;

13. The speech synthesis apparatus of claim 9, wherein the third obtaining module is further configured to:

inputting the text to be synthesized into a speech synthesis model;

and encoding the text to be synthesized by a text encoding network in the speech synthesis model so as to output the text hidden vector of the text to be synthesized.

14. The speech synthesis apparatus of any one of claims 9-11, wherein the generation module is further configured to:

and inputting the spliced hidden vector into a decoding network in a speech synthesis model for decoding so as to output the target synthesized speech of the text to be synthesized.

15. An apparatus for training a speech synthesis model, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a sample text with a marked speech synthesis sample result and at least two sample speeches to be synthesized aiming at the same sample speaker, and the at least two sample speeches to be synthesized have at least two different styles;

the output module is used for inputting the sample text and the at least two sample voices to be synthesized into a voice synthesis model to be trained and outputting a voice synthesis training result aiming at the sample text;

and the determining module is used for acquiring the difference between the voice synthesis sample result and the voice synthesis training result, adjusting the model parameters of the voice synthesis model to be trained according to the difference, returning the sample text of the marked voice synthesis sample result and at least two voice steps of the sample to be synthesized for the same sample speaker until the training result meets the training end condition, and determining the voice synthesis model to be trained after the model parameters are adjusted for the last time as the trained voice synthesis model.

16. The apparatus for training a speech synthesis model according to claim 15, wherein the output module is further configured to:

inputting the sample text into the speech synthesis model to be trained, and outputting a sample text hidden vector of the sample text;

17. An electronic device comprising a processor and a memory;

wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the method of any one of claims 1-6 or 7-8.

18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6 or 7-8.

19. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of claim 1 or 7.

Technical Field

The present disclosure relates to the field of computer technology, and more particularly to the field of artificial intelligence, and more particularly to the fields of deep learning, speech technology, and the like.

Background

In recent years, with the rapid development of artificial intelligence technology, speech synthesis is applied more and more widely. Current speech synthesis is mainly based on acoustic models for converting text or phoneme sequences into mel-frequency spectrum and vocoders for converting the mel-frequency spectrum into speech. In this case, the style of the synthesized speech is correlated with the style of the input speech to be synthesized. For example, inputting a to-be-synthesized voice of mandarin, and obtaining a synthesis of mandarin; the input is happy speech to be synthesized, and the obtained synthesized speech is happy.

However, in the related art, when recording the voice to be synthesized, it is not possible to record all kinds (styles) of sound libraries for each speaker because of the cost and the imitation ability of the speaker. This most likely limits the synthetic expressiveness of the library.

Therefore, how to improve the efficiency and reliability in the speech synthesis process has become one of the important research directions.

Disclosure of Invention

The present disclosure provides a speech synthesis method and a training method of a speech synthesis model.

According to an aspect of the present disclosure, there is provided a speech synthesis method including:

acquiring a text to be synthesized and at least two voices to be synthesized;

obtaining a tone implicit vector of a first voice to be synthesized and a style implicit vector of a second voice to be synthesized in the at least two voices to be synthesized;

acquiring a text hidden vector of the text to be synthesized;

and splicing the tone hidden vector, the style hidden vector and the text hidden vector, and generating a target synthesized voice of the text to be synthesized based on the spliced hidden vector. According to another aspect of the present disclosure, there is provided a method for training a speech synthesis model, including:

inputting the sample text and the at least two sample voices to be synthesized into a voice synthesis model to be trained, and outputting a voice synthesis training result aiming at the sample text;

a third obtaining module, configured to obtain a text hidden vector of the text to be synthesized;

and the generating module is used for splicing the tone hidden vector, the style hidden vector and the text hidden vector and generating the target synthesized voice of the text to be synthesized based on the spliced hidden vector. According to another aspect of the present disclosure, there is provided a training apparatus for a speech synthesis model, including:

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of speech synthesis according to the first aspect of the present disclosure or the method of training a speech synthesis model according to the second aspect.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the speech synthesis method of the first aspect of the present disclosure or the training method of the speech synthesis model of the second aspect.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, characterized in that the computer program realizes the steps of the method of claim 1 or the steps of the method of claim 7 when being executed by a processor.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 6 is a block diagram of a speech synthesis apparatus for implementing a speech synthesis method of an embodiment of the present disclosure;

FIG. 7 is a block diagram of a training apparatus for a speech synthesis model used to implement the speech synthesis method of an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a speech synthesis method or a training method of a speech synthesis model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following briefly describes the technical field to which the disclosed solution relates:

computer Technology (Computer Technology), the content of which is very extensive, can be roughly divided into several aspects of Computer system Technology, Computer machine element Technology, Computer component Technology and Computer assembly Technology. The computer technology comprises the following steps: the basic principle of the operation method, the design of an arithmetic unit, an instruction system, the design of a Central Processing Unit (CPU), the pipeline principle, the application of the basic principle in the CPU design, a storage system, a bus and input and output.

AI (Artificial Intelligence) is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of a human being, and has a technology at a hardware level and a technology at a software level. Artificial intelligence hardware techniques generally include computer vision techniques, speech recognition techniques, natural language processing techniques, and learning/deep learning thereof, big data processing techniques, knowledge-graph techniques, and the like.

DL (Deep Learning), which is an intrinsic rule and a representation hierarchy of sample data, is learned, and information obtained in these Learning processes greatly helps interpretation of data such as text, image, and sound. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art. Deep learning has achieved many achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technologies, and other related fields. The deep learning enables the machine to imitate human activities such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes great progress on the artificial intelligence related technology.

The Speech technology refers to the key technologies in the field of computers, such as Automatic Speech Recognition (ASR) technology and Text To Speech (TTS) technology. The computer can listen, see, speak and feel, and is the development direction of future human-computer interaction, and the voice of the computer becomes one of the best viewed human-computer interaction modes in the future, and has more advantages than other interaction modes.

A speech synthesis method and a method for training a speech synthesis model according to the embodiments of the present disclosure are described below with reference to the drawings.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure.

As shown in fig. 1, a speech synthesis method proposed in this embodiment includes the following steps:

s101, obtaining a text to be synthesized and at least two voices to be synthesized.

The text to be synthesized may be a sequence of words or phonemes.

It should be noted that the target of speech synthesis is to generate corresponding speech according to a text, however, the stability exhibited by the text in the speech synthesis process is inferior to that of the phoneme sequence, so in practical application, if the obtained text to be synthesized is a text, optionally, the text may be directly used as the text to be synthesized; alternatively, the words may be first converted into corresponding phoneme sequences, and then the phoneme sequences may be used as the text to be synthesized.

The voice to be synthesized at least comprises a tone and a style.

Wherein the style comprises at least one of the following dimensions: dialect, accent, emotion, rhythm, broadcast style, etc.

For example, for an acquired voice 1 to be synthesized, the voice includes tone 1 and style 1; aiming at the acquired voice 2 to be synthesized, the voice comprises tone 2 and style 2; wherein, style 1, includes: dialect 1 and accent 1; style 2, comprising: dialect 2, accent 2, and prosodic rhythm 1.

It should be noted that, in the present disclosure, two voices to be synthesized may have the same sound color and/or style, or may have different sound colors and/or styles.

For example, for an acquired voice 1 to be synthesized, the voice includes tone 1 and style 1; and aiming at the acquired voice 2 to be synthesized, the voice comprises tone 1 and style 2.

S102, obtaining a tone hidden vector of a first voice to be synthesized and a style hidden vector of a second voice to be synthesized in at least two voices to be synthesized.

In the embodiment of the disclosure, in order to synthesize a target synthesized voice for a text to be synthesized in a style migration manner, any two voices to be synthesized may be selected from all voices to be synthesized to obtain a first voice to be synthesized and a second voice to be synthesized, and then the two voices to be synthesized are separated in tone and style, so as to obtain a tone hidden vector of the first voice to be synthesized and a style hidden vector of the second voice to be synthesized.

As a possible implementation manner, feature extraction may be performed on the tone of the first speech to be synthesized, and a tone hidden vector of the first speech to be synthesized is generated according to the features of the extracted tone; and performing feature extraction on the style of the second voice to be synthesized, and generating a style implicit vector of the second voice to be synthesized according to the extracted style feature.

S103, obtaining a text hidden vector of the text to be synthesized.

In the embodiment of the disclosure, after the text to be synthesized is obtained, the semantic features of each character vector in the text to be synthesized can be obtained in multiple ways, so as to obtain the text hidden vector of the text to be synthesized.

And S104, splicing the tone hidden vector, the style hidden vector and the text hidden vector, and generating a target synthesized voice of the text to be synthesized based on the spliced hidden vector.

In the embodiment of the disclosure, after the timbre hidden vector, the style hidden vector and the text hidden vector are obtained, a target synthesized voice of a text to be synthesized can be generated by splicing.

In the present disclosure, since the two voices to be synthesized input by the user are not particularly limited, there are many different combinations of the timbre hidden vector and the style hidden vector.

Optionally, for the same text to be synthesized, a plurality of different combinations of the timbre hidden vector and the style hidden vector may be combined to generate a plurality of different target synthesized voices. For example, the text hidden vector 1 is spliced with the timbre hidden vectors 1-2 and the style hidden vectors 1-3 to generate target synthesized voices 1-6 of the text to be synthesized.

Optionally, the tone of the speaker may be defined for the same text to be synthesized, that is, the same tone hidden vector may be spliced with multiple different style hidden vectors, so as to generate multiple different target synthesized voices. For example, the text hidden vector 1 and the timbre hidden vector 1 are spliced with the style hidden vectors 1 to 3, so that target synthesized voices 1 to 3 of the text to be synthesized can be generated.

It should be noted that, in the present disclosure, a specific manner of splicing the timbre hidden vector, the style hidden vector, and the text hidden vector is not limited, and may be selected according to an actual situation. Optionally, the tone hidden vector, the style hidden vector and the text hidden vector can be directly spliced, so that a hidden vector with the dimension being the sum of the dimensions of the tone hidden vector, the style hidden vector and the text hidden vector is obtained; optionally, dimension conversion may be performed on the tone hidden vector, the style hidden vector, and the text hidden vector, and then the tone hidden vector, the style hidden vector, and the text hidden vector after the dimension conversion are spliced.

According to the speech synthesis method provided by the embodiment of the disclosure, at least two voices to be synthesized are separated in tone and style, so that a tone hidden vector of a first voice to be synthesized and a style hidden vector of a second voice to be synthesized are obtained, a text hidden vector of a text to be synthesized is obtained, and then, for the same text to be synthesized, multiple different combinations of the tone hidden vector and the style hidden vector are combined, so that multiple different target synthesized voices are generated, style migration is realized, each tone can have multiple styles, and efficiency and reliability in a speech synthesis process are improved.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure.

As shown in fig. 2, the speech synthesis method proposed by the present disclosure specifically includes the following steps based on the above embodiment:

s201, obtaining a text to be synthesized and at least two voices to be synthesized.

Step S201 is identical to step S101, and is not described herein again.

The specific process of obtaining the timbre hidden vector of the first speech to be synthesized and the style hidden vector of the second speech to be synthesized in the at least two speeches to be synthesized in step S102 in the previous embodiment includes the following steps S202 to S204.

S202, inputting the first voice to be synthesized and the second voice to be synthesized into a voice synthesis model.

The speech synthesis model is a convergence model including a plurality of networks (tone color coding network, style coding network, text coding network, decoding network, etc.).

S203, the first voice to be synthesized is coded by the tone coding network in the voice synthesis model so as to output a tone implicit vector corresponding to the tone of the first voice to be synthesized.

In the embodiment of the present disclosure, a tone Encoder (Encoder) in a tone encoding network in a speech synthesis model may encode the first speech to be synthesized to output a tone hidden vector corresponding to a tone of the first speech to be synthesized.

And S204, coding the second voice to be synthesized by the style coding network in the voice synthesis model so as to output a style hidden vector corresponding to the style of the second voice to be synthesized.

In the embodiment of the present disclosure, a style Encoder (Encoder) in a style encoding network in the speech synthesis model may encode the second speech to be synthesized to output a style hidden vector corresponding to the style of the second speech to be synthesized.

The specific process of acquiring the text hidden vector of the text to be synthesized in step S103 in the previous embodiment includes the following steps S205 to S206.

And S205, inputting the text to be synthesized into the voice synthesis model.

S206, the text to be synthesized is coded by the text coding network in the speech synthesis model so as to output the text hidden vector of the text to be synthesized.

In the embodiment of the present disclosure, a text to be synthesized may be encoded by a text Encoder (Encoder) in a text encoding network in a speech synthesis model, so as to output a text hidden vector of the text to be synthesized.

The specific process of splicing the timbre hidden vector, the style hidden vector and the text hidden vector in step S104 in the previous embodiment includes the following steps S206 to S207.

And S207, performing dimension conversion on the tone color hidden vector, the style hidden vector and the text hidden vector to obtain a target text hidden vector, a target tone color hidden vector and a target style hidden vector with the same dimension.

In the embodiment of the disclosure, a target dimension may be obtained, and dimensions of the timbre hidden vector, the style hidden vector and the text hidden vector are respectively converted into the target dimension according to the target dimension, so as to obtain a target text hidden vector, a target timbre hidden vector and a target style hidden vector with the same dimensions.

For example, the dimensions of the timbre hidden vector, the style hidden vector and the text hidden vector are 3 dimensions, 4 dimensions and 6 dimensions respectively, and the target dimension is 5 dimensions, in this case, the dimensions of the timbre hidden vector, the style hidden vector and the text hidden vector can be converted into 5 dimensions respectively, so as to obtain the target text hidden vector, the target timbre hidden vector and the target style hidden vector, all of which have 5 dimensions.

And S208, splicing the target text hidden vector, the target tone hidden vector and the target style hidden vector.

In the embodiment of the disclosure, after the target text hidden vector, the target timbre hidden vector and the target style hidden vector with the same dimensionality are obtained, the target text hidden vector, the target timbre hidden vector and the target style hidden vector can be spliced.

It should be noted that, in the present disclosure, the timbre hidden vector, the style hidden vector, and the text hidden vector may also be directly spliced, so as to obtain a hidden vector having a dimension that is a sum of dimensions of the timbre hidden vector, the style hidden vector, and the text hidden vector.

For example, the dimensions of the timbre hidden vector, the style hidden vector and the text hidden vector are 3-dimensional, 4-dimensional and 6-dimensional respectively, and in this case, the timbre hidden vector, the style hidden vector and the text hidden vector can be directly spliced to obtain a hidden vector with 13-dimensional dimensions.

S209, generating a target synthesized voice of the text to be synthesized based on the spliced hidden vector.

In the embodiment of the present disclosure, the concatenated hidden vectors may be input to a decoding network in a speech synthesis model, and a Decoder (Decoder) decodes the concatenated hidden vectors to output a target synthesized speech of a text to be synthesized.

In this case, the tone of the target synthesized speech coincides with the tone of the first speech to be synthesized, and the style coincides with the style of the second speech to be synthesized.

According to the speech synthesis method disclosed by the embodiment of the disclosure, the generation of the target synthesized speech can be realized based on the convergent speech synthesis model comprising a plurality of networks, and the efficiency and the reliability in the speech synthesis process are further improved.

As shown in fig. 3, according to the speech synthesis method proposed by the present disclosure, the tone color mel spectrum (the first speech to be synthesized), the style mel spectrum (the second speech to be synthesized), and the text (the text to be synthesized) are input into the speech synthesis model, the tone color encoder in the tone color coding network encodes the tone color mel spectrum, the style encoder in the style coding network encodes the style mel spectrum, and the text encoder in the text coding network encodes the text to obtain the style hidden vector, the tone color hidden vector, and the text hidden vector, and the hidden vectors are spliced and input into the decoding network, and the decoder decodes the hidden vectors to obtain the target mel spectrum. Further, the mel spectrum is inputted to a vocoder to be converted, and a target synthesized voice can be obtained.

Fig. 4 is a schematic diagram according to a third embodiment of the present disclosure.

As shown in fig. 4, the training method of a speech synthesis model provided in this embodiment includes the following steps:

s401, obtaining a sample text marked with a voice synthesis sample result and at least two sample voices to be synthesized aiming at the same sample speaker, wherein the at least two sample voices to be synthesized have at least two different styles.

The sample text, the first sample voice to be synthesized, and the second sample voice to be synthesized are consistent in quantity and can be obtained according to actual conditions. For example, 1000 sets of sample text, first sample speech to be synthesized, and second sample speech to be synthesized may be obtained.

In the present disclosure, at least two sample voices to be synthesized for the same sample speaker have different styles. For example, for the same sample speaker a, the first sample voice to be synthesized corresponds to style 1, and the second sample voice to be synthesized 2 corresponds to style 2.

S402, inputting the sample text and at least two sample voices to be synthesized into a voice synthesis model to be trained, and outputting a voice synthesis training result aiming at the sample text.

S403, obtaining the difference between the voice synthesis sample result and the voice synthesis training result, adjusting the model parameters of the voice synthesis model to be trained according to the difference, returning to the steps of obtaining the sample text marked with the voice synthesis sample result and at least two sample voices to be synthesized for the same sample speaker until the training result meets the training end condition, and determining the voice synthesis model to be trained after the model parameters are adjusted for the last time as the trained voice synthesis model.

The training end condition may be set according to an actual situation, and the disclosure is not limited.

Alternatively, the training end condition may be set such that the difference between the speech synthesis sample result and the speech synthesis training result is smaller than a preset difference threshold. For example, the training end condition may be set to be that the difference between the speech synthesis sample result and the speech synthesis training result is less than 95%.

According to the training method of the speech synthesis model disclosed by the embodiment of the disclosure, the sample text with the labeled speech synthesis sample result and at least two sample voices to be synthesized for the same sample speaker can be obtained, the sample text and the at least two sample voices to be synthesized are input into the speech synthesis model to be trained, the speech synthesis training result for the sample text is output, and then the model is trained according to the difference between the speech synthesis sample result and the speech synthesis training result to obtain the trained speech synthesis model, so that the training effect of the speech synthesis model is ensured, and a foundation is laid for accurately generating the target synthesized voice based on the speech synthesis model.

Fig. 5 is a schematic diagram according to a fourth embodiment of the present disclosure.

As shown in fig. 5, the training method of a speech synthesis model provided in this embodiment includes the following steps:

s501, obtaining a sample text marked with a voice synthesis sample result and at least two sample voices to be synthesized aiming at the same sample speaker, wherein the at least two sample voices to be synthesized have at least two different styles.

This step S501 is identical to the step S401, and is not described here again.

The specific process of inputting the sample text and the two sample voices to be synthesized into the voice synthesis model to be trained in step S402 in the previous embodiment and outputting the voice synthesis training result for the sample text includes the following steps S502 to S504.

S502, inputting at least two sample voices to be synthesized into a voice synthesis model to be trained, and outputting a sample tone hidden vector corresponding to the tone of a first sample voice to be synthesized and a sample style hidden vector corresponding to the style of a second sample voice to be synthesized in the at least two sample voices to be synthesized.

S503, inputting the sample text into the speech synthesis model to be trained, and outputting a sample text hidden vector of the sample text.

S504, splicing the sample tone hidden vector, the sample style hidden vector and the sample text hidden vector, and generating a speech synthesis training result of the sample text based on the spliced hidden vector.

And S505, obtaining the difference between the voice synthesis sample result and the voice synthesis training result, adjusting the model parameters of the voice synthesis model to be trained according to the difference, returning to the steps of obtaining the sample text marked with the voice synthesis sample result and at least two voices of the sample to be synthesized for the same sample speaker until the training result meets the training end condition, and determining the voice synthesis model to be trained after the model parameters are adjusted for the last time as the trained voice synthesis model.

This step S505 is identical to the step S403, and is not described herein again.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

Corresponding to the embodiments provided above, an embodiment of the present disclosure further provides a speech synthesis apparatus, and since the speech synthesis apparatus provided in the embodiment of the present disclosure corresponds to the speech synthesis method provided in the embodiments described above, the implementation manner of the speech synthesis method is also applicable to the speech synthesis apparatus provided in the embodiment, and is not described in detail in the embodiment.

Fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the speech synthesis apparatus 600 includes: a first obtaining module 601, a second obtaining module 602, a third obtaining module 603, and a generating module 604. Wherein:

a first obtaining module 601, configured to obtain a text to be synthesized and at least two voices to be synthesized;

a second obtaining module 602, configured to obtain timbre hidden vectors of the at least two voices to be synthesized and style hidden vectors of a second voice to be synthesized;

a third obtaining module 603, configured to obtain a text hidden vector of the text to be synthesized;

a generating module 604, configured to splice the timbre hidden vector, the style hidden vector, and the text hidden vector, and generate a target synthesized speech of the text to be synthesized based on the spliced hidden vector.

The second obtaining module 602 is further configured to:

extracting features of the tone of the first voice to be synthesized, and generating the tone implicit vector of the first voice to be synthesized according to the extracted features of the tone;

and extracting the style of the second voice to be synthesized, and generating the style implicit vector of the second voice to be synthesized according to the extracted style features.

Wherein, the generating module 604 is further configured to:

and splicing the target text hidden vector, the target tone hidden vector and the target style hidden vector.

The second obtaining module 602 is further configured to:

inputting the first voice to be synthesized and the second voice to be synthesized into a voice synthesis model;

coding the first to-be-synthesized voice by a tone coding network in the voice synthesis model to output the tone hidden vector corresponding to the tone of the first to-be-synthesized voice;

The third obtaining module 603 is further configured to:

inputting the text to be synthesized into a speech synthesis model;

and encoding the text to be synthesized by a text encoding network in the speech synthesis model so as to output the text hidden vector of the text to be synthesized.

Wherein, the generating module 604 is further configured to:

and inputting the spliced hidden vector into a decoding network in a speech synthesis model for decoding so as to output the target synthesized speech of the text to be synthesized.

According to the speech synthesis device disclosed by the embodiment of the disclosure, at least two voices to be synthesized are separated in tone and style, so that a tone hidden vector of a first voice to be synthesized and a style hidden vector of a second voice to be synthesized are obtained, a text hidden vector of a text to be synthesized is obtained, and then different combinations of the tone hidden vector and the style hidden vector are combined for the same text to be synthesized, so that different target synthesized voices are generated, style migration is realized, each tone can have multiple styles, and efficiency and reliability in a speech synthesis process are improved.

Fig. 7 is a schematic structural diagram of a training apparatus for a speech synthesis model according to an embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 for a speech synthesis model includes: an obtaining module 701, an output module 702, and a determining module 703. Wherein the content of the first and second substances,

an obtaining module 701, configured to obtain a sample text with a speech synthesis sample result labeled and at least two sample speeches to be synthesized for a same sample speaker, where the at least two sample speeches to be synthesized have at least two different styles;

an output module 702, configured to input the sample text and the at least two sample voices to be synthesized into a voice synthesis model to be trained, and output a voice synthesis training result for the sample text;

a determining module 703, configured to obtain a difference between the speech synthesis sample result and the speech synthesis training result, adjust a model parameter of the speech synthesis model to be trained according to the difference, return to the sample text with the labeled speech synthesis sample result and at least two speech steps of the sample to be synthesized for the same sample speaker until the training result meets a training end condition, and determine the speech synthesis model to be trained after the model parameter is adjusted for the last time as the trained speech synthesis model.

Wherein, the output module 702 is further configured to:

inputting the sample text into the speech synthesis model to be trained, and outputting a sample text hidden vector of the sample text;

According to the training device of the speech synthesis model disclosed by the embodiment of the disclosure, the sample text with the labeled speech synthesis sample result and the two to-be-synthesized sample speeches for the same sample speaker can be obtained, the sample text and the two to-be-synthesized sample speeches are input into the to-be-trained speech synthesis model, the speech synthesis training result for the sample text is output, the trained speech synthesis model is obtained according to the difference between the speech synthesis sample result and the speech synthesis training result, the training effect of the speech synthesis model is ensured, and a foundation is laid for accurately generating the target synthesized speech based on the speech synthesis model.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a speech synthesis method or a training method of a speech synthesis model. For example, in some embodiments, the speech synthesis method or the training method of the speech synthesis model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by the computing unit 801, a computer program may perform one or more steps of the speech synthesis method or the training method of a speech synthesis model described above. Alternatively, in other embodiments, the computing unit 801 may be configured by any other suitable means (e.g., by means of firmware) to perform a speech synthesis method or a training method of a speech synthesis model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of claim 1 or the steps of the method of claim 7.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

21页详细技术资料下载

Speech synthesis method and training method of speech synthesis model

相关技术

网友询问留言