Voice synthesis method and device and electronic equipment

文档序号：228389 发布日期：2021-11-09 浏览：8次中文

阅读说明：本技术 一种语音合成方法和装置、电子设备 (Voice synthesis method and device and electronic equipment ) 是由陈凌辉伍芸荻刘丹于 2021-08-12 设计创作，主要内容包括：本申请提供了一种语音合成方法和装置。首先调用第一编码模型对文本信息进行编码得到文本特征,然后再调用第一解码模型基于所述文本特征进行解码得到语音信息。其中,该第一编码模型和该第一解码模型分别至少包含级联的N层第一编码器和级联的M层第一解码器。对任何1≤i<N,第i+1层第一编码器的输入编码包含第i层编码器第一编码器的输出编码,对任何1≤j<M,第j层第一解码器的输入编码包含第j+1层第一解码器的输出编码,其中i、j、M、N均为正整数。该文本特征包含所述N层第一编码器中至少一个的输出编码,且所述M层解码器中至少一个的输入编码从该文本特征中获取。该方案能够为用户提供更加富有节奏变化、更贴近真实人声韵律的合成语音。(The application provides a speech synthesis method and a speech synthesis device. Firstly, a first coding model is called to code text information to obtain text characteristics, and then a first decoding model is called to decode the text characteristics to obtain voice information. Wherein the first coding model and the first decoding model respectively comprise at least a cascaded N-layer first encoder and a cascaded M-layer first decoder. And the input codes of any i < N, i +1 th layer first encoder comprise the output codes of the i layer first encoder, and the input codes of any j < M, i +1 th layer first decoder comprise the output codes of the j +1 th layer first decoder, wherein i, j, M and N are positive integers. The text feature comprises an output encoding of at least one of the N-layer first encoders, and an input encoding of at least one of the M-layer decoders is derived from the text feature. The scheme can provide the synthesized voice which is richer in rhythm change and closer to the real human voice rhythm for the user.)

1. A method of speech synthesis, comprising:

calling a first coding model to code text information to obtain text characteristics, wherein the first coding model at least comprises cascaded N layers of first encoders, the text characteristics comprise output codes of at least one of the N layers of first encoders, and input codes of any i +1 layer of first encoders are input codes of i < N, and the i +1 layer of first encoders comprise output codes of the first encoders of the i layer of encoders; and the number of the first and second groups,

calling a first decoding model to decode based on the text features to obtain voice information, wherein the first decoding model at least comprises cascaded M layers of decoders, a first input code of at least one of the M layers of decoders is obtained from the text features, j is more than or equal to 1 and less than M, and a second input code of a j layer of first decoder comprises an output code of a j +1 layer of first decoder;

wherein i, j, M and N are positive integers.

2. The speech synthesis method according to claim 1,

for each layer of the first encoder, the text information can be segmented according to the corresponding text granularity of the first encoder to obtain at least one text segmentation, and the output codes of the layer of the first encoder are used for representing the characteristics of each text segmentation; and the number of the first and second groups,

for each layer of the first decoder, the speech information can be segmented according to the speech granularity corresponding to the first decoder to obtain at least one speech segmentation, and the second input code of the layer of the first decoder is used for characterizing the characteristics of each speech segmentation.

3. The speech synthesis method according to claim 2,

for any 1 ≦ i<N, text granularity G corresponding to the ith layer first encoder_iIs smaller than the text granularity G corresponding to the first encoder of the i +1 th layer_i+1And text granularity G_iEach text segmentation section obtained by the method is one or more text granularities G_i+1Segmenting the obtained text; furthermore, it is possible to provide a liquid crystal display device,

for any 1 ≦ j<M, corresponding speech granularity G of jth layer first decoder_jLess than the speech granularity G corresponding to the first decoder of the j +1 th layer_j+1And speech granularity G_jEach voice segmentation segment under allOne or more speech granularities G_j+1And the lower voice is formed by segmentation.

4. The speech synthesis method according to claim 3,

k layers of first encoders in the N layers of first encoders correspond to K layers of first decoders in the M layers of first decoders one by one, and K is a positive integer;

the first input encoding of each of the K-layer first decoders includes an output encoding of its corresponding first encoder; furthermore, it is possible to provide a liquid crystal display device,

for each pair of the first decoder and the first encoder which correspond to each other, the text segmentation obtained after the text information is segmented according to the text granularity corresponding to the first encoder corresponds to the voice segmentation obtained after the voice information is segmented according to the voice granularity corresponding to the first decoder.

5. The method of claim 4, wherein the step of invoking the first decoding model for decoding based on the text feature comprises:

obtaining the output codes of a j-th layer first decoder, wherein the i-th layer first encoder and the j-th layer first decoder belong to the K-layer first encoder and the K-layer first decoder respectively and correspond to each other, the output codes of the i-th layer first encoder are input text code sequences used for representing the characteristics of each first text segment, the second input codes of the j-th layer first decoder are input voice code sequences used for representing the characteristics of each first voice segment, the output codes of the j-th layer first decoder are output voice code sequences used for representing the characteristics of each second voice segment, the text information is cut according to the text granularity corresponding to the i-th layer first encoder to obtain each first text segment, the voice information is cut according to the voice granularity corresponding to the j-th layer first decoder to obtain each first voice segment, and each first speech segment is formed from one or more second speech segments.

6. The speech synthesis method according to any one of claims 1 to 5, further comprising: when i is equal to 1, obtaining a phoneme sequence corresponding to each text segmentation segment corresponding to the first encoder of the j-th layer according to the text information; furthermore, it is possible to provide a liquid crystal display device,

the step of decoding the text feature by using the first decoding model to obtain the speech information includes, when j is 1:

calling the first decoding model to decode the text features to obtain an output code of a first decoder of a j layer;

obtaining the acoustic characteristics of the voice information according to the output codes of the j-th layer first decoder and the phoneme sequence; and the number of the first and second groups,

and obtaining the waveform signal of the voice information through a vocoder according to the acoustic characteristics.

7. The speech synthesis method according to any one of claims 1 to 5, further comprising:

training an initial first coding model and/or an initial first decoding model to obtain the first coding model and/or the first decoding model.

8. The speech synthesis method of claim 7, wherein the step of training the initial first decoding model comprises:

preprocessing each voice sample in a voice sample set to obtain segmentation information of the voice sample, wherein the segmentation information is used for indicating voice segmentation of the voice sample under the corresponding voice granularity of each layer of first decoder; and the number of the first and second groups,

training the initial first decoding model based on the voice sample set and the segmentation information to obtain the first decoding model;

wherein the speech sample set comprises at least one speech sample, the initial first decoding model comprises at least M layers of cascaded initial first decoders respectively corresponding to the M layers of first decoders, and for any 1 ≦ j < M, the second input encoding of the jth layer of initial first decoder comprises the output encoding of the jth +1 layer of initial first decoder.

9. The method of claim 8, wherein the step of training the initial first decoding model comprises:

inputting the speech sample set into an automatic coding and decoding network, wherein the automatic coding and decoding network comprises an initial second coding model and the initial first decoding model, the initial second coding model at least comprises M layers of cascaded initial second encoders respectively corresponding to the M layers of initial first encoders, and the input codes of any j which is more than or equal to 1 and less than M and is the j +1 th layer of initial second encoder comprise the output codes of the j layer of initial second encoder; and the number of the first and second groups,

and adjusting parameters of the initial second coding model and the initial first decoding model until the reconstruction loss of the voice sample set meets a preset condition.

10. The speech synthesis method of claim 9, wherein the calculation of the reconstruction loss comprises:

calling the initial second encoder model to encode each voice sample to obtain a first distribution parameter output by each layer of initial second encoder, wherein the first distribution parameter is used for representing a first distribution of characteristics of the voice segmentation segment obtained after each voice sample is segmented according to the segmentation information under the voice granularity corresponding to the layer of initial second encoder;

for each voice sample, sampling based on the first distribution corresponding to each layer of initial second encoder to obtain a sampling code corresponding to each layer of initial second encoder;

calling the initial first decoding model to decode the sampling codes of all the voice samples, wherein the first input codes of the Mth layer initial first decoder comprise the sampling codes corresponding to the Mth layer initial second encoder, j is more than or equal to 1 and less than M, and the first input codes of the jth layer initial first decoder comprise the sampling codes corresponding to the jth layer initial second encoder;

when j is 1, obtaining a reconstructed sample corresponding to each voice sample according to the output of the j-th layer initial first decoder, wherein the reconstructed sample corresponding to each voice sample forms a reconstructed sample set;

calculating a first difference based on the set of speech samples and the set of reconstruction samples;

calculating a second difference based on the first distribution and a preset target distribution; and the number of the first and second groups,

obtaining the reconstruction loss based on the first difference and the second difference.

11. The method of claim 7, wherein the step of training the initial first coding model comprises:

training an initial first coding model based on a text sample set to obtain the first coding model;

wherein the text sample set comprises at least one text sample, the initial first coding model comprises at least N layers of initial first encoders respectively cascaded with the N layers of first encoders, and the input codes of any i < N which is not less than 1, and the i +1 th layer of initial first initial encoder comprise the output codes of the i-th layer of initial first encoder.

12. The method of claim 11, wherein the step of training the initial first coding model based on the set of text samples comprises:

inputting preamble information into an i-th layer initial first encoder, wherein the preamble information is a text sequence in each text sample when i is 1, and the preamble information is output codes of an i-1-th layer first encoder when i is greater than 1;

and adjusting parameters of an initial first encoder of the ith layer until the preset loss of the preamble information meets the preset condition.

13. The speech synthesis method of claim 12, wherein the calculation of the pre-set penalty comprises:

calling a feature extraction network corresponding to the i-th layer initial first encoder to process the preorder information to obtain a feature code;

selecting at least one element in the feature codes as an anchor point;

determining a target voice segmentation corresponding to an element corresponding to the anchor point in the preamble information, wherein a text granularity corresponding to the target voice segmentation is larger than a text granularity corresponding to the preamble information;

calling a reverse support vector extraction network corresponding to an i-th layer initial first encoder, selecting a positive sample from other elements corresponding to the target voice segmentation segment in the preamble information, and selecting at least one negative sample from elements not corresponding to the target voice segmentation segment;

computing a noise contrast estimate based on the anchor point, the positive sample, and the negative sample; and the number of the first and second groups,

obtaining the preset loss based on the noise comparison estimation.

14. The method of claim 7, wherein the step of training the initial first coding model and/or the initial first decoding model comprises:

jointly training an initial first encoder and an initial first decoder based on a text sample set and a voice sample set to obtain the first encoding model and the first decoding model;

wherein the text sample set comprises at least one text sample, the speech sample set comprises at least one speech sample, and a one-to-one correspondence exists between the at least one text sample and the at least one speech sample;

wherein the initial first coding model at least comprises N layers of cascaded initial first encoders respectively corresponding to the N layers of first encoders, and the input codes of any i +1 th layer of initial first decoder with i being more than or equal to 1 and less than N comprise the output codes of the i-th layer of initial first encoder;

wherein the initial first decoding model at least comprises M layers of cascaded initial first decoders respectively corresponding to the M layers of first decoders, and the second input code of any j < M which is more than or equal to 1, the j-th layer of initial first decoder comprises the output code of a j + 1-th layer of initial first decoder.

15. The speech synthesis method of claim 14, wherein the step of simultaneously training an initial first encoder and an initial first decoder based on the set of text samples and the set of speech samples comprises:

inputting the text sample set into the initial first coding model, wherein each text sample is processed by the initial first coding model to obtain an intermediate text characteristic, the intermediate text characteristic is processed by the initial first decoding model to obtain a predicted voice sample, and the predicted voice sample of each text sample forms a predicted voice sample set; and the number of the first and second groups,

adjusting parameters of the initial first encoder and the initial first decoder until a difference between the set of speech samples and the set of predicted speech samples satisfies a preset condition.

16. A speech synthesis apparatus, comprising;

the encoding module is used for calling a first encoding model to encode text information to obtain text characteristics, wherein the first encoding model at least comprises cascaded N layers of first encoders, the text characteristics comprise output codes of at least one of the N layers of first encoders, and input codes of any i < N, i +1 th layer of first encoder are more than or equal to 1, and the input codes of the i +1 th layer of first encoder comprise output codes of the first encoder of the i-th layer of encoder; and the number of the first and second groups,

a decoding module, configured to invoke a first decoding model to perform decoding based on the text feature to obtain voice information, where the first decoding model at least includes cascaded M-layer decoders, a first input code of at least one of the M-layer decoders is obtained from the text feature, and any 1 ≦ j < M, and a second input code of a jth-layer first decoder includes an output code of a jth + 1-layer first decoder;

wherein i, j, M and N are positive integers.

17. An electronic device comprising a processor and a memory, wherein the memory stores instructions and the instructions, when executed by the processor, cause the electronic device to perform the speech synthesis method of any of claims 1-5, 8-17.

18. A computer readable storage medium storing computer instructions which, when executed by a processor, cause a computer to perform the speech synthesis method according to any one of claims 1 to 5, 8 to 17.

19. A computer program product comprising computer instructions which, when run on a computer, cause the computer to perform the method of speech synthesis according to any one of claims 1 to 5, 8 to 17.

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a speech synthesis method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the development of computer technology and artificial intelligence networks, the technology of Speech synthesis (also called Text-to-Speech) has been rapidly promoted in recent years, and is increasingly widely applied to various fields of human life. The application of the voice synthesis diversification provides great convenience for daily life, and simultaneously enriches the impression brought by the multimedia technology. For example, the reading aid based on the speech synthesis not only enables the visually impaired to read more extensive text materials, but also provides more reading scenes for the ordinary people. For example, the virtual image based on speech synthesis can simulate vivid human voice by using a simplified pronunciation database, and provides more general technology for the fields of game entertainment, augmented reality, virtual reality and the like.

The main function of a speech synthesis system is to convert text to speech. A common speech synthesis system comprises three modules: a text front end, an acoustic module, and a vocoder. The voice coder is mainly used for converting the acoustic characteristics into a final voice waveform signal. In recent years, under the promotion of the development of deep learning technology, end-to-end-based acoustic models (such as Tacotron, fastspech, and the like) are gradually and widely applied to speech synthesis systems, so that the tone quality and naturalness of synthesized speech at sentence level are greatly improved, and better user experience is obtained.

However, in a practical application scenario of speech synthesis, sentence-level speech synthesis cannot satisfy all requirements. The problem is particularly prominent in scenes that long-time voice resources need to be made, such as audio reading, text news broadcasting and the like. For example, in these scenarios, it is often necessary to generate speech directly from long segments of text, or even from the text of an entire speech piece. The whole sentence of the text has a plurality of levels, which can be roughly divided into chapters, paragraphs, sentences, intonation phrases (or called clauses) and the like from large to small, and meanwhile, the whole sentence of the text can be further divided into a plurality of levels such as prosodic phrases, prosodic words, grammatical words and characters inside the intonation phrases. However, the sentence-level speech synthesis system usually only uses sentences or clauses as input, and does not consider the dependency relationship between different sentences or between different clauses in the whole speech, which results in poor user experience of the existing system in speech synthesis. Especially, since sentence-level speech synthesis does not take into account the structure and semantic information of the speech piece, the experience of the actual user in feeling the whole speech piece is significantly reduced.

In particular, existing sentence-level speech synthesis schemes typically have several significant drawbacks. First, the rhythm is monotonous, the tone and intonation of all sentences are close, and the long-term use of the Chinese characters can cause auditory fatigue of users due to lack of various variability. Second, lack of speech focus, no re-reading of the central sentence, results in users easily missing important information. Third, the lack of contextual semantic association without taking into account prosodic commitments between sentences may result in incorrect tone. Finally, the real tone corresponding to the common modifying and retrieving techniques in the language, such as the ranking, the dual, etc., cannot be reflected.

Therefore, how to provide a speech synthesis system capable of accurately reflecting information of a multi-layer structure in a text, so as to provide a synthesized speech closer to a real human voice prosody for a user, becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present application provides a speech synthesis method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product, so as to provide a speech synthesis system capable of accurately reflecting multi-level structure information in a text, thereby providing a synthesized speech closer to the prosody of a real human voice for a user.

One aspect of an embodiment of the present application provides a speech synthesis method, including: calling a first coding model to code text information to obtain text characteristics, wherein the first coding model at least comprises cascaded N layers of first encoders, the text characteristics comprise output codes of at least one of the N layers of first encoders, and input codes of any i +1 layer of first encoders are more than or equal to i < N, and the input codes of the i +1 layer of first encoders comprise output codes of the first encoders of the i layer of encoders; calling a first decoding model to decode based on the text features to obtain voice information, wherein the first decoding model at least comprises cascaded M layers of decoders, a first input code of at least one of the M layers of decoders is obtained from the text features, j is more than or equal to 1 and less than M, and a second input code of a j layer of first decoder comprises an output code of a j +1 layer of first decoder; wherein i, j, M and N are positive integers.

Another aspect of the embodiments of the present application provides a speech synthesis apparatus, including: the encoding module is used for calling a first encoding model to encode the text information to obtain text characteristics, wherein the first encoding model at least comprises cascaded N layers of first encoders, the text characteristics comprise output codes of at least one of the N layers of first encoders, and input codes of any i < N, i +1 th layer of first encoder are input codes of i layer of encoder, and the output codes of the i layer of first encoder are output codes of the i layer of encoder; the decoding module is used for calling a first decoding model to decode based on the text characteristics to obtain voice information, wherein the first decoding model at least comprises cascaded M-layer decoders, a first input code of at least one of the M-layer decoders is obtained from the text characteristics, j is more than or equal to 1 and less than M, and a second input code of a j-th-layer first decoder comprises an output code of a j + 1-th-layer first decoder; wherein i, j, M and N are positive integers.

It is yet another aspect of the embodiments of the present application to provide an electronic device, which can be used to implement the foregoing speech synthesis method. In some embodiments, the electronic device includes a processor and a memory. The memory stores instructions, and the instructions, when executed by the processor, cause the electronic device to perform the speech synthesis method.

Yet another aspect of embodiments of the present application provides a computer-readable storage medium. The computer readable storage medium stores computer instructions, and the computer instructions, when executed by the processor, cause the computer to perform the aforementioned speech synthesis method.

One aspect of an embodiment of the present application provides a computer program product. The computer program product contains computer instructions which, when run on a computer, cause the computer to perform the aforementioned speech synthesis method.

According to the technical scheme provided by the embodiment of the application, the first coding model is called to code the text information to obtain the text characteristics, and then the first decoding model is called to decode based on the text characteristics to obtain the voice information. Wherein the first coding model and the first decoding model respectively comprise at least a cascaded N-layer first encoder and a cascaded M-layer first decoder. And the input codes of any i < N, i +1 th layer first encoder comprise the output codes of the i layer first encoder, and the input codes of any j < M, i +1 th layer first decoder comprise the output codes of the j +1 th layer first decoder, wherein i, j, M and N are positive integers. The text feature comprises an output encoding of at least one of the N-layer first encoders, and an input encoding of at least one of the M-layer decoders is derived from the text feature. Through the technical scheme, the multi-layer structure information in the text information is extracted layer by the cascaded first encoder, and then is restored to the voice prosody characteristics of each layer by the cascaded first decoder, so that the voice information generated according to the voice prosody characteristics can accurately reflect the multi-layer structure in the corresponding text information, and synthetic voice which is rich in rhythm change and closer to the real voice prosody is provided for a user.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a flow chart of a speech synthesis method in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an information flow of a speech synthesis method in an embodiment of the present application;

FIG. 3 is a partial diagram of an information flow of a speech synthesis method in an embodiment of the present application;

FIG. 4 is a diagram illustrating a method for segmenting text information or voice information to obtain segments at different levels of granularity in an embodiment of the present application;

FIG. 5 is a schematic diagram of an information flow of another speech synthesis method in the embodiment of the present application;

FIG. 6 is a partial diagram of an information flow of another speech synthesis method in an embodiment of the present application;

FIG. 7 is a diagram illustrating that a first decoder obtains an output speech sequence through an input speech sequence and an input text sequence in the embodiment of the present application;

FIG. 8 is a flow chart of another speech synthesis method in an embodiment of the present application;

FIG. 9 is a flow chart of another speech synthesis method in an embodiment of the present application;

FIG. 10 is a flow chart of another speech synthesis method in an embodiment of the present application;

FIG. 11 is a flow chart of another speech synthesis method in an embodiment of the present application;

FIG. 12 is a flow chart of another speech synthesis method in an embodiment of the present application;

FIG. 13 is a diagram illustrating an information flow for obtaining a reconstruction loss when training an initial first decoding model according to an embodiment of the present application;

FIG. 14 is a flow chart of yet another speech synthesis method in an embodiment of the present application;

FIG. 15 is a schematic diagram illustrating an information flow for obtaining a predetermined loss when training an initial first coding model according to an embodiment of the present application;

FIG. 16 is a flow chart of another speech synthesis method in an embodiment of the present application;

fig. 17 is a schematic structural diagram of a speech synthesis apparatus in an embodiment of the present application; and the number of the first and second groups,

fig. 18 is a schematic structural diagram showing another speech synthesis apparatus in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Some terms used in the present invention will be described first.

Encoder (encoder): i.e., the encoding side in an auto-encoder (Autoencoder) architecture, for mapping an input to an encoding (also referred to as hidden variable, hidden representation).

Decoder (decoder): i.e. the decoding side in an auto-encoder configuration, is used to map the encoding into an output, which is usually a reconstruction of the input to the encoding side to some extent.

Tacotron model: an end-to-end conventional speech synthesis model based on Attention (Attention) mechanism, capable of directly generating speech based on text.

BERT model: that is, a Bidirectional Encoder representation model (Bidirectional Encoder representation from transforms) based on a transformer is a model that uses unsupervised language representation in deep Bidirectional and performs pre-training only using a plain text corpus, and belongs to a pre-training technique of natural language processing.

VAE model: namely, a conditional variant automatic encoder (Variational automatic encoder), is a generation model based on an automatic encoder structure, which makes a strong assumption on hidden variable distribution, and learns a hidden representation using a Variational method, thereby introducing an additional loss component and an SGVB (Stochastic Gradient Variational Bayes) estimator in a training function.

As described in the background section, existing speech synthesis systems generate synthesized speech based primarily on sentences or clauses, in which usually only prosodic codes at the sentence level are extracted. Taking the speech synthesis system based on the tacontron model as an example, when performing model training, the model uses a speech coding model (e.g. bidirectional LSTM network) to extract a vector of fixed dimension from speech in addition to the conventional frame-level acoustic target prediction. The vector is constrained by a standard normal distribution assumption for extracting sentence-level acoustic prosodic tokens from the speech.

In the prosody prediction phase, existing methods typically extract sentence-level textual representations using a pre-trained language model like BERT, and then simply use contextual textual representations to predict the acoustic prosody representations of the sentences to be synthesized. The prediction method mainly comprises two types, one type is a learnable prediction method, a neural network model is directly used for training and predicting prosody representation, the other type is a prediction method based on retrieval, text features are used as indexes of the prosody representation, and corresponding prosody representation is selected from a training sound library based on matching of the text features to perform subsequent speech synthesis. However, limited to a single sentence level or a clause level, when applied to a long text (such as an entire language text), the prior art solutions generate synthesized speech that presents at least the following problems.

First, synthesized speech lacks a hierarchical acoustic characterization. Natural human speech signals contain rich prosodic changes including, but not limited to, the appearance of focal words within sentences, the alignment of turns between sentences, the emphasis of central sentences within paragraphs, the intonation of sentence ends, the prosodic consistency between sentences within the same lexicon system, and the like. However, in the prior art, only a vector with a few dimensions is usually used for characterizing prosody, and it is difficult to abstract and express so many prosody changes. Second, synthesized speech can only be dependent between adjacent sentences. Although the relation can improve the starting and undertaking conversion of the tone of the language between sentences to a certain extent, the language relation and semantic phenomena between more sentences are involved, and the abstract expression is difficult to obtain from only two or three continuous texts. Third, synthesized speech does not make efficient use of linguistic experience and knowledge. The acoustic models of conventional speech synthesis rely entirely on data to learn, while the limited data hardly reflect the experience and knowledge gained by linguistics through long-term induction and accumulation.

A first aspect of an embodiment of the present application provides a speech synthesis method. In this embodiment, the speech synthesis method can be applied to an automatic coding and decoding network including a first coding model and a first decoding model. The automatic codec network may be implemented in hardware, software, or a combination of both. For example, the codec network may exist in a hardware environment formed by a server and a terminal, and the server and the terminal are connected through a network, including but not limited to a local area network, a metropolitan area network, or a wide area network. At this time, the above coding model and decoding model may be implemented by software running on the server and the terminal, or by commands embedded in hardware on the server and the terminal. It should be noted that the above coding model and decoding model may also be all run on a server or a terminal, and this embodiment is not limited to this.

The above speech synthesis method is further described below in conjunction with fig. 1. Fig. 1 is a schematic diagram of a speech synthesis method in an embodiment of the present application. As shown in fig. 1, the speech synthesis method may include the following steps.

And S20, calling the first coding model to code the text information to obtain the text characteristics. Wherein the first coding model comprises at least a cascaded N-layer first coder, and the text feature comprises an output code of at least one of the N-layer first coders. For any layer 1 ≦ i < N, the input encoding for the i +1 th layer first encoder includes the output encoding for the i layer encoder first encoder. i and N are both positive integers.

The text information is information embodied in a text format, which is input information of the whole speech synthesis process, and is usually a text to be subjected to text-to-speech conversion. In this embodiment, the text information may have a multi-layer text structure. For example, the text information is a text at chapter level, i.e., the text may contain one or more complete chapters. "chapter" is understood herein to mean a complete article structure, such as an entire news report, an entire word of speech, an entire chapter in a literary work, and so forth. It will be appreciated that the "text format" herein may be embodied as either characters of a language word or as a visual or computer readable code of a language word.

The structure of the first coding model can be seen in the left half of fig. 2. Fig. 2 is a schematic information flow diagram of a speech synthesis method in an embodiment of the present application. As shown in fig. 2, the first coding model includes, from bottom to top, a layer 1 first encoder to an nth layer first encoder. Those skilled in the art will appreciate that terms like "bottom-up" are used herein for clarity in describing the orientation illustrated in the corresponding drawing (e.g., FIG. 2 herein) and are not intended to limit the spatial or logical relationship of elements. The text information input into the first coding model is firstly processed by a first encoder of a layer 1 to obtain an output code of the layer 1. The output code is then processed by a layer 2 first encoder to obtain a layer 2 output code. Next, the layer 2 encoder output code is processed by the layer 3 first encoder to obtain the layer 3 output code. And the like until the Nth layer output code is obtained by the Nth layer first encoder. It should be understood that although the text message is directly input to the first encoder of layer 1 in fig. 2, in practical applications, the text message may be pre-processed (such as text analysis, text segmentation, text correction, etc.) and then input to the first encoder of layer 1.

The above-described encoding process of the first encoder results in a total of N layers of output encodings, and the text features of the output of the first coding model may comprise at least one of these encodings (fig. 2 shows a situation comprising all N layers of encodings). Since the above N-level output encodings are generated in cascade, the text features may accordingly characterize the features of one or more levels in the hierarchy of the text information. The specific construction mode of the hierarchical structure determines the specific relationship of the N layers of cascade first encoders, and in addition, the concerned hierarchy in the specific use scene determines which layers of output encoding are contained in the text feature.

And S40, calling the first decoding model to decode based on the text characteristics to obtain the voice information. Wherein the first decoding model comprises at least M layers of decoders in cascade, the input encoding of at least one of said M layers of decoders being obtained from the text feature. For any 1 ≦ j < M, the input encoding for the j-th layer first decoder contains the output encoding for the j + 1-th layer first decoder. j and M are positive integers.

The speech information here refers to information embodied in a speech format, which is output information of the entire speech synthesis process, and is usually finally obtained personalized speech. In this embodiment, the voice information corresponds to text information, for example, the voice information is a reading voice corresponding to text at chapter level. It is understood that the "voice format" herein may be embodied as either human-ear or machine-recognizable voices, or as codes or files stored in a medium that can be played back by a specific device, such as audio files like wav, MP3, MP4, or audio tracks of video files like mkv, avi, etc.

The structure of the first decoding model can be seen in the right half of fig. 2. As shown in fig. 2, the first decoding model includes a layer 1 first decoder to an mth layer first decoder from bottom to top. The text features of the input M-th layer decoding model are firstly processed by an M-th layer first decoder to obtain an M-th layer output code. The output code is then processed by an M-th layer first decoder to obtain an M-1 th layer output code. Next, the output code of the M-1 layer decoder is processed by the M-2 layer first decoder to obtain the M-2 layer output code. And so on until the output code of the first decoder of the layer 1 is obtained. It is understood that, in the case that the layer 1 first decoder is a vocoder, etc., the output code of the layer 1 first decoder may be a speech waveform, which can be directly the speech information finally output by the first decoding model. Alternatively, the output code of the layer 1 first decoder may only include the acoustic feature in the speech information in practical application, and then perform subsequent processing on the acoustic feature to obtain the speech information.

In this embodiment, the cascaded M-layer first decoder is configured to decode the text features layer by layer to obtain the acoustic features with a hierarchical structure (i.e., including the speech prosody features). The specific construction mode of the hierarchy depends on the specific relationship of the M-layer concatenated first decoders. Meanwhile, the text features can represent the features of one or more levels in the hierarchical structure of the text information, and all or part of the features of the levels can also be directly input into the first decoder of the corresponding level, so that the speech prosody features output by the first decoder of the level have more accurate corresponding relation with the text features of the specific level. Thus, each layer of first decoders may have two input encodings in addition to the mth layer of first decoders, where the first input encoding is derived from text features and the second input encoding is the output encoding of the previous layer of first decoders. It will be appreciated that although fig. 2 shows that the layer 1 to M-1 first decoders each have a first input encoding, in practical applications it may be provided that some of the first decoders do not retrieve the first input encoding from the text feature. At this point, which layers of the first decoder have the first input encoding depends on the level of text of interest (or the level of text corresponding to the level of speech of interest) in the particular usage scenario.

In practical applications, both the input encoding and the output encoding can be implemented by vectors. In some embodiments, for each layer of the first encoder, the text information can be segmented according to the corresponding text granularity of the first encoder to obtain at least one text segmentation, and the output of the layer of the first encoder encodes features for characterizing each text segmentation; meanwhile, for each layer of first decoder, the voice information can be segmented according to the voice granularity corresponding to the first decoder to obtain at least one voice segmentation, and the second input code of the layer of first decoder is used for representing the characteristics of each voice segmentation.

The text/voice segmentation section refers to each part of the text/voice information obtained after the text/voice information is segmented according to the corresponding text/voice granularity, and the original text/voice information can be obtained in a lossless manner by sequentially splicing the parts. The term "granularity" is understood here to mean the structural dimensions of the text information or the speech information. For example, the text of a novel is segmented according to the granularity of chapter level, the obtained text segments can be all the chapters of the novel, and the text of the same novel is segmented according to the granularity of paragraph level, and the obtained text segments can be all the paragraphs of all the chapters of the novel. For another example, the speakable speech corresponding to a part of the novel is segmented according to the granularity of chapter level, the obtained speech segmentation can be the speakable speech of each chapter of the novel, and the speakable speech corresponding to the same part of the novel is segmented according to the granularity of paragraph level, and the obtained speech segmentation can be the speakable speech of each paragraph of all chapters of the novel. Of course, each segmented segment obtained by the sub-chapter level granularity sub-segmentation may include one or more chapters, and each segmented segment obtained by the paragraph level granularity sub-segmentation may also include one or more paragraphs.

In some embodiments, vectors are used to represent the input codes and the output codes, which can visually reflect the corresponding features of each segment at the corresponding granularity, i.e. the text structure features of the text segments or the phonetic prosody features of the speech segments. On the basis of fig. 2, fig. 3 further shows a partial information flow diagram of a speech synthesis method in the embodiment of the present application. As shown in FIG. 3, taking the ith layer of the N layer first encoder as an example, assuming that a total of x text segments are available by segmenting the text information at its corresponding granularity, the output encoding of the first encoder may be represented as a vector O_i＝(O_i1,O_i2,…,O_ix). Wherein, O_i1Indicates the feature of the 1 st text segment at that granularity, O_i2Features representing the 2 nd text segment at that granularity, and so on, O_ixRepresenting the characteristics of the xth text segment at that granularity. Then, O_iThe first encoder of the (i + 1) th layer is input on one hand, and the output characteristics of the first encoder of other layers are input to the first decoding model together as text characteristics on the other hand. Similarly, taking the j-th layer first decoder of the N-layer first decoders as an example, assuming that a total of y speech segments are obtained by slicing the speech information at the corresponding granularity, the second input code of the first decoder is the output code from the j + 1-th layer input decoder, which can be represented as a vector I_j＝(I_j1,I_j2,…,I_jy). Wherein, I_j1Features representing the 1 st speech segment at that granularity, I_j2Features representing the 2 nd speech segment at that granularity, and so on, I_jyIndicating the characteristics of the y-th speech segment at that granularity. The first decoder may also pass through I_jDecoding to obtain the output code of the j layer, or as shown in fig. 3, further obtaining the first input code from the text feature, and obtaining the output code of the j layer through the first input code and the second input code. It is to be understood that the feature of each segment may be embodied as a scalar, a vector, or the like, and may be specifically determined according to a text structure feature or a speech prosody feature that is of interest in an actual application, which is not limited in this embodiment of the present application.

Further, in some embodiments, the text information is sliced layer by layer at N levels of text granularity corresponding to N layers of the first encoder, and the speech information is sliced layer by layer at M speech granularity corresponding to M layers of the first decoder. That is, for any 1 ≦ i<N, text granularity G corresponding to the ith layer first encoder_iIs smaller than the text granularity G corresponding to the first encoder of the i +1 th layer_i+1And text granularity G_iEach text segmentation section obtained by the method is one or more text granularities G_i+1And segmenting the text obtained next. At the same time, for any 1 ≦ j<M, corresponding speech granularity G of jth layer first decoder_jLess than the speech granularity G corresponding to the first decoder of the j +1 th layer_j+1And speech granularity G_jEach voice segmentation segment of the lower layer is one or more voicesParticle size G_j+1And the lower voice is formed by segmentation. Fig. 4 is a schematic diagram illustrating a layer-by-layer segmentation of text information or speech information at four different levels of granularity in an embodiment of the present application, where the first granularity to the fourth granularity gradually decrease in a structural scale. Taking the text information as an example, the first to fourth granularities may be at chapter level, paragraph level, sentence level and clause level, respectively, and one segment shown in the first granularity may represent a complete text. If the text contains four paragraphs, the four segmentations at the second granularity shown in fig. 4 can be obtained after segmentation according to the second granularity at the paragraph level, where each segmentation is the text of one paragraph. Four paragraphs of the text may contain 3, 2, 3, 4 sentences, respectively, and a total of 12 segments at the third granularity shown in fig. 4 may be obtained by segmenting the text according to the third granularity at the sentence level. Where the first 3 segmentations come from the first paragraph, the next 2 segmentations come from the second paragraph, the next 3 segmentations come from the third paragraph, and the last 4 segmentations come from the fourth paragraph, and each segmentation corresponds in order to 12 sentences in the text. The 12 sentences of the text may include 3, 2, 3, 2, 3, 4, 3, and 3 clauses, respectively, and then the text is further segmented according to a fourth granularity at the clause level to obtain 34 segments in total at the fourth granularity shown in fig. 4, where each segment corresponds to 34 clauses in the text in sequence. It should be noted that although not shown in fig. 4, the text may be analogized to further obtain the segmentation at the granularity smaller in the structural scale, such as further segmenting every other clause into words, so as to obtain a larger number of segments at the fifth granularity. The step of segmenting the voice message can be obtained by analogy with the example of the text message, and is not described herein again.

In this embodiment, the text information is segmented layer by layer with the granularity from the nth layer to the 1 st layer first encoder, which means that each layer of the first encoder can extract the feature of the text segmentation with larger structure scale from several continuous text segmentation. That is, a plurality of layers of features from a small-scale structure to a large-scale structure are generated step by step for the text structure, and the feature of each text segment is determined by the feature that it contains all smaller-granularity text segments (or referred to as lower-level text segments). Correspondingly, the speech information is segmented layer by layer with the granularity from the M layer to the 1 st layer first decoder, which means that the multi-level text features are restored to the speech segmentation segment with smaller scale layer by layer with the processing of the M layer first decoder. That is, the features of multiple layers of text segments in the text structure (i.e., text features) are gradually decoupled from large to small scales as speech prosodic features, and the features of each speech segment are influenced by the features of all speech segments at the larger granularity that contain it (or referred to as upper layer segments). As can be seen from the above process, in the generated speech information, each speech segment with the minimum granularity (e.g., the phoneme-level segment) contains features that take into account the text structure of its corresponding hierarchy, so that the speech information can be given rich prosody in the decoding process. Meanwhile, the multi-level text structure enables semantic and grammar information under multiple scales in the text to be reflected in the final voice information.

Since the encoding process and the decoding process both use sliced segments obtained by slicing layer by layer at multiple levels of granularity, the first input encoding from the text feature of the first decoder can be directly encoded with the output of the first encoder at the corresponding granularity. In some embodiments, the K-layer first encoder of the N-layer first encoder corresponds to the K-layer first decoder of the M-layer first decoder one-to-one, and the first input code of each of the K-layer first decoders includes the output code of the first encoder corresponding to the first decoder. Wherein K is a positive integer.

Please refer to fig. 5 on the basis of fig. 2. FIG. 5 is a schematic information flow diagram of another speech synthesis method according to the embodiment of the present application, in which a K-layer first encoder C in an N-layer first encoder_L1To C_LKRespectively corresponding to K layers of first decoders D in the M layers of first decoders_L1To D_LK. For any 1-K, the first encoder C is contained in the text characteristics_LkIs encoded and the output isThe output code is directly used as the first decoder D_LkThe first input code of (1). At this time, the first encoder C_LkThe output code of (a) may be represented as a vector O_Lk。

The first encoder and the first decoder corresponding to each other here may be understood as an encoder and a decoder having corresponding granularities, or may be understood as a case where the text granularity and the speech granularity corresponding to both are at the same level. In some embodiments, "corresponding granularity" means that the segmentation of the text information and the speech information at that granularity can result in a one-to-one correspondence of segments. That is to say, for each pair of the first decoder and the first encoder which correspond to each other, the text segment obtained by segmenting the text information according to the text granularity corresponding to the first encoder corresponds to the speech segment obtained by segmenting the speech information according to the speech granularity corresponding to the first decoder.

In some embodiments, the text granularity corresponding to the aforementioned K-layer first encoder may include one or more of the following levels of granularity: chapter level, paragraph level, clause level, phrase level, word level, and character level. The text granularity at the character level has the smallest text structure scale, and can be regarded as a basic unit of a text structure, and the length of a corresponding text segment is usually one character unit, such as one chinese character in chinese text or one or more letters corresponding to a single pronunciation in english text. In some embodiments, the first encoder to which the text granularity corresponds may obtain an output encoding characterizing the pronunciation and tone of the character according to a prediction mechanism from the character to the phoneme. In a practical application scenario, the text granularity at the character level generally corresponds to a layer 1 (i.e., i ═ 1) first encoder, and in this case, the first encoder may include a BERT model whose output encodes features characterizing the respective text segments (e.g., individual character units) at the text granularity at the character level.

Further, the word-level text granularity is larger than the character-level text granularity on the text structure scale, and the corresponding text segment usually contains one or several character units and can independently constitute a word with a specific semantic meaning. In some embodiments, the first encoder corresponding to the granularity of the text may obtain the output encoding for characterizing the overreading word according to a focus word prediction mechanism, where the focus word generally refers to a word having a central meaning in the text and needing to be overread in the voice to indicate accent.

Further, the text granularity at the clause level and the phrase level is larger than that at the word level in the text structure scale, and the corresponding text segment usually contains one or several words, but usually cannot constitute an independent sentence. In some more subdivided scenarios, a clause may also be considered to contain one or several phrases. In some embodiments, the first encoder corresponding to the text granularity may obtain the output encoding for characterizing the mood, intonation, or duration dependence of words in speech according to a parsing or syntactic analysis mechanism.

Further, the text granularity at the sentence level is larger than the text granularity at the clause level and the phrase level on the text structure scale, and the corresponding text segment usually contains a complete sentence. Sentences typically have very distinct cut points in the textual structure, such as periods, semicolons, commas, or the like in chinese and english.

Further, the text granularity at the paragraph level is larger than the text granularity at the sentence level on the text structure scale, and the corresponding text fragment usually contains one or several sentences, i.e. one paragraph. Paragraphs tend to have line breaks or long blanks as cut points on the text structure, and in some cases are supplemented with first line indentation or hanging indentation and other formats. In some embodiments, the first encoder corresponding to the text granularity may obtain an output code for characterizing an emphasized sentence according to a lexicography analysis or a prediction mechanism of a paragraph center sentence, where the emphasized sentence is generally a sentence having a center meaning in a text and needing to be emphasized by a specific mood in speech.

Further, the granularity of text at chapter level is larger than that of text at paragraph level in the text structure scale, and the corresponding text segment usually contains one or several paragraphs, or even the whole text. In some embodiments, the first encoder corresponding to the text granularity may obtain the output codes for characterizing the overall semantics of the sample according to the genre analysis or the abstract analysis.

The above-described solution of using vectors to represent input and output codes is considered in the following, and reference is made to fig. 6 on the basis of fig. 5. Fig. 6 shows a partial schematic diagram of the information flow in fig. 5. First encoder C_LkCorresponding text granularity and first decoder D_LkThe corresponding speech granularity is at the same level. The text information can be segmented according to the text granularity to obtain z text segments, correspondingly, the voice information is segmented according to the voice granularity to obtain z voice segments, and the z text segments in the text information and the z voice segments in the voice information are in one-to-one correspondence in sequence. According to the above scheme, the first encoder C_LkIs the first decoder D_LkMay be represented as O_Lk＝(O_Lk1,O_Lk2,…,O_Lkz) Wherein each element represents a feature of z text segments, respectively. At the same time, the first decoder D_LkMay be represented as I_Lk＝(I_Lk1,I_Lk2,…,I_Lkz) Wherein each element represents a feature of z speech segments, respectively. Further, consider the first encoder D_LkLower layer first encoder D_Lk-1Assume that at the first encoder D_Lk-1The speech information is segmented at a corresponding speech granularity to obtain z' speech segments, and the first decoder D is then operative to perform the segmentation_LkCan be expressed as P_Lk＝(P_Lk1,P_Lk2,…,P_Lkz′) Wherein each element represents a feature of z' speech segments, respectively. It is to be understood that the z speech segments and the z' speech segments are here respectively segments obtained by segmenting the speech information at different speech granularities.

Since the decoding of the text features by the M-level first decoder is performed layer by layer, step S40 includes obtaining the output code of the jth-level first decoder, where j is greater than or equal to 1<And M. In some embodiments, the j-th layer first decoder D_jBelonging to the aforementioned K-layer first decoder, i.e. having K such that D_Lk＝D_j. At the same time, with the j-th layer first decoder D_jThe corresponding first encoder is the ith layer first encoder C_iI.e. there is i such that C_i＝C_Lk. In connection with the information flow diagram in fig. 6, a first encoder C is used_LkThe z text segments obtained by segmenting under the corresponding text granularity are called as first text segments which are respectively characterized by O_Lk1To O_Lkz(ii) a A first decoder D_LkThe z speech segments obtained by segmentation under the corresponding speech granularity are called first speech segments which are respectively characterized by I_Lk1To I_Lkz(ii) a A first decoder D_Lk-1The z' speech segments obtained by segmentation at the corresponding speech granularity are called second speech segments, which are respectively characterized by P_Lk1To P_Lkz′. Under the technical scheme of layer-by-layer segmentation, each first voice segmentation segment consists of one or more second voice segmentation segments, namely z' is more than or equal to z. At this time, the first decoder D is applied to the j-th layer_j(or D)_Lk) In other words, the first input is encoded as a first encoder C_i(or C)_Lk) Is the input text encoding sequence O_Lk＝(O_Lk1,O_Lk2,…,O_Lkz) The second input code is an input speech coding sequence I_Lk＝(I_Lk1,I_Lk2,…,I_Lkz) The output code is an output speech coding sequence P_Lk＝(P_Lk1,P_Lk2,…,P_Lkz′)。

In some embodiments, the aforementioned obtaining of the output encoding of the layer j first decoder may be implemented by a multilayer perceptual processing (MLP) model and a Recurrent Neural Network (RNN) model. Typically, the MLP model is applied primarily to the highest layer of the aforementioned M-layer first decoders (i.e., the M-th layer first decoder), which tend to have only a first input encoding and no second input encoding (because there is no higher layer decoder). Meanwhile, the first decoder of the layer often corresponds to the largest-scale speech granularity, so that the first input code may only have one feature element, for example, a single element for characterizing the structure of the whole text at chapter-level granularity. In contrast, the RNN model is mainly applied to the first decoders of other layers, which correspond to speech granularity of smaller scale, so that the first input code and the second input code thereof may both contain text structure features or speech prosody features in the form of sequences. In this case, the RNN model can capture the time-series correlation in these sequences well, so as to accurately reflect the correlation between text and speech quality inspection. The RNN model may be implemented by a long-term memory (LSTM) model, a Gated Repeat Unit (GRU) model, or the like.

Considering that the first speech segment and the second speech segment may have different numbers, the first decoder D is obtained at the j-th layer_jThe input code and the output code of the decoder may be aligned first, so that D_jCorrelations on elements between input encodings and output encodings are captured. In some embodiments, this alignment may be achieved by extending the input codes of smaller length. In this case, the obtaining of the output code of the j-th layer first decoder may include the following steps.

Firstly, an input text coding sequence O_LkAnd outputting the speech coding sequence I_LkRespectively converted into an extended input text encoding sequence and an extended output speech encoding sequence. Wherein the length of the extended input text encoding sequence and the extended output speech encoding sequence are both equal to the length of the output speech encoding sequence. Referring to fig. 7 on the basis of the example of fig. 6, in the conversion process, the text encoding sequence O is input_LkWherein each text code is sequentially expanded into text subsequences, and the length of each text subsequence is equal to that of the speech coding sequence I_LkThe number of second voice segmentation sections contained in the first voice segmentation section corresponding to the voice code corresponding to the text code position, and each element in the text subsequence is the same as the text code. For example, the text encoding O in FIG. 7_Lk1Corresponding speech coding sequence I_LkSpeech coding in (1)_Lk1At this time, if I_Lk1The corresponding first voice is cutThe segment contains 3 second voice segments, then O is_Lk1Extension to a text subsequence of length 3 (O)_Lk1,O_Lk1,O_Lk1). Next, text encoding O_Lk2Corresponding speech coding sequence I_LkSpeech coding in (1)_Lk2At this time, if I_Lk2If the corresponding first voice segmentation segment contains 2 second voice segmentation segments, then O is added_Lk1Extension to a text subsequence of length 2 (O)_Lk2,O_Lk2). By analogy, if I_LkzThe corresponding first speech segment contains 4 second speech segments, until O is cut_LkzExtension to a text subsequence of length 4 (O)_Lkz,O_Lkz,O_Lkz,O_Lkz). At this time, the number of elements in all text subsequences is equal to the total number z 'of the second speech segments, so that the text subsequences are concatenated in sequence to obtain the extended input text encoding sequence, and the length of the extended input text encoding sequence is z'. Similarly, the speech coding sequence I is input_LkEach speech code is sequentially expanded into a speech subsequence, the length of each speech subsequence is equal to the number of second speech segments contained in a first speech segment corresponding to the speech code, and each element in the speech subsequence is the same as the text code. As shown in fig. 7, the converted extended input text encoding sequence is also z' in length.

Then, based on the first element in the extended input text encoding sequence (i.e., the first O in FIG. 7)_Lk1) And the first element in the extended speech coding sequence (i.e., the first I in FIG. 7)_Lk1) Obtaining the first speech code in the output speech sequence (i.e. P in FIG. 7)_Lk1)。

Next, for a speech coding of any remaining position in the output speech sequence, the speech coding is obtained based on the element extending the corresponding position in the input text coding sequence, the element extending the corresponding position in the speech coding sequence, and all speech codings preceding the position in the output speech coding sequence. Still taking the information flow shown in fig. 7 as an example, P is obtained in the above_Lk1Then, based on a second O_Lk1The second I_Lk1And P_Lk1Obtaining P_Lk2Then based on a third O_Lk1The third I_Lk1、P_Lk1And P_Lk2Obtaining P_Lk3Then based on the first O_Lk2First I_Lk2、P_Lk1To P_Lk3Obtaining P_Lk4And so on until based on the last O_LkzLast one I_Lkz、P_Lk1To P_Lk(z′-1)Obtaining P_Lkz′. The above process essentially recursively predicts each speech code in the output speech coding sequence, and if the RNN model is used to model the probability distribution of each speech code in the sequence, speech code P_LkCan be expressed as

Wherein, O'_LklAnd l'_LklRespectively, the ith element in the extended input text-coding sequence and the extended speech-coding sequence.

Fig. 8 shows a flow chart of another speech synthesis method in the embodiment of the present application. In some embodiments, the aforementioned speech method further comprises the steps of: s30, based on the text information, obtains a phoneme sequence corresponding to each text segment corresponding to the first encoder of layer 1 (i.e., i is 1). It is understood that the phoneme sequence here contains a series of phonemes, and these phonemes are in one-to-one correspondence with the text segmentation segments obtained at the text granularity corresponding to the layer 1 first encoder (usually, the text granularity at the character level). That is, the phonemes corresponding to each text segment may be obtained, and then the phonemes may be arranged in the order of the text segments in the text information to obtain the phoneme sequence. It is understood that there are many methods for obtaining the corresponding phonemes from the text segment, for example, the corresponding phonemes can be obtained from the existing phoneme database by using the text content of the text segment as an index.

At this time, step S40 may include the following steps:

and S41, calling a first decoding model to decode the text features, and obtaining the output code of the first decoder of the layer 1 (namely j is 1).

And S42, coding the output of the first decoder of the layer 1 and the phoneme sequence to obtain the acoustic characteristics of the speech information.

The output encoding of the layer 1 first decoder here reflects the prosodic features of the speech information at the phoneme level, i.e. reflects the form in which the phoneme should be presented in the speech information, such as whether it is re-read, how long it needs to last, etc. Thus, the acoustic features of the speech signal are obtained by combining the output code with the phoneme sequence.

And S43, obtaining the waveform signal of the voice information through the vocoder according to the acoustic characteristics.

The vocoder here is a software or hardware tool that encodes and decodes speech into specific speech waveform signals, and since the acoustic features contain prosodic information on the phoneme level, they can be converted into prosodic speech by the vocoder.

The above embodiment describes the process of directly converting text information into speech information using the first decoding model and the first encoding model, and the process of obtaining the first decoding model and the first encoding model will be further described below. In some embodiments, the foregoing speech synthesis method further includes: step S10, training the initial first coding model and/or the initial first decoding model to obtain the first coding model and/or the first decoding model. The initial first coding model generally has a similar structure to the first coding model, such as an initial first encoder comprising at least N layers of concatenations, and the N layers of initial first encoders respectively correspond to the N layers of first encoders in the first coding model. Meanwhile, the input codes of any initial first initial encoder with the level i being more than or equal to 1 and less than N, and the i +1 level initial first encoder comprise the output codes of the initial first encoder with the level i. Similarly, the initial first decoding model typically has a similar architecture as the first decoding model, such as an initial first decoder comprising at least M layers of concatenation. Meanwhile, for any 1 ≦ j < M, the second input code of the j-th layer initial first decoder contains the output code of the j + 1-th layer initial first decoder. Depending on the practical application scenario, step S10 can be embodied in, but not limited to, the following forms: the method further includes training only the initial first coding model to obtain the first coding model, training only the initial first decoding model to obtain the first decoding model, or training both the initial first coding model and the initial first decoding model to obtain the first coding model and the first decoding model.

The following first describes the training of the first decoding model. The training of the first decoding model is typically based on a set of speech samples, which contains at least one speech sample. In some embodiments, the training of the first decoding model may employ a weakly supervised environment. For example, each sample has a corresponding text sample, but the text sample and the segmented segment of the speech sample are not labeled. At this time, in order to train the first decoding model with more information contained in the speech sample, the speech sample may be preprocessed first. Fig. 10 shows a flowchart of another speech synthesis method in the embodiment of the present application, where the step S10 may include the following steps:

and S11, preprocessing each voice sample in the voice sample set to obtain the segmentation information of the voice sample. And the segmentation information is used for indicating the speech segmentation of the speech sample under the corresponding speech granularity of each layer of the first encoder.

The slicing information may have various expressions. In some embodiments, the segmentation information may be embodied as a segmentation point location between corresponding speech segments. For example, for a speech sample containing multiple speech frames, the segmentation information may contain a series of speech frame numbers indicating that the speech frames are the head or tail frames of each speech segment. In other embodiments, the segmentation information may also be embodied as the length of a speech segment, and for a speech sample comprising a plurality of speech frames, the segmentation information may comprise a series of integers indicating how many speech frames each segment in the sequence contains. The above is merely an example of the segmentation information, and the embodiment of the present application is not limited thereto. Since the segmentation information indicates the speech segments, the process of obtaining the segmentation information may also be regarded as a process of segmenting the speech information to obtain each segment.

In some embodiments, the segmentation information may be obtained by using a text corresponding to the voice sample. Fig. 11 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application, wherein step S11 includes the following steps:

and S1111, acquiring a text sample corresponding to the voice sample.

The text sample can be obtained in multiple ways. In some embodiments, the speech samples in the speech training set may be speech that itself has original text, such as a text sample that is obtained by a reader reading the audio of an article. In other embodiments, the speech samples in the speech training set may not have the original corresponding text, and then speech recognition may be used to obtain their corresponding text samples, for example, using the audio of the speech uttered by the speaker as the speech sample, and performing speech recognition on the audio to obtain the corresponding text as the text sample. Of course, in some application scenarios, the two manners may be simultaneously adopted to obtain the text samples corresponding to different voice samples in the same voice sample set.

S1112, obtaining an initial text segmentation of the text sample under the first text granularity.

The initial text segmentation segment usually has obvious text structure characteristics. For example, the first text granularity is at chapter level, and chapter numbers can be directly identified to obtain the initial text segmentation. For another example, the first text granularity is at a paragraph level, and the initial text segmentation can be obtained by directly recognizing formats such as first line indentation or text wrapping. As another example, the first text granularity is at the sentence level, and periods, semicolons, and/or commas may be directly identified to obtain the initial text snippet. Therefore, more accurate segmentation can be directly carried out on the basis of the text sample.

S1113, segmenting the voice sample by utilizing voice endpoint detection to obtain an initial voice segmentation of the voice sample under the first voice granularity.

The initial speech segmentation segment here usually has a relatively distinct temporal feature. For example, the first speech granularity is a phoneme level, and the initial speech segmentation can be obtained by directly recognizing the phoneme-to-phoneme switch. Where the first speech granularity is, for example, clause level or phrase level, the pause interval between phrases can be directly identified to obtain the initial speech segmentation. Therefore, accurate segmentation can be directly carried out on the basis of the voice sample. For the details of voice endpoint detection, reference may be made to existing voice endpoint detection techniques, such as conventional Voice Activity Detection (VAD), which is not described herein again.

S1114, obtaining a predicted text segmentation corresponding to each initial speech segmentation under the first speech granularity by using speech recognition.

The predictive speech segmentation here is a text obtained by directly converting an initial speech segmentation. For the details of the speech recognition technology, reference may be made to the prior art, which is not described herein.

It is to be noted that the above step S1112 is executed after step S1111, and step S1114 is executed after step S1113, but the execution order between the combination of step S1111 and step S1112 and the combination of step S1113 and S1114 is not limited here. For example, any combination may be performed first, or two groups may be performed simultaneously, as long as step S1112 and step S1114 are guaranteed to be performed before step S1115.

S1115, obtaining a first corresponding relation between each initial text segmentation segment and each predicted text segmentation segment by using a text alignment algorithm. Wherein each initial text segment corresponds to one or more predicted text segments.

As can be seen from the foregoing, each predicted-text segment substantially corresponds to a portion of an initial text segment, since the structural dimension corresponding to the first speech granularity is typically smaller than the structural dimension corresponding to the first text granularity. Since the predictive text segment is obtained by speech recognition of the initial speech segment, which does not necessarily coincide exactly with the portion of the initial text segment, the text alignment here is essentially an approximate alignment. That is, in the first correspondence, the difference between each initial text segment and the whole of the one or more predicted text segments corresponding to the initial text segment is minimized in order in the text sample.

S1116, based on the first corresponding relation, recombining the initial voice segmentation segments into one or more first recombined voice segmentation segments of the voice sample under the second voice granularity. The second speech granularity is greater than the first speech granularity, and each of the first recombined speech segments is comprised of one or more initial speech segments.

Because the initial speech segmentation segment corresponds to the predicted text segmentation segment one by one, and the predicted text segmentation segment is recombined into the initial text segmentation segment approximately through the first corresponding relation, the initial speech segmentation segment can be recombined according to the first corresponding relation, and then the recombined speech segmentation segment corresponding to the initial text segmentation segment one by one is obtained. Obviously, the second speech granularity is of the same level of granularity as the first text granularity.

Through the above steps S1111 to S1116, the initial speech segmentation with smaller granularity can be obtained from the speech itself, and the recombined speech segmentation with larger granularity can be obtained by combining the initial text segmentation. Both the initial speech segmentation and the recombined speech segmentation may be incorporated into the segmentation information as described above. Meanwhile, since the corresponding granularity of the initial text segment and the recombined speech segment is in the same level, if another text granularity is selected as the first text granularity in S1112, the recombined speech segment at another speech granularity can be recombined after S1113 to S1116, thereby obtaining richer segmentation information.

The above steps utilize speech recognition of the speech segments and text segmentation to obtain a larger-sized recombined speech segment. In other embodiments, additionally or alternatively, pronunciation prediction for text may also be utilized to obtain smaller granularity of speech segmentation. Fig. 12 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application, wherein step S11 includes the following steps:

and S1121, obtaining a text sample corresponding to the voice sample.

For details of this step, refer to the step S1111, which is not described herein again.

And S1122, converting the text sequence in the text sample into a phoneme sequence by using pronunciation prediction. The text sequence comprises all the text units which are sequentially arranged in the text sample, and the phoneme sequence comprises phoneme units which are sequentially arranged and respectively correspond to all the text units.

It is to be understood that the pronunciation prediction herein is a prediction of pronunciation for each unit of text. The text units here can be considered as text segments at a smaller granularity in the text sample, typically at the character level, and thus can be obtained from the text structure itself. For example, in a chinese text, a text unit may be a single chinese character, or may be a single word composed of chinese characters. For example, a text unit in an english text may be a single word, or several letters in a word corresponding to a single phoneme. Accordingly, the phoneme sequence is composed of phoneme units, and each phoneme unit may contain a single phoneme or several phonemes constituting one word. Details of the pronunciation prediction technique at the character level can be found in the prior art, and are not described herein.

S1123, obtaining a second corresponding relation between the phoneme sequence and the voice sample by using a forced alignment algorithm.

Similar to step S1115, each phoneme unit substantially corresponds to a portion of a speech sample. Since the phoneme units are obtained from the text units by speech prediction, which do not necessarily exactly coincide with the part of the speech sample, the forced alignment algorithm here is essentially an approximate speech alignment. That is, in the second correspondence, the phoneme units are in one-to-one correspondence with the respective parts at the corresponding granularity in the speech in the order in the phoneme sequence. Here the "corresponding granularity" may be the same as or different from the aforementioned first speech granularity.

S1124, segmenting the voice sample into one or more phoneme voice segments based on the second corresponding relationship. Wherein each phoneme speech fragment corresponds to a phoneme unit.

It is to be understood that the phoneme speech segmentation here is the "individual parts" of the speech as described above. The phoneme speech segment may be considered as a speech segment corresponding to a smaller granularity due to the corresponding phoneme level. In some embodiments, the above steps S1121 to S1124 can be used for the voice endpoint detection in step S1113 to obtain the initial voice segmentation with smaller granularity.

Due to the mutual correspondence between the text units, the phoneme units and the phoneme speech segmentation, the phoneme speech segmentation can be further recombined by using the text units to obtain speech segmentation with larger granularity. With further reference to fig. 12, step S11 may additionally include the steps of:

s1125, text segmentation of the text sample under the second text granularity is obtained by text analysis. Wherein each text segmentation segment at the second text granularity comprises one or more text units.

The second text granularity may be the same or different than the first text granularity described above. And text analysis is a process of obtaining the corresponding relation between each text segmentation segment and each text unit under the preset text granularity. For example, when the text unit is Chinese, the text analysis may be a process of determining the Chinese characters contained in each chapter, each paragraph, each sentence, each clause, or each word in the article. That is, text analysis is a process of recombining text units into each text segment at a preset text granularity (i.e., a second text granularity). The details of the text analysis technique can refer to the prior art, and are not described herein.

S1126, based on the corresponding relation between each text unit and each text segmentation under the second text granularity, recombining each phoneme speech segmentation into one or more second recombined speech segmentation of the speech sample under a third speech granularity. Wherein each second recombined speech segment is comprised of one or more phoneme speech segments.

Since the text units correspond to the phoneme speech segments and the text units may be reassembled into the text segments at the second text granularity, the phoneme speech segments may be correspondingly reassembled into the second reassembled speech segments. Obviously, in view of the re-assembly process, the third speech granularity is larger than the corresponding granularity of the phoneme unit, i.e. the aforementioned "corresponding granularity", and at the same time, the third speech granularity may be the same as or different from the aforementioned second speech granularity.

It can be seen that steps S1125 and S1126 utilize text analysis to recombine phoneme units, so as to obtain a second recombined speech segmentation based on phoneme segmentation, and enrich segmentation information of speech samples. It is understood that if a different second text granularity is selected in step S1125, the segmentation information corresponding to the different speech granularity (i.e., the third speech granularity) can be obtained in step S1126, further enriching the segmentation information of the speech sample.

It should be noted that, although steps S1121 through S1126 are shown simultaneously in fig. 12, steps S1125 and S1126 may be omitted depending on the actual application scenario (for example, when the second recombined speech segment is not required). In some embodiments, the phoneme speech segmentation may be acquired using only the steps S1121 to S1124.

After the preprocessing of step S11, segmentation information of the voice sample can be obtained, that is, automatic labeling of the segmentation of the voice sample is completed. Then, the initial first decoding model can be trained in the weak supervision environment by using the labeled information.

S12, training the initial first decoding model based on the voice sample set and the segmentation information to obtain a first decoding model.

Since the segmentation information indicates the speech segmentation corresponding to each layer of the first decoder (also corresponding to each layer of the initial first decoder), the speech segmentation features at different speech granularities can be used to train each layer of the first initial decoder layer by layer.

In consideration of the audio signal property of the voice sample, in order to obtain a large amount of training data, the quality requirement on the sample often cannot be made, and in this case, the obtained voice sample usually contains more interference information, such as environmental noise, channel noise, and voiceprint. Therefore, in order to be able to exclude these interference information, the initial first decoding model may be trained using an architecture that "generates network-discriminative network" probabilistic models, such as using a network architecture similar to the VAE model. At this time, an automatic coding and decoding network including the initial first decoding model and the coding model corresponding to the initial first decoding model can be constructed. In order to distinguish from the aforementioned initial first coding model for text coding, this coding model may be referred to as an initial second coding model and comprises at least an initial second encoder of an M-layer cascade. The M layers of initial second encoders respectively correspond to the M layers of initial first encoders, and the input encoding of any j < M which is more than or equal to 1, and j +1 layer of initial second encoders comprises the output encoding of the j layer of initial second encoder. It can be seen that each layer of the initial second encoder essentially undertakes the function of performing speech prosody feature extraction layer by layer from small granularity to large granularity, and thus may comprise multiple layers of 1-dimensional convolutional layers and at least one pooling layer, wherein the 1-dimensional convolutional layers may be used for feature extraction of equal length, and the pooling layer may be used for performing pooling functions such as maximum pooling, minimum pooling, average pooling, and the like.

In some embodiments, step S21 includes the following steps:

and S1201, inputting the voice sample set into an automatic coding and decoding network. Wherein, the automatic coding and decoding network comprises the initial second coding model and the initial first decoding model.

S1202, adjusting parameters of the initial second coding model and the initial first decoding model until the reconstruction loss of the voice sample set meets a preset condition.

The adjustment of the parameters is usually an iterative adjustment, and the basis of each adjustment is the reconstruction loss obtained based on each voice sample and the automatic coding and decoding network, and the reconstruction loss is the basis for judging whether the training is completed. In general, the reconstruction loss is at least used for representing the difference between a reconstructed sample obtained after a sample is encoded and decoded by an automatic coding and decoding network and an original sample, and a smaller reconstruction loss indicates that the sample is better restored by the automatic coding network. Therefore, considering the adjustment effect and the computing resource consumption in combination, the preset condition is generally set as: the reconstruction loss is less than or equal to a first preset threshold, or the iteration number is greater than or equal to a second preset threshold. Of course, the preset condition may also have other settings, such as that the descending speed of the reconstruction loss within the preset number of iterations is less than or equal to a third preset threshold, or that the descending amplitude is less than or equal to a fourth preset threshold. The embodiment of the present application does not limit this.

In some embodiments, the aforementioned reconstruction loss may be calculated by the information flow diagram shown in fig. 13. As shown in fig. 13, the calculation of the reconstruction loss may comprise the following steps:

firstly, calling the initial second encoder model to encode each voice sample to obtain each layer of initial second encoder C'_0j(j is more than or equal to 1 and less than or equal to M) and outputting the first distribution parameter. Wherein the first distribution parameter is used for characterizing the first distribution q_φjThe first distribution q_φjIs each speech sample at an initial secondary encoder C'_0jThe speech segmentation obtained after segmentation according to the segmentation information at the corresponding speech granularity (i.e. the corresponding first decoder D)_jSpeech segmentation at the corresponding speech granularity). It will be appreciated that the first distribution parameter herein embodies, in one aspect, the initial secondary encoder C 'per layer'_0jOn the other hand, the encoding result of (2) also shows a random distribution of interference information.

Second, for each speech sample, based on a per-layer initial secondary encoder C'_0jCorresponding said first distribution q_φjSampling is carried out to obtain each layer of initial second encoder C'_0jCorresponding sampling code S_j. For each sample, this step may perform M samples, corresponding to M layers of the initial second encoder, or M levels of speech granularity. It will be appreciated that for each speech granularity, the sampling is analogous to appending a random disturbance message to the characteristics of all speech segments resulting from segmenting the speech sample.

The sampled encoding of each speech sample is then decoded using the initial first decoding model. The first input code of the M-th layer initial first decoder comprises a sampling code corresponding to the M-th layer initial second encoder, j < M is larger than or equal to any 1, and the first input code of the j-th layer initial first decoder comprises a sampling code corresponding to the j-th layer initial second encoder. Therefore, in the decoding process, except that the original first decoder of the mth layer (the highest layer in fig. 13) decodes only the output code of the original second encoder of the mth layer, the original first decoders of other layers need to decode the sampling code of the corresponding layer and the output code of the previous layer at the same time. That is, the first initial decoding model not only needs to complete the reconstruction of the voice samples step by step from large granularity to small granularity according to the hierarchy, but also needs to be affected by the interference information caused by sampling coding at the first decoder of each layer.

Next, a reconstructed sample corresponding to each speech sample is obtained from the output of the initial first decoder at layer 1 (i.e., j ═ 1). And the reconstructed samples corresponding to the voice samples form a reconstructed sample set. Here, the layer 1 initial first decoder corresponds to the layer 1 (i.e., i ═ 1) initial second encoder. For example, if the speech samples input to the layer 1 initial second encoder are audio signals (e.g., the layer 1 initial first encoder is a wav2vec model for extracting speech features), then the output of the layer 1 initial first decoder is a reconstructed audio signal. As another example, if the speech samples input to the layer 1 initial second encoder are speech features of an audio signal, the output of the layer 1 initial first decoder is reconstructed speech features of the audio signal.

And finally, calculating a first difference based on the voice sample set and the reconstruction sample set, calculating a second difference based on the first distribution and a preset target distribution, and obtaining the reconstruction loss based on the first difference and the second difference. Obviously, the first difference reflects the restoration degree of the automatic coding and decoding network to each voice sample in the training process, and a smaller first difference indicates a stronger reconstruction capability of the network. The second difference is mainly to constrain the random distribution of the interference information based on the preset target distribution, and avoid the model to automatically converge the random distribution to a single point distribution in order to realize the smaller first difference. Therefore, the finally obtained reconstruction loss can reflect the reconstruction capability of the automatic coding and decoding network and better reflect the robustness of the automatic coding and decoding network to the interference capability. The first distribution can be preset to be a particular type of distribution in general. In some embodiments, the first distribution may be a normal distribution, and accordingly, the target distribution may use a standard normal distribution.

In some application scenarios, the reconstruction loss may be calculated using log-likelihood values of the set of speech samples, for example:

where phi and theta correspond to parameters of the initial second coding model and the initial first decoding model, respectively, x represents a speech sample input to the initial second coding model, S represents a sampling code of each layer corresponding to the speech sample, q_φ(S | x) represents the posterior probability, p, of the sample coding obtained by the initial second coding model with respect to the speech samples_θ(x | S) represents the posterior probability, p, of the reconstructed sample obtained by the initial first decoding model with respect to the sample encoding_θAnd (S) is the target distribution. KL indicates calculation of KL divergence (Kullback-Leibler divergence) between the two distributions. The right side of the inequality is the lower bound of Evidence (ELBO) of log-likelihood values, where the first term in the ELBO corresponds to the first difference and the second term corresponds to the second difference. It will be appreciated that the objective of the iterative adjustment at this time is to optimize the ELBO.

Further, considering an M-level concatenated initial second encoder in the initial second encoder model, one may obtain:

wherein q is_φ1(S₁| x) represents the posterior probability of the sampled code obtained by the initial first encoder at layer 1 with respect to the speech samples, q_φj(S_j|S_j-1) Represents the j-th layerThe posterior probability, 1, of the sample code obtained by the initial second encoder relative to the sample code obtained by the initial second encoder of layer j-1<j≤M。

As can be understood from the foregoing process, the training process may be terminated when the reconstruction loss (or the iterative computation based on the reconstruction loss) satisfies the predetermined condition. At this time, the decoding part in the automatic coding and decoding network is the trained first decoding model.

The training of the first coding model is explained below. The training of the first coding model is typically based on a set of text samples, which contains at least one text sample. In some embodiments, the training of the first decoding model employs an unsupervised environment. For example, directly acquiring a text without any label as a text sample. Fig. 14 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application, wherein step S10 may include the following steps:

s13, training the initial first coding model based on the text sample set to obtain the first coding model.

In this context, the interference information contained therein is small compared to audio, so that instead of using a generative-discriminative network probability model to train the first coding model in each layer synchronously as a whole, each layer of the initial first encoder can be trained independently. It should be noted that "independent" herein means that each layer of initial first encoder is not required to be trained at the same time as other layers of initial first encoders, and does not mean that the training of each layer of initial first encoder is unrelated to the training of other layers of initial first encoders. Since N layers of initial first encoders are arranged in cascade, for each layer of initial first encoder C_0iThe trained previous layer initial first encoder C still needs to be used when training_0(i-1)Output coding of 1<i is less than or equal to N. In some embodiments, step S13 may include the following steps:

s1301, inputting the preamble information into the i-th layer initial first encoder C_0i. Wherein, when i is 1, the preamble information is a text sequence of each text sample, i is>When 1, the preamble information is the first encoder C of the i-1 layer_0(i-1)Output coding of。

S1302, initializing a first encoder C for the ith layer_0iUntil the preset loss of the preamble information meets a preset condition.

Similar to step S1202, the adjustment of the parameters is typically an iterative adjustment, each based on a respective text sample and the initial first encoder C_0iAnd obtaining a preset loss, wherein the preset loss is used for judging whether the training is finished. In general, a smaller default loss indicates a better training effect. Therefore, considering the adjustment effect and the computing resource consumption in combination, the preset condition is generally set as: the preset loss is less than or equal to a first preset threshold, or the iteration number is greater than or equal to a second preset threshold. Of course, the preset condition may also have other settings, such as that the descending speed of the preset loss within the preset number of iterations is less than or equal to the third preset threshold, or the descending amplitude is less than or equal to the fourth preset threshold. The embodiment of the present application does not limit this.

In some embodiments, the preset penalty may be constructed with contrast learning, taking into account the contextual relevance of the text. For example, the ith layer initial first encoder C_0iThe feature extraction network and the reverse support vector extraction network corresponding thereto may be included, and the predetermined loss may be calculated by the information flow diagram shown in fig. 15. For clarity, FIG. 15 begins the first encoder C with layer i +1_0(i+1)For core explanation, it is understood that the initial first encoder C for layer 1₀₁In other words, only the preamble information O needs to be processed based on FIG. 15 and the related description_0iAnd correspondingly replacing the text sequence in the text sample, and regarding the text granularity corresponding to the text sequence as 0. As shown in fig. 15, the calculation of the default loss may include the following steps:

first, calling the feature extraction network to the preamble information O_0i＝(O_0i1,O_0i2,…O_0ix) Processing to obtain characteristic code c_i＝(c_i1,c_i2,…c_ix). For example, the feature extraction network may be a unidirectional LSTM network or other RNN networkFeature code c_iCan encode O for output_0iThe contextual characteristics of (1). Where x is the i-th layer first encoder C_iThe number of text segments obtained by cutting the text sample at the corresponding text granularity.

Secondly, selecting the characteristic code c_iAt least one element c_itAs anchor point and determines the preamble information O_0iIntermediate anchor point c_itCorresponding element O_0itThe corresponding target speech segment is segmented. Wherein, t is more than or equal to 1<x, the text granularity corresponding to the target speech segmentation is larger than the i-th layer initial first encoder C_0iCorresponding text granularity. For example, if the initial first encoder C_0iThe corresponding text granularity is at a word level, and the target speech segmentation can be speech segmentation at a speech granularity of a clause, a sentence, a paragraph or a chapter level. As another example, if the first encoder C is initialized_0iThe corresponding text granularity is sentence level, and the target speech segmentation can be speech segmentation at the speech granularity of paragraph or chapter level. It is understood that the element O_0itThe corresponding target speech segmentation segment is the t-th speech segmentation segment in the x speech segmentation segments.

Then, calling the reverse support vector extraction network to obtain the information O in the front sequence_0iWherein a positive sample O is selected from other elements in the segmentation also corresponding to the target speech_0it+And selecting at least one negative sample O from the elements not corresponding to the target speech segment_0it-. For example, if the initial first encoder C_0iCorresponding text granularity is word level, positive sample O_0it+With the element O_0itCan be different words in the same clause, sentence, paragraph or chapter, and the corresponding negative sample O_0it-Are words that are not located in the clause, sentence, paragraph, or chapter. For example, if the initial first encoder C_0iCorresponding text granularity is sentence level, positive sample O_0it+With the element O_0itCan respectively correspond to different sentences in the same paragraph or chapter, and the corresponding negative sample O_0it-Corresponding to sentences not located in the paragraph or chapter. In some casesIn an embodiment, the aforementioned feature extraction network and the backward support vector extraction network may be constructed based on a noise contrast estimation (NSE) model, where in the output encoding, the positive sample O is_0it+Need to be located at anchor point c_itCorresponding element O_0itThereafter, a negative sample O_0it-Needs to be located at the positive sample O_0it+Then, t-<t<t+。

Finally, a noise contrast estimate (NSE) is calculated based on the anchor point, the positive sample, and the negative sample, and a preset loss is obtained based on the noise contrast estimate. For example, in a Contrast Prediction Coding (CPC) model, NSE can be obtained by calculating mutual information. In some embodiments, the preset Loss may be expressed by the following formula:

where E is a scaling factor, and Σ denotes the accumulation of the corresponding operation results for all negative samples.

In addition to the feature extraction network and the inverse support vector extraction network, an i +1 th layer initial first encoder C_0(i+1)It may also comprise an output network corresponding thereto for encoding the features c_iConversion to an initial first encoder C_0(i+1)Output code of (2)_0(i+1)＝(O_0(i+1)1,O_0(i+1)2,…O_0(i+1)x′) Wherein x' is the i +1 th layer of the first encoder C_(i+1)The number of text segments obtained by cutting the text sample at the corresponding text granularity. Consider x'<x, the output network may comprise at least a pooling layer for encoding the features c_iPooling, such as maximum pooling, minimum pooling, or average pooling, is performed.

The following describes the synchronous training of the first coding model and the first decoding model. In a similar way to the two previous approaches, the training of the first coding model is based on a text sample set comprising at least one text sample, the training of the first decoding model is based on a speech sample set comprising at least one speech sample, and there is a one-to-one correspondence between text samples and speech samples. In some embodiments, the foregoing synchronous training may be regarded as training in a supervised environment, that is, a speech sample corresponding to a text sample is directly used as a label of the text sample. Fig. 16 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application, wherein step S10 may include the following steps:

s14, performing joint training on the initial first coding model and the initial first decoding model based on the text sample set and the voice sample set to obtain a first coding model and a first decoding model.

In some embodiments, step S14 may include the following steps:

s1401, inputting the text sample set into the initial first coding model. And obtaining an intermediate text characteristic after each text sample is processed by the initial first coding model, obtaining a predicted voice sample after the intermediate text characteristic is processed by the initial first decoding model, and forming a predicted voice sample set by the predicted voice sample of each text sample.

The process of deriving the predicted speech samples from the text samples is here similar to the process of obtaining speech information from text information as described above, wherein the intermediate text features of the text samples correspond to text features of the text information. Accordingly, the details of step S1401 can refer to the aforementioned steps S20 and S40, which are not described herein.

S1402, adjusting parameters of the initial first coding model and the initial first decoding model until a difference between the speech sample set and the predicted speech sample set meets a preset condition.

Similar to step S1202, the adjustment of the parameters is usually an iterative adjustment, and each adjustment is based on the difference obtained by each text sample, the corresponding speech sample, and the initial first coding model and the initial first decoding model. The difference is a basis for judging whether the current first coding model and the first decoding model can accurately obtain the corresponding label of the text sample (i.e. the corresponding voice sample). In general, smaller differences indicate better prediction labeling. Therefore, considering the adjustment effect and the computing resource consumption in combination, the preset condition is generally set as: the difference is less than or equal to a first preset threshold, or the iteration number is greater than or equal to a second preset threshold. Of course, the preset condition may also have other settings, such as that the descending speed of the difference within the preset number of iterations is less than or equal to the third preset threshold, or the descending amplitude is less than or equal to the fourth preset threshold. The embodiment of the present application does not limit this.

Since the above training process uses mutually corresponding text samples and speech samples, i.e., "text-to-speech" sample pairs. Therefore, the speech information and text information obtained by the speech synthesis method in practical application can also form a new "text-speech" sample pair to be added into the training set. In some embodiments, after step S40, the aforementioned speech synthesis method further includes the steps of:

and S50, adding the text information into the text sample set, and adding the voice information into the voice sample set.

Therefore, more and more abundant labeled samples can be obtained along with the application of the speech synthesis method, so that more accurate first coding models and first decoding models can be trained in other application scenes.

According to the speech synthesis method provided by the embodiment of the application, the first coding model is called to code the text information to obtain the text characteristics, and then the first decoding model is called to decode based on the text characteristics to obtain the speech information. Wherein the first coding model and the first decoding model respectively comprise at least a cascaded N-layer first encoder and a cascaded M-layer first decoder. And the input codes of any i < N, i +1 th layer first encoder comprise the output codes of the i layer first encoder, and the input codes of any j < M, i +1 th layer first decoder comprise the output codes of the j +1 th layer first decoder, wherein i, j, M and N are positive integers. The text feature comprises an output encoding of at least one of the N-layer first encoders, and an input encoding of at least one of the M-layer decoders is derived from the text feature. Through the technical scheme, the multi-layer structure information in the text information is extracted layer by the cascaded first encoder, and then is restored to the voice prosody characteristics of each layer by the cascaded first decoder, so that the voice information generated according to the voice prosody characteristics can accurately reflect the multi-layer structure in the corresponding text information, and synthetic voice which is rich in rhythm change and closer to the real voice prosody is provided for a user.

A second aspect of the embodiments of the present application provides a speech synthesis apparatus, which can be used to implement the aforementioned speech synthesis method. Fig. 17 is a schematic structural diagram of a speech synthesis apparatus in an embodiment of the present application, where the speech synthesis apparatus may include an encoding module 1702 and a decoding module 1704.

The encoding module 1702 is configured to invoke a first encoding model to encode the text information, so as to obtain text features. Wherein the first coding model comprises at least N cascaded layers of first encoders, and the text feature comprises an output code of at least one of the N layers of first encoders. For any layer 1 ≦ i < N, the input encoding for the i +1 th layer first encoder includes the output encoding for the i layer encoder first encoder.

The decoding module 1704 is configured to invoke the first decoding model to perform decoding based on the text feature, so as to obtain voice information. Wherein the first decoding model comprises at least cascaded M-layer decoders, a first input encoding of at least one of said M-layer decoders being obtained from the text feature. For any 1 ≦ j < M, the second input code for the j-th layer first decoder comprises the output code for the j + 1-th layer first decoder.

Wherein i, j, M and N are positive integers.

In some embodiments, for each layer of the first encoder, the text information can be sliced according to the corresponding text granularity of the first encoder to obtain at least one text slice, and the output of the layer of the first encoder encodes features characterizing each text slice. And for each layer of first decoder, the voice information can be segmented according to the voice granularity corresponding to the first decoder to obtain at least one voice segmentation segment, and the second input code of the layer of first decoder is used for characterizing the characteristics of each voice segmentation segment.

In some embodiments, for any 1 ≦ i<N, text granularity G corresponding to the ith layer first encoder_iIs smaller than the text granularity G corresponding to the first encoder of the i +1 th layer_i+1And text granularity G_iEach text segmentation section obtained by the method is one or more text granularities G_i+1And segmenting the text obtained next. Also, for any 1 ≦ j<M, corresponding speech granularity G of jth layer first decoder_jLess than the speech granularity G corresponding to the first decoder of the j +1 th layer_j+1And speech granularity G_jEach speech segmentation segment under is composed of one or more speech granularities G_j+1And the lower voice is formed by segmentation.

In some embodiments, the K-layer first encoder of the N-layer first encoder corresponds to the K-layer first decoder of the M-layer first decoder one-to-one, and the first input encoding of each of the K-layer first decoders includes the output encoding of its corresponding first encoder. Wherein K is a positive integer.

In some embodiments, for each of the K pairs of the first decoder and the first encoder corresponding to each other, the text information can be segmented according to the text granularity corresponding to the first encoder to obtain the text segmentation and the speech information can be segmented according to the speech granularity corresponding to the first decoder to obtain the speech segmentation, and the text segmentation and the speech segmentation correspond to each other.

In some embodiments, the last one of the K layers of first decoders is based on a multi-layer perceptron model, and at least one of the remaining first decoders is based on an autoregressive neural network model.

In some embodiments, the aforementioned decoding module is specifically configured to obtain the output code of the j-th layer first decoder. The first encoder of the ith layer and the first decoder of the jth layer belong to the first encoder of the K layer and the first decoder of the K layer respectively and correspond to each other, the output code of the first encoder of the ith layer is an input text code sequence used for representing the characteristics of each first text segment, the second input code of the first decoder of the jth layer is an input voice code sequence used for representing the characteristics of each first voice segment, and the output code of the first decoder of the jth layer is an output voice code sequence used for representing the characteristics of each second voice segment. And cutting the text information according to the text granularity corresponding to the ith layer of first encoder to obtain each first text segmentation, cutting the text information according to the voice granularity corresponding to the jth layer of first encoder to obtain each first voice segmentation, wherein each first voice segmentation is composed of one or more second voice segmentation.

In some embodiments, the aforementioned decoding module comprises a conversion sub-module, a first acquisition sub-module, and a second acquisition sub-module.

The conversion submodule is used for converting the input text coding sequence and the output voice coding sequence into an extended input text coding sequence and an extended input voice coding sequence respectively. Wherein the length of the extended input text encoding sequence and the extended input speech encoding sequence is equal to the length of the output speech encoding sequence. In the above conversion process, each text code in the input text coding sequence is sequentially and respectively expanded into a text subsequence in the expanded text coding sequence, the length of each text subsequence is equal to the number of second voice segments contained in a first voice segment corresponding to the voice code at the position corresponding to the text code in the voice coding sequence, and each element in the text subsequence is the same as the text code. In addition, each voice code in the input voice code sequence is sequentially and respectively expanded into a voice subsequence in the expanded voice code sequence, the length of each voice subsequence is equal to the number of second voice segments contained in a first voice segment corresponding to the voice code, and each element in the voice subsequence is the same as the voice code.

The first obtaining module is used for obtaining a first voice code in the output voice code sequence based on a first element in the extended input text code sequence and a first element in the extended voice code sequence.

And aiming at the voice codes at any other positions in the output voice code sequence, the second acquisition module is used for acquiring the output voice code based on the element at the corresponding position in the extended input text code sequence, the element at the corresponding position in the extended voice code sequence and all the voice codes before the position in the output voice code sequence.

In some embodiments, the K-level first encoder corresponds to text granularity of one or more of chapter level, paragraph level, clause level, phrase level, word level, and character level, respectively.

In some embodiments, the first encoder corresponding to the text granularity at the character level obtains output codes for characterizing pronunciation and tone of characters based on phoneme prediction, the first encoder corresponding to the text granularity at the word level obtains output codes for characterizing re-read words based on focus word prediction, the first encoder corresponding to the text granularity at the clause level obtains output codes for characterizing mood, tone, and time-length dependence between words based on sentence analysis and syntax analysis, the paragraph encoder obtains output codes for characterizing emphasized sentences in paragraphs based on lexical manipulation analysis and paragraph center sentence prediction, and the first encoder corresponding to the text granularity at the chapter level obtains output codes for characterizing overall semantics of text based on text analysis and abstract analysis.

In some embodiments, a layer 1 (i.e., i ═ 1) first encoder includes a BERT model-based output encoding features for each text segment at a text granularity characterizing the character level.

In some embodiments, the speech synthesis apparatus further includes a bottom layer phoneme module, configured to obtain, according to the text information, a phoneme sequence corresponding to each text fragment corresponding to the first encoder on layer 1. Further, the decoding module 1704 includes an underlying decoding sub-module, an acoustic feature sub-module, and a waveform generation sub-module.

The bottom layer decoding submodule is used for calling a first decoding model to decode the text features and obtaining the output code of a first decoder of a layer 1 (namely j is 1).

And the acoustic feature submodule is used for obtaining the acoustic features of the voice information according to the output codes of the layer 1 first decoder and the phoneme sequence.

And the waveform generation submodule is used for obtaining a waveform signal of the voice information through the vocoder according to the acoustic characteristics.

In some embodiments, as shown in fig. 18, the foregoing speech synthesis apparatus further includes a training module 1701 for training the initial first coding model and/or the initial first decoding model to obtain the first coding model and/or the first decoding model.

In some embodiments, the training module 1701 includes a pre-processing sub-module and a training sub-module.

The preprocessing submodule is used for preprocessing each voice sample in the voice sample set to obtain segmentation information of the voice sample. And the segmentation information is used for indicating the speech segmentation of the speech sample under the corresponding speech granularity of each layer of first decoder. The speech sample set comprises at least one speech sample, and the initial first decoding model comprises at least an initial first decoder of an M-layer cascade corresponding to the M-layer first decoders, respectively. For any layer 1 ≦ j < M, the second input code for the j-th layer initial first decoder contains the output code for the j + 1-th layer initial first decoder.

The training submodule is used for training the initial first decoding model based on the voice sample set and the segmentation information to obtain a first decoding model.

In some embodiments, the training of the initial first decoding model described above is training in a weakly supervised environment.

In some embodiments, the preprocessing submodule includes a text obtaining unit, a text segmentation unit, a first voice segmentation unit, a voice recognition unit, a text alignment unit, and a first voice recombination unit.

The text acquisition unit is used for acquiring a text sample corresponding to the voice sample.

The text segmentation unit is used for obtaining an initial text segmentation of the text sample under the first text granularity.

The first voice segmentation unit is used for segmenting the voice sample by utilizing voice endpoint detection to obtain an initial voice segmentation of the voice sample under the first voice granularity.

The speech recognition unit is used for obtaining the predicted text segmentation corresponding to each initial speech segmentation under the first speech granularity by using speech recognition.

The text alignment unit is used for obtaining a first corresponding relation between each text segmentation and each predicted text segmentation by using a text alignment algorithm. Wherein each initial text segment corresponds to one or more predicted text segments.

The first voice reorganization unit is used for reorganizing each initial voice segmentation segment into one or more first reorganized voice segmentation segments of the voice sample under the second voice granularity based on the first corresponding relation. And each first recombined voice segmentation segment consists of one or more initial voice segmentation segments.

In some embodiments, the text obtaining unit is specifically configured to perform speech recognition on the speech sample to obtain the text sample.

In some embodiments, the pre-processing sub-module includes a text acquisition unit, a pronunciation prediction unit, a speech alignment unit, and a second speech segmentation unit.

The text acquisition unit is used for acquiring a text sample corresponding to the voice sample.

The pronunciation prediction unit is used for converting the text sequence in the text sample into a phoneme sequence by utilizing pronunciation prediction. The text sequence comprises all the text units which are sequentially arranged in the text sample, and the phoneme sequence comprises phoneme units which are sequentially arranged and respectively correspond to all the text units.

The speech alignment unit is used for obtaining a second corresponding relation between the phoneme sequence and the speech sample by using a forced alignment algorithm.

The second speech segmentation unit is used for segmenting the speech sample into one or more phoneme speech segments based on the second corresponding relation. Wherein each phoneme speech fragment corresponds to a phoneme unit.

In some embodiments, the pre-processing sub-module further comprises a text analysis unit and a second speech re-assembly unit.

The text analysis unit is used for obtaining the text segmentation of the text sample under the second text granularity by using text analysis. And each text segmentation segment under the second text granularity is composed of one or more text units.

The second speech recombination unit is used for recombining each phoneme speech segmentation into one or more second recombined speech segmentation of the speech sample under a third speech granularity based on the corresponding relation between each text unit and each text segmentation under the second text granularity. Wherein each second recombined speech segment is comprised of one or more phoneme speech segments.

In some embodiments, the training submodule includes a first input unit and a first adjustment unit.

The first input unit is used for inputting the voice sample set into an automatic coding and decoding network. The automatic coding and decoding network comprises an initial second coding model and the initial first decoding model, wherein the initial second coding model at least comprises M layers of cascaded initial second encoders respectively corresponding to the M layers of initial first encoders. For any input code of the j +1 th layer initial second encoder, which is more than or equal to 1 and less than M, the input code of the j +1 th layer initial second encoder comprises the output code of the j th layer initial second encoder.

The first adjusting unit is configured to adjust parameters of the initial second coding model and the initial first decoding model until a reconstruction loss of the speech sample set meets a preset condition.

In some embodiments, the calculation of the reconstruction loss is performed by a first calculation unit. The first computing unit is configured to: calling the initial second encoder model to encode each voice sample to obtain a first distribution parameter output by each layer of initial second encoder, wherein the first distribution parameter is used for representing a first distribution of characteristics of voice segmentation segments obtained after each voice sample is segmented according to the segmentation information under the voice granularity corresponding to the layer of initial second encoder; for each voice sample, sampling based on the first distribution corresponding to each layer of initial second encoder to obtain a sampling code corresponding to each layer of initial second encoder; calling the initial first decoding model to decode the sampling codes of all the voice samples, wherein the first input codes of the Mth layer initial first decoder comprise the sampling codes corresponding to the Mth layer initial second encoder, j is more than or equal to 1 and less than M, and the first input codes of the jth layer initial first decoder comprise the sampling codes corresponding to the jth layer initial second encoder; obtaining a reconstructed sample corresponding to each voice sample according to the output of the initial first decoder of the layer 1 (namely j is 1), wherein the reconstructed sample corresponding to each voice sample forms a reconstructed sample set; calculating a first difference based on the set of speech samples and the set of reconstruction samples; calculating a second difference based on the first distribution and a preset target distribution; and obtaining a reconstruction loss based on the first difference and the second difference.

In some embodiments, the predetermined condition is that the reconstruction loss is less than or equal to a first predetermined threshold, or that the number of iterations of the adjusting reaches a second predetermined threshold.

In some embodiments, the first distribution is a normal distribution and the target distribution is a standard normal distribution.

In some embodiments, each layer of the initial second encoder comprises a pooling layer and a plurality of 1-dimensional convolutional layers.

In some embodiments, the aforementioned training module 1701 is specifically configured to train an initial first coding model based on a text sample set to obtain the first coding model. Wherein the text sample set comprises at least one text sample, and the initial first coding model comprises at least an initial first coder cascaded with N layers corresponding to the N-layer first coders, respectively. For any layer i < N, the input code of the i +1 th layer initial first encoder comprises the output code of the i-th layer initial first encoder.

In some embodiments, the training of the initial first encoder described above is in an unsupervised environment.

In some embodiments, the aforementioned training module 1701 includes a second input unit and a second adjustment unit.

The second input unit is used for inputting the preamble information into the ith layer initial first encoder. When i is 1, the preamble information is a text sequence in each text sample, and when i >1, the preamble information is output and encoded by the i-1 layer first encoder.

The second adjusting unit is used for adjusting the parameters of the initial first encoder of the ith layer until the preset loss of the preamble information meets the preset condition.

In some embodiments, the calculation of the reconstruction loss is performed by a second calculation unit. The second unit is used for: calling a feature extraction network corresponding to the i-th layer initial first encoder to process the preorder information to obtain a feature code; selecting at least one element in the feature code as an anchor point; determining a target voice segmentation segment corresponding to an element corresponding to the anchor point in the preamble information, wherein the text granularity corresponding to the target voice segmentation segment is larger than the text granularity corresponding to the preamble information; calling a reverse support vector extraction network corresponding to an i-th layer initial first encoder, selecting a positive sample from other elements corresponding to the target voice segmentation segment in the preamble information, and selecting at least one negative sample from elements not corresponding to the target voice segmentation segment; computing a noise contrast estimate based on the anchor point, the positive sample, and the negative sample; and obtaining a preset loss based on the noise contrast estimation.

In some embodiments, the predetermined condition is that the predetermined loss is less than or equal to a first predetermined threshold, or the number of iterations of the adjusting reaches a second predetermined threshold.

In some embodiments, the feature extraction network and the inverse support vector extraction network are based on a contrast predictive coding model. Furthermore, in the output encoding, the positive sample is located after the element corresponding to the anchor point, and the at least one negative sample is located after the positive sample.

In some embodiments, the aforementioned training module 1701 is specifically configured to jointly train the initial first encoder and the initial first decoder based on the text sample set and the speech sample set, obtaining a first encoding model and a first decoding model. Wherein the text sample set comprises at least one text sample, the speech sample set comprises at least one speech sample, and a one-to-one correspondence exists between the at least one text sample and the at least one speech sample. The initial first coding model at least comprises N layers of cascaded initial first encoders respectively corresponding to the N layers of first encoders, and the input codes of any i +1 th layer of initial first decoder with the i being more than or equal to 1 and less than N comprise the output codes of the i layers of initial first encoders. The initial first decoding model at least comprises M layers of cascaded initial first decoders respectively corresponding to the M layers of first decoders, and for any j < M which is more than or equal to 1, the second input code of the j-th layer of initial first decoder comprises the output code of a j +1 layer of initial first decoder.

In some embodiments, the aforementioned training module 1701 includes a third input unit and a third adjustment unit.

The third input unit is used for inputting the text sample set into the initial first coding model. And the intermediate text features are processed by the initial first decoding model to obtain predicted voice samples, and the predicted voice samples of the text samples form a predicted voice sample set.

The third adjusting unit is configured to adjust parameters of the initial first encoder and the initial first decoder until a difference between the speech sample set and the predicted speech sample set satisfies a predetermined condition.

In some embodiments, the predetermined condition is that the difference is less than or equal to a first predetermined threshold, or that the number of iterations of the adjusting reaches a second predetermined threshold.

In some embodiments, the aforementioned speech synthesis apparatus further includes a sample supplement module, configured to add the text information to the text sample set and add the speech information to the speech sample set after the decoding module calls the first decoding model to decode the text feature.

The speech synthesis apparatus provided in the embodiment of the present application can achieve the same technical effects as the speech synthesis method, and specific details can refer to the foregoing method embodiment and are not described herein again.

A third aspect of embodiments of the present application provides an electronic device, which may be used to implement the foregoing speech synthesis method. In some embodiments, the electronic device includes a processor and a memory. The memory stores instructions, and the instructions, when executed by the processor, cause the electronic device to perform any of the aforementioned speech synthesis methods.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium. The computer readable storage medium stores computer instructions, and the computer instructions, when executed by the processor, cause the computer to perform any of the speech synthesis methods described above.

The computer readable storage medium contains program instructions, data files, data structures, etc., or a combination thereof. The program recorded in the computer-readable storage medium may be designed or configured to implement the method of the present invention. The computer readable storage medium includes a hardware system for storing and executing program commands. Examples of hardware systems are magnetic media (such as hard disks, floppy disks, magnetic tape), optical media (such as CD-ROMs and DVDs), magneto-optical media (such as floppy disks, ROMs, RAMs, flash memory, etc.). The program includes assembly language code or machine code compiled by a compiler and higher-level language code interpreted by an interpreter. The hardware system may be implemented using at least one software module to conform to the present invention.

A fifth aspect of embodiments of the present application provides a computer program product. The computer program product contains computer instructions which, when run on a computer, cause the computer to perform any of the speech synthesis methods described above.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant parts can be referred to the description of the relevant parts of the method.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

41页详细技术资料下载

Voice synthesis method and device and electronic equipment

相关技术

网友询问留言