Speech synthesis method and system with rhythm

文档序号：513261 发布日期：2021-05-28 浏览：30次中文

阅读说明：本技术 带有韵律的语音合成方法及系统 (Speech synthesis method and system with rhythm ) 是由俞凯杜晨鹏于 2020-12-31 设计创作，主要内容包括：本发明实施例提供一种带有韵律的语音合成方法。该方法包括：使用基于混合高斯模型的混合密度网络(MDN)预测韵律；从目标语音中提取音素级韵律信息作为混合密度网络的训练目标；利用训练后的混合密度网络对当前语音进行韵律预测,从预测得到的混合高斯分布中采样各个音素的韵律信息；基于采样的各个音素的韵律信息合成语音。本发明实施例还提供一种带有韵律的语音合成系统。本发明实施例使用混合高斯分布建模音素级韵律后,不同的高斯可能代表相应的韵律。通过实验证明,混合高斯分布上得到的韵律似然度明显更高,可以更好地建模音素级韵律,从而生成韵律更加丰富的语音。(The embodiment of the invention provides a speech synthesis method with rhythms. The method comprises the following steps: predicting prosody using a Mixture Density Network (MDN) based on a Gaussian mixture model; extracting phoneme-level prosody information from the target speech to serve as a training target of the mixed density network; carrying out prosody prediction on the current voice by using the trained mixed density network, and sampling prosody information of each phoneme from the mixed Gaussian distribution obtained by prediction; speech is synthesized based on the prosody information of the sampled individual phonemes. The embodiment of the invention also provides a speech synthesis system with rhythms. After the embodiment of the invention uses mixed Gaussian distribution to model the phoneme-level prosody, different gaussians can represent the corresponding prosody. Experiments prove that the prosody likelihood obtained on the Gaussian mixture distribution is obviously higher, and the phoneme-level prosody can be better modeled, so that the voice with richer prosody is generated.)

1. A method for synthesizing speech with rhythm, comprising:

predicting prosody using a Mixture Density Network (MDN) based on a Gaussian mixture model;

extracting phoneme-level prosody information from target speech to serve as a training target of the mixed density network;

carrying out prosody prediction on the current voice by using the trained mixed density network, and sampling prosody information of each phoneme from the mixed Gaussian distribution obtained by prediction;

speech is synthesized based on the prosody information of the sampled individual phonemes.

2. The method of claim 1, wherein the performing prosodic prediction on current speech using the trained mixed density network comprises:

and performing prosody prediction on the current voice and the historical prosody information by using the trained mixed density network.

3. The method of claim 1, wherein the phoneme-level prosodic information is obtained by a prosody extractor, wherein the prosody extractor comprises a recurrent neural network layer for embedding the phoneme-level prosodic information.

4. The method of claim 3, wherein the architecture of the prosody extractor comprises: two-layer two-dimensional convolution, batch normalization layer and ReLU activation function.

5. A speech synthesis system with prosody:

a prosody prediction program module for predicting prosody using a Mixture Density Network (MDN) based on a Gaussian mixture model;

a training target determining program module for extracting phoneme-level prosody information from the target speech as a training target of the mixed density network;

the prosody information prediction program module is used for carrying out prosody prediction on the current voice by utilizing the trained mixed density network and sampling prosody information of each phoneme from the mixed Gaussian distribution obtained by prediction;

and a speech synthesis program module for synthesizing speech based on the prosody information of the sampled phonemes.

6. The system of claim 5, wherein the prosodic information predictor module is to:

and performing prosody prediction on the current voice and the historical prosody information by using the trained mixed density network.

7. The system of claim 5, wherein the phoneme-level prosodic information is obtained by a prosody extractor, wherein the prosody extractor comprises a recurrent neural network layer for embedding the phoneme-level prosodic information.

8. The system of claim 7, wherein the architecture of the prosody extractor comprises: two-layer two-dimensional convolution, batch normalization layer and ReLU activation function.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.

Technical Field

The invention relates to the field of intelligent voice, in particular to a voice synthesis method and system with rhythms.

Background

A neural Text-To-Speech (TTS) synthesis model with a sequence-To-sequence structure may be used To generate natural-sounding Speech.

In addition to advances in acoustic modeling, prosodic modeling has also been extensively studied. E.g., a speech-level prosody model in the TTS, extracts a global (speech-level) prosody embedding from the reference speech to control the prosody of the TTS output. Prosody is embedded in several Global Style Tags (GSTs). Prosody modeling can also be performed by using a Variational Automatic Encoder (VAE), so that various prosody embeddings can be extracted from standard Gaussian priors. In addition to global prosody modeling, recent research has also analyzed fine-grained prosody, such as extracting frame-level prosody information and aligning it to each phoneme code using attention modules.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

most of the existing prosodic models adopt unimodal distribution, such as single Gaussian distribution, which is not reasonable enough. This makes the sampled prosody not sufficiently diverse, and in addition, prosody sampling under such conditions may cause degradation in the quality of the synthesized speech.

Disclosure of Invention

The method aims to at least solve the problem that prosody sampling under the condition can cause the quality reduction of the synthesized voice due to insufficient prosody diversity of sampling in the prior art.

In a first aspect, an embodiment of the present invention provides a method for synthesizing speech with rhythm, including:

predicting prosody using a Mixture Density Network (MDN) based on a Gaussian mixture model;

extracting phoneme-level prosody information from target speech to serve as a training target of the mixed density network;

speech is synthesized based on the prosody information of the sampled individual phonemes.

In a second aspect, an embodiment of the present invention provides a speech synthesis system with rhythm, including:

a prosody prediction program module for predicting prosody using a Mixture Density Network (MDN) based on a Gaussian mixture model;

a training target determining program module for extracting phoneme-level prosody information from the target speech as a training target of the mixed density network;

and a speech synthesis program module for synthesizing speech based on the prosody information of the sampled phonemes.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for synthesizing speech with prosody of any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the program is executed by a processor to implement the steps of the method for synthesizing speech with prosody according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: after modeling the phoneme-level prosody using a mixed gaussian distribution, different gaussians may represent the corresponding prosody. Experiments prove that the prosody likelihood obtained on the Gaussian mixture distribution is obviously higher, and the phoneme-level prosody can be better modeled, so that the voice with richer prosody is generated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for synthesizing speech with prosody according to an embodiment of the present invention;

FIG. 2 is a diagram of the overall architecture of FastSpeech2 based on a method for prosodic speech synthesis according to an embodiment of the present invention;

FIG. 3 is a diagram of a prosody extractor architecture for a method for synthesizing speech with prosody according to an embodiment of the present invention;

FIG. 4 is a diagram of prosody predictor architecture for a method for speech synthesis with prosody according to an embodiment of the present invention;

FIG. 5 is a graph of performance data on a test set for a method of speech synthesis with prosody provided by an embodiment of the present invention;

FIG. 6 is a diagram of log-likelihood of extracted ground truth PL prosody embedding for a method of speech synthesis with prosody provided by an embodiment of the present invention;

FIG. 7 is a chart of AB preferences test data for prosody diversity of a method for speech synthesis with prosody in accordance with an embodiment of the present invention;

FIG. 8 is a diagram of data for assessing the naturalness and inference speed of a TTS system with a prosodic speech synthesis method according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a speech synthesis system with rhythm according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a speech synthesis method with rhythm according to an embodiment of the present invention, which includes the following steps:

s11: predicting prosody using a Mixture Density Network (MDN) based on a Gaussian mixture model;

s12: extracting phoneme-level prosody information from target speech to serve as a training target of the mixed density network;

s13: carrying out prosody prediction on the current voice by using the trained mixed density network, and sampling prosody information of each phoneme from the mixed Gaussian distribution obtained by prediction;

s14: speech is synthesized based on the prosody information of the sampled individual phonemes.

In this embodiment, the process of mapping a phoneme sequence to its corresponding mel-frequency spectrogram is a one-to-many mapping. Thus, the use of multimodal distributions is contemplated.

For step S11, the method defines the neural network as a mixture model, focuses on a Mixture Density Network (MDN) based on a Gaussian Mixture Model (GMM) to predict a Gaussian mixture distribution parameter of the mixture model, which includes a mean μ_iVariance, varianceAnd a mixing weight alpha_i. It should be noted that the sum of the mixing weights is limited to 1, which can be achieved by applying a Softmax function, formalized as:

where M is the number of gaussian components,is the corresponding neural network output. The mean and variance of the gaussian components are expressed as:

whereinAndthe mean and variance corresponding to the ith gaussian component are output for the neural network. Limitation of the above formulaIs positive.

The criteria for training the MDN in this work is to give its inputs h and e_k-1Observation e of_kNegative log likelihood of (d). Here we can express the loss function as:

thus, the mixture density network is optimized to predict the GMM parameters such that e_kThe highest probability. Thereby performing prosody prediction using a Mixture Density Network (MDN) based on a mixture gaussian model.

For step S12, the method is applied to the TTS model in actual use,

the TTS model is based on FastSpeech2, wherein FastSpeech2 can solve the problem in fastspeed and better solve the one-to-many mapping problem in TTS. By the following means: 1. the direct training model replaces the simplified output of the teacher with real targets. 2. More speech variable information (like pitch, energy and more precise duration) is introduced as conditional input. The encoder converts the input phoneme sequence into a hidden state sequence h, and then predicts the output mel spectrogram through a variance adapter and a decoder. FastSpeech2 is optimized compared to the original FastSpeech to produce a Mean Square Error (MSE) L between the predicted spectrum and the ground true Merr spectrum_MELAt a minimum, not using teachingAnd (5) training a teacher. Furthermore, the duration target is not extracted from the attention of the autoregressive teacher model, but from the forced alignment of speech and text. In addition, the prediction condition of the excimer spectrogram is combined with variance information such as tone, energy and the like through a variance adapter. The adapter is trained to predict the loss L at MSE_VARVariance information of the following.

On the basis of the TTS model, the method introduces a prosody extractor and a prosody predictor. The model structure is shown in fig. 2, and a prosody extractor, a mixed density network (which may also be called a prosody predictor), and the like are provided therein. SG in the figure refers to an operation of preventing the back propagation of the gradient, and OR refers to a training and prediction using the real extracted prosody and the prosody sampled from the prediction distribution, respectively. The mixed density network is part of the TTS model, and macroscopically, the training is also that the entire TTS model is trained.

During model training, the phoneme-level prosody information is extracted from the target speech segment corresponding to the phoneme by an extractor network and is used as a training target of the mixed density network. The mixed density network herein is used to predict phoneme-level prosody and is therefore also referred to as a prosody predictor network.

More specifically in the context of a macroscopic TTS model, both the prosody extractor and prosody predictor are trained in conjunction with the FastSpeech2 architecture. A sentence-level prosody embedding e is extracted from the fundamental mel spectrum segment by a prosody extractor and projected into the hidden state sequence h. Thus, the prosody extractor is optimized to extract the prosody information available in e to better reconstruct the mel-spectrum. The method uses the GMM, whose parameters are predicted by the MDN, to model the distribution of e. Here, MDN is a prosodic predictor that predicts z for each phoneme with the hidden state sequence h as input^α(z^αIs the corresponding neural network output), z^μAnd z^σ(mean and variance of gaussian components). A GRU (Gated recirculation Unit) is also designed for predicting the current prosody distribution. In the reasoning process, the GMM distribution is subjected to autoregressive prediction, and the rhythm of each phoneme is embeddedSampling is performed. Then embedding the samples in a sequenceProjected and added to the corresponding hidden state sequence h.

The overall architecture is optimized by the lossy function:

in the formula, L_MDNIs an equationNegative log-likelihood of e, L, as defined in (1)_FastSpeech2Is a loss function of Fast-Speech2, is a variance prediction loss L_VARAnd mel spectrum reconstruction loss L_MELThe sum, β, is the relative weight between the two terms. It is worth noting that we are calculating L_MDNThe stopping gradient operation on e is used, so L is not directly used_MDNAnd optimizing the prosody extractor.

In one embodiment, the phoneme-level prosody information is obtained by a prosody extractor, wherein the prosody extractor includes a recurrent neural network layer for embedding the phoneme-level prosody information.

The architecture of the prosody extractor includes: two-layer two-dimensional convolution, batch normalization layer and ReLU activation function.

In the present embodiment, the prosody extractor is configured in detail as shown in fig. 3. It contains 2 layers of 2D convolution, with a kernel size of 3 x 3, each layer followed by a batch normalization layer and a ReLU activation function. After the above module, a dual positive GRU with a hidden size of 32 is designed. The concatenated forward and backward states from the GRU layer are the output of the prosody extractor, which is called prosody embedding of phonemes.

Further, the detailed architecture of the prosody predictor is shown in fig. 4, the hidden state h is subjected to 2-layer one-dimensional convolution with kernel size of 3, and each layer is followed byIn turn, ReLU, layer normalization, and exit layer. Then, the output of the above module is embedded with the previous prosody e_k-1Concatenated and sent to a GRU hiding size 384. Then we obtain z^α、z^μAnd z^σ。

For step S13, performing prosody prediction on the current speech by the mixed density network trained in the above steps, so as to obtain prosody information of each phoneme sampled in the mixed gaussian distribution,

as an embodiment, in this embodiment, the performing prosody prediction on the current speech by using the trained mixed density network includes:

and performing prosody prediction on the current voice and the historical prosody information by using the trained mixed density network.

In the present embodiment, the prosody distribution of each phoneme is predicted based on information of the current phoneme and information of the historical prosody. In synthesizing speech, the prosody of each phoneme is sampled from the mixture gaussian distribution obtained from the corresponding prediction.

The current voice is the voice to be added with prosody, for example, the voice of the user and the smart speaker have a voice conversation, the voice output by the smart speaker is the voice without added with prosody, that is, the current voice is said in this step, and the historical prosody information can be extracted from the voice input by the user history, so that the prosody of the user can be added to the current voice to obtain the prosody information of each phoneme. (according to different use cases, the historical prosodic information can be obtained from other modes, the applied current voice is not limited, and the prosody is added)

With step S14, a speech is synthesized based on the prosody information of the sampled individual phonemes, eventually resulting in a synthesized speech rich in prosody variations.

It can be seen from this embodiment that after modeling the phoneme-level prosody using a mixed gaussian distribution, different gaussians may represent the corresponding prosody. Experiments prove that the prosody likelihood obtained on the Gaussian mixture distribution is obviously higher, and the phoneme-level prosody can be better modeled, so that the voice with richer prosody is generated.

The method was tested: LJSpeech is a single speaker english data set containing about 24 hours of speech and 13100 utterances. We selected 50 utterances for verification, another 50 utterances for testing, and the remaining utterances for training. For simplicity, the speech is resampled to 16 kHz. Before training the TTS, we calculated the phoneme alignments of the training data using the Librispeech trained HMM-GMM ASR (Hidden Markov Model) (Automatic Speech Recognition) Model, and then extracted the duration of each phoneme from the alignments for FastSpeech2 training.

In the method, all TTS models based on fastspech2 take a phoneme sequence as input, and a corresponding 320-dimensional mel spectrogram as output. The frame offset is set to 12.5ms and the frame length is set to 50 ms. Beta is set to 0.02. Wavenet acts as a vocoder to reconstruct the waveform from the mel spectrogram.

Proving the necessity of using phoneme-level prosody information, the method verifies whether using extracted PL (Phone-level) prosody embedded e in the reconstruction is better than using a global VAE (Variational auto encoder). In the global VAE system, 256-dimensional global prosody embeddings are extracted from the VAE hindbrain for each voice, and then the embedding is broadcasted and added to the encoder output of fastspeech to reconstruct mel-spectrogram. In our PL model, the number of gaussian components in the prosody predictor is 10, and the extracted e is as described in the above embodiment. The open source tool 1 is used to calculate the cepstral distortion (MCD) over the test set to measure the distance between the reconstructed speech and the real speech. The results are shown in FIG. 5, the lower the MCD, the better. We have found that the use of extracted phoneme-level prosody e improves reconstruction performance.

To analyze the number of gaussian components, we try to calculate how many gaussian components are needed to model the extracted distribution e. We plot log-likelihood curves on the training and validation sets in fig. 6, which contain several different numbers of gaussian components. It can be observed that the difference between the training curve and the verification curve for the single gaussian model is larger than in GMMs. Furthermore, increasing the number of components provides higher log-likelihood, thereby improving PL prosody modeling. Therefore, we used 10 packets in the following GMM experiments.

We used different prosodic modeling for subjective evaluation of three fastspech 2-based TTS systems: 1) a global VAE; 2) PL1, PL prosody modeling using a single gaussian; 3) PL10, PL prosody modeling using 10 gaussian components. To provide better speech quality in synthesized speech, we scale the prediction standard deviation of the gaussian function by a factor of 0.2 at the time of sampling.

We use different sampling rhythmsThe speech for each test set was synthesized 3 times. We performed an AB preference test in which two sets of synthesized speech were from two different TTS models, and 20 listeners needed to select one better from the prosodic diversity aspect. The results in fig. 7 show that PL10 can provide better prosodic diversity in synthesized speech than PL1 and global VAE.

We also assessed the naturalness of the synthesized speech by Mean Opinion Score (MOS) test in which the listener was asked to score each utterance using a 5-point numerical scale. The speech converted back from the terrestrial true spectrogram using the Wavenet vocoder is also rated as "ground truth". The results are shown in FIG. 8. Autoregressive sampled PL prosody from a single gaussian sometimes produces very unnatural speech, resulting in MOS reduction in PL 1. We found that the naturalness of PL10 was better than PL1, indicating that GMM can model PL prosody better than a single gaussian model. The global VAE system also has good naturalness, very close to the result of PL 10.

Fastspech2 is used as a non-autoregressive TTS model to avoid frame-by-frame generation and speed up the inference speed. In this work, we predict the distribution of PL prosody embedding only autoregressive, hoping to maintain fast inference speed. We evaluated all systems on the tester using an Intel Xeon Gold 6240 CPU. As shown in table 2, the time cost of the proposed model is only 1.11 times higher than the baseline. Therefore, the impact of using autoregressive PL prosody prediction on inference speed is very limited.

The method models the prosody at the phoneme level using a GMM-based mixed density network, denoted as e. Our experiments demonstrate for the first time that extracted e can provide efficient reconstruction information, which is better than using a global VAE. We then found that the log-likelihood of e increases when more gaussian components are used, indicating that GMM can model PL prosody better than a single gaussian. Subjective evaluation shows that the method can remarkably improve prosodic diversity in the synthesized speech without manual control and can obtain better naturalness. We have also found that the use of additional mixed density networks has a very limited impact on the speed of inference.

Fig. 9 is a schematic structural diagram of a speech synthesis system with rhythm according to an embodiment of the present invention, which can execute the speech synthesis method with rhythm described in any of the above embodiments and is configured in a terminal.

The present embodiment provides a speech synthesis system 10 with rhythm, which includes: a prosody prediction program module 11, a training target determination program module 12, a prosody information prediction program module 13, and a speech synthesis program module 14.

Wherein, the prosody prediction program module 11 is configured to predict prosody using a Mixture Density Network (MDN) based on a mixture gaussian model; the training target determining program module 12 is configured to extract phoneme-level prosody information from the target speech as a training target of the mixed density network; the prosody information prediction program module 13 is configured to perform prosody prediction on the current speech by using the trained mixed density network, and sample prosody information of each phoneme from a mixture gaussian distribution obtained by prediction; the speech synthesis program module 14 is used to synthesize speech based on the prosodic information of the sampled individual phonemes.

Further, the prosodic information predictor module is configured to:

and performing prosody prediction on the current voice and the historical prosody information by using the trained mixed density network.

Further, the phoneme-level prosody information is obtained by a prosody extractor, wherein the prosody extractor includes a recurrent neural network layer for embedding the phoneme-level prosody information.

Further, the architecture of the prosody extractor includes: two-layer two-dimensional convolution, batch normalization layer and ReLU activation function.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the speech synthesis method with rhythm in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

predicting prosody using a Mixture Density Network (MDN) based on a Gaussian mixture model;

extracting phoneme-level prosody information from target speech to serve as a training target of the mixed density network;

speech is synthesized based on the prosody information of the sampled individual phonemes.

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform a method for prosodic speech synthesis in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for synthesizing speech with prosody of any embodiment of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：支持多说话人风格、语言切换且韵律可控的语音合成装置

Speech synthesis method and system with rhythm

相关技术

网友询问留言