Speech synthesis method, device and storage medium

文档序号：513256 发布日期：2021-05-28 浏览：10次中文

阅读说明：本技术 一种语音合成方法、装置及存储介质 (Speech synthesis method, device and storage medium ) 是由殷昊陈云琳江明奇杨喜鹏张旭于 2020-12-31 设计创作，主要内容包括：本发明公开了一种语音合成方法、装置及计算机可读存储介质,首先将原始语音信号按频率分解成n个子带频率信号,n的取值为大于等于2的正整数；接着从所述原始语音信号提取梅尔频谱特征；之后根据所提取的梅尔频谱特征生成所述n个子带频率信号中每个子带频率信号的预测采样点；最后利用所述n个子带频率信号中每个子带频率信号的预测采样点对所述n个子带频率信号进行合成,得到对应所述原始语音信号的语音合成信号。(The invention discloses a voice synthesis method, a device and a computer readable storage medium, which are characterized in that an original voice signal is decomposed into n sub-band frequency signals according to frequency, wherein the value of n is a positive integer which is more than or equal to 2; then extracting Mel frequency spectrum characteristics from the original voice signal; then generating a prediction sampling point of each sub-band frequency signal in the n sub-band frequency signals according to the extracted Mel frequency spectrum characteristics; and finally, synthesizing the n sub-band frequency signals by using the predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals to obtain a voice synthesis signal corresponding to the original voice signal.)

1. A method of speech synthesis, the method comprising:

decomposing an original voice signal into n sub-band frequency signals according to frequency, wherein the value of n is a positive integer greater than or equal to 2;

extracting mel frequency spectrum characteristics from the original voice signal;

generating a prediction sampling point of each sub-band frequency signal in the n sub-band frequency signals according to the extracted Mel frequency spectrum characteristics;

and synthesizing the n sub-band frequency signals by using the predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals to obtain a speech synthesis signal corresponding to the original speech signal.

2. The method of claim 1, wherein generating the predicted sample point for each of the n subband frequency signals from the extracted mel spectral features comprises:

performing linear prediction on the n sub-band frequency signals according to the extracted mel frequency spectrum characteristics to obtain a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals;

performing neural network prediction on the n sub-band frequency signals by using the extracted mel frequency spectrum characteristics to obtain a residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals;

and correspondingly adding the linear predicted value and the residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals to obtain a predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals.

3. The method according to claim 1, wherein synthesizing the n subband frequency signals using the predicted sampling point of each subband frequency signal in the n subband frequency signals to obtain the speech synthesis signal corresponding to the original speech signal comprises:

generating n sub-band voice synthesis signals according to the linear prediction value and the residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals;

and combining the n sub-band voice synthesis signals according to frequency to obtain a voice synthesis signal corresponding to the original voice signal.

4. The method of claim 2, wherein performing linear prediction on the n subband frequency signals according to the extracted mel spectral features comprises:

converting the extracted mel frequency spectrum features into linear spectrums;

equally dividing the linear spectrum into n sub-band linear spectra;

performing linear prediction on the n sub-band linear spectrums to obtain a linear prediction coefficient corresponding to each sub-band linear spectrum;

and determining a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals according to the linear prediction coefficient.

5. The method of claim 2, wherein performing neural network prediction on the n subband frequency signals using the extracted mel-frequency spectral features comprises:

carrying out model training by utilizing the Mel frequency spectrum sample and the n subband frequency signals to obtain a neural network model;

and taking the extracted Mel frequency spectrum characteristics as the input of the neural network model, and performing neural network prediction on the n sub-band frequency signals.

6. A speech synthesis apparatus, characterized in that the apparatus comprises:

the signal decomposition module is used for decomposing the original voice signal into n sub-band frequency signals according to frequency, wherein the value of n is a positive integer which is more than or equal to 2;

the characteristic extraction module is used for extracting Mel frequency spectrum characteristics from the original voice signal;

the sampling point generating module is used for generating a prediction sampling point of each sub-band frequency signal in the n sub-band frequency signals according to the extracted Mel frequency spectrum characteristics;

and the signal synthesis module is used for synthesizing the n sub-band frequency signals by using the predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals to obtain a voice synthesis signal corresponding to the original voice signal.

7. The apparatus of claim 6,

the sampling point generating module is specifically configured to perform linear prediction on the n subband frequency signals according to the extracted mel frequency spectrum features to obtain a linear prediction value corresponding to each subband frequency signal in the n subband frequency signals; performing neural network prediction on the n sub-band frequency signals by using the extracted mel frequency spectrum characteristics to obtain a residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals; and correspondingly adding the linear predicted value and the residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals to obtain a predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals.

8. The apparatus of claim 6,

the signal synthesis module is specifically configured to generate n subband speech synthesis signals according to the linear prediction value and the residual error corresponding to each subband frequency signal in the n subband frequency signals; and combining the n sub-band voice synthesis signals according to frequency to obtain a voice synthesis signal corresponding to the original voice signal.

9. The apparatus of claim 7,

the sampling point generating module is also used for converting the extracted Mel frequency spectrum characteristics into a linear spectrum; equally dividing the linear spectrum into n sub-band linear spectra; performing linear prediction on the n sub-band linear spectrums to obtain a linear prediction coefficient corresponding to each sub-band linear spectrum; and determining a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals according to the linear prediction coefficient.

10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the speech synthesis method of any one of claims 1 to 5.

Technical Field

The present invention relates to speech processing technologies, and in particular, to a speech synthesis method, apparatus, and computer-readable storage medium.

Background

Speech synthesis refers to a technique in which a computer automatically generates corresponding speech from text. Speech synthesis is mainly composed of text front-end analysis, acoustic models and vocoders, and is currently gradually shifted from the conventional technology to the deep learning technology.

The voice synthesis based on deep learning can greatly improve the tone quality of synthesized voice, but the voice synthesis system based on the neural network often causes the problem of time delay due to too many sampling points which need to be generated every second because the sampling rate of the voice is too high.

Disclosure of Invention

In order to solve the above drawbacks of the current speech synthesis technology based on neural network, embodiments of the present invention creatively provide a speech synthesis method, apparatus, and computer-readable storage medium.

According to a first aspect of the present invention, there is provided a speech synthesis method comprising: decomposing an original voice signal into n sub-band frequency signals according to frequency, wherein the value of n is a positive integer greater than or equal to 2; extracting mel frequency spectrum characteristics from the original voice signal; generating a prediction sampling point of each sub-band frequency signal in the n sub-band frequency signals according to the extracted Mel frequency spectrum characteristics; and synthesizing the n sub-band frequency signals by using the predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals to obtain a speech synthesis signal corresponding to the original speech signal.

According to an embodiment of the present invention, generating a predicted sample point of each of the n subband frequency signals according to the extracted mel spectral features comprises: performing linear prediction on the n sub-band frequency signals according to the extracted mel frequency spectrum characteristics to obtain a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals; performing neural network prediction on the n sub-band frequency signals by using the extracted mel frequency spectrum characteristics to obtain a residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals; and correspondingly adding the linear predicted value and the residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals to obtain a predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals.

According to an embodiment of the present invention, synthesizing the n subband frequency signals by using the predicted sampling point of each subband frequency signal in the n subband frequency signals to obtain a speech synthesis signal corresponding to the original speech signal, includes: generating n sub-band voice synthesis signals according to the linear prediction value and the residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals; and combining the n sub-band voice synthesis signals according to frequency to obtain a voice synthesis signal corresponding to the original voice signal.

According to an embodiment of the present invention, the performing linear prediction on the n subband frequency signals according to the extracted mel-frequency spectrum features includes: converting the extracted mel frequency spectrum features into linear spectrums; equally dividing the linear spectrum into n sub-band linear spectra; performing linear prediction on the n sub-band linear spectrums to obtain a linear prediction coefficient corresponding to each sub-band linear spectrum; and determining a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals according to the linear prediction coefficient.

According to an embodiment of the present invention, the performing neural network prediction on the n subband frequency signals by using the extracted mel-frequency spectrum features includes: carrying out model training by utilizing the Mel frequency spectrum sample and the n subband frequency signals to obtain a neural network model; and taking the extracted Mel frequency spectrum characteristics as the input of the neural network model, and performing neural network prediction on the n sub-band frequency signals.

According to the second aspect of the present invention, there is also provided a speech synthesis apparatus comprising: the signal decomposition module is used for decomposing the original voice signal into n sub-band frequency signals according to frequency, wherein the value of n is a positive integer which is more than or equal to 2; the characteristic extraction module is used for extracting Mel frequency spectrum characteristics from the original voice signal; the sampling point generating module is used for generating a prediction sampling point of each sub-band frequency signal in the n sub-band frequency signals according to the extracted Mel frequency spectrum characteristics; and the signal synthesis module is used for synthesizing the n sub-band frequency signals by using the predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals to obtain a voice synthesis signal corresponding to the original voice signal.

According to an embodiment of the present invention, the sampling point generating module is specifically configured to perform linear prediction on the n subband frequency signals according to the extracted mel frequency spectrum feature, so as to obtain a linear prediction value corresponding to each subband frequency signal in the n subband frequency signals; performing neural network prediction on the n sub-band frequency signals by using the extracted mel frequency spectrum characteristics to obtain a residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals; and correspondingly adding the linear predicted value and the residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals to obtain a predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals.

According to an embodiment of the present invention, the signal synthesis module is specifically configured to generate n subband speech synthesis signals according to a linear prediction value and a residual error corresponding to each subband frequency signal in the n subband frequency signals; and combining the n sub-band voice synthesis signals according to frequency to obtain a voice synthesis signal corresponding to the original voice signal.

According to an embodiment of the present invention, the sampling point generating module is further configured to convert the extracted mel-frequency spectrum feature into a linear spectrum; equally dividing the linear spectrum into n sub-band linear spectra; performing linear prediction on the n sub-band linear spectrums to obtain a linear prediction coefficient corresponding to each sub-band linear spectrum; and determining a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals according to the linear prediction coefficient.

According to an embodiment of the present invention, the sampling point generating module is further configured to perform model training by using mel-frequency spectrum samples and the n subband frequency signals to obtain a neural network model; and taking the extracted Mel frequency spectrum characteristics as the input of the neural network model, and performing neural network prediction on the n sub-band frequency signals.

According to a third aspect of the present invention, there is also provided a computer-readable storage medium comprising a set of computer-executable instructions which, when executed, are operable to perform any of the speech synthesis methods described above.

The invention discloses a voice synthesis method, a device and a computer readable storage medium, which are characterized in that an original voice signal is decomposed into n sub-band frequency signals according to frequency, wherein the value of n is a positive integer which is more than or equal to 2; then extracting Mel frequency spectrum characteristics from the original voice signal; then generating a prediction sampling point of each sub-band frequency signal in the n sub-band frequency signals according to the extracted Mel frequency spectrum characteristics; and finally, synthesizing the n sub-band frequency signals by using the predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals to obtain a voice synthesis signal corresponding to the original voice signal. Therefore, the invention leads the voice synthesis system to output a plurality of sampling points each time instead of outputting one sampling point each time by introducing the sub-band frequency (sub) technology, thereby effectively accelerating the prediction speed of the voice synthesis system.

It is to be understood that the teachings of the present invention need not achieve all of the above-described benefits, but rather that specific embodiments may achieve specific technical results, and that other embodiments of the present invention may achieve benefits not mentioned above.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a first schematic diagram illustrating a first implementation flow of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a block diagram of an example speech synthesis system of the present invention;

FIG. 3 is a schematic diagram illustrating an implementation flow of generating predicted sampling points according to Mel frequency spectrum features according to an embodiment of the present invention;

fig. 4 is a schematic diagram showing a composition structure of a speech synthesis apparatus according to an embodiment of the present invention.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given only to enable those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The technical solution of the present invention is further elaborated below with reference to the drawings and the specific embodiments.

In the related art, a method of accelerating a prediction speed of a vocoder (vocoder) is a method based on a linear prediction coefficient in combination with a neural network. The method divides the voice signal into a linear part and a nonlinear part, estimates the linear part which is easy to predict by using a linear prediction coefficient, and gives the nonlinear part which is difficult to predict to a powerful neural network for prediction. Since the neural network portion only needs to predict the residual error (non-linear portion) of the speech signal, a relatively simple network can be used, which can speed up speech synthesis.

However, this method can only implement real-time speech synthesis on a Graphics Processing Unit (GPU) with a relatively high performance, and when offline speech synthesis is performed on a machine with a relatively low performance (such as a watch, a headset, or a vehicle), delay is often caused by too many sampling points that need to be predicted per second.

In order to solve the above problems, embodiments of the present invention provide a subband-based linear prediction speech synthesis method, which implements a parallel vocoder (vocoder) technique, that is, an autoregressive vocoder outputs one sampling point at a time, and then converts the sampling point into a plurality of sampling points generated at a time, thereby effectively implementing acceleration of prediction.

FIG. 1 is a flow chart illustrating a speech synthesis method according to an embodiment of the present invention; FIG. 2 is a block diagram of a speech synthesis system according to an embodiment of the present invention.

Referring to fig. 1, a speech synthesis method according to an embodiment of the present invention includes: an operation 101 of frequency decomposing an original speech signal into n subband frequency signals; an operation 102 of extracting mel-frequency spectrum features from the original voice signal; an operation 103 of generating a predicted sampling point of each of the n subband frequency signals according to the extracted mel frequency spectrum features; in operation 104, the n subband frequency signals are synthesized by using the predicted sampling point of each subband frequency signal in the n subband frequency signals, so as to obtain a speech synthesis signal corresponding to the original speech signal.

In operation 101, sub-band frequency (subband) is to convert an original signal from a time domain to a frequency domain and then divide it into several sub-bands.

Specifically, the speech synthesis system first frequency decomposes the original speech signal (wav) into n subband frequency signals (subband wav), such as those referred to in the speech synthesis system framework of fig. 2, which may be respectively denoted as subband frequency _1, subband frequency _2, …, subband frequency _ n; wherein the value of n is a positive integer greater than or equal to 2.

Here, the speech synthesis system according to the embodiment of the present invention may be applied to electronic devices with lower performance, such as watches, earphones, vehicles and other devices with offline speech synthesis function.

In operation 102, the speech synthesis system performs feature extraction from the original speech signal wav, resulting in mel-frequency spectrum (mel spectrum) features.

The mel spectrum (mel spectrum), namely the suppression of the human ear to high-frequency signals, is simulated, and a linear spectrum after fast Fourier transform (sfft) is processed by utilizing a group of a plurality of triangular filters to obtain low-dimensional features, so that the mel spectrum (mel spectrum) is widely applied to voice feature extraction.

In operations 103-104, the speech synthesis system generates a predicted sample point of each sub-band frequency signal of the n sub-band frequency signals in a linear prediction mode and a non-linear prediction mode (i.e. neural network prediction mode) according to the extracted mel-frequency spectrum characteristics.

Specifically, referring to fig. 3, operation 103 includes: operation 1031, performing linear prediction on the n subband frequency signals according to the extracted mel frequency spectrum features to obtain a linear prediction value corresponding to each subband frequency signal in the n subband frequency signals; operation 1032, performing neural network prediction on the n subband frequency signals by using the extracted mel frequency spectrum features to obtain a residual error corresponding to each subband frequency signal in the n subband frequency signals; and in operation 1033, the linear prediction value and the residual error corresponding to each of the n subband frequency signals are added correspondingly to obtain a predicted sampling point of each of the n subband frequency signals.

In operation 1031, performing linear prediction on the n subband frequency signals according to the extracted mel spectral features, including: converting the extracted mel frequency spectrum features into linear spectrums; equally dividing the linear spectrum into n sub-band linear spectra; performing linear prediction on the n sub-band linear spectrums to obtain a linear prediction coefficient corresponding to each sub-band linear spectrum; and determining a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals according to the linear prediction coefficient.

Specifically, referring to fig. 2, the speech synthesis system first converts the extracted mel-frequency spectrum (mel spectrum) features into a linear spectrum (linear spectrum); equally dividing the linear spectrum (linear spectrum) into n sub-band linear spectra, which are denoted as sub _ linear _1, sub _ linear _2, …, sub _ linear _ n; then, performing linear prediction on the n sub-band linear spectrums to obtain Linear Prediction Coefficients (LPC) corresponding to each sub-band linear spectrum, which are denoted as sub _ LPC _1, sub _ LPC _2, … and sub _ LPC _ n; and then determining a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals according to the linear prediction coefficients, and recording the linear prediction values as sub _ wav _1, sub _ wav _2, … and sub _ wav _ n.

Here, the samples of a speech can be approximated by a linear combination of the past speech samples (minimum mean square error), and a unique set of prediction coefficients can be determined, which are linear prediction coefficients.

At operation 1032, performing neural network prediction on the n subband frequency signals using the extracted mel-frequency spectral features, including: carrying out model training by utilizing the Mel frequency spectrum sample and the n subband frequency signals to obtain a neural network model; and taking the extracted Mel frequency spectrum characteristics as the input of the neural network model, and performing neural network prediction on the n sub-band frequency signals.

Referring to fig. 2, the speech synthesis system first performs Model training using mel-frequency spectrum samples and the n subband frequency signals to obtain a neural network Model (neural Model); and taking the extracted Mel frequency spectrum characteristics as the input of the neural network model, performing neural network prediction on the n sub-band frequency signals, and outputting residual errors corresponding to each sub-band frequency signal in the n sub-band frequency signals, wherein the residual errors are recorded as out _1, out _2, … and out _ n.

In operation 104, the speech synthesis system generates a full band speech synthesis signal (full band wav) using the n predicted sample points. Specifically, n subband speech synthesis signals are generated according to a linear prediction value and a residual error corresponding to each subband frequency signal in the n subband frequency signals; and combining the n sub-band voice synthesis signals according to frequency to obtain a voice synthesis signal corresponding to the original voice signal.

The voice synthesis method of the embodiment of the invention comprises the steps of firstly decomposing an original voice signal into n sub-band frequency signals according to frequency, wherein the value of n is a positive integer which is more than or equal to 2; then extracting Mel frequency spectrum characteristics from the original voice signal; then generating a prediction sampling point of each sub-band frequency signal in the n sub-band frequency signals according to the extracted Mel frequency spectrum characteristics; and finally, synthesizing the n sub-band frequency signals by using the predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals to obtain a voice synthesis signal corresponding to the original voice signal. Therefore, the invention leads the voice synthesis system to output a plurality of sampling points each time instead of outputting one sampling point each time by introducing the sub-band frequency (sub) technology, thereby effectively accelerating the prediction speed of the voice synthesis system.

Similarly, based on the above speech synthesis method, an embodiment of the present invention further provides a computer-readable storage medium, in which a program is stored, and when the program is executed by a processor, the processor is caused to perform at least the following operation steps: an operation 101 of frequency decomposing an original speech signal into n subband frequency signals; an operation 102 of extracting mel-frequency spectrum features from the original voice signal; an operation 103 of generating a predicted sampling point of each of the n subband frequency signals according to the extracted mel frequency spectrum features; in operation 104, the n subband frequency signals are synthesized by using the predicted sampling point of each subband frequency signal in the n subband frequency signals, so as to obtain a speech synthesis signal corresponding to the original speech signal.

Further, based on the above-mentioned speech synthesis method, an embodiment of the present invention further provides a speech synthesis apparatus, as shown in fig. 4, where the apparatus 40 includes: a signal decomposition module 401, configured to decompose an original speech signal into n subband frequency signals according to frequency, where a value of n is a positive integer greater than or equal to 2; a feature extraction module 402, configured to extract mel-frequency spectrum features from the original speech signal; a sampling point generating module 403, configured to generate a predicted sampling point of each of the n subband frequency signals according to the extracted mel frequency spectrum feature; a signal synthesizing module 404, configured to synthesize the n subband frequency signals by using the predicted sampling point of each subband frequency signal in the n subband frequency signals, so as to obtain a speech synthesis signal corresponding to the original speech signal.

According to an embodiment of the present invention, the sampling point generating module 403 is specifically configured to perform linear prediction on the n subband frequency signals according to the extracted mel frequency spectrum feature, so as to obtain a linear prediction value corresponding to each subband frequency signal in the n subband frequency signals; performing neural network prediction on the n sub-band frequency signals by using the extracted mel frequency spectrum characteristics to obtain a residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals; and correspondingly adding the linear predicted value and the residual error corresponding to each sub-band frequency signal in the n sub-band frequency signals to obtain a predicted sampling point of each sub-band frequency signal in the n sub-band frequency signals.

According to an embodiment of the present invention, the signal synthesis module 404 is specifically configured to generate n subband speech synthesis signals according to a linear prediction value and a residual error corresponding to each subband frequency signal in the n subband frequency signals; and combining the n sub-band voice synthesis signals according to frequency to obtain a voice synthesis signal corresponding to the original voice signal.

According to an embodiment of the present invention, the sampling point generating module 403 is further configured to convert the extracted mel-frequency spectrum feature into a linear spectrum; equally dividing the linear spectrum into n sub-band linear spectra; performing linear prediction on the n sub-band linear spectrums to obtain a linear prediction coefficient corresponding to each sub-band linear spectrum; and determining a linear prediction value corresponding to each sub-band frequency signal in the n sub-band frequency signals according to the linear prediction coefficient.

According to an embodiment of the present invention, the sampling point generating module 403 is further configured to perform model training by using mel-frequency spectrum samples and the n subband frequency signals to obtain a neural network model; and taking the extracted Mel frequency spectrum characteristics as the input of the neural network model, and performing neural network prediction on the n sub-band frequency signals.

Here, it should be noted that: the above description of the embodiment of the speech synthesis apparatus is similar to the description of the embodiment of the method shown in fig. 1 to 3, and has similar beneficial effects to the embodiment of the method shown in fig. 1 to 3, and therefore, the description is omitted. For technical details that are not disclosed in the embodiment of the speech synthesis apparatus of the present invention, please refer to the description of the method embodiment shown in fig. 1 to 3 of the present invention for understanding, and therefore, for brevity, will not be described again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

12页详细技术资料下载

Speech synthesis method, device and storage medium

相关技术

网友询问留言