Speech synthesis method

文档序号:989494 发布日期:2020-11-06 浏览:2次 中文

阅读说明:本技术 一种语音合成方法 (Speech synthesis method ) 是由 袁熹 于 2020-07-14 设计创作,主要内容包括:本发明公开了一种语音合成方法,本发明将谱梯度Sobel算子创新性地引入到语音合成模型的损失函数设计中,从而提高了语音合成中特征预测模型细节刻画的能力;本发明提升语音合成的音质。(The invention discloses a speech synthesis method, which creatively introduces a spectral gradient Sobel operator into the loss function design of a speech synthesis model, thereby improving the detail description capability of a characteristic prediction model in speech synthesis; the invention improves the tone quality of voice synthesis.)

1. A speech synthesis method, comprising the steps of:

step 1, acoustic features select Mel spectrum or linear spectrum as output of a feature prediction model, and text is subjected to forward calculation through the feature prediction model to obtain a prediction spectrum

Figure FDA0002582934470000011

Step 2, calculating

Figure FDA0002582934470000012

Step 3, calculating a spectrum S of the real audio;

step 4, calculating Sobel operator S of Ssobel

Step 5, calculating

Figure FDA0002582934470000014

Step 6, calculatingAnd Sobel operator of S

Figure FDA0002582934470000017

Step 7, determining a balance coefficient alpha;

step 8, constructing the following loss function loss:

wherein loss is the mean square error from step 5

Figure FDA0002582934470000019

step 9, based on the loss calculated in the step 8, reverse derivation is carried out, and parameters of the characteristic prediction model are updated;

step 10, repeating the steps 1-9, training the feature prediction model until the feature prediction model is converged, and finally obtaining a completely trained feature prediction model;

step 11, during speech synthesis, transmitting the text to a feature prediction model, calculating and outputting the prediction spectrum in the step 1 through the feature prediction model

Figure FDA00025829344700000111

2. A speech synthesis method according to claim 1, characterized in that, in step 2,means for calculatingThe Sobel feature calculation of (1), including the x-direction and the y-direction; sobel is derived from image processing, the image is actually a two-dimensional array, the spectrum of the acoustic features is similar to the image, and is understood to be a two-dimensional array, the x direction refers to the transverse direction of the array, and the y direction refers to the longitudinal direction of the array.

3. A speech synthesis method according to claim 1, characterized in that in step 3, the spectral calculation of the real audio is to calculate the spectrum of the target audio, which is linear spectrum or mel spectrum, but is consistent with the spectrum selection in step 1.

4. A speech synthesis method according to claim 1, characterised in that in step 4, SsobelThe calculation refers to Sobel feature calculation of S, including the x-direction and the y-direction.

5. A speech synthesis method according to claim 1, characterised in that in step 7 the balance factor is in the range 0 to 1.

Technical Field

The invention relates to the technical field of voice, in particular to a voice synthesis method.

Background

Speech synthesis technology is to give computers (or various terminal devices) the ability to speak like a human, which is a typical interdisciplinary discipline. TTS technology (also called text-to-speech technology) belongs to speech synthesis, and is a technology for converting text information generated by a computer or input from the outside into intelligible and fluent speech and outputting the speech.

At present, the common evaluation method for synthesizing the voice judges the quality of the voice synthesis method by evaluating the voice quality of the synthesized voice, and the evaluation strategy determines that the voice quality is important for the research of the voice synthesis technology. Combining speech synthesis and speech sound quality evaluation, firstly synthesizing speech by a speech synthesis method, then judging whether the sound quality of the synthesized speech is good or bad by the speech sound quality evaluation, finally reflecting the quality of the synthesis method by an evaluation result, and finding out and modifying factors influencing the sound quality of the synthesized speech by the synthesis method so as to synthesize the speech with better sound quality. Therefore, in order to effectively advance the speech synthesis technology, a high-sound-quality synthesis algorithm is particularly important.

The mainstream speech synthesis method at present is based on modeling parameters, and generally comprises two parts: a feature prediction model and a vocoder model, both models being trained separately. The feature prediction model maps the input text sequence into acoustic features, which the vocoder model receives and reduces to true speech. Before training the model, a Loss Function (Loss Function), also called an objective Function, is defined to express the difference between the prediction result and the real sample, and further adjust the model parameters. The design of the loss function has a large impact on model training.

The acoustic feature selection and corresponding loss function of the commonly used feature prediction model are as follows:

the acoustic features select the fundamental frequency (F0), and loss is the Mean Absolute Error (MAE) distance calculated as a norm (L1) distance. The Duration (Duration) of the phoneme is calculated first, and then the distance from the base frequency of the corresponding real audio is calculated, and the loss function is calculated as follows:

acoustic feature selection Linear spectra (Linear spectra), loss as MAE or Mean Square Error (MSE), distance as calculated as L1 or two-norm (L2) distance. The distance between the predicted linear spectrum and the true linear spectrum is calculated, expressed as follows:

Figure BDA0002582934480000013

the acoustic features were chosen as Mel Spectrum (Mel Spectrogram), with loss as MAE or MSE and distance as calculated as L1 or L2 distance. The distance between the predicted mel-frequency spectrum and the true mel-frequency spectrum is calculated as follows:

Figure BDA0002582934480000021

acoustic features, combinations of the above features, loss being designed as a combination of the above

The paper [ Deep Voice: Real-time New-to-Speech ] discloses a loss design using MAE loss of F0. F0 has the maximum energy, and the correct fitting F0 can basically restore the timbre of the target person. But the middle and high frequency parts of the speech represent the details of the speech, which are related to the timbre information; the adoption of the loss design based on F0 does not consider the middle-high frequency part, and can seriously reduce the sound quality of the synthesized speech;

in speech signals, the energy of a low-frequency part is large, and the energy of a medium-high frequency part is small, if MAE or MSE loss is used, a model is bound TO be mainly fitted TO the low-frequency part (because the gradient caused by large low-frequency energy is large), the medium-high frequency part is turbid, medium-high frequency textures displayed on a Spectrogram are absent, and the synthesized tone quality is tedious. In addition, MAE loss can sharpen the spectrum, and the synthesized tone quality has mechanical feeling;

paper [ Natural TTS Synthesis by Conditioning wave Net on Mel Spectrogramm comparisons ], discloses that the design of the loss is the MSE loss of Mel Spectrogram; as described above, medium and high frequency details still cannot be characterized, and MSE loss will make the spectrogram "blurred" and the synthesized tone will have a "cloudy" feeling.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a voice synthesis method for overcoming the defects of the prior art, and the voice quality of voice synthesis is improved.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a voice synthesis method, which comprises the following steps:

step 1, acoustic features select Mel spectrum or linear spectrum as output of feature prediction model, and text is processed through the feature prediction modelObtaining prediction spectrum by forward calculation

Figure BDA0002582934480000022

Step 2, calculatingSobel operator of

Step 3, calculating a spectrum S of the real audio;

step 4, calculating Sobel operator S of Ssobel

Step 5, calculating

Figure BDA0002582934480000025

Mean square error of sum S

Step 6, calculatingAnd Sobel operator of S

Figure BDA0002582934480000028

Step 7, determining a balance coefficient alpha;

step 8, constructing the following loss function loss:

wherein loss is the mean square error from step 5

Figure BDA00025829344800000210

And step 6, mean square error of characteristic spectrum Sobel operatorTwo parts, alpha being a balance of two partsThe equilibrium coefficient of (a);

step 9, based on the loss calculated in the step 8, reverse derivation is carried out, and parameters of the characteristic prediction model are updated;

step 10, repeating the steps 1-9, training the feature prediction model until the feature prediction model is converged, and finally obtaining a completely trained feature prediction model;

step 11, during speech synthesis, transmitting the text to a feature prediction model, calculating and outputting the prediction spectrum in the step 1 through the feature prediction modelThen will beInputting to a vocoder to obtain sound.

As a further optimization scheme of the speech synthesis method of the present invention, in step 2,means for calculating

Figure BDA0002582934480000035

The Sobel feature calculation of (1), including the x-direction and the y-direction; sobel is derived from image processing, the image is actually a two-dimensional array, the spectrum of the acoustic features is similar to the image, and is understood to be a two-dimensional array, the x direction refers to the transverse direction of the array, and the y direction refers to the longitudinal direction of the array.

As a further optimization scheme of the speech synthesis method according to the present invention, in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which is a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1.

As a further optimization scheme of the speech synthesis method of the present invention, in step 4, SsobelThe calculation refers to Sobel feature calculation of S, including the x-direction and the y-direction.

As a further optimization scheme of the speech synthesis method according to the present invention, in step 7, the balance coefficient ranges from 0 to 1.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

(1) the spectral gradient Sobel operator is innovatively introduced into the loss function design of the speech synthesis model, so that the detail description capability of the feature prediction model in speech synthesis is improved;

(2) the invention improves the tone quality of voice synthesis.

Drawings

Fig. 1 is a calculation process of a loss function in conjunction with a Sobel operator.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Sobel is a gradient feature operator derived from image processing, and describes the texture features of an image. The lack of the middle and high frequency sound quality is an important cause of the degradation of the synthetic sound quality. In the technology, a Sobel operator is introduced into the design of a feature prediction model loss, so that the model focuses on the details of the acoustic features, and the synthetic tone quality is improved.

The method for determining the loss function of the speech synthesis model by combining the Sobel operator comprises the following steps:

step 1, selecting Mel Spectrogram or Linear Spectrogram according to acoustic characteristics, and obtaining spectrum output through forward calculation

Step 2, calculatingSobel operator of

Step 3, calculating a spectrum S of the real audio;

step 4, calculating Sobel operator S of Ssobel

Step 5, calculating the MSE of the two spectrums;

step 6, calculating MSE of two spectrum Sobel operators

Step 7, determining a balance coefficient alpha

Step 8, constructing the following loss:

Figure BDA0002582934480000044

the forward process of step 1 refers to the spectral output during model training

Figure BDA0002582934480000045

Considering that the existing vocoder model can well restore the sound through a linear spectrum or a Mel spectrum, the feature selection of the feature prediction model can be a linear spectrum or a Mel spectrum;

in the step 2, the step of the method is carried out,means for calculatingThe Sobel feature calculation of (1), including the x-direction and the y-direction;

in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which may be a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1;

in step 4, SsobelThe calculation refers to Sobel feature calculation of S, including an x direction and a y direction;

in step 5, the MSE of the spectrum is calculated;

in step 6, MSE of a spectral Sobel operator is calculated;

in step 7, the balance coefficient is used for controlling the weight of the two parts, and the range is between 0 and 1;

in step 9, the constructed loss function is the final loss function, which is composed of two parts of MSE of the spectrum and MSE of the Sobel operator, and the balance coefficient controls the weight of the two parts.

The invention focuses on a feature prediction model of speech synthesis, and the loss design method based on the Sobel operator; the spectral gradient Sobel operator is innovatively introduced into the loss function design of the speech synthesis model.

FIG. 1 is a schematic diagram of a loss function, in which a feature prediction model is forward calculated to obtain a prediction spectrum (component 101, corresponding to equation (1))) And calculating to obtain Sobel operator (component 102, corresponding to formula (1))

Figure BDA0002582934480000049

) The real audio frequency calculation obtains a real audio frequency spectrum (part 103, corresponding to S in formula (1)), and the Sobel operator (part 104, corresponding to S in formula (1)) for obtaining the real audio frequency spectrum is further calculatedsobel). The MSE of the components 101 and 103 is calculated to obtain the spectral MSE (component 105, corresponding to equation (1))MSE of the part 102 and the part 104 is calculated to obtain MSE of the spectrum Sobel operator (part 106, corresponding to equation (1))

Figure BDA00025829344800000411

Specifying the balance coefficient α, the components 105, 106 dot-by-dot [ α, (1- α)]And a final loss is obtained (element 107).

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

7页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音合成方法和系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!