Speech synthesis model training and speech synthesis method, device and speech synthesizer

文档序号:702072 发布日期:2021-04-13 浏览:20次 中文

阅读说明:本技术 语音合成模型训练及语音合成方法、装置及语音合成器 (Speech synthesis model training and speech synthesis method, device and speech synthesizer ) 是由 马达标 于 2020-12-24 设计创作,主要内容包括:本发明提供了一种语音合成模型训练及语音合成方法、装置及语音合成器,采用的的语音合成模型是一种全卷积语音合成模型,全卷积语音合成模型可以对多个待合成数据执行并行处理,提高语音合成效率。在对全卷积语音合成模型进行训练的过程中,通过调用待训练全卷积语音合成模型,对所述声学特征训练样本进行处理,得到离散语音合成结果,将所述离散语音合成结果转化为连续语音合成结果,进而利用连续语音合成结果得到损失函数,才能够利用损失函数对所述待训练全卷积语音合成模型的模型参数进行调整,即对待训练全卷积语音合成模型的模型参数进行优化,经过模型参数优化,最终得到的训练好的全卷积语音合成模型能够提高语音合成质量。(The invention provides a method and a device for training a voice synthesis model and synthesizing voice and a voice synthesizer. In the process of training the full convolution speech synthesis model, the acoustic feature training sample is processed by calling the full convolution speech synthesis model to be trained to obtain a discrete speech synthesis result, the discrete speech synthesis result is converted into a continuous speech synthesis result, and then a loss function is obtained by using the continuous speech synthesis result, so that the model parameters of the full convolution speech synthesis model to be trained can be adjusted by using the loss function, namely, the model parameters of the full convolution speech synthesis model to be trained are optimized, and the finally obtained trained full convolution speech synthesis model can improve the speech synthesis quality through model parameter optimization.)

1. A method for training a speech synthesis model, the method comprising:

acquiring an acoustic feature training sample;

calling a full convolution speech synthesis model to be trained, and processing the acoustic feature training sample to obtain a discrete speech synthesis result;

converting the discrete voice synthesis result into a continuous voice synthesis result;

comparing the continuous speech synthesis result with a reference output speech synthesis result corresponding to the acoustic feature training sample, and obtaining a loss function of the full convolution speech synthesis model to be trained by using the comparison result;

adjusting model parameters of the full convolution speech synthesis model to be trained by utilizing the loss function;

taking the parameter-adjusted full convolution speech synthesis model as a full convolution speech synthesis model to be trained, returning to execute the step of calling the full convolution speech synthesis model to be trained and processing the acoustic characteristic training sample until the model training termination condition is met;

and taking the model parameters meeting the termination condition of model training as the model parameters of the full convolution speech synthesis model to be trained to obtain the trained full convolution speech synthesis model.

2. The method of claim 1, wherein converting the discrete speech synthesis result into a continuous speech synthesis result comprises:

obtaining a uniform distribution sampling result which obeys uniform distribution;

obtaining a voice synthesis probability distribution result meeting the discrete multinomial distribution by utilizing the discrete voice synthesis result and the uniformly distributed sampling result;

and processing the voice synthesis probability distribution result by utilizing a continuity function to obtain a continuous voice synthesis result.

3. The method of claim 1, wherein the loss function comprises at least: short-time fourier transform STFT loss function.

4. The method of claim 3, wherein the adjusting the model parameters of the full-convolution speech synthesis model to be trained using the loss function comprises:

obtaining a continuous speech synthesis spectrum corresponding to the continuous speech synthesis result and a reference output speech synthesis spectrum corresponding to the reference output speech synthesis result from the loss function;

comparing the continuous speech synthesis frequency spectrum with the reference output speech synthesis frequency spectrum, and obtaining the model gradient of the full convolution speech synthesis model to be trained by using the comparison result;

and adjusting the model parameters of the full convolution speech synthesis model to be trained along the gradient descending direction of the model of the full convolution speech synthesis model to be trained.

5. The method according to claim 1, wherein the model training termination condition comprises at least:

the model training frequency reaches the preset frequency, or the model training time reaches the preset time, or the loss function meets the preset condition.

6. A method of speech synthesis, the method comprising:

acquiring acoustic features corresponding to the voice to be synthesized;

calling the full convolution speech synthesis model obtained by training according to the speech synthesis model training method of any one of claims 1 to 5, and processing the acoustic features corresponding to the speech to be synthesized to obtain a speech synthesis result.

7. A speech synthesis model training apparatus, characterized in that the apparatus comprises:

the sample acquisition unit is used for acquiring an acoustic feature training sample;

the sample processing unit is used for calling a full convolution speech synthesis model to be trained, and processing the acoustic feature training sample to obtain a discrete speech synthesis result;

the conversion unit is used for converting the discrete voice synthesis result into a continuous voice synthesis result;

a loss function obtaining unit, configured to compare the continuous speech synthesis result with a reference output speech synthesis result corresponding to an acoustic feature training sample, and obtain a loss function of the full convolution speech synthesis model to be trained by using the comparison result;

the parameter adjusting unit is used for adjusting the model parameters of the full convolution speech synthesis model to be trained by utilizing the loss function; taking the parameter-adjusted full convolution speech synthesis model as a full convolution speech synthesis model to be trained, returning to the step of calling the full convolution speech synthesis model to be trained executed by the execution sample processing unit, and processing the acoustic feature training sample until a model training termination condition is met; and taking the model parameters meeting the termination condition of model training as the model parameters of the full convolution speech synthesis model to be trained to obtain the trained full convolution speech synthesis model.

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

the acoustic feature acquisition unit is used for acquiring acoustic features corresponding to the voice to be synthesized;

an acoustic feature processing unit, configured to invoke the full convolution speech synthesis model obtained by the training of the speech synthesis model according to any one of claims 1 to 5, and process the acoustic features corresponding to the speech to be synthesized, so as to obtain a speech synthesis result.

9. A speech synthesizer, characterized in that it comprises at least a speech synthesis apparatus according to claim 8.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for training a speech synthesis model and synthesizing speech and a speech synthesizer.

Background

With the development of artificial intelligence technology, the speech synthesis technology is more and more emphasized by people, the speech synthesis technology can be applied to the fields of human-computer interaction or conversion of texts into natural language output and the like, at present, a recurrent neural network model is adopted for speech synthesis, but the problem of low synthesis efficiency exists in a mode of performing speech synthesis by using the recurrent neural network model.

Therefore, how to improve the speech synthesis efficiency becomes a technical problem to be solved at present.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for training a speech synthesis model and speech synthesis, and a speech synthesizer, so as to improve speech synthesis efficiency.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of speech synthesis model training, the method comprising:

acquiring an acoustic feature training sample;

calling a full convolution speech synthesis model to be trained, and processing the acoustic feature training sample to obtain a discrete speech synthesis result;

converting the discrete voice synthesis result into a continuous voice synthesis result;

comparing the continuous speech synthesis result with a reference output speech synthesis result corresponding to the acoustic feature training sample, and obtaining a loss function of the full convolution speech synthesis model to be trained by using the comparison result;

adjusting model parameters of the full convolution speech synthesis model to be trained by utilizing the loss function;

taking the parameter-adjusted full convolution speech synthesis model as a full convolution speech synthesis model to be trained, returning to execute the step of calling the full convolution speech synthesis model to be trained and processing the acoustic characteristic training sample until the model training termination condition is met;

and taking the model parameters meeting the termination condition of model training as the model parameters of the full convolution speech synthesis model to be trained to obtain the trained full convolution speech synthesis model.

Preferably, the converting the discrete speech synthesis result into a continuous speech synthesis result includes:

obtaining a uniform distribution sampling result which obeys uniform distribution;

obtaining a voice synthesis probability distribution result meeting the discrete multinomial distribution by utilizing the discrete voice synthesis result and the uniformly distributed sampling result;

and processing the voice synthesis probability distribution result by utilizing a continuity function to obtain a continuous voice synthesis result.

Preferably, the loss function includes at least: short-time fourier transform STFT loss function.

Preferably, the adjusting the model parameters of the full convolution speech synthesis model to be trained by using the loss function includes:

obtaining a continuous speech synthesis spectrum corresponding to the continuous speech synthesis result and a reference output speech synthesis spectrum corresponding to the reference output speech synthesis result from the loss function;

comparing the continuous speech synthesis frequency spectrum with the reference output speech synthesis frequency spectrum, and obtaining the model gradient of the full convolution speech synthesis model to be trained by using the comparison result;

and adjusting the model parameters of the full convolution speech synthesis model to be trained along the gradient descending direction of the model of the full convolution speech synthesis model to be trained.

Preferably, the model training termination condition at least includes:

the model training frequency reaches the preset frequency, or the model training time reaches the preset time, or the loss function meets the preset condition.

A method of speech synthesis, the method comprising:

acquiring acoustic features corresponding to the voice to be synthesized;

and calling the full convolution speech synthesis model obtained by the training of the speech synthesis model training method, and processing the acoustic characteristics corresponding to the speech to be synthesized to obtain a speech synthesis result.

A speech synthesis model training apparatus, the apparatus comprising:

the sample acquisition unit is used for acquiring an acoustic feature training sample;

the sample processing unit is used for calling a full convolution speech synthesis model to be trained, and processing the acoustic feature training sample to obtain a discrete speech synthesis result;

the conversion unit is used for converting the discrete voice synthesis result into a continuous voice synthesis result;

a loss function obtaining unit, configured to compare the continuous speech synthesis result with a reference output speech synthesis result corresponding to an acoustic feature training sample, and obtain a loss function of the full convolution speech synthesis model to be trained by using the comparison result;

the parameter adjusting unit is used for adjusting the model parameters of the full convolution speech synthesis model to be trained by utilizing the loss function; taking the parameter-adjusted full convolution speech synthesis model as a full convolution speech synthesis model to be trained, returning to the step of calling the full convolution speech synthesis model to be trained executed by the execution sample processing unit, and processing the acoustic feature training sample until a model training termination condition is met; and taking the model parameters meeting the termination condition of model training as the model parameters of the full convolution speech synthesis model to be trained to obtain the trained full convolution speech synthesis model.

A speech synthesis apparatus, the apparatus comprising:

the acoustic feature acquisition unit is used for acquiring acoustic features corresponding to the voice to be synthesized;

and the acoustic feature processing unit is used for calling the full convolution speech synthesis model obtained by the training of the speech synthesis model training method, and processing the acoustic features corresponding to the speech to be synthesized to obtain a speech synthesis result.

A speech synthesizer comprising at least a speech synthesis apparatus as described above.

According to the above technical solution, compared with the prior art, the invention provides a method, an apparatus and a speech synthesizer for training a speech synthesis model and synthesizing speech. In addition, in the process of training the full convolution speech synthesis model, the acoustic feature training sample is processed by calling the full convolution speech synthesis model to be trained to obtain a discrete speech synthesis result, and the discrete speech synthesis result is converted into a continuous speech synthesis result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a WaveRNN speech synthesizer model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present invention;

FIG. 3 is a specific structure of a full convolution speech synthesis model to be trained according to an embodiment of the present invention;

fig. 4 is a flowchart of a method for converting the discrete speech synthesis result into a continuous speech synthesis result according to an embodiment of the present invention;

FIG. 5 is a flowchart of a speech synthesis method according to an embodiment of the present invention;

fig. 6 is a block diagram of a speech synthesis model training apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the development of artificial intelligence technology, speech synthesis technology is more and more emphasized by people, speech synthesis technology can be applied to the fields of human-computer interaction or conversion of text into natural language output and the like, at present, a WaveRNN speech synthesizer is adopted for speech synthesis, a WaveRNN speech synthesizer model shown in figure 1 is referred, the WaveRNN speech synthesizer mainly adopts a recurrent neural network model for speech synthesis, the WaveRNN speech synthesizer is connected in series by a plurality of neural network layers, when synthesis is carried out, an output result obtained by a previous neural network layer is used as an input of a next neural network layer, a final speech synthesis result can be obtained through serial stepwise calculation, each neural network layer can only process one sampling point at a time, and if the time step length of speech operation executed by each neural network layer is T, the total time T is in direct proportion to the time step length, namely t ═ o (t), it is clear that the WaveRNN speech synthesizer is limited by the serial loop structure, resulting in slow speech synthesis speed and low synthesis efficiency.

In order to solve the above technical problems, the present invention provides a speech synthesis model training method, a speech synthesis device, and a server, wherein the speech synthesis model used in the present invention is a full convolution speech synthesis model, and the full convolution speech synthesis model can perform parallel processing on a plurality of data to be synthesized, so that the speech synthesis efficiency can be improved. In addition, in the process of training the full convolution speech synthesis model, the acoustic feature training sample is processed by calling the full convolution speech synthesis model to be trained to obtain a discrete speech synthesis result, and the discrete speech synthesis result is converted into a continuous speech synthesis result.

The technical scheme of the invention is described in detail by specific examples below:

fig. 2 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present invention, where the method is applicable to a server, and referring to fig. 2, the method may include:

s100, obtaining an acoustic feature training sample;

the acoustic feature training sample is a training sample to be trained which meets acoustic features, and the acoustic feature training sample comprises a preset number of discrete acoustic feature sampling points.

Step S110, calling a full convolution speech synthesis model to be trained, and processing the acoustic feature training sample to obtain a discrete speech synthesis result;

it should be noted that the full convolution speech synthesis model to be trained in the embodiment of the present invention may specifically be a UFANS full convolution speech synthesis model, and the UFANS full convolution speech synthesis model has the characteristics of full convolution and high perception field.

The embodiment of the invention mainly calls a to-be-trained full convolution speech synthesis model, and performs speech synthesis processing on the acoustic feature training sample to obtain a discrete speech synthesis result, the to-be-trained full convolution speech synthesis model provided by the embodiment of the invention can simultaneously process a plurality of sampling points in the acoustic feature training sample, for example, 275 sampling points can be simultaneously processed in parallel, and if the time step length of the to-be-trained full convolution speech synthesis model for performing speech operation is T, the total time T is in direct proportion to the time step length, that is, the to-be-trained full convolution speech synthesis model is calledObviously, the to-be-trained full convolution speech synthesis model in the application can execute parallel processing on a plurality of to-be-synthesized data, the total time T is reduced, the speech synthesis speed is high, and the synthesis efficiency is high.

Fig. 3 shows a specific structure of a full convolution speech synthesis model to be trained, which is disclosed in the embodiment of the present invention, and the full convolution speech synthesis model to be trained mainly includes: convolution structure a, averaging retardation layer B, upsampling layer C, and convolution structure D.

Step S120, converting the discrete voice synthesis result into a continuous voice synthesis result;

because the to-be-trained full convolution speech synthesis model is processed on the acoustic feature training sample, the obtained speech synthesis result is discretized, and the discretized speech synthesis result cannot optimize the model, for example, the discrete speech synthesis result cannot obtain the gradient, and only the continuous speech synthesis result can obtain the gradient, so that the discrete speech synthesis result needs to be converted into the continuous speech synthesis result, and the to-be-trained full convolution speech synthesis model can be optimized by using the continuous speech synthesis result.

Step S130, comparing the continuous speech synthesis result with a reference output speech synthesis result corresponding to the acoustic feature training sample, and obtaining a loss function of the full convolution speech synthesis model to be trained by using the comparison result;

the reference output voice synthesis result corresponding to the acoustic feature training sample is generated in advance according to the acoustic feature training sample and is an actual voice synthesis result corresponding to the acoustic feature training sample. And the continuous speech synthesis result is obtained by processing the acoustic feature training sample by using the full convolution speech synthesis model to be trained, and the difference between the speech synthesis result predicted by the full convolution speech synthesis model to be trained and the actual speech synthesis result corresponding to the acoustic feature training sample can be obtained by comparing the continuous speech synthesis result with the reference output speech synthesis result corresponding to the acoustic feature training sample.

Optionally, the loss function adopted in the embodiment of the present invention may include: an STFT (short-time Fourier transform) loss function, which is not specifically limited in the embodiments of the present invention.

The STFT loss function mainly converts a time sequence in a voice synthesis result from a time domain to a frequency domain to obtain a frequency spectrum corresponding to a continuous voice synthesis result and a frequency spectrum corresponding to a reference output voice synthesis result, and the difference between the frequency spectrum corresponding to the continuous voice synthesis result and the frequency spectrum corresponding to the reference output voice synthesis result is the loss function of the full convolution voice synthesis model to be trained. In addition, the embodiment of the invention can reduce high-frequency noise in a voice synthesis result by utilizing the STFT loss function and enhance the synthesis effect.

Step S140, adjusting model parameters of the full convolution speech synthesis model to be trained by using the loss function;

by adjusting the model parameters of the full convolution speech synthesis model to be trained, the loss function of the full convolution speech synthesis model to be trained is in a convergence state, namely, the frequency spectrum corresponding to the continuous speech synthesis result is closer to the frequency spectrum corresponding to the reference output speech synthesis result.

The model parameters of the full convolution speech synthesis model to be trained may be adjusted in a gradient back propagation manner, and the embodiment of the present invention is not particularly limited.

The embodiment of the invention provides a method for adjusting model parameters of the full convolution speech synthesis model to be trained by using the loss function, which comprises the following steps:

obtaining a continuous speech synthesis spectrum corresponding to the continuous speech synthesis result and a reference output speech synthesis spectrum corresponding to the reference output speech synthesis result from the loss function; comparing the continuous speech synthesis frequency spectrum with the reference output speech synthesis frequency spectrum, and obtaining the model gradient of the full convolution speech synthesis model to be trained by using the comparison result; and adjusting the model parameters of the full convolution speech synthesis model to be trained along the gradient descending direction of the model of the full convolution speech synthesis model to be trained.

Step S150, taking the full convolution speech synthesis model after parameter adjustment as a full convolution speech synthesis model to be trained, and returning to execute the step S110 until a model training termination condition is met;

after the model parameters of the full convolution speech synthesis model to be trained are adjusted once each time, the full convolution speech synthesis model after parameter adjustment is used as the full convolution speech synthesis model to be trained, and the step S110 is returned to be executed.

The model training termination condition at least comprises the following steps:

the model training frequency reaches a preset frequency, or the model training time reaches a preset time, or the loss function meets a preset condition, for example, the loss function is in a convergence state.

And step S160, taking the model parameters meeting the termination condition of model training as the model parameters of the full convolution speech synthesis model to be trained to obtain the trained full convolution speech synthesis model.

The voice synthesis model adopted in the invention is a full convolution voice synthesis model, the full convolution voice synthesis model can execute parallel processing on a plurality of data to be synthesized, and the voice synthesis speed is high, so the voice synthesis efficiency can be improved, and the synthesis speed of the voice synthesizer is improved by two orders of magnitude at least under the parallel environment. In the process of training the full convolution speech synthesis model, the acoustic feature training sample is processed by calling the full convolution speech synthesis model to be trained to obtain a discrete speech synthesis result, and the discrete speech synthesis result is converted into a continuous speech synthesis result.

A specific process of converting the discrete speech synthesis result into a continuous speech synthesis result is given below, fig. 4 is a flowchart of a method for converting the discrete speech synthesis result into the continuous speech synthesis result according to an embodiment of the present invention, where the method is applicable to a server, and referring to fig. 4, the method may include:

s200, obtaining a uniformly distributed sampling result which is subjected to uniform distribution;

step S210, obtaining a speech synthesis probability distribution result meeting discrete multinomial distribution by using a discrete speech synthesis result and a uniformly distributed sampling result;

and step S220, processing the voice synthesis probability distribution result by using a continuity function to obtain a continuous voice synthesis result.

Specifically, in the embodiment of the present invention, a Gumbel transformation method is mainly utilized, and a uniformly distributed sampling result z subject to uniform distribution is first obtained1,...,znThen, using the discrete speech synthesis result and the uniformly distributed sampling result, obtaining a speech synthesis probability distribution result x satisfying the discrete polynomial distribution:

processing the voice synthesis probability distribution result by using a continuity function to obtain a continuous voice synthesis result: x is the number ofk

Wherein p iskRepresenting any of the discrete speech synthesis results, zkRepresenting any of the uniformly distributed sampling results that obey a uniform distribution.

Through the method, the discrete speech synthesis result is converted into the continuous speech synthesis result, the loss function is obtained through the continuous speech synthesis result, and the model parameters of the full convolution speech synthesis model to be trained can be adjusted through the loss function in combination with a gradient back propagation mode, namely, the model parameters of the full convolution speech synthesis model to be trained are optimized, and the finally obtained trained full convolution speech synthesis model can improve the speech synthesis quality through model parameter optimization.

Referring to fig. 5, a speech synthesis method is described below, where fig. 5 is a flowchart of a speech synthesis method according to an embodiment of the present invention, where the speech synthesis method is applicable to a server, and with reference to fig. 5, the speech synthesis method may include:

s300, obtaining acoustic characteristics corresponding to the voice to be synthesized;

and S310, calling the full convolution speech synthesis model, and processing the acoustic characteristics corresponding to the speech to be synthesized to obtain a speech synthesis result.

It should be noted that the speech synthesis method in the embodiment of the present invention calls the full convolution speech synthesis model obtained by training the speech synthesis model training method described in the above embodiment.

It should be noted that the full convolution speech synthesis model in the embodiment of the present invention may specifically be a UFANS full convolution speech synthesis model, and the UFANS full convolution speech synthesis model has the characteristics of full convolution and high perceived visual field.

The full convolution speech synthesis model provided by the embodiment of the invention can simultaneously process a plurality of sampling points in the acoustic characteristic training sample, can execute parallel processing on a plurality of data to be synthesized, and has the advantages of reduced total time consumption, high speech synthesis speed and high synthesis efficiency.

And after the discrete speech synthesis result is obtained, the discrete speech synthesis result can be converted into a continuous speech synthesis result, so that the subsequent processing is facilitated.

The structure of the full convolution speech synthesis model in the embodiment of the present invention is the same as the specific structure of the full convolution speech synthesis model to be trained shown in fig. 3.

The following describes a speech synthesis model training apparatus provided in an embodiment of the present application, and the speech synthesis model training apparatus described below may be referred to in correspondence with the above speech synthesis model training method.

Fig. 6 is a block diagram of a speech synthesis model training apparatus according to an embodiment of the present application, and referring to fig. 6, the speech synthesis model training apparatus includes:

a sample obtaining unit 600, configured to obtain an acoustic feature training sample;

the sample processing unit 610 is configured to invoke a full convolution speech synthesis model to be trained, and process the acoustic feature training sample to obtain a discrete speech synthesis result;

a converting unit 620, configured to convert the discrete speech synthesis result into a continuous speech synthesis result;

a loss function obtaining unit 630, configured to compare the continuous speech synthesis result with a reference output speech synthesis result corresponding to an acoustic feature training sample, and obtain a loss function of the full convolution speech synthesis model to be trained by using the comparison result;

a parameter adjusting unit 640, configured to adjust a model parameter of the full convolution speech synthesis model to be trained by using the loss function; taking the parameter-adjusted full convolution speech synthesis model as a full convolution speech synthesis model to be trained, returning to the step of calling the full convolution speech synthesis model to be trained executed by the execution sample processing unit, and processing the acoustic feature training sample until a model training termination condition is met; and taking the model parameters meeting the termination condition of model training as the model parameters of the full convolution speech synthesis model to be trained to obtain the trained full convolution speech synthesis model.

The conversion unit comprises:

the uniform distribution sampling result acquisition unit is used for acquiring a uniform distribution sampling result subjected to uniform distribution;

a discrete probability distribution result obtaining unit, configured to obtain a speech synthesis probability distribution result satisfying a discrete polynomial distribution by using a discrete speech synthesis result and a uniformly distributed sampling result;

and the continuous voice synthesis result acquisition unit is used for processing the voice synthesis probability distribution result by utilizing a continuity function to obtain a continuous voice synthesis result.

The loss function includes at least: short-time fourier transform STFT loss function.

The parameter adjusting unit is specifically configured to:

obtaining a continuous speech synthesis spectrum corresponding to the continuous speech synthesis result and a reference output speech synthesis spectrum corresponding to the reference output speech synthesis result from the loss function;

comparing the continuous speech synthesis frequency spectrum with the reference output speech synthesis frequency spectrum, and obtaining the model gradient of the full convolution speech synthesis model to be trained by using the comparison result;

and adjusting the model parameters of the full convolution speech synthesis model to be trained along the gradient descending direction of the model of the full convolution speech synthesis model to be trained.

The model training termination condition at least comprises:

the model training frequency reaches the preset frequency, or the model training time reaches the preset time, or the loss function meets the preset condition.

Optionally, an embodiment of the present invention further discloses a speech synthesis apparatus, fig. 7 is a block diagram of a structure of the speech synthesis apparatus provided in the embodiment of the present application, and referring to fig. 7, the speech synthesis apparatus includes:

an acoustic feature obtaining unit 700, configured to obtain an acoustic feature corresponding to a speech to be synthesized;

the acoustic feature processing unit 710 is configured to invoke the full convolution speech synthesis model obtained by the training of the speech synthesis model training method, and process the acoustic feature corresponding to the speech to be synthesized to obtain a speech synthesis result.

Optionally, the embodiment of the present invention further discloses a speech synthesizer, where the speech synthesizer at least includes the above speech synthesis apparatus.

Technical features described in the embodiments in the present specification may be replaced or combined with each other, each embodiment is described with a focus on differences from other embodiments, and the same and similar portions among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音合成方法、装置、计算机设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!