Voice synthesis method and device and storage medium

文档序号：1005937 发布日期：2020-10-23 浏览：18次中文

阅读说明：本技术 一种语音合成方法及装置、存储介质 (Voice synthesis method and device and storage medium ) 是由武执政宋伟于 2019-09-17 设计创作，主要内容包括：本发明实施例公开了一种语音合成方法及装置、存储介质,该方法包括：获取待合成语句的符号序列,待合成语句包括表征目标对象的录音语句和针对目标对象的查询结果语句；利用预设编码模型,对符号序列进行编码处理,得到特征向量集合；获取录音语句对应的录音声学特征；基于预设解码模型、特征向量集合、预设注意力模型和录音声学特征,对待合成语句对应的声学特征进行预测,得到待合成语句对应的预测声学特征,预设注意力模型为利用特征向量集合生成用于解码的上下文向量的模型,预测声学特征由相关联的至少一个声学特征组成；对预测声学特征进行特征转换和合成,得到待合成语句对应的语音。(The embodiment of the invention discloses a voice synthesis method, a device and a storage medium, wherein the method comprises the following steps: obtaining a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a sound recording sentence representing a target object and a query result sentence aiming at the target object; coding the symbol sequence by using a preset coding model to obtain a characteristic vector set; acquiring a recording acoustic characteristic corresponding to a recording statement; predicting the acoustic features corresponding to the sentences to be synthesized based on a preset decoding model, a feature vector set, a preset attention model and the acoustic features of the sound recording to obtain predicted acoustic features corresponding to the sentences to be synthesized, wherein the preset attention model is a model for generating context vectors for decoding by using the feature vector set, and the predicted acoustic features consist of at least one associated acoustic feature; and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.)

1. A method of speech synthesis, the method comprising:

obtaining a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a sound recording sentence representing a target object and a query result sentence aiming at the target object;

coding the symbol sequence by using a preset coding model to obtain a characteristic vector set;

acquiring a recording acoustic characteristic corresponding to the recording statement;

predicting the acoustic features corresponding to the sentence to be synthesized based on a preset decoding model, the feature vector set, a preset attention model and the acoustic features of the sound recording to obtain predicted acoustic features corresponding to the sentence to be synthesized, wherein the preset attention model is a model for generating a context vector for decoding by using the feature vector set, and the predicted acoustic features consist of at least one associated acoustic feature;

and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

2. The method of claim 1, wherein the predicting the acoustic features corresponding to the sentence to be synthesized based on a preset decoding model, the feature vector set, a preset attention model and the acoustic features of the sound recording to obtain the predicted acoustic features corresponding to the sentence to be synthesized comprises:

when i is equal to 1, acquiring initial acoustic features at the ith decoding moment, and predicting the 1 st acoustic feature based on the initial acoustic features, the preset decoding model, the feature vector set and the preset attention model, wherein i is an integer greater than 0;

under the condition that i is larger than 1, when the ith decoding time is the decoding time of the sound recording statement, taking the acoustic feature of a jth frame from the sound recording acoustic features, taking the acoustic feature of the jth frame as the acoustic feature of an i-1 th frame, and predicting the ith acoustic feature based on the acoustic feature of the i-1 th frame, the preset decoding model, the feature vector set and the preset attention model, wherein j is an integer larger than 0;

when the ith decoding time is the decoding time of the query result statement, taking one frame of acoustic features in the (i-1) th acoustic features as the (i-1) th frame of acoustic features, and predicting the ith acoustic features based on the (i-1) th frame of acoustic features, the preset decoding model, the feature vector set and the preset attention model;

continuing to execute the prediction process of the (i +1) th decoding moment until the decoding of the sentence to be synthesized is finished to obtain the nth acoustic feature, wherein n is the total frame number of the decoding moments of the sentence to be synthesized and is an integer greater than 1;

and taking the obtained ith acoustic feature to the nth acoustic feature as the predicted acoustic feature.

3. The method of claim 2, wherein the preset decoding model comprises a first recurrent neural network and a second recurrent neural network; predicting an ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model, including:

carrying out nonlinear change on the acoustic features of the (i-1) th frame to obtain an intermediate feature vector;

performing matrix operation and nonlinear transformation on the intermediate eigenvector by using the first recurrent neural network to obtain an ith intermediate latent variable;

performing context vector calculation on the feature vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector;

performing matrix operation and nonlinear transformation on the ith context vector and the ith intermediate hidden variable by using the second recurrent neural network to obtain an ith hidden variable;

and performing linear transformation on the ith hidden variable according to a preset frame number to obtain the ith acoustic feature.

4. The method of claim 3, wherein the set of feature vectors comprises a feature vector corresponding to each symbol in the sequence of symbols; the performing context vector calculation on the feature vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector includes:

performing attention calculation on the feature vector corresponding to each symbol in the symbol sequence and the ith intermediate hidden variable by using the preset attention model to obtain an ith group of attention values;

and according to the ith group of attention values, carrying out weighted summation on the feature vector set to obtain the ith context vector.

5. The method according to claim 4, wherein after predicting the ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model, the method further comprises, before continuing to perform the prediction process at the i +1 th decoding time:

determining the ith target symbol corresponding to the maximum attention value from the ith group of attention values;

when the ith target symbol is a non-ending symbol of the sound recording statement, determining that the (i +1) th decoding time is the decoding time of the sound recording statement;

and/or when the ith target symbol is a non-end symbol of the query result statement, determining that the (i +1) th decoding time is the decoding time of the query result statement;

and/or when the ith target symbol is the end symbol of the sound recording statement and the end symbol of the sound recording statement is not the end symbol of the statement to be synthesized, determining the (i +1) th decoding time as the decoding time of the query result statement;

and/or when the ith target symbol is the end symbol of the query result statement and the end symbol of the query result statement is not the end symbol of the statement to be synthesized, determining that the (i +1) th decoding time is the decoding time of the sound recording statement;

and/or when the ith target symbol is the end symbol of the sentence to be synthesized, determining the (i +1) th decoding time as the decoding end time of the sentence to be synthesized.

6. The method according to claim 1, wherein the encoding the symbol sequence by using a preset coding model to obtain a feature vector set comprises:

performing vector conversion on the symbol sequence by using the preset coding model to obtain an initial characteristic vector set;

and carrying out nonlinear change and feature extraction on the initial feature vector set to obtain the feature vector set.

7. The method according to claim 1, wherein the performing feature transformation and synthesis on the predicted acoustic features to obtain the speech corresponding to the sentence to be synthesized comprises:

performing feature conversion on the predicted acoustic features to obtain a linear spectrum;

and carrying out reconstruction synthesis on the linear spectrum to obtain the voice.

8. The method of claim 1, wherein the sequence of symbols is a sequence of letters or a sequence of phonemes.

9. The method of claim 1, wherein prior to said obtaining the sequence of symbols for the sentence to be synthesized, the method further comprises:

obtaining a sample symbol sequence corresponding to at least one sample synthesis statement, wherein each sample synthesis statement represents a sample object and a reference query result aiming at the sample object;

acquiring an initial voice synthesis model, initial acoustic features and sample acoustic features corresponding to the sample synthesis statements; the initial speech synthesis model is a model for encoding processing and prediction;

and training the initial speech synthesis model by using the sample symbol sequence, the initial acoustic features and the sample acoustic features to obtain the preset coding model, the preset decoding model and the preset attention model.

10. A speech synthesis apparatus, characterized in that the apparatus comprises: the device comprises a sequence generation module, a voice synthesis module and an acquisition module; wherein the content of the first and second substances,

the sequence generation module is used for acquiring a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a sound recording sentence representing a target object and a query result sentence aiming at the target object;

the voice synthesis module is used for coding the symbol sequence by using a preset coding model to obtain a characteristic vector set;

the acquisition module is used for acquiring the sound recording acoustic characteristics corresponding to the sound recording sentences;

the speech synthesis module is further configured to predict, based on a preset decoding model, the feature vector set, a preset attention model and the acoustic features of the sound recording, the acoustic features corresponding to the sentence to be synthesized, so as to obtain predicted acoustic features corresponding to the sentence to be synthesized, where the preset attention model is a model that uses the feature vector set to generate a context vector for decoding, and the predicted acoustic features are composed of at least one associated acoustic feature; and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

11. A speech synthesis apparatus, characterized in that the apparatus comprises: a processor, a memory and a communication bus, the memory in communication with the processor through the communication bus, the memory storing one or more programs executable by the processor, the one or more programs, when executed, causing the processor to perform the method of any of claims 1-9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program which, when executed by at least one processor, causes the at least one processor to perform the method of any one of claims 1-9.

Technical Field

The embodiment of the invention relates to a voice processing technology in the field of electronic application, in particular to a voice synthesis method and device and a storage medium.

Background

At present, a speech synthesis technology is applied to many intelligent devices, such as an intelligent sound box, a telephone outbound system and a number calling system, after receiving a query request for a target object sent by a user, the intelligent device generates a sentence to be synthesized representing the target object and a query result according to the query request, converts the sentence to be synthesized into a complete speech and plays the complete speech to inform the user of the query result about the target object; when the sentence to be synthesized is converted into the complete voice, recording the recording of the target object in advance for the fixed target object in the sentence to be synthesized, synthesizing the synthetic voice corresponding to the query result in a voice synthesis mode for the dynamically updated query result in the sentence to be synthesized, and splicing the recording and the synthetic voice to obtain the complete voice of the sentence to be synthesized.

However, since the process of generating the recorded voice and the process of generating the synthesized voice are independent, the speeds, the pitches, and the like of the recorded voice and the synthesized voice are different, which may cause the prosody of the complete voice combined by the recorded voice and the synthesized voice to be inconsistent, and further cause the excessive duration between the recorded voice and the synthesized voice to have uncertainty and poor voice quality.

Disclosure of Invention

The invention mainly aims to provide a voice synthesis method, a voice synthesis device and a storage medium, which can realize the prosody consistency of synthesized voice and improve the quality of the synthesized voice.

The technical scheme of the invention is realized as follows:

the embodiment of the invention provides a voice synthesis method, which comprises the following steps:

coding the symbol sequence by using a preset coding model to obtain a characteristic vector set;

acquiring a recording acoustic characteristic corresponding to the recording statement;

and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

In the foregoing solution, the predicting the acoustic features corresponding to the sentence to be synthesized based on the preset decoding model, the feature vector set, the preset attention model, and the acoustic features of the sound recording to obtain the predicted acoustic features corresponding to the sentence to be synthesized includes:

and taking the obtained ith acoustic feature to the nth acoustic feature as the predicted acoustic feature.

In the above scheme, the preset decoding model includes a first recurrent neural network and a second recurrent neural network; predicting an ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model, including:

carrying out nonlinear change on the acoustic features of the (i-1) th frame to obtain an intermediate feature vector;

performing matrix operation and nonlinear transformation on the intermediate eigenvector by using the first recurrent neural network to obtain an ith intermediate latent variable;

performing context vector calculation on the feature vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector;

and performing linear transformation on the ith hidden variable according to a preset frame number to obtain the ith acoustic feature.

In the above scheme, the feature vector set includes a feature vector corresponding to each symbol in the symbol sequence; the performing context vector calculation on the feature vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector includes:

and according to the ith group of attention values, carrying out weighted summation on the feature vector set to obtain the ith context vector.

In the foregoing solution, after predicting the ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set, and the preset attention model, and before continuing to perform the prediction process at the i +1 th decoding time, the method further includes:

determining the ith target symbol corresponding to the maximum attention value from the ith group of attention values;

when the ith target symbol is a non-ending symbol of the sound recording statement, determining that the (i +1) th decoding time is the decoding time of the sound recording statement;

and/or when the ith target symbol is a non-end symbol of the query result statement, determining that the (i +1) th decoding time is the decoding time of the query result statement;

and/or when the ith target symbol is the end symbol of the sentence to be synthesized, determining the (i +1) th decoding time as the decoding end time of the sentence to be synthesized.

In the foregoing scheme, the encoding the symbol sequence by using a preset encoding model to obtain a feature vector set includes:

performing vector conversion on the symbol sequence by using the preset coding model to obtain an initial characteristic vector set;

and carrying out nonlinear change and feature extraction on the initial feature vector set to obtain the feature vector set.

In the foregoing scheme, the performing feature conversion and synthesis on the predicted acoustic features to obtain the speech corresponding to the sentence to be synthesized includes:

performing feature conversion on the predicted acoustic features to obtain a linear spectrum;

and carrying out reconstruction synthesis on the linear spectrum to obtain the voice.

In the above scheme, the symbol sequence is an alphabetical sequence or a phoneme sequence.

In the above scheme, before the obtaining of the symbol sequence of the sentence to be synthesized, the method further includes:

An embodiment of the present invention provides a speech synthesis apparatus, where the apparatus includes: the device comprises a sequence generation module, a voice synthesis module and an acquisition module; wherein the content of the first and second substances,

the voice synthesis module is used for coding the symbol sequence by using a preset coding model to obtain a characteristic vector set;

the acquisition module is used for acquiring the sound recording acoustic characteristics corresponding to the sound recording sentences;

In the foregoing solution, the speech synthesis module is specifically configured to, when i is equal to 1, obtain an initial acoustic feature at an ith decoding time, predict a 1 st acoustic feature based on the initial acoustic feature, the preset decoding model, the feature vector set, and the preset attention model, where i is an integer greater than 0;

and using the obtained ith acoustic feature to the nth acoustic feature as the predicted acoustic feature.

In the above scheme, the preset decoding model includes a first recurrent neural network and a second recurrent neural network;

the speech synthesis module is specifically configured to perform nonlinear change on the acoustic features of the (i-1) th frame to obtain an intermediate feature vector; performing matrix operation and nonlinear transformation on the intermediate eigenvector by using the first cyclic neural network to obtain an ith intermediate latent variable; performing context vector calculation on the characteristic vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector; performing matrix operation and nonlinear transformation on the ith context vector and the ith intermediate hidden variable by using the second recurrent neural network to obtain an ith hidden variable; and according to a preset frame number, performing linear transformation on the ith hidden variable to obtain the ith acoustic feature.

In the above scheme, the feature vector set includes a feature vector corresponding to each symbol in the symbol sequence;

the speech synthesis module is specifically configured to perform attention calculation on a feature vector corresponding to each symbol in the symbol sequence and the ith intermediate hidden variable by using the preset attention model to obtain an ith group of attention values; and according to the ith group of attention values, carrying out weighted summation on the feature vector set to obtain the ith context vector.

In the foregoing solution, the speech synthesis module is further configured to determine, after predicting the ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set, and the preset attention model, an ith target symbol corresponding to a maximum attention value from the ith group of attention values before continuing to perform the prediction process at the i +1 th decoding time;

when the ith target symbol is a non-ending symbol of the sound recording statement, determining the (i +1) th decoding time as the decoding time of the sound recording statement;

and/or when the ith target symbol is a non-end symbol of the query result statement, determining that the (i +1) th decoding time is the decoding time of the query result statement;

and/or when the ith target symbol is the end symbol of the sentence to be synthesized, determining the (i +1) th decoding time as the decoding end time of the sentence to be synthesized.

In the foregoing scheme, the speech synthesis module is specifically configured to perform vector conversion on the symbol sequence to obtain an initial feature vector set; and carrying out nonlinear change and feature extraction on the initial feature vector set to obtain the feature vector set.

In the above scheme, the speech synthesis module is specifically configured to perform feature conversion on the predicted acoustic features to obtain a linear spectrum; and carrying out reconstruction synthesis on the linear spectrum to obtain the voice.

In the above scheme, the symbol sequence is an alphabetical sequence or a phoneme sequence.

In the above scheme, the apparatus further comprises: a training module;

the training module is configured to, before the obtaining of the symbol sequence of the sentence to be synthesized, obtain a sample symbol sequence corresponding to each of at least one sample synthesis sentence, where each sample synthesis sentence represents a sample object and a reference query result for the sample object; acquiring an initial voice synthesis model, initial acoustic features and sample acoustic features corresponding to the sample synthesis statements; the initial speech synthesis model is a model for encoding processing and prediction; and training the initial speech synthesis model by using the sample symbol sequence, the initial acoustic features and the sample acoustic features to obtain the preset coding model, the preset decoding model and the preset attention model.

An embodiment of the present invention provides a speech synthesis apparatus, where the apparatus includes: a processor, a memory and a communication bus, the memory being in communication with the processor through the communication bus, the memory storing one or more programs executable by the processor, the one or more programs, when executed, causing the processor to perform the steps of any of the speech synthesis methods described above.

Embodiments of the present invention provide a computer-readable storage medium storing a program which, when executed by at least one processor, causes the at least one processor to perform the steps of any of the speech synthesis methods described above.

The embodiment of the invention provides a voice synthesis method and device and a storage medium, wherein the technical implementation scheme is adopted, and the predicted acoustic features corresponding to the sentences to be synthesized are obtained by prediction based on a preset decoding model, a feature vector set, a preset attention model and a recording acoustic feature; secondly, the predicted acoustic features corresponding to the sentences to be synthesized are subjected to feature conversion and synthesis to obtain the voice, so that the problem of uncertainty of excessive time length existing when the recording and the voice are spliced is solved, and the quality of the synthesized voice is improved.

Drawings

Fig. 1 is a first schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a tacontron model according to an embodiment of the present invention;

fig. 3 is a first flowchart of a speech synthesis method according to an embodiment of the present invention;

fig. 4 is a flowchart of a speech synthesis method according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a correspondence relationship between a phoneme sequence and an attention number according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.

Referring now to fig. 1, which is a schematic diagram of a speech synthesis apparatus 1 for implementing various embodiments of the present invention, the apparatus 1 may include: the system comprises a sequence generation module 11, a voice synthesis module 12 and a playing module 13, wherein the sequence generation module 11 receives a query request aiming at a target object sent by a user, determines a sentence to be synthesized according to the query request, the sentence to be synthesized is a text of a query result about the target object, and transmits a symbol sequence of the sentence to be synthesized to the voice synthesis module 12; the voice synthesis module 12 performs voice synthesis on the symbol sequence to obtain a voice corresponding to the voice to be synthesized, and transmits the voice to the playing module 13; the playing module 13 plays the voice.

In some embodiments, the speech synthesis module 12 is a module built by an Attention model and a coder-Decoder (Encoder-Decoder) model, for example, the speech synthesis module 12 is a tacontron model, the tacontron model is a Text-to-speech (TTS) model based on deep learning, as shown in fig. 2, the tacontron model mainly includes a coding model 21, an Attention model 22 and a decoding model 23, the coding model 21 includes a character embedding model 211, a Pre-net model 212 and a CBHG model 213, the decoding model 23 includes a Pre-net model 231, a first Recurrent Neural Network (RNN) 232, a second Recurrent Neural Network 233, a linear conversion model 234, a CBHG model 235 and a speech reconstruction model 236; the CBHG model 213 and the CBHG model 235 have the same structure, and are composed of a convolution bank, a highway neural network (highway network), and a gate cycle Unit (GRU); the speech reconstruction model 236 comprises a model generated using the Griffin-Lim algorithm.

Illustratively, the Tacotron model receives the symbol sequence of the sentence to be synthesized and starts to perform the encoding process as follows: the character embedding model 211 performs vector conversion on the symbol sequence to obtain a converted vector set, and transmits the converted vector set to the Pre-net model 212; the Pre-net model 212 performs nonlinear change on the converted vector set to obtain an intermediate characteristic vector set, and transmits the intermediate characteristic vector set to the CBHG model 213; the CBHG model 213 performs a series of matrix operations and nonlinear transformation on the intermediate characteristic vector set to obtain a characteristic vector set, and ends the encoding.

Further, after the encoding process is finished, the prediction process is started to be executed as follows: at the current decoding moment, the Pre-net model 231 performs nonlinear transformation on the current frame acoustic features to obtain an intermediate feature vector, and transmits the intermediate feature vector to the first recurrent neural network 232; the first recurrent neural network 232 performs a series of matrix operations and nonlinear transformation on the intermediate feature vector to obtain a current intermediate Hidden variable (Hidden State), transmits the current intermediate Hidden variable to the attention model 22 and the second recurrent neural network 233, and the first recurrent neural network 232 also stores the current intermediate Hidden variable for use at the next frame interface time; the attention model 22 performs context vector calculation on the current intermediate hidden variable and the feature vector set obtained by encoding to obtain a current context vector, and transmits the current context vector to the second recurrent nerve 233; the second recurrent neural network 233 performs a series of matrix operations and nonlinear transformations on the current context vector and the current intermediate hidden state to obtain a current hidden variable, and transmits the current hidden variable to the linear conversion model 234; the linear transformation model 234 performs linear transformation on the current hidden variable to obtain the current acoustic feature, and transmits the current acoustic feature to the CBHG model 235; continuing to execute the prediction process of the next decoding moment until the decoding of the sentence to be synthesized is finished, and obtaining the last acoustic feature; the CBHG model 235 performs feature conversion on the first acoustic feature to the last acoustic feature to obtain a linear spectrum, and transmits the linear spectrum to the voice reconstruction model 236; the speech reconstruction model 236 reconstructs and synthesizes the linear spectrum, and generates speech.

It should be noted that, as indicated by the dashed line in fig. 2, in the prediction process, the decoding model 23 may perform the prediction process in an auto-regression manner, that is, one frame of acoustic features in the current acoustic features obtained at the current decoding time is used as the input of the next decoding time; the prediction process can also be executed without adopting an autoregressive mode, namely the input of the next decoding moment is not one frame of acoustic features in the previous acoustic features obtained at the current decoding moment; fig. 2 illustrates only three decoding time points as an example, and the decoding time points are not limited in the embodiment of the present invention.

It will be appreciated by those skilled in the art that the configuration of the speech synthesis apparatus shown in fig. 1 or fig. 2 is not intended to be limiting, and the speech synthesis apparatus may comprise more or less components than those shown, or some components may be combined, or a different arrangement of components.

It should be noted that the embodiment of the present invention can be implemented based on the speech synthesis apparatus shown in fig. 1 or fig. 2, and a specific embodiment of speech synthesis is described below based on fig. 1 or fig. 2.

25页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种语种识别方法以及相关装置

Voice synthesis method and device and storage medium

相关技术

网友询问留言