Acoustic feature conversion and model training method, device, equipment and medium

文档序号：154800 发布日期：2021-10-26 浏览：17次中文

阅读说明：本技术 声学特征转换及模型训练方法、装置、设备、介质 (Acoustic feature conversion and model training method, device, equipment and medium ) 是由林诗伦于 2020-12-30 设计创作，主要内容包括：本申请提供了一种声学特征转换及模型训练方法、装置、设备、介质,应用于人工智能领域；其中,所述声学特征转换方法包括：将待转换文本序列输入至转换模型的编码器网络,得到文本表示序列；待转换文本序列包括音韵特征信息；将文本表示序列输入至转换模型的基础注意力网络,得到当前时间步的第一注意力状态、第一上下文向量和基础注意力得分矩阵；将当前时间步的第一注意力状态和第一上下文向量输入至转换模型的解码器网络,得到第一声学特征；第一声学特征用于合成待转换文本序列对应的音频数据。通过本申请提供的声学特征转换方法,能够生成质量较高的声学特征。(The application provides an acoustic feature conversion and model training method, device, equipment and medium, which are applied to the field of artificial intelligence; the acoustic feature conversion method comprises the following steps: inputting a text sequence to be converted into an encoder network of a conversion model to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information; inputting the text representation sequence into a basic attention network of a conversion model to obtain a first attention state, a first context vector and a basic attention scoring matrix of the current time step; inputting a first attention state and a first context vector of a current time step into a decoder network of a conversion model to obtain a first acoustic feature; the first acoustic feature is used for synthesizing audio data corresponding to the text sequence to be converted. By the acoustic feature conversion method, acoustic features with high quality can be generated.)

1. An acoustic feature conversion method, comprising:

inputting a text sequence to be converted into an encoder network of a conversion model to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information;

inputting the text representation sequence into a basic attention network of the conversion model to obtain a first attention state, a first context vector and a basic attention score matrix of the current time step;

inputting the first attention state and the first context vector of the current time step into a decoder network of the conversion model to obtain a first acoustic feature; the first acoustic feature is used for synthesizing audio data corresponding to the text sequence to be converted;

wherein a loss function of the conversion model in the training process is related to a first loss value corresponding to at least one guiding attention network; the first loss value is used to characterize a distance between a mentoring attention score matrix of the mentoring attention network output and the base attention score matrix.

2. The method of claim 1, wherein inputting the sequence of textual representations into a base attention network of the transformation model results in a first attention state, a first context vector, and a base attention score matrix for a current time step, comprising:

determining a first attention state of the current time step according to the second attention state, the second context vector and the second acoustic feature of the previous time step;

determining the basic attention score matrix according to the text representation sequence, the first attention state and the sequence position of the current time step;

determining the first context vector based on the base attention score matrix and the sequence of text representations.

3. The method of claim 2, wherein the sequence of text representations comprises a plurality of sequence position-corresponding text representation vectors; said determining said first context vector from said base attention score matrix and said sequence of text representations comprises:

and according to the attention weight corresponding to each sequence position in the basic attention score matrix, performing weighted summation on the text representation vector corresponding to each sequence position to obtain the first context vector.

4. The method of claim 3, wherein inputting the first state and the first context vector for the current time step to a decoder network of the conversion model to obtain the first acoustic feature comprises:

acquiring a second decoder state of the previous time step;

inputting the second decoder state, the first context vector, and the first attention state to the decoder network, resulting in the first acoustic feature.

5. The method of claim 4, wherein inputting the second decoder state, the first context vector, and the first attention state to the decoder network to obtain the first acoustic feature comprises:

determining a first decoder state based on the second decoder state, the first context vector, and a first attention state;

converting the first decoder state into the first acoustic feature based on a preset affine function.

6. A method for training a conversion model, comprising:

acquiring sample data; the sample data comprises a sample text sequence;

inputting the sample text sequence into an encoder network of a conversion model to obtain a sample representation sequence;

inputting the sample representation sequence into a basic attention network of the conversion model to obtain a sample basic score matrix of the current time step;

inputting the sample representation sequence into at least one instructive attention network of the conversion model to obtain a sample instructive score matrix of the current time step output by each instructive attention network;

determining a first loss value corresponding to each guiding attention network according to the sample basic score matrix and the sample guiding score matrix output by each guiding attention network; the first loss value is used to characterize a distance between a sample guideline score matrix of the guideline attention network output and the sample base score matrix;

and adjusting the model parameters of the conversion model by using the first loss value corresponding to each attention directing network to obtain the trained conversion model.

7. The method of claim 6, wherein inputting the sample representation sequence into a base attention network of the transformation model to obtain a sample base score matrix for a current time step comprises:

determining a third attention state of the current time step according to a fourth attention state, a fourth context vector and a fourth acoustic feature of the previous time step;

and determining the sample basis score matrix according to the sample representation sequence, the third attention state and the sequence position of the current time step.

8. The method of claim 7, wherein the sample data further comprises sample acoustic features corresponding to a sample text sequence; the method further comprises the following steps:

determining a third context vector based on the sample base score matrix and the third attention state;

inputting the third attention state and the third context vector to a decoder network of the conversion model to obtain a third acoustic feature; the third acoustic feature is used for synthesizing audio data corresponding to the sample text sequence;

determining a second loss value according to the third acoustic characteristic and the sample acoustic characteristic;

the adjusting the model parameters of the conversion model by using the first loss value corresponding to each attention-directing network to obtain the trained conversion model includes:

and adjusting the model parameters of the conversion model by using the second loss value and the first loss value corresponding to each attention directing network to obtain the trained conversion model.

9. The method of claim 7, wherein, in the case that the mentoring network is a forward attentive network, the inputting the sequence of sample representations to at least one mentoring network of the transformation model to obtain a sample mentoring score matrix for the current time step for each of the mentoring network outputs comprises:

determining a first alignment parameter of the current time step based on the sample basis score matrix of the current time step and a second alignment parameter of a previous time step;

normalizing the first alignment parameter to obtain a first guidance score matrix output by the forward attention network;

the determining a first loss value corresponding to each of the mentoring networks according to the sample base score matrix and the sample mentoring score matrix output by each of the mentoring networks includes:

determining a first loss value corresponding to the forward attention network according to the sample base score matrix and the first guidance score matrix.

10. The method of claim 9, wherein the first alignment parameter comprises a first sub-parameter corresponding to each of the sequence positions; determining, by the computing device, a first alignment parameter for the current time step based on the base attention score matrix for the current time step and a second alignment parameter for a previous time step, including:

determining a first sub-parameter corresponding to each of the sequence positions of the current time step based on the attention weight corresponding to each of the sequence positions in the sample base score matrix and a second sub-parameter corresponding to each of the sequence positions in the second alignment parameter.

11. The method of claim 7, wherein in the case that the guiding attention network is a Gaussian attention network, the inputting the sequence of sample representations to at least one guiding attention network of the transformation model to obtain a sample guiding score matrix for the current time step for each of the guiding attention network outputs comprises:

acquiring a first mean value parameter, a first variance parameter and a first offset parameter of the current time step according to the third attention state;

determining mixed Gaussian distribution according to the first mean parameter, the first variance parameter and the first offset parameter;

obtaining a second derivative scoring matrix output by the Gaussian attention network based on mixed Gaussian distribution;

and determining a first loss value corresponding to the Gaussian attention network according to the sample basis score matrix and the second guidance score matrix.

12. The method of claim 11, wherein obtaining the first mean parameter, the first variance parameter, and the first offset parameter for the current time step according to the third attention state comprises:

converting, by the multi-layer perceptron, the third attention state into a mean intermediate parameter, a variance intermediate parameter, and a shift intermediate parameter;

obtaining the first variance parameter based on inputting the variance intermediate parameter into an exponential function;

inputting the offset intermediate parameter into a first activation function to obtain a first offset parameter;

and inputting the mean value intermediate parameter into a second activation function, and determining the first mean value parameter according to the parameter output by the second activation function and the second mean value parameter of the previous time step.

13. An acoustic feature conversion apparatus, comprising:

the first input module is used for inputting the text sequence to be converted into the encoder network of the conversion model to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information;

the second input module is used for inputting the text representation sequence into a basic attention network of the conversion model to obtain a first attention state, a first context vector and a basic attention score matrix of the current time step;

a third input module, configured to input the first attention state and the first context vector of the current time step to a decoder network of the conversion model to obtain a first acoustic feature; the first acoustic feature is used for synthesizing audio data corresponding to the text sequence to be converted;

14. A computer device, comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 5, or implementing the method of any one of claims 6 to 12, when executing executable instructions stored in the memory.

15. A computer-readable storage medium having stored thereon executable instructions for, when executed by a processor, implementing the method of any one of claims 1 to 5, or implementing the method of any one of claims 6 to 12.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an acoustic feature conversion method, apparatus, device, and computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The speech synthesis technology converts the text into corresponding audio content through a certain rule or model algorithm. The traditional speech synthesis technology is mainly based on a splicing method or a statistical parameter method, and can realize text-to-speech feature conversion. However, in the conventional speech synthesis technology, the acoustic feature quality obtained by the text-to-speech feature conversion scheme is low, and the requirements of application scenarios cannot be met.

Disclosure of Invention

The embodiment of the application provides an acoustic feature conversion method, an acoustic feature conversion device, acoustic feature conversion equipment and a computer-readable storage medium, which can generate acoustic features with high quality.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an acoustic feature conversion method, which comprises the following steps: inputting a text sequence to be converted into an encoder network of a conversion model to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information; inputting the text representation sequence into a basic attention network of a conversion model to obtain a first attention state, a first context vector and a basic attention scoring matrix of the current time step; inputting a first attention state and a first context vector of a current time step into a decoder network of a conversion model to obtain a first acoustic feature; the first acoustic feature is used for synthesizing audio data corresponding to the text sequence to be converted; wherein, the loss function of the conversion model in the training process is related to the first loss value corresponding to at least one instruction attention network; the first loss value is used to characterize a distance between a guiding attention score matrix and a base attention score matrix of the guiding attention network output.

In some embodiments of the application, the inputting the text representation sequence into a base attention network of a conversion model to obtain a first attention state, a first context vector, and a base attention score matrix for a current time step includes: determining a first attention state of the current time step according to the second attention state, the second context vector and the second acoustic feature of the previous time step; determining a basic attention scoring matrix according to the text representation sequence, the first attention state and the sequence position of the current time step; a first context vector is determined based on the base attention score matrix and the sequence of text representations.

In some embodiments of the present application, the sequence of textual representations includes a plurality of textual representation vectors corresponding in sequence position; determining a first context vector based on the base attention score matrix and the text representation sequence, comprising: and according to the attention weight corresponding to each sequence position in the basic attention scoring matrix, performing weighted summation on the text expression vector corresponding to each sequence position to obtain a first context vector.

In some embodiments of the present application, the inputting the first state of the current time step and the first context vector into a decoder network of a conversion model to obtain the first acoustic feature includes: acquiring the state of a second decoder at the previous time step; the second decoder state, the first context vector and the first attention state are input to a decoder network, resulting in a first acoustic feature.

In some embodiments of the present application, the inputting the second decoder state, the first context vector, and the first attention state into the decoder network to obtain the first acoustic feature comprises: determining a first decoder state based on the second decoder state, the first context vector, and the first attention state; the first decoder state is converted into a first acoustic feature based on a preset affine function.

In some embodiments of the present application, the method further comprises: inputting the first acoustic feature into a post-processing network of a conversion model to obtain a first feature to be converted; and inputting the first feature to be converted into a preset vocoder to obtain audio data corresponding to the text sequence to be converted.

The embodiment of the application provides a conversion model training method, which comprises the following steps: acquiring sample data; the sample data comprises a sample text sequence; inputting the sample text sequence into an encoder network of a conversion model to obtain a sample representation sequence; inputting the sample representation sequence into a basic attention network of a conversion model to obtain a sample basic score matrix of the current time step; inputting the sample representation sequence into at least one attention directing network of the conversion model to obtain a sample directing score matrix of the current time step output by each attention directing network; determining a first loss value corresponding to each guiding attention network according to the sample basic score matrix and the sample guiding score matrix output by each guiding attention network; the first loss value is used to characterize a distance between a sample guideline score matrix and a sample base score matrix of the guideline attention network output; and adjusting the model parameters of the conversion model by using the first loss value corresponding to each attention directing network to obtain the trained conversion model.

In some embodiments of the present application, the inputting the sample representation sequence into a basic attention network of a transformation model to obtain a sample basic score matrix of a current time step includes: determining a third attention state of the current time step according to the fourth attention state, the fourth context vector and the fourth acoustic feature of the previous time step; and determining a sample basis score matrix according to the sample representation sequence, the third attention state and the sequence position of the current time step.

In some embodiments of the present application, the sample data further comprises sample acoustic features corresponding to a sample text sequence; the method further comprises the following steps: determining a third context vector according to the sample basis score matrix and the third attention state; inputting the third attention state and the third context vector into a decoder network of a conversion model to obtain a third acoustic feature; the third acoustic feature is used for synthesizing audio data corresponding to the sample text sequence; determining a second loss value according to the third acoustic characteristic and the sample acoustic characteristic; the adjusting the model parameters of the conversion model by using the first loss value corresponding to each attention-directing network to obtain the trained conversion model comprises: and adjusting the model parameters of the conversion model by using the second loss value and the first loss value corresponding to each instruction attention network to obtain the trained conversion model.

In some embodiments of the present application, in the case where the mentoring network is a forward attentional network, the inputting the sequence of sample representations to at least one mentoring network of the transformation model to obtain a sample mentoring score matrix for a current time step of each mentoring network output includes: determining a first alignment parameter of the current time step based on the sample basic score matrix of the current time step and a second alignment parameter of the previous time step; normalizing the first alignment parameter to obtain a first guidance score matrix output by the forward attention network; the determining a first loss value corresponding to each attention directing network according to the sample basis score matrix and the sample guidance score matrix output by each attention directing network comprises: and determining a first loss value corresponding to the forward attention network according to the sample basic score matrix and the first guidance score matrix.

In some embodiments of the present application, the first alignment parameter comprises a first sub-parameter corresponding to each sequence position; determining a first alignment parameter for the current time step based on the base attention score matrix for the current time step and a second alignment parameter for a previous time step, comprising: and determining a first sub-parameter corresponding to each sequence position of the current time step based on the attention weight corresponding to each sequence position in the sample basic score matrix and a second sub-parameter corresponding to each sequence position in the second alignment parameter.

In some embodiments of the present application, in the case that the attentive network is a gaussian attentive network, the inputting the sample representation sequence to at least one attentive network of the transformation model to obtain a sample guidance score matrix for a current time step of each attentive network output includes: acquiring a first mean value parameter, a first variance parameter and a first offset parameter of the current time step according to the third attention state; determining mixed Gaussian distribution according to the first mean value parameter, the first variance parameter and the first offset parameter; obtaining a second derivative scoring matrix of the output of the Gaussian attention network based on the mixed Gaussian distribution; the determining a first loss value corresponding to each attention directing network according to the sample basis score matrix and the sample guidance score matrix output by each attention directing network comprises: and determining a first loss value corresponding to the Gaussian attention network according to the sample basis score matrix and the second guidance score matrix.

In some embodiments of the present application, the obtaining a first mean parameter, a first variance parameter, and a first deviation parameter of the current time step according to the third attention state includes: converting the third attention state into a mean intermediate parameter, a variance intermediate parameter, and a shift intermediate parameter by the multi-layered perceptron; obtaining a first variance parameter based on inputting the variance intermediate parameter into the exponential function; inputting the offset intermediate parameter into a first activation function to obtain a first offset parameter; and inputting the mean value intermediate parameter into a second activation function, and determining a first mean value parameter according to the parameter output by the second activation function and the second mean value parameter of the previous time step.

In some embodiments of the present application, the method further comprises: inputting the third acoustic feature into a post-processing network of the conversion model to obtain a second feature to be converted; determining a third loss value according to the second feature to be converted and the acoustic features of the sample; adjusting model parameters of the conversion model by using the first loss value corresponding to each attention-directing network to obtain a trained conversion model, comprising: and adjusting the model parameters of the conversion model by using the third loss value and the first loss value corresponding to each instruction attention network to obtain the trained conversion model.

The embodiment of the application provides an acoustic feature conversion device, the device includes: the first input module is used for inputting the text sequence to be converted into the encoder network of the conversion model to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information; the second input module is used for inputting the text representation sequence into a basic attention network of the conversion model to obtain a first attention state, a first context vector and a basic attention scoring matrix of the current time step; the third input module is used for inputting the first attention state and the first context vector of the current time step into a decoder network of a conversion model to obtain a first acoustic feature; the first acoustic feature is used for synthesizing audio data corresponding to the text sequence to be converted; wherein, the loss function of the conversion model in the training process is related to the first loss value corresponding to at least one instruction attention network; the first loss value is used to characterize a distance between a guiding attention score matrix and a base attention score matrix of the guiding attention network output.

The embodiment of the application provides a conversion model training device, the device includes: the encoding module is used for acquiring sample data; the sample data comprises a sample text sequence; and inputting the sample text sequence into an encoder network of a conversion model to obtain a sample representation sequence. And the basic attention module is used for inputting the sample representation sequence into a basic attention network of the conversion model to obtain a sample basic score matrix of the current time step. And the guiding attention module is used for inputting the sample representation sequence into at least one guiding attention network of the conversion model to obtain a sample guiding score matrix of the current time step output by each guiding attention network. The parameter adjusting module is used for determining a first loss value corresponding to each attention directing network according to the sample basic score matrix and the sample attention directing score matrix output by each attention directing network; the first loss value is used to characterize a distance between a sample guideline score matrix and a sample base score matrix of the guideline attention network output; and adjusting the model parameters of the conversion model by using the first loss value corresponding to each attention directing network to obtain the trained conversion model.

An embodiment of the present application provides a computer device, including:

a memory for storing executable instructions;

and the processor is used for realizing the acoustic feature conversion method or the conversion model training method provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to implement the acoustic feature transformation method or the transformation model training method provided in the embodiment of the present application when the processor is executed.

The embodiment of the application has the following beneficial effects:

according to the method and the device for converting the text sequence to be converted into the first acoustic feature capable of synthesizing the audio data, the conversion model converts the text sequence to be converted into the first acoustic feature capable of synthesizing the audio data, and model parameters of the conversion model are adjusted based on at least one first loss value corresponding to the attention-directing network in the training process, so that the conversion model has the advantages of the attention-directing networks. Therefore, the conversion accuracy of the conversion model can be improved, the quality of acoustic features is improved, and the application scene of the application is enlarged. In addition, in practical application, technical support can be provided for good interaction of users, more accurate audio information can be obtained by using the acoustic features converted by the method of the embodiment of the application, the use by the users is facilitated, and the use experience of the users is improved.

Drawings

Fig. 1A is an alternative architecture diagram of an acoustic feature transformation system provided by an embodiment of the present application;

FIG. 1B is a block diagram of an alternative architecture of a vehicle-mounted speech synthesis system according to an embodiment of the present invention;

fig. 2A is a schematic structural diagram of an acoustic feature conversion apparatus provided in an embodiment of the present application;

fig. 2B is a schematic structural diagram of a transformation model training device provided in an embodiment of the present application

Fig. 3 is a schematic flow chart of an alternative acoustic feature transformation method provided in the embodiment of the present application;

fig. 4 is a schematic flow chart of an alternative acoustic feature transformation method provided in the embodiment of the present application;

fig. 5 is a schematic flow chart of an alternative acoustic feature transformation method provided in the embodiment of the present application;

FIG. 6 is a schematic flow chart diagram illustrating an alternative transformation model training method according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram illustrating an alternative transformation model training method according to an embodiment of the present disclosure;

FIG. 8 is a schematic flow chart diagram illustrating an alternative transformation model training method according to an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart diagram illustrating an alternative transformation model training method according to an embodiment of the present disclosure;

FIG. 10 is a schematic flow chart diagram illustrating an alternative transformation model training method according to an embodiment of the present disclosure;

fig. 11 is a scene schematic diagram of an alternative cloud service scenario provided by an embodiment of the present application;

FIG. 12 is a scene schematic diagram of an alternative customized voice scene provided by an embodiment of the present application;

fig. 13 is an alternative architecture diagram of a speech synthesis system according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, the terms "first \ second \ third" are used merely for distinguishing similar objects and do not represent specific ordering for the objects, and it is understood that "first \ second \ third" may be interchanged with specific order or sequence where permitted so that the embodiments of the present application described in the present embodiment can be implemented in an order other than that shown or described in the present embodiment.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

The scheme provided by the embodiment of the application relates to an artificial intelligence technology, and is specifically explained by the following embodiment:

AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science, attempting to understand the essence of intelligence and producing a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The embodiment of the application relates to a machine learning technology.

Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

(1) And (3) voice synthesis: also known as Text-to-Speech (TTS), functions to convert computer-generated or externally-input Text information into intelligible, fluent Speech and to read it.

(2) Frequency spectrum: spectrum (spectra) is a representation of a signal in a time domain in a frequency domain, and can be obtained by performing fourier transform on the signal, and the obtained result is two graphs with amplitude and phase as vertical axes and frequency as horizontal axis, and in application of the speech synthesis technology, phase information is often omitted, and only corresponding amplitude information in different frequencies is reserved.

(3) Fundamental frequency: in sound, Fundamental frequency (Fundamental frequency) refers to the frequency of the Fundamental tone in a complex tone, denoted by the symbol FO. Among the several tones constituting a complex tone, the fundamental tone has the lowest frequency and the highest intensity. The level of the fundamental frequency determines the level of a tone. The frequency of speech is usually referred to as the frequency of the fundamental tone.

(4) A vocoder: vocoders (vocoders) are derived from the acronym of human Voice coders (Voice Encoder), also known as speech signal analysis and synthesis systems, which function to convert acoustic features into sound.

(5) GMM: the Gaussian Mixture Model (Gaussian Mixture Model) is an extension of a single Gaussian probability density function, and multiple Gaussian probability density functions are used to more accurately statistically Model the distribution of variables.

(6) DNN: deep Neural networks (Deep Neural networks) are discriminant models, MLPs (multi-layer perceptron) that contain more than two hidden layers, each node, except for the input nodes, is a neuron with a nonlinear activation function, and as with MLPs, DNNs can be trained using a back-propagation algorithm.

(7) CNN: convolutional Neural networks (Convolutional Neural networks) are a type of feed-forward Neural Network whose neurons can respond to elements within the receptive field. CNNs are generally composed of multiple convolutional layers and top fully-connected layers, which reduce the number of parameters of the model by sharing the parameters, making them widely used in image and speech recognition.

(8) RNN: a Recurrent Neural Network (RNN) is a type of Recurrent Neural Network (Recurrent Neural Network) in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all nodes (Recurrent units) are connected in a chain.

(9) LSTM: the Long Short-Term Memory network (Long Short-Term Memory) is a recurrent neural network, and a Cell for judging whether information is useful or not is added into an algorithm. An input gate, a forgetting gate and an output gate are placed in one Cell. After the information enters the LSTM, whether the information is useful or not is judged according to the rule. Information which accords with the algorithm authentication is left, and information which does not accord with the algorithm authentication is forgotten through a forgetting door. The network is suitable for processing and predicting important events with relatively long intervals and delays in the time series.

(10) GRU: a Gate recovery Unit (Gate recovery Unit) is a kind of Recurrent neural network. Like LSTM, it is proposed to solve the problems of long-term memory and gradients in back propagation. Compared with the LSTM, the GRU has one less 'gating' inside, has less parameters than the LSTM, can achieve the effect equivalent to the LSTM in most cases, and effectively reduces the time consumption of calculation.

(11) CTC: continuous time Classification (connected Temporal Classification) is a time-series Classification algorithm, which has the advantage of automatic alignment of unaligned data, mainly on serialized data that is not aligned in advance. Such as speech recognition, optical character recognition, etc.

Referring to fig. 1A, fig. 1A is an alternative architecture diagram of an acoustic feature transformation system 100 provided in this embodiment of the present application, in order to implement supporting an acoustic feature transformation application, a terminal 400-1 is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of both. Fig. 1A further shows that the server 200 may be a server cluster, where the server cluster includes servers 200-1 to 200-3, and similarly, the servers 200-1 to 200-3 may be physical machines, or virtual machines constructed by using virtualization technologies (such as container technology and virtual machine technology), which is not limited in this embodiment, and of course, a single server may also be used in this embodiment to provide services.

In some embodiments of the present application, any form of terminal 400-1 may access the server 200 for providing the voice synthesis service through the network 300. After the terminal normally accesses the server 200, the text to be synthesized is sent to the server 200, and after the server 200 performs fast synthesis, the corresponding synthesized audio is sent to the terminal in a streaming or sentence returning mode. The one-time complete speech synthesis process comprises the following steps: the terminal uploads a text sequence to be converted to the server 200, and the server 200 inputs the text sequence to be converted to an encoder network of a conversion model after receiving the text sequence to be converted to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information; inputting the text representation sequence into a basic attention network of a conversion model to obtain a first attention state, a first context vector and a basic attention scoring matrix of the current time step; inputting a first attention state and a first context vector of a current time step into a decoder network of a conversion model to obtain a first acoustic feature; rapidly synthesizing audio corresponding to the text sequence to be converted through the first acoustic characteristics, and completing processing operations such as audio compression and the like; the server returns the audio to the terminal in a streaming or sentence returning mode, and the terminal can perform smooth and natural voice playing after receiving the audio.

Taking a vehicle-mounted terminal scenario as an example, please refer to fig. 1B, where fig. 1B is an optional architecture schematic diagram of a vehicle-mounted speech synthesis system provided in an embodiment of the present invention, in order to support an exemplary application, a vehicle device 11 is any vehicle running on a road, and a vehicle-mounted device 12 (for example, a central control system of the vehicle, a vehicle-mounted computer, or the like) in the vehicle device 11, where the vehicle-mounted device 12 may establish a connection with a terminal 400-2 in a wired/wireless manner, the vehicle-mounted device 12 may also be connected with a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two networks. Fig. 1B also shows that the server 200 may be a server cluster including servers 200-1 to 200-3, and similarly, the servers 200-1 to 200-3 may be physical machines or virtual machines constructed by using virtualization technologies (e.g., container technology, virtual machine technology, etc.). The terminal 400-2 may be a mobile device such as a mobile phone, tablet, wearable device, etc.

In the related technology, the vehicle-mounted equipment, once receiving the message, arranges the message into a message queue and immediately starts to perform message reminding, similar to the experience of a mobile phone and a personal computer, or the vehicle-mounted equipment does not perform message reminding after receiving the message, and needs to wait for a user to actively check a new message. Aiming at the message reminding problem in the vehicle-mounted terminal scene, the message can be converted into audio and notified to the user through the voice synthesis system comprising the acoustic feature conversion method, so that the interference to the user is reduced, and the potential safety hazard is eliminated.

In some embodiments of the present application, a voice synthesis system is disposed in the vehicle-mounted device 12, and after the vehicle-mounted device 12 receives a message that needs to be broadcasted by voice, a text sequence to be converted corresponding to the message may be acquired by using the voice synthesis system preset in the vehicle-mounted device 12, and the text sequence to be converted is input to an encoder network of a conversion model to obtain a text representation sequence; inputting the text representation sequence into a basic attention network of a conversion model to obtain a first attention state, a first context vector and a basic attention scoring matrix of the current time step; inputting a first attention state and a first context vector of a current time step into a decoder network of a conversion model to obtain a first acoustic feature; and rapidly synthesizing the audio corresponding to the text sequence to be converted through the first acoustic characteristics, and completing processing operations such as audio compression and the like. The audio may then be output by an audio playback device built in the vehicle device, or the audio may be sent to the vehicle device 11 in a streaming or sentence-back manner and output by an audio playback device in the vehicle device 11.

In some embodiments of the present application, after receiving a message that needs to be broadcasted by voice, the in-vehicle device 12 uploads a text sequence to be converted that needs to be synthesized to the server 200, and after receiving the text sequence to be converted, the server 200 inputs the text sequence to be converted to an encoder network of a conversion model to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information; inputting the text representation sequence into a basic attention network of a conversion model to obtain a first attention state, a first context vector and a basic attention scoring matrix of the current time step; inputting a first attention state and a first context vector of a current time step into a decoder network of a conversion model to obtain a first acoustic feature; rapidly synthesizing audio corresponding to the text sequence to be converted through the first acoustic characteristics, and completing processing operations such as audio compression and the like; the server returns the audio to the vehicular apparatus 11 by streaming or sentence return. The audio may then be output by an audio playback device built in the vehicle device, or the audio may be sent to the vehicle device 11 in a streaming or sentence-back manner and output by an audio playback device in the vehicle device 11.

In some embodiments of the present application, the in-vehicle device 12 may receive the message that the voice broadcast is required in the following manner. The vehicle-mounted device 12 is connected to the terminal 400-2, and the terminal 400-2 may receive information that needs to be played in voice, where the information may include at least one of: short message information sent by other terminals, information received by instant messaging software, push information of any application software, information to be played specified by a user and the like. After receiving the information that needs to be played, the terminal transmits the information to the in-vehicle device 12, and the in-vehicle device 12 may synthesize the information into audio through the speech synthesis system provided in the above embodiment, and output the audio through the audio playing device of the in-vehicle device 12 or the vehicle device 11.

Referring to fig. 2A, fig. 2A is a schematic structural diagram of an acoustic feature conversion apparatus 500 provided in an embodiment of the present application, and the acoustic feature conversion apparatus 500 shown in fig. 2A includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the acoustic feature translation device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 540 in figure 2A.

Referring to fig. 2B, fig. 2B is a schematic structural diagram of a transformation model training apparatus 600 according to an embodiment of the present application, where the transformation model training apparatus 600 shown in fig. 2B includes: at least one processor 610, memory 650, at least one network interface 620, and a user interface 630. The various components in the conversion model training apparatus 600 are coupled together by a bus system 640. It is understood that bus system 640 is used to enable communications among the components. Bus system 640 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 640 in fig. 2B.

The Processor 510/610 may be an integrated circuit chip having Signal processing capabilities such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

The user interface 530/630 includes one or more output devices 531/631, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 530/630 also includes one or more input devices 532/632, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 550/650 includes both volatile and nonvolatile memory, and can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550/650 described in connection with the embodiments disclosed herein is intended to comprise any suitable type of memory. Memory 550/650 optionally includes one or more storage devices physically remote from processor 510/610.

In some embodiments of the present application, memory 550/650 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551/651, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and for handling hardware-based tasks;

a network communication module 552/652 for reaching other computing devices via one or more (wired or wireless) network interfaces 520/620, an exemplary network interface 520/620 includes: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a display module 553/653 to enable presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 531/631 (e.g., display screens, speakers, etc.) associated with the user interface 530/630;

an input processing module 554/654 for detecting one or more user inputs or interactions from one of the one or more input devices 532/632 and translating the detected inputs or interactions.

In some embodiments of the present application, the acoustic feature transformation apparatus/transformation model training apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2A illustrates an acoustic feature transformation apparatus 555 stored in a memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: a first input module 5551, a second input module 5552, and a third input module 5553. Fig. 2B shows a conversion model training means 655 stored in memory 650, which may be software in the form of programs and plug-ins, etc., comprising the following software modules: an encoding module 6551, a base attention module 6552, an attentional module 6553 and a parameter tuning module 6554, which are logical and thus may be arbitrarily combined or further separated depending on the functionality implemented.

The functions of the respective modules will be explained below.

In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and for example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to perform the acoustic feature conversion method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

An exemplary application and implementation of the server provided in the embodiments of the present application will be combined, and in the embodiments of the present application, the acoustic feature transformation method provided in the embodiments of the present application will be described with the server as an execution subject.

Referring to fig. 3, fig. 3 is an alternative flow chart of an acoustic feature transformation method provided in an embodiment of the present application, which will be described with reference to the steps shown in fig. 3.

In step 301, a text sequence to be converted is input to an encoder network of a conversion model to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information.

In some embodiments of the present application, the rhyme feature information included in the text sequence to be converted may include at least one of: phonemes, intonations, and prosodic boundaries. The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme; phonemes are divided into two major categories, vowels and consonants. For example, for Chinese, a phone includes an initial (an initial, which is a complete syllable formed with a final using a consonant preceding the final) and a final (i.e., a vowel). For english, a phoneme includes a vowel and a consonant. The tone refers to a change in the elevation of a sound. Illustratively, there are four tones in Chinese: yin Ping, Yang Ping, upward voice and voice removing, English includes repeat reading, repeat reading and light reading, Japanese includes repeat reading and light reading. Prosodic boundaries are used to indicate where pauses should be made while reading text. Illustratively, the prosody boundaries are divided into prosody boundaries of different pause levels such as "# 1", "# 2", "# 3", and "# 4", and the pause degrees thereof are sequentially increased.

In some embodiments of the present application, before performing step 301, an original text may be obtained and processed to obtain the text sequence to be converted. Wherein, the original text is a character sequence in any language. Taking English language as an example, The original text is "The sky is blue"; taking the chinese language as an example, the original text is "sky is blue".

In some embodiments of the application, the original text may be processed through a preset phonological characteristic conversion model, so as to obtain a text sequence to be converted, which carries phonological characteristic information. The phonological feature conversion model may include a Text regularization (TN) model, a Grapheme-to-Phoneme (G2P) model, a word segmentation model, and a prosody model. The numbers, symbols, abbreviations and the like in the original text can be converted into language words through the TN model, phonemes of the multilingual text are obtained through the G2P model, the multilingual text is segmented through the segmentation model, and prosodic boundaries and tones of the multilingual text are obtained through the prosody model.

The G2P model may use a Recurrent Neural Network (RNN) and a Long-Short Term Memory Network (LSTM) to implement conversion from grapheme to phoneme. The word segmentation model can be an n-gram model, a hidden Markov model, a naive Bayes classification model, etc. The prosodic model is a pre-training language model BERT (bidirectional Encoder restationfrom transformers), a bidirectional LSTM-CRF (Conditional Random Field) model, and the like.

For example, taking the original text as "sky is blue", the original text may be processed through a preset phonological feature conversion model, and the resulting text sequence to be converted may be "ti ā nk ō ng #1sh im #2l and ns, and it can be seen that the text sequence to be converted includes phonemes, tones and prosodic boundaries.

In some embodiments of the present application, the encoder network is configured to convert the original text sequence to be converted, which carries the rhyme feature information, into a text representation sequence that can be recognized by the attention network, wherein the attention network may be the underlying attention network in step 302. The encoder network may be a CBHG (convergence Bank + high-speed network + bidirectional Gated recovery Unit, convolutional layer + high-speed network + bidirectional Recurrent neural network) network. By means of the CBHG network, overfitting may be reduced and a closer approximation of the obtained first acoustic feature to the real acoustic feature may be achieved.

Wherein the above step 301 can be realized by the formula (1-1).

Wherein the content of the first and second substances,a sequence of representations of the text is represented,representing a text sequence to be converted, wherein the CBHG Encode is the CBHG network; l is the sequence length.

In some embodiments of the present application, the CBHG network consists of a convolutional layer, a high-speed network, and a bidirectional recurrent neural network.

In step 302, the text representation sequence is input to the basic attention network of the conversion model, and a first attention state, a first context vector and a basic attention score matrix of the current time step are obtained.

In some embodiments of the present application, the base Attention network may include an Attention-RNN module (Attention-RNN) that obtains the first Attention state after inputting the text representation sequence into a base Attention network of a transition model, and a base Attention module that determines the base Attention score matrix based on the first Attention state and the text representation sequence, and then determines a first context vector based on the first Attention state and the base Attention score matrix.

The text representation sequence is sequentially input into the basic attention network according to the time sequence, and the time sequence represents the concept of time step. The first attention state is a hidden state of hidden layer neurons of the attention RNN module at the current time step, and the hidden state is a mesomeric value connecting the neurons of each hidden layer.

In step 303, inputting the first attention state and the first context vector of the current time step into a decoder network of a conversion model to obtain a first acoustic feature; the first acoustic feature is used for synthesizing audio data corresponding to the text sequence to be converted.

In some embodiments of the application, the Decoder network may include a Decoder RNN module (Decoder-RNN) that, upon input of the first attention state and the first context vector for the current time step to the Decoder network, may determine the first acoustic feature based on the first attention state and the first context vector for the current time step.

In some embodiments of the present application, the first acoustic feature is used to synthesize audio data corresponding to a text sequence to be converted, wherein the first acoustic feature may be presented in a form of a compressed spectrogram, for example, an amplitude spectrogram (magnetic spectra), a Mel-frequency spectrogram (Mel spectra), or the like. Using a compressed spectrogram instead of, for example, the original spectrogram may reduce redundancy, thereby reducing the amount of computation and time required for the training process and the feature transformation process.

In some embodiments of the present application, a loss function of the transformation model during training is associated with a first loss value corresponding to the at least one mentoring network; the first loss value is used to characterize a distance between a guiding attention score matrix and a base attention score matrix of the guiding attention network output.

Wherein, the training process of the conversion model may include: acquiring sample data; the sample data comprises a sample text sequence; inputting the sample text sequence into an encoder network of a conversion model to obtain a sample representation sequence; inputting the sample representation sequence into a basic attention network of a conversion model to obtain a sample basic score matrix of the current time step; inputting the sample representation sequence into at least one attention directing network of the conversion model to obtain a sample directing score matrix of the current time step output by each attention directing network; determining a first loss value corresponding to each guiding attention network according to the sample basic score matrix and the sample guiding score matrix output by each guiding attention network; the first loss value is used to characterize a distance between a sample guideline score matrix and a sample base score matrix of the guideline attention network output; and adjusting the model parameters of the conversion model by using the first loss value corresponding to each attention directing network to obtain the trained conversion model.

In the training process of the conversion model, model parameters of the conversion model are adjusted according to at least one first loss value corresponding to the attention-directing network, and the first loss value is used for representing the distance between a sample guidance score matrix and a sample basic score matrix output by the attention-directing network; in the process of converting the text to the voice feature by using the conversion model, although only the basic attention network is used in each attention network of the conversion model, the model parameters of the conversion model are adjusted based on at least one attention directing network, and the characteristics of at least one attention directing network are fused into the conversion model, so that the advantages of each attention directing network are still achieved in the process of converting the text to the voice acoustic feature.

In some embodiments of the present application, the at least one mentoring attention network and the base attention network are different mechanisms. For example, the mechanism of the underlying attention network may be a location sensitive attention mechanism; where the underlying attention network uses a location-sensitive attention mechanism, each mechanism directing the attention network is any mechanism that is not the location-sensitive attention mechanism. For example, the mechanism directing the attention network may be a monotonous attention mechanism (e.g., a forward attention mechanism), a dynamic convolution attention mechanism (e.g., a mixed gaussian attention mechanism), or the like.

Taking a basic attention network adopting a position sensitive attention mechanism, a guiding attention network adopting a forward attention mechanism and a guiding attention network adopting a mixed Gaussian attention mechanism as examples, in the training process, because the forward attention mechanism provides strong monotonicity guidance for the basic attention network, the adverse effect caused by mis-alignment in the training process is reduced, and the training speed and the quality of the generated acoustic features are improved; in the training process, the mixed Gaussian attention mechanism brings the long sentence synthesis capability to the basic attention network, so that the basic attention network can synthesize sentences ten times as long as the training data, and the robustness of the whole system for long sentence synthesis is greatly improved.

It should be noted that at least one attention directing network in the conversion model is only used in the model training process, and is used for adjusting the model parameters of the conversion model to obtain the trained conversion model, and in the process of actually performing text-to-speech feature conversion, after inputting the text sequence to be converted into the trained conversion model, only the basic attention network is used, and at least one attention directing network is not used.

As can be seen from the foregoing exemplary implementation of fig. 3, in the embodiment of the present application, a text sequence to be converted is converted into a first acoustic feature that can synthesize audio data through a conversion model, where a model parameter of the conversion model is adjusted in a training process based on a first loss value corresponding to at least one attention-directing network, so that the conversion model has advantages of each attention-directing network. Therefore, the conversion accuracy of the conversion model can be improved, the quality of acoustic features is improved, and the application scene of the application is enlarged. In addition, in practical application, technical support can be provided for good interaction of users, more accurate audio information can be obtained by using the acoustic features converted by the method of the embodiment of the application, the use by the users is facilitated, and the use experience of the users is improved.

Referring to fig. 4, fig. 4 is an alternative flow chart of an acoustic feature transformation method provided in an embodiment of the present application, and based on fig. 3, step 302 in fig. 3 includes steps 401 to 403, and step 303 includes step 404 and step 405, which will be described in conjunction with the steps shown in fig. 4.

In step 401, the first attention state of the current time step is determined according to the second attention state of the previous time step, the second context vector and the second acoustic feature.

In some embodiments of the present application, step 401 above may be implemented by equation (1-2):

s_t＝AttentionRNN(s_t-1，c_t-1，o_t-1) Formula (1-2);

wherein, c_t-1A second context vector, s, representing the last time step_t-1A second state of attention, o, representing the last time step_t-1Second acoustic characteristic of last time step, s_tA first attention state representing a current time step.

In step 402, a base attention score matrix is determined based on the sequence of textual representations, the first attention state, and the sequence position of the current time step.

In some embodiments of the present application, step 402 described above may be implemented by equations (1-3):

α_t＝LSAttention(s_t，h_i，l_t) Formula (1-3);

wherein alpha is_tBase attention score matrix, s, representing the current time step_tA first attention state, h, representing the current time step_iRepresenting a sequence of text representations,/_tIndicating the sequence position of the current time step.

In step 403, a first context vector is determined based on the base attention score matrix and the sequence of text representations.

In some embodiments of the present application, the sequence of text representations includes a plurality of text representation vectors corresponding in sequence position. The above step 403 may be implemented according to the following manner: and according to the attention weight corresponding to each sequence position in the basic attention scoring matrix, performing weighted summation on the text expression vector corresponding to each sequence position to obtain a first context vector.

For example, the above step 403 can be realized by the formulas (1-4):

c_t＝∑_iα_t,ih_iformulas (1-4);

wherein, c_tA first context vector, alpha, representing the current time step_t,iAttention weight, h, corresponding to a sequence position of i, representing the current time step_iThe representation sequence position is a text representation vector corresponding to i.

In step 404, a second decoder state for the previous time step is obtained.

In step 405, the second decoder state, the first context vector and the first attention state are input to a decoder network, resulting in a first acoustic feature.

In some embodiments of the present application, the above step 405 may be implemented according to the following manner: determining a first decoder state based on the second decoder state, the first context vector, and the first attention state; the first decoder state is converted into a first acoustic feature based on a preset affine function.

In some embodiments of the present application, step 405 may be implemented by equations (1-5) and (1-6):

d_t＝DecoderRNN(d_t-1，c_t，s_t) Formula (1-5);

wherein d is_tA first decoder state representing the current time step, d_t-1Second decoder state representing last time step, c_tA first context vector, s, representing the current time step_tA first attention state representing a current time step.

o_t＝Affine(d_t) Formulas (1-6);

wherein d is_tA first decoder state representing the current time step, affinity (.) representing a preset Affine function, o_tA first acoustic feature representing a current time step.

Referring to fig. 5, fig. 5 is an alternative flow chart of the acoustic feature transformation method provided in the embodiment of the present application, and based on the above embodiment, after step 303 in fig. 3, the method further includes step 501 and step 502, which will be described in conjunction with the steps shown in fig. 5.

In step 501, the first acoustic feature is input to a post-processing network of a conversion model to obtain a first feature to be converted.

In some embodiments of the present application, the conversion model further includes a post-processing network (postnet). The post-processing network may be a CBHG network.

In step 502, the first feature to be converted is input to a preset vocoder, and audio data corresponding to the text sequence to be converted is obtained.

In some embodiments of the present application, the post-processing network may process the first acoustic feature to obtain a first feature to be converted, and may use the first feature to be converted as an input of the vocoder to obtain audio data output by the vocoder, where the audio data is speech synthesized according to the text sequence to be converted.

As can be seen from the above exemplary implementation of fig. 5, the conversion model of the post-processing network provided by the present application may utilize more context information than the speech synthesis model without the post-processing network. And the first to-be-converted feature from the post-processing network contains better resolved harmonics and high frequency formant structures, which reduces the artifacts of the synthesized sound.

Referring to fig. 6, fig. 6 is an alternative flow chart diagram of a conversion model training method provided in the embodiment of the present application, which will be described with reference to the steps shown in fig. 6.

In step 601, sample data is acquired; the sample data comprises a sequence of sample texts.

In step 602, the sample text sequence is input to an encoder network of a conversion model, resulting in a sample representation sequence.

In some embodiments of the present application, equations (1-7) may be employed to convert the sample text sequence to a sample representation sequence.

In step 603, the sample representation sequence is input to the basic attention network of the conversion model, and a sample basic score matrix of the current time step is obtained.

In step 604, the sample representation sequence is input to at least one mentoring network of the transformation model, resulting in a sample mentoring score matrix for the current time step for each mentoring network output.

In some embodiments of the present application, in the training of the conversion model, based on each guidance attention network in the conversion model, a sample guidance score matrix corresponding to the guidance attention network is obtained.

In step 605, a first loss value corresponding to each of the mentoring networks is determined according to the sample base score matrix and the sample mentoring score matrix output by each mentoring network.

In some embodiments of the present application, the first loss value is used to characterize a distance between a sample guideline score matrix of the guideline attention network output and the sample base score matrix. In some embodiments, the corresponding first loss value of the mentoring attention network may be calculated by calculating an L1 norm between the sample mentoring score matrix and the sample base score matrix.

In step 606, the model parameters of the transformed model are adjusted by using the first loss value corresponding to each attention-directing network, so as to obtain the trained transformed model.

As can be seen from the foregoing exemplary implementation of fig. 6, the embodiment of the present application adjusts the first loss value based on the corresponding at least one attention-directing network, so that the conversion model has the advantages of each attention-directing network. Therefore, the conversion accuracy of the conversion model can be improved, the quality of acoustic features is improved, and the application scene of the application is enlarged. In addition, in practical application, because the attention directing network with different characteristics can be selected according to actual requirements in the training process, the finally generated conversion model in the embodiment of the application can be suitable for different application scenes, and the application range is wider; meanwhile, the features of the guidance attention networks are integrated into the basic attention network in the training process, so that the guidance attention networks are not needed in the use process, the framework of the existing conversion model is not needed to be adjusted in the model updating process, and the operation and maintenance cost and the updating cost are lower.

Referring to fig. 7, fig. 7 is an optional flowchart of a conversion model training method provided in the embodiment of the present application, based on fig. 6, step 603 in fig. 6 may include step 701 and step 702, step 604 may include step 703 and step 704, and step 605 may be updated to step 705, which will be described with reference to the steps shown in fig. 7.

In step 701, a third attention state of the current time step is determined according to a fourth attention state of the previous time step, a fourth context vector and a fourth acoustic feature.

In some embodiments of the present application, the above step 701 may be implemented by equations (1-8):

s′_t＝AttentionRNN(s′_t-1，c′_t-1，o′_t-1) Formulas (1-8);

wherein, c'_t-1Fourth context vector, s ', representing last time step'_t-1Fourth attention State, o ', representing the last time step'_t-1Fourth Acoustic feature, s 'of last time step'_tA third state of attention representing the current time step.

In step 702, a sample base score matrix is determined based on the sample representation sequence, the third attention state, and the sequence position of the current time step.

In some embodiments of the present application, step 702 above may be implemented by equations (1-9):

α′_t＝LSAttention(s′_t，h′_i，l′_t) Formula (II)(1-9)；

Wherein, alpha'_tSample base score matrix, s ', representing the current time step'_tA third attention state, h ', representing a current time step'_iDenotes sample sequence, l'_tIndicating the sequence position of the current time step.

Where the transformation model includes a forward attention network, step 604 may include:

in step 703, a first alignment parameter for the current time step is determined based on the sample base score matrix for the current time step and the second alignment parameter for the previous time step.

In some embodiments of the present application, the first alignment parameter comprises a first sub-parameter corresponding to each sequence position; the above step 703 can be implemented by: and determining a first sub-parameter corresponding to each sequence position of the current time step based on the attention weight corresponding to each sequence position in the sample basic score matrix and a second sub-parameter corresponding to each sequence position in the second alignment parameter.

In some embodiments of the present application, the above step 703 may be implemented by equations (1-10):

e_t,i＝(e_t-1,i+e_t-1,i-1)α_t,iformulas (1-10);

wherein e is_t,iRepresenting a first sub-parameter, e, corresponding to each sequence position i at the current time step_t-1A second alignment parameter, e, representing the last time step_t-1,iA second sub-parameter representing the second alignment parameter at each sequence position i; e.g. of the type_t-1,i-1Represents a second sub-parameter, α, corresponding to the second alignment parameter at each sequence position i-1_t,iAn attention weight corresponding to i at the sequence position representing the current time step.

In step 704, the first alignment parameter is normalized to obtain a first guidance score matrix corresponding to the forward attention network.

In some embodiments of the present application, step 704 described above may be implemented by equations (1-11):

wherein, af_t,iA first guideline score matrix, e, representing the correspondence of the forward attention network_t,iThe first sub-parameter, i.e. the first alignment parameter, corresponding to each sequence position i at the current time step is represented.

In step 705, a first loss value corresponding to the forward attention network is determined according to the sample base score matrix and the first guidance score matrix.

Wherein by calculating af_t,iAnd alpha'_tThe distance between the first and second nodes, and determining a corresponding first loss value of the forward attention network.

As can be seen from the foregoing exemplary implementation of fig. 7, in the embodiment of the present application, the model parameters of the conversion model are adjusted through the forward direction guide network, so that adverse effects caused by misalignment in the training process can be reduced, and the training speed and the quality of the generated acoustic features are improved.

Referring to fig. 8, fig. 8 is an optional flowchart of a conversion model training method provided in this embodiment of the present application, based on fig. 6, step 603 in fig. 6 may include step 801 and step 802, step 604 may include step 803, step 804 and step 805, and step 605 may be updated to step 806, which will be described with reference to the steps shown in fig. 8.

In step 801, a third attention state for the current time step is determined based on the fourth attention state, the fourth context vector, and the fourth acoustic feature for the previous time step.

In step 802, a sample base score matrix is determined based on the sequence of textual representations, the third attention state, and the sequence position of the current time step.

In some embodiments of the present application, implementation methods of step 801 and step 802 are the same as implementation methods of step 701 and step 702 in the embodiment of fig. 7, and are not described herein again.

In step 803, a first mean parameter, a first variance parameter, and a first deviation parameter of the current time step are obtained according to the third attention state.

In some embodiments of the present application, the step 803 may be implemented by: converting, by the multi-layer perceptron, the third attention state into a mean intermediate parameter, a variance intermediate parameter, and a shift intermediate parameter; obtaining the first variance parameter based on inputting the variance intermediate parameter into an exponential function; inputting the offset intermediate parameter into a first activation function to obtain a first offset parameter; and inputting the mean value intermediate parameter into a second activation function, and determining the first mean value parameter according to the parameter output by the second activation function and the second mean value parameter of the previous time step.

In some embodiments of the present application, the step 803 may be implemented by equations (1-12), equations (1-13), equations (1-14), and equations (1-15):

ω′_t,μ′_t,σ′_t＝MLP(s′_t) Formulas (1-12);

wherein, s'_tA third attention state, ω 'representing a current time step'_tOffset intermediate parameter, μ 'representing the current time step'_tMean median parameter, σ ', representing the current time step'_tThe variance intermediate parameter represents the current time step, and MLP (.) represents the multi-layer perceptron.

ω_t＝softmax(ω′_t) Formulas (1-13);

wherein, ω is_tA first offset parameter, ω'_tAn offset intermediate parameter representing the current time step, softmax (.) represents the first activation function.

σ_t＝exp(σ′_t) Formulas (1-14);

wherein σ_tA first party difference parameter, σ ', representing a current time step'_tThe variance intermediate parameter, exp (.), represents the exponential function for the current time step.

μ_t＝softplus(μ′_t)+μ_t-1 Formulas (1-15);

wherein, mu_tA first mean parameter, μ 'representing a current time step'_tMean intermediate parameter, softplus (.) representing the current time step, represents the second activation function.

In step 804, a gaussian mixture distribution is determined according to the first mean parameter, the first variance parameter, and the first offset parameter.

In step 805, a second derivative scoring matrix of the Gaussian attention network output is derived based on the mixed Gaussian distribution.

In some embodiments of the present application, the above step 805 may be implemented by equations (1-16):

wherein, ag_t,iA second derivative scoring matrix representing the output of the gaussian attention network.

In step 806, a first loss value corresponding to the gaussian attention network is determined according to the sample basis score matrix and the second derivative score matrix.

Wherein, through calculating ag_t,iAnd alpha'_tThe distance between the first and second nodes, and determining a corresponding first loss value of the forward attention network.

As can be seen from the foregoing exemplary implementation of fig. 8, in the embodiment of the present application, model parameters of the conversion model are adjusted through a gaussian attention network, so that the conversion model has a long sentence synthesis capability, and the conversion model can process a text sequence ten times as long as training data.

Referring to fig. 9, fig. 9 is an optional flowchart of a conversion model training method provided in this embodiment of the present application, based on fig. 6, step 601 in fig. 6 may be updated to step 901, step 604 may include step 902, step 903, and step 904, the method further includes step 905 and step 906, and step 606 may be updated to step 907. The description will be made in conjunction with the steps shown in fig. 9.

In step 901, sample data is acquired; the sample data comprises a sample text sequence and sample acoustic features corresponding to the sample text sequence.

In some embodiments of the present application, the sample acoustic feature is a standard acoustic feature corresponding to the sample text sequence, and the present application needs to train the conversion model so that any one sample text sequence can be converted to an acoustic feature close to or even identical to the standard acoustic feature. The sample acoustic features can be selected according to different actual scenes, for example, for the same sample text sequence, the sample acoustic features with different style characteristics can be corresponding, the sample acoustic features with the target style characteristics can be selected as labels of the sample text sequence according to the target style characteristics, and then the conversion model is trained, and the obtained trained conversion model can convert the text sequence into the acoustic features with the target style characteristics.

In step 902, a third attention state for the current time step is determined based on the fourth attention state, the fourth context vector, and the fourth acoustic feature for the previous time step.

In step 903, a sample base score matrix is determined based on the sequence of text representations, the third attention state, and the sequence position of the current time step.

In step 904, a third context vector is determined based on the sample base score matrix and the third attention state.

In some embodiments of the present application, the sample representation sequence comprises a plurality of sample representation vectors corresponding in sequence position. The above step 904 may be implemented according to the following: and according to the attention weight corresponding to each sequence position in the sample basis score matrix, carrying out weighted summation on the sample representation vector corresponding to each sequence position, and carrying out third context vector.

The above step 403 can be implemented by equations (1-17):

c′_t＝∑_iα′_t,ih′_iformulas (1-17);

wherein, c'_tA third context vector, α ', representing the current time step'_t,iAttention weight, h 'corresponding to sequence position i in sample base score matrix representing current time step'_iThe representation sequence position represents a vector for the sample corresponding to i.

In step 905, inputting the third attention state and the third context vector to a decoder network of the conversion model to obtain a third acoustic feature; the third acoustic feature is used to synthesize audio data corresponding to the sample text sequence.

In some embodiments of the present application, step 905 described above may be implemented according to the following: determining a third decoder state based on the fourth decoder state, the third context vector, and the third attention state; and converting the third decoder state into a third acoustic feature based on a preset affine function.

In some embodiments of the present application, step 905 may be implemented by equations (1-18) and (1-19):

d′_t＝DecoderRNN(d′_t-1，c′_t，s′_t) Formulas (1-18);

wherein, d'_tThird decoder State, d ', representing the current time step'_t-1Fourth decoder State, c 'representing the last time step'_tThird context vector, s ', representing the current time step'_tA third state of attention representing the current time step.

o′_t＝Affine(d′_t) Formulas (1-19);

wherein, d'_tRepresenting the state of a third decoder of the current time step, Affinine (.) representing a preset Affine function o'_tA third acoustic feature representing the current time step.

In step 906, a second loss value is determined based on the third acoustic feature and the sample acoustic feature.

In step 907, the model parameters of the transformed model are adjusted by using the second loss values and the first loss values corresponding to each of the guiding attention networks to obtain a trained transformed model.

In some embodiments of the present application, a loss function of the transformation model is associated with the second loss value and the first loss value corresponding to each attention-directing network, and an overall loss value of the transformation model is obtained through the loss function, and model parameters of each sub-network in the transformation model can be adjusted through the overall loss value.

In some embodiments of the present application, the loss function of the conversion model further includes a second loss value and a loss weight corresponding to each first loss value, and the loss function may be configured to perform weighted summation on the loss values according to the loss weights corresponding to the loss values to obtain the overall loss value. The magnitude of the loss weight of the loss value is used for representing the influence degree of the aspect corresponding to the loss value on the conversion model, namely the attention degree of the conversion model on the aspect.

For example, if the loss weight corresponding to the forward guidance attention network of the loss function is larger, it indicates that the influence of the forward guidance network on the conversion model is larger, and the conversion model obtained through the training of the loss function has higher training speed and higher accuracy; if the loss weight corresponding to the Gaussian guide attention network of the loss function is larger, the influence of the Gaussian guide attention network on the conversion model is larger, and the effect of the conversion model obtained through the training of the loss function on the long sentence synthesis is better.

Referring to fig. 10, fig. 10 is an optional flowchart of a conversion model training method provided in an embodiment of the present application, and based on fig. 9, the method further includes step 1001 and step 1002, and step 907 in fig. 9 may be updated to step 1003. The description will be made in conjunction with the steps shown in fig. 10.

In step 1001, the third acoustic feature is input to the post-processing network of the conversion model, and a second feature to be converted is obtained.

In step 1002, a third loss value is determined according to the second feature to be converted and the sample acoustic feature.

In step 1003, the model parameters of the conversion model are adjusted by using the second loss value, the third loss value, and the first loss value corresponding to each of the guidance attention networks, so as to obtain a trained conversion model.

In some embodiments of the present application, a loss function of the transformation model is associated with the second loss value, the third loss value, and the corresponding first loss value of each attention-directing network, and an overall loss value of the transformation model is obtained through the loss function, and model parameters of each sub-network in the transformation model are adjusted through the overall loss value.

In some embodiments of the present application, the loss function of the conversion model further includes a second loss value, a third loss value, and a loss weight corresponding to each first loss value, and the loss function may be implemented by performing a weighted summation on the loss values according to the loss weights corresponding to the loss values to obtain the overall loss value. The magnitude of the loss weight of the loss value is used for representing the influence degree of the aspect corresponding to the loss value on the conversion model, namely the attention degree of the conversion model on the aspect.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

With the rapid development of smart devices (such as smart phones, smart speakers, etc.), the voice interaction technology is increasingly applied as a natural interaction mode. As an important part of the voice interaction technology, the voice synthesis technology has also made great progress. The application provides and realizes a text-to-acoustic feature conversion model for multi-attention mechanism guidance learning. The model divides the end-to-end text-to-acoustic feature conversion into an encoding module (Encoder), a plurality of Attention modules (Attention) and a corresponding decoding module (Decoder). During training, a plurality of attention mechanisms are used for guiding and learning the basic attention module, and only the basic attention module is reserved during actual use. The scheme can transfer different characteristics of the attention mechanism which are beneficial to improving the synthesis robustness and quality to the basic attention mechanism on the premise of not changing an on-line engine code. The obtained end-to-end model can synthesize the input text and output more robust and high-quality acoustic characteristics for subsequent vocoder use. The scheme is widely applied to scenes such as reading APP intelligent reading, intelligent customer service, news broadcasting, intelligent equipment interaction and the like.

The core of the application is divided into two parts. The first part is the backbone network, which decomposes the end-to-end text-to-acoustic feature model into multiple modules. An encoding module (Encoder), a basic Attention module (Attention) and a basic decoding module (Decoder) are included, which are retained in both training and practical use. The second part is a multi-attention guiding learning network, which shares the coding module with the backbone network and comprises two different attention modules and corresponding decoding modules for guiding the learning of the basic attention module. The multi-attention-guiding learning network is only used during training, guidance is provided for learning of a basic attention mechanism, and the part is removed during actual use. The innovative multi-attention guiding learning mode can instill the characteristics of various attention mechanisms to the basic attention mechanism, so that the robustness and the quality of the synthesis are improved under the condition that the online forward framework does not need to be changed.

The speech synthesis technology converts the text into corresponding audio content through a certain rule or model algorithm. The traditional speech synthesis technology is mainly based on a concatenation method or a statistical parameter method. With the continuous breakthrough of deep learning in the field of speech recognition, some leading-edge internet companies at home and abroad begin to introduce deep learning into the field of speech synthesis, and make great progress.

Conventional speech synthesis methods can be divided into two main categories: splicing method and parameter method. The speech synthesis system based on splicing divides the existing audio into small units, and the small units are connected in series through some dynamic algorithms during synthesis and then processed later to form new audio. Parameter-based speech synthesis systems convert existing audio into spectra (spectra) and acoustic parameters, such as Fundamental frequency (Fundamental frequency), Duration of articulation (Duration), etc., and train acoustic models that predict relevant parameters from textual information during synthesis and send them to vocoders (vocoders) to synthesize new audio. Both of these approaches typically involve two components, a front end and a back end. The front end is responsible for text analysis and linguistic feature extraction, such as word segmentation, POS tagger, disambiguation, and prosodic structure prediction. The back end is responsible for converting the phonetic features extracted from the front end into audio, and operations which may be included include acoustic parameter prediction, prosody modeling, audio generation and the like. Speech synthesis systems based on either concatenation or parametric methods have dominated over the past decades. However, conventional techniques require a large number of modules and sophisticated feature designs. In addition, due to the fact that the splitting among the modules and the insufficient capacity of the models exist, a speech synthesis system based on the traditional method has a large space for improving naturalness and fidelity.

Through research, the application finds that the traditional technical scheme has the following problems to be solved: (1) traditional speech synthesis techniques such as concatenative synthesis and parameter synthesis all require a large number of modules and sophisticated feature design. In addition, due to the fact that the splitting among the modules and the insufficient capacity of the models exist, a speech synthesis system based on the traditional method has a large improvement space in the aspects of naturalness and fidelity; (2) end-to-end speech synthesis based on deep learning has great improvement in naturalness and fidelity, but a single attention mechanism cannot meet the requirement of robustness of online synthesis.

In some embodiments of the present application, the text-to-acoustic feature conversion method provided by the embodiments of the present application has a wide application range. The application provides two application scenarios, and in the first application scenario, a speech synthesis scheme can be put on a cloud service and used as a basic technology to enable a user using the cloud service, such as a bank intelligent customer service. In a second application scenario, the solution can be applied to personalized scenarios in the vertical domain, such as book intelligent reading, news reporting, and the like.

Please refer to a scene diagram of a cloud service scene shown in fig. 11. In the cloud service scene, various intelligent devices such as intelligent robots and intelligent mobile phones can be accessed to a server which is located for providing a voice synthesis service through a wireless network. After the intelligent equipment is normally accessed to the server, the text to be synthesized is sent to the server, and after the text to be synthesized is quickly synthesized by the server, the corresponding synthesized audio is sent to the equipment in a streaming or sentence returning mode. The one-time complete speech synthesis process comprises the following steps: the client uploads a text to be synthesized to the server, and the server receives the text and then performs corresponding regularization processing; inputting the normalized text information into an end-to-end speech synthesis system designed and combined according to the service, quickly synthesizing the audio corresponding to the text, and completing processing operations such as audio compression and the like; the server returns the audio to the client side in a streaming or sentence returning mode, and the client side can carry out smooth and natural voice playing after receiving the audio.

In the whole process, the delay of the background speech synthesis service is very small, and the client can basically and immediately obtain a return result. The user can hear the required content in a short time, and the eyes are liberated, so that the interaction is natural and convenient.

Please refer to a scene diagram of the customized voice scene shown in fig. 12. In the customized voice scene, customized dedicated tone-color voice synthesis services are needed in many vertical scenes, such as in the fields of novel reading, news broadcasting and the like. The specific flow of the customized speech synthesis service is shown in FIG. 12: the method comprises the following steps that a demand party submits a tone demand list of voice synthesis required by a product, such as the sex of a speaker, the tone type and the like; after receiving the list of the demand party, the background collects a sound library according to the required tone condition and trains a corresponding customized model; after the synthetic sample delivery demander is checked and confirmed, the customized model is deployed on line; the application (such as reading APP, a news client and the like) of the demand side sends the required text to the background corresponding model; the user can hear the content read aloud with the corresponding customized timbre in the application, and the synthesis process is the same as the service of the cloud service scene.

Personalized customized voice synthesis puts higher requirements on robustness, generalization, instantaneity and the like of the system, the end-to-end system capable of being modularized can be flexibly adjusted according to actual conditions, and high adaptability of the system under different requirements is guaranteed on the premise that the synthesis effect is hardly influenced.

The present application will focus on solving the above-mentioned problems. The end-to-end text-to-acoustic feature module is divided into two parts. The first part is the backbone network, which decomposes the end-to-end text-to-acoustic feature conversion model into multiple modules. The system comprises an encoding module (Encode), a basic Attention module (Attention) and a basic decoding module (Decode), wherein the basic Attention module uses a position-sensitive Attention mechanism, and the modules are reserved in training and practical use. The second part is a multi-attention guiding learning network, the network and the backbone network share a coding module, and two attention modules with respective characteristics and corresponding decoding modules are included for guiding the learning of the basic attention module. The forward attention mechanism module provides strong monotonicity guidance for a basic attention mechanism, reduces adverse effects caused by misalignment in a training process, and improves training speed and quality of generated acoustic features. The attention mechanism based on the Gaussian mixture model brings the long sentence synthesis capability to the basic attention mechanism, so that the basic attention mechanism can synthesize sentences ten times as long as training data, and the robustness of the whole system to long sentence synthesis is greatly improved. The multi-attention-guiding learning network is only used during training, guidance is provided for learning of a basic attention mechanism, and the part is removed during actual use. Based on the multi-attention guiding learning mode in the embodiment of the application, the characteristics of various attention mechanisms can be infused to the basic attention mechanism, so that the robustness and the quality of the synthesis are improved under the condition that the on-line forward framework does not need to be changed.

As shown in fig. 13, for the architecture diagram of the speech synthesis system provided by the present application, a text a110 is input to a CBHG encoder network a120, a hidden text representation sequence a130 can be obtained, during the training process, the hidden text representation sequence a130 needs to be input to a forward attention network a141, a base attention network a140, and a gaussian attention network a142, respectively, an attention score matrix of each attention network output can be obtained, and the model parameters of the acoustic feature transformation model can be adjusted according to the distance between the attention score matrix of a141 and a142 and the attention score matrix of a141, respectively. In the using process, the hidden text representation sequence a130 is only input to the basic attention network a140, and then passes through the decoder network a150, the acoustic feature of the text corresponding to the a110 can be obtained, and in order to improve the authenticity of the subsequent speech synthesis, the acoustic feature can be processed through the CBHG post-processing network a 160.

In some embodiments of the present application, a location sensitive attention mechanism is used as the attention mechanism of the underlying attention network. The application will have pitch and prosodic information using a CBHG encoderThe Chinese phonetic sequence is converted into a hidden text representation more suitable for attention mechanismSee equation (2-1).

Attention RNN (attention RNN) uses the state of the previous time step, the context vector of the previous time step and the decoding result of the previous time step as input, and outputs the current state s_tTo calculate the attention score, see formula (2-2).

s_t＝AttentionRNN(s_t-1，c_t-1，o_t-1) Formula (2-2);

wherein, c_t-1Context vector, s, of the last time step_t-1State of last time step, o_t-1Decoding result of last time stepCurrent state s_t。

The position sensitive attention mechanism takes the current state, the hidden representation and the position related information as input to obtain the attention score alpha_tSee equation (2-3).

α_t＝LSAttention(s_t，h_i，l_t) Formula (2-3);

wherein s is_tCurrent state, h_iHidden text representation,/_tLocation related information

Then the context vector c of the current moment is calculated_tSee formula (2-4)

c_t＝∑_iα_t,ih_iFormula (2-4);

finally, the current attention RNN state and context vector are input to the decoder RNN.

Then, a decoder d is obtained_tSee equations (2-5), and obtain the final decoding result o by affine function_tSee equations (2-6).

d_t＝DecoderRNN(d_t-1，c_t，s_t) Formula (2-5);

o_t＝Affine(d_t) Formula (2-6);

a multi-guide attention mechanism is introduced in the above basic structure. All the instructor attention modules share the encoder with the base attention module and each has its own attention RNN and decoder RNN (identical in structure). Taking fig. 11 as an example, the forward attention network a141 has a corresponding decoder network a151, and the gaussian attention network a142 has a corresponding decoder network a 152. Wherein the content of the first and second substances,

in some embodiments of the present application, two instructive attention networks with different attention mechanisms, namely a forward attention network and a GMM-based attention network (gaussian attention network), may be selected to provide an instructive attention score matrix af_tAnd ag_t。

Among them, as a monotone attention mechanism, the forward attention considers only the alignment path satisfying the monotone condition at each decoding time step to ensure monotonicity of the final alignment path. Through verification, the method can accelerate the convergence speed and improve the stability of feature generation. The training process is generally unable to learn any effective alignment method, and adding a fixed diagonal mask to constrain the attention alignment matrix in the training helps solve this problem.

Defining an intermediate variable e by reference to a connected temporal classification model (CTC)_t,iThe sum of the probabilities of all the monotonically aligned paths is denoted. e.g. of the type_t,iCan be recursively calculated (where alpha is_t,iMay be the attention weight of the sequence position i at the current time step t generated by the base attention module). Then, af_t,iCan be obtained by normalization, see equations (2-7, 2-8).

e_t,i＝(e_t-1,i+e_t-1,i-1)α_t,iFormula (2-7);

unlike forward attention, gaussian attention networks are a purely location-dependent attention mechanism. It may give basic attention mechanisms a different benefit than monotonic attention mechanisms, such as robustness of long sentence synthesis (much longer than seen during training) while preserving the naturalness of shorter utterances.

Given the RNN state s currently of interest_tFirst three intermediate parameters ω 'are calculated by the multilayer perceptron (MLP)'_t,μ′_t,σ′_tSee equations (2-9).

ω′_t,μ′_t,σ′_t＝MLP(s_t) Formula (2-9);

the three intermediate parameters are then refined by varying transfer functions to obtain the parameters of the mixed gaussian distribution. Wherein the variance is calculated by an exponential function, and the mean and the offset are obtained by softplus and softmax, respectively, to ensure that they are positive, see equations (2-10, 2-11, 2-12).

ω_t＝softmax(ω′_t) Formula (2-10);

σ_t＝exp(σ′_t) Formula (2-11);

μ_t＝softplus(μ′_t)+μ_t-1formula (2-12);

gaussian attention network uses N-hybrid Gaussian distributions to generate an attention score ag_t,iSee equations (2-13).

In the training process, the tutorial components are trained along with the basic structure. Only the basic module is retained in the inference. The training loss function includes three parts. The first is the distance between the decoder output and the true acoustic features. And then the distance between the output of the post-processing network and the true acoustic features. The last portion includes the distance between the base alignment score and all of the guide alignment scores. For simplicity, the L1 norm may be taken as a distance metric.

The conversion model provided by the embodiment of the application can infuse the characteristics of various attention mechanisms to the basic attention mechanism, so that the robustness and quality of synthesis are improved under the condition that an online forward frame does not need to be changed. Can stably output the speech synthesis service with high intelligibility, high naturalness and high fidelity. The scheme can be deployed at the cloud end to provide general synthesis service for various devices, and exclusive timbre can be customized according to the requirements of different applications. The method can continuously absorb the speciality of a newly-proposed attention mechanism, and the beneficial effect of the technical scheme is that the whole text-to-acoustic feature network is thermally updated without modifying a forward framework.

In some embodiments of the present application, the instructional attention mechanism may include a monotonicity attention mechanism similar to the forward attention mechanism and may also include a dynamic convolution attention mechanism that also takes into account location information similar to the Gaussian mixture attention mechanism.

Continuing with the exemplary structure of the acoustic feature conversion apparatus 555 provided by the embodiments of the present application implemented as software modules, in some embodiments of the present application, as shown in fig. 2A, the software modules stored in the acoustic feature conversion apparatus 555 in the memory 550 may include:

the first input module 5551 is configured to input the text sequence to be converted into an encoder network of the conversion model, so as to obtain a text representation sequence; the text sequence to be converted comprises rhyme characteristic information;

a second input module 5552, configured to input the text representation sequence into a basic attention network of the conversion model, so as to obtain a first attention state, a first context vector, and a basic attention score matrix at the current time step;

a third input module 5553, configured to input the first attention state and the first context vector at the current time step into a decoder network of a conversion model, so as to obtain a first acoustic feature; the first acoustic feature is used for synthesizing audio data corresponding to the text sequence to be converted; wherein, the loss function of the conversion model in the training process is related to the first loss value corresponding to at least one instruction attention network; the first loss value is used to characterize a distance between a guiding attention score matrix and a base attention score matrix of the guiding attention network output.

In some embodiments of the present application, the second input module 5552 is further configured to perform weighted summation on the text representation vector corresponding to each sequence position according to the attention weight corresponding to each sequence position in the basic attention score matrix, so as to obtain a first context vector.

In some embodiments of the present application, the second input module 5552 is further configured to obtain a second decoder state of a previous time step; the second decoder state, the first context vector and the first attention state are input to a decoder network, resulting in a first acoustic feature.

In some embodiments of the present application, the second input module 5552 is further configured to determine a first decoder state based on the second decoder state, the first context vector, and the first attention state; the first decoder state is converted into a first acoustic feature based on a preset affine function.

In some embodiments of the present application, the third input module 5553 is further configured to input the first acoustic feature to a post-processing network of a conversion model, so as to obtain a first to-be-converted feature; and inputting the first feature to be converted into a preset vocoder to obtain audio data corresponding to the text sequence to be converted.

Continuing with the exemplary structure of the conversion model training apparatus 655 embodied as software modules provided by the embodiments of the present application, in some embodiments of the present application, as shown in fig. 2B, the software modules stored in the conversion model training apparatus 655 of the memory 650 include:

an encoding module 6551 for acquiring sample data; the sample data comprises a sample text sequence; and inputting the sample text sequence into an encoder network of a conversion model to obtain a sample representation sequence.

The base attention module 6552 is configured to input the sample representation sequence into a base attention network of the transformation model to obtain a sample base score matrix of the current time step.

A guided attention module 6553 for inputting the sample representation sequence to at least one guided attention network of the conversion model, resulting in a sample guided score matrix for the current time step for each guided attention network output.

A parameter adjusting module 6554, configured to determine a first loss value corresponding to each of the guidance attention networks according to the sample basis score matrix and the sample guidance score matrix output by each of the guidance attention networks; the first loss value is used to characterize a distance between a sample guideline score matrix and a sample base score matrix of the guideline attention network output; and adjusting the model parameters of the conversion model by using the first loss value corresponding to each attention directing network to obtain the trained conversion model.

In some embodiments of the present application, the base attention module 6552 is further configured to determine a third attention state for the current time step based on a fourth attention state for the previous time step, a fourth context vector, and a fourth acoustic feature; and determining a sample basis score matrix according to the sample representation sequence, the third attention state and the sequence position of the current time step.

In some embodiments of the present application, the sample data further comprises sample acoustic features corresponding to a sample text sequence; the parameter adjustment module 6554 is further configured to determine a third context vector according to the sample basis score matrix and the third attention state; inputting the third attention state and the third context vector into a decoder network of a conversion model to obtain a third acoustic feature; the third acoustic feature is used for synthesizing audio data corresponding to the sample text sequence; determining a second loss value according to the third acoustic characteristic and the sample acoustic characteristic; and adjusting the model parameters of the conversion model by using the second loss value and the first loss value corresponding to each instruction attention network to obtain the trained conversion model.

In some embodiments of the present application, in the case that the attentive network is a forward attentive network, the attentive module 6553 is further configured to determine a first alignment parameter for the current time step based on the sample base score matrix for the current time step and a second alignment parameter for a previous time step; normalizing the first alignment parameter to obtain a first guidance score matrix output by the forward attention network; the parameter adjustment module 6554 is further configured to determine a first loss value corresponding to the forward attention network according to the sample base score matrix and the first guidance score matrix.

In some embodiments of the present application, the first alignment parameter comprises a first sub-parameter corresponding to each sequence position; the direct attention module 6553 is further configured to determine a first sub-parameter corresponding to each sequence position of the current time step based on the attention weight corresponding to each sequence position in the sample base score matrix and a second sub-parameter corresponding to each sequence position in the second alignment parameter.

In some embodiments of the present application, in a case that the attentive network is a gaussian attentive network, the attentive module 6553 is further configured to obtain a first mean parameter, a first variance parameter, and a first deviation parameter of the current time step according to the third attentive state; determining mixed Gaussian distribution according to the first mean value parameter, the first variance parameter and the first offset parameter; obtaining a second derivative scoring matrix of the output of the Gaussian attention network based on the mixed Gaussian distribution; the parameter adjusting module 6554 is further configured to determine a first loss value corresponding to the gaussian attention network according to the sample basis score matrix and the second guidance score matrix.

In some embodiments of the present application, the direct attention module 6553 is further configured to convert the third attention state into a mean intermediate parameter, a variance intermediate parameter, and a shift intermediate parameter via the multi-layered perceptron; obtaining a first variance parameter based on inputting the variance intermediate parameter into the exponential function; inputting the offset intermediate parameter into a first activation function to obtain a first offset parameter; and inputting the mean value intermediate parameter into a second activation function, and determining a first mean value parameter according to the parameter output by the second activation function and the second mean value parameter of the previous time step.

In some embodiments of the present application, the parameter adjustment module 6554 is further configured to input a third acoustic feature into a post-processing network of the conversion model, so as to obtain a second feature to be converted; determining a third loss value according to the second feature to be converted and the acoustic features of the sample; and adjusting the model parameters of the conversion model by using the third loss value and the first loss value corresponding to each instruction attention network to obtain the trained conversion model.

Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the acoustic feature transformation method or the transformation model training method described in the embodiment of the present application.

Embodiments of the present disclosure provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform an acoustic feature transformation method or a transformation model training method provided by embodiments of the present disclosure, for example, as shown in fig. 3 to 10.

In some embodiments of the present application, the computer readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments of the application, the executable instructions may be in the form of a program, software module, script, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the following technical effects can be achieved through the embodiments of the present application:

(1) according to the method and the device for converting the text sequence to be converted into the first acoustic feature capable of synthesizing the audio data, the conversion model converts the text sequence to be converted into the first acoustic feature capable of synthesizing the audio data, and model parameters of the conversion model are adjusted based on at least one first loss value corresponding to the attention-directing network in the training process, so that the conversion model has the advantages of the attention-directing networks. Therefore, the conversion accuracy of the conversion model can be improved, the quality of acoustic features is improved, and the application scene of the application is enlarged. In addition, in practical application, technical support can be provided for good interaction of users, more accurate audio information can be obtained by using the acoustic features converted by the method of the embodiment of the application, the use by the users is facilitated, and the use experience of the users is improved.

(2) Compared with a speech synthesis model without a post-processing network, the conversion model of the post-processing network provided by the embodiment of the application can utilize more context information. And the first to-be-converted feature from the post-processing network contains better resolved harmonics and high frequency formant structures, which reduces the artifacts of the synthesized sound.

(3) The embodiment of the application is adjusted based on the first loss value corresponding to at least one attention-directing network, so that the conversion model has the advantages of each attention-directing network. Therefore, the conversion accuracy of the conversion model can be improved, the quality of acoustic features is improved, and the application scene of the application is enlarged. In addition, in practical application, because the attention directing network with different characteristics can be selected according to actual requirements in the training process, the finally generated conversion model in the embodiment of the application can be suitable for different application scenes, and the application range is wider; meanwhile, the features of the guidance attention networks are integrated into the basic attention network in the training process, so that the guidance attention networks are not needed in the use process, the framework of the existing conversion model is not needed to be adjusted in the model updating process, and the operation and maintenance cost and the updating cost are lower.

(4) According to the embodiment of the application, the model parameters of the conversion model are adjusted through the forward direction guide network, adverse effects caused by mis-alignment in the training process can be reduced, and the training speed and the quality of generated acoustic features are improved.

(5) According to the method and the device, model parameters of the conversion model are adjusted through the Gaussian attention network, so that the conversion model has the capability of long sentence synthesis, and the conversion model can process text sequences ten times as long as training data.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

39页详细技术资料下载

Acoustic feature conversion and model training method, device, equipment and medium

相关技术

网友询问留言