Voice emotion conversion method and device, computer equipment and storage medium

文档序号：154827 发布日期：2021-10-26 浏览：36次中文

阅读说明：本技术 语音情感转换方法、装置、计算机设备及存储介质 (Voice emotion conversion method and device, computer equipment and storage medium ) 是由张旭龙王健宗于 2021-07-26 设计创作，主要内容包括：本发明公开了一种语音情感转换方法、装置、计算机设备及存储介质,其中方法包括：接收用户输入的语音和用户选择的需要转换的情感信息,情感信息对应一个预先设置好的情感编码；将语音输入至预先训练好的声学模型中,得到目标梅尔频谱；将目标梅尔频谱和情感编码输入至训练好的风格码提取网络,得到目标风格码；从语音中提取文本内容,并将文本内容和目标风格码一并输入至训练好的声学模型中,得到风格转换后的梅尔频谱；根据风格转换后的梅尔频谱生成包括情感信息的语音。通过上述方式,本发明能够利用声学模型将用户输入的语音的情感进行转换,大大提高了语音的情感转换效率。(The invention discloses a speech emotion conversion method, a speech emotion conversion device, computer equipment and a storage medium, wherein the method comprises the following steps: receiving voice input by a user and emotion information which is selected by the user and needs to be converted, wherein the emotion information corresponds to a preset emotion code; inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum; inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain target style codes; extracting text content from the voice, and inputting the text content and the target style code into a trained acoustic model to obtain a Mel frequency spectrum after style conversion; and generating voice comprising emotional information according to the Mel frequency spectrum after the style conversion. By the mode, the emotion of the voice input by the user can be converted by the aid of the acoustic model, and emotion conversion efficiency of the voice is greatly improved.)

1. A speech emotion conversion method is characterized by comprising the following steps:

receiving voice input by a user and emotion information which is selected by the user and needs to be converted, wherein the emotion information corresponds to a preset emotion code;

inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum;

inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain target style codes;

extracting text content from the voice, and inputting the text content and the target style code into the trained acoustic model together to obtain a Mel frequency spectrum after style conversion;

and generating voice comprising the emotional information according to the Mel frequency spectrum after the style conversion.

2. The method of claim 1, wherein pre-training the acoustic model and the trellis code extraction network comprises:

constructing a style code mapping network with the same structure as the style code extraction network and an optimization target of the style code mapping network;

acquiring sample voice and preset emotion codes of the training;

inputting the emotion codes and random noise prepared in advance into a style code mapping network to obtain first style codes;

analyzing the sample voice by using the acoustic model to obtain voice characteristics, generating a first Mel frequency spectrum according to the voice characteristics, and generating a second Mel frequency spectrum according to the voice characteristics and the first style code;

inputting the second Mel frequency spectrum and the emotion codes into the style code extraction network to obtain second style codes;

and updating the acoustic model by combining the pre-constructed optimization target of the trellis code mapping network and the trellis code loss function back propagation.

3. The method for converting speech emotion of claim 2, wherein the optimization goal is:

wherein F is the trellis code mapping network, L_reconFor speech reproduction loss, L_diverseFor loss of emotional diversity, Mel_predictIs the first Mel frequency spectrum, Mel_sourceIs the second mel frequency spectrum, s₁And s₂Mapping a first style code generated by the network from a different emotion encoding for the style code,anda second Mel frequency spectrum generated by combining the first style codes corresponding to different emotion codes with the voice features extracted by the acoustic model;

the style loss function is:

L_style＝‖s_F-s_E‖₁；

wherein L is_styleIs the style loss function, s_FIs the first style code, s_EIs the second style code.

4. The method of claim 2, wherein before obtaining the sample speech of the current training and the predetermined emotion encoding, the method further comprises:

and setting one corresponding emotion code in one-hot format for each emotion.

5. The method of converting speech emotion of claim 1, wherein the acoustic model is one of a Fastspeech model and a Fastspeech2 model.

6. The method of converting speech emotion of claim 1, wherein the trellis code extraction networks each include a convolutional network for extracting features from the second Mel spectrum with the emotion encoding as a constraint, and a neural network for generating the target trellis code based on the features.

7. The method of speech emotion conversion of claim 6, wherein the convolutional network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer and a third pooling layer, and the neural network comprises a first bi-directional LSTM layer and a second bi-directional LSTM layer.

8. A speech emotion conversion apparatus, comprising:

the receiving module is used for receiving voice input by a user and emotion information which is selected by the user and needs to be converted, wherein the emotion information corresponds to a preset emotion code;

the first input module is used for inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum;

the style extraction module is used for inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain target style codes;

the second input module is used for extracting text content from the voice and inputting the text content and the target style code into the trained acoustic model together to obtain a Mel frequency spectrum after style conversion;

and the generating module is used for generating the voice comprising the emotion information according to the Mel frequency spectrum after the style conversion.

9. A computer device, characterized in that the computer device comprises a processor, a memory coupled to the processor, in which memory program instructions are stored, which program instructions, when executed by the processor, cause the processor to carry out the steps of the speech emotion conversion method as claimed in any of claims 1-7.

10. A storage medium characterized in that it stores program instructions capable of implementing the speech emotion conversion method as recited in any one of claims 1 to 7.

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech emotion conversion method, apparatus, computer device, and storage medium.

Background

The voice signal not only contains semantic information, but also contains other information such as speaker identity and emotion when speaking, and the emotion voice conversion refers to a technology for converting voice from one emotion to another emotion and keeping other information such as the semantic and the speaker identity unchanged. Along with the development of economic technology and artificial intelligence technology, people have increasingly enriched entertainment life, the public life can not leave the video and audio technology, how to make a machine have emotion perception capability and expression capability like human is a key for realizing harmony of human-computer interaction, in recent years, a voice processing technology is remarkably improved, however, at present, computers only have logical reasoning capability, if the computer is endowed with emotion expression capability, harmonious man-machine interaction can be realized, indirect tools for communicating with the computers, such as a keyboard, a mouse and the like, are omitted, the communication between a machine and a person in the future is not limited to neutral voice any more, but can communicate with the computer by using the voice and the emotion, and in the field of film and television art, if the emotion of the human voice can be converted, the level of works can be greatly increased, such as dubbing, therefore, the conversion of speech emotion has a profound research significance no matter whether the object is a machine or a human.

Disclosure of Invention

The application provides a voice emotion conversion method and device, computer equipment and a storage medium, so that the emotion of voice is converted.

In order to solve the technical problem, the application adopts a technical scheme that: a speech emotion conversion method is provided, which comprises the following steps: receiving voice input by a user and emotion information which is selected by the user and needs to be converted, wherein the emotion information corresponds to a preset emotion code; inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum; inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain target style codes; extracting text content from the voice, and inputting the text content and the target style code into a trained acoustic model to obtain a Mel frequency spectrum after style conversion; and generating voice comprising emotional information according to the Mel frequency spectrum after the style conversion.

As a further improvement of the present application, the pre-training of the acoustic model and the trellis code extraction network includes: constructing a style code mapping network with the same structure as the style code extraction network and an optimization target of the style code mapping network; acquiring sample voice and preset emotion codes of the training; inputting emotion codes and random noise prepared in advance into a style code mapping network to obtain a first style code; analyzing the sample voice by using an acoustic model to obtain voice characteristics, generating a first Mel frequency spectrum according to the voice characteristics, and generating a second Mel frequency spectrum according to the voice characteristics and the first style code; inputting the second Mel frequency spectrum and the emotion codes into a style code extraction network to obtain second style codes; and updating the acoustic model by combining the optimization target of the pre-constructed style code mapping network and the reverse propagation of the style code loss function.

As a further improvement of the present application, the optimization goals are:

wherein F is a style code mapping network, Lrecon is voice reproduction loss, Ldirect is emotion diversity loss, Melpredict is a first Mel frequency spectrum, Melsource is a second Mel frequency spectrum, and s is₁And s₂A first style code generated by the network from a different emotion encoding is mapped for the style code,anda second Mel frequency spectrum generated by combining the first style codes corresponding to different emotion codes with the voice characteristics extracted by the acoustic model;

the style loss function is:

Lstyle＝‖sF-sE‖₁；

wherein Lstyle is a style loss function, sF is a first style code, and sE is a second style code.

As a further improvement of the present application, before obtaining the sample speech of the present training and the preset emotion encoding, the method further includes: and setting one corresponding emotion code in one-hot format for each emotion.

As a further improvement of the application, the acoustic model is one of a Fastspeech model and a Fastspeech2 model.

As a further improvement of the application, the style code extraction networks respectively comprise a convolution network and a neural network, the convolution network takes emotion coding as a limiting condition to extract features from the second Mel frequency spectrum, and the neural network generates the target style code according to the features.

As a further improvement of the present application, the convolutional network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer and a third pooling layer, and the neural network comprises a first bi-directional LSTM layer and a second bi-directional LSTM layer.

In order to solve the above technical problem, another technical solution adopted by the present application is: provided is a speech emotion conversion device including: the receiving module is used for receiving voice input by a user and emotion information which is selected by the user and needs to be converted, and the emotion information corresponds to a preset emotion code; the first input module is used for inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum; the style extraction module is used for inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain target style codes; the second input module is used for extracting text content from the voice and inputting the text content and the target style code into the trained acoustic model to obtain a Mel frequency spectrum after style conversion; and the generating module is used for generating voice comprising emotion information according to the Mel frequency spectrum after the style conversion.

In order to solve the above technical problem, the present application adopts another technical solution that: there is provided a computer device comprising a processor, a memory coupled to the processor, the memory having stored therein program instructions which, when executed by the processor, cause the processor to carry out the steps of the speech emotion translation method as claimed in any one of the above.

In order to solve the above technical problem, the present application adopts another technical solution that: there is provided a storage medium storing program instructions capable of implementing the speech emotion conversion method as described in any one of the above.

The beneficial effect of this application is: according to the voice emotion conversion method, after voice input by a user and emotion information to be converted are obtained, the voice is input to a pre-trained acoustic model to extract a target Mel frequency spectrum of the voice, a trained Mel code extraction network is used for extracting a target Mel code from the target Mel frequency spectrum, the voice is converted into text content without emotion, the text content and the target Mel code are input into a word number acoustic model to obtain a Mel frequency spectrum after style conversion, and the voice containing the emotion information is generated according to the Mel frequency spectrum after style conversion, so that the voice emotion is converted.

Drawings

FIG. 1 is a flowchart illustrating a method for emotion-based speech conversion according to a first embodiment of the present invention;

FIG. 2 is a functional block diagram of a speech emotion converting apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first", "second" and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any indication of the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. All directional indications (such as up, down, left, right, front, and rear … …) in the embodiments of the present application are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indication is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

FIG. 1 is a flowchart illustrating a speech emotion conversion method according to a first embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:

step S101: receiving voice input by a user and emotion information which is selected by the user and needs to be converted, wherein the emotion information corresponds to a preset emotion code.

It should be noted that human emotions usually include happiness, anger, sadness, happiness, fear, excitement, aversion, etc., and human emotions may be different in different situations, and the expressed speech may also carry emotion elements under the influence of the emotions, so that other people can feel the emotions contained in the speech and feel their own emotions.

Note that emotion encoding for each emotion needs to be set in advance.

In step S101, after receiving the speech input by the user, when the user needs emotion information for converting the speech, the emotion in the speech is converted according to the emotion code, so as to obtain the speech containing emotion information desired by the user.

Step S102: and inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum.

In step S102, after acquiring the speech, the speech is input into an acoustic model, and a target mel spectrum of the speech is obtained, where the acoustic model is trained in advance. The acoustic model is an indispensable part in the technical field of speech recognition, and the task of the acoustic model is to describe the physical change rule of speech and calculate the probability of the model for generating a speech waveform so as to obtain the Mel frequency spectrum of the speech.

Further, the acoustic model is one of a Fastspeech model and a Fastspeech2 model.

In this embodiment, preferably, the acoustic model is a Fastspeech2 model, which utilizes the fast reasoning capability of the Fastspeech2 model to improve the emotion conversion efficiency of speech. Compared with the Fastspeech model, the Fastspeech2 model can better solve the problem of one-to-many mapping, simplify training and obtain higher-quality sound.

Specifically, the fastspech 2 model is one of acoustic models, which includes a Character embedding layer, an Encoder layer, a Variance adapter layer, and a Decoder layer. The sequence embedding layer is used for converting an input phoneme sequence into a text-to-sequence phoneme sequence and adding position information into the text-to-sequence phoneme sequence, and the Encoder layer is mapped according to the text-to-sequence phoneme sequence added with the position information to obtain an intermediate characteristic; the Variance adapter layer is used for introducing different acoustic characteristic information into the intermediate characteristic, the Variance adapter layer comprises a duration predictor, a pitch predictor and an energy predictor, the three predictors have consistent structures and respectively comprise a two-layer 1D convolution network and a linear layer, and the different acoustic characteristic information comprises duration (sound length characteristic), pitch (pitch characteristic), energy (sound energy characteristic) and the like; the Decoder layer is used for outputting a Mel frequency spectrum with a unique style according to intermediate characteristics introducing different acoustic characteristic information.

Step S103: and inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain the target style codes.

In step S103, after the target mel frequency spectrum and the emotion code are acquired, the target mel frequency spectrum is input to a pre-trained style code extraction network, and style extraction is performed under the constraint condition of emotion code, so as to obtain a target style code, i.e., a style code for expressing emotion information.

Further, the style code extraction network comprises a convolution network and a neural network, the convolution network extracts the features from the second Mel frequency spectrum by taking the emotional coding as a limiting condition, and the neural network generates the target style code according to the features.

Specifically, the convolution network is used for extracting high-level features of the target Mel frequency spectrum from the target Mel frequency spectrum, and then the neural network extracts the target style code from the high-level features

The convolutional network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer and a third pooling layer, and the neural network comprises a first bidirectional LSTM layer and a second bidirectional LSTM layer.

Specifically, a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer and a third pooling layer in the convolutional network are connected in sequence. It should be noted that the mel frequency spectrum is a spectrum representing characteristic information of voice, the convolutional layer is used for extracting the characteristic information from the mel frequency spectrum, the pooling layer is used for reducing the characteristic latitude, compressing the number of data and parameters, reducing overfitting, and improving the fault tolerance of the model at the same time, generally, in a convolutional neural network, there is a pooling layer behind each convolutional layer, and the addition of the pooling layer is used for accelerating operation and making some detected characteristics more robust.

In this embodiment, the first convolution may be set to input size 32 × 9 × 6000, output size 32 × 9 × 3000, kernel size 2 × 4, step size 1 × 2, the first convolution may be set to input size 32 × 9 × 3000, output size 16 × 4 × 3000, kernel size 2 × 1, step size 2 × 1, the second convolution may be set to input size 16 × 4 3000, output size 16 × 4 × 3000, kernel size 2 × 4, step size 1 × 1, the second convolution may be set to input size 16 × 4 × 3000, output size 8 × 2, kernel size 1500, kernel size 2 × 4, step size 2 × 1, and the second convolution may be set to input size 2 × 4 × 3000, output size 1500, step size 2 × 2, step size 1500, and the third convolution may be set to input size 2 × 2, step size 2 × 2, step size 2 × 1, and the second convolution may be set to input size 2, step size 2 × 3000, and the second convolution may be set to input size 2 × 2, and the third convolution may be 1500, Output size 4 x 2 x 750, kernel size 2 x 1 x 2, step size 2 x 1 x 2.

In particular, the neural network includes a first bidirectional LSTM layer and a second bidirectional LSTM layer. The bidirectional LSTM (bidirectional long-short time memory recurrent neural network) comprises a forward LSTM unit and a backward LSTM unit, the features extracted by the convolutional network are fed into the forward LSTM unit firstly to obtain and store the output of a forward hidden layer at each moment, then the output of the backward hidden layer at each moment is fed into the backward LSTM unit reversely to obtain and store the output of the backward hidden layer at each moment, and finally the final output is obtained by combining the output results of the corresponding moments of the forward LSTM unit and the backward LSTM unit at each moment.

Step S104: text content is extracted from the voice, and the text content and the target style code are input into a trained acoustic model together to obtain a Mel frequency spectrum after style conversion.

In step S104, after the target style code is acquired, the text content in the speech is recognized by the speech recognition technology, and the text content is input to the acoustic model together with the target style code as a phoneme sequence to obtain a mel spectrum after the style conversion. Among them, the Speech Recognition technology (ASR) is a technology for converting Speech into text.

Step S105: and generating voice comprising emotional information according to the Mel frequency spectrum after the style conversion.

In step S105, after obtaining the style-converted mel spectrum, a vocoder generates a corresponding voice waveform according to the style-converted mel spectrum, and the voice waveform is played to generate a voice containing emotion information. Wherein, the vocoder is trained in advance, and can be a WaveGlow model.

Further, the pre-training of the acoustic model and the trellis code extraction network includes:

1. and constructing a style code mapping network with the same structure as the style code extraction network and an optimization target of the style code mapping network.

Specifically, when an acoustic model and a style code extraction network are trained, an additional style code mapping network is needed to assist training, and the style code mapping network and the style code extraction network are identical in structure. In this embodiment, the purpose of adding the style code mapping network when training the acoustic model and the style code extraction network is to add interference to the training process, thereby improving the diversity of the output result, and being capable of generating different style codes.

2. And acquiring sample voice and preset emotion codes of the training.

Specifically, the sample speech and emotion encoding need to be set in advance.

Further, before obtaining the sample speech of the present training and the preset emotion encoding, the method further includes: and setting one corresponding emotion code in one-hot format for each emotion.

In particular, one-hot encoding, also known as one-hot encoding or one-bit-efficient encoding, uses an N-bit status register to encode N states, each having its own independent register bit and only one of which is active at any time. In this embodiment, assuming that there are four emotions, "happiness, anger, sadness, and happiness", that is, four emotion codes are provided, 4-bit status registers are required to encode the 4 emotions, and "0001" represents "happiness", "0010" represents "anger", "0100" represents "sadness", and "1000" represents "happiness".

3. And inputting the emotion codes and the pre-prepared random noise into the style code mapping network to obtain a first style code.

The random noise is randomly generated in advance, and the magnitude of the random noise is the same as the magnitude of the mel spectrum output by the acoustic model.

Specifically, the process of generating the first style code by the style code mapping network according to the random noise and the emotion encoding is the same as the process of generating the target style code by the style code extracting network according to the mel frequency spectrum and the emotion encoding, and details are not repeated here.

4. And analyzing the sample voice by using an acoustic model to obtain voice characteristics, generating a first Mel frequency spectrum according to the voice characteristics, and generating a second Mel frequency spectrum according to the voice characteristics and the first style code.

Specifically, the acoustic model needs to be run twice during training, the first time is to obtain a first mel frequency spectrum of the sample voice from the sample voice, and the second time is to extract voice features from the sample voice after the style code mapping network obtains the first style code, and generate a second mel frequency spectrum after combining the voice features with the first style code.

5. And inputting the second Mel frequency spectrum and the emotion codes into the style code extraction network to obtain second style codes.

Specifically, the process of extracting the second style code by the style code extracting network according to the second mel frequency spectrum and the emotion encoding is the same as the process of generating the target style code by the style code extracting network according to the mel frequency spectrum and the emotion encoding, and the description is omitted here.

6. And updating the acoustic model by combining the optimization target of the pre-constructed style code mapping network and the reverse propagation of the style code loss function.

Wherein, the optimization target is:

wherein F is a style code mapping network, L_reconFor speech reproduction loss, L_diverseFor loss of emotional diversity, Mel_predictIs the first Mel frequency spectrum, Mel_sourceIs the second mel frequency spectrum, s₁And s₂A first style code generated by the network from a different emotion encoding is mapped for the style code,andand a second Mel frequency spectrum generated by combining the first style codes corresponding to different emotion codes with the voice characteristics extracted by the acoustic model.

Specifically, in the embodiment, when the acoustic model and the trellis code extraction network are trained, a voice reproduction task is constructed, and the loss L of the voice reproduction task is defined_reconAnd, in order to generate different styles, a diversity loss L is constructed_diverseThen according to the speech re-loss L_reconAnd loss of diversity L_diverseAnd constructing an optimization target of the trellis code mapping network.

The style loss function is:

L_style＝‖s_F-s_E‖₁；

wherein L is_styleAs a function of style loss, s_FIs a first style code, s_EIs a second trellis code.

According to the voice emotion conversion method, after the voice input by the user and the emotion information to be converted are obtained, the voice is input to the acoustic model trained in advance to extract a target Mel frequency spectrum of the voice, the trained Mel code extraction network is used for extracting the target Mel frequency spectrum from the target Mel frequency spectrum, the voice is converted into text content without emotion, the text content and the target Mel frequency spectrum are input into the word number acoustic model to obtain the Mel frequency spectrum after style conversion, and the voice containing the emotion information is generated according to the Mel frequency spectrum after style conversion, so that the emotion of the voice is converted.

FIG. 2 is a functional block diagram of a speech emotion conversion apparatus according to an embodiment of the present invention. As shown in FIG. 2, the speech emotion conversion apparatus 20 includes a receiving module 21, a first input module 22, a style extraction module 23, a second input module 24, and a generation module 25.

The receiving module 21 is configured to receive a voice input by a user and emotion information selected by the user and needing to be converted, where the emotion information corresponds to a preset emotion code.

The first input module 22 is configured to input the speech into a pre-trained acoustic model to obtain a target mel spectrum.

And the style extraction module 23 is configured to input the target mel frequency spectrum and the emotion codes into the trained style code extraction network to obtain the target style codes.

And the second input module 24 is configured to extract text content from the speech, and input the text content and the target style code into the trained acoustic model together to obtain a mel frequency spectrum after style conversion.

And a generating module 25, configured to generate a speech including emotion information according to the style-converted mel frequency spectrum.

Optionally, the speech emotion converting apparatus 20 further includes a training module, the training module is configured to train the acoustic model and the lattice code extraction network in advance, and the operation of the training module to train the acoustic model and the lattice code extraction network in advance specifically includes: constructing a style code mapping network with the same structure as the style code extraction network and an optimization target of the style code mapping network; acquiring sample voice and preset emotion codes of the training; inputting emotion codes and random noise prepared in advance into a style code mapping network to obtain a first style code; analyzing the sample voice by using an acoustic model to obtain voice characteristics, generating a first Mel frequency spectrum according to the voice characteristics, and generating a second Mel frequency spectrum according to the voice characteristics and the first style code; inputting the second Mel frequency spectrum and the emotion codes into a style code extraction network to obtain second style codes; and updating the acoustic model by combining the optimization target of the pre-constructed style code mapping network and the reverse propagation of the style code loss function.

Optionally, the optimization objective is:

wherein F is a style code mapping network, L_reconFor speech reproduction loss, L_diverseFor loss of emotional diversity, Mel_predictIs the first Mel frequency spectrum, Mel_sourceIs the second mel frequency spectrum, s₁And s₂A first style code generated by the network from a different emotion encoding is mapped for the style code,anda second Mel frequency spectrum generated by combining the first style codes corresponding to different emotion codes with the voice characteristics extracted by the acoustic model;

the style loss function is:

L_style＝‖s_F-s_E‖₁；

wherein L is_styleAs a function of style loss, s_FIs a first style code, s_EIs a second trellis code.

Optionally, before the training module performs the operation of obtaining the sample speech and the preset emotion encoding of the current training, the training module is further configured to: and setting one corresponding emotion code in one-hot format for each emotion.

Optionally, the acoustic model is one of a fastspech model, a fastspech 2 model.

Optionally, the style code extraction networks each include a convolutional network and a neural network, the convolutional networks extract features from the second mel frequency spectrum with emotion encoding as a limiting condition, and the neural networks generate the target style codes according to the features.

Optionally, the convolutional network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, and a third pooling layer, and the neural network comprises a first bi-directional LSTM layer and a second bi-directional LSTM layer.

For other details of the technical solution implemented by each module in the speech emotion converting apparatus in the above embodiment, reference may be made to the description of the speech emotion converting method in the above embodiment, and details are not described herein again.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer device 30 includes a processor 31 and a memory 32 coupled to the processor 31, wherein the memory 32 stores program instructions, and the program instructions, when executed by the processor 31, cause the processor 31 to perform the following steps:

receiving voice input by a user and emotion information which needs to be converted and is selected by the user, wherein each emotion corresponds to a preset emotion code;

inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum;

inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain target style codes;

extracting text content from the voice, and inputting the text content and the target style code into a trained acoustic model to obtain a Mel frequency spectrum after style conversion;

and generating voice comprising emotional information according to the Mel frequency spectrum after the style conversion.

The processor 31 may also be referred to as a CPU (Central Processing Unit). The processor 31 may be an integrated circuit chip having signal processing capabilities. The processor 31 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a storage medium according to an embodiment of the invention. A storage medium of an embodiment of the invention stores program instructions 41 that are capable of implementing all of the methods described above, the program instructions 41 when executed implement the steps of:

receiving voice input by a user and emotion information which needs to be converted and is selected by the user, wherein each emotion corresponds to a preset emotion code;

inputting the voice into a pre-trained acoustic model to obtain a target Mel frequency spectrum;

inputting the target Mel frequency spectrum and the emotion codes into a trained style code extraction network to obtain target style codes;

extracting text content from the voice, and inputting the text content and the target style code into a trained acoustic model to obtain a Mel frequency spectrum after style conversion;

and generating voice comprising emotional information according to the Mel frequency spectrum after the style conversion.

The program instructions 41 may be stored in the storage medium in the form of a software product, and include several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or various media capable of storing program codes, or a computer device such as a computer, a server, a mobile phone, or a tablet.

In the several embodiments provided in the present application, it should be understood that the disclosed computer apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种用于车联网语音降噪的处理方法

Voice emotion conversion method and device, computer equipment and storage medium

相关技术

网友询问留言