Method, apparatus, device and medium for generating audio

文档序号：989498 发布日期：2020-11-06 浏览：2次中文

阅读说明：本技术 用于生成音频的方法、装置、设备和介质 (Method, apparatus, device and medium for generating audio ) 是由殷翔于 2020-07-30 设计创作，主要内容包括：本公开的实施例公开了用于生成音频的方法、装置、设备和介质。该用于生成音频的方法的一具体实施方式包括：获取目标普通话文本信息和目标用户发出的用户语音音频的音色信息；将目标普通话文本信息转换为与其对应的目标方言文本信息；基于目标方言文本信息和用户语音音频的音色信息,生成目标语音音频,其中,目标语音音频的音色与用户语音音频的音色信息相匹配,目标方言文本信息用于指示目标语音音频所对应的文本信息。该实施方式可以将普通话文本转换为与其对应的方言语音音频,并且方言语音音频具有目标用户发出的语音音频的音色,由此丰富了语音音频的生成方式。(Embodiments of the present disclosure disclose methods, apparatuses, devices and media for generating audio. One embodiment of the method for generating audio comprises: acquiring target Mandarin text information and tone color information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating a target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio. The embodiment can convert Mandarin Chinese text into dialect voice audio corresponding to Mandarin Chinese text, and the dialect voice audio has tone color of voice audio uttered by the target user, thereby enriching generation manner of the voice audio.)

1. A method for generating audio, comprising:

acquiring target Mandarin text information and tone color information of user voice audio sent by a target user;

converting the target Mandarin Chinese text information into target dialect text information corresponding to the target Mandarin Chinese text information;

and generating a target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

2. The method of claim 1, wherein generating target speech audio based on the target dialect text information and timbre information of the user speech audio comprises:

extracting text characteristic information of the text information of the target dialect;

inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information;

inputting the coded text characteristic information and the tone color information of the user voice audio into a pre-trained decoder to obtain Mel frequency spectrum information;

and inputting the Mel frequency spectrum information into a vocoder to obtain a target voice audio.

3. The method of claim 2, wherein the pre-trained encoder and pre-trained decoder are trained by:

acquiring audio samples provided by different users marked with Mel frequency spectrum information;

inputting the audio sample into an encoder to be trained to obtain an encoded audio sample;

respectively inputting the coded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information;

inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information;

and adjusting parameters of the encoder and the decoder according to the deviation between the marked Mel frequency spectrum information and the predicted Mel frequency spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

4. The method of claim 1, wherein the timbre information of the user speech audio is derived based on audio data provided by a target user and a pre-trained timbre coder.

5. The method of claim 1, wherein generating target speech audio based on the target dialect text information and timbre information of the user speech audio comprises:

and generating a target voice audio based on the target dialect text information, the tone color information of the user voice audio and the target voice style information, wherein the target voice style information is used for indicating the style of the target voice audio.

6. The method of claim 5, wherein generating target speech audio based on the target dialect text information, timbre information of user speech audio, and target speech style information comprises:

extracting text characteristic information of the text information of the target dialect;

inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information;

inputting the coded text characteristic information, the tone information of the user voice audio and the target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information;

and inputting the Mel frequency spectrum information into a vocoder to obtain a target voice audio.

7. The method of claim 5, wherein the target speech style information is obtained by:

acquiring voice audio of a person with the voice style indicated by the target voice style information;

and inputting the voice audio of the person into a pre-trained voice style encoder to generate target voice style information.

8. An apparatus for generating audio, comprising:

an acquisition unit configured to acquire target Mandarin text information and tone color information of user voice audio uttered by a target user;

a conversion unit configured to convert the target mandarin chinese text information into target dialect text information corresponding thereto;

and the generating unit is configured to generate the target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for generating audio.

Background

Text-To-Speech (TTS), also known as Speech synthesis, is a technology for converting Text information into intelligible and fluent chinese spoken language and outputting the same. Speech synthesis can not only help visually impaired people read information on a computer, but also can increase the readability of text documents.

The existing general speech synthesis technology mainly records a single tone speech library in advance, and then makes a speech synthesis system based on the speech library, and the synthesized speech of the method depends on the speech library, namely the synthesized sound is like a person who records the speech. The process only converts the text input by the user into the voice with single tone color, and the attributes of the voice such as tone color, tone and the like are also very simple.

Disclosure of Invention

The present disclosure presents methods, apparatuses, devices and media for generating audio.

In a first aspect, an embodiment of the present disclosure provides a method for generating audio, the method including: acquiring target Mandarin text information and tone color information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating a target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

In some embodiments, generating the target speech audio based on the target dialect text information and the timbre information of the user speech audio comprises: extracting text characteristic information of the text information of the target dialect; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information and the tone color information of the user voice audio into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In some embodiments, the pre-trained encoder and the pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; respectively inputting the coded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation of the marked Mel frequency spectrum information and the predicted Mel frequency spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In some embodiments, the timbre information of the user speech audio is derived based on audio data provided by the target user and a pre-trained timbre encoder.

In some embodiments, generating the target speech audio based on the target dialect text information and the timbre information of the user speech audio comprises: and generating the target voice audio based on the target dialect text information, the tone color information of the user voice audio and the target voice style information, wherein the target voice style information is used for indicating the style of the target voice audio.

In some embodiments, generating the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information comprises: extracting text characteristic information of the text information of the target dialect; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, the tone information of the user voice audio and the target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In some embodiments, the target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; and inputting the voice audio of the person into a pre-trained voice style encoder to generate target voice style information.

In a second aspect, an embodiment of the present disclosure provides an apparatus for generating audio, the apparatus comprising: an acquisition unit configured to acquire target Mandarin text information and tone color information of user voice audio uttered by a target user; a conversion unit configured to convert the target mandarin chinese text information into target dialect text information corresponding thereto; and the generating unit is configured to generate the target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

In a third aspect, embodiments of the present disclosure provide an electronic device for generating audio, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for generating audio as described above.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium for generating audio, on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments of the method for generating audio as described above.

The method, the device, the equipment and the medium for generating the audio frequency provided by the embodiment of the disclosure are realized by acquiring target Mandarin text information and tone color information of user voice audio frequency sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating a target voice audio based on the target dialect text information and the tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio, so that the mandarin text can be converted into the dialect voice audio corresponding to the target dialect voice audio, and the dialect voice audio has the tone of the voice audio sent by the target user, thereby enriching the generation mode of the voice audio.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating audio according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for generating audio according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating audio according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating audio according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of an embodiment of a method for generating audio or an apparatus for generating audio to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the terminal devices 101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit data (e.g., target Mandarin text information and tone color information of user voice audio uttered by the target user), etc. The terminal devices 101, 102, 103 may have various client applications installed thereon, such as audio playing software, music processing applications, news information applications, image processing applications, web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having information processing functions, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as a plurality of software or software modules (e.g., software or software modules used to provide a generated audio service) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background audio processing server generating target voice audio based on mandarin chinese text information provided by the target user and user voice audio uttered by the target user, which are transmitted by the terminal devices 101, 102, 103. Optionally, the background audio processing server may further feed back the generated target voice audio to the terminal device, so that the terminal device can play the target voice audio. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing a service of generating audio) or as a single software or software module. And is not particularly limited herein.

It should be further noted that the method for generating audio provided by the embodiments of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, the various parts (e.g., the various units, sub-units, modules, and sub-modules) included in the apparatus for generating audio may be all disposed in the server, may be all disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. The system architecture may only include the electronic device (e.g., server or terminal device) on which the method for generating audio operates, when the electronic device on which the method for generating audio operates does not require data transfer with other electronic devices.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating audio in accordance with the present disclosure is shown. The method for generating audio comprises the following steps:

step 201, obtaining target Mandarin text information and tone color information of user voice audio sent by a target user.

In this embodiment, an execution subject (e.g., a server or a terminal device shown in fig. 1) of the method for generating audio may obtain target mandarin chinese text information and tone color information of user voice audio uttered by a target user from other electronic devices by a wired connection manner or a wireless connection manner, or locally.

Wherein the target user may be any user. The user speech audio may be audio of any speech uttered by the target user. For example, the user speech audio may be the audio of a song that the target user sings, or the audio of speech uttered by the target user during a conversation.

Here, the tone information of the user voice audio may be obtained by a tone information generation model trained in advance. The tone information generation model can be obtained by training based on the voice audio sample marked with tone information.

Specifically, the executing agent may input the user speech audio to a pre-trained tone information generation model to generate tone information of the user speech audio.

Optionally, the tone information generation model may also be a model obtained by adopting unsupervised machine learning algorithm training.

In some alternatives, the timbre information of the user speech audio is derived based on audio data provided by the target user and a pre-trained timbre coder.

In this implementation, the tone color encoder is configured to capture tone color features of the speech audio data provided by the input target user, the tone color features are independent of corresponding text features and unique speaker style features of the speech audio, and the output of the pre-trained tone color encoder may be embodied in the form of an embedded vector.

According to the implementation mode, the tone information of the user voice audio is obtained through the audio data provided by the target user and the pre-trained tone encoder, the tone characteristics of the voice audio can be better captured, and the accuracy of the obtained tone information of the user voice audio is further improved.

Step 202, converting the mandarin chinese text message into the target dialect text message corresponding to the mandarin chinese text message.

In this embodiment, the execution subject converts the acquired target mandarin text information into target dialect text information corresponding thereto.

The target dialect text information may be any dialect text information, such as a shanxi dialect, a Hunan dialect, a Sichuan dialect, and the like, which is not limited in this application.

Here, the manner in which the execution subject converts the target mandarin text information into the target dialect text information corresponding thereto may be that a pre-trained dialect text conversion model is used to convert the target mandarin text information into the target dialect text information corresponding thereto, wherein the dialect text conversion model is trained based on a mandarin text sample labeled with the corresponding dialect text information; the dialect text information corresponding to the mandarin text information may also be searched in the comparison table by using a preset dialect text information and mandarin text information comparison table, which is not limited in the present application.

As an example, if the mandarin chinese text message is "open that drawer", the corresponding chinese dialect text message is "open that drawer".

And step 203, generating the target voice audio based on the target dialect text information and the tone color information of the user voice audio.

In this embodiment, the execution subject may generate the target speech audio from the target dialect text information and the tone color information of the user speech audio. The tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

As an example, the execution subject may input the target dialect text information and the tone color information of the user voice audio to a pre-trained audio generation model to generate the target voice audio. The audio generation model can be obtained based on sample data which is marked with voice audio and has target dialect text information and tone color information of the voice audio of the user.

In some alternatives, generating the target speech audio based on the target dialect text information and timbre information of the user speech audio includes: and generating the target voice audio based on the target dialect text information, the tone color information of the user voice audio and the target voice style information.

In this implementation, the executing body may generate the target speech audio based on the target dialect text information, the tone color information of the user speech audio, and the target speech style information acquired in the above steps. And the voice style of the target voice audio is the voice style indicated by the target voice style information.

Here, the voice style information (including the target voice style information) may characterize a style of the voice, for example, the voice style information may characterize at least one of: pace, rhythm, intonation, stress, word biting.

The execution subject may input the voice audio to a pre-trained voice style information generation model to obtain voice style information of the voice audio. The voice style information generation model can be obtained by training based on voice audio samples marked with voice style information.

According to the implementation mode, the target speech audio is generated based on the target dialect text information, the tone color information of the user speech audio and the target speech style information, so that the generated dialect audio can further have the style information of a speaker of the certain dialect audio, and the naturalness and the fluency of the generated dialect audio are further improved.

In some alternative ways, the target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; and inputting the voice audio of the person into a pre-trained voice style encoder to generate target voice style information.

In this implementation, the execution subject may obtain the voice audio of the person having the voice style indicated by the target voice style information, and input the voice audio of the person into the pre-trained voice style encoder to obtain the target voice style information.

The voice style encoder is used for capturing style characteristics of input voice audio, the style characteristics are independent of text characteristics corresponding to the voice audio and unique speaker timbre characteristics, and the output of the pre-trained style encoder can be embodied in the form of an embedded vector.

As an example, if the target speech audio is a tetralogy dialect speech audio, the execution subject may use style information of the speech audio of a tetragon as the target speech style information, further obtain the tetragon speech audio provided by the tetragon, and input the tetragon speech audio into a pre-trained speech style encoder to obtain the target speech style information.

The realization mode obtains the voice audio of the person with the voice style indicated by the target voice style information; the voice audio of the person is input into the pre-trained voice style encoder to generate target voice style information, so that the style characteristics of the voice audio can be better captured, and the accuracy of the acquired target voice style information is further improved.

With continued reference to fig. 3, fig. 3 is a schematic diagram of one application scenario of the method for generating audio according to the present embodiment. In the application scenario of fig. 3, the server 301 first obtains the target mandarin text information 302 (e.g., opens the drawer) and the tone color information 303 of the user voice audio uttered by the target user, then converts the obtained target mandarin text information into the target dialect text information 304 (e.g., opens the drawer), and finally generates the target voice audio 305 (e.g., the dialect audio having the tone color of the target user voice audio 305 matching the tone color information 303 of the user voice audio based on the target dialect text information and the tone color information of the user voice audio, where the target dialect text information 304 is used to indicate the text information corresponding to the target voice audio.

In the method provided by the above embodiment of the present disclosure, target mandarin chinese text information and tone color information of user voice audio sent by a target user are obtained; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating a target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, the target dialect text information is used for indicating the text information corresponding to the target voice audio, the Mandarin text is converted into the dialect voice audio corresponding to the target voice audio, and the dialect voice audio has the tone color of the voice audio sent by the target user, so that the generation mode of the voice audio is enriched.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating audio is shown. The flow 400 of the method for generating audio comprises the steps of:

step 401, obtaining target Mandarin text information and tone color information of user voice audio sent by a target user.

In this embodiment, step 401 is substantially the same as step 201 in the corresponding embodiment of fig. 2, and is not described here again.

Step 402, converting the target mandarin chinese text message into a target dialect text message corresponding thereto.

In this embodiment, step 402 is substantially the same as step 202 in the corresponding embodiment of fig. 2, and is not described herein again.

Step 403, extracting text feature information of the target mandarin chinese text information.

In this embodiment, the execution subject may extract the text feature information of the target mandarin text information by performing text analysis on the target mandarin text information.

The text feature information may include: phoneme, tone, word segmentation, rhythm, phrase, etc.

Specifically, the text analysis process may include: converting target mandarin text information into phonemes corresponding to the target mandarin text information through G2P (Grapheme-to-Phoneme, Grapheme-to-phonemes), determining participles corresponding to the target mandarin text information through a participle prediction model, and determining prosodic phrases corresponding to the target mandarin text information through a prosodic phrase prediction model, wherein the prosodic phrase prediction model can be obtained through training based on text information samples marked with the prosodic phrases, and the participle prediction model can be obtained through training based on the text information samples marked with the participles.

Here, the prosodic phrase prediction Model and the participle prediction Model may be implemented based on a statistical Machine learning algorithm, for example, an HMM (Hidden Markov Model) algorithm, an SVM (Support Vector Machine) algorithm, and the like, which is not limited in this application.

Step 404, inputting the text feature information into a pre-trained encoder to obtain encoded text feature information.

In this embodiment, the pre-trained encoder is used to encode the text feature information, and the encoder may be implemented based on a Convolutional Neural network in the prior art or in a future development technology, for example, CNN (Convolutional Neural network), LSTM (Long Short-Term Memory network), GRU (Gated recursive Unit), BGRU (Bidirectional Gated recursive Unit), and the like, which is not limited in this application.

Preferably, the encoder may be implemented based on a CNN + BGRU hybrid neural network that can learn deeper text features through convolutional layers.

Step 405, inputting the coded text feature information and the tone information of the user voice audio into a pre-trained decoder to obtain mel frequency spectrum information.

In this embodiment, the execution body may input the encoded text feature information and the tone color information of the user speech audio into a pre-trained decoder to obtain mel-frequency spectrum information.

The pre-trained decoder can adopt an autoregressive decoder or a non-autoregressive decoder, and compared with decoders in other forms, the autoregressive decoder can better utilize the dependence characteristics of the voice audio on different time scales to improve the generation quality of the target voice audio.

Furthermore, it should be noted that the encoder and decoder can be directly connected through the attention mechanism model.

Step 406, inputting the mel spectrum information into the vocoder to obtain the target voice audio.

In this embodiment, the execution main body inputs the mel spectrum information obtained in the above steps into the vocoder to obtain the target voice audio. The vocoder is used for representing the corresponding relation between the Mel frequency spectrum information and the voice audio.

In some alternatives, the pre-trained encoder and the pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; respectively inputting the coded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation between the marked Mel frequency spectrum information and the predicted Mel frequency spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In this implementation, the pre-trained encoder and the pre-trained decoder are trained by: first, audio samples provided by different users marked with mel-frequency spectrum information are obtained. Here, the audio samples provided by different users may be different dialect audio samples.

As an example, audio samples of different dialects may include mandarin audio provided by a target user and 5 dialect speakers, one dialect audio provided by each, such as shanxi, hunan, sichuan, northeast, and guangdong, two hours each.

And then, inputting the audio sample into an encoder to be trained to obtain an encoded audio sample, and respectively inputting the encoded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information.

Here, the text feature information classifier is used to ensure that the encoder encodes only text feature information by means of supervised learning. The tone information classifier is used for ensuring that the encoder only encodes tone characteristic information in a supervised learning mode.

Further, the execution main body can combine the classified text characteristic information and the classified tone color information and input the combined information into a decoder to be trained to obtain predicted Mel frequency spectrum information, calculate the deviation between the predicted Mel frequency spectrum information and the labeled Mel frequency spectrum information, and adjust the parameters of the encoder and the decoder according to the deviation until the deviation meets the preset conditions, so that the trained encoder and decoder are obtained.

Furthermore, it is noted that the encoded audio sample may also be input to a style information classifier. The style information classifier is used for ensuring that the encoder only encodes style characteristic information in a supervised learning mode.

Further, the execution main body can combine the classified text characteristic information, the classified tone information and the classified style information and input the combined information into a decoder to be trained to obtain predicted Mel frequency spectrum information, calculate the deviation of the predicted Mel frequency spectrum information and the labeled Mel frequency spectrum information, adjust the parameters of the encoder and the decoder according to the deviation until the deviation meets the preset conditions, and obtain the trained encoder and decoder.

This implementation is through training the encoder and the decoder of treating the training based on the audio sample that obtains the different users that mark with Mel frequency spectrum information and provide, obtains the encoder and the decoder that the training was accomplished for the encoder and the decoder that the training was accomplished can learn the characteristics of the text characteristic information of different dialect audios and the characteristics of tone information, help promoting the generalized ability of the decoder and the encoder that the training was accomplished.

In some alternatives, generating the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information includes: extracting text characteristic information of the text information of the target dialect; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, the tone information of the user voice audio and the target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain a target voice audio.

In this implementation, the execution subject may extract the text feature information of the target mandarin text information by performing text analysis on the target mandarin text information.

And then, inputting the text characteristic information into a pre-trained encoder to obtain the encoded text characteristic information, and inputting the encoded text characteristic information, the tone information of the user voice audio and the target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information.

Here, the timbre information of the user speech audio may be obtained based on the pre-trained timbre coder, and the target speech style information may be obtained based on the pre-trained style coder.

And finally, the execution main body inputs the Mel frequency spectrum information obtained in the step into a vocoder to obtain the target voice audio. The vocoder is used for representing the corresponding relation between the Mel frequency spectrum information and the voice audio.

The implementation mode obtains Mel frequency spectrum information through a decoder pre-trained with coded text characteristic information, timbre information of a voice audio and target style information, inputs the Mel frequency spectrum information into a vocoder to obtain a target voice audio, so that the target voice audio fully combines the target dialect text information, the timbre information of the voice audio and the target style information, and adopts the vocoder (vocoder) to generate the target voice audio, the accuracy of the generated target voice audio can be improved, and the generated target voice audio is closer to the real voice audio, so that the synthesis effect is more natural.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating audio in this embodiment highlights the steps of encoding the text feature information by the encoder, decoding the encoded text feature information and the timbre information of the user voice audio by the decoder, and obtaining the target voice audio by using the vocoder. Therefore, the scheme described in this embodiment can make the target speech audio fully combine the target dialect text information and the tone color information of the speech audio, and adopt a vocoder (vocoder) to generate the target speech audio, which can improve the accuracy of the generated target speech audio, so that the generated target speech audio is closer to the real speech audio.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating audio, which corresponds to the method embodiment shown in fig. 2, and which may include the same or corresponding features as the method embodiment shown in fig. 2 and produce the same or corresponding effects as the method embodiment shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for generating audio of the present embodiment includes: an obtaining unit 501 configured to obtain target mandarin chinese text information and tone color information of user voice audio uttered by a target user; a conversion unit 502 configured to convert the target mandarin chinese text information into target dialect text information corresponding thereto; a generating unit 503 configured to generate the target speech audio based on the target dialect text information and the tone color information of the user speech audio, wherein the tone color of the target speech audio matches with the tone color information of the user speech audio, and the target dialect text information indicates text information corresponding to the target speech audio.

In this embodiment, the obtaining unit 501 of the apparatus 500 for generating audio may obtain the target voice style information and the user voice audio uttered by the target user from other electronic devices through a wired connection manner or a wireless connection manner, or locally.

In this embodiment, the generating unit 503 may generate the target voice audio based on the timbre information of the user voice audio uttered by the target user acquired by the acquiring unit 501 and the target dialect text information acquired by the converting unit 502. The tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

In some optional implementations of this embodiment, the generating unit 503 includes: a first generation subunit (not shown in the figure) configured to extract text feature information of the target dialect text information; a second generating subunit (not shown in the figure), configured to input the text feature information into a pre-trained encoder, to obtain encoded text feature information; a third generating subunit (not shown in the figure), configured to input the encoded text feature information and the tone color information of the user speech audio into a pre-trained decoder, to obtain mel frequency spectrum information; the fourth generating subunit (not shown in the figure) is configured to input mel spectrum information to the vocoder, resulting in the target voice audio.

In some optional implementations of this embodiment, the pre-trained encoder and the pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; respectively inputting the coded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation between the marked Mel frequency spectrum information and the predicted Mel frequency spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In some alternative implementations of the present embodiment, the timbre information of the user speech audio is derived based on audio data provided by the target user and a pre-trained timbre encoder.

In some optional implementations of the embodiment, the generating unit is further configured to generate the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information, the target speech style information indicating a style of the target speech audio.

In some optional implementations of this embodiment, generating the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information includes: extracting text characteristic information of the text information of the target dialect; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, the tone information of the user voice audio and the target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In some optional implementations of this embodiment, the target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; and inputting the voice audio of the person into a pre-trained voice style encoder to generate target voice style information.

In the apparatus provided by the above embodiment of the present disclosure, the obtaining unit 501 obtains target mandarin text information and tone information of a user voice audio sent by a target user, the converting unit 502 converts the target mandarin text information into target dialect text information corresponding to the target mandarin text information, and then the generating unit 503 generates the target voice audio based on the target dialect text information and the tone information of the user voice audio, where the tone of the target voice audio matches the tone information of the user voice audio, and the target dialect text information is used to indicate text information corresponding to the target voice audio.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device/server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Python, Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In accordance with one or more embodiments of the present disclosure, there is provided a method for generating audio, the method comprising: acquiring target Mandarin text information and tone color information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating a target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, in which generating target speech audio based on target dialect text information and timbre information of user speech audio includes: extracting text characteristic information of the text information of the target dialect; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information and the tone color information of the user voice audio into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, in which a pre-trained encoder and a pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; respectively inputting the coded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation of the marked Mel frequency spectrum information and the predicted Mel frequency spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In accordance with one or more embodiments of the present disclosure, methods for generating audio are provided in which timbre information for user speech audio is derived based on audio data provided by a target user and a pre-trained timbre encoder.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, in which generating target speech audio based on target dialect text information and timbre information of user speech audio includes: and generating the target voice audio based on the target dialect text information, the tone color information of the user voice audio and the target voice style information, wherein the target voice style information is used for indicating the style of the target voice audio.

According to one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, generating target voice audio based on the target dialect text information, the timbre information of the user voice audio, and the target voice style information includes: extracting text characteristic information of the text information of the target dialect; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, the tone information of the user voice audio and the target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, in which target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; and inputting the voice audio of the person into a pre-trained voice style encoder to generate target voice style information.

In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for generating audio, the apparatus comprising: an acquisition unit configured to acquire target Mandarin text information and tone color information of user voice audio uttered by a target user; a conversion unit configured to convert the target mandarin chinese text information into target dialect text information corresponding thereto; and the generating unit is configured to generate the target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio, a generating unit includes: a first generation subunit configured to extract text feature information of the target dialect text information; the second generation subunit is configured to input the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; the third generation subunit is configured to input the coded text characteristic information and the tone color information of the user voice audio into a pre-trained decoder to obtain Mel frequency spectrum information; the fourth generation subunit is configured to input the mel spectrum information to the vocoder, resulting in the target voice audio.

In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for generating audio, in which a pre-trained encoder and a pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; respectively inputting the coded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation between the marked Mel frequency spectrum information and the predicted Mel frequency spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In accordance with one or more embodiments of the present disclosure, an apparatus for generating audio is provided in which timbre information of user speech audio is obtained based on audio data provided by a target user and a pre-trained timbre encoder.

According to one or more embodiments of the present disclosure, in the apparatus for generating audio provided by the present disclosure, the generating unit is further configured to generate the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information, the target speech style information indicating a style of the target speech audio.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, generating target voice audio based on target dialect text information, timbre information of user voice audio, and target voice style information includes: extracting text characteristic information of the text information of the target dialect; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, the tone information of the user voice audio and the target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; and inputting the voice audio of the person into a pre-trained voice style encoder to generate target voice style information.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and a generation unit. The names of these units do not in some cases constitute a limitation on the units themselves, and for example, the acquiring unit may also be described as a "unit that acquires target mandarin chinese text information and timbre information of user speech audio uttered by a target user".

As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target Mandarin text information and tone color information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating a target voice audio based on the target dialect text information and the tone color information of the user voice audio, wherein the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

21页详细技术资料下载

Method, apparatus, device and medium for generating audio

相关技术

网友询问留言