Voice generation method and device, electronic equipment and readable storage medium

文档序号:228387 发布日期:2021-11-09 浏览:14次 中文

阅读说明:本技术 语音生成方法、装置、电子设备及可读存储介质 (Voice generation method and device, electronic equipment and readable storage medium ) 是由 刘若澜 卢春晖 陈萧 文学 楼晓雁 于 2020-05-15 设计创作,主要内容包括:本申请实施例提供了一种语音生成方法、装置、电子设备及可读存储介质,同时,由电子设备执行的上述语音生成方法可以使用人工智能模型来执行。而该语音生成方法包括:获取待处理信息;对待处理信息进行编码,得到信息编码结果;基于信息编码结果,生成目标用户对应于目标语言的语音信息。在本申请实施例中,由于在生成语音信息时,是基于目标用户的语音特征、信息编码结果以及目标语言特征来生成的,即在生成语音信息时,考虑到了目标用户的声音特色以及目标语言的语言特征,此时可以保证合成的语音信息与目标用户在说目标语言时的音色更加贴合,提升了语音合成的效果。(The embodiment of the application provides a voice generation method, a voice generation device, electronic equipment and a readable storage medium, and meanwhile, the voice generation method executed by the electronic equipment can be executed by using an artificial intelligence model. The voice generating method comprises the following steps: acquiring information to be processed; coding information to be processed to obtain an information coding result; and generating voice information corresponding to the target language of the target user based on the information coding result. In the embodiment of the present application, when generating the voice information, the voice information is generated based on the voice feature of the target user, the information encoding result, and the target language feature, that is, when generating the voice information, the voice feature of the target user and the language feature of the target language are considered, so that it can be ensured that the synthesized voice information is more consistent with the tone of the target user when speaking the target language, and the voice synthesis effect is improved.)

1. A method of speech generation, comprising:

acquiring information to be processed;

coding the information to be processed to obtain an information coding result;

and generating voice information corresponding to the target language of the target user based on the information coding result.

2. The method according to claim 1, wherein if the information to be processed is a text to be processed, the generating the speech information corresponding to the target language for the target user based on the information encoding result comprises:

acquiring tone characteristics of the text to be processed;

and generating voice information corresponding to the target language of the target user based on the tone characteristics and the information coding result.

3. The method according to claim 1, wherein if the information to be processed is a text to be processed, the encoding the information to be processed to obtain an information encoding result includes:

acquiring phoneme characteristics corresponding to the text to be processed;

and carrying out text coding on the phoneme characteristics to obtain a text coding result.

4. The method of claim 3, wherein the text coding the phoneme features to obtain a text coding result comprises:

acquiring tone features corresponding to the text to be processed;

and performing text coding on the tone features and the phoneme features to obtain the text coding result.

5. The method of claim 1, wherein if the information to be processed is speech information to be processed, the encoding the information to be processed to obtain an information encoding result comprises:

acquiring a phoneme posterior probability corresponding to the voice information to be processed;

and coding the phoneme posterior probability to obtain an information coding result.

6. The method according to claim 5, wherein the obtaining the posterior probability of the phoneme corresponding to the speech information to be processed comprises:

acquiring phoneme posterior probability of the speech information to be processed corresponding to each candidate language;

and splicing the phoneme posterior probabilities corresponding to each candidate language to obtain the phoneme posterior probability corresponding to the voice information to be processed.

7. The method according to any one of claims 1 to 6, characterized in that the method is implemented by a speech generation model, wherein the speech generation model is obtained by:

acquiring an initial neural network model and training data, wherein the training data comprises training sample pairs, and the training sample pairs comprise sample input information, sample output voice information corresponding to the sample input information and a sample user information label;

the initial neural network model comprises an initial voice generation model and an initial user information classification module, the initial voice generation model comprises an initial coding module and an initial voice generation module, the initial user information classification module is connected with the initial coding module, the initial coding module is used for coding input information to obtain a sample information coding result, the initial voice generation module is used for obtaining predicted voice information based on the sample information coding result, and the initial user information classification module is used for obtaining predicted user information based on the sample information coding result;

training the initial neural network model based on the training data until a total loss function corresponding to the initial neural network is converged to obtain a trained initial neural network model, and taking the trained initial speech generation model as the speech generation model;

the total loss function comprises a first loss function and a second loss function, wherein the value of the first loss function represents the difference between the predicted voice information corresponding to the input information and the sample output voice information, and the value of the second loss function represents the difference between the predicted user information corresponding to the sample input information and the sample user information label.

8. The method of claim 7, wherein the initial neural network model further comprises a pitch classifier coupled to the initial coding module, and wherein for the training sample pair, the training sample pair further comprises sample pitch features corresponding to sample input information;

the pitch classifier is used for obtaining a predicted pitch characteristic based on a sample information coding result;

the total loss function further includes a third loss function whose value characterizes a difference between the predicted pitch feature and the sample pitch feature.

9. The method of claim 7, wherein if the sample input speech information is sample input speech information, the initial neural network model further comprises an initial phoneme recognizer connected to the initial coding module, the initial phoneme recognizer being configured to determine a phoneme posterior probability corresponding to the sample input speech information;

the initial encoding module is specifically configured to, when encoding sample input information to obtain a sample information encoding result:

and coding the phoneme posterior probability corresponding to the sample input voice information to obtain a sample information coding result.

10. The method of claim 7, wherein the obtaining training data comprises:

acquiring initial training data, wherein the initial training data comprises initial sample pairs, the initial sample pairs comprise initial sample input information, initial sample output voice information corresponding to the initial sample input information, and user information labels corresponding to the initial sample input information;

performing voice adjustment processing on at least one initial sample output voice message to obtain processed initial sample output voice message, and obtaining a user information label corresponding to the processed initial sample output voice message;

taking the initial training data, the processed initial sample output voice information and a user information label corresponding to the processed initial sample output voice information as the training data;

wherein the voice adjustment processing includes at least one of a speed of sound adjustment processing, a pitch adjustment processing, and a noise addition processing.

11. The method according to claim 10, wherein each of the initial sample output speech information includes initial sample output speech information corresponding to at least two different languages and having different amounts of training data, wherein the amount of training data includes an amount of user data and/or an amount of speech data;

the voice adjusting processing of at least one of the initial sample output voice information includes:

and performing voice regulation processing on the initial sample output voice information corresponding to the language with less training data.

12. The method according to claim 11, wherein if the voice adjustment process includes the speed adjustment process and/or the pitch adjustment process, the user information tag corresponding to the processed initial sample output voice information is different from the user information tag corresponding to the initial sample output voice information;

and if the voice regulation processing only comprises the noise adding processing, the user information label corresponding to the processed initial sample output voice information is the same as the user information label corresponding to the corresponding initial sample output voice information.

13. A speech synthesis apparatus, comprising:

the information acquisition module is used for acquiring information to be processed;

the information coding module is used for coding the information to be processed to obtain an information coding result;

and the voice generating module is used for generating the voice information of the target user corresponding to the target language based on the information coding result.

14. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor, when executing the computer program, is configured to perform the method of any of claims 1 to 12.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 12.

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech generation method, apparatus, electronic device, and readable storage medium.

Background

In real life, the input information can be converted into the voice information of different languages, for example, the text of different languages can be converted into the voice information corresponding to the tone colors of different speakers. For example, a chinese-english speech generating system can generate a chinese speech read aloud with the timbre of a target speaker after inputting a chinese text and designating the target speaker, and similarly, can generate an english speech read aloud with the timbre of the target speaker when inputting english.

However, although the speech generating system in the prior art can realize the speech synthesis of multiple persons and multiple languages, the speech effect of the generated speech still needs to be improved.

Disclosure of Invention

An object of an embodiment of the present application is to provide one. The scheme provided by the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides a speech generation method, where the method includes:

acquiring information to be processed;

coding the information to be processed to obtain an information coding result;

and generating voice information corresponding to the target language of the target user based on the information coding result.

In a second aspect, an embodiment of the present application provides a speech generating apparatus, including:

the information acquisition module is used for acquiring information to be processed;

the information coding module is used for coding the information to be processed to obtain an information coding result;

and the voice generating module is used for generating the voice information of the target user corresponding to the target language based on the information coding result.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor; wherein the memory has stored therein a computer program; the processor is used for executing the method provided by the embodiment of the application when the computer program runs.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the method provided by the present application.

The advantageous effects brought by the technical solutions provided in the embodiments of the present application will be described in detail in the following description of the specific embodiments with reference to various alternative embodiments, which will not be described herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 shows a basic block diagram of prior art speech synthesis;

FIG. 2 is a flow chart of a method for generating speech according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an initial neural network model according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a specific structure of an initial neural network model provided by an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a structure of a speech generation model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a principle of generating voice information according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating an initial neural network model according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating a specific structure of another initial neural network model provided in an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating the structure of another speech generation model provided by embodiments of the present application;

fig. 10 is a schematic diagram of another principle of generating voice information according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of another initial neural network model provided by an embodiment of the present application;

FIG. 12 is a diagram illustrating a structure of a Chinese-English phoneme recognizer according to an embodiment of the present application;

FIG. 13 is a schematic diagram illustrating the structure of another speech generation model provided by an embodiment of the present application;

FIG. 14 is a schematic diagram illustrating an initial neural network model according to an embodiment of the present disclosure;

fig. 15 is a schematic diagram illustrating an FFT block with context preservation provided in an embodiment of the present application;

fig. 16 shows a schematic structural diagram of a pure FFT block provided in an embodiment of the present application;

FIG. 17 is a schematic diagram illustrating the structure of another speech generation model provided by embodiments of the present application;

fig. 18 is a schematic structural diagram illustrating a speech generating apparatus according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following detailed description of embodiments of the present application, which is to be taken in an illustrative manner and is described below with reference to the accompanying drawings, is provided for purposes of explanation only and is intended to provide a thorough understanding of embodiments of the present application, which are defined by the claims and their equivalents, and in which various specific details are included to assist understanding, but are to be regarded as illustrative only and not as limiting of the application. Accordingly, those of ordinary skill in the art will recognize that changes and modifications of the described embodiments can be made without departing from the scope and spirit of the application. In addition, some descriptions of well-known functions and constructions may be omitted in the following description for clarity and conciseness.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

For better illustration and understanding of the solutions provided by the embodiments of the present application, the following first describes the technologies related to the embodiments of the present application.

Speech synthesis refers to the conversion of input text into speech information, as shown in fig. 1, which shows a basic block diagram of speech synthesis in the prior art, based on which it can be known that input text (e.g., "Hello", etc.) has speech data of the timbre of the target speaker (selected from speaker-1, speaker-2 … …, etc.) shown in the figure, such as "Hello", etc., shown in the figure, after passing through a text encoder, decoder, and vocoder in a multilingual speech synthesis network.

Voice conversion refers to transforming the speech signal of a source speaker so that it sounds like the target speaker does and does not change the content of the utterance. For example, a source speaker is a american speaker and speaks only english, a target speaker is a chinese speaker and speaks only mandarin, and a chinese speaker is designated after a phrase of english is spoken into the voice conversion system, so that the same phrase of english spoken by the previous american speaker, which is read aloud by the color of the chinese speaker, can be synthesized.

Cross-language means that the language of the targeted speaker is different from the language to be synthesized. For example, a Chinese-English speech generating system generates a piece of English speech using the timbre of a Chinese person, or conversely, speaks Chinese using the timbre of an English person. For the situation that Chinese speaks English, because the target speaker never speaks English in the training set, that is, the situation that the speaker and English voices appear simultaneously never occurs in the training process, at this time, because the target speaker never learns, the generated voice generally has the accent of the mother language of the target speaker rather than the standard accent of the target language, and the generated voice is like English with a very heavy accent spoken by the Chinese when the Chinese beginners English. And this will be more apparent when the different languages of training data are not balanced, and the resulting speech tones will be more like those of the native language of the intended speaker, rather than those of the target language.

The low-resource language means a language having less training data among the multi-language training data, and the generation of the speech of the low-resource language means that the generated target language is speech of a low resource. For example, in a chinese-english speech generating system, training data of chinese is very small, and as long as the generated language is chinese, the pronunciation of chinese is not full and the tone is biased regardless of whether chinese is spoken by chinese or english. This is because in prior art schemes, the pitch characteristics used to control the generation of speech tones are typically input at a very early point in the system, and are encoded by being input to the text encoder in concert with the phoneme characteristics. Since the text encoder needs to encode the phoneme feature and the pitch feature at the same time, the pitch information is likely to be weakened in the process, if the text encoding containing less pitch information is sent to the decoder, the pitch of the synthesized speech will not be very obvious, and for low-resource languages, the pitch problem will become more prominent because of the less training data.

In order to solve at least one of the above technical problems in the prior art, embodiments of the present application provide a method, an apparatus, an electronic device, and a readable storage medium for generating a speech, based on which the pitch of the generated speech is more standard, and the speech effect is effectively improved.

In addition, the voice generation method provided by the embodiment of the application can be applied to any electronic equipment, such as products of smart phones, tablet computers and the like. Of course, the method may also be applied to a server (including but not limited to a physical server and a cloud server), and the server may convert the acquired to-be-processed information into voice information based on the method provided in the embodiment of the present application.

In order to make the objects, technical solutions and advantages of the present application clearer, various alternative embodiments of the present application and how the technical solutions of the embodiments of the present application solve the above technical problems will be described in detail below with reference to specific embodiments and drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 shows a flow chart of a speech generating method provided in an embodiment of the present application, and as can be seen from the foregoing description, the method may be executed by any electronic device, such as a terminal device of a user (e.g., a mobile phone of the user, a tablet computer, etc.), or a server. As shown in fig. 2, the method may include the steps of:

step S110: obtaining information to be processed

In practical applications, the information to be processed refers to information that needs to be synthesized into speech, and the information to be processed may be text (i.e., text to be processed) or speech information (i.e., speech information to be processed). For example, it is assumed that the text "hello" needs to be changed into voice information, where the text "hello" is a text to be processed, or the voice "hello" is converted into voice "hello", where the voice "hello" is voice information to be processed, and a language of the information to be processed is not limited in the embodiment of the present disclosure, for example, the language of the information to be processed may be chinese, english, or french.

Step S120: and coding the information to be processed to obtain an information coding result.

In practical applications, when the information to be processed is encoded, the information to be processed may be encoded by an encoding module (e.g., a text encoder) to obtain a corresponding information encoding result.

Step S130: and generating voice information corresponding to the target language of the target user based on the information coding result.

The target user refers to which user's voice is specifically adopted by the finally generated voice information, and the target language refers to which language is specifically adopted by the finally generated voice information. For example, it is assumed that the information to be processed needs to be generated as voice information having a timbre of user a and being in english, where user a is a target user, and english is a target language. The target language may be the same as or different from the language of the information to be processed, which is not limited in the embodiment of the present application.

Specifically, the speech feature corresponding to the target user (i.e., the target speech feature) and the language feature corresponding to the target language (i.e., the target language feature) may be obtained, and then the speech information corresponding to the target language of the target user is generated based on the obtained information encoding result, the target speech feature, and the target language feature.

In practical application, the voice characteristics corresponding to each user and the language characteristics corresponding to each language can be learned in advance and stored, and further, when the voice characteristics of a certain user and the language characteristics of a certain language need to be determined, the voice characteristics and the language characteristics can be directly acquired according to the pre-stored information. The voice feature corresponding to each user may be determined according to all the sample output voice information corresponding to the user, for example, a voice feature average value of all the sample output voice information corresponding to the user may be used as the voice feature corresponding to the user.

In the embodiment of the present application, when generating the voice information, the voice information is generated based on the voice feature of the target user, the information encoding result, and the target language feature, that is, when generating the voice information, the sound feature of the target user and the language feature of the target language are considered, so that it can be ensured that the generated voice information is more consistent with the tone of the target user when speaking the target language, and the voice effect is improved.

In an optional embodiment of the present application, if the information to be processed is a text to be processed, the method may further include:

acquiring tone characteristics of a text to be processed;

generating voice information corresponding to the target language of the target user based on the information coding result, wherein the voice information comprises:

and generating voice information corresponding to the target language of the target user based on the tone characteristics and the information coding result.

In practical applications, there will be a corresponding tone for each language, and the tone features of different languages are generally different. For example, the Chinese tone is significantly different from the English tone, the Chinese tone refers to the Chinese tone, which may include six tones of "yin Ping", "yang Ping", "up" and "down", and a light tone and a tone variation, and the English tone refers to the heavy reading and light reading of the English sound.

In the prior art, when a speaker in one language is used to generate a voice in another language, the pitch of the generated voice is more similar to the pitch of the native language of the specified speaker (i.e., the target user), rather than the pitch of the target language. For example, having a chinese speaker who only has chinese training data speak english, the resulting english language may have a mandarin accent, unlike an english native accent, and vice versa.

Based on this, in order to ensure that the tone of the generated voice information is more accurate, when the information to be processed is a text to be processed, the tone feature of the text to be processed may also be obtained after the text to be processed is obtained, and a specific implementation manner of obtaining the tone feature of the text to be processed may be configured in advance, which is not limited in the embodiment of the present application. If the text to be processed is input to the prosody prediction module trained in advance (the prosody prediction module is used for acquiring the tone information included in the text) to acquire the tone information included in the text to be processed, and the mapping relationship between each piece of tone information and the tone feature is learned in advance, the tone feature corresponding to the text to be processed can be determined according to the mapping relationship between the tone information and the tone feature after the tone information included in the text to be processed is known.

Optionally, when the text to be processed is generated into the voice information, the voice information corresponding to the text to be processed may be generated based on the obtained target voice feature, the target language feature, the information encoding result, and the tone feature.

In the embodiment of the present application, when generating the speech information of the text to be processed, the pitch characteristics of the text to be processed are determined based on the pitch characteristics of all the pitch information and the information encoding result, that is, the pitch characteristics included in the text to be processed are considered when generating the speech information, at this time, when synthesizing the speech information, the encoder and the vocoder can be more directly and conveniently guided to generate the speech information, the effect of the generated speech information is improved, and the pitch of the generated speech information is more in line with the actual pitch characteristics of the target language. For example, when the text to be processed is a Chinese text and the target language is English, the generated voice information can be more fitted with the tone of English, and the actual application requirements can be better met.

In an optional embodiment of the present application, encoding information to be processed to obtain an information encoding result includes:

acquiring phoneme characteristics corresponding to a text to be processed;

and carrying out text coding on the phoneme characteristics to obtain a text coding result.

Wherein, the phoneme refers to the minimum voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms one phoneme. For example, in the chinese language, generally, phonemes can be divided into two categories of vowels and consonants, and all vowels of diphthongs and rhinonyms can be divided into two or three single vowels, such as the chinese syllable of the character "o" is ā and includes only one single unit sound, at this time, the character "o" has only one phoneme, the a i includes a diphthongs and also only one phoneme, and the d ā i includes a vowel and a diphthongs, and thus two phonemes.

In practical application, when a text to be processed is coded, a phoneme feature corresponding to the text to be processed can be obtained, and the phoneme feature is subjected to text coding, and an obtained information coding result can be called a text coding result; the phoneme feature includes all phoneme information included in the text to be processed, and a specific implementation manner of obtaining the phoneme feature corresponding to the text to be processed may be configured in advance, which is not limited in the embodiment of the present application. For example, the text to be processed may be input to a pre-trained text-to-phoneme module (which is used to acquire phoneme information included in the text), the phoneme information included in the text to be processed is obtained based on an output result of the text-to-phoneme module, and then the tone feature corresponding to the text to be processed is determined based on a pre-learned mapping relationship between each phoneme information and the phoneme feature.

In an alternative embodiment of the present application, text coding is performed on the phoneme features to obtain a text coding result, where the text coding result includes:

acquiring tone features corresponding to a text to be processed;

and carrying out text coding on the tone features and the phoneme features to obtain a text coding result.

In practical applications, there may be multiple encoding modes when encoding a text to be processed, for example, text encoding may be performed based on a phoneme feature corresponding to the text to be processed, or text encoding may be performed based on a tone feature and a phoneme feature of the text to be processed, and for descriptions of the tone feature and the phoneme feature corresponding to the text to be processed, reference may be made to the description in the foregoing, and details are not repeated here.

In an example, assuming that the text to be processed is "hello" in chinese, the phoneme information included in the "hello" in chinese is "ni hao" 5 phonemes, and the tone information included in the "upwarping" corresponding to the phoneme "ni" and the "upwarping" corresponding to the phoneme "hao"; at this time, the phoneme characteristics including the 5 phoneme information of "ni hao" may be determined based on the mapping relationship between the phoneme information and the phoneme characteristics learned in advance, and the tone characteristics including the "upper sound" corresponding to "ni" and the "upper sound" corresponding to the phoneme "hao" may be determined based on the mapping relationship between the tone information and the tone characteristics learned in advance, and then the text encoding result corresponding to the chinese "hello" may be obtained for the determined phoneme characteristics and the tone characteristics.

In an optional embodiment of the present application, if the information to be processed is speech information to be processed, encoding the information to be processed to obtain an information encoding result, including:

acquiring a phoneme posterior probability corresponding to the voice information to be processed;

and coding the phoneme posterior probability to obtain an information coding result.

In practical application, when the information to be processed is speech information to be processed (that is, speech information corresponding to a target language of a target user needs to be generated from a certain speech information), a phoneme Posterior Probability (PPG) corresponding to the speech information to be processed may be determined, and then the phoneme Posterior probability is encoded to obtain an information encoding result. The PPG is a frame-level representation of phoneme information, each frame of speech may be mapped to a probability of the frame of speech on each phoneme through a phoneme recognizer, and in this case, the PPG of the frame of speech may represent the phoneme information and the pitch information of the frame of speech. Furthermore, when the information to be processed is the speech information to be processed, compared with the case that the information to be processed is the text to be processed, the phoneme feature and the tone feature may be replaced with the phoneme posterior probability corresponding to the speech information to be processed when the information to be processed is encoded.

In an optional embodiment of the present application, obtaining a posterior probability of a phoneme corresponding to the speech information to be processed includes:

acquiring phoneme posterior probability of the speech information to be processed corresponding to each candidate language;

and splicing the phoneme posterior probabilities corresponding to each candidate language to obtain the phoneme posterior probabilities corresponding to the voice information to be processed.

Wherein the candidate language refers to a language in which the generated speech information can be used. For example, the user may select to convert the information to be processed into the phonetic information of chinese or english, where chinese and english are candidate languages. In practical applications, when determining the phoneme posterior probability corresponding to the speech information to be processed, the phoneme posterior probability corresponding to each candidate language of the speech information to be processed may be determined, and then the phoneme posterior probability after the phoneme posterior probabilities corresponding to each candidate language of the speech information to be processed are spliced is used as the phoneme posterior probability corresponding to the speech information to be processed.

In practical applications, the phoneme posterior probabilities corresponding to different candidate languages may be obtained by different phoneme recognizers. For example, a phoneme posterior probability corresponding to Chinese may be obtained by a pre-trained Mandarin Chinese phoneme recognizer, and a phoneme posterior probability corresponding to English may be obtained by a pre-trained English phoneme recognizer. When there are multiple candidate languages, the phone recognizers corresponding to the various languages may be independent from each other, or the phone recognizers corresponding to the various languages may be integrated into one model.

In an alternative embodiment of the present application, the method is implemented by a speech generation model, wherein the speech generation model is obtained by:

acquiring an initial neural network model and training data, wherein the training data comprises training sample pairs, and the training sample pairs comprise sample input information, sample output voice information corresponding to the sample input information and a sample user information label;

the initial neural network model comprises an initial voice generation model and an initial user information classification module, the initial voice generation model comprises an initial coding module and an initial voice generation module, the initial user information classification module is connected with the initial coding module, the initial coding module is used for coding sample input information to obtain a sample information coding result, the initial voice generation module is used for obtaining predicted voice information based on the sample information coding result, and the initial user information classification module is used for obtaining predicted user information based on the sample information coding result;

training the initial neural network model based on the training data until a total loss function corresponding to the initial neural network is converged to obtain a trained initial neural network model, and taking the trained initial speech generation model as a speech generation model;

the total loss function comprises a first loss function and a second loss function, the value of the first loss function represents the difference between the predicted voice corresponding to the sample input information and the sample output voice information, and the value of the second loss function represents the difference between the predicted user information corresponding to the sample text and the sample user information label.

In practical application, when the information to be processed is generated into the speech information of the target user corresponding to the target language, the method can be implemented based on a pre-trained speech generation model, and the speech generation model can include an encoding module and a speech generation module.

Specifically, the information to be processed may be input to a speech generation model, the speech generation model may encode the information to be processed to obtain an information encoding result, and input the target speech feature, the target language feature, and the information encoding result (where, when the information to be processed is a text to be processed, the target speech feature, the target language feature, the tone feature, and the information encoding result) to a speech generation module, and the speech generation module may generate speech information corresponding to the target language for the target user based on the input data.

In practical applications, the speech generation model may be obtained by training in the following manner, which may specifically include:

acquiring an initial neural network model and training data for training the initial neural network model, wherein the initial neural network model can comprise an initial speech generation model and an initial user information classification module, an initial coding module in the initial speech generation model is connected with the initial user information classification module, and the initial coding module can comprise a text coder; training data may include training sample pairs, and each training sample pair may include sample input information, sample output speech information corresponding to the sample input information, and a sample user information tag for characterizing which user's speech the sample output speech information corresponds to.

It can be understood that, when the information to be processed is the text to be processed, the sample input information in the training sample pair is the sample input text; and when the information to be processed is the voice information to be processed, the sample input information in the training sample pair is the sample input voice information, and the sample input voice information and the sample output voice information in the training sample are the same voice information.

Further, sample input information in the training data can be input into an initial speech generation model of the initial neural network model, an initial coding module in the initial speech generation model codes the sample input information to obtain a sample information coding result, and the initial speech generation model determines sample language features corresponding to the sample input information and extracts sample speech features of the sample output speech information; then, the sample information coding result, the sample language feature and the sample voice feature can be input to an initial voice generating module (which can comprise a decoder) to obtain predicted voice information corresponding to the sample input information; meanwhile, a sample information coding result can be input to an initial user information classification module to obtain predicted user information corresponding to the sample input information, then a value of a first loss function can be determined based on the predicted voice information corresponding to the sample input information and the sample output voice information, a value of a second loss function is determined based on the predicted user information and a sample user information label, then whether a total loss function corresponding to the initial neural network model is converged or not is determined based on the value of the first loss function and the value of the second loss function, if the total loss function is not converged, the current initial neural network model does not meet requirements, at this moment, network parameters of the initial neural network model can be adjusted, training is continued to the adjusted initial neural network model based on training data until the total loss function is converged, and the trained initial neural network model is obtained, and the trained initial speech generation model included in the initial speech generation model is used as the speech generation model.

In practical application, the initial neural network model may further include a residual network connected to the initial speech generation module, and the residual network may eliminate noise in the sample output speech information, thereby ensuring that the sample output speech information in the training data is clean and noiseless speech information. The input of the residual error network is sample output voice information, the output is prediction residual error coding, the prediction residual error coding output by the residual error network can be input to an initial voice generating module, and the initial voice generating module can obtain prediction voice information corresponding to the sample input information based on a sample information coding result, a sample language characteristic, the prediction residual error coding and a sample voice characteristic.

At this time, the total loss function (also referred to as an objective function) corresponding to the initial neural network model may be expressed as a loss function having a domain confrontation training target in combination with an Evidence Lower Bound (ELBO), and when the sample input information is the sample input text, the total loss function may be specifically expressed as:

wherein the content of the first and second substances,the function of the total loss is expressed as,loss function for representing classification of user informationNumber (i.e. second loss function), ELBO (θ, φ)r(ii) a speech, text) represents all losses of the initial speech generation module part (i.e. the synthesizer, which may specifically include the text encoder and decoder) and the residual network part; theta denotes the model parameter of the initial speech generation module, phirModel parameter, psi, representing a residual networksRepresenting model parameters in an initial user information classification module (namely, a speaker confrontation network in the following text), speed representing a feature vector corresponding to sample output voice information, text representing a feature vector corresponding to sample input text, and lambdasWeight, y, representing the initial user information classification modulesAnd representing a feature vector corresponding to the user information represented by the sample user information label.

Wherein ELBO (theta, phi)r(ii) a speech, text) may be specifically expressed as:

wherein the content of the first and second substances,representing the expectation of the values of the first loss function, logp (speed | z), for all training data in the presence of a residual networkrText) represents the loss between the predicted speech information output by the initial speech generation module and the sample output speech information (i.e., the first loss function),which represents the probability of a standard gaussian distribution,the KL distance (alternative name of KL Divergence and information entropy) between the distribution probability of the predicted residual coding result output by the residual network and the standard Gaussian distribution probability is represented as lambdaKLRepresents DKLSpeech information, text, and z, where speed represents a feature vector corresponding to sample output speech information, text represents a feature vector corresponding to sample input text, and zrTo representAnd predicting a residual coding result.

It should be noted that, if the sample input information is a sample input text, when the initial neural network is trained, the sample information coding result may be obtained by text coding the sample tonal features and the sample phoneme features corresponding to the sample input text, or may be obtained by text coding only the sample phoneme features corresponding to the sample input text; if the sample information coding result is obtained after text coding is performed on the sample phoneme characteristics, the input of the initial speech generation module can be the sample tone characteristics, the sample information coding result, the sample language characteristics and the sample speech characteristics of the sample input text, so that even if the tone characteristics are weakened when the initial text coder performs text coding, the input tone characteristics can directly and conveniently guide the initial speech model to synthesize the speech information, and the tone deviation of the generated speech information is avoided.

In an optional embodiment of the present application, the initial neural network model further includes a pitch classifier connected to the initial encoding module, and for a training sample pair, the training data further includes a sample pitch feature corresponding to the sample input information;

the pitch classifier is used for obtaining a predicted pitch characteristic based on the sample information coding result;

the total loss function also includes a third loss function whose value characterizes a difference between the predicted pitch characteristic and the sample pitch characteristic.

In practical application, the initial neural network model further includes a pitch classifier connected to the initial encoding module, and configured to classify pitch features corresponding to the sample input information, at this time, after a sample information encoding result corresponding to the sample input information is obtained, the sample information may be encoded and input to the pitch classifier to generate a predicted pitch feature, then a value of a third loss function is determined based on the predicted pitch feature and the sample pitch feature, and whether a total loss function corresponding to the initial neural network model converges or not is determined jointly according to the value of the third loss function, the value of the first loss function, and the value of the second loss function.

When the total loss function corresponding to the initial neural network is determined based on the value of the third loss function, the value of the first loss function, and the value of the second loss function, the values of the loss functions may be added and summed to obtain the total loss function, or the total loss function may be determined by performing weighted summation according to the importance degree of each loss function, which is not limited in the embodiment of the present application.

Optionally, when the sample input information is a sample input text, the total loss function corresponding to the initial neural network at this time may be specifically as follows:

wherein the content of the first and second substances,the function of the total loss is expressed as,denotes a third loss function (i.e. pitch classification loss function, hereinafter), λtWeight, ψ, representing a third loss functiontModel parameters, y, representing a pitch classifiertWhich is representative of the pitch characteristics of the sample,for the target loss function in the foregoing (which includes the first loss function, the second classification loss function, and the corresponding loss function of the residual network), θ represents the model parameters of the initial speech generation module, φrRepresenting model parameters of a residual error network, speed representing a feature vector corresponding to sample output voice information, text representing a feature vector corresponding to sample input text, ysFeature vectors corresponding to user information representing a sample user information tag representationCan be seen in the aboveThe description is omitted here.

In an optional embodiment of the present application, if the sample input information is sample input speech information, the initial neural network model further includes an initial phoneme recognizer connected to the initial coding module, where the initial phoneme recognizer is configured to determine a posterior probability of a phoneme corresponding to the sample input speech information;

the initial encoding module is specifically configured to, when encoding the sample input information to obtain a sample information encoding result:

and coding the posterior probability of the phoneme corresponding to the sample input voice information to obtain a sample information coding result.

In an optional embodiment of the present application, if the sample input information is a sample input text, the initial encoding module is specifically configured to, when encoding the sample input information to obtain a sample information encoding result:

and coding the sample input text to obtain a sample text coding result.

In practical applications, when the sample input information is sample input speech information, the initial neural network model further includes an initial phoneme recognizer for determining a posterior probability of a phoneme corresponding to the sample input speech information, and the initial phoneme recognizer is connected to the initial encoding module. That is, after the sample input speech information is input to the initial neural network model, the initial phoneme recognizer may determine a phoneme posterior probability corresponding to the sample input speech information, and then input the obtained phoneme posterior probability to the initial coding module, and the initial coding module may code the input phoneme posterior probability to obtain a sample information coding result.

Accordingly, when the sample input information is the sample input text, after the sample input text is input to the initial neural network model, the pitch characteristic and the phoneme characteristic of the sample input text may be determined, and then the initial encoding module may encode the input pitch characteristic and the phoneme characteristic (or only the phoneme characteristic) to obtain the sample information encoding result (i.e., the sample text encoding result).

In an alternative embodiment of the present application, if the sample input information is sample input speech information, the sample pitch characteristic corresponding to the sample input information is determined by the following method:

acquiring the probability that each phoneme of each frame of voice in the sample input voice information corresponds to each phoneme;

determining a target phoneme of each frame of voice according to the probability that the phoneme of each frame of voice corresponds to each phoneme;

and taking the tone characteristic corresponding to the target phoneme of each frame of voice as a sample tone characteristic corresponding to the sample input voice information.

In practical applications, when the sample input information is different, the determination manners of the corresponding sample pitch features are also different. If the sample input text is the sample input text, determining the tone feature corresponding to the tone information included in the sample input text as the sample tone feature directly based on the preset mapping relationship between the tone information and the tone feature; when the sample input speech information is sample input speech information, the sample input speech information includes each frame of speech, at this time, the probability (that is, PPG) that the phoneme of each frame of speech in the sample input speech information corresponds to each phoneme may be obtained, then, the target phoneme of each frame of speech is determined according to the probability that the phoneme of each frame of speech corresponds to each phoneme, and the tone feature corresponding to the target phoneme of each frame of speech in the sample input speech information is used as the sample tone feature corresponding to the sample input speech information.

The implementation manner for determining the target phoneme of each frame of speech may be configured in advance, and the embodiment of the present application is not limited. For example, a phoneme corresponding to the highest probability among probabilities of respective phonemes in a certain frame of speech may be used as a target phoneme of the frame of speech.

In an example, assuming that the sample input speech information includes 2 frames of speech, where phonemes included in a language corresponding to the sample input speech information are a, b, c, and d, it may be determined that probabilities that phonemes in the 1 st frame of speech correspond to phonemes a, b, c, and d are 0.3, 0.2, 0.4, and 0.1, respectively, probabilities that phonemes in the 2 nd frame of speech correspond to phonemes a, b, c, and d are 0.6, 0.2, 0.1, and 0.1, respectively, where a probability that the 1 st frame of speech corresponds to phoneme c is relatively large, and a probability that the 2 nd frame of speech corresponds to phoneme a is relatively large, a target phoneme in the 1 st frame of speech is phoneme c, and a target phoneme in the 2 nd frame of speech is phoneme a; further, the pitch feature corresponding to the pitch "flat" corresponding to the phoneme c and the pitch feature corresponding to the pitch "flat" corresponding to the phoneme a may be determined, and then the determined pitch features may be used as the sample pitch features corresponding to the sample input speech information. The tone feature corresponding to each tone may be determined based on a mapping relationship between a preset tone and the tone feature.

In an alternative embodiment of the present application, the obtaining training data includes:

acquiring initial training data, wherein the initial training data comprises initial sample pairs, the initial sample pairs comprise initial sample input information, initial sample output voice information corresponding to the initial sample input information, and user information labels corresponding to the initial sample input information;

performing voice adjustment processing on at least one initial sample output voice message to obtain processed initial sample output voice message, and obtaining a user information label corresponding to the processed initial sample output voice message;

taking the initial training data, the processed initial sample output voice information and a user information label corresponding to the processed initial sample output voice information as training data;

wherein the voice adjustment processing includes at least one of a speed adjustment processing, a pitch adjustment processing, and a noise addition processing.

Specifically, the initial training data may include initial sample pairs, and each initial sample pair may include initial sample input information (which may be initial sample input text or initial sample input speech information) and initial sample output speech information corresponding to the initial sample input information, and a user information tag for characterizing which user's speech the initial sample speech output information is specific to.

In practical application, if training data of a certain language is less, the finally obtained speech synthesis network may cause unsatisfactory effect of finally generated speech due to less training data and insufficient training, based on which, after the application is implemented to obtain initial training data, speech adjustment processing may be performed on initial sample output speech information included in an initial sample pair to obtain processed initial sample output speech information, and then the initial training data, the processed initial sample output speech information, and a user information tag corresponding to the processed initial sample output speech information are collectively used as training data.

In an optional embodiment of the present application, taking the initial training data, the processed initial sample output voice information, and the user information tag corresponding to the processed initial sample output voice information as the training data, may include:

obtaining training data based on the initial training data and the processed training sample pairs, and taking the initial training data and the generated processed training sample pairs as training data;

if the initial sample input information is a sample input text, each processed training sample pair comprises an original initial sample input text, corresponding processed initial sample output voice information and a user information label corresponding to the processed initial sample output voice information; if the initial sample input information is sample input voice information, each processed training sample pair comprises processed initial sample output voice information and a user information label corresponding to the processed initial sample output voice information.

In practical application, after the initial training data is acquired, the initial sample output voice information included in the initial sample pair may be subjected to voice adjustment processing to obtain processed initial sample output voice information, then a new training sample pair (i.e., a processed training sample pair) is formed based on the processed initial sample output voice information, and the new training sample pair and the original initial training data are used together as training data. At this time, compared with the original initial training data, the training data also increases the training samples after each processing, the number of the training data is obviously increased, and the purpose of expanding the training data can be achieved.

When the initial sample input information is a sample input text, each processed training sample pair may include the sample input text, a corresponding processed initial sample output voice message, and a user information tag corresponding to the processed initial sample output voice message; when the initial sample input information is the sample input voice information, the sample input voice information and the sample output voice information in the initial sample pair are the same voice information, correspondingly, when the voice adjustment processing is performed on the initial sample output voice information, the voice adjustment processing is performed on the initial sample input and output voice information, and each processed training sample pair comprises corresponding processed initial sample output voice information and a user information label corresponding to the processed initial sample output voice information.

The sound speed adjusting processing refers to adjusting the sound speed according to a certain proportion, the tone adjusting processing refers to adjusting the basic frequency in the voice according to a certain proportion, and the noise adding processing refers to adding noise in the voice.

In practical application, the voice adjustment processing specifically includes which processing manners and implementation manners of each processing manner may be configured in advance, and the embodiment of the present application is not limited. For example, the speed adjustment processing and the pitch adjustment processing may be implemented by a sox tool, and when a speed parameter of the sox takes different values, different speech information is obtained, for example, when the speed parameter takes 0.8, 0.9, 1.1, and 1.2 (that is, 80%, 90%, 110%, and 120% of the original speed), 4 voices with different speech speeds may be obtained; and when the noise adding processing is carried out on the voice information output by the initial sample, the vehicle-mounted noise can be added to the voice information of the initial sample according to the set decibel of the signal-to-noise ratio.

When the speech adjustment processing includes more than two speech adjustment processing modes, different speech adjustment processing may be performed on the initial sample output speech information, or speech adjustment processing in other modes may be performed on the basis of a processing result obtained after a certain speech adjustment processing, which is not limited in this application example. For example, when it is assumed that the sound speed adjustment and the noise addition processing are performed in the sound adjustment processing, the sound speed adjustment and the noise addition processing may be performed on the initial sample output sound information respectively to obtain the initial sample output sound information after the sound speed adjustment and the initial sample output sound information after the noise addition processing, or the sound speed adjustment may be performed on the initial sample output sound information first to obtain the initial sample output sound information after the sound speed adjustment, and then the noise addition processing may be performed on the initial sample output sound information after the sound speed adjustment to obtain the initial sample output sound information after the noise addition processing.

In an optional embodiment of the present application, each initial sample output speech information includes initial sample output speech information corresponding to at least two different languages and having different training data amounts, where a training data amount includes a user data amount and/or a speech data amount;

performing speech conditioning processing on at least one initial sample output speech information, comprising:

and performing voice regulation processing on the initial sample output voice information corresponding to the language with less training data.

In practical applications, each initial sample output speech information may include initial sample output speech information corresponding to at least two different languages, so that it can be ensured that the obtained speech synthesis network trained based on the initial sample output speech information can synthesize speech information in different languages. Wherein the amount of training data corresponding to each language in the initial sample output speech information is different. The training data amount may include a user data amount and/or a voice data amount, the user data amount refers to the number of users corresponding to each language, and the voice data amount refers to the total voice duration or the number of voice bars corresponding to each language. For example, the initial sample output speech information belonging to the chinese language corresponds to 5 users, respectively, and the total time length of the initial sample output speech information belonging to the chinese language is 10 hours, at this time, the user data amount corresponding to the chinese language is 5 and the speech data amount is 10 hours.

Furthermore, for the language with less training data, the initial sample output voice information corresponding to the language can be subjected to voice adjustment processing to increase the training data corresponding to the language, so that the related network of the language with less training data can be trained more sufficiently, and the problem that the voice synthesis effect is not ideal due to pronunciation errors corresponding to the language caused by too little training data is solved.

In an optional embodiment of the present application, if the voice adjustment processing includes voice speed adjustment processing and/or tone adjustment processing, a user information tag corresponding to the processed initial sample output voice information is different from a user information tag corresponding to the corresponding initial sample output voice information;

and if the voice regulation processing only comprises noise adding processing, the user information label corresponding to the processed initial sample output voice information is the same as the user information label corresponding to the corresponding initial sample output voice information.

In practical applications, in order to characterize which user's voice specifically belongs to the processed initial sample output voice information, the processed initial sample voice information may also have a corresponding user information tag. If the processed initial sample output voice information is obtained based on the sound speed adjustment processing and/or the tone adjustment processing, because the voice has a larger difference with the initial sample output voice information after the sound speed adjustment processing and/or the tone adjustment processing, a user corresponding to the processed initial sample output voice information will be different from a user corresponding to the initial sample output voice information, and a user information tag of the processed initial sample output voice information will naturally also be different from a user information tag of the initial sample output voice information, and at this time, a user corresponding to the processed initial sample output voice information will be used as a new user and a new user information tag is added.

For example, assuming that an initial sample output voice message is the voice of the user a, and the user information tag of the initial sample output voice message is a, further, after performing the sound velocity adjustment processing and/or the tone adjustment processing on the initial sample output voice message, the processed initial sample output voice message can be obtained, at this time, the user corresponding to the processed initial sample output voice message is the new user B, and the user information tag of the processed initial sample output voice message is B.

Correspondingly, if the processed initial sample output voice information is obtained only based on the noise processing, and there is no great difference with the initial sample output voice information, at this time, the user corresponding to the processed initial sample output voice information is the same user as the user corresponding to the initial sample output voice information, that is, the user information tag of the processed initial sample output voice information is also the same as the user information tag of the initial sample output voice information. For example, if an initial sample output speech message is the speech of the user a and the user information tag of the initial sample output speech message is a, if only the initial sample output speech message is subjected to the noise addition processing, the user corresponding to the processed initial sample output speech message is still the user a and the user information tag of the processed initial sample output speech message is also still a.

Optionally, in order to better understand the implementation manner of the speech adjustment processing performed on the initial sample output speech information and the generated technical effect, a specific example is described. In this example, it is assumed that the initial training data includes an initial sample pair of chinese and english languages, where the initial sample output speech information corresponding to the chinese language only includes one speaker (i.e., the number of users is 1), and the amount of corresponding speech data is 15 minutes or less; the initial sample output speech information corresponding to english includes 6 speakers (i.e., the number of users is 6), the corresponding amount of speech data exceeds 10 hours, and users of each language only speak the language of the language, and the speech adjustment process includes a speed adjustment process and a noise addition process. Because the training data amount corresponding to the Chinese language is small, the speed parameter of the sox tool can be respectively 0.8, 0.9, 1.1 and 1.2, the speed of the initial sample output voice information corresponding to the Chinese language is regulated, new initial sample output voice information is obtained, and the data amount of the initial sample output voice information corresponding to the Chinese language is expanded by 4 times. Then, the original initial sample output voice information corresponding to the chinese language and the new initial sample output voice information obtained after the speed regulation processing may be subjected to the noise adding processing to obtain the initial sample output voice information of the same number as that before the noise adding processing, at this time, the data amount of the initial sample output voice information corresponding to the chinese language is 10 times that of the original data amount, and since the fundamental frequency of the initial sample output voice information is largely changed after the initial sample output voice information is subjected to the sound speed regulation processing, the initial sample output voice information obtained after the regulation based on each speed parameter corresponds to a new user, and then the number of users of the initial sample output voice information corresponding to the chinese language is also changed from the original 1 person to 5 persons.

Therefore, in this example, the training data amount corresponding to the chinese language is greatly expanded compared to the original training data amount, and after the initial neural network model is trained based on the expanded training data, the relevant network corresponding to the chinese language can be trained more sufficiently, so that the problem of pronunciation error caused by too little training data can be alleviated.

In addition, in order to better understand the speech generation method provided in the embodiment of the present application and the training process of the speech generation model, the method provided in the present application will be described in detail below with different embodiments and different application scenarios as examples.

The first embodiment is as follows:

the following description is made in detail with reference to an application scenario in which the input text is used to generate speech information in different candidate languages (such as chinese or english). In this scenario, the information to be processed is a text to be processed, the sample input information in the training data is a sample input text, and the corresponding speech generation model can be obtained by training neural network models with different network structures, and when the network structures are different, the training modes of the initial neural network model are also partially different, and the initial neural network models with different network structures and the training modes are described in detail below by using different examples (example 1 and example 2).

Example 1:

as shown in FIG. 3, the initial neural network model may include an initial speech generation model, an initial user information classification module, a residual network, and a pitch classifier. The initial speech generation model comprises an initial coding module (not shown in fig. 3) and an initial speech generation module (not shown in fig. 3), the coding module comprises a phoneme tone information acquisition module, a phoneme tone feature acquisition module and a text encoder which are connected in sequence, and the initial speech generation module comprises a decoder and a vocoder which are connected in sequence; the residual error network comprises a voice feature extraction module and a residual error coder (specifically, the residual error coder can be a variational self-coder) which are connected in sequence, the residual error coder is connected with a decoder in the initial voice generation module, and the text coder is respectively connected with the initial user information classification module, the decoder and the tone classifier.

The process of training the initial neural network model shown in fig. 3 is described in detail below with reference to fig. 4. In this example, the initial user information classification module may be a countermeasure network (i.e., the speaker countermeasure network in fig. 4) composed of a gradient inversion layer and a user information classifier (the speaker classifier in fig. 4), wherein the gradient inversion layer may multiply an error passed to the gradient inversion layer by a negative number, so that training targets of networks before and after the gradient inversion layer are opposite to each other, so as to achieve an effect of countermeasure; the residual encoder is a variational self-encoder, and the speech features extracted by a speech feature extraction module (not shown in fig. 4) included in the residual network may be mel spectra (a representation of speech features); the pitch classifier (pitch classification network in fig. 4) may be composed of one or more fully-connected layers (not shown in fig. 4) and a Softmax (excitation function) layer (not shown in fig. 4) cascaded in sequence to enable pitch feature classification of an input text encoding combination.

Accordingly, when the initial neural network model is trained based on training data, the pitch information and the phoneme information (i.e., the toned phoneme sequence in fig. 4) included in the sample input text (the text in fig. 4) may be obtained by a phoneme pitch information obtaining module (not shown in fig. 4), then the pitch feature (i.e., the pitch embedding in fig. 4) including the pitch information and the phoneme feature (i.e., the phoneme embedding in fig. 4) including the phoneme information are obtained based on the phoneme pitch feature obtaining module (not shown in fig. 4) and input to the text encoder, the sample information encoding result (i.e., the text encoding in fig. 4) is obtained, and the sample information encoding result is input to the pitch classifier and the speaker countermeasure network, respectively, and the predicted pitch feature and the predicted user information are obtained; meanwhile, the residual network can extract the mel spectrum (i.e. sample voice features) of the sample output voice information and input the mel spectrum to the residual encoder to obtain a prediction residual code (i.e. the residual code in fig. 4); further, residual coding, language features corresponding to the sample input text (i.e., language embedding in fig. 4), speech features corresponding to the sample output speech information (i.e., speaker embedding in fig. 4), and text coding may be concatenated and then input to a decoder to obtain a predicted mel spectrum (i.e., the mel spectrum in fig. 4), and then the vocoder generates and outputs predicted speech information (i.e., speech in fig. 4) based on the obtained mel spectrum.

The language features corresponding to the input text of each sample (i.e., language embedding in fig. 4) and the speech features corresponding to the speech information output by each sample (i.e., speaker embedding in fig. 4) can be predetermined and configured in the initial neural network model, and then can be directly obtained during training; of course, a module for determining the language features of the sample input text and the speech features corresponding to the sample output speech information may also be configured in the initial neural network model, and determined in real time during training, which is not limited in the embodiment of the present application. The speech feature (i.e., speaker embedding) corresponding to the sample output speech information may be obtained by extracting an x-vector (e.g., a 64-dimensional x-vector) of the sample output speech using a pre-trained x-vector (x-vector, an expression form of speech feature) extractor.

Further, a value of a pitch classification loss function (i.e., the third loss function in the foregoing) may be obtained based on the sample pitch feature and the predicted pitch feature corresponding to the sample input text; obtaining a value of a user information classification loss function (namely, a second loss function in the foregoing) based on a user information label corresponding to the predicted user information and the sample input text, determining a KL distance of a variational self-encoder based on the obtained residual coding and a sample residual coding corresponding to the sample output voice information, and determining a value of a voice loss function (namely, a first loss function in the foregoing) based on the sample output voice information and the predicted voice information; and then taking the sum of the value of the speech loss function, the value of the user information classification loss function, the value of the pitch classification loss function and the KL distance of the variational self-encoder as the total loss function of the initial neural network model, if the total loss function is not converged, continuing the training on the initial neural network model until the total loss function is converged to obtain the trained initial neural network model, and taking the speech generation model included in the trained initial neural network model as the speech generation model in the embodiment of the application. When determining the value of the speech loss function, the mean square error between the predicted mel spectrum and the mel spectrum extracted based on the sample output speech information can be determined and used as the value of the speech loss function, and the pitch classification loss function and the user information classification loss function can be cross entropy loss functions.

Alternatively, in practical applications, when it is required to generate the speech information of the target user corresponding to the target language (e.g. english) from the text to be processed, the text to be processed (the text in fig. 5) may be input into the speech generation model shown in fig. 5, the speech generation model may obtain the pitch information and the phoneme information (i.e. the sequence of tonal phonemes in fig. 5) of the text to be processed, then obtain the pitch feature containing the pitch information (i.e. the pitch embedding in fig. 5) and the phoneme feature containing the phoneme information (i.e. the phoneme embedding in fig. 5) and input them into the text encoder, obtain the information encoding result (i.e. the text encoding in fig. 5), and splice the obtained language feature corresponding to english (i.e. the language embedding in fig. 5), the speech feature of the target user (i.e. the speaker embedding in fig. 5) and the text encoding, and then send them into the decoder to obtain the mel spectrum (i.e. the mel spectrum in fig. 5), and then obtains the voice information (i.e. the voice in fig. 5) of the target user corresponding to the english language through the vocoder.

Wherein, the speech feature corresponding to each user and the language feature corresponding to each language can be pre-configured in the speech generation model, and the speech feature corresponding to each user is a speech feature average value of all samples output speech information corresponding to the user during training (for example, an x-vector average value of speech information can be output for all samples); in order to reduce the noise of the speech information output by the sample, improve the stability of model output and further improve the speech synthesis effect, residual coding of the all-zero vector can be used as a prior mean value to be spliced into a decoder together with language embedding, speaker embedding and text coding to obtain a Mel spectrum.

As shown in fig. 6, in this example, after the text encoder obtains the sample information encoding result (text encoding in fig. 6) based on the phoneme characteristics (phoneme embedding in fig. 6) and the pitch characteristics (pitch embedding in fig. 6), since the obtained text encoding can be input to the pitch classifier for pitch classification to obtain the predicted pitch characteristics and the loss between the predicted pitch characteristics and the actual pitch characteristics is added to the total loss function, the text encoder can be forced to learn more pitch information to obtain the information encoding result with stronger pitch information, and then after the text encoding is input to the decoder, the pitch of the speech (exemplified by mel spectrum in fig. 6) generated by the decoder based on text encoding, speaker embedding and language embedding is made more obvious.

Example 2:

as shown in fig. 7, the initial neural network model may include an initial speech generation model, an initial user information classification module, and a residual network. The initial speech generation model comprises an initial coding module (not shown in fig. 7) and an initial speech generation module (not shown in fig. 7), wherein the initial coding module comprises a phoneme tone information acquisition module, a phoneme tone feature acquisition module and a text encoder which are sequentially connected; the initial voice generation module comprises a decoder and a vocoder which are connected in sequence; the residual error network comprises a voice feature extraction module and a residual error coder (specifically, the residual error coder can be a variational self-coder) which are connected in sequence, the residual error coder is connected with a decoder in the initial voice generation module, and the text coder is connected with the initial user information classification module and the decoder.

The process of training the initial neural network model shown in fig. 8 is described in detail below with reference to fig. 8. In this example, the initial user information classification module may be a confrontation network (the speaker confrontation network in fig. 8) composed of a gradient inversion layer and a user information classifier (the speaker classifier in fig. 8), wherein the description of the gradient inversion layer refers to the description in example 1 and is not repeated herein; the residual encoder may be a variational self-encoder, and the speech feature extracted by the speech feature extraction module included in the residual network may be a mel spectrum.

Accordingly, when the initial neural network model is trained based on training data, the pitch information and phoneme information (i.e., the toned phoneme sequence in fig. 8) included in the sample input text (the text in fig. 8) may be obtained by a phoneme pitch information obtaining module (not shown in fig. 8), then the pitch feature (i.e., the pitch embedding in fig. 8) including the pitch information and the phoneme feature (i.e., the phoneme embedding in fig. 8) including the phoneme information are obtained based on the phoneme pitch feature obtaining module (not shown in fig. 8), and only the phoneme features are input to the text encoder to obtain sample information encoding results (i.e., the text encoding in fig. 8), and the sample information encoding results are respectively input to the speaker confrontation network to obtain predicted user information; the mel spectrum of the sample output voice information (i.e. the sample voice features) can be extracted from the residual error network and input into a residual error encoder to obtain a prediction residual error code (i.e. the residual error code in fig. 8); further, the prediction residual encoding, the language feature corresponding to the sample input text (i.e. the language embedding in fig. 8), the voice feature corresponding to the sample output voice information (i.e. the speaker embedding in fig. 8), the pitch feature (i.e. the pitch embedding in fig. 8) and the text encoding may be concatenated and input to the decoder to obtain a prediction mel spectrum (i.e. the mel spectrum in fig. 8), and then the vocoder may generate and output the prediction voice information (i.e. the voice in fig. 8) based on the obtained mel spectrum. The description of the language features corresponding to the sample input text (i.e., language embedding) and the speech features corresponding to the sample output speech information (i.e., speaker embedding) is the same as that described in example 1 above, and thus will not be repeated herein.

Further, based on the user information label corresponding to the predicted user information and the sample input text, a value of a user information classification loss function (i.e., the second loss function in the foregoing) and a sample residual encoding corresponding to the sample input text based on the obtained prediction residual encoding may be obtained, a KL distance from the encoder may be determined, and a value of a speech loss function (i.e., the first loss function in the foregoing) may be determined based on the sample speech information and the predicted speech information; and then taking the sum of the value of the voice loss function, the value of the user information classification loss function and the KL distance of the variational self-encoder as the total loss function of the initial neural network model, if the total loss function is not converged, continuing the training on the initial neural network model until the total loss function is converged to obtain the trained initial neural network model, and taking the voice generation model included in the trained initial neural network model as the voice generation model for generating information in the embodiment of the application. When the value of the speech loss function is determined, the mean square error between the predicted Mel spectrum and the Mel spectrum extracted based on the sample output speech information can be determined and used as the value of the speech loss function, and the user information classification loss function can be a cross entropy loss function.

Alternatively, in practical applications, when it is required to generate the speech information of the target user corresponding to the target language (for example, english) from the text to be processed (for example, the text in fig. 9), the text to be processed (for example, the text in fig. 9) may be input into the speech generation model shown in fig. 9, the pitch information and the phoneme information (for example, the sequence of tonal phonemes in fig. 9) of the text to be processed are obtained, then the pitch feature (for example, the pitch embedding in fig. 9) including the pitch information and the phoneme feature (for example, the phoneme embedding in fig. 9) including the phoneme information are obtained, the phoneme embedding is input into the text encoder, the result of information encoding (for example, the text encoding in fig. 9) is obtained, then the language feature (for example, the language embedding in fig. 9) corresponding to english, the speech feature (for example, the person embedding in fig. 9) of the target user, the text encoding and the pitch embedding are spliced and then sent into the decoder to obtain the mel spectrum, and obtaining the voice information of the target user corresponding to English through the vocoder. For the description of the speech feature corresponding to each user and the language feature corresponding to each language, reference may be made to the description in example 1 above, and details are not repeated here. Similarly, in order to reduce the noise of the sample output speech information and further improve the speech synthesis effect, the residual coding of the all-zero vector can be used as the prior mean value to be spliced with the language embedding, the speaker embedding and the text coding and then sent to the decoder to obtain the mel spectrum.

As shown in fig. 10, in this example, the pitch feature (pitch embedding in fig. 10) is not directly fed into the text encoder, but is spliced with the information encoding result (text encoding in fig. 10) obtained based on the phoneme feature, the speech feature (speaker embedding in fig. 10) corresponding to the target user, and the language feature (language embedding in fig. 10) of the target language and then input to the decoder, so that the pitch of the generated speech (mel spectrum in fig. 10) can be more directly controlled, thereby avoiding the influence of the text encoder on the pitch and improving the speech effect of the generated speech.

Example two:

the following is a detailed description of an application scenario in which the input speech information is used to generate speech information in different candidate languages (such as chinese or english), and a corresponding speech generation model can be obtained by training a neural network model. In this scenario, the information to be processed is speech information to be processed, the sample input information in the training data is sample input speech information, and the sample input speech information and the sample output speech information in the training sample pair are the same speech information.

As shown in fig. 11, the initial neural network model may include an initial speech generation model (not shown in fig. 11), an initial user information classification module (speaker fighting network in fig. 11), a residual network, and a pitch classifier. The initial speech generation model comprises an initial coding module (not shown in fig. 11) and an initial speech generation module (not shown in fig. 11), the initial coding module comprises an initial phoneme recognizer (taking a mid-english phoneme recognizer as an example in fig. 11) and a text coder which are connected in sequence, and the initial speech generation module comprises a decoder and a vocoder (not shown in fig. 11) which are connected in sequence; the residual network includes a speech feature extraction module (not shown in fig. 11) and a residual encoder (specifically, a residual encoder of a variational self-encoder) connected in sequence, the residual encoder is connected to a decoder in the initial speech generation module, and the text encoder is connected to the initial user information classification module, the decoder, and the tone classifier, respectively.

The process of training the initial neural network model shown in fig. 11 is described in detail below. In this example, the initial user information classification module may be a confrontation network (i.e., the speaker confrontation network in fig. 11) composed of a gradient inversion layer and a user information classifier (i.e., the speaker classifier in fig. 11), where the roles played by the gradient inversion layer are the same as those played by the gradient inversion layer in the first embodiment, and the description of this portion may refer to the description in the first embodiment, and will not be repeated here. The voice feature extracted by the voice feature extraction module (not shown in fig. 11) included in the residual error network may be a mel spectrum; the structure and functions of the tone classifier are the same as those of the tone classifier in the first embodiment, and the description of this part can be referred to the description of the first embodiment, and will not be repeated here.

Correspondingly, when the initial neural network model is trained based on training data, the posterior probability of phonemes (i.e., the middle-to-english PPG in fig. 11) included in sample input speech information (speech in fig. 11) can be obtained by the middle-to-english phoneme recognizer and input to the text encoder, so as to obtain a sample information encoding result (i.e., the text encoding in fig. 11), and the text encoding is respectively input to the pitch classifier and the speaker countermeasure network, so as to obtain a predicted pitch feature (the pitch label in fig. 11) and predicted user information (the speaker label in fig. 11); meanwhile, the mel spectrum of the sample output voice information (voice in fig. 11) can be extracted from the residual network and input to the residual encoder, so as to obtain the prediction residual code (i.e. the residual code in fig. 11); further, residual coding, speech features corresponding to the sample input speech information (i.e., language embedding in fig. 11), speech features corresponding to the sample output speech information (i.e., speaker embedding in fig. 11), and text coding may be concatenated and then input to a decoder ("concatenating" in fig. 11), so as to obtain a predicted mel spectrum (i.e., the mel spectrum in fig. 11), and then the vocoder may generate and output predicted speech information based on the obtained mel spectrum (i.e., not shown in fig. 11).

The language features corresponding to the input speech information of each sample (i.e., language embedding in fig. 11) and the speech features corresponding to the output speech information of each sample (i.e., speaker embedding in fig. 11) may be predetermined and configured in the initial neural network model, and then directly obtained during training; of course, the module for determining the language feature and the speech feature may also be configured in the initial neural network model and determined in real time during training, which is not limited in the embodiment of the present application. The speech feature (i.e., speaker embedding) corresponding to the sample output speech information may be obtained by extracting an x-vector (e.g., a 64-dimensional x-vector) of the sample output speech using a pre-trained x-vector (x-vector, an expression form of speech feature) extractor.

Further, a value of a pitch classification loss function (i.e., the third loss function in the foregoing) may be obtained based on the sample pitch feature and the predicted pitch feature corresponding to the sample input speech information; obtaining a value of a user information classification loss function (namely, a second loss function in the foregoing) based on a user information label corresponding to the predicted user information and the sample input voice information, determining a KL distance from a coder based on the obtained prediction residual coding and sample residual coding, and determining a value of a voice loss function (namely, a first loss function in the foregoing) based on the sample output voice information and the predicted voice information; and then taking the sum of the value of the speech loss function, the value of the user information classification loss function, the value of the pitch classification loss function and the KL distance of the variational self-encoder as the total loss function of the initial neural network model, if the total loss function is not converged, continuing the training on the initial neural network model until the total loss function is converged to obtain the trained initial neural network model, and taking the speech generation model included in the trained initial neural network model as the speech generation model in the embodiment of the application. When determining the value of the speech loss function, the mean square error between the predicted mel spectrum and the mel spectrum extracted based on the sample output speech information can be determined and used as the value of the speech loss function, and the pitch classification loss function and the user information classification loss function can be cross entropy loss functions.

Here, the specific structure of the chinese-english phoneme recognizer may be as shown in fig. 12, and since the present example relates to two candidate languages, an english phoneme recognizer corresponding to english and a mandarin chinese phoneme recognizer corresponding to chinese may be included in the chinese-english phoneme recognizer. The English phoneme recognizer and the Mandarin phoneme recognizer are obtained by training in advance based on a large amount of English and Mandarin phonetic information respectively. Accordingly, after the sample input speech information (in fig. 12, the speech) is input to the chinese-english phoneme recognizer, the sample input speech information may be respectively input to the english phoneme recognizer and the mandarin chinese phoneme recognizer to respectively obtain the posterior probability of the phoneme corresponding to the english language of the sample input speech information (in fig. 12, the english PPG) and the posterior probability of the phoneme corresponding to the chinese language (in fig. 12, the mandarin chinese PPG), and then the posterior probability of the phoneme corresponding to the english language and the posterior probability of the phoneme corresponding to the chinese language are spliced together (in fig. 12, "splicing"), so that the posterior probability of the phoneme corresponding to the sample input speech information (i.e., the middle-english PPG in fig. 12) is obtained, and the posterior probability of the phoneme corresponding to the sample input speech information is input to the text encoder to perform the subsequent processes.

Optionally, in practical applications, when the voice information to be processed needs to be generated into the voice information corresponding to the target language (for example, english) of the target user, the speech information to be processed (fig. 13 is speech) may be input into the speech generation model shown in fig. 13, the mid-to-english phoneme recognizer in the speech generation model may obtain the posterior probability of phonemes of the speech information to be processed (fig. 13 is mid-to-english PPG) and input into the text encoder, obtain the information encoding result (i.e. the text encoding in fig. 13), and the obtained language features (i.e. language embedding in fig. 13) corresponding to the english, the speech features (i.e. speaker embedding in fig. 13) of the target user and the text code are spliced and sent to a decoder to obtain a mel spectrum, and then the speech information (i.e. not shown in fig. 13) corresponding to the english of the target user is obtained through a vocoder.

Wherein, the speech feature corresponding to each user and the language feature corresponding to each language can be pre-configured in the speech generation model, and the speech feature corresponding to each user is a speech feature average value of all samples output speech information corresponding to the user during training (for example, an x-vector average value of speech information can be output for all samples); in order to reduce the noise of the speech information output by the sample and further improve the speech synthesis effect, residual coding of the full zero vector can be used as a priori mean value to be spliced into a decoder together with language embedding, speaker embedding and text coding to obtain a Mel spectrum.

Example three:

in practical applications, when the information to be processed is speech information to be processed, the speech generation model may be decoded by using a decoder based on an FFT (Feed-Forward Transformer) block in addition to the cyclic Neural network rnn (current Neural network) with attention mechanism (i.e., the decoder in the above description), where the initial Neural network model is specifically shown in fig. 14. As can be seen from fig. 14, the initial neural network model is the same as the initial neural network model shown in fig. 12 except that the decoder in fig. 12 is replaced by an FFT block with context preservation (the FFT block with context preservation in fig. 14) which is repeated N times (usually N is 6), and a linear layer is added. Therefore, for a detailed description of the structure of the initial neural network model except the FFT block and the linear layer with context preservation, reference may be made to the description of fig. 12 above, and details thereof are not repeated here.

As shown in fig. 15, the FFT block with context preservation described above (i.e., the FFT block and the band context preservation in fig. 15) may generally include a pure FFT block and a linear layer for obtaining predicted speech features (i.e., mel-frequency spectrum in fig. 15) based on input information to form the context preservation mechanism. It should be noted that, in the actual training process, after obtaining the output information based on the included pure FFT block, the output information needs to be input to the linear layer for forming the upper and lower layer retention mechanism and the linear layer connected to the FFT block with context retention (i.e., the linear layer connected to the FFT block with context retention in fig. 14), respectively.

In this example, the FFT block used for decoding is provided with a context retention mechanism, and thus better predicted mel-frequency spectrum can be decoded. In practical applications, each pure FFT block may generally comprise a multi-headed self-attention layer, a convolutional layer (in the figures, a one-dimensional convolutional layer is taken as an example) and two residual connecting and normalizing layers (i.e., Add & Norm in fig. 15 and 16), and one of the residual connecting and normalizing layers is connected to the multi-headed self-attention layer and the convolutional layer, respectively, and the other residual connecting and normalizing layer is connected to the convolutional layer and the linear layer (the linear layer includes a linear layer for forming an upper and lower layer preservation mechanism and a linear layer connected to the FFT block with context preservation (i.e., a linear layer connected to the FFT block with context preservation in fig. 14)), as shown in fig. 16 in particular.

Correspondingly, when the initial neural network model is trained based on training data, the posterior probability of phonemes (i.e., the middle-to-english PPG in fig. 14) included in sample input speech information (speech in fig. 14) can be obtained by the middle-to-english phoneme recognizer and input to the text encoder, so as to obtain a sample information encoding result (i.e., the text encoding in fig. 14), and the text encoding is respectively input to the tone classifier and the speaker countermeasure network, so as to obtain a predicted tone feature (the tone label in fig. 14) and predicted user information (the speaker label in fig. 14); meanwhile, the mel spectrum of the sample output voice information (voice in fig. 14) can be extracted from the residual network and input into the residual encoder, so as to obtain the prediction residual code (i.e. the residual code in fig. 14); further, residual coding, speech features corresponding to the sample input speech information (i.e., language embedding in fig. 14), speech features corresponding to the sample output speech information (i.e., speaker embedding in fig. 14), and text coding may be concatenated (i.e., "concatenated" in fig. 14) and then input to the N-times repeated FFT block with context preservation (in fig. 14, FFT block with context preservation) and linear layer to obtain a predicted mel spectrum (i.e., mel spectrum in fig. 14), and then the vocoder may generate and output the predicted speech information based on the obtained mel spectrum (i.e., not shown in fig. 14).

For the determination method of the language feature corresponding to each sample input speech information (i.e. language embedding in fig. 14), the speech feature corresponding to each sample output speech information (i.e. speaker embedding in fig. 14), and the expression form of the speech feature corresponding to each sample output speech information, reference may be made to the description in embodiment two, which is not repeated herein; the specific structure of the mid-english phoneme recognizer is the same as the specific structure shown in fig. 12, and for the specific structure of the mid-english phoneme recognizer, reference may be made to the description of embodiment two, which is not repeated herein.

Further, a value of a pitch classification loss function (i.e., the third loss function in the foregoing) may be obtained based on the sample pitch feature and the predicted pitch feature corresponding to the sample input speech information; obtaining a value of a user information classification loss function (namely, a second loss function in the foregoing) based on a user information label corresponding to the predicted user information and the sample input voice information, determining a KL distance of a variational self-encoder based on the obtained residual encoding and the sample residual encoding, and determining a value of a voice loss function (namely, a first loss function in the foregoing) based on the sample output voice information and the predicted voice information; then, the sum of the value of the speech loss function, the value of the user information classification loss function, the value of the pitch classification loss function, and the KL distance of the variational self-encoder is used as the total loss function of the initial neural network model, if the total loss function is not converged, the initial neural network model is continuously trained until the total loss function is converged, so as to obtain the trained initial neural network model, and at this time, the speech generation model included in the trained initial neural network model may be used as the speech generation model in the embodiment of the present application, as shown in fig. 17 specifically. Note that, the decoder for decoding in the speech generation model obtained at this time is a pure FFT block repeated N times, and is not an FFT block with upper and lower reservations repeated N times.

It can be understood that, in this example, since the decoding is performed by repeating N times the FFT blocks with context preservation, and the linear layer included in each FFT block with context preservation can output one predicted speech feature, in order to ensure that the training effect is more accurate, when determining the total loss function of the neural network model, the loss between each predicted speech feature output by the linear layer in the FFT blocks with context preservation and the sample speech feature corresponding to the sample output speech information can be determined based on the preset weight, and then the total loss function can be determined together with the sum of the speech loss function, the user information classification loss function, the tone classification loss function, and the KL distance from the variable encoder.

When the value of the speech loss function is determined, the mean square error between the predicted Mel spectrum and the Mel spectrum extracted from the sample output speech information can be determined and used as the value of the speech loss function, and the pitch classification loss function and the user information classification loss function can be cross entropy loss functions.

Alternatively, in practical application, when the speech information to be processed needs to be synthesized into the speech information corresponding to the target language (for example, english) of the target user, the speech information to be processed (fig. 17 is speech) may be input into a speech generation model as shown in fig. 17, a mid-to-english phoneme recognizer in the speech generation model may obtain a phoneme posterior probability (fig. 17 is mid-to-english PPG) of the speech information to be processed (fig. 17 is speech) and input into a text encoder, an information encoding result (i.e. text encoding in fig. 17) is obtained, and the obtained language features (i.e. language embedding in fig. 17) corresponding to the english, the speech features (i.e. speaker embedding in fig. 17) of the target user and the text codes are spliced and then sent to a pure FFT block repeated N times to obtain a mel spectrum, and then the speech information (i.e. not shown in fig. 17) corresponding to the english of the target user is obtained through a vocoder.

Wherein, the speech feature corresponding to each user and the language feature corresponding to each language can be pre-configured in the speech generation model, and the speech feature corresponding to each user is a speech feature average value of all samples output speech information corresponding to the user during training (for example, an x-vector average value of speech information can be output for all samples); in order to reduce the noise of the sample output voice information and further improve the voice synthesis effect, residual coding of a full zero vector can be used as a priori mean value to be spliced with language embedding, speaker embedding and text coding and then sent to a pure FFT block which is repeated for N times to obtain a Mel spectrum.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application further provides a speech synthesis apparatus, as shown in fig. 18, the speech generation apparatus 100 may include an information acquisition module 101, an information encoding module 102, and a speech generation module 103, where:

an information obtaining module 101, configured to obtain information to be processed;

the information coding module 102 is configured to code the text to be processed to obtain an information coding result;

and the voice generating module 103 is used for generating voice information of the target user corresponding to the target language based on the information coding result.

Optionally, if the information to be processed is a text to be processed, when the speech generation module generates speech information of the target user corresponding to the target language based on the information encoding result, the speech generation module is specifically configured to:

acquiring tone characteristics of a text to be processed;

and generating voice information corresponding to the target language of the target user based on the tone characteristics and the information coding result.

Optionally, if the information to be processed is a text to be processed, the information encoding module is specifically configured to, when encoding the information to be processed to obtain an information encoding result:

acquiring tone characteristics of a text to be processed;

and carrying out text coding on the phoneme characteristics to obtain a text coding result.

Optionally, the information encoding module is specifically configured to, when performing text encoding on the phoneme features to obtain a text encoding result:

acquiring tone features corresponding to a text to be processed;

and carrying out text coding on the tone features and the phoneme features to obtain a text coding result.

Optionally, if the information to be processed is speech information to be processed, the information encoding module is configured to, when encoding the information to be processed to obtain an information encoding result, specifically:

acquiring a phoneme posterior probability corresponding to the voice information to be processed;

and coding the phoneme posterior probability to obtain an information coding result.

Optionally, when obtaining the posterior probability of the phoneme corresponding to the speech information to be processed, the information encoding module is specifically configured to:

acquiring phoneme posterior probability of the speech information to be processed corresponding to each candidate language;

and splicing the phoneme posterior probabilities corresponding to each candidate language to obtain the phoneme posterior probabilities corresponding to the voice information to be processed.

Optionally, the apparatus is implemented by a speech generating model, and the speech generating model is trained by a training module in the following manner:

acquiring an initial neural network model and training data, wherein the training data comprises training sample pairs, and the training sample pairs comprise sample input information, sample output voice information corresponding to the sample input information and a sample user information label;

the initial neural network model comprises an initial voice generation model and an initial user information classification module, the initial voice generation model comprises an initial coding module and an initial voice generation module, the initial user information classification module is connected with the initial coding module, the initial coding module is used for coding input information to obtain a sample information coding result, the initial voice generation module is used for obtaining predicted voice information based on the sample information coding result, and the initial user information classification module is used for obtaining predicted user information based on the sample information coding result;

training the initial neural network model based on the training data until a total loss function corresponding to the initial neural network is converged to obtain a trained initial neural network model, and taking the trained initial speech generation model as a speech generation model;

the total loss function comprises a first loss function and a second loss function, the value of the first loss function represents the difference between the predicted voice information corresponding to the input information and the sample output voice information, and the value of the second loss function represents the difference between the predicted user information corresponding to the sample input information and the sample user information label.

Optionally, the initial neural network model further includes a pitch classifier connected to the initial encoding module, and for the training sample pair, the training sample pair further includes a sample pitch feature corresponding to the sample input information;

the pitch classifier is used for obtaining a predicted pitch characteristic based on the sample information coding result;

the total loss function also includes a third loss function whose value characterizes a difference between the predicted pitch characteristic and the sample pitch characteristic.

Optionally, if the sample input information is sample input speech information, the initial neural network model further includes an initial phoneme recognizer connected to the initial coding module, and the initial phoneme recognizer is configured to determine a posterior probability of a phoneme corresponding to the sample input speech information;

the initial encoding module is specifically configured to, when encoding the sample input information to obtain a sample information encoding result:

and coding the posterior probability of the phoneme corresponding to the sample input voice information to obtain a sample information coding result.

Optionally, when the training module acquires the training data, the training module is specifically configured to:

acquiring initial training data, wherein the initial training data comprises initial sample pairs, the initial sample pairs comprise initial sample input information, initial sample output voice information corresponding to the initial sample input information, and user information labels corresponding to the initial sample input information;

performing voice adjustment processing on at least one initial sample output voice message to obtain processed initial sample output voice message, and obtaining a user information label corresponding to the processed initial sample output voice message;

taking the initial training data, the processed initial sample output voice information and a user information label corresponding to the processed initial sample output voice information as training data;

wherein the voice adjustment processing includes at least one of a speed adjustment processing, a pitch adjustment processing, and a noise addition processing.

Optionally, each initial sample output voice message includes initial sample output voice messages corresponding to at least two different languages and different training data volumes, where the training data volumes include user data volumes and/or voice data volumes;

when the training module performs the speech adjustment processing on the speech information output by at least one initial sample, the training module is specifically configured to:

and performing voice regulation processing on the initial sample output voice information corresponding to the language with less training data.

Optionally, if the voice adjustment processing includes voice speed adjustment processing and/or tone adjustment processing, the user information tag corresponding to the processed initial sample output voice information is different from the user information tag corresponding to the corresponding initial sample output voice information;

and if the voice regulation processing only comprises noise adding processing, the user information label corresponding to the processed initial sample output voice information is the same as the user information label corresponding to the corresponding initial sample output voice information.

Based on the same principle as the method and the apparatus provided by the embodiment of the present application, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor is configured to, when running the computer program, be able to perform the method shown in any optional embodiment of the present application.

According to an embodiment of the present application, among speech generation methods performed by an electronic device, a speech generation method for recognizing a user's speech and interpreting a user's intention may receive a speech signal as an analog signal via a speech acquisition device (e.g., a microphone) and convert a speech portion into a computer-readable text using an Automatic Speech Recognition (ASR) model. The user's utterance intention can be obtained by interpreting the converted text using a Natural Language Understanding (NLU) model. The ASR model or NLU model may be an artificial intelligence model. The artificial intelligence model can be processed by an artificial intelligence specific processor designed in a hardware architecture specified for artificial intelligence model processing. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained with a plurality of pieces of training data by a training algorithm to obtain a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model can include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by a calculation between a calculation result of a previous layer and the plurality of weight values.

Language understanding is a technique for recognizing and applying/processing human language/text, including, for example, natural language processing, machine translation, dialog systems, question answering, or speech recognition/synthesis.

Embodiments of the present application further provide a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program can perform the method shown in any optional embodiment of the present application.

As an example, fig. 19 shows a schematic structural diagram of an electronic device to which an embodiment of the present application is applied, and as shown in fig. 19, an electronic device 4000 shown in fig. 19 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (field programmable Gate Array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 19, but it is not intended that there be only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in any of the foregoing method embodiments.

The apparatus provided in the embodiment of the present application may implement at least one of the modules through an AI (Artificial Intelligence) model. The functions associated with the AI may be performed by the non-volatile memory, the volatile memory, and the processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors, such as a Central Processing Unit (CPU), an Application Processor (AP), or the like, or pure graphics processing units, such as a Graphics Processing Unit (GPU), a Vision Processing Unit (VPU), and/or AI-specific processors, such as a Neural Processing Unit (NPU).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operating rules or artificial intelligence models are provided through training or learning.

Here, the provision by learning means that a predefined operation rule or an AI model having a desired characteristic is obtained by applying a learning algorithm to a plurality of learning data. This learning may be performed in the device itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may include a plurality of neural network layers. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), generative confrontation networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to make, allow, or control the target device to make a determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

37页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:自动音频内容生成

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!