Cross-language voice conversion method and device, computer equipment and storage medium

文档序号：617691 发布日期：2021-05-07 浏览：3次中文

阅读说明：本技术 跨语言语音转换方法、装置、计算机设备和存储介质 (Cross-language voice conversion method and device, computer equipment and storage medium ) 是由赵之源王若童黄东延于 2020-12-28 设计创作，主要内容包括：本发明实施例公开了一种跨语言语音转换方法、装置、计算机设备和存储介质。该方法包括：获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同；对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征；将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征；将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。本发明实施例实现了跨语言的合成目标用户语音。(The embodiment of the invention discloses a cross-language voice conversion method, a cross-language voice conversion device, computer equipment and a storage medium. The method comprises the following steps: acquiring a voice to be converted and an example voice of a target user, wherein the language used by the voice content of the voice to be converted is different from the language used by the voice content of the example voice; preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics; taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model; and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted. The embodiment of the invention realizes the cross-language synthesis of the target user voice.)

1. A method of cross-language speech conversion, the method comprising:

preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics;

taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model;

and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted.

2. The method of claim 1, wherein the speech feature to be converted is a mel-frequency cepstrum to be converted, the example speech feature is an example mel-frequency cepstrum, the speech conversion model comprises a first encoder, a second encoder, a length adjuster and a decoder, the first encoder comprises an FFT Block, and the obtaining the target speech feature by using a pre-trained speech conversion model with the speech feature to be converted and the example speech feature as input comprises:

inputting the mel-frequency cepstrum to the first encoder to obtain a first vector;

inputting a portion of an example mel-frequency cepstrum to the second encoder to obtain a second vector, the portion of the example mel-frequency cepstrum being randomly truncated in the example mel-frequency cepstrum;

splicing the first vector and the second vector to obtain a third vector;

inputting the third vector to the length adjuster to obtain a fourth vector;

and inputting the fourth vector to the decoder to obtain a predicted Mel cepstrum as a target voice characteristic.

3. The method of claim 2, wherein the first encoder is configured to compress the mel-frequency cepstrum to obtain a first vector, and wherein the length adjuster is configured to obtain a predicted extension length for each frame in the third vector according to the third vector, and to extend the third vector into a fourth vector according to the predicted extension length.

4. The method of claim 1, wherein the training of the speech conversion model comprises:

acquiring training voice, first training example voice of a training user and second training example voice, wherein the voice content of the first training example voice is the same as the voice content of the training voice, and the language used by the voice content of the training voice is different from the language used by the voice content of the second training example voice;

preprocessing the training speech to obtain a training speech feature, preprocessing the first training example speech to obtain a first training example speech feature, preprocessing the second training example speech to obtain a second training example speech feature, wherein the training speech feature is a training Mel cepstrum, the first training example speech feature is a first training example Mel cepstrum, and the second training example speech feature is a second training example Mel cepstrum;

inputting the training mel-frequency cepstrum to the first encoder to obtain a first vector;

inputting a portion of a second training example mel-frequency cepstrum to the second encoder to obtain a second vector, the portion of the second training example mel-frequency cepstrum being randomly truncated in the second training example mel-frequency cepstrum;

splicing the first vector and the second vector to obtain a third vector;

inputting the third vector to the length adjuster to obtain a fourth vector;

inputting the fourth vector to the decoder to obtain a training predicted Mel-cepstrum;

calculating a training loss of the training prediction Mel cepstrum and a first training example Mel cepstrum;

and performing back propagation according to the training loss to update the training weight of the voice conversion model until the voice conversion model converges.

5. The method of claim 1, wherein the obtaining the speech to be converted comprises:

acquiring a text to be converted;

and converting the text to be converted into synthetic voice as the voice to be converted.

6. The method according to claim 2, wherein the preprocessing the speech to be converted to obtain the speech features to be converted comprises:

carrying out short-time Fourier transform on the voice to be converted to obtain a magnitude spectrum;

filtering the amplitude spectrum to obtain a Mel frequency spectrum;

and carrying out cepstrum analysis on the Mel frequency spectrum to obtain a Mel cepstrum to be converted, wherein the Mel cepstrum to be converted is used as a voice feature to be converted.

7. The method according to claim 6, wherein the short-time fourier transforming the speech to be converted to obtain an amplitude spectrum comprises:

subtracting the head and tail blank parts in the voice to be converted to obtain a first corrected voice to be converted;

pre-emphasis, framing and windowing are carried out on the first corrected voice to be converted to obtain second corrected voice to be converted;

and carrying out short-time Fourier transform on the second corrected voice to be converted to obtain a magnitude spectrum.

8. An apparatus for cross-language speech conversion, the apparatus comprising:

the voice acquisition module is used for acquiring voice to be converted and example voice of a target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the example voice;

the voice processing module is used for preprocessing the voice to be converted to obtain voice characteristics to be converted and preprocessing the example voice to obtain example voice characteristics;

the feature conversion module is used for taking the voice feature to be converted and the example voice feature as input and obtaining a target voice feature by using a pre-trained voice conversion model;

and the voice simulation module is used for converting the target voice characteristics into target voice simulating the example voice, and the voice content of the target voice is the same as that of the voice to be converted.

9. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for cross-language speech conversion, a computer device, and a storage medium.

Background

Machine learning and deep learning rely on mass data and strong processing capability of a computer, and make a major breakthrough in the fields of images, voice, texts and the like. Because the same type of framework can achieve good effects in different fields, neural network algorithm models which have been used for solving text and image problems are applied to the speech field.

The existing neural network algorithm model applied to the speech field can capture the characteristics of a target speaker according to the voice of the target speaker, so that other voices of the target speaker are stably synthesized, the voice similarity and the language naturalness are both close to the level of a real person, but the synthesized voice can only be the voice with the same language as the target speaker, the voice of the target speaker cannot be synthesized into the voice of the target speaker using other national languages, and if the target speaker only speaks Chinese, only the Chinese voice can be synthesized, and the voice of other national languages cannot be synthesized.

Disclosure of Invention

In view of the above, it is necessary to provide a cross-language voice conversion method, apparatus, computer device and storage medium.

In a first aspect, an embodiment of the present invention provides a cross-language speech conversion method, where the method includes:

preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics;

taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model;

and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted.

In a second aspect, an embodiment of the present invention provides a cross-language voice conversion apparatus, where the apparatus includes:

In a third aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to perform the following steps:

preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics;

taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model;

and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the processor is caused to execute the following steps:

preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics;

taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model;

and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted.

The embodiment of the invention obtains the to-be-converted voice and the example voice with different languages used by the voice content, inputs the to-be-converted voice and the example voice into the pre-trained voice conversion model to obtain the target voice with the same voice content as the to-be-converted voice and simulating the example voice, solves the problem that the voice of the target speaker cannot be synthesized into the voice of the target speaker which is emitted by using other national languages, and obtains the beneficial effect of synthesizing the voice of the target user in a cross-language mode.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a diagram of an application environment for a cross-language speech conversion method in one embodiment;

FIG. 2 is a flow diagram of a method for cross-language speech conversion in one embodiment;

FIG. 3 is a flowchart of step S130 of the cross-language speech conversion method in one embodiment;

FIG. 4 is a flowchart of step S110 of the cross-language speech conversion method in one embodiment;

FIG. 5 is a flowchart of step S120 of the cross-language speech conversion method in one embodiment;

FIG. 6 is a flowchart of step S410 of the cross-language speech conversion method in one embodiment;

FIG. 7 is a flow diagram of a method for speech conversion model training in one embodiment;

FIG. 8 is a block diagram showing the structure of a cross-language speech conversion apparatus according to an embodiment;

FIG. 9 is a block diagram of a computer device in one embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a diagram of an application environment for the cross-language speech conversion method in one embodiment. Referring to fig. 1, the cross-language voice conversion method is applied to a cross-language voice conversion system. The cross-language voice conversion system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network, the terminal 110 may be specifically a desktop terminal or a mobile terminal, and the mobile terminal may be specifically at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers. The terminal 110 is used for the voice to be converted and the example voice of the target user and uploading the voice to the server 120, the language used by the voice content of the voice to be converted is different from the language used by the voice content of the example voice, and the server 120 is used for receiving the voice to be converted and the example voice of the target user; preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics; taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model; and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted.

In another embodiment, the above cross-language voice conversion method may also be directly applied to the terminal 110, where the terminal 110 is configured to obtain a voice to be converted and an example voice of a target user, and a language used by a voice content of the voice to be converted is different from a language used by a voice content of the example voice; preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics; taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model; and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted.

As shown in FIG. 2, in one embodiment, a cross-language speech conversion method is provided. The method can be applied to both the terminal and the server, and this embodiment is exemplified by being applied to the terminal. The cross-language voice conversion method specifically comprises the following steps:

s110, obtaining the voice to be converted and the example voice of the target user, wherein the language used by the voice content of the voice to be converted is different from the language used by the voice content of the example voice.

In this embodiment, when executing the cross-language voice conversion method, the user may execute on a mobile device, such as a mobile phone, and first the user needs to input the voice to be converted and the example voice of the target user, where the voice content of the voice to be converted is the voice content that the user finally wants to obtain, and the example voice of the target user is the sound characteristic of the voice sound that the user finally wants to obtain. In addition, the language used by the speech content of the speech to be converted is different from the language used by the speech content of the example speech, that is, the speech to be converted may be chinese, then the example speech may be english, the speech to be converted may also be english plus chinese, the example speech may be english, and it should be noted that the language used by the speech content of the speech to be converted is different from the language used by the speech content of the example speech as long as the language used by the speech content of the speech to be converted is partially different from the language used by the speech content of the example speech, or. Illustratively, a user wants to obtain a first Chinese language which only speaks and speaks a target voice of "Yes", only needs to speak "Yes" by himself as a voice to be converted, and obtains an example voice of the first Chinese language, which can be any Chinese voice of the first Chinese language.

S120, preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics.

And S130, taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model.

S140, converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as the voice content of the voice to be converted.

In this embodiment, after obtaining the speech to be converted and the example speech, the speech to be converted needs to be preprocessed to obtain the speech feature to be converted, and the example speech is preprocessed to obtain the example speech feature so as to be conveniently input to the speech conversion model, where the speech conversion model is a neural network model, the speech to be converted is trained by a large number of training users in advance, both input and output in the training process are speech features, and the speech conversion model can extract and combine speech content in the speech feature to be converted and sound features in the example speech feature, so that the target speech feature can be obtained after inputting the speech feature to be converted and the example speech feature to the pre-trained speech conversion model. Finally, the target voice characteristics are converted into target voice through other preset neural network models, the target voice obtained through conversion of the target voice characteristics obtained through the voice conversion model simulates the sound characteristics of example voice, the sent voice content is the voice content of the voice to be converted, and due to the fact that the language used by the voice content of the voice to be converted is different from the language used by the voice content of the example voice, cross-language voice conversion is completed. The other preset neural network model may be a WaveNet neural network model, a WaveRNN neural network model, or the like.

In one embodiment, as shown in fig. 3, step S130 specifically includes:

s210, inputting the Mel cepstrum to the first encoder to obtain a first vector.

And S220, inputting a part of the example Mel cepstrum to the second encoder to obtain a second vector, wherein the part of the example Mel cepstrum is obtained by random interception in the example Mel cepstrum.

In this embodiment, the speech feature to be converted is a mel cepstrum to be converted, the example speech feature is an example mel cepstrum, and after the speech feature to be converted and the example speech feature are obtained, the speech feature to be converted and the example speech feature may be input to a pre-trained speech conversion model, where the speech conversion model includes a first encoder, a second encoder, a length adjuster, and a decoder. The first encoder is built based on a FastSpeech framework and comprises an FFT (fast-Forward transform Block) which is generated based on a self-attention mechanism (self-attention) of non-autoregressive and a one-dimensional convolutional neural network, so that the first encoder does not depend on the output of the previous frame and can perform parallel operation, and the generation speed of the target speech features is greatly increased. Specifically, the first encoder comprises a CNN (convolutional neural network) model, a Positional encoding (position-based word embedding) model and an FFT Block, the second encoder comprises an LSTM (Long Short-Term Memory) model, a Linear regression (Linear regression) model, a pooling layer and a normalization layer, the length adjuster comprises the CNN model and the Linear model, and the decoder comprises the FFT Block, the Linear model, a Post-Net and an output layer.

Specifically, a Mel cepstrum to be converted is input into a first encoder, a CNN model in the first encoder is used for compressing the Mel cepstrum to be converted to obtain Bottle-rock characteristics, so that voice content is extracted well, then a first vector is output quickly based on parallel operation of FFT Block, the vector length of the first vector takes the maximum value of the input sequence length in batch processing (Btach), and the rest sequences which are not long enough are supplemented with 0 at the back, so that the obtained first vector is used as the extracted voice content. The partial example mel cepstra is then input to a second encoder, which outputs a second vector, where the partial example mel cepstra is randomly truncated from the example speech feature, i.e., the example mel cepstra. Specifically, after example speech is converted into an example mel cepstrum, a preset number of cut segments of the example mel cepstrum of the target user are randomly selected, the cut segments are spliced to serve as part of the example mel cepstrum, and the obtained second vector serves as the extracted sound feature.

And S230, splicing the first vector and the second vector to obtain a third vector.

And S240, inputting the third vector to the length adjuster to obtain a fourth vector.

And S250, inputting the fourth vector to the decoder to obtain a predicted Mel cepstrum as a target voice characteristic.

In this embodiment, after the first vector and the second vector are obtained, the first vector and the second vector need to be spliced to obtain a third vector, and then the third vector is input to the length adjuster. Therefore, the length adjuster can also obtain the predicted extension length of each frame in the third vector according to the third vector through the two convolutional layers of the length adjuster, the length of each frame in the predicted Mel inverse frequency spectrum is equivalent to the length of each frame, and the third vector is extended into the fourth vector according to the predicted extension length. Illustratively, the speech content corresponding to the third vector is "how good", the feature length of which is 3, and the predicted extension length obtained by the length adjuster according to the third vector is corresponding to [ 4,2,3 ], so that the feature length of "you" in the finally obtained fourth vector is 4, the feature length of "good" is 2, and the feature length of "do" is 3. And finally, inputting the fourth vector into a decoder to obtain a predicted Mel cepstrum, and taking the predicted Mel cepstrum as a target voice feature.

According to the embodiment of the invention, the first encoder does not depend on the output of the previous frame through the non-autoregressive-based attention mechanism and the FFT Block generated by the one-dimensional convolutional neural network, and can perform parallel operation, so that the generation speed of the target speech characteristics is greatly accelerated.

In one embodiment, as shown in fig. 4, step S110 specifically includes:

s310, obtaining the text to be converted and the example voice of the target user.

S320, converting the text to be converted into synthetic voice serving as the voice to be converted, wherein the language used by the voice content of the voice to be converted is different from the language used by the voice content of the example voice.

In this embodiment, if the to-be-converted speech spoken by the user is directly obtained as the input speech feature of the subsequent speech conversion model, interference, such as cough, unclear word, and the like, which may be generated on the input speech feature due to the user, may be obtained. Therefore, the text to be converted with the same content is converted into clear and accurate synthesized voice, and the interference caused by the user is eliminated.

Further, in order to explain that the use of the synthesized speech as the input of the speech conversion model can eliminate the interference caused by the user himself, in the process of using the speech conversion model, it is assumed that the feature sequence of the input speech feature to be converted is x ═ x (x is x)₁,x₂,…,x_n) Where n represents the nth frame in the time series of the mel cepstrum to be converted, and the feature series of the target speech feature predicted by the speech conversion model is y ═ y (y₁,y₂,…,y_m) Also, here, m represents the m-th frame in the time series of the predicted mel-frequency cepstrum. It is desirable that the feature sequence predicted by the speech conversion model is as close as possible to the target feature sequence of the actual speech featureHere we assume that each frame of the input feature sequence contains two implicit variables, one implicit variable being the speech content of the input speech c ═ c (c)₁，c₂，...，c_n) Another implicit variable is the acoustic feature of the input speech, s ═ s(s)₁，s₂，...，s_i) In the target sequenceAlso contains the sound characteristics of the target userWhere i represents the input speech, t represents the target user, i belongs to {1, 2.. multidata, j }, t belongs to {1, 2.. multidata, k }, where j represents the number of input speech in the entire input data set, and k represents the number of target users in the entire input data set.

The role of the first coder in the speech conversion model is to convert the speech features s of the input speech_iBy removing from the input sequence only the speech content c remains, the input sequence can be represented as follows:

because we use the method of converting TTS synthesized speech into human speech, we can separate the voice characteristics and speech content of the user, because there is only one voice characteristic in the input speech, i.e. the voice characteristic of the synthesized speech, we set it as s₀Can be regarded as s₀Is a constant. According to bayes' theorem, equation (1) can be changed to:

for the predicted sequence y, this can be expressed in the same way as:

wherein the content of the first and second substances,is the output of the second encoder and c is the output of the first encoder, which are combined together and adjusted by the length adjuster to be the input of the decoder, and finally the predicted sequence y is output by the decoder. Due to c andare derived from two sequences, which can be considered independent of each other. Thus, in conjunction with equations (2) and (3), one can obtain:

as can be seen from equation (4), when the input speech is a fixed synthesized speech, the prediction sequence y is only compared with the input sequence x to train the userAnd speech content c. Thereby relieving the directnessAnd acquiring the speech to be converted read aloud by the user as input speech, and extracting the interference of the speech content in the speech conversion model.

In one embodiment, as shown in fig. 5, step S120 specifically includes:

and S410, carrying out short-time Fourier transform on the voice to be converted to obtain a magnitude spectrum.

And S420, filtering the amplitude spectrum to obtain a Mel frequency spectrum.

And S430, performing cepstrum analysis on the Mel frequency spectrum to obtain a Mel cepstrum to be converted, wherein the Mel cepstrum to be converted is used as a voice feature to be converted.

In this embodiment, when the voice to be converted is preprocessed to obtain the voice feature to be converted, specifically, the voice to be converted needs to be subjected to short-time fourier transform at first, the voice to be converted is subjected to short-time fourier transform to obtain an amplitude spectrum and a phase spectrum, a waveform of the voice to be converted is converted from a time domain to a frequency domain, extraction of the voice feature is facilitated, a mel spectrum can be obtained only by filtering the amplitude spectrum, a Filter used in filtering can be a Filter Bank (Filter Bank), the Filter Bank is based on a principle that a person is more sensitive to high-frequency sound, the Filter is denser at a low frequency, the threshold is large, the Filter is sparser at a high frequency, the threshold is small, and a filtering result is more suitable for human voice. In order to obtain features closer to the human voice production mechanism and the human nonlinear auditory system, cepstrum analysis is finally performed on the Mel Frequency Spectrum to obtain the Mel-Frequency Spectrum (MFC) which is taken as the feature of the voice to be converted. It should be noted that the target speech needs to be processed the same as the speech to be converted, and the embodiments of the present invention are not described herein again.

The embodiment of the invention converts the voice to be converted into the Mel cepstrum, which is not only closer to the characteristics of a human voice mechanism and a nonlinear auditory system, but also beneficial to the training and input and output of a neural network model.

In one embodiment, as shown in fig. 6, step S410 specifically includes:

and S510, subtracting the head and tail blank parts in the voice to be converted to obtain a first corrected voice to be converted.

S520, pre-emphasis, framing and windowing are carried out on the first corrected voice to be converted to obtain second corrected voice to be converted.

And S530, carrying out short-time Fourier transform on the second corrected voice to be converted to obtain a magnitude spectrum.

In this embodiment, because a blank part may exist in a head-to-tail part of the speech to be converted, in order to enable a speech conversion model to be aligned, learned and converted better, when short-time fourier transform is performed on the speech to be converted to obtain an amplitude spectrum, a first corrected speech to be converted is obtained by subtracting the head-to-tail blank part from the speech to be converted before the first corrected speech to be converted, in addition, in order to better adapt to the short-time fourier transform, after the first corrected speech to be converted is obtained, pre-emphasis, framing and windowing are further required to be performed on the first corrected speech to be converted to obtain a second corrected speech to be converted, through the pre-emphasis, high-frequency information is added to the speech to be converted, a part of noise is filtered, through the framing and windowing, the speech to be converted can be more stable and continuous, and finally, the short-time fourier transform is performed on the second corrected speech to be converted to. Steps S510 and S520 in the embodiment of the present invention may be selectively performed according to a user requirement.

As shown in FIG. 7, in one embodiment, a method of speech conversion model training is provided. The method can be applied to both the terminal and the server, and this embodiment is exemplified by being applied to the terminal. The training of the voice conversion model specifically comprises the following steps:

s610, obtaining training voice, and training the first training example voice and the second training example voice of the user.

S620, preprocessing the training speech to obtain training speech features, preprocessing the first training example speech to obtain first training example speech features, and preprocessing the second training example speech to obtain second training example speech features.

S630, inputting the training mel-frequency cepstrum to the first encoder to obtain a first vector.

And S640, inputting a part of the second training example Mel cepstrum to the second encoder to obtain a second vector, wherein the part of the second training example Mel cepstrum is obtained by randomly intercepting the second training example Mel cepstrum.

And S650, splicing the first vector and the second vector to obtain a third vector.

And S660, inputting the third vector to the length adjuster to obtain a fourth vector.

And S670, inputting the fourth vector to the decoder to obtain a training prediction Mel cepstrum.

S680, calculating the training loss of the training prediction Mel cepstrum and the first training example Mel cepstrum.

And S690, performing back propagation according to the training loss to update the training weight of the voice conversion model until the voice conversion model converges.

In this embodiment, when training the speech conversion model, first training speech and training example speech of a training user need to be obtained, where the training example speech includes first training example speech and second training example speech, where speech content of the first training example speech is the same as speech content of the training speech, a language used by speech content of the training speech is different from a language used by speech content of the second training example speech, the first training example speech is predicted speech that we need to obtain last, and the second training example speech is a speech feature serving as an input model. Then training speech needs to be preprocessed to obtain training speech features, the first training example speech is preprocessed to obtain first training example speech features, the second training example speech is preprocessed to obtain second training example speech features, the training speech features are training Mel cepstrum, the first training example speech features are first training example Mel cepstrum, and the second training example speech features are second training example Mel cepstrum. Subsequent operations are the same as those in S210 to S250 of the embodiment of the present invention, and are not described again in the embodiment of the present invention. After the training prediction mel cepstrum is obtained, training losses of the training prediction mel cepstrum and the first training example mel cepstrum, namely losses between a predicted value and an actual value, need to be calculated, and finally, back propagation is carried out according to the training losses to update the training weight of the voice conversion model until the voice conversion model converges.

Wherein, two training example voices need to be acquired, but when the training set data is enough, no additional data collection is caused. For example, if the training speech includes "YES", it is necessary to acquire a first training example speech of the speech content, that is, "YES" uttered by the training user, and it is also necessary to acquire a second training example speech of a language different from the language used by the speech content, that is, a speech of another language uttered by the training user, for example, "good", and when the training set data is sufficient, the "good" uttered by the training user is taken as the first training example speech when the training speech includes "good", and at this time, it is not necessary to additionally acquire the second training example speech.

Preferably, the language used by the speech content of the training speech includes a language used by the speech content of the speech to be converted in actual use, that is, the speech used by the speech content of the speech to be converted participates in the training of the speech conversion model, and the training user also includes a target user, that is, the target user participates in the training of the speech conversion model as the training user, so that the cross-language conversion can be realized more accurately. In addition, the first encoder does not depend on the output of the previous frame, so that the training speed of the voice conversion model is greatly increased.

As shown in fig. 8, in an embodiment, a cross-language speech conversion apparatus is provided, and the cross-language speech conversion apparatus provided in this embodiment can execute the cross-language speech conversion method provided in any embodiment of the present invention, and has corresponding functional modules and beneficial effects of the execution method. The cross-language voice conversion apparatus includes a voice acquisition module 100, a voice processing module 200, a feature conversion module 300, and a voice simulation module 400.

Specifically, the speech acquiring module 100 is configured to acquire a speech to be converted and an example speech of a target user, where a language used by a speech content of the speech to be converted is different from a language used by a speech content of the example speech; the voice processing module 200 is configured to preprocess the voice to be converted to obtain a voice feature to be converted, and preprocess the example voice to obtain an example voice feature; the feature conversion module 300 is configured to use the to-be-converted speech feature and the example speech feature as input, and obtain a target speech feature by using a pre-trained speech conversion model; the voice simulation module 400 is configured to convert the target voice feature into a target voice simulating the example voice, where the voice content of the target voice is the same as the voice content of the voice to be converted.

In one embodiment, the speech to be converted is characterized by a mel-frequency cepstrum to be converted, the example speech characteristic is an example mel-frequency cepstrum, the speech conversion model comprises a first encoder, a second encoder, a length adjuster and a decoder, and the characteristic conversion module 300 is specifically configured to input the mel-frequency cepstrum to the first encoder to obtain a first vector; inputting a portion of an example mel-frequency cepstrum to the second encoder to obtain a second vector, the portion of the example mel-frequency cepstrum being randomly truncated in the example mel-frequency cepstrum; splicing the first vector and the second vector to obtain a third vector; inputting the third vector to the length adjuster to obtain a fourth vector; and inputting the fourth vector to the decoder to obtain a predicted Mel cepstrum as a target voice characteristic.

In one embodiment, the first encoder is configured to compress the mel-frequency cepstrum to obtain a first vector, and the length adjuster is configured to obtain a predicted extension length of each frame in the third vector according to the third vector, and extend the third vector into a fourth vector according to the predicted extension length.

In one embodiment, the cross-language speech conversion apparatus further includes a model training module 500, where the model training module 500 is configured to obtain a training speech, a first training example speech for training a user, and a second training example speech, where a speech content of the first training example speech is the same as a speech content of the training speech, and a language used by the speech content of the training speech is different from a language used by the speech content of the second training example speech; preprocessing the training speech to obtain a training speech feature, preprocessing the first training example speech to obtain a first training example speech feature, preprocessing the second training example speech to obtain a second training example speech feature, wherein the training speech feature is a training Mel cepstrum, the first training example speech feature is a first training example Mel cepstrum, and the second training example speech feature is a second training example Mel cepstrum; inputting the training mel-frequency cepstrum to the first encoder to obtain a first vector; inputting a portion of a second training example mel-frequency cepstrum to the second encoder to obtain a second vector, the portion of the second training example mel-frequency cepstrum being randomly truncated in the second training example mel-frequency cepstrum; splicing the first vector and the second vector to obtain a third vector; inputting the third vector to the length adjuster to obtain a fourth vector; inputting the fourth vector to the decoder to obtain a training predicted Mel-cepstrum; calculating a training loss of the training prediction Mel cepstrum and a first training example Mel cepstrum; and performing back propagation according to the training loss to update the training weight of the voice conversion model until the voice conversion model converges.

In an embodiment, the speech obtaining module 100 is specifically configured to obtain a text to be converted; and converting the text to be converted into synthetic voice as the voice to be converted.

In an embodiment, the speech processing module 200 is specifically configured to perform short-time fourier transform on the speech to be converted to obtain an amplitude spectrum; filtering the amplitude spectrum to obtain a Mel frequency spectrum; and carrying out cepstrum analysis on the Mel frequency spectrum to obtain a Mel cepstrum to be converted, wherein the Mel cepstrum to be converted is used as a voice feature to be converted.

In an embodiment, the speech processing module 200 is further configured to subtract a head and tail blank portion in the speech to be converted to obtain a first corrected speech to be converted; pre-emphasis, framing and windowing are carried out on the first corrected voice to be converted to obtain second corrected voice to be converted; carrying out short-time Fourier transform on the second corrected voice to be converted to obtain an amplitude spectrum

FIG. 9 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be a terminal, and may also be a server. As shown in fig. 9, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a cross-language speech conversion method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a cross-language speech conversion method. Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is proposed, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

acquiring a voice to be converted and an example voice of a target user, wherein the language used by the voice content of the voice to be converted is different from the language used by the voice content of the example voice; preprocessing the voice to be converted to obtain voice characteristics to be converted, and preprocessing the example voice to obtain example voice characteristics; taking the voice feature to be converted and the example voice feature as input, and obtaining a target voice feature by using a pre-trained voice conversion model; and converting the target voice characteristics into target voice simulating the example voice, wherein the voice content of the target voice is the same as that of the voice to be converted.

In one embodiment, the speech feature to be converted is a mel cepstrum to be converted, the example speech feature is an example mel cepstrum, the speech conversion model includes a first encoder, a second encoder, a length adjuster and a decoder, the first encoder includes an FFT Block, and the obtaining the target speech feature by using the pre-trained speech conversion model with the speech feature to be converted and the example speech feature as input includes: inputting the mel-frequency cepstrum to the first encoder to obtain a first vector; inputting a portion of an example mel-frequency cepstrum to the second encoder to obtain a second vector, the portion of the example mel-frequency cepstrum being randomly truncated in the example mel-frequency cepstrum; splicing the first vector and the second vector to obtain a third vector; inputting the third vector to the length adjuster to obtain a fourth vector; and inputting the fourth vector to the decoder to obtain a predicted Mel cepstrum as a target voice characteristic.

In one embodiment, the training of the speech conversion model comprises: acquiring training voice, first training example voice of a training user and second training example voice, wherein the voice content of the first training example voice is the same as the voice content of the training voice, and the language used by the voice content of the training voice is different from the language used by the voice content of the second training example voice; preprocessing the training speech to obtain a training speech feature, preprocessing the first training example speech to obtain a first training example speech feature, preprocessing the second training example speech to obtain a second training example speech feature, wherein the training speech feature is a training Mel cepstrum, the first training example speech feature is a first training example Mel cepstrum, and the second training example speech feature is a second training example Mel cepstrum; inputting the training mel-frequency cepstrum to the first encoder to obtain a first vector; inputting a portion of a second training example mel-frequency cepstrum to the second encoder to obtain a second vector, the portion of the second training example mel-frequency cepstrum being randomly truncated in the second training example mel-frequency cepstrum; splicing the first vector and the second vector to obtain a third vector; inputting the third vector to the length adjuster to obtain a fourth vector; inputting the fourth vector to the decoder to obtain a training predicted Mel-cepstrum; calculating a training loss of the training prediction Mel cepstrum and a first training example Mel cepstrum; and performing back propagation according to the training loss to update the training weight of the voice conversion model until the voice conversion model converges.

In one embodiment, the obtaining the voice to be converted includes: acquiring a text to be converted; and converting the text to be converted into synthetic voice as the voice to be converted.

In an embodiment, the preprocessing the speech to be converted to obtain the speech feature to be converted includes: carrying out short-time Fourier transform on the voice to be converted to obtain a magnitude spectrum; filtering the amplitude spectrum to obtain a Mel frequency spectrum; and carrying out cepstrum analysis on the Mel frequency spectrum to obtain a Mel cepstrum to be converted, wherein the Mel cepstrum to be converted is used as a voice feature to be converted.

In one embodiment, the short-time fourier transform of the speech to be converted to obtain a magnitude spectrum includes: subtracting the head and tail blank parts in the voice to be converted to obtain a first corrected voice to be converted; pre-emphasis, framing and windowing are carried out on the first corrected voice to be converted to obtain second corrected voice to be converted; and carrying out short-time Fourier transform on the second corrected voice to be converted to obtain a magnitude spectrum.

In one embodiment, a computer-readable storage medium is proposed, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of:

In one embodiment, the obtaining the voice to be converted includes: acquiring a text to be converted; and converting the text to be converted into synthetic voice as the voice to be converted.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：支持听不见的水印的文本到语音框架

Cross-language voice conversion method and device, computer equipment and storage medium

相关技术

网友询问留言