Sound changing method and device

文档序号：1273724 发布日期：2020-08-25 浏览：25次中文

阅读说明：本技术 变声方法及装置 (Sound changing method and device ) 是由刘恺于 2019-01-30 设计创作，主要内容包括：本发明公开了一种变声方法及装置,所述方法包括：接收源说话人语句；从所述源说话人语句中提取语音识别声学特征及语音合成声学特征；利用所述语音识别声学特征得到语音识别隐层特征；利用所述语音合成声学特征得到语音合成编码特征；将所述语音识别隐层特征及所述语音合成编码特征输入预先构建的对应特定目标说话人的音色转换模型,得到特定目标说话人的语音合成声学特征；利用所述特定目标说话人的语音合成声学特征生成特定目标说话人音频信号。本发明方案可以实现任意源说话人语音到目标说话人语音的变换,而且具有较好的变声效果。(The invention discloses a sound changing method and a device, wherein the method comprises the following steps: receiving a source speaker sentence; extracting voice recognition acoustic features and voice synthesis acoustic features from the source speaker sentence; obtaining a speech recognition hidden layer characteristic by utilizing the speech recognition acoustic characteristic; obtaining a speech synthesis coding feature by using the speech synthesis acoustic feature; inputting the voice recognition hidden layer characteristics and the voice synthesis coding characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain voice synthesis acoustic characteristics of the specific target speaker; and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker. The scheme of the invention can realize the conversion from the voice of any source speaker to the voice of the target speaker and has better sound changing effect.)

1. A method of changing sound, the method comprising:

receiving a source speaker sentence;

extracting voice recognition acoustic features and voice synthesis acoustic features from the source speaker sentence;

obtaining a speech recognition hidden layer characteristic by utilizing the speech recognition acoustic characteristic;

obtaining a speech synthesis coding feature by using the speech synthesis acoustic feature;

inputting the voice recognition hidden layer characteristics and the voice synthesis coding characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain voice synthesis acoustic characteristics of the specific target speaker;

and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

2. The method of claim 1, further comprising constructing the timbre conversion model for the particular target speaker by:

collecting audio data of a specific target speaker;

and carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.

3. The method of claim 2, further comprising: the method for constructing the universal sound variation model based on the audio data of a plurality of speakers specifically comprises the following steps:

collecting audio data of a plurality of speakers as training data;

extracting voice recognition acoustic features and voice synthesis acoustic features from the training data;

obtaining a speech recognition hidden layer characteristic by utilizing the speech recognition acoustic characteristic;

obtaining a speech synthesis coding feature by using the speech synthesis acoustic feature;

and training to obtain a universal sound changing model by utilizing the speech recognition hidden layer characteristics, the speech synthesis coding characteristics and the speech synthesis acoustic characteristics.

4. The method of claim 1, wherein the deriving speech recognition hidden layer features using the speech recognition acoustic features comprises:

and inputting the voice recognition acoustic characteristics into a voice recognition model to obtain voice recognition hidden layer characteristics.

5. The method of claim 1, wherein the deriving speech synthesis coding features using the speech synthesis acoustic features comprises:

and inputting the voice synthesis acoustic features into a pre-constructed reference coding model to obtain voice synthesis coding features.

6. The method of claim 1, wherein the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.

7. The method of claim 1, wherein the speech synthesis acoustic features comprise: the mel frequency spectrum.

8. A sound-modifying apparatus, comprising:

the receiving module is used for receiving source speaker sentences;

the feature extraction module is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the source speaker sentences;

the hidden layer feature acquisition module is used for acquiring a voice recognition hidden layer feature by utilizing the voice recognition acoustic feature;

the coding feature acquisition module is used for acquiring a speech synthesis coding feature by using the speech synthesis acoustic feature;

the characteristic conversion module is used for inputting the voice recognition hidden layer characteristics and the voice synthesis coding characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain voice synthesis acoustic characteristics of the specific target speaker;

and the voice synthesis module is used for generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

9. An electronic device, comprising: one or more processors, memory;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the method of any one of claims 1 to 7.

10. A readable storage medium having stored thereon instructions that are executed to implement the method of any one of claims 1 to 7.

Technical Field

The invention relates to the field of voice signal processing, in particular to a voice changing method and a voice changing device.

Background

At present, with the development of speech synthesis technology, how to make synthesized speech natural, diversified and personalized becomes a hot spot of current speech technology research, and the sound change technology is one of ways to make synthesized speech diversified and personalized. The voice modification technology mainly refers to a technology of preserving semantic content of a voice signal but changing voice characteristics of a speaker so that a voice of a person sounds like a voice of another person. The sound variation technique is generally divided into two ways from the perspective of speaker conversion: a speech conversion between non-specific persons, such as conversion between male voice and female voice, conversion between different age levels, etc.; another is speech conversion between specific persons, such as converting the voice of speaker a to the voice of speaker B.

A conventional processing method for realizing timbre conversion from any speaker to a target speaker by changing voice usually is based on a speech recognition technology, and aligns parallel corpora by using DTW (Dynamic Time Warping) or attention (attention) mechanisms, and then performs timbre conversion. In the processing mode, when the conversion model is trained, parallel corpora of a source speaker and a target speaker, namely audio corpora with the same content, need to be collected, and the conversion model is trained by using the aligned frequency spectrum characteristics; when the audio conversion is carried out, the spectrum characteristics extracted from the audio data of the source speaker are converted through a conversion model, the fundamental frequency characteristics are subjected to linear stretching treatment, and the non-periodic components are not changed. The voice modification effect obtained by the voice modification processing mode is poor, and particularly, the converted voice cannot well reflect the characteristics of prosody, emotion and the like of a source speaker.

Disclosure of Invention

The embodiment of the invention provides a voice changing method and a voice changing device, which are used for improving the voice changing effect and enabling the prosody and emotion of converted voice to be closer to the voice characteristics of a source speaker.

Therefore, the invention provides the following technical scheme:

a method of changing sound, the method comprising:

receiving a source speaker sentence;

extracting voice recognition acoustic features and voice synthesis acoustic features from the source speaker sentence;

obtaining a speech recognition hidden layer characteristic by utilizing the speech recognition acoustic characteristic;

obtaining a speech synthesis coding feature by using the speech synthesis acoustic feature;

and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

Optionally, the method further includes constructing the tone conversion model corresponding to the specific target speaker in the following manner:

collecting audio data of a specific target speaker;

Optionally, the method further comprises: the method for constructing the universal sound variation model based on the audio data of a plurality of speakers specifically comprises the following steps:

collecting audio data of a plurality of speakers as training data;

extracting voice recognition acoustic features and voice synthesis acoustic features from the training data;

obtaining a speech recognition hidden layer characteristic by utilizing the speech recognition acoustic characteristic;

obtaining a speech synthesis coding feature by using the speech synthesis acoustic feature;

Optionally, the obtaining of the speech recognition hidden layer feature by using the speech recognition acoustic feature includes:

and inputting the voice recognition acoustic characteristics into a voice recognition model to obtain voice recognition hidden layer characteristics.

Optionally, the obtaining the speech synthesis coding features by using the speech synthesis acoustic features includes:

and inputting the voice synthesis acoustic features into a pre-constructed reference coding model to obtain voice synthesis coding features.

Optionally, the reference coding model is a neural network model.

Optionally, the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.

Optionally, the speech synthesis acoustic features comprise: the mel frequency spectrum.

A sound-modifying apparatus, the apparatus comprising:

the receiving module is used for receiving source speaker sentences;

the feature extraction module is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the source speaker sentences;

the hidden layer feature acquisition module is used for acquiring a voice recognition hidden layer feature by utilizing the voice recognition acoustic feature;

the coding feature acquisition module is used for acquiring a speech synthesis coding feature by using the speech synthesis acoustic feature;

and the voice synthesis module is used for generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

Optionally, the apparatus further comprises: the tone conversion model building module is used for building a tone conversion model corresponding to a specific target speaker;

the tone conversion model construction module comprises:

a target data collection unit for collecting audio data of a specific target speaker;

and the model training unit is used for carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.

Optionally, the apparatus further comprises: the universal model building module is used for building a universal sound variation model based on the audio data of a plurality of speakers;

the general model building module comprises:

the universal data collection unit is used for collecting audio data of a plurality of speakers as training data;

the feature extraction unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data;

the hidden layer feature acquisition unit is used for acquiring a voice recognition hidden layer feature by utilizing the voice recognition acoustic feature;

the coding feature acquisition unit is used for acquiring a speech synthesis coding feature by using the speech synthesis acoustic feature;

and the universal parameter training unit is used for training to obtain a universal sound variation model by utilizing the speech recognition hidden layer characteristics, the speech synthesis coding characteristics and the speech synthesis acoustic characteristics.

Optionally, the hidden layer feature obtaining unit is specifically configured to input the speech recognition acoustic feature into a speech recognition model to obtain a speech recognition hidden layer feature.

Optionally, the coding feature obtaining unit is specifically configured to input the speech synthesis acoustic feature into a pre-constructed reference coding model to obtain a speech synthesis coding feature.

Optionally, the reference coding model is a neural network model.

Optionally, the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.

Optionally, the speech synthesis acoustic features comprise: the mel frequency spectrum.

An electronic device, comprising: one or more processors, memory;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method described above.

A readable storage medium having stored thereon instructions which are executed to implement the foregoing method.

The voice changing method and the voice changing device provided by the embodiment of the invention pre-construct a tone color conversion model corresponding to a specific target speaker, extract voice recognition acoustic characteristics and voice synthesis acoustic characteristics from audio data corresponding to a source speaker sentence, obtain voice recognition hidden layer characteristics by utilizing the voice recognition acoustic characteristics, and obtain voice synthesis coding characteristics by utilizing the voice synthesis acoustic characteristics; and taking the hidden layer characteristics and the coding characteristics as intermediaries, converting the voice recognition acoustic characteristics of the corresponding source speaker into voice synthesis acoustic characteristics of the corresponding specific target speaker by using the tone conversion model, and then generating the audio signal of the specific target speaker by using the voice synthesis acoustic characteristics. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and because the coding characteristics obtained by compressing the characteristics of the voice synthesis acoustic characteristics of the whole sentence of voice are added, the prosody and emotion of the converted voice can be closer to the corresponding voice characteristics of the source speaker.

Furthermore, in the scheme of the invention, during modeling, the audio data of a plurality of speakers are firstly utilized to carry out the training of the universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out the self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker. Because the adaptive training is carried out on the audio data of the specific target speaker on the basis of the universal sound variation model, the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by utilizing the tone conversion model are more in line with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of a general acoustic variation model constructed in an acoustic variation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a topology structure of a generic acoustic modification model in an acoustic modification method according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of changing voice in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a model training and testing process in the sound-varying method according to an embodiment of the present invention;

FIG. 5 is a block diagram of an embodiment of the sound varying apparatus of the present invention;

FIG. 6 is a block diagram illustrating an apparatus for a method of changing voice in accordance with an exemplary embodiment;

fig. 7 is a schematic structural diagram of a server in an embodiment of the present invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.

The embodiment of the invention provides a voice changing method and a voice changing device, wherein a tone color conversion model corresponding to a specific target speaker is pre-constructed, voice recognition acoustic characteristics and voice synthesis acoustic characteristics are extracted from audio data corresponding to a received source speaker sentence, voice recognition hidden layer characteristics are obtained by utilizing the voice recognition acoustic characteristics, and voice synthesis coding characteristics are obtained by utilizing the voice synthesis acoustic characteristics; and taking the hidden layer characteristics and the coding characteristics as intermediaries, converting the voice recognition acoustic characteristics of the corresponding source speaker into voice synthesis acoustic characteristics of the corresponding specific target speaker by using the tone conversion model, and then generating the audio signal of the specific target speaker by using the voice synthesis acoustic characteristics.

In practical application, the tone conversion model can be obtained by collecting audio data of a large number of specific target speakers and training; or firstly, the audio data of a plurality of speakers are utilized to carry out the training of a universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out the self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker.

As shown in fig. 1, it is a flowchart of constructing a generic acoustic modification model in the acoustic modification method according to the embodiment of the present invention, and includes the following steps:

step 101, collecting audio data of a plurality of speakers as training data.

The generic voicing model is not specific to a particular target speaker and thus may be trained based on audio data for multiple speakers.

Step 102, extracting voice recognition acoustic features and voice synthesis acoustic features from the training data.

The speech recognition acoustic features may include, but are not limited to, any one or more of: MFCC (Mel-scale frequency Cepstral Coefficients, Mel frequency Cepstral Coefficients), PLP (Perceptual Linear prediction) parameters. The MFCC is a cepstrum parameter extracted in the Mel scale frequency domain, and the Mel scale describes the non-linear characteristic of human ear frequency; the PLP parameter is a characteristic parameter based on an auditory model, and the characteristic parameter is a group of coefficients of an all-pole model prediction polynomial, which is equivalent to a Linear Prediction Coefficient (LPC) characteristic. The extraction of the speech recognition acoustic features may be performed by the prior art and will not be described in detail herein.

The speech synthesis acoustic features include: MEL spectra (MEL spectra), etc.

And 103, obtaining a speech recognition hidden layer feature by using the speech recognition acoustic feature.

In the embodiment of the present invention, the speech recognition model may adopt a neural network model, and the neural network model may include one or more hidden layers. Accordingly, the speech recognition acoustic features are input into the speech recognition model, and the output of the hidden layer can be obtained. In practical application, one or more hidden layers can be output as the hidden layer features of the speech recognition.

And 104, obtaining the voice synthesis coding characteristics by utilizing the voice synthesis acoustic characteristics.

Specifically, the speech synthesis acoustic feature may be input into a pre-constructed reference coding model, and the speech synthesis coding feature may be obtained according to an output of the reference coding model. The reference coding model may use a neural network, such as a structure of multiple layers of convolution plus one-way GRU (Gated recursive Unit), for compressing an audio signal of an indefinite length into a feature vector of a fixed length.

It should be noted that the speech synthesis coding features are the result of compressing the acoustic features of a complete sentence.

And 105, training to obtain a universal sound variation model by utilizing the speech recognition hidden layer characteristics, the speech synthesis coding characteristics and the speech synthesis acoustic characteristics.

The generic vocalization model and the tone conversion model corresponding to the specific target speaker may employ a neural network model, such as a CNN-LSTM (convolutional neural network and long-short term memory network).

Fig. 2 is a schematic diagram of a topology structure of a generic acoustic variation model in an acoustic variation method according to an embodiment of the present invention.

The input of the universal sound variation model comprises: and outputting the speech recognition hidden layer characteristics A, the speech recognition hidden layer characteristics B and the speech synthesis coding characteristics obtained by using the source audio speech synthesis acoustic characteristics through the reference coding model as the target audio speech synthesis acoustic characteristics. The method comprises the following steps that (1) hidden layer characteristics A of voice recognition are subjected to a plurality of neural network models such as a convolutional layer, a pooling layer and a residual error layer to obtain a hidden layer 1 and a hidden layer 2; the hidden layer characteristics B of the voice recognition are subjected to multiple layers of DNN to obtain a hidden layer 3; synthesizing acoustic characteristics by source audio speech, and obtaining a hidden layer 4 by a reference coding model; hidden layer 1, hidden layer 2, hidden layer 3 and hidden layer 4 are combined as input of the LSTM model.

On the basis of the general sound changing model, aiming at a specific target speaker, the sound color conversion model corresponding to the specific target speaker can be obtained by collecting a small amount of audio data of the specific target speaker and carrying out self-adaptive training on the general sound changing model by utilizing the audio data of the specific target speaker.

The adaptive training process is similar to that of the generic acoustic variation model, except for the training data.

Because the hidden layer features output by the acoustic model are recognized to contain less tone features of the source speaker and simultaneously retain semantic information and partial prosodic information, the mapping relation from the hidden layer features to the synthesized acoustic features of the target speaker is learned through the acoustic change model, and the tone conversion from the source speaker to the target speaker can be realized.

The voice changing method provided by the embodiment of the invention utilizes the tone conversion model to convert the voice recognition acoustic characteristics of the source speaker into the voice synthesis acoustic characteristics of the specific target speaker, and then generates the voice frequency signal of the specific target speaker according to the voice synthesis acoustic characteristics, thereby realizing the real-time conversion from the voice frequency data of the source speaker to the voice frequency signal of the specific target speaker.

As shown in fig. 3, it is a flowchart of the sound changing method according to the embodiment of the present invention, and the method includes the following steps:

in step 301, a source speaker utterance is received.

The source speaker sentence is a complete sentence of the source speaker, namely, the audio corresponding to the complete sentence.

Step 302, extracting voice recognition acoustic features and voice synthesis acoustic features from the source speaker sentence.

Similar to the model training phase, the speech recognition acoustic features may include, but are not limited to, any one or more of: MFCC, PLP, etc.; the speech synthesis acoustic features include at least a mel-frequency spectrum.

And step 303, obtaining a speech recognition hidden layer feature by using the speech recognition acoustic feature.

The obtaining of the speech recognition hidden layer features may be obtained by inputting the speech recognition acoustic features into a speech recognition model, and specifically, one or more hidden layers in the speech recognition model may be output as the hidden layer features of the speech recognition.

And step 304, obtaining the speech synthesis coding characteristics by utilizing the speech synthesis acoustic characteristics.

Specifically, the speech synthesis acoustic feature may be input into the reference coding model, and the speech synthesis coding feature may be obtained according to an output of the reference coding model.

It should be noted that the speech synthesis coding features are the result of compressing the acoustic features of a complete sentence.

Step 305, inputting the speech recognition hidden layer characteristics and the speech synthesis coding characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain speech synthesis acoustic characteristics of the specific target speaker.

The input of the tone conversion model is the speech recognition hidden layer feature obtained in the step 303 and the speech synthesis coding feature obtained in the step 304, and the output is the speech synthesis acoustic feature.

By utilizing the tone conversion model, the voice recognition acoustic characteristics of the source speaker can be converted into voice synthesis acoustic characteristics with the voice characteristics of the specific target speaker.

Step 306, generating the audio signal of the specific target speaker by using the voice synthesis acoustic feature of the specific target speaker.

Specifically, a neural network vocoder such as wavenet/wavevernn may be utilized to generate a speech signal, and the speech signal is synthesized by synthesizing the speech into an acoustic feature, that is, the conversion from the speech of the speaker of any source to the speech of the speaker of the target speaker is realized.

For better understanding of the solution of the present invention, fig. 4 shows a schematic diagram of a model training and testing process in the acoustic modification method according to the embodiment of the present invention.

It should be noted that the type and number of the specific acoustic features included in the speech synthesis acoustic features extracted from the source speaker sentence in the step 302 may be the same as or different from the type and number of the specific acoustic features included in the speech synthesis acoustic features of the specific target speaker obtained in the step 305, and the embodiment of the present invention is not limited thereto.

The voice changing method provided by the embodiment of the invention is characterized in that a tone conversion model corresponding to a specific target speaker is pre-constructed, voice recognition acoustic characteristics and voice synthesis acoustic characteristics are extracted from received source speaker sentences, voice recognition hidden layer characteristics are obtained by utilizing the voice recognition acoustic characteristics, and voice synthesis coding characteristics are obtained by utilizing the voice synthesis acoustic characteristics; and taking the hidden layer characteristics and the coding characteristics as intermediaries, converting the voice recognition acoustic characteristics of the corresponding source speaker into voice synthesis acoustic characteristics of the corresponding specific target speaker by using the tone conversion model, and then generating the audio signal of the specific target speaker by using the voice synthesis acoustic characteristics. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and because the coding characteristics obtained by compressing the characteristics of the voice synthesis acoustic characteristics of the whole sentence of voice are added, the prosody and emotion of the converted voice can be closer to the voice characteristics of the source speaker.

In addition, in the scheme of the invention, during modeling, the audio data of a plurality of speakers are firstly utilized to carry out the training of a universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker. Because the adaptive training is carried out on the audio data of the specific target speaker on the basis of the universal sound variation model, the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by utilizing the tone conversion model are more in line with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.

Correspondingly, the embodiment of the invention also provides a sound changing device, which is a structural block diagram of the device as shown in fig. 5.

In this embodiment, the apparatus includes the following modules:

a receiving module 501, configured to receive a source speaker sentence;

a feature extraction module 502, configured to extract speech recognition acoustic features and speech synthesis acoustic features from the source speaker sentence;

a hidden layer feature obtaining module 503, configured to obtain a speech recognition hidden layer feature by using the speech recognition acoustic feature;

a coding feature obtaining module 504, configured to obtain a speech synthesis coding feature by using the speech synthesis acoustic feature;

a feature conversion module 505, configured to input the speech recognition hidden layer feature and the speech synthesis coding feature into a pre-constructed tone conversion model corresponding to a specific target speaker, so as to obtain a speech synthesis acoustic feature of the specific target speaker;

a speech synthesis module 506 for generating a target-specific speaker audio signal using the speech synthesis acoustic features of the target-specific speaker.

In the embodiment of the present invention, the voice recognition acoustic features may include, but are not limited to, any one or more of the following: MFCC, PLP, etc.; the speech synthesis acoustic features include at least a mel-frequency spectrum.

The hidden layer feature obtaining module 503 may specifically input the speech recognition acoustic feature into a speech recognition model to obtain a speech recognition hidden layer feature. The speech recognition model may employ a neural network model, such as LSTM (long short-Term Memory), LC-CLDNN (space-controlled CLDNN), etc., which is a neural network model constructed using a convolution structure, a loop structure, and a full-connected structure at the same time.

In practical application, one or more hidden layers in the speech recognition model can be output as hidden layer features of the speech recognition.

The coding feature obtaining module 504 may specifically input the speech synthesis acoustic feature into a pre-constructed reference coding model, and obtain a speech synthesis coding feature according to an output of the reference coding model. The specific structure of the reference coding model has been described in detail above, and is not described herein again.

The feature conversion module 505 inputs the speech recognition hidden layer feature and the speech synthesis coding feature into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain a speech synthesis acoustic feature of the specific target speaker. The speech synthesis acoustic features may include features such as a mel-frequency spectrum. Accordingly, the speech synthesis module 504 can utilize neural network vocoders such as wavenet/wavevernn to synthesize acoustic features of the speech to generate an audio signal of the specific target speaker, thereby realizing the conversion from the speech of the speaker from any source to the speech of the target speaker.

The voice changing device provided by the embodiment of the invention is characterized in that a tone conversion model corresponding to a specific target speaker is pre-constructed, voice recognition acoustic characteristics and voice synthesis acoustic characteristics are extracted from received source speaker audio data, voice recognition hidden layer characteristics are obtained by utilizing the voice recognition acoustic characteristics, and voice synthesis coding characteristics are obtained by utilizing the voice synthesis acoustic characteristics; and taking the hidden layer characteristics and the coding characteristics as intermediaries, converting the voice recognition acoustic characteristics of the corresponding source speaker into voice synthesis acoustic characteristics of the corresponding specific target speaker by using the tone conversion model, and then generating the audio signal of the specific target speaker by using the voice synthesis acoustic characteristics. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and because the coding characteristics obtained by compressing the characteristics of the voice synthesis acoustic characteristics of the whole sentence of voice are added, the prosody and emotion of the converted voice can be closer to the voice characteristics of the source speaker.

In practical applications, the tone conversion model may be constructed by a corresponding tone conversion model construction module, and the tone conversion model construction module may be a part of the apparatus of the present invention, or may be independent of the apparatus of the present invention, which is not limited thereto.

The tone conversion model building module can specifically acquire the tone conversion model by collecting audio data of a large number of specific target speakers for training, or perform general sound variation model training by using the audio data of a plurality of speakers, and then perform self-adaptive training by using a small amount of audio data of the specific target speakers on the basis of the general sound variation model to acquire the tone conversion model corresponding to the specific target speakers.

The universal acoustic varying model can be constructed by a corresponding universal model construction module, and similarly, the universal model construction module can be a part of the device of the invention or can be independent of the device of the invention, which is not limited herein.

It should be noted that, no matter training of the generic acoustic variation model or adaptive training based on the generic acoustic variation model, an iterative computation process is performed, and therefore, in practical applications, the generic model building module and the timbre conversion model building module may be combined into one functional module or may be used as two independent functional modules, which is not limited herein. The iterative calculation process of the two is the same, but the training data is different.

In a specific embodiment, the tone conversion model building module may include the following units:

a target data collection unit for collecting audio data of a large number of specific target speakers as training data;

the feature extraction unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data;

the hidden layer feature acquisition unit is used for acquiring a voice recognition hidden layer feature by utilizing the voice recognition acoustic feature;

the coding feature acquisition unit is used for acquiring a speech synthesis coding feature by using the speech synthesis acoustic feature;

and the parameter training unit is used for training to obtain a tone conversion model corresponding to the specific target speaker by utilizing the hidden layer characteristics, the voice synthesis coding characteristics and the voice synthesis acoustic characteristics.

In another embodiment, the generic model building module may include the following elements:

the universal data collection unit is used for collecting audio data of a plurality of speakers as training data;

the feature extraction unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data;

the hidden layer feature acquisition unit is used for acquiring a voice recognition hidden layer feature by utilizing the voice recognition acoustic feature;

the coding feature acquisition unit is used for acquiring a speech synthesis coding feature by using the speech synthesis acoustic feature;

Accordingly, the tone conversion model building module may include the following units:

a target data collection unit for collecting audio data of a specific target speaker;

The self-adaptive training process mainly comprises the steps of extracting voice recognition acoustic features and voice synthesis acoustic features from audio data of the specific target speaker, obtaining voice recognition hidden layer features and voice synthesis coding features by respectively utilizing the voice recognition acoustic features and the voice synthesis acoustic features, and then training to obtain a tone conversion model corresponding to the specific target speaker by utilizing the hidden layer features, the coding features and the voice synthesis acoustic features through iterative computation.

By using the scheme of the embodiment, the tone conversion model corresponding to the specific target speaker can be obtained by collecting a small amount of audio data of the specific target speaker and carrying out adaptive training based on the universal sound variation model, so that the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by using the tone conversion model can better accord with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.

Fig. 6 is a block diagram illustrating an apparatus 800 for a method of changing voice in accordance with an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various classes of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the key press false touch correction method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present invention also provides a non-transitory computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform all or part of the steps of the above-described method embodiments of the present invention.

Fig. 7 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

20页详细技术资料下载

Sound changing method and device

相关技术

网友询问留言