Real-time sound changing method and device

文档序号:1298291 发布日期:2020-08-07 浏览:24次 中文

阅读说明:本技术 实时变声方法及装置 (Real-time sound changing method and device ) 是由 刘恺 于 2019-01-30 设计创作,主要内容包括:本发明公开了一种实时变声方法及装置,所述方法包括:接收源说话人音频数据;从所述源说话人音频数据中提取语音识别声学特征,并利用所述语音识别声学特征得到语音识别的隐层特征;将所述隐层特征输入预先构建的对应特定目标说话人的音色转换模型,得到特定目标说话人的语音合成声学特征;利用所述特定目标说话人的语音合成声学特征生成特定目标说话人音频信号。利用本发明,可以实现低响应延迟的实时变声,并得到较好的变声效果。(The invention discloses a real-time sound changing method and a device, wherein the method comprises the following steps: receiving source speaker audio data; extracting voice recognition acoustic features from the source speaker audio data, and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features; inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker; and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker. The invention can realize real-time sound change with low response delay and obtain better sound change effect.)

1. A real-time voicing method, the method comprising:

receiving source speaker audio data;

extracting voice recognition acoustic features from the source speaker audio data, and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;

inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;

and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

2. The method of claim 1, further comprising constructing the timbre conversion model for the particular target speaker by:

collecting audio data of a specific target speaker;

and carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.

3. The method of claim 2, further comprising: the method for constructing the universal sound variation model based on the audio data of a plurality of speakers specifically comprises the following steps:

collecting audio data of a plurality of speakers as training data;

extracting voice recognition acoustic features and voice synthesis acoustic features from the training data, and obtaining hidden layer features of voice recognition by using the voice recognition acoustic features;

and training to obtain a universal sound variation model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.

4. The method of claim 1, wherein the deriving hidden layer features for speech recognition using the speech recognition acoustic features comprises:

and inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.

5. The method of claim 4, wherein the speech recognition model is a neural network model.

6. The method of claim 1, wherein the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.

7. The method of claim 1, wherein the speech synthesis acoustic features comprise any one or more of: clear-turbid characteristics, fundamental frequency characteristics, spectral characteristics, and aperiodic components.

8. A real-time sound-altering apparatus, comprising:

the receiving module is used for receiving audio data of a source speaker;

the characteristic acquisition module is used for extracting voice recognition acoustic characteristics from the source speaker audio data and obtaining hidden layer characteristics of voice recognition by utilizing the voice recognition acoustic characteristics;

the characteristic conversion module is used for inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;

and the voice synthesis module is used for generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

9. An electronic device, comprising: one or more processors, memory;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the method of any one of claims 1 to 7.

10. A readable storage medium having stored thereon instructions that are executed to implement the method of any one of claims 1 to 7.

Technical Field

The invention relates to the field of voice signal processing, in particular to a real-time voice changing method and device.

Background

At present, with the development of speech synthesis technology, how to make synthesized speech natural, diversified and personalized becomes a hot spot of current speech technology research, and the sound change technology is one of ways to make synthesized speech diversified and personalized. The voice modification technology mainly refers to a technology of preserving semantic content of a voice signal but changing voice characteristics of a speaker so that a voice of a person sounds like a voice of another person. The sound variation technique is generally divided into two ways from the perspective of speaker conversion: a speech conversion between non-specific persons, such as conversion between male voice and female voice, conversion between different age levels, etc.; another is speech conversion between specific persons, such as converting the voice of speaker a to the voice of speaker B.

A conventional processing method for realizing timbre conversion from any speaker to a target speaker by changing voice usually is based on a speech recognition technology, and aligns parallel corpora by using DTW (Dynamic Time Warping) or attention (attention) mechanisms, and then performs timbre conversion. In the processing mode, when the conversion model is trained, parallel corpora of a source speaker and a target speaker, namely audio corpora with the same content, need to be collected, and the conversion model is trained by using the aligned frequency spectrum characteristics; when the audio conversion is carried out, the spectrum characteristics extracted from the audio data of the source speaker are converted through a conversion model, the fundamental frequency characteristics are subjected to linear stretching treatment, and the non-periodic components are not changed. The sound changing effect obtained by the sound changing processing mode is poor, and some application scenes with real-time requirements cannot be met.

Disclosure of Invention

The embodiment of the invention provides a real-time sound changing method and device, which are used for realizing real-time sound changing with low response delay and obtaining a better sound changing effect.

Therefore, the invention provides the following technical scheme:

a real-time voicing method, the method comprising:

receiving source speaker audio data;

extracting voice recognition acoustic features from the source speaker audio data, and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;

inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;

and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

Optionally, the method further includes constructing the tone conversion model corresponding to the specific target speaker in the following manner:

collecting audio data of a specific target speaker;

and carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.

Optionally, the method further comprises: the method for constructing the universal sound variation model based on the audio data of a plurality of speakers specifically comprises the following steps:

collecting audio data of a plurality of speakers as training data;

extracting voice recognition acoustic features and voice synthesis acoustic features from the training data, and obtaining hidden layer features of voice recognition by using the voice recognition acoustic features;

and training to obtain a universal sound variation model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.

Optionally, the obtaining of the hidden layer feature of the speech recognition by using the speech recognition acoustic feature includes:

and inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.

Optionally, the speech recognition model is a neural network model.

Optionally, the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.

Optionally, the speech synthesis acoustic features comprise any one or more of: clear-turbid characteristics, fundamental frequency characteristics, spectral characteristics, and aperiodic components.

A real-time sound-altering device, the device comprising:

the receiving module is used for receiving audio data of a source speaker;

the characteristic acquisition module is used for extracting voice recognition acoustic characteristics from the source speaker audio data and obtaining hidden layer characteristics of voice recognition by utilizing the voice recognition acoustic characteristics;

the characteristic conversion module is used for inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;

and the voice synthesis module is used for generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.

Optionally, the apparatus further comprises: the tone conversion model building module is used for building a tone conversion model corresponding to a specific target speaker;

the tone conversion model construction module comprises:

a target data collection unit for collecting audio data of a specific target speaker;

and the model training unit is used for carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.

Optionally, the apparatus further comprises: the universal model building module is used for building a universal sound variation model based on the audio data of a plurality of speakers;

the general model building module comprises:

the universal data collection unit is used for collecting audio data of a plurality of speakers as training data;

the feature acquisition unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;

and the universal parameter training unit is used for training to obtain the multi-person sound-changing model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.

Optionally, the feature obtaining module includes:

the acoustic feature extraction unit is used for extracting voice recognition acoustic features from the source speaker audio data;

and the hidden layer feature extraction unit is used for inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.

Optionally, the speech recognition model is a neural network model.

Optionally, the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.

Optionally, the speech synthesis acoustic features comprise any one or more of: clear-turbid characteristics, fundamental frequency characteristics, spectral characteristics, and aperiodic components.

An electronic device, comprising: one or more processors, memory;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method described above.

A readable storage medium having stored thereon instructions which are executed to implement the foregoing method.

The real-time voice changing method and the device provided by the embodiment of the invention pre-construct a tone color conversion model corresponding to a specific target speaker, extract voice recognition acoustic characteristics from received source speaker audio data, obtain hidden layer characteristics of voice recognition by using the voice recognition acoustic characteristics, use the hidden layer characteristics as an intermediary, convert the voice recognition acoustic characteristics corresponding to the source speaker into voice synthesis acoustic characteristics corresponding to the specific target speaker by using the tone color conversion model, and then generate an audio signal of the specific target speaker by using the voice synthesis acoustic characteristics. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and the stream type feature extraction can be carried out, the real-time sound change with low response delay is realized, and the application requirement of the real-time sound change is met.

Furthermore, in the scheme of the invention, during modeling, the audio data of a plurality of speakers are firstly utilized to carry out the training of the universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out the self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker. Because the adaptive training is carried out on the audio data of the specific target speaker on the basis of the universal sound variation model, the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by utilizing the tone conversion model are more in line with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of a general acoustic variation model constructed in a real-time acoustic variation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a topology structure of a generic acoustic modification model in a real-time acoustic modification method according to an embodiment of the present invention;

FIG. 3 is a flow chart of a real-time voicing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a model training and testing process in a real-time acoustic method according to an embodiment of the present invention;

FIG. 5 is a block diagram of a real-time sound-changing device according to an embodiment of the present invention;

FIG. 6 is a block diagram illustrating an apparatus for a real-time voicing method in accordance with an exemplary embodiment;

fig. 7 is a schematic structural diagram of a server in an embodiment of the present invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.

The embodiment of the invention provides a real-time voice changing method and a device, which are characterized in that a tone color conversion model corresponding to a specific target speaker is constructed in advance, voice recognition acoustic characteristics are extracted from received source speaker audio data, hidden layer characteristics of voice recognition are obtained by utilizing the voice recognition acoustic characteristics, the hidden layer characteristics are used as an intermediary, the voice recognition acoustic characteristics corresponding to the source speaker are converted into voice synthesis acoustic characteristics corresponding to the specific target speaker by utilizing the tone color conversion model, and then voice synthesis acoustic characteristics are utilized to generate an audio signal of the specific target speaker.

In practical application, the tone conversion model can be obtained by collecting audio data of a large number of specific target speakers and training; or firstly, the audio data of a plurality of speakers are utilized to carry out the training of a universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out the self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker.

As shown in fig. 1, it is a flowchart of constructing a universal acoustic varying model in a real-time acoustic varying method according to an embodiment of the present invention, and the method includes the following steps:

step 101, collecting audio data of a plurality of speakers as training data.

The generic voicing model is not specific to a particular target speaker and thus may be trained based on audio data for multiple speakers.

And step 102, extracting voice recognition acoustic features and voice synthesis acoustic features from the training data, and obtaining hidden layer features of voice recognition by using the voice recognition acoustic features.

The speech recognition acoustic features may include, but are not limited to, any one or more of MFCC (Mel-scale frequency Cepstral Coefficients), P L P (Perceptual Linear prediction) parameters, the MFCC is a Cepstral parameter extracted in the Mel-scale frequency domain, the Mel-scale describes the non-linear characteristics of human ear frequencies, the P L P parameter is an auditory model-based feature parameter which is a set of Coefficients of an all-pole model prediction polynomial equivalent to a L PC (L initial predictive coefficient) feature.

The speech synthesis acoustic features may include, but are not limited to, any one or more of voiced (UV), fundamental (L F0), spectral (MCEP), Aperiodic (AP) components.

The hidden layer feature refers to an output of a hidden layer of a speech recognition model, and in the embodiment of the present invention, the speech recognition model may adopt a neural network model, and the neural network model may include one or more hidden layers. Accordingly, the speech recognition acoustic features are input into the speech recognition model, and the output of the hidden layer can be obtained. In practical application, one or more hidden layers can be output as the hidden layer features of the speech recognition.

And 103, training to obtain a universal sound variation model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.

The generic vocalization model and the tone color conversion model corresponding to the specific target speaker may employ neural network models such as CNN-L STM (convolutional neural network and long-short term memory network) and the like.

Fig. 2 is a schematic diagram of a topological structure of a universal acoustic varying model in a real-time acoustic varying method according to an embodiment of the present invention.

The input of the universal sound changing model comprises a speech recognition hidden layer feature A and a speech recognition hidden layer feature B, and the output is target audio speech synthesis acoustic features, wherein the speech recognition hidden layer feature A passes through a plurality of neural network models such as a convolutional layer, a pooling layer and a residual layer to obtain a hidden layer 1 and a hidden layer 2, the speech recognition hidden layer feature B passes through a plurality of layers of DNN to obtain a hidden layer 3, and the hidden layer 1, the hidden layer 2 and the hidden layer 3 are combined to be used as the input of an L STM model.

On the basis of the general sound changing model, aiming at a specific target speaker, the sound color conversion model corresponding to the specific target speaker can be obtained by collecting a small amount of audio data of the specific target speaker and carrying out self-adaptive training on the general sound changing model by utilizing the audio data of the specific target speaker.

The adaptive training process is similar to that of the generic acoustic variation model, except for the training data.

Because the hidden layer features output by the acoustic model are recognized to contain less tone features of the source speaker and simultaneously retain semantic information and partial prosodic information, the mapping relation from the hidden layer features to the synthesized acoustic features of the target speaker is learned through the acoustic change model, and the tone conversion from the source speaker to the target speaker can be realized.

The real-time voice changing method provided by the embodiment of the invention utilizes the tone conversion model to convert the voice recognition acoustic characteristics of the source speaker into the voice synthesis acoustic characteristics of the specific target speaker, and then generates the audio signal of the specific target speaker according to the voice synthesis acoustic characteristics, thereby realizing the real-time conversion from the audio data of the source speaker to the audio signal of the specific target speaker.

As shown in fig. 3, it is a flowchart of a real-time sound-changing method according to an embodiment of the present invention, and the method includes the following steps:

step 301, source speaker audio data is received.

The source speaker audio data may be real-time online streaming audio data or offline audio data, which is not limited in the embodiments of the present invention.

Step 302, extracting voice recognition acoustic features from the source speaker audio data, and obtaining hidden layer features of voice recognition by using the voice recognition acoustic features.

Similar to the model training phase, the speech recognition acoustic features may include, but are not limited to, any one or more of MFCC, P L P, or the like.

The hidden layer features may be obtained by inputting the speech recognition acoustic features into a speech recognition model, and specifically, one or more hidden layers in the speech recognition model may be output as the hidden layer features of the speech recognition.

And 303, inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to the specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker.

The input of the timbre conversion model is the hidden layer feature obtained in the step 302, and the output is the speech synthesis acoustic feature, which may include, but is not limited to, any one or more of the following features, namely, unvoiced and voiced features (UV), fundamental frequency features (L F0), spectral features (MCEP), aperiodic components (AP), and the like.

By utilizing the tone conversion model, the voice recognition acoustic characteristics of the source speaker can be converted into voice synthesis acoustic characteristics with the voice characteristics of the specific target speaker.

Step 304, generating the audio signal of the specific target speaker by using the voice synthesis acoustic feature of the specific target speaker.

Specifically, vocoders such as world/direct based on signal processing can be utilized to synthesize the speech synthesis acoustic features into a speech signal, i.e., to realize the conversion from the speech of the speaker of any source to the speech of the speaker of the target speaker.

For better understanding of the solution of the present invention, fig. 4 shows a schematic diagram of a model training and testing process in the real-time acoustic varying method according to the embodiment of the present invention.

The real-time voice changing method provided by the embodiment of the invention is characterized in that a tone conversion model corresponding to a specific target speaker is constructed in advance, voice recognition acoustic characteristics are extracted from received source speaker audio data, hidden layer characteristics of voice recognition are obtained by utilizing the voice recognition acoustic characteristics, the hidden layer characteristics are used as an intermediate to be input into the tone conversion model, the voice recognition acoustic characteristics corresponding to the source speaker are converted into voice synthesis acoustic characteristics corresponding to the specific target speaker by utilizing the tone conversion model, and then the voice synthesis acoustic characteristics are utilized to generate an audio signal of the specific target speaker. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and the stream type feature extraction can be carried out, the real-time sound change with low response delay is realized, and the application requirement of the real-time sound change is met. The real-time voice changing method provided by the invention realizes that the relevant information of the speaker is removed while the information of the content, rhythm and the like of the source speaker is kept, namely, the voice of the source speaker is changed into the voice of the target speaker in real time.

In addition, in the scheme of the invention, during modeling, the audio data of a plurality of speakers are firstly utilized to carry out the training of a universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker. Because the adaptive training is carried out on the audio data of the specific target speaker on the basis of the universal sound variation model, the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by utilizing the tone conversion model are more in line with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.

Correspondingly, an embodiment of the present invention further provides a real-time sound-changing device, as shown in fig. 5, which is a structural block diagram of the device.

In this embodiment, the apparatus includes the following modules:

a receiving module 501, configured to receive audio data of a source speaker;

a feature obtaining module 502, configured to extract a speech recognition acoustic feature from the source speaker audio data, and obtain a hidden layer feature of speech recognition by using the speech recognition acoustic feature;

the feature conversion module 503 is configured to input the hidden layer features into a pre-constructed tone conversion model corresponding to the specific target speaker, so as to obtain a speech synthesis acoustic feature of the specific target speaker;

a speech synthesis module 504 for generating a target-specific speaker audio signal using the speech synthesis acoustic features of the target-specific speaker.

It should be noted that the real-time sound changing apparatus provided in the embodiment of the present invention may be applied to an application environment of real-time online sound changing, and may also be applied to an application environment of offline sound changing, that is, the audio data received by the receiving module 501 may be streaming audio data input by a source speaker in real time, or may be non-real-time audio data of the source speaker, for example, the audio data is obtained from an audio file of the source speaker.

The feature obtaining module 502 may specifically include: the device comprises an acoustic feature extraction unit and a hidden layer feature extraction unit. Wherein:

the acoustic feature extraction unit is configured to extract speech recognition acoustic features from the source speaker audio data, where the speech recognition acoustic features may include, but are not limited to, any one or more of MFCC, P L P, and the like.

The hidden layer feature extraction unit is used for inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.

The speech recognition model may adopt a neural network model, such as L STM (L ong Short-term memory network), L C-C L DNN (L accident-controlled C L DNN), and the like, wherein the C L DNN is a neural network model constructed by simultaneously using a convolution structure, a loop structure, and a full-connection structure.

In practical application, one or more hidden layers in the speech recognition model can be output as hidden layer features of the speech recognition.

In the embodiment of the present invention, the speech synthesis acoustic features may include, but are not limited to, any one or more of a voiced-unvoiced feature (UV), a fundamental frequency feature (L F0), a spectral feature (MCEP), an aperiodic component (AP), and the like.

The real-time voice changing device provided by the embodiment of the invention is characterized in that a tone conversion model corresponding to a specific target speaker is constructed in advance, voice recognition acoustic characteristics are extracted from received source speaker audio data, hidden layer characteristics of voice recognition are obtained by utilizing the voice recognition acoustic characteristics, the hidden layer characteristics are used as an intermediary, the voice recognition acoustic characteristics corresponding to the source speaker are converted into voice synthesis acoustic characteristics corresponding to the specific target speaker by utilizing the tone conversion model, and then the voice synthesis acoustic characteristics are utilized to generate an audio signal of the specific target speaker. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and the stream type feature extraction can be carried out, the real-time sound change with low response delay is realized, and the application requirement of the real-time sound change is met.

In practical applications, the tone conversion model may be constructed by a corresponding tone conversion model construction module, and the tone conversion model construction module may be a part of the apparatus of the present invention, or may be independent of the apparatus of the present invention, which is not limited thereto.

The tone conversion model building module can specifically acquire the tone conversion model by collecting audio data of a large number of specific target speakers for training, or perform general sound variation model training by using the audio data of a plurality of speakers, and then perform self-adaptive training by using a small amount of audio data of the specific target speakers on the basis of the general sound variation model to acquire the tone conversion model corresponding to the specific target speakers.

The universal acoustic varying model can be constructed by a corresponding universal model construction module, and similarly, the universal model construction module can be a part of the device of the invention or can be independent of the device of the invention, which is not limited herein.

It should be noted that, no matter training of the generic acoustic variation model or adaptive training based on the generic acoustic variation model, an iterative computation process is performed, and therefore, in practical applications, the generic model building module and the timbre conversion model building module may be combined into one functional module or may be used as two independent functional modules, which is not limited herein. The iterative calculation process of the two is the same, but the training data is different.

In a specific embodiment, the tone conversion model building module may include the following units:

a target data collection unit for collecting audio data of a large number of specific target speakers as training data;

the feature acquisition unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;

and the parameter training unit is used for training to obtain a tone conversion model corresponding to the specific target speaker by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.

In another embodiment, the generic model building module may include the following elements:

the universal data collection unit is used for collecting audio data of a plurality of speakers as training data;

the feature acquisition unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;

and the universal parameter training unit is used for training to obtain the multi-person sound-changing model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.

Accordingly, the tone conversion model building module may include the following units:

a target data collection unit for collecting audio data of a specific target speaker;

and the model training unit is used for carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker. The self-adaptive training process mainly comprises the steps of extracting voice recognition acoustic features and voice synthesis acoustic features from the audio data of the specific target speaker, obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features, and training to obtain a tone conversion model corresponding to the specific target speaker by utilizing the hidden layer features and the voice synthesis acoustic features through iterative computation.

By using the scheme of the embodiment, the tone conversion model corresponding to the specific target speaker can be obtained by collecting a small amount of audio data of the specific target speaker and carrying out adaptive training based on the universal sound variation model, so that the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by using the tone conversion model can better accord with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.

Fig. 6 is a block diagram illustrating an apparatus 800 for a real-time voicing method according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various classes of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user, in some embodiments, the screen may include a liquid crystal display (L CD) and a Touch Panel (TP). if the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), programmable logic devices (P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the key press false touch correction method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present invention also provides a non-transitory computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform all or part of the steps of the above-described method embodiments of the present invention.

Fig. 7 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows server, Mac OS XTM, UnixTM, &lttttranslation = L "&tttl &/t &gttinuxtm, FreeBSDTM, and so forth.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音信号中的摩擦音检测

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!