Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing

文档序号：513255 发布日期：2021-05-28 浏览：21次中文

阅读说明：本技术 个性化语音合成模型构建、语音合成和测试方法及装置 (Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing ) 是由黄智颖雷鸣于 2019-11-27 设计创作，主要内容包括：本发明公开了一种个性化语音合成模型的构建方法、语音合成方法和测试方法及装置。其中,个性化语音合成模型的构建方法,包括：从多说话人语音合成模型的多个说话人的训练集数据中,确定出与用户近似的训练数据；从所述多个说话人中除了所述近似的训练数据所属的说话人之外,选择与所述用户属于相同类别的同类说话人；根据与所述用户近似的训练数据和所选择的所述同类说话人,对所述多说话人语音合成模型进行训练,得到所述用户的个性化语音合成模型。本发明能够合成用户特定说话风格的语音,提升了用户体验。(The invention discloses a method for constructing a personalized speech synthesis model, a speech synthesis method, a testing method and a testing device. The construction method of the personalized speech synthesis model comprises the following steps: determining training data similar to a user from training set data of a plurality of speakers of a multi-speaker speech synthesis model; selecting a speaker of the same class as the user from the plurality of speakers except for the speaker to which the approximate training data belongs; and training the multi-speaker voice synthesis model according to the training data similar to the user and the selected same speaker to obtain the personalized voice synthesis model of the user. The invention can synthesize the voice of the specific speaking style of the user, and improves the user experience.)

1. A method for constructing a personalized speech synthesis model is characterized by comprising the following steps:

determining training data similar to a user from training set data of a plurality of speakers of a multi-speaker speech synthesis model;

selecting a speaker of the same class as the user from the plurality of speakers except for the speaker to which the approximate training data belongs;

and training the multi-speaker voice synthesis model according to the training data similar to the user and the selected same speaker to obtain the personalized voice synthesis model of the user.

2. The method of claim 1, wherein prior to determining training data that approximates the user from the training set data for a plurality of speakers of the multi-speaker speech synthesis model, further comprising:

processing the data of the user, and extracting corresponding linguistic features and acoustic features;

the training the multi-speaker speech synthesis model according to the training data similar to the user and the selected same speaker to obtain the personalized speech synthesis model of the user comprises:

and inputting the ID of the same speaker in the multi-speaker voice synthesis model and the corresponding speaker characterization into the multi-speaker voice synthesis model, taking the corresponding linguistic characteristics and acoustic characteristics of the user and the similar training data as training data together, and training the multi-speaker voice synthesis model to obtain the personalized voice synthesis model of the user.

3. The method of claim 2, wherein determining training data that approximates the user from training set data for a plurality of speakers of a multi-speaker speech synthesis model comprises:

determining training data of a preset number of adjacent speakers similar to the user from training set data of a plurality of speakers of a multi-speaker voice synthesis model; and/or determining training data corresponding to a preset number of adjacent sentences similar to the user;

the training data includes speech data and corresponding text, as well as linguistic features of the text and acoustic features of the speech data.

4. The method of claim 1 or 2, wherein the predetermined number of nearby speakers that approximate the user is determined by:

calculating corresponding vectors for the user and each speaker of the multiple speakers respectively;

determining the distance between each speaker in the speakers and the vector of the user respectively, sorting the distances according to the sizes, and determining a preset number of speakers starting from the smallest distance as adjacent speakers.

5. The method of claim 3, wherein a set number of adjacent sentences that are similar to the user are determined by:

calculating corresponding vectors respectively for each sentence of each speaker in a user and a plurality of speakers;

the distance between each sentence of each speaker in the multiple speakers and the vector of the user is determined respectively and sorted according to the size, and a preset number of sentences starting from the smallest distance are determined as adjacent sentences.

6. A method according to claim 2 or 3, wherein the user's data comprises: voice data and corresponding text;

the processing of the data of the user and the extraction of the corresponding linguistic features and acoustic features include:

automatically labeling the text of the user through voice synthesis to determine labeling information, wherein the labeling information comprises: pronunciation labeling and rhythm labeling; determining a sound speed boundary by the voice data of the user through voice recognition and voice activity detection; extracting corresponding linguistic features according to the pronunciation labels, the rhythm labels and the speed boundaries;

and extracting acoustic features of the voice data of the user.

7. The method of claim 2, 3 or 5, wherein prior to extracting the acoustic features from the user's speech data, further comprising:

preprocessing operations including energy warping, dereverberation, and energy enhancement are performed on the speech data.

8. The method according to any one of claims 1-3 and 5, wherein the same category refers to the same category determined according to any one or a combination of the following conditions of the speaker: gender, age, speech style and speech environment.

9. A method for constructing a personalized speech synthesis model is characterized by comprising the following steps:

according to a preset scene, selecting a speaker similar to the user from at least one social network of the user corresponding to the scene, and acquiring training set data of the similar speaker;

selecting a same speaker from the plurality of speakers that belongs to the same category as the user except for the approximate speaker; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

and training the multi-speaker speech synthesis model according to the training data of the similar speakers in each scene and the selected similar speakers to obtain the personalized speech synthesis model of the user in the scene.

10. A method for constructing a personalized speech synthesis model is characterized by comprising the following steps:

according to the preset priority of each approximate user set, sequentially searching at least one approximate speaker similar to the user in each approximate speaker set according to the sequence of the priority;

acquiring training set data of at least one approximate speaker according to the searched at least one approximate speaker;

and training the multi-speaker speech synthesis model according to the training data of the at least one approximate speaker and the selected same speaker to obtain the personalized speech synthesis model of the user.

11. A method for constructing a personalized speech synthesis model is characterized by comprising the following steps:

according to the priority of each approximate user set of the user, sequentially pushing each level of approximate speaker set to the client of the user according to the sequence of the priority;

receiving the identification of the approximate speaker selected from the approximate speaker set of each level returned by the client, and acquiring the training set data of the approximate speaker according to the identification;

and training the multi-speaker voice synthesis model according to the training data of the at least one approximate speaker and the selected similar speaker to obtain the personalized voice synthesis model of the user.

12. The method of claim 11, wherein the respective levels approximate a set of speakers, including one or more of:

at least one set of users of the user's social network;

at least one set of users of the users belonging to the same geographic area;

at least one user set selected by the user according to the preference of the user.

13. A method for personalized speech synthesis, comprising:

processing a text to be synthesized by voice, and extracting corresponding linguistic features;

inputting the linguistic features and the ID of the same speaker corresponding to the user in the personalized speech synthesis model training process into the personalized speech synthesis model, and predicting acoustic features corresponding to the text to be speech synthesized;

synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

the personalized speech synthesis model is obtained by adopting the construction method of the personalized speech synthesis model according to any one of claims 1-12.

14. The method of claim 13, wherein synthesizing the synthesized speech of the user corresponding to the text based on the acoustic features comprises:

and converting the acoustic features into corresponding voice by using a vocoder.

15. A method for testing a personalized speech synthesis model is characterized by comprising the following steps:

processing a text to be synthesized by voice, and extracting corresponding linguistic features;

synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

verifying the synthesized voice and determining whether the personalized voice synthesis model is qualified or not;

the personalized speech synthesis model is obtained by adopting the method for creating the personalized speech synthesis model according to any one of claims 1-12.

16. Use of the method for constructing a personalized speech synthesis model according to any one of claims 1 to 12, the method for personalized speech synthesis according to claim 13 or 14, and the method for testing a personalized speech synthesis model according to claim 15 in audio reading, smart customer service, speech interaction, voice announcement, machine translation.

17. An apparatus for constructing a personalized speech synthesis model, comprising:

the determining module is used for determining training data approximate to a user from training set data of a plurality of speakers of the multi-speaker speech synthesis model;

a selection module for selecting a speaker of the same class as the user except the speaker to which the approximate training data belongs from the plurality of speakers;

and the training module is used for training the multi-speaker voice synthesis model according to the training data similar to the user and the selected same speaker to obtain the personalized voice synthesis model of the user.

18. An apparatus for constructing a personalized speech synthesis model, comprising:

the first selection module is used for selecting a user similar to the user from at least one social network of the user corresponding to a preset scene according to the preset scene;

the acquisition module is used for acquiring the training set data of the approximate speaker;

a second selection module for selecting a same speaker belonging to the same category as the user from the plurality of speakers except the approximate speaker; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

and the training module is used for training the multi-speaker voice synthesis model according to the training data of the similar speakers in each scene and the selected similar speakers to obtain the personalized voice synthesis model of the user in the scene.

19. An apparatus for constructing a personalized speech synthesis model, comprising:

the searching module is used for sequentially searching at least one approximate speaker similar to the user in each approximate speaker set according to the preset priority of each approximate user set and the sequence of the priority;

the acquisition module is used for acquiring training set data of at least one approximate speaker according to the searched at least one approximate speaker;

a selection module for selecting a same speaker belonging to the same category as the user from among the plurality of speakers except the approximate speaker; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

and the training module is used for training the multi-speaker voice synthesis model according to the training data of the at least one approximate speaker and the selected similar speaker to obtain the personalized voice synthesis model of the user.

20. An apparatus for constructing a personalized speech synthesis model, comprising:

the pushing module is used for sequentially pushing the approximate speaker sets of all levels to the client side where the user is located according to the priority of each approximate user set of the user and the sequence of the priority;

the receiving module is used for receiving the identification of the approximate speaker selected from the approximate speaker set of each level returned by the client;

the acquisition module is used for acquiring training set data of the approximate speaker according to the identification;

21. A personalized speech synthesis apparatus, comprising:

the extraction module is used for processing the text to be synthesized by the voice and extracting the corresponding linguistic features;

the prediction module is used for inputting the linguistic characteristics and the ID of the same speaker corresponding to the user in the personalized speech synthesis model training process into the personalized speech synthesis model and predicting the acoustic characteristics corresponding to the text to be speech synthesized;

the voice synthesis module is used for synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

the personalized speech synthesis model is obtained by using the speech synthesis model construction device according to any one of claims 17-20.

22. A device for testing a personalized speech synthesis model, comprising:

the extraction module is used for processing the text to be synthesized by the voice and extracting the corresponding linguistic features;

the voice synthesis module is used for synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

the verification module is used for verifying the synthesized voice and determining whether the personalized voice synthesis model is qualified or not;

the personalized speech synthesis model is obtained by using the device for creating a personalized speech synthesis model according to any one of claims 17-20.

23. An intelligent voice server, comprising: a memory and a processor; wherein the memory stores a computer program which, when executed by the processor, is capable of implementing a method of constructing a personalized speech synthesis model according to any of claims 1 to 12, or of implementing a personalized speech synthesis method according to claim 13 or 14, or of implementing a method of testing a personalized speech synthesis model according to claim 15.

24. A computer-readable storage medium, on which computer instructions are stored, which instructions, when executed by a processor, are capable of implementing a method of constructing a personalized speech synthesis model according to any one of claims 1 to 12, or of implementing a method of personalized speech synthesis according to claim 13 or 14, or of implementing a method of testing a personalized speech synthesis model according to claim 15.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method for constructing a personalized speech synthesis model, a speech synthesis method, a testing method and a testing device.

Background

The voice interaction scene in the artificial intelligence technology needs personalized voice synthesis. Personalized speech synthesis is a strong business need and is one of the future trends in the field of speech synthesis.

In the traditional Speech synthesis technology, a multi-speaker Speech synthesis system based on mass data can be constructed by using hundreds of hours of training data of hundreds of speakers, specifically, a multi-speaker Speech synthesis model, such as a Text-To-Speech (Text-To-Speech) model based on a Neural network, can be utilized, in the training data of the model, the Speech data volume of a single speaker is often different from several hours To tens of hours, and the Speech synthesis system constructed by using the data of a mass speaker can provide a more stable Speech synthesis effect.

For a multi-speaker Neural TTS model, given a speaker in any training set, the multi-speaker Neural TTS model can be used to synthesize the voice of the speaker, but for a specific speaker (hereinafter, referred to as "Neurral TTS"), the model cannot synthesize the voice of a specific style of the specific speaker.

Disclosure of Invention

In view of the above, the present invention is proposed to provide a method for constructing a personalized speech synthesis model, a speech synthesis method and a testing method and apparatus that overcome or at least partially solve the above problems.

In a first aspect, an embodiment of the present invention provides a method for constructing a personalized speech synthesis model, including:

determining training data similar to a user from training set data of a plurality of speakers of a multi-speaker speech synthesis model;

selecting a speaker of the same class as the user from the plurality of speakers except for the speaker to which the approximate training data belongs;

and training the multi-speaker voice synthesis model according to the training data similar to the user and the selected same speaker to obtain the personalized voice synthesis model of the user.

In one or more possible embodiments, before determining training data similar to the user from the training set data of the speakers of the multi-speaker speech synthesis model, the method further includes:

processing the data of the user, and extracting corresponding linguistic features and acoustic features;

In one or more possible embodiments, the determining training data that approximates the user from training set data of multiple speakers of the multiple speaker speech synthesis model comprises:

the training data includes speech data and corresponding text, as well as linguistic features of the text and acoustic features of the speech data.

In one or more possible embodiments, the predetermined number of nearby speakers that approximate the user is determined by:

calculating corresponding vectors for the user and each speaker of the multiple speakers respectively;

In one or more possible embodiments, a set number of adjacent sentences that are similar to the user are determined by:

calculating corresponding vectors respectively for each sentence of each speaker in a user and a plurality of speakers;

In one or more possible embodiments, the data of the user includes: voice data and corresponding text;

the processing of the data of the user and the extraction of the corresponding linguistic features and acoustic features include:

and extracting acoustic features of the voice data of the user.

In one or more possible embodiments, before the extracting the acoustic features from the voice data of the user, the method further includes:

preprocessing operations including energy warping, dereverberation, and energy enhancement are performed on the speech data.

In one or more possible embodiments, the same category refers to the same category determined according to any one or a combination of the following conditions of the speaker: gender, age, speech style and speech environment.

In a second aspect, an embodiment of the present invention provides a method for constructing a personalized speech synthesis model, including:

according to a preset scene, selecting a user similar to the user from at least one social network of the user corresponding to the scene, and acquiring training set data of the similar speaker;

In a third aspect, an embodiment of the present invention provides a method for constructing a personalized speech synthesis model, including:

acquiring training set data of at least one approximate speaker according to the searched at least one approximate speaker;

In a fourth aspect, an embodiment of the present invention provides a method for constructing a personalized speech synthesis model, including:

according to the priority of each approximate user set of the user, sequentially pushing each level of approximate speaker set to the client of the user according to the sequence of the priority;

In one or more possible embodiments, the respective levels approximate the set of speakers, including one or more of:

at least one set of users of the user's social network;

at least one set of users of the users belonging to the same geographic area;

at least one user set selected by the user according to the preference of the user.

In a fifth aspect, an embodiment of the present invention provides a personalized speech synthesis method, including:

processing a text to be synthesized by voice, and extracting corresponding linguistic features;

synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

the personalized voice synthesis model is obtained by adopting the construction method of the personalized voice synthesis model.

In one embodiment, synthesizing the synthesized speech of the user corresponding to the text according to the acoustic features includes:

and converting the acoustic features into corresponding voice by using a vocoder.

In a sixth aspect, an embodiment of the present invention provides a method for testing a personalized speech synthesis model, including:

processing a text to be synthesized by voice, and extracting corresponding linguistic features;

synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

verifying the synthesized voice and determining whether the personalized voice synthesis model is qualified or not;

the personalized speech synthesis model is obtained by adopting the establishing method of the personalized speech synthesis model.

In a seventh aspect, embodiments of the present invention provide an application of the method for constructing the personalized speech synthesis model, the personalized speech synthesis method, and the method for testing the personalized speech synthesis model in audio reading, smart customer service, speech interaction, speech broadcasting, and machine translation.

In an eighth aspect, an embodiment of the present invention provides a device for constructing a personalized speech synthesis model, including:

the receiving module is used for receiving the identification of the approximate speaker in the selected approximate speaker set of each level returned by the client;

the acquisition module is used for acquiring training set data of the approximate speaker according to the identification;

In a ninth aspect, an embodiment of the present invention provides a device for constructing a personalized speech synthesis model, including:

the first selection module is used for selecting a user similar to the user from at least one social network of the user corresponding to a preset scene according to the preset scene;

the acquisition module is used for acquiring the training set data of the approximate speaker;

and the training module is used for training the multi-speaker voice synthesis model according to the training data of the speakers similar to the user in each scene and the selected similar speakers to obtain the personalized voice synthesis model of the user in the scene.

In a tenth aspect, an embodiment of the present invention provides a device for constructing a personalized speech synthesis model, including:

the acquisition module is used for acquiring training set data of at least one approximate speaker according to the searched at least one approximate speaker;

In an eleventh aspect, an embodiment of the present invention provides an apparatus for constructing a personalized speech synthesis model, including:

the receiving module is used for receiving the identification of the approximate speaker selected from the approximate speaker set of each level returned by the client;

the acquisition module is used for acquiring training set data of the approximate speaker according to the identification;

In a twelfth aspect, an embodiment of the present invention provides a personalized speech synthesis apparatus, including:

the extraction module is used for processing the text to be synthesized by the voice and extracting the corresponding linguistic features;

the voice synthesis module is used for synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

the personalized voice synthesis model is obtained by adopting the construction device of the voice synthesis model.

In a thirteenth aspect, an embodiment of the present invention provides a testing apparatus for personalized speech synthesis models, including:

the extraction module is used for processing the text to be synthesized by the voice and extracting the corresponding linguistic features;

the voice synthesis module is used for synthesizing the synthesized voice corresponding to the text by the user according to the acoustic features;

the verification module is used for verifying the synthesized voice and determining whether the personalized voice synthesis model is qualified or not;

the personalized voice synthesis model is obtained by adopting the creating device of the personalized voice synthesis model.

In a fourteenth aspect, an embodiment of the present invention provides an intelligent voice server, including: a memory and a processor; wherein the memory stores a computer program which, when executed by the processor, is capable of implementing the aforementioned method of constructing a personalized speech synthesis model, or of implementing the aforementioned method of personalized speech synthesis, or of implementing a method of testing the aforementioned personalized speech synthesis model.

In a fifteenth aspect, the embodiment of the present invention provides a computer-readable storage medium, on which computer instructions are stored, and when the instructions are executed by a processor, the method for constructing the personalized speech synthesis model, or the method for personalized speech synthesis, or the method for testing the personalized speech synthesis model, can be implemented.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the method for constructing the personalized speech synthesis model, the personalized speech synthesis method and the device for testing the personalized speech synthesis model provided by the embodiment of the invention determine the training data similar to the personalized user (target speaker) from the training set data of a plurality of speakers of the multi-speaker speech synthesis model, select the same speaker belonging to the same category as the user from the plurality of speakers of the multi-speaker speech synthesis model, then train the multi-speaker speech synthesis model by using the similar training data in the training set data of the same speaker and the plurality of speakers, can obtain the personalized speech synthesis model related to the user, can realize that the speech with the specific speaking style of the user (namely the target speaker) can be synthesized by using the data of the target speaker with less data volume and the existing multi-speaker speech synthesis model, the personalized voice is obtained, the temperature is brought to the machine, and the user experience is improved.

In one embodiment, from training set data of a plurality of speakers of a multi-speaker speech synthesis model, training data of a preset number of adjacent speakers similar to a user is determined; or determining training data corresponding to a preset number of adjacent sentences similar to the user; by using the adjacent speaker and/or the adjacent sentence close to the user to assist the learning of the voice of the target speaker, the naturalness and the similarity of the final voice synthesis can be improved.

In one embodiment, the same category refers to that one of the speakers is close to the user as much as possible according to gender, age, speaking mode and speaking environment, so that the personalized speech synthesis model can better learn the voice of the user.

In an embodiment, the speech synthesis model construction method, the speech synthesis method, and the speech synthesis apparatus provided in the embodiments of the present invention perform energy normalization, dereverberation, and energy enhancement preprocessing on speech data of a user before extracting linguistic features and acoustic features, so that the speech synthesis model has better robustness against environmental noise, reverberation, and volume variance.

In an embodiment, the method for constructing the speech synthesis model provided in the embodiment of the present invention may further select, according to a used scene, a user similar to the user from at least one social network of the user, and train the multi-speaker speech synthesis model by using training set data of the similar user and similar speakers of the same category to obtain a final personalized speech synthesis model of the user.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of a method for constructing a personalized speech synthesis model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a training process of a multi-speaker speech synthesis model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a process for extracting linguistic features provided by an embodiment of the invention;

FIG. 4 is an exemplary diagram of a distance relationship between a targeted speaker and a nearby speaker provided by an embodiment of the present invention;

FIG. 5 is a diagram illustrating an example of distance relationships between a target speaker and neighboring sentences in accordance with an embodiment of the present invention;

FIGS. 6A-6D are flow charts of a first embodiment of the present invention;

FIGS. 7A and 7B are flow charts of a second embodiment of the present invention;

FIG. 8 is a flowchart of a personalized speech synthesis method according to an embodiment of the present invention;

FIG. 9 is another flowchart of a personalized speech synthesis method according to an embodiment of the present invention;

FIG. 10 is a flowchart of a method for testing a personalized speech synthesis model according to an embodiment of the present invention;

fig. 11 to 13 are flow charts of methods for constructing another personalized speech synthesis model according to embodiments of the present invention;

fig. 14 to 17 are block diagrams of several structures of a device for constructing a personalized speech synthesis model according to an embodiment of the present invention;

fig. 18 is a block diagram of a personalized speech synthesis apparatus according to an embodiment of the present invention;

fig. 19 is a block diagram of a testing apparatus for personalized speech synthesis model according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to realize the automatic synthesis of the personalized speaker voice, the embodiment of the invention provides a construction method of a personalized voice synthesis model, a personalized voice synthesis method, a testing method of the personalized voice synthesis model and a related device. For convenience of description, the user who is to generate the speech synthesis model, i.e., this personalized speaker, is referred to as a "target speaker", and the target speaker, not any speaker in the speaker speech synthesis model.

The following describes in detail specific embodiments of the above embodiments in order with reference to the drawings.

Referring to fig. 1, the method for constructing a personalized speech synthesis model provided by the embodiment of the present invention includes the following steps:

s11, determining training data similar to the user from the training set data of multiple speakers of the multiple speaker voice synthesis model;

s12, selecting similar speakers belonging to the same category as the user from the speakers except the speaker to which the approximate training data belongs;

and S13, training the multi-speaker voice synthesis model according to the training data similar to the user and the selected same speaker to obtain the personalized voice synthesis model of the user.

In one or some possible embodiments, before the step S11, the following steps may be further performed:

processing data of a user (namely a target speaker), and extracting corresponding linguistic features and acoustic features;

accordingly, in step S13, training the personalized speech synthesis model of the user (the target speaker) can be achieved by:

and inputting the ID of the same speaker in the multi-speaker voice synthesis model and the corresponding speaker characterization into the multi-speaker voice synthesis model, and training the multi-speaker voice synthesis model by using the linguistic characteristics and acoustic characteristics corresponding to the user and similar training data as training data to obtain the personalized voice synthesis model of the user.

The method for constructing the personalized speech synthesis model provided by the embodiment of the invention can synthesize the speech with the specific speaking style of the user (the target speaker) by using the data of the user (the target speaker) with less data and the existing multi-speaker speech synthesis model to obtain the personalized speech, thereby bringing the temperature to the machine and improving the user experience.

In the method for constructing a speech synthesis model and the speech synthesis method described later provided in the embodiments of the present invention, the type of the model may be any speech synthesis model based on a Neural network (e.g., Neural TTS model), or other similar speech synthesis models, such as End-to-End (End to End) speech synthesis labels, regardless of the multi-speaker speech synthesis model or the personalized speech synthesis model of the user (target speaker), which is not limited in the embodiments of the present invention.

First, a brief explanation will be given of the multi-speaker speech synthesis model. In the embodiment of the invention, the multi-speaker speech synthesis model can adopt any existing multi-speaker speech synthesis model in the prior art, in the construction of the multi-speaker speech synthesis model, the model is trained by using training set data of a plurality of speakers, the plurality of speakers are preset some speakers or some speakers, such as Zhang three, Liqu and WangWu, and each person has a corresponding ID (serial number) in the model.

The training set data of multiple speakers comprises training data of each speaker, and the training data of each person can also comprise voice data and corresponding text of the speaker, and linguistic features and acoustic features extracted according to the voice data and the text.

Different personalized speech synthesis models can be targeted to different personalized individuals, and a multi-utterance speech synthesis model is the basis of the personalized speech synthesis model. In order to ensure the learning accuracy of the model, the extraction of the linguistic features and the acoustic features can be realized through a series of means, and a plurality of modes, such as a manual labeling mode or a computer-aided manual labeling mode, are also adopted in the concrete realization. The embodiment of the present invention is not limited thereto.

The training process of a multi-speaker speech synthesis model can be as shown in fig. 2, when the multi-speaker speech synthesis model is trained, each speaker is set with an ID, and it is assumed that there are data of three speakers with IDs 1, 2 and 3 in the training set data. During training, the input data are the linguistic characteristics of three persons, namely ID1, ID2 and ID3, and the corresponding ID and Speaker characterization (Speaker Embedding) thereof, and referring to fig. 2, the multi-Speaker speech synthesis model comprises an encoding (Encoder), an Attention Mechanism (Attention Mechanism) and a decoding (Decoder), the output is the acoustic characteristics of the three persons, and the training process can be realized by a training method of a neural network, such as a Back Propagation (BP) algorithm. The principle of the BP algorithm mainly comprises two links, namely excitation propagation and weight updating, which are iterated repeatedly until the response of the network to the input reaches a preset target range. The learning process of the BP algorithm consists of a forward propagation process and a backward propagation process. In the forward propagation process, input information passes through the hidden layer through the input layer, is processed layer by layer and is transmitted to the output layer. If the expected output value cannot be obtained in the output layer, taking the square sum of the output and the expected error as an objective function, turning into reverse propagation, calculating the partial derivative of the objective function to the weight of each neuron layer by layer to form the gradient of the objective function to the weight vector, and finishing the learning of the network in the weight modifying process as the basis for modifying the weight. And when the error reaches the expected value, the network learning is finished. For learning using training set data, the relationship between linguistic features and acoustic features may be learned.

The method comprises the steps of inputting linguistic characteristics and corresponding speaker IDs of three speakers 1, 2 and 3 by a multi-speaker voice synthesis model during training, adding speaker characteristics of the three speakers, wherein the speaker characteristics comprise characteristic vectors of the three speakers ID1, ID2 and ID3, and if the ID of a certain speaker in a training set is input after training is finished, inputting one linguistic characteristic at any time, the corresponding voice of the speaker with the ID of 1 can be predicted.

Speaker characterization (Speaker Embedding) includes a set of feature vectors, the number of feature vectors being equal to the total number of speakers in a training set of multiple speakers. The mathematical expression is a matrix of N M (N is the number of speakers), and the feature vector of each speaker is abstracted and quantized by a series of features of the speaker.

The multi-speaker speech synthesis model has the following characteristics: if the multi-speaker speech synthesis model has no speaker ID and speaker characterization, the model can only output a standard sound according to the text, but if there is ID and speaker characterization of a specific speaker (for example, Zhang three, ID is 1) in the training set data of the multi-speaker speech synthesis model, the speech of the specific speaker can be output, that is, the multi-speaker speech synthesis model can be controlled to output the speech of any one person in the training set, but not the speaker in the training set data, the multi-speaker speech synthesis model cannot output the corresponding sound, in other words, the input ID must be the ID of one speaker in the training set. The embodiment of the invention just utilizes the characteristics, so that a user (a target speaker) can pretend to be a certain speaker in the trained multi-speaker voice synthesis model, and the training of the personalized voice synthesis model of the user (the target speaker) can be realized by utilizing the existing multi-speaker voice synthesis model. The specific training process of the personalized speech synthesis model is similar to the training process of the multi-speaker speech synthesis model, and is not described herein again.

In one embodiment, in the step S11, the step of processing the data of the user (target speaker) and extracting the corresponding linguistic feature and acoustic feature includes: speech data and corresponding text.

The process of extracting linguistic features, as shown in fig. 3, may be implemented, for example, by the following steps:

s31, automatically labeling the text of the user through voice synthesis to determine labeling information, wherein the labeling information comprises: pronunciation labeling and rhythm labeling;

for example, the TTS front end performs processing to perform pronunciation labeling and prosody labeling.

S32, determining a sound speed boundary by the voice data of the user through voice recognition and voice activity detection;

and S33, extracting corresponding linguistic features according to the pronunciation labels, the rhythm labels and the speed boundaries.

The linguistic feature refers to a linguistic feature extracted based on pronunciation labels and prosody labels, such as phoneme sequences, tones, boundary information, and pauses.

For example, pronunciation labels are text labels with pinyin (including pitch), such as: "I" is labeled "wo 3" and the number 3 indicates that the tone is sound 3.

The prosody label is, for example, a pause label, such as "I is #3 Chinese # 1. In this sentence, "# 3" indicates a long pause, and "# 1" indicates a short pause.

Specifically, the phoneme boundary of the Voice data may be determined by technical means such as Automatic Speech Recognition (ASR) and Voice Activity Detection (VAD) to determine a mute part of the Voice, and the start time and the end time of each phoneme.

In the embodiment of the present invention, the acoustic features are acoustic features based on speech extraction, for example: linear spectra, Mel-scale frequency Cepstral Coefficients (MFCC), and fbank (filter bank), etc.

MFCC is a cepstrum parameter extracted in Mel scale frequency domain, and according to the research of human auditory mechanism, human ears have different auditory sensitivities to sound waves with different frequencies. Speech signals from 200Hz to 5000Hz have a large impact on the intelligibility of speech. When two sounds with different loudness act on human ears, the presence of frequency components with higher loudness affects the perception of frequency components with lower loudness, making them less noticeable, which is called masking effect. Since lower frequency sounds travel a greater distance up the cochlear inner basilar membrane than higher frequency sounds, generally bass sounds tend to mask treble sounds, while treble sounds mask bass sounds more difficult. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, a group of band-pass filters is arranged according to the size of a critical bandwidth in a frequency band from low frequency to high frequency to filter the input signal. The signal energy output by each band-pass filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of voice after being further processed. Since the characteristics do not depend on the properties of the signals, no assumptions and restrictions are made on the input signals, and the research results of the auditory model are utilized. Therefore, the parameter has better robustness than the LPCC based on the vocal tract model, better conforms to the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced.

The extraction process of MFCC includes: pre-emphasis, framing, windowing, FFT (fast Fourier transform), and filtering by a triangular band-pass filter, calculating the logarithmic energy output by each filter bank, and then obtaining the MFCC coefficient through Discrete Cosine Transform (DCT).

The extraction of the MFCC coefficients is obtained by performing discrete cosine transform on the basis of Fbank, so that the process of extracting the Fbank is consistent with the first steps of extracting the MFCC coefficients.

The extraction of the linear spectrum is realized by the following way: and performing sliding window Fourier transform on the voice data signal, and performing sliding window Fourier processing to obtain a linear spectrum of the voice signal.

The embodiment of the invention does not limit the specific characteristics adopted by the linguistic characteristics and the acoustic characteristics, nor the specific extraction mode of the characteristics, and can be realized by adopting the extraction means in the prior art.

Because the voice collecting mode of the voice data of the user is random, for example, the mobile phone is used to speak in an environment with a relatively complicated background, so that there may exist environmental noises and reverberation with different degrees in the voice of the user, and the volume of the environmental noises and reverberation may also be different, in order to achieve a better training effect in a relatively poor recording environment, preferably, in the embodiment of the present invention, before the voice data of the user is subjected to the extraction of the acoustic features, the following steps may be further performed: pre-processing operations including energy warping, dereverberation, and energy enhancement are performed on the speech data.

Specifically, the energy normalization step normalizes the energy of the data of the same batch to a certain specific energy distribution. The step of dereverberating is to eliminate reverberation in the speech; the step of energy enhancement is to enhance the speech signal and attenuate the noise.

Theoretically, the more voice data of a user, the longer the voice, the more beneficial the training of the personalized voice synthesis model, but in practice, the more data, the increased recording and voice processing costs (such as labeling, linguistic and acoustic feature extraction) and the reduced user experience are brought, so that in the embodiment of the invention, the user can provide about 10-100 sentences of voice data, the construction of the personalized voice synthesis model can be completed, and the user experience is greatly improved on the premise of ensuring the accuracy of the model.

In one embodiment, in the step S12, the training data similar to the user is determined from the training set data of multiple speakers of the multiple speaker speech synthesis model, and in the implementation, the method may be implemented by:

determining training data of a preset number of adjacent speakers similar to a user from training set data of a plurality of speakers of a multi-speaker voice synthesis model; and/or determining training data corresponding to a preset number of adjacent sentences similar to the user; the training data includes voice data and corresponding text, as well as linguistic features of the text and acoustic features of the voice data.

Therefore, the data of the speakers or the sentences similar to the user are searched as much as possible in the training set data of the speakers of the multi-speaker speech synthesis model, and the purpose is to utilize the training data of the similar speakers or sentences to assist the learning of the speech related characteristics of the user (the target speaker), thereby realizing the naturalness and the similarity of the finally synthesized speech.

Determining a preset number of nearby speakers that approximate the user (the target speaker) may be accomplished by:

calculating corresponding vectors for the user and each speaker of the multiple speakers respectively;

the distance between each speaker in the multiple speakers and the vector of the user is respectively determined and sorted according to the size, and a preset number of speakers starting from the smallest distance are determined as adjacent speakers.

For example, a speaker recognition algorithm of i-vector can be used, which calculates a vector (referred to as i-vector) for each speaker, and determines the similarity between different speakers according to the distance between vectors (euclidean distance or cosine distance).

The distance between the vector of each of the multiple speakers and the user (target speaker) vector may be sorted according to magnitude, with the speakers from the smallest distance being the neighboring speakers.

Referring to fig. 4, the circular dots represent the vectors of speakers respectively, the circular dot located in the center of the dotted circle is the vector of the user, and four adjacent speakers (all speakers in the training set of the multi-speaker speech synthesis model) are also included in the dotted circle and are identified as adjacent speaker 1 to adjacent speaker 4, respectively.

The adjacent sentences are determined in a similar manner. Because the adjacent speakers do not consider the diversity of sentences of the same speaker in the pool of the training set, namely some sentences are close to the user and some sentences are far away from the user. Therefore, it is possible to calculate a vector of each sentence, a distance from the user vector, a distance between the two, and several sentences which are close to the user as neighboring sentences.

The distance between each sentence of each speaker in the multiple speakers and the vector of the user is determined respectively and sorted according to the size, and a preset number of sentences starting from the smallest distance are determined as adjacent sentences.

Referring to FIG. 5, the larger dots represent the vector of the speaker, the smaller dots represent the vector of the sentence, the center position within the dashed box is the user, and the other smaller dots within the dashed box represent neighboring sentences that are close to the user.

After the nearby speakers and/or nearby sentences of the user (target speaker) are determined, it is also necessary to select a similar speaker belonging to the same category as the user (target speaker) among speakers in the training set of the multi-speaker speech synthesis model, the selected speaker being other than the nearby speakers or speakers to which the nearby sentences belong.

In one embodiment, the same category refers to the same category determined according to any one or a combination of the following conditions of the speaker: gender, age, speaking style, and speaking environment, etc. And comprehensively considering one or more conditions to select speakers in the same class.

For example, speakers belonging to the same gender can be directly selected, speakers of the same gender and the same age group can be selected, and the general principle is to select speakers closer to the user (the target speaker).

If simply selected, a single condition may be considered, for example, a speaker of the same gender may be selected, and a speaker of the same gender may be simply selected because there is a large difference between male voice and female voice.

The following briefly describes the method for constructing a personalized speech synthesis model according to an embodiment of the present invention with two specific examples.

The first embodiment is as follows:

assume that the training set data of multiple speakers of the multi-speaker speech synthesis model includes speaker a, speaker B, speaker C, speaker D, and speaker E, whose IDs in the multi-speaker speech synthesis model are ID1, ID2, ID3, ID4, and ID5, respectively. And training the Neural TTS model of the multiple speakers by using the training data of the speakers to obtain the trained Neural TTS model of the multiple speakers.

Currently, there is an individualized speaker F, and referring to the flowchart shown in fig. 6A, after performing speech synthesis, automatic labeling and speech data preprocessing on individualized data, i.e., speech data and text, of the speaker F, respectively, corresponding linguistic features and acoustic features are extracted.

Fig. 6B shows a process of how to extract linguistic features from a text and a speech, for example, a pronunciation label and a prosody label in the text are extracted first through a TTS front end, a phoneme boundary is obtained by performing ASR and VAD processing on the speech, and then the linguistic features are extracted according to the pronunciation label and the prosody label in combination with the phoneme boundary.

Fig. 6C shows a process of extracting corresponding acoustic features from the speech after preprocessing (energy warping, dereverberation, and energy enhancement).

And selecting the adjacent speakers of the speaker F, namely the speaker B and the speaker C according to the distances among the speaker A, the speaker B, the speaker C, the speaker D, the speaker E and the speaker F vector.

Among the speakers a to E, the speaker D having the same gender as the speaker F was selected from the three speakers a, D and E, excluding the speaker B and the speaker C.

Referring to fig. 6D, the ID4 of the speaker D and the speaker representation of the speaker D are input into the attribute of the multi-speaker Neural TTS model, and the multi-speaker Neural TTS model is trained using the linguistic features and the acoustic features of the user (target speaker), i.e., the speaker F, and the training data (including the speech data, the text, and the corresponding linguistic features and acoustic features) of the speaker B and the speaker C in the training set data of the multi-speaker Neural TTS model, so as to obtain the personalized speech synthesis model for the speaker F.

Example two:

similar to the embodiment, the training set data of the multiple speakers of the multi-speaker speech synthesis model in the second embodiment includes all the training data of the speaker a, the speaker B, the speaker C, the speaker D, and the speaker E, and the IDs of the training data in the multi-speaker speech synthesis model are ID1, ID2, ID3, ID4, and ID5, respectively. And training the Neural TTS model of the multiple speakers by using the training data of the speakers to obtain the trained Neural TTS model of the multiple speakers.

Currently, there is an individualized speaker F, and referring to the flowchart shown in fig. 7A, after performing speech synthesis, automatic labeling and speech data preprocessing on the individualized data, i.e., speech data and text, of the speaker F, respectively, corresponding linguistic features and acoustic features are extracted. For details, fig. 6B and 6C of the first embodiment can be referred to.

Different from the first embodiment, corresponding vectors are respectively calculated for each sentence spoken by the speaker a, the speaker B, the speaker C, the speaker D, the speaker E and the speaker F, the distance between the vectors of the sentences and the vector of the speaker F is calculated, and the adjacent sentences of the speaker F are selected, assuming that 14 sentences are total sentences from sentence 1 to sentence 14.

Among the speakers a to E, except the speakers to which the 14 sentences belong, a speaker having the same gender as the speaker F is selected as the speaker D.

Referring to fig. 7B, the ID4 of the speaker D and the speaker representation of the speaker D are input into the attribute of the multi-speaker Neural TTS model, and the multi-speaker Neural TTS model is trained using the linguistic features and the acoustic features of the target speaker F and the training data (including the speech data, the text, and the corresponding linguistic features and acoustic features) corresponding to sentences 1 to 14 in the training set data of the multi-speaker Neural TTS model, so as to obtain the personalized speech synthesis model for the speaker F.

The inventor verifies through experiments that the construction method of the personalized speech synthesis model provided by the embodiment of the invention can well simulate the speech of the speaker under the condition that the target speaker F provides few sentences, for example, even if only 10 sentences exist, so that the user can use the method conveniently and is not limited by the recording environment and the used recording equipment. In addition, the data volume of the target speaker F is small, the time cost of recording and labeling is greatly reduced, and the training process of the voice synthesis model and the speed of voice synthesis are greatly accelerated.

The embodiment of the invention also provides a personalized speech synthesis method based on the construction method of the personalized speech synthesis model.

The personalized speech synthesis method, as shown in fig. 8, includes the following steps:

s81, processing the text to be synthesized, and extracting corresponding linguistic features;

s82, inputting the linguistic characteristics and the ID of the same speaker corresponding to the user (target speaker) in the personalized speech synthesis model training process into the personalized speech synthesis model, and predicting the acoustic characteristics corresponding to the text to be speech synthesized;

and S83, synthesizing the synthesized voice corresponding to the text of the user (the target speaker) according to the acoustic characteristics.

The personalized speech synthesis model is obtained by adopting the construction method of the personalized speech synthesis model.

Taking the flowchart shown in fig. 9 as an example, assuming that a short text speech of a target speaker F needs to be synthesized, and the speaker F selects a similar speaker as the speaker D during the personalized Neural TTS model training, firstly, the front end of the TTS is used to extract the linguistic features of the text to be synthesized, and then the extracted linguistic features and the ID of the speaker D, i.e., ID4, are input into the personalized Neural TTS model together, so that the corresponding acoustic features can be output, and then the acoustic features are converted into the speech of the speaker F through a Vocoder (Vocoder). The text to be speech-synthesized can be arbitrarily specified as required.

Referring to fig. 10, in the method for testing a personalized speech synthesis model according to the embodiment of the present invention, steps S101 to S103 are similar to the speech synthesis method described above, except that step S104 tests a subsequent speech result that needs to be output by a verification model, and determines whether the personalized speech synthesis model is qualified, and if the test is not qualified, a training process of the model may be adjusted according to result feedback, and the specific implementation process is not described herein again.

The training method of the personalized speech synthesis model and the personalized speech synthesis method provided by the embodiment of the invention can be widely applied to various artificial intelligence scenes, such as application scenes of audio reading, intelligent customer service, speech interaction, speech broadcasting, machine translation and the like.

For example, when the system is applied to products such as a voice assistant and an intelligent customer service, a user inputs a segment of recorded voice in advance, and the system can output vivid voice by giving any segment of text content, so that scenes such as intelligent interaction and voice broadcasting are realized.

In another or other possible embodiments, an embodiment of the present invention provides a method for constructing a personalized speech synthesis model, which is shown in fig. 11 and includes the following steps:

s111, selecting a user similar to the user from at least one social network corresponding to the scene of the user according to a preset scene, and acquiring training set data of the similar speaker;

s112, selecting similar speakers belonging to the same category as the user from the plurality of speakers except the similar speakers; the multiple speakers are speakers corresponding to a training set of the multi-speaker speech synthesis model;

s113, training the multi-speaker voice synthesis model according to the training data of the speakers similar to the user in each scene and the selected similar speakers to obtain the personalized voice synthesis model of the user in the scene.

In the above method, the scenarios may be various, for example, the APP of the client corresponds to a scenario of a social network, and the server may select a user similar to the user from at least one social network of the user according to an applicable scenario of the client APP used by the user, for example, a social APP, or a voice-simulated APP, or an online game APP, according to a specific scenario, for example, from people such as family, relatives, friends, colleagues, classmates, and the like in a circle of friends to which the user belongs, select a user having a certain or some common points with the user, for example, the same family, or the same school, or the same work unit, and then acquire training data of the similar users for subsequently training a multi-speaker voice synthesis model.

For example, the scene may also be a scene set by the user at the client, such as a home scene, a work scene, a leisure scene, and the like, in different scenes, the user may want to use different voices and/or different language expression modes, or in some specific occasions, want to use specific voices and/or language expression modes, so that the user may select the corresponding personalized voice synthesis model by himself or herself, which requires to learn training data of those similar users, and through the client, the user may select similar users from a plurality of social networks of the user, or directly select voices of a specific character to learn in a specific occasion, so as to achieve the purpose of simulating the voices and/or expression modes of the similar users.

The embodiment of the invention also provides a method for constructing the personalized speech synthesis model, which is shown in fig. 12 and comprises the following steps:

s121, sequentially searching at least one approximate speaker similar to the user in each approximate speaker set according to the preset priority of each approximate user set and the sequence of the priority;

s122, acquiring training set data of at least one approximate speaker according to the searched at least one approximate speaker;

s123, selecting similar speakers belonging to the same category as the user from the plurality of speakers except the similar speakers; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

and S124, training the multi-speaker voice synthesis model according to the training data of at least one approximate speaker and the selected same speaker to obtain the personalized voice synthesis model of the user.

In the above steps S121 to S122, there may be a plurality of sets of similar users selected in different manners, such as the set of similar users selected from the social network, and according to a preset priority, the server may sequentially search at least one similar speaker similar to the user from each set of similar speakers, for example, the sets with different priorities include: the method comprises the steps that a family set, a friend set, a classmate set, a colleague set and a residential community set are large to an administrative region range set such as a district, a city, a province and the like, priority levels sequentially change from high to low, at the moment, when the speaker with the similar quota is selected, the speaker with the similar quota can be sequentially selected according to the sequence of the priority levels from high to low until a sufficient number of similar users are selected, and if the selection of all the similar speakers is met in a higher set, the speaker with the similar quota does not need to be searched in a set of the next level.

The selection of the approximate speaker or the approximate user can also be completely given to the user (the target speaker) for selection, for example, the server pushes the corresponding approximate user set, the client selects the approximate user set, and the approximate speaker or the approximate user returns to the server for acquisition of training data and training of the personalized speech synthesis model after the selection is completed.

The corresponding method for constructing the personalized speech synthesis model, as shown in fig. 13, includes the following steps:

s131, sequentially pushing the approximate speaker set of each level to the client of the user according to the priority of each approximate user set of the user and the sequence of the priority;

s132, receiving the identification of the approximate speaker in the selected approximate speaker set of each level returned by the client, and acquiring training set data of the approximate speaker according to the identification;

s133, selecting similar speakers belonging to the same category as the user from the plurality of speakers except the similar speakers; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

s134, training the multi-speaker voice synthesis model according to the training data of at least one approximate speaker and the selected same speaker to obtain the personalized voice synthesis model of the user.

In one embodiment, the above-mentioned sets of levels of approximate speakers include one or more of the following:

at least one set of users of a social network of users; as mentioned above, the social network of the user may be friends of multiple communities and the like that the user selects autonomously, and will not be described herein.

At least one user set of users belonging to the same geographical area, such as users belonging to the same district, city, or even province; such a scenario may be applicable, for example, when a personalized speech synthesis model is required to learn dialects, pronunciation characteristics, and the like of a specific region.

At least one user set selected by the user according to the preference of the user; when a user wants to imitate the voice and/or language style of a specific person, for example, one or more preferred users can be selected as approximate users (approximate speakers) according to the preference of the user.

When the above-mentioned S131 to S134 are implemented specifically, they may be implemented by a server or a client with a certain computing power, and the user selects an approximate speaker from a plurality of possible sets, for example, selects a favorite star of the user, so as to fuse the voice characteristics of the user and the voice characteristics of the star according to a certain ratio, thereby generating a rich voice experience. The server or the client acquires corresponding training set data according to the identification of the approximate speaker, so that an individual voice synthesis model corresponding to the user is generated by training the multi-speaker voice synthesis model.

Based on the same inventive concept, the embodiment of the invention also provides a device for constructing the personalized speech synthesis model, a personalized speech synthesis device and a server, and because the principles of the problems solved by the devices and the server are similar to the method for constructing the personalized speech synthesis model and the personalized speech synthesis method, the implementation of the devices and the server can refer to the implementation of the method, and repeated parts are not repeated.

The device for constructing the personalized speech synthesis model provided by the embodiment of the invention, as shown in fig. 14, includes:

a determining module 141, configured to determine training data similar to the user from training set data of multiple speakers in the multiple speaker speech synthesis model;

a selecting module 142, configured to select a similar speaker belonging to the same category as the user from the speakers except the speaker to which the approximate training data belongs;

the training module 143 is configured to train the multi-speaker speech synthesis model according to training data similar to the user and the selected same speaker, so as to obtain an individual speech synthesis model of the user.

In an embodiment, the apparatus for constructing a personalized speech synthesis model, as shown in fig. 14, further includes: an extraction module 144; wherein:

the extraction module 144 is configured to process data of a target speaker and extract corresponding linguistic features and acoustic features;

correspondingly, the training module 143 is specifically configured to input the ID of the same speaker in the multi-speaker speech synthesis model and the corresponding speaker characterization into the multi-speaker speech synthesis model, and train the multi-speaker speech synthesis model by using the linguistic feature and the acoustic feature corresponding to the user and the similar training data as training data, so as to obtain the personalized speech synthesis model of the user.

In one embodiment, the determining module 141 is further configured to determine training data of a preset number of neighboring speakers similar to the user from training set data of multiple speakers of the multiple speaker speech synthesis model; and/or determining training data corresponding to a preset number of adjacent sentences similar to the user; the training data includes speech data and corresponding text, as well as linguistic features of the text and acoustic features of the speech data.

In one embodiment, the determining module 141 is further configured to calculate a corresponding vector for the user and each speaker of the plurality of speakers; determining the distance between each speaker in the speakers and the vector of the user respectively, sorting the distances according to the sizes, and determining a preset number of speakers starting from the smallest distance as adjacent speakers.

In one embodiment, the determining module 141 is further configured to calculate a corresponding vector for each sentence of each speaker of the user and the plurality of speakers; the distance between each sentence of each speaker in the multiple speakers and the vector of the user is determined respectively and sorted according to the size, and a preset number of sentences starting from the smallest distance are determined as adjacent sentences.

In one embodiment, the user's data includes: voice data and corresponding text;

accordingly, the extracting module 144 is further configured to automatically label the text of the user by speech synthesis to determine labeling information, where the labeling information includes: pronunciation labeling and rhythm labeling; determining a sound speed boundary by the voice data of the user through voice recognition and voice activity detection; extracting corresponding linguistic features according to the pronunciation labels, the rhythm labels and the speed boundaries; and extracting acoustic features of the voice data of the user.

In one embodiment, the above extraction module 144 is further configured to perform preprocessing operations including energy warping, dereverberation and energy enhancement on the voice data of the user before performing the extraction of the acoustic features on the voice data.

In an embodiment, the method for constructing a personalized speech synthesis model, the method for personalized speech synthesis, and the apparatus provided in the embodiments of the present invention can better learn the voice of a user during the process of training to obtain the personalized speech synthesis model, for example, if all model parameters are updated. Of course, it is also possible to update only part of the parameters, for example, only the Decoder parameters in the model.

An embodiment of the present invention further provides another apparatus for constructing a personalized speech synthesis model, which is shown in fig. 15 and includes:

a first selection module 151, configured to select, according to a preset scene, a user similar to the user from at least one social network of the user corresponding to the scene;

an obtaining module 152 for obtaining training set data of the approximate speaker;

a second selecting module 153 for selecting a same speaker belonging to the same category as the user from the plurality of speakers except the approximate speaker; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

a training module 154, configured to train the multi-speaker speech synthesis model according to training data of speakers similar to the user in each scene and the selected similar speaker, so as to obtain an individualized speech synthesis model of the user in the scene.

An embodiment of the present invention further provides another apparatus for constructing a personalized speech synthesis model, which is shown in fig. 16 and includes:

the searching module 161 is configured to sequentially search, according to the preset priority of each approximate speaker set, at least one approximate speaker similar to the user in each approximate speaker set according to the order of the priority;

an obtaining module 162, configured to obtain training set data of at least one approximate speaker according to the found at least one approximate speaker;

a selection module 163 for selecting a same speaker belonging to the same category as the user from the plurality of speakers except the approximate speaker; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

a training module 164, configured to train the multi-speaker speech synthesis model according to the training data of the at least one approximate speaker and the selected similar speaker, so as to obtain an individualized speech synthesis model of the user.

An embodiment of the present invention further provides another apparatus for constructing a personalized speech synthesis model, which is shown in fig. 17 and includes:

the pushing module 171 is configured to sequentially push the approximate speaker sets of each level to the client where the user is located according to the priority of each approximate user set of the user and the order of the priority;

a receiving module 172, configured to receive identifiers of approximate speakers in the selected approximate speaker sets of each level returned by the client;

an obtaining module 173, configured to obtain training set data of the approximate speaker according to the identifier;

a selection module 174 for selecting a same speaker from the plurality of speakers, except the approximate speaker, that belongs to the same category as the user; the multiple speakers are speakers corresponding to a training set of a multi-speaker speech synthesis model;

a training module 175, configured to train the multi-speaker speech synthesis model according to the training data of the at least one approximate speaker and the selected similar speaker, so as to obtain a personalized speech synthesis model of the user.

An apparatus for personalized speech synthesis provided by an embodiment of the present invention, as shown in fig. 18, includes:

the extraction module 181 is configured to process a text to be speech-synthesized, and extract a corresponding linguistic feature;

the prediction module 182 is configured to input the linguistic features and the IDs of the similar speakers corresponding to the user in the personalized speech synthesis model training process into the personalized speech synthesis model, and predict acoustic features corresponding to the text to be speech-synthesized;

a speech synthesis module 183, configured to synthesize, according to the acoustic feature, a synthesized speech of the user corresponding to the text;

the personalized speech synthesis model is obtained by adopting the construction device of the speech synthesis model.

An embodiment of the present invention further provides a device for testing a personalized speech synthesis model, which is shown in fig. 19 and includes:

the extraction module 191 is used for processing the text to be speech-synthesized and extracting the corresponding linguistic features;

the prediction module 192 is configured to input the linguistic features and the IDs of the similar speakers corresponding to the user in the personalized speech synthesis model training process into the personalized speech synthesis model, and predict the acoustic features corresponding to the text to be speech-synthesized;

a speech synthesis module 193, configured to synthesize, according to the acoustic feature, a synthesized speech corresponding to the text by the user;

a verification module 194, configured to verify the synthesized speech, and determine whether the personalized speech synthesis model is qualified;

similarly, the personalized speech synthesis model is also obtained by the creation device of the personalized speech synthesis model.

An embodiment of the present invention further provides an intelligent voice server, including: a memory and a processor; wherein the memory stores a computer program which, when executed by the processor, is capable of implementing the aforementioned method of constructing a personalized speech synthesis model or of implementing one of the aforementioned methods of personalized speech synthesis.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor, can perform the foregoing method for constructing a personalized speech synthesis model or can implement the foregoing method for personalized speech synthesis.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

38页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种语音合成方法、装置及存储介质

Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing

相关技术

网友询问留言