Model training and voice synthesis method, device, equipment and medium

文档序号：972880 发布日期：2020-11-03 浏览：2次中文

阅读说明：本技术 一种模型训练及语音合成方法、装置、设备和介质 (Model training and voice synthesis method, device, equipment and medium ) 是由康永国于 2020-07-13 设计创作，主要内容包括：本申请公开了一种模型训练及语音合成方法、装置、设备和介质,涉及人工智能、深度学习和语音技术领域。具体实现方案为：获取训练数据集中的样本文本；基于预先采用无监督的训练方法训练出的声学模型,确定所述样本文本对应的标签信息；其中,所述标签信息包括风格信息和/或角色信息；基于所述样本文本以及所述样本文本对应的标签信息,对文本分类模型进行训练；其中,所述文本分类模型用于根据输入的文本输出对应的标签信息。本申请实施例实现了自动确定样本文本对应的标签信息的技术效果,提高了标签标注的准确性以及效率,相应提高了文本训练模型的训练速度。(The application discloses a model training and voice synthesis method, device, equipment and medium, and relates to the technical fields of artificial intelligence, deep learning and voice. The specific implementation scheme is as follows: acquiring a sample text in a training data set; determining label information corresponding to the sample text based on an acoustic model trained by adopting an unsupervised training method in advance; the label information comprises style information and/or role information; training a text classification model based on the sample text and the label information corresponding to the sample text; and the text classification model is used for outputting corresponding label information according to the input text. The embodiment of the application realizes the technical effect of automatically determining the label information corresponding to the sample text, improves the label marking accuracy and efficiency, and correspondingly improves the training speed of the text training model.)

1. A method of model training, the method comprising:

acquiring a sample text in a training data set;

determining label information corresponding to the sample text based on an acoustic model trained by adopting an unsupervised training method in advance; the label information comprises style information and/or role information;

training a text classification model based on the sample text and the label information corresponding to the sample text; and the text classification model is used for outputting corresponding label information according to the input text.

2. The method of claim 1, wherein the method of training the acoustic model comprises:

acquiring training data in the training data set, wherein the training data comprises text features of sample texts and voice data corresponding to the sample texts;

and training a pre-constructed acoustic model based on the training data by adopting an unsupervised training method to establish a mapping relation between text characteristics and acoustic characteristics and obtain a clustering result for clustering the training data according to styles and/or roles.

3. The method according to claim 2, wherein the clustering result includes label information corresponding to each voice data; determining label information corresponding to the sample text based on an acoustic model trained by adopting an unsupervised training method in advance, wherein the label information comprises the following steps:

and inputting the voice data corresponding to the sample text into the trained acoustic model to obtain the label information corresponding to the sample text output by the acoustic model.

4. The method of any of claims 1-3, wherein the method of generating the sample text comprises:

acquiring a preset number of real person voice data, and executing background music and/or noise removal operation on each real person voice data;

and segmenting each real person voice data, and acquiring texts corresponding to each segmented voice data as sample texts.

5. The method of claim 4, wherein the live speech data comprises: and the real person of the literature carrier broadcasts data in voice.

6. A method of speech synthesis, the method comprising:

inputting a text to be synthesized into a text classification model trained in advance, and obtaining label information corresponding to the text to be synthesized output by the classification model; the label information comprises style information and/or role information; the text classification model is a model trained by using the model training method of any one of claims 1 to 5;

inputting the text features of the text to be synthesized and the label information corresponding to the text to be synthesized into an acoustic model which is trained by adopting an unsupervised training method in advance, and obtaining the text features output by the acoustic model and the acoustic features corresponding to the label information;

and performing voice synthesis on the text to be synthesized based on the acoustic features to obtain voice data corresponding to the text to be synthesized.

7. The method of claim 6, wherein the text to be synthesized comprises: a literature vector text to be synthesized.

8. A model training apparatus, the apparatus comprising:

the sample text acquisition module is used for acquiring sample texts in the training data set;

the label information determining module is used for determining label information corresponding to the sample text based on an acoustic model trained by adopting an unsupervised training method in advance; the label information comprises style information and/or role information;

the text classification model training module is used for training a text classification model based on the sample text and the label information corresponding to the sample text; and the text classification model is used for outputting corresponding label information according to the input text.

9. The apparatus of claim 8, wherein the method of training the acoustic model comprises:

acquiring training data in the training data set, wherein the training data comprises text features of sample texts and voice data corresponding to the sample texts;

10. The apparatus according to claim 9, wherein the clustering result includes label information corresponding to each voice data; the tag information determination module is specifically configured to:

and inputting the voice data corresponding to the sample text into the trained acoustic model to obtain the label information corresponding to the sample text output by the acoustic model.

11. The apparatus according to any one of claims 8-10, wherein the method of generating the sample text comprises:

acquiring a preset number of real person voice data, and executing background music and/or noise removal operation on each real person voice data;

and segmenting each real person voice data, and acquiring texts corresponding to each segmented voice data as sample texts.

12. The apparatus of claim 11, wherein the live speech data comprises: and the real person of the literature carrier broadcasts data in voice.

13. A speech synthesis apparatus, the apparatus comprising:

the label information acquisition module is used for inputting a text to be synthesized into a text classification model trained in advance and acquiring label information corresponding to the text to be synthesized output by the classification model; the label information comprises style information and/or role information; the text classification model is a model trained by using the model training method of any one of claims 1 to 5;

the acoustic feature acquisition module is used for inputting the text features of the text to be synthesized and the label information corresponding to the text to be synthesized into an acoustic model which is trained by adopting an unsupervised training method in advance, and acquiring the text features output by the acoustic model and the acoustic features corresponding to the label information;

and the voice synthesis module is used for carrying out voice synthesis on the text to be synthesized based on the acoustic characteristics to obtain voice data corresponding to the text to be synthesized.

14. The apparatus of claim 13, wherein the text to be synthesized comprises: a literature vector text to be synthesized.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method of any one of claims 1-5 and/or the speech synthesis method of any one of claims 6-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the model training method of any one of claims 1-5 and/or the speech synthesis method of any one of claims 6-7.

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, deep learning and voice, in particular to a model training and voice synthesis method, device, equipment and medium.

Background

The traditional speech synthesis technology adopts supervised machine learning, namely, text data of different styles, different emotions or different roles have corresponding labels, and the labels can help a speech synthesis system to better model and generate speech.

In the existing method, labeling personnel usually label the acquired text data according to subjective experience, but the label labeling accuracy is low due to inconsistent understanding of the labeling personnel on the label, and the label labeling efficiency is low due to the need of the labeling personnel to manually label the data.

Disclosure of Invention

The embodiment of the disclosure provides a model training and speech synthesis method, device, equipment and medium.

According to an aspect of the present disclosure, there is provided a model training method, the method including:

acquiring a sample text in a training data set;

According to another aspect of the present disclosure, there is provided a speech synthesis method, the method including:

inputting a text to be synthesized into a text classification model trained in advance, and obtaining label information corresponding to the text to be synthesized output by the classification model; the label information comprises style information and/or role information; the text classification model is a model trained by using a model training method disclosed by the application;

according to another aspect of the present disclosure, there is provided a model training apparatus, the apparatus including:

the sample text acquisition module is used for acquiring sample texts in the training data set;

According to another aspect of the present disclosure, there is provided a speech synthesis apparatus, the apparatus including:

the label information acquisition module is used for inputting a text to be synthesized into a text classification model trained in advance and acquiring label information corresponding to the text to be synthesized output by the classification model; the label information comprises style information and/or role information; the text classification model is a model trained by using a model training method disclosed by the application;

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a model training method and/or a speech synthesis method as described in any of the embodiments of the present application.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a model training method and/or a speech synthesis method according to any one of the embodiments of the present disclosure.

According to the technology of the application, the technical effect of automatically determining the label information corresponding to the sample text is achieved, the label marking accuracy and efficiency are improved, and the training speed of the text training model is correspondingly improved; and the generation of multi-style, multi-role and rich and colorful voice data is realized, the voice data is closer to the real reading style, and the time and experience of the user for listening to the voice data are greatly improved. It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of a model training method according to an embodiment of the present application;

FIG. 2A is a flow chart of a model training method according to an embodiment of the present application;

FIG. 2B is a schematic illustration of model training according to an embodiment of the present application;

FIG. 3A is a flow chart of a method of speech synthesis according to an embodiment of the present application;

FIG. 3B is a schematic illustration of speech synthesis according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a model training apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

FIG. 6 is a block diagram of an electronic device for model training and speech synthesis methods according to embodiments of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the traditional speech synthesis technology, it is not easy to determine the labels corresponding to the text data by adopting a supervised machine learning method, for example, to obtain the text data of four labels, one method is to record one label by one label, which consumes very much time; the other method is to collect the existing historical data and manually label the historical data according to experience by a labeling person, and the labeling method has the defects of low accuracy and efficiency. Therefore, for multi-style and multi-role voice broadcasting of some scenes, the traditional supervised machine learning method is difficult to adopt.

Fig. 1 is a flowchart of a model training method disclosed in an embodiment of the present application, which may be applied to an unsupervised case of training a text classification model. The method of the present embodiment may be performed by a model training apparatus, which may be implemented by software and/or hardware, and may be integrated on any electronic device with computing capability, such as a server.

As shown in fig. 1, the model training method disclosed in this embodiment may include:

s101, obtaining a sample text in the training data set.

Wherein the training data set comprises training data for training the acoustic model and the text classification model. In this embodiment, the training data set includes a sample text, text features of the sample text, and speech data corresponding to the sample text. The sample text is obtained by performing speech recognition on the collected speech data.

Optionally, the method for generating the sample text includes two steps:

A. and acquiring a preset number of real person voice data, and executing background music and/or noise removal operation on each real person voice data.

The real person voice data refers to voice data dubbed by a real person.

In one implementation, data acquisition is performed in the internet to acquire a preset number of real-person voice data, and according to an audio range where background music and/or noise are located in the real-person voice data, a filter corresponding to the audio range is adopted to perform filtering processing on the real-person voice data to remove the background music and/or noise in the real-person voice data.

Optionally, the real person voice data includes: the real person voice broadcast data of the literature carrier, such as real person voice broadcast data of a novel, real person voice broadcast data of poetry, real person voice broadcast data of a prose, real person voice broadcast data of a opera and the like.

Through the real person voice broadcast data with the literature carrier as real person voice data, compare and record in traditional recording studio, have following advantage: 1) the style expression and the emotional fluctuation are more natural, and the role play is more vivid and more natural. 2) Whether the real person voice broadcast data of the literature carrier can be used or not can be evaluated in real time, and even if the speaker is picked up in advance for the recording in the later period of the recording studio, the recording quality does not reach the standard or the style is not expressed in place possibly through long-time recording. 3) The real person voice broadcast speaker capable of being copied to more literature carriers and the real person voice broadcast data of different types of literature carriers can be quickly copied.

B. And segmenting each real person voice data, and acquiring texts corresponding to each segmented voice data as sample texts.

In one embodiment, the human voice data is segmented by taking at least one of a word, a word or a sentence as a unit to obtain a plurality of pieces of voice data, for example, segmentation is performed by taking "word" as a unit to obtain a plurality of pieces of "word voice data", and for example, segmentation is performed by taking "sentence" as a unit to obtain a plurality of pieces of "sentence voice data". And then carrying out voice recognition on each piece of voice data by using the existing voice recognition algorithm to obtain a text corresponding to each piece of voice data, wherein the text is used as a sample text in the training data set. The speech recognition algorithm includes, but is not limited to, a dynamic time warping algorithm, a hidden markov model algorithm based on a parametric model, a vector quantization algorithm based on a non-parametric model, and the like.

Optionally, the segmenting of the real person voice data by using sentences as units includes: the obtained real voice data is segmented by the existing voice segmentation method, for example, the voice segmentation is carried out according to the voice pause time, or the voice segmentation is carried out according to the decibel number of the voice. By acquiring a preset number of real voice data and executing background music and/or noise removal operation on each real voice data, interference factors of the acquired real voice data are reduced, and the accuracy and reliability of subsequent acoustic model training are ensured; by segmenting the voice data of each real person and acquiring the text corresponding to each segmented voice data as the sample text, the consistency of the sample text and the voice data is kept, so that the subsequent acoustic model training can be smoothly carried out.

By obtaining the sample text in the training data set, a foundation is laid for determining the corresponding label information according to the sample text subsequently.

S102, determining label information corresponding to the sample text based on an acoustic model trained by adopting an unsupervised training method in advance; wherein the tag information includes style information and/or role information.

The unsupervised training method is utilized to complete the training of the model without human participation and autonomously mine the relation between data.

In one embodiment, an unsupervised training method is used for modeling an acoustic model, and the existing clustering algorithm is used for clustering voice data, including but not limited to k-means clustering, hierarchical clustering, DBSCAN density clustering, grid clustering and the like, to obtain a clustering result. Optionally, the belonging clustering result includes tag information corresponding to each voice data, the tag information includes style information and/or role information, for example, the style information includes but is not limited to generous liberal, male healthy and luxurious, emotional running, cool, sadness, fantasy and the like, and the role information includes but is not limited to woman, man, old people, kids and the like. And finally, taking the label information corresponding to the voice data as the label information corresponding to the sample text corresponding to the voice data.

The label information corresponding to the sample text is determined based on the acoustic model trained by adopting an unsupervised training method in advance, so that the effect of automatically determining the label corresponding to the sample text is realized, and the label marking accuracy and efficiency are improved.

S103, training a text classification model based on the sample text and the label information corresponding to the sample text; and the text classification model is used for outputting corresponding label information according to the input text.

The types of the text classification model include, but are not limited to, a CNN (Convolutional Neural Networks) model, a Transform model, and the like.

In one embodiment, the sample text and the obtained label information corresponding to the sample text are used as training data for training a text classification model, and model training is performed by using an existing model training method to obtain a text classification model. The text classification model is used for labeling the text, inputting the text of the label to be labeled into the text classification model, and correspondingly outputting the label information corresponding to the text by the text classification model.

The text classification model is trained based on the sample text and the label information corresponding to the sample text, so that the modeling of the text classification model is realized, manual participation is not needed, and the labor cost is saved.

According to the technical scheme of the embodiment, the label information of the sample text is determined by acquiring the sample text in the training data set and based on the acoustic model trained by adopting an unsupervised training method in advance, and finally the text classification model is trained based on the sample text and the label information corresponding to the sample text, so that the technical effect of automatically determining the label information corresponding to the sample text is realized, the label labeling accuracy and efficiency are improved, and the training speed of the text training model is correspondingly improved.

Fig. 2A is a flowchart of a model training method disclosed in an embodiment of the present application, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments. As shown in fig. 2A, the method may include:

s201, training data in a training data set are obtained, wherein the training data comprise text features of sample texts and voice data corresponding to the sample texts.

The text features of the sample text are obtained by performing text analysis on the sample text, and the text features include, but are not limited to, an initial and final sequence, a subsequence, a syllable sequence, and the like.

S202, training a pre-constructed acoustic model based on the training data by adopting an unsupervised training method to establish a mapping relation between text features and acoustic features and obtain a clustering result for clustering the training data according to styles and/or roles.

The acoustic features include, but are not limited to, mel-frequency cepstral coefficients, acoustic energy, acoustic fundamental frequency, acoustic spectrum, and the like.

In one embodiment, the acoustic model training is performed using an unsupervised training method comprising any one of: VAE (variant automatic encoder), VQ-VAE (Vector quantized-variant automatic encoder), mutual information method, and GAN (Generative adaptive networks). The acoustic model can be used for identifying acoustic features corresponding to the text features, inputting the text features into the trained acoustic model, and outputting the acoustic features corresponding to the text features according to the established mapping relation between the text features and the acoustic features. In the process of carrying out acoustic model training by adopting an unsupervised training method, clustering is carried out on voice data according to a preset style and/or role so as to determine style information and/or role information corresponding to each piece of voice data and obtain a clustering result.

And S203, obtaining a sample text in the training data set.

S204, inputting the voice data corresponding to the sample text into the trained acoustic model, and obtaining the label information corresponding to the sample text output by the acoustic model.

In one embodiment, since the trained acoustic model already obtains the clustering result of the speech data corresponding to each sample text, the acoustic model can output the clustering result corresponding to the speech data, that is, style information and/or role information corresponding to the speech data, by inputting any speech data into the trained acoustic model, and further, the output style information and/or role information is used as the tag information of the sample text corresponding to the speech data.

S205, training a text classification model based on the sample text and the label information corresponding to the sample text; and the text classification model is used for outputting corresponding label information according to the input text.

In one embodiment, based on a sample text in a training data set and by inputting speech data corresponding to the sample text into an acoustic model, a text classification model is trained to obtain a trained text classification model, wherein the label information corresponding to the sample text is acquired.

Fig. 2B is a schematic diagram of model training disclosed in an embodiment of the present application, and unsupervised acoustic model training is performed according to sample features 21 corresponding to sample texts 20 and voice data 22 corresponding to the sample texts 20, and clustering results obtained by clustering according to styles and/or roles in the unsupervised acoustic model training process are used as tag information 23 corresponding to the sample texts 20, so as to perform text classification model training according to the sample texts 20 and the tag information 23.

According to the technical scheme of the embodiment, a pre-constructed acoustic model is trained based on training data by acquiring the training data in a training data set and adopting an unsupervised training method to establish a mapping relation between text features and acoustic features and obtain a clustering result of clustering the training data according to styles and/or roles, so that modeling of the acoustic model is realized, a foundation is laid for automatically determining the acoustic features according to the text features subsequently, and the effect of automatically determining style information and/or role information of voice data is realized due to clustering operation on the training data according to the styles and/or roles; the voice data corresponding to the sample text is input into the trained acoustic model, and the label information corresponding to the sample text output by the acoustic model is obtained, so that the technical effect of automatically determining the label information corresponding to the sample text is realized, the label labeling accuracy and efficiency are improved, and the training speed of the text training model is correspondingly improved.

With the explosion of voiced novels, broadcast plays, and various podcasts, it is becoming more common for people to audibly obtain information and entertainment content using fragmented time. The speech synthesis technology is a tool for converting text into speech, and can technically provide a large amount of speech data.

However, the current voice synthesis technical effect has a certain gap with the voice broadcast of real people, the synthesized voice has a single style and a single role, and has no feeling fluctuation, so that users can feel tired easily, the use duration of products cannot be prolonged, and the user experience is poor.

Fig. 3A is a flowchart of a speech synthesis method disclosed in an embodiment of the present application, which can be applied to a case where speech data corresponding to a text to be synthesized is obtained according to the text to be synthesized. The method of the present embodiment may be performed by a speech synthesis apparatus, which may be implemented by software and/or hardware, and may be integrated on any electronic device with computing capability, such as a server.

As shown in fig. 3A, the model training method disclosed in this embodiment may include:

s301, inputting a text to be synthesized into a text classification model trained in advance, and obtaining label information corresponding to the text to be synthesized output by the classification model; the label information comprises style information and/or role information; the text classification model is a model trained by using the model training method described in this embodiment.

In an embodiment, since the text classification model is obtained by training according to the sample text and the label information corresponding to the sample text, the text to be synthesized is input into the trained text classification model, and the text classification model can output the label information corresponding to the text to be synthesized.

Optionally, the text to be synthesized includes literature carrier text to be synthesized, such as novel text, poem text, prose text, opera text, and the like.

By taking the literature carrier text to be synthesized as the text to be synthesized, more colorful literature carrier voice broadcast contents with different styles, multiple roles and the like can be finally obtained, and the requirement of a user on entertainment contents is met.

S302, inputting the text features of the text to be synthesized and the label information corresponding to the text to be synthesized into an acoustic model trained by adopting an unsupervised training method in advance, and obtaining the text features output by the acoustic model and the acoustic features corresponding to the label information.

In one embodiment, text analysis is performed on a text to be synthesized to obtain text features of the text to be synthesized, the text features of the text to be synthesized and label information of the text to be synthesized, which is obtained in S301, are input into a trained acoustic model together, the acoustic model outputs acoustic features corresponding to the text features and the label information according to a mapping relationship between the constructed text features and the acoustic features and a clustering result obtained by clustering training data according to styles and/or roles, that is, the acoustic model is controlled through the label information, so that predicted acoustic features not only correspond to text content to be synthesized, but also the style information and/or role information corresponding to the label information is added.

S303, performing voice synthesis on the text to be synthesized based on the acoustic features to obtain voice data corresponding to the text to be synthesized.

In one embodiment, the acoustic features output by the acoustic model are input into a preset vocoder, the vocoder performs speech synthesis by using a speech synthesis technology based on the acoustic features, and speech data corresponding to the text to be synthesized is output.

Fig. 3B is a schematic diagram of speech synthesis disclosed in an embodiment of the present application, a text 30 to be synthesized is subjected to a text analysis operation 31 to obtain a text feature 32, the text 30 to be synthesized is input into a trained text classification model 33 to obtain tag information 34 corresponding to the text 30 to be synthesized, the obtained text feature 32 and the tag information 34 are input into a trained acoustic model 35 together to obtain an acoustic feature 36 corresponding to the text feature 32 and the tag information 34, and finally the acoustic feature 36 is input into a preset vocoder 37 to perform speech synthesis to obtain speech data 38 corresponding to the text 30 to be synthesized.

According to the technical scheme of the embodiment, the text to be synthesized is input into the text classification model trained in advance, the label information corresponding to the text to be synthesized output by the classification model is obtained, the text characteristics of the text to be synthesized and the label information corresponding to the text to be synthesized are input into the acoustic model trained in advance by adopting an unsupervised training method, the text characteristics output by the acoustic model and the acoustic characteristics corresponding to the label information are obtained, finally, the text to be synthesized is subjected to voice synthesis based on the acoustic characteristics, the voice data corresponding to the text to be synthesized is obtained, the generation of multi-style, multi-role and rich-emotion voice data is realized, the voice data is closer to the reading style of a real person, and the time and experience of a user for listening to the voice data are greatly improved.

Fig. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application, which may be applied to an unsupervised case of training a text classification model. The apparatus of the embodiment can be implemented by software and/or hardware, and can be integrated on any electronic device with computing capability, such as a server.

As shown in fig. 4, the model training apparatus 40 disclosed in this embodiment may include a sample text obtaining module 41, a label information determining module 42, and a text classification model training module 43, where:

a sample text obtaining module 41, configured to obtain a sample text in a training data set;

the tag information determining module 42 is configured to determine, based on an acoustic model trained by using an unsupervised training method in advance, tag information corresponding to the sample text; the label information comprises style information and/or role information;

a text classification model training module 43, configured to train a text classification model based on the sample text and the label information corresponding to the sample text; and the text classification model is used for outputting corresponding label information according to the input text.

Optionally, the training method of the acoustic model includes:

acquiring training data in the training data set, wherein the training data comprises text features of sample texts and voice data corresponding to the sample texts;

Optionally, the clustering result includes label information corresponding to each voice data; the tag information determining module 42 is specifically configured to:

and inputting the voice data corresponding to the sample text into the trained acoustic model to obtain the label information corresponding to the sample text output by the acoustic model.

Optionally, the method for generating the sample text includes:

acquiring a preset number of real person voice data, and executing background music and/or noise removal operation on each real person voice data;

and segmenting each real person voice data, and acquiring texts corresponding to each segmented voice data as sample texts.

Optionally, the real person voice data includes: and the real person of the literature carrier broadcasts data in voice.

The model training device 40 disclosed in the embodiment of the present application can execute any model training method disclosed in the embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. The contents not described in detail in this embodiment may refer to the description in any embodiment of the model training method in this application.

Fig. 5 is a schematic structural diagram of a speech synthesis apparatus disclosed in an embodiment of the present application, which is applicable to a case where speech data corresponding to a text to be synthesized is obtained according to the text to be synthesized. The apparatus of the embodiment can be implemented by software and/or hardware, and can be integrated on any electronic device with computing capability, such as a server.

As shown in fig. 5, the speech synthesis apparatus 50 disclosed in this embodiment may include a tag information acquisition module 51, an acoustic feature acquisition module 52, and a speech synthesis module 53, where:

the tag information obtaining module 51 is configured to input a text to be synthesized into a text classification model trained in advance, and obtain tag information corresponding to the text to be synthesized output by the classification model; the label information comprises style information and/or role information; the text classification model is a model trained by using a model training method disclosed by the application;

the acoustic feature obtaining module 52 is configured to input text features of the text to be synthesized and tag information corresponding to the text to be synthesized into an acoustic model trained by an unsupervised training method in advance, and obtain the text features output by the acoustic model and the acoustic features corresponding to the tag information;

and a speech synthesis module 53, configured to perform speech synthesis on the text to be synthesized based on the acoustic features, and obtain speech data corresponding to the text to be synthesized.

Optionally, the text to be synthesized includes: a literature vector text to be synthesized.

The speech synthesis apparatus 50 disclosed in the embodiment of the present application can execute the speech synthesis method disclosed in the embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. The contents not described in detail in this embodiment may refer to the description in any embodiment of the speech synthesis method in this application.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device for a model training method and/or a speech synthesis method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the model training method and/or the speech synthesis method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the model training method and/or the speech synthesis method provided herein.

The memory 602, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the model training method and/or the speech synthesis method in the embodiments of the present application (e.g., the sample text acquisition module 41, the tag information determination module 42, and the text classification model training module 43 shown in fig. 4, and/or the tag information acquisition module 51, the acoustic feature acquisition module 52, and/or the speech synthesis module 53 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implementing the model training method and/or the speech synthesis method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device by the model training method and/or the speech synthesis method, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to the electronics of the model training method and/or the speech synthesis method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the model training method and/or the speech synthesis method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the model training method and/or the speech synthesis method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the label information of the sample text is determined by acquiring the sample text in the training data set and based on the acoustic model which is trained by adopting an unsupervised training method in advance, and finally the text classification model is trained based on the sample text and the label information corresponding to the sample text, so that the technical effect of automatically determining the label information corresponding to the sample text is realized, the label labeling accuracy and efficiency are improved, and the training speed of the text training model is correspondingly improved;

the method comprises the steps of inputting a text to be synthesized into a text classification model trained in advance, obtaining label information corresponding to the text to be synthesized output by the classification model, inputting text characteristics of the text to be synthesized and the label information corresponding to the text to be synthesized into an acoustic model trained in advance by an unsupervised training method, obtaining the text characteristics output by the acoustic model and acoustic characteristics corresponding to the label information, finally carrying out voice synthesis on the text to be synthesized based on the acoustic characteristics, obtaining voice data corresponding to the text to be synthesized, achieving the generation of multi-style, multi-role and rich-emotion voice data, being closer to the reading style of a real person, and greatly improving the time and experience of a user for listening to the voice data.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种双层自回归解码的序列到序列语音合成方法及系统

Model training and voice synthesis method, device, equipment and medium

相关技术

网友询问留言