Speech synthesis method, system, device and storage medium based on neural network

文档序号:702070 发布日期:2021-04-13 浏览:27次 中文

阅读说明:本技术 基于神经网络的语音合成方法、系统、设备及存储介质 (Speech synthesis method, system, device and storage medium based on neural network ) 是由 陈子浩 罗超 周明康 邹宇 李巍 严丽 于 2020-12-15 设计创作,主要内容包括:本发明提供了基于神经网络的语音合成方法、系统、设备及存储介质,该方法包括:提供纯中文的第一音频文本数据集和纯英文的第二音频文本数据集;对第一中文文本和第一英文文本进行预处理,获得仅保留预设标点的第二中文文本和第二英文文本,根据自然语言处理算法结合各个场景进行分词,并将中文文本转为拼音;将纯中文音频与分词后的第二中文文本对齐,将纯英文音频与分词后的第二英文文本对齐,输入神经网络模型,建立拼音到中文音频的映射与大写英文单词到英文音频的映射;送入训练好的声码器,将梅尔频谱转换为音频。本发明能够合成出流利的中英文混合文本的音频,不需要找真人录音,也能实现合成的语音效果自然逼真的效果。(The invention provides a speech synthesis method, a system, equipment and a storage medium based on a neural network, wherein the method comprises the following steps: providing a first audio text data set in pure Chinese and a second audio text data set in pure English; preprocessing the first Chinese text and the first English text to obtain a second Chinese text and a second English text which only reserve preset punctuations, performing word segmentation according to a natural language processing algorithm by combining each scene, and converting the Chinese text into pinyin; aligning the pure Chinese audio with the second Chinese text after word segmentation, aligning the pure English audio with the second English text after word segmentation, inputting the pure Chinese audio into a neural network model, and establishing the mapping from pinyin to Chinese audio and the mapping from capitalized English words to English audio; sending into the trained vocoder to convert Mel spectrum into audio. The invention can synthesize the audio frequency of fluent Chinese and English mixed text, and can realize natural and vivid synthesized voice effect without finding the real person for recording.)

1. A speech synthesis method based on a neural network is characterized by comprising the following steps:

s110, providing a first audio text data set in pure Chinese and a second audio text data set in pure English;

s120, preprocessing a first Chinese text in the first audio text data set and a first English text in the second audio text data set to obtain a second Chinese text and a second English text only retaining preset punctuations;

s130, performing word segmentation on the second Chinese text and the second English text according to a natural language processing algorithm and combining each scene, and converting the Chinese text into pinyin;

s140, aligning the audio in the first audio text data set with the second Chinese text after word segmentation, and aligning the audio in the second audio text data set with the second Chinese text after word segmentation;

s150, inputting the aligned first audio text data set and the aligned second audio text data set into a neural network model, and respectively establishing a mapping from pinyin to Chinese audio and a mapping from capitalized English words to English audio by using an seq2seq model of an encoder-decoder;

and S160, sending the voice coder which is trained to convert the Mel frequency spectrum into audio.

2. The method for speech synthesis based on neural network of claim 1, wherein in step S120, the predetermined punctuations include a comma, a period and a question mark in the english alphabet state in the first chinese text, and a comma, an apostrophe, a period and a question mark in the english alphabet state in the first english text.

3. The method for speech synthesis based on neural network as claimed in claim 1, wherein in step S130, the arabic numerals in the english algorithm are converted into english words.

4. The method for speech synthesis based on neural network as claimed in claim 1, wherein in step S140, a language tag is added to each text, and each phoneme in the converted pinyin text is converted into a corresponding dictionary index, thereby obtaining a vector for the neural network model to use.

5. The method for speech synthesis based on neural network as claimed in claim 4, wherein in step S150, an end-to-end neural network model of encoder-decoder is established using bi-directional LTSM, multi-layer CNN and full-connection layer neural network structure, and the alignment relationship between phoneme vector and corresponding Mel spectral feature is learned through attention mechanism; after the aligned acoustic model is obtained, the text is converted into a Mel frequency spectrum.

6. The method of claim 5, wherein the neural network model employs two encoders, namely a Chinese encoder and an English encoder, and in the training stage, the input text of the encoder is respectively sent to the two encoders during the model training, and finally the final encoder is obtained according to the label of the input language.

7. The neural network-based speech synthesis method of claim 5, wherein in the decoder decoding process, the mapping relationship between the output information of the discriminator and the speaker timbre is established by inputting the audio of the model into the discriminator and inputting the information output by the discriminator into each step of the decoding process, and a full connection layer is connected behind the decoder for generating the Mel spectral features of the specified dimension.

8. A neural network-based speech synthesis system for implementing the neural network-based speech synthesis method of claim 1, comprising:

the data set module is used for providing a first audio text data set in pure Chinese and a second audio text data set in pure English;

the preprocessing module is used for preprocessing a first Chinese text in the first audio text data set and a first English text in the second audio text data set to obtain a second Chinese text and a second English text which only reserve preset punctuations;

the text word segmentation module is used for carrying out word segmentation on the second Chinese text and the second English text according to a natural language processing algorithm and combining each scene, and converting the Chinese text into pinyin;

the text alignment module aligns the audio in the first audio text data set with the second Chinese text after word segmentation, and aligns the audio in the second audio text data set with the second Chinese text after word segmentation;

the audio mapping module is used for inputting the aligned first audio text data set and the aligned second audio text data set into a neural network model, and respectively establishing a mapping from pinyin to Chinese audio and a mapping from capitalized English words to English audio by using an seq2seq model of an encoder-decoder;

and the audio generation module is used for sending the audio into the trained vocoder to convert the Mel frequency spectrum into audio.

9. A neural network-based speech synthesis apparatus, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the neural network-based speech synthesis method of any one of claims 1-7 via execution of the executable instructions.

10. A computer-readable storage medium storing a program which, when executed, implements the steps of the neural network-based speech synthesis method of any one of claims 1 to 7.

Technical Field

The present invention relates to the field of speech synthesis, and in particular, to a speech synthesis method, system, device and storage medium based on a neural network.

Background

An online travel service company needs to call a great number of merchants and guests every day, and uses an outbound robot to call hotels and customers by means of a voice synthesis technology and by means of modules such as voice recognition, conversation management, natural language understanding and natural language generation, so that a great deal of manpower resources can be saved. The main idea of Chinese-English mixed speech synthesis is to synthesize the audio of Chinese-English mixed text by one person's voice, but because the Chinese-English pronunciation is fluent and the number of people with good tone is not large, the cost of recording the audio is huge. The difficulty of finding a customer service capable of speaking Chinese and English mixed texts is high, so that the cost of developing a large amount of telephone services is high, and the timeliness and flexibility of adding new mixed texts are reduced. But audio text containing chinese only and english only are obviously easily available.

In addition, bugs may appear in the on-line and publishing process of the internal service of the enterprise, and a publisher can be reminded in time in a mail and telephone mode to correct the bugs in time; since various services have many english-language professional terms, a large amount of chinese-english mixed text phonetics needs to be broadcasted during the telephone broadcasting.

Accordingly, the present invention provides a method, system, device and storage medium for neural network based speech synthesis.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a speech synthesis method, a system, equipment and a storage medium based on a neural network, which overcome the difficulties in the prior art, can synthesize the audio of fluent Chinese-English mixed texts, and do not need to spend large cost to find a Chinese-English fluent sound recorder for recording, and the synthesized speech effect is natural and vivid.

The embodiment of the invention provides a speech synthesis method based on a neural network, which comprises the following steps:

s110, providing a first audio text data set in pure Chinese and a second audio text data set in pure English;

s120, preprocessing a first Chinese text in the first audio text data set and a first English text in the second audio text data set to obtain a second Chinese text and a second English text only retaining preset punctuations;

s130, performing word segmentation on the second Chinese text and the second English text according to a natural language processing algorithm and combining each scene, and converting the Chinese text into pinyin;

s140, aligning the audio in the first audio text data set with the second Chinese text after word segmentation, and aligning the audio in the second audio text data set with the second Chinese text after word segmentation;

s150, inputting the aligned first audio text data set and the aligned second audio text data set into a neural network model, and respectively establishing a mapping from pinyin to Chinese audio and a mapping from capitalized English words to English audio by using an seq2seq model of an encoder-decoder;

and S160, sending the voice coder which is trained to convert the Mel frequency spectrum into audio.

Preferably, in step S120, the preset punctuations include a comma, a period, and a question mark in the first chinese text in the english-chinese alphabet state, and a comma, a quotation mark, a period, and a question mark in the first english text in the english-chinese alphabet state.

Preferably, in step S130, the arabic numerals in the english algorithm are converted into english words.

Preferably, in step S140, a language tag is added to each text, and each phoneme in the converted pinyin text is converted into a corresponding dictionary index, so as to obtain a vector for the neural network model to use.

Preferably, in step S150, an end-to-end neural network model of an encoder-decoder is established using a bidirectional LTSM, a multi-layer CNN, and a full connection layer neural network structure, and an alignment relationship between a phoneme vector and a corresponding mel-spectrum feature is learned through an attention mechanism; after the aligned acoustic model is obtained, the text is converted into a Mel frequency spectrum.

Preferably, the neural network model adopts two encoders, namely a Chinese encoder and an English encoder, in the training stage, the input text of the encoder is respectively sent into the two encoders during model training, and finally the final encoder is obtained according to the label of the input language.

Preferably, in the decoder decoding process, the mapping relation between the output information of the discriminator and the tone of the speaker is established by sending the audio of the model to the discriminator and sending the information output by the discriminator to each step of the decoding process, and a full connection layer is connected behind the decoder for generating the Mel spectral characteristics of the specified dimension.

An embodiment of the present invention further provides a speech synthesis system based on a neural network, which is used to implement the above speech synthesis method based on the neural network, and the speech synthesis system based on the neural network includes:

the data set module is used for providing a first audio text data set in pure Chinese and a second audio text data set in pure English;

the preprocessing module is used for preprocessing a first Chinese text in the first audio text data set and a first English text in the second audio text data set to obtain a second Chinese text and a second English text which only reserve preset punctuations;

the text word segmentation module is used for carrying out word segmentation on the second Chinese text and the second English text according to a natural language processing algorithm and combining each scene, and converting the Chinese text into pinyin;

the text alignment module aligns the audio in the first audio text data set with the second Chinese text after word segmentation, and aligns the audio in the second audio text data set with the second Chinese text after word segmentation;

the audio mapping module is used for inputting the aligned first audio text data set and the aligned second audio text data set into a neural network model, and respectively establishing a mapping from pinyin to Chinese audio and a mapping from capitalized English words to English audio by using an seq2seq model of an encoder-decoder;

and the audio generation module is used for sending the audio into the trained vocoder to convert the Mel frequency spectrum into audio.

An embodiment of the present invention further provides a speech synthesis apparatus based on a neural network, including:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the neural network-based speech synthesis method described above via execution of the executable instructions.

Embodiments of the present invention also provide a computer-readable storage medium storing a program that, when executed, implements the steps of the above-described neural network-based speech synthesis method.

The invention aims to provide a speech synthesis method, a system, equipment and a storage medium based on a neural network, which can synthesize the audio frequency of a fluent Chinese-English mixed text, and do not need to spend large cost to find a fluent Chinese-English sound recorder for recording, and the synthesized speech effect is natural and vivid.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a flow chart of a neural network based speech synthesis method of the present invention.

FIG. 2 is a block diagram of a neural network based speech synthesis system of the present invention.

Fig. 3 is a schematic structural diagram of a neural network-based speech synthesis apparatus of the present invention.

Fig. 4 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their repetitive description will be omitted.

FIG. 1 is a flow chart of a neural network based speech synthesis method of the present invention. As shown in fig. 1, an embodiment of the present invention provides a speech synthesis method based on a neural network, including the following steps:

s110, providing a first audio text data set in pure Chinese and a second audio text data set in pure English.

S120, preprocessing the first Chinese text in the first audio text data set and the first English text in the second audio text data set to obtain a second Chinese text and a second English text only retaining preset punctuations.

S130, performing word segmentation on the second Chinese text and the second English text according to a natural language processing algorithm and combining each scene, and converting the Chinese text into pinyin.

S140, aligning the audio in the first audio text data set with the second Chinese text after word segmentation, and aligning the audio in the second audio text data set with the second English text after word segmentation.

S150, inputting the aligned first audio text data set and the aligned second audio text data set into a neural network model, and respectively establishing a mapping from pinyin to Chinese audio and a mapping from capitalized English words to English audio by using an seq2seq model of an encoder-decoder. Among them, Encoder-Decoder is a very common model framework in deep learning, for example, auto-encoding of unsupervised algorithm is designed and trained by using the structure of encoding-decoding. For example, the application of image capture that is relatively hot in these two years is the encoding-decoding framework of CNN-RNN. As another example, the neural network machine translation NMT model is often the encoding-decoding framework of LSTM-LSTM. seq2seq belongs to one of the encoder-decoder structures, the common one is seen here, the basic idea is to use two RNNs, one as the encoder and the other as the decoder. The encoder is responsible for compressing an input sequence into a vector with a specified length, the vector can be regarded as the semantic of the sequence, the process is called encoding, and the simplest way for obtaining the semantic vector is to directly use the hidden state of the last input as the semantic vector C. The last hidden state can be transformed to obtain a semantic vector, and all hidden states of the input sequence can be transformed to obtain a semantic variable.

And S160, sending the voice coder which is trained to convert the Mel frequency spectrum into audio.

The invention can finally obtain a speaker who speaks both Chinese and English by finding a speaker in English mother language to record English audio and then finding a speaker in Chinese mother language to record Chinese audio through the neural network model, and can broadcast relevant information of foreign orders through the method, thereby reducing the labor cost.

In a preferred embodiment, the predetermined punctuations include a comma, a period and a question mark in the english-chinese alphabet state in the first chinese text, and a comma, a quotation mark, a period and a question mark in the english-chinese alphabet state in the first english text in step S120.

In a preferred embodiment, in step S130, the arabic numerals in the english algorithm are converted into english words.

In a preferred embodiment, in step S140, a language tag is added to each text, and each phoneme in the converted pinyin text is converted into a corresponding dictionary index, so as to obtain a vector for the neural network model to use.

In a preferred embodiment, in step S150, an end-to-end neural network model of encoder-decoder is built using the bi-directional LTSM, the multi-layer CNN, and the full-connection layer neural network structure, and the alignment relationship between the phoneme vector and the corresponding mel-spectrum feature is learned through an attention mechanism. After the aligned acoustic model is obtained, the text is converted into a Mel frequency spectrum. The Long Short-Term Memory network (LSTM) is a time-cycle neural network, and is specially designed for solving the Long-Term dependence problem of the general RNN (cyclic neural network), and all RNNs have a chain form of a repeating neural network module. Convolutional Neural Networks (CNN) are a class of feed forward Neural Networks (fed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the representative algorithms for deep learning (deep learning).

In a preferred embodiment, the neural network model adopts two encoders of a Chinese encoder and an English encoder, in the training stage, the input text of an encoder is respectively sent into the two encoders during model training, and finally, the final encoder is obtained according to the label of the input language.

In a preferred embodiment, in the decoder decoding process, the mapping relation between the output information of the discriminator and the tone of the speaker is established by sending the audio of the model to the discriminator and sending the information output by the discriminator to each step of the decoding process, and a full connection layer is connected behind the decoder for generating the Mel-spectrum characteristics with specified dimensions.

The difficulty of customer service for speaking Chinese and English mixed texts is high, so that the cost for developing a large number of telephone services is high, an English audio is recorded by finding a speaker in English mother language, a Chinese audio is recorded by finding a speaker in Chinese mother language, finally a speaker can speak Chinese and English through a neural network model, and related information of foreign orders can be broadcasted through the method, so that the labor cost is reduced. The problems to be solved by the invention are as follows: a sound recorder which is fluent in Chinese and English but can record English audio by finding a speaker in English mother language and Chinese audio by finding a speaker in Chinese mother language and learn mapping from Chinese pinyin to Chinese audio and English capital letters to English audio by a neural network model is developed, corresponding audio information can be synthesized according to an input text, a voice outbound robot can replace a real person, and the business requirements can be met at high speed.

The invention discloses a speech synthesis technology of Chinese and English mixed texts based on a neural network, which uses a deep learning technology to construct a deep learning network structure, utilizes pure Chinese audio of a certain speaker and pure English audio of the certain speaker to train a deep learning model, can synthesize audio of the Chinese and English mixed texts, converts text information into speech information, and performs speech broadcast of related scenes.

The invention provides a Chinese-English mixed speech synthesis method based on a neural network, wherein a text to be synthesized, which is mixed with Chinese and English, is sent into a model, and the model can synthesize corresponding audio. The invention mainly comprises the following steps: 1) firstly, preprocessing a Chinese text data set and a pure English audio text data set to obtain a text only containing partial punctuations and Chinese and English, then carrying out word segmentation by combining different scenes according to a word segmentation algorithm of NLP, and then converting Chinese into pinyin, wherein an example is that the 'travel network is the biggest online travel service company in China' is converted into 'xie 2cheng2 lv3xing2 wang3 shi4 zhong1 guo2 zui4 da4 de5 zai4 xin 4 lv3xing2 fu2 wu4 gong1 si 1'; while English data sets need to convert Arabic numbers and the like into English words, examples are "32 dollas" into "THIRTY-TWO DOLLARS"; 2) the method comprises the steps of preprocessing recording data through a program, forcibly aligning audio and texts through a forced alignment method, and adding voice tags in the preprocessed data for subsequent models to use. 3) Sending data into a neural network model, respectively establishing mapping from pinyin to Chinese audio and mapping from capitalized English words to English audio by using seq2seq model of encoder-decoder, sending the audio of the model into a discriminator in the decoding process, expecting to be the same as a real language label, sending information output by the discriminator into each step of the decoding process, and establishing the mapping relation between the output information of the discriminator and the speech. 4) Then sent to a trained vocoder to convert mel-spectrum into audio.

In one embodiment, the invention provides a Chinese-English hybrid speech synthesis model based on a neural network, which comprises a text regularization stage, a data post-processing stage, acoustic modeling and a vocoder. The technology comprises the following steps:

a text regularization stage:

firstly, the confirmed text and the audio are in one-to-one correspondence, the Chinese text is regularized, punctuations except commas, periods and question marks are deleted, and the punctuations are changed into punctuations in an English state.

The Arabic numerals of the Chinese text are converted into Chinese according to the reading method of the actual scene. For example, "order end number 6158" should be converted to "order end number six and five and eight", "now 22: 20 "should go" now twenty-two o twenty minutes.

After the above processing, the Chinese is converted into a Pinyin format, for example: "Speech synthesis" is converted to "yu 3yin1 he 2cheng 2".

The punctuations of the English text except comma, sentence, quotation mark and question mark are deleted, and each punctuation is changed into the punctuation in the English state.

And converting the Arabic numerals in the English text into English words. For example, "10 dollas" is converted to "ten dollas" and finally all letters in the english word are converted to capital letters in the same way.

And (3) data post-processing stage:

firstly, the text obtained through regularization is simply processed, the text and the audio are forcibly aligned through a Montreal formed Al aligner tool, the result is further processed to obtain the text which can be used by the model, and a language label is added into each text for the subsequent acoustic model modeling. And converting each phoneme in the converted pinyin text into a corresponding dictionary index, and further obtaining a vector for a subsequent model to use.

Acoustic modeling:

the whole model is built by using neural network structures such as bidirectional LTSM, multilayer CNN and full connection layer, and the framework of the model is an encoder-decoder model of seq2 seq. In addition, in order to better learn the alignment relationship between the input text and the audio, the model adds an attention mechanism. Because the Chinese pronunciation and the English pronunciation have great difference, the model adopts two encoders, namely a Chinese encoder and an English encoder, and in the training stage, the text is simultaneously fed into the two encoders, so that the error of the encoder on the encoding of another language can be reduced, and finally, the final encoder is obtained according to the label of the input language.

A vocoder:

the vocoder part uses the melan generation countermeasure network model to convert the mel-spectrum to audio.

In the specific implementation of the invention, the method is mainly divided into the following six parts: data set preparation, a text regularization module, a data post-processing module, an acoustic model, a vocoder and model training. The specific implementation steps are as follows:

step 1: data set preparation

The Chinese dialect in the data set is extracted and labeled from the call records of hotel customer service and merchants, the English dialect is extracted and labeled from overseas orders, two special manual customer services are trained and then recorded in a recording studio, 10000 Chinese audios with 48kHz and 10000 English audios with 48kHz are recorded in total, the total duration of the audios is about 21 hours, and each audio is labeled and checked by special staff.

Step 2: text regularization module

Firstly, checking whether the text is matched with the audio, after the data is correct, carrying out regularization processing on the text of the Chinese, deleting punctuations except commas, periods and question marks, and changing the punctuations into punctuations in an English state. The Arabic numerals of the Chinese text are converted into Chinese according to the reading method of the actual scene. For example, "order end 3364" would be converted to "order end three six four", "today 23: 20 "should turn to" twenty three and twenty-two tenths of a day ". After the above processing, the Chinese is converted into a Pinyin format, for example: "Speech synthesis" goes to "yu 3yin1 he 2cheng 2"; the punctuations of the English text except comma, sentence, quotation mark and question mark are deleted, and each punctuation is changed into the punctuation in the English state. And converting the Arabic numerals in the English text into English words. For example, "give me 5 books" is converted to "give me five books", and finally all letters in the English word are converted to capital letters.

And step 3: data post-processing stage

Firstly, all punctuations are removed, only capital English words and pinyin characters are reserved, the text and the audio are forcibly aligned through a Montreal Formed Aligner (MFA) alignment tool, the audio and the text content are matched through the alignment of the character level of Chinese characters and the alignment of the word level of English words, so that a subsequent model can better learn the alignment relationship, and a language label is added into each text for the modeling of the subsequent acoustic model. And then, each character in the pinyin passes through an embedding layer, and the input text is converted into a vector which can be utilized by the model.

And 4, step 4: acoustic model modeling

The acoustic model is a neural network established by using network structures such as a bidirectional LTSM, a multilayer CNN and a full connection layer, the main structure of the acoustic model is an end-to-end model of an encoder-decoder, and in order to better learn the alignment relation between characters and audio, the convergence of the model is accelerated by using an attention mechanism; because the pronunciation characteristics and habits of Chinese and English are greatly different, two encoder encoders are adopted and named as encoder _ cn and encoder _ en respectively, the input of the encoder is sent into the two encoders respectively during model training, the encoders in different languages are hidden according to the labels of the input languages, and the final encoder outputs the result of the encoder with the same language label; in the decoder decoding process, the audio frequency of the model is sent to the discriminator, the expectation is the same as the real language label, and the information output by the discriminator is sent to each step of the decoding process, so that the mapping relation between the output information of the discriminator and the tone of the speaker is established, and a full connection layer is connected behind the decoder for generating the Mel spectral characteristics with the specified dimension.

And 5: vocoder

The vocoder part uses melgan, and by training a melgan model, the Mel spectrum characteristics can be synthesized into audio.

Step 6: model training

Both the acoustic model and the vocoder are trained separately.

Firstly, a Montreal formed aligned alignment tool is used for forcibly aligning the text and the audio, text information is converted into a vector which can be used by a model, the data is sent into an acoustic model for training, the data volume is large, and in order to enable the model to be more stable, the training is carried out for 40 ten thousand times, the loss basically converges, and the text phoneme and the Mel spectrum are aligned. The training of the vocoder uses the generation of melgan to resist the training of the network model, and the Mel frequency spectrum is converted into real audio.

The invention discloses a Chinese and English mixed speech synthesis method based on a neural network, which mainly comprises the following four modules, firstly regularizing a text, and regularizing a Chinese text to form a pinyin text, wherein the Chinese text only comprises commas, periods and question marks in an English letter state, the English text only comprises words in capital letters after regularization, and the English text only comprises commas, single quotation marks, periods and question marks in an English letter state. Converting each phoneme of the text into a vector, then sending the vector into an encoder-decoder model, training a neural network model through a GPU, and learning the alignment relation between the phoneme vector and the corresponding Mel spectrum characteristic by using an attention mechanism; after the aligned acoustic model is obtained, the text is converted into mel-spectrum, and the mel-spectrum is converted into audio by using a melgan model. The method can synthesize the audio frequency of the fluent Chinese-English mixed text, and does not need to spend large cost to find a Chinese-English fluent sound recorder for recording, and the synthesized voice effect is natural and vivid.

FIG. 2 is a block diagram of a neural network based speech synthesis system of the present invention. As shown in fig. 2, the neural network-based speech synthesis system 5 of the present invention includes:

the data set module 51 provides a first audio text data set in chinese and a second audio text data set in english.

The preprocessing module 52 preprocesses the first chinese text in the first audio text data set and the first english text in the second audio text data set to obtain the second chinese text and the second english text only with the preset punctuations reserved.

And a word segmentation module 53 for performing word segmentation on the second Chinese text and the second English text according to a natural language processing algorithm in combination with each scene, and converting the Chinese text into pinyin.

The text alignment module 54 aligns the audio in the first audio text data set with the second chinese text after word segmentation and aligns the audio in the second audio text data set with the second english text after word segmentation.

And the audio mapping module 55 inputs the aligned first audio text data set and second audio text data set into the neural network model, and respectively establishes the mapping from pinyin to Chinese audio and the mapping from capitalized English words to English audio by using the seq2seq model of the encoder-decoder.

The audio generation module 56 sends the trained vocoder to convert the mel spectrum into audio.

The voice synthesis system based on the neural network can synthesize the audio frequency of the fluent Chinese-English mixed text, and does not need to spend large cost to find a fluent Chinese-English sound recorder for recording, so that the synthesized voice effect is natural and vivid.

The embodiment of the invention also provides a speech synthesis device based on the neural network, which comprises a processor. A memory having stored therein executable instructions of the processor. Wherein the processor is configured to perform the steps of the neural network based speech synthesis method via execution of the executable instructions.

As shown above, the neural network-based speech synthesis system of the present invention can synthesize the audio of the fluent chinese-english mixed text, and does not need to spend a large amount of money to find a chinese-english fluent sound recorder for recording, and the synthesized speech effect is natural and vivid.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.

Fig. 3 is a schematic structural diagram of a neural network-based speech synthesis apparatus of the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 3. The electronic device 600 shown in fig. 3 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 3, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

Embodiments of the present invention also provide a computer-readable storage medium for storing a program, and the steps of the speech synthesis method based on a neural network implemented when the program is executed. In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.

As shown above, the neural network-based speech synthesis system of the present invention can synthesize the audio of the fluent chinese-english mixed text, and does not need to spend a large amount of money to find a chinese-english fluent sound recorder for recording, and the synthesized speech effect is natural and vivid.

Fig. 4 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 4, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, the present invention provides a method, a system, a device and a storage medium for speech synthesis based on a neural network, which can synthesize the audio of a fluent chinese-english mixed text without spending a large amount of money to find a chinese-english fluent sound recorder for recording, and the synthesized speech has natural and vivid speech effect.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:用于生成音频的方法、装置、设备和介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!