Speech synthesis method and device

文档序号：1659486 发布日期：2019-12-27 浏览：26次中文

阅读说明：本技术 语音合成方法及装置 (Speech synthesis method and device ) 是由张鹏于 2018-06-19 设计创作，主要内容包括：本发明实施例提供一种语音合成方法及装置。所述方法包括：检测到待合成语音中包括未登录词,根据预设算法,确定预设词典中与所述未登录词的相似度最高的目标候选词；所述相似度包括字面相似度以及语义相似度；将所述目标候选词替代所述未登录词,进行语音合成。本发明解决了未登录词无法进行语音合成的问题,且不需要扩大TTS语料库,不会增加TTS神经网络的训练复杂度、解码复杂度；且相似度包括字面相似度以及语义相似度,通过两个维度筛选与未登录词相似的目标候选词,确保目标候选词替代未登录词的准确性。(The embodiment of the invention provides a voice synthesis method and a voice synthesis device. The method comprises the following steps: detecting that the voice to be synthesized comprises unknown words, and determining a target candidate word with the highest similarity with the unknown words in a preset dictionary according to a preset algorithm; the similarity comprises literal similarity and semantic similarity; and replacing the target candidate word with the unknown word to perform voice synthesis. The invention solves the problem that the unknown words can not be synthesized, and does not need to expand the TTS corpus and increase the training complexity and decoding complexity of the TTS neural network; and the similarity comprises the literal similarity and the semantic similarity, and the target candidate words similar to the unknown words are screened through two dimensions, so that the accuracy of replacing the unknown words by the target candidate words is ensured.)

1. A method of speech synthesis, comprising:

detecting that the voice to be synthesized comprises unknown words, and determining a target candidate word with the highest similarity with the unknown words in a preset dictionary according to a preset algorithm; the similarity comprises literal similarity and semantic similarity;

and replacing the target candidate word with the unknown word to perform voice synthesis.

2. The method according to claim 1, wherein the step of determining the target candidate word with the highest similarity to the unknown word in a preset dictionary according to a preset algorithm comprises;

determining candidate words in the preset dictionary according to a first preset rule; the literal similarity between the candidate word and the unknown word meets a first preset condition, and the semantic similarity between the candidate word and the unknown word meets a second preset condition;

and determining a target candidate word with the highest similarity with the unknown word in the candidate words according to a second preset rule.

3. The method according to claim 2, wherein the step of determining the candidate words in the preset dictionary according to the first preset rule comprises:

according to a first preset formula, inquiring a first vocabulary in the preset dictionary, wherein the similarity of the vocabulary and the character face of the unknown word meets a first preset condition;

and querying a candidate word in the first vocabulary according to a second preset formula, wherein the candidate word is the first vocabulary with the semantic similarity with the unknown word meeting a second preset condition.

4. The method according to claim 3, wherein the step of searching for the first vocabulary in the predetermined dictionary whose literal similarity to the unknown word satisfies a first predetermined condition according to a first predetermined formula comprises:

according to the following formula, the literal similarity between the vocabulary in the preset dictionary and the unknown words is inquired;

wherein, sim_DSimilarity between the vocabulary in the preset dictionary and the character face of the unknown word is obtained; w is the unknown word, and w1 is a word in the preset dictionary;

and determining a first vocabulary of which the face similarity with the unknown words meets a first preset condition in the vocabularies in the preset dictionary.

5. The method according to claim 3, wherein the step of querying, according to a second predetermined formula, a candidate word in the first vocabulary, the candidate word being a first vocabulary with a semantic similarity to the unknown word satisfying a second predetermined condition, comprises:

determining semantic similarity between the first vocabulary and the unknown words according to the following formula;

s_G＝γ[p(w_i|w_i-1,w_i-2)+p(w_i|w_i+1,w_i+2)]+(1-γ)

*[p(w_i|w_i-1)+p(w_i|w_i+1)]

determining candidate words in the first vocabulary, wherein the semantic similarity between the candidate words and the unknown words meets a second preset condition;

wherein s is_GThe semantic similarity of the first vocabulary and the unknown words is obtained; w is a_iIs the first vocabulary; gamma is a first preset weight;

w_i-1the words are the previous words arranged in the sentence of the speech to be synthesized according to a preset direction; w is a_i-2Is the said w_i-1The previous vocabulary arranged in the sentence in the voice to be synthesized according to the preset direction; p (w)_i|w_i-1,w_i-2) Is w_i-2、wi-1、w_iA first preset probability arranged in a preset language model in sequence according to the preset direction; p (w)_i|w_i-1) Is w_i-1、w_iAccording to the preset language modelSetting a second preset probability with the directions arranged in sequence;

w_i+1the next vocabulary arranged in the sentence of the speech to be synthesized according to the preset direction is the unknown word; w is a_i+2Is the said w_i+1The next vocabulary arranged in the preset direction in the sentence in the voice to be synthesized; p (w)_i|w_i+1,w_i+2) Is w_i、w_i+1、w_i+2A third preset probability arranged in the preset language model in sequence according to the preset direction; p (w)_i|w_i+1) Is w_iAnd wi +1 a fourth preset probability which is sequentially arranged in a preset language model according to the preset direction.

6. The method according to claim 2, wherein the step of determining, according to a second preset rule, a target candidate word with the highest similarity to the unknown word in the candidate words comprises:

according to a preset weight, carrying out weighted summation on the literal similarity and the semantic similarity of the candidate words to obtain a comprehensive weight of the candidate words; and the candidate word with the highest comprehensive weight is the target candidate word.

7. The method according to claim 1, wherein the step of performing speech synthesis by replacing the unknown word with the target candidate word comprises:

and replacing the target candidate word with the position of the unknown word in the sentence of the voice to be synthesized, and performing voice synthesis.

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

the detection module is used for detecting that the voice to be synthesized comprises unknown words, and determining a target candidate word with the highest similarity with the unknown words in a preset dictionary according to a preset algorithm; the similarity comprises literal similarity and semantic similarity;

and the synthesis module is used for replacing the unknown words with the target candidate words to carry out voice synthesis.

9. An electronic device, comprising a memory, a processor, a bus, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the speech synthesis method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that: the program when executed by a processor implements the steps in a speech synthesis method according to any one of claims 1 to 7.

Technical Field

The embodiment of the invention relates to the technical field of information processing, in particular to a voice synthesis method and a voice synthesis device.

Background

The Speech synthesis technology (Text To Speech, TTS) is a technology for generating artificial Speech by mechanical or electronic means. At present, TTS is widely used in daily life, for example, a relatively common statistical speech synthesis technology based on Hidden Markov Models (HMMs) achieves the purpose of voice conversion by adjusting Model parameters, and the synthesis effect is relatively good; on the other hand, the requirements of users on speech synthesis systems are higher and higher, for example, HMM synthesized sound still has the problems of sound too smooth, dull, lack of details, low naturalness and the like, which affect the sound quality.

In order to meet the use requirements of users, research and development personnel propose an effective method for establishing a multilayer neural network, namely a deep neural network on unsupervised data, and a round of hot tide of relevant researches such as machine learning and neural network is raised. The combination of deep neural networks with speech processing techniques has also begun to be widely tried by developers and made a breakthrough in speech synthesis. For example, an end-to-end speech synthesis system has been developed, which can receive the input of characters, output the corresponding original spectrogram, and then provide it to the Griffin-Lim reconstruction algorithm (a language model commonly used in large-vocabulary continuous speech recognition) to directly produce speech.

However, due to the limitations of the memory and the computation time of a Central Processing Unit (CPU), the TTS corpus can only maintain the synthesis of the common expressions through the common expressions, and for some unknown words (OOV), i.e., words that are not recorded in the participle table but must be separated, the synthesis usually cannot be performed, and only noise can be synthesized. Moreover, with the rapid development of internet technology, more and more OOVs are inevitably encountered in speech synthesis. At present, most of ways for solving the problem are to expand a TTS corpus, but training complexity and decoding complexity of a TTS neural network greatly depend on the size of the corpus, and complexity is rapidly expanded when the corpus is increased. It is not favorable for the speech synthesis.

Disclosure of Invention

The embodiment of the invention provides a voice synthesis method and a voice synthesis device, which are used for solving the problem that TTS (text to speech) in the prior art cannot synthesize unknown words.

In one aspect, an embodiment of the present invention provides a speech synthesis method, where the method includes:

and replacing the target candidate word with the unknown word to perform voice synthesis.

In another aspect, an embodiment of the present invention provides a speech synthesis apparatus, where the apparatus includes:

and the synthesis module is used for replacing the unknown words with the target candidate words to carry out voice synthesis. In another aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, a bus, and a computer program stored in the memory and executable on the processor, where the processor implements the steps in the speech synthesis method when executing the program.

In still another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps in the above-mentioned speech synthesis method.

According to the voice synthesis method and device provided by the embodiment of the invention, in the voice synthesis process, when the situation that the voice to be synthesized comprises the unknown word is detected, the target candidate word with the highest similarity with the unknown word in the preset dictionary is determined according to the preset algorithm; the target candidate words are used for replacing the unknown words to carry out voice synthesis, so that the problem that the unknown words cannot be subjected to voice synthesis is solved, a TTS corpus does not need to be expanded, and the training complexity and the decoding complexity of a TTS neural network cannot be increased; and the similarity comprises the literal similarity and the semantic similarity, and the target candidate words similar to the unknown words are screened through two dimensions, so that the accuracy of replacing the unknown words by the target candidate words is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a speech synthesis method according to an embodiment of the present invention;

fig. 2 is a second schematic flowchart of a speech synthesis method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart illustrating a speech synthesis method according to an embodiment of the present invention.

As shown in fig. 1, the speech synthesis method provided in the embodiment of the present invention specifically includes the following steps:

step 101, detecting that the voice to be synthesized comprises unknown words, and determining a target candidate word with the highest similarity with the unknown words in a preset dictionary according to a preset algorithm; the similarity comprises literal similarity and semantic similarity.

The unknown words are words which are not contained in a TTS corpus but must be segmented in the speech synthesis process; in the process of voice synthesis, firstly, performing text preprocessing on voice to be synthesized, converting an input text into a word vector, and searching a target candidate word in a preset dictionary if an unregistered word is detected in the conversion process;

the target candidate words are the words with the highest similarity to the unknown words, and the similarity is determined according to a preset algorithm and comprises the face similarity and the semantic similarity; the literal similarity is the literal similarity of the two words, and the semantic similarity is the semantic similarity of the two words.

And 102, replacing the unknown word with the target candidate word to perform voice synthesis.

Specifically, the target candidate word is substituted for the unknown word, and speech synthesis is performed, that is, the target candidate word is substituted for the position of the sentence of the unknown word in the speech to be synthesized, and speech synthesis is performed; in the speech synthesis process, usually sentence-by-sentence synthesis is performed, in the sentence where the synthesized unknown word is located, the target candidate word is substituted for the position of the unknown word in the sentence in the speech to be synthesized, and then text preprocessing is performed on the sentence.

In the embodiment of the invention, in the process of voice synthesis, when detecting that the voice to be synthesized comprises unknown words, determining a target candidate word with the highest similarity with the unknown words in a preset dictionary according to a preset algorithm; the target candidate words are used for replacing the unknown words to carry out voice synthesis, so that the problem that the unknown words cannot be subjected to voice synthesis is solved, a TTS corpus does not need to be expanded, and the training complexity and the decoding complexity of a TTS neural network cannot be increased; the similarity comprises the literal similarity and the semantic similarity, and a target candidate word similar to the unknown word is screened through two dimensions, so that the accuracy of replacing the unknown word by the target candidate word is ensured; the invention solves the problem that the TTS in the prior art can not synthesize the unknown words.

It should be noted that, in the embodiment of the present invention, the "word" or "vocabulary" may include any number of chinese characters or characters, and the segmentation of the specific "word" or "vocabulary" is determined according to the situation of the text preprocessing.

As shown in fig. 2, another embodiment of the present invention provides a speech synthesis method, which specifically includes the following steps:

step 201, detecting that the speech to be synthesized includes unknown words.

The unknown words are words which are not contained in a TTS corpus but must be segmented in the speech synthesis process; in the process of speech synthesis, firstly, text preprocessing is performed on the speech to be synthesized, and the input text is converted into word vectors, and in the process of conversion, if an unknown word is detected, step 202 is executed.

Step 202, determining candidate words in the preset dictionary according to a first preset rule; the literal similarity between the candidate word and the unknown word meets a first preset condition, and the semantic similarity between the candidate word and the unknown word meets a second preset condition;

the character face similarity of the candidate words and the unknown words meets a first preset condition, and the semantic similarity meets a second preset condition; through a first preset rule, a part of candidate words are screened out from the preset dictionary firstly. The literal similarity is the literal similarity of the two words, and the semantic similarity is the semantic similarity of the two words.

Step 203, determining a target candidate word with the highest similarity to the unknown word in the candidate words according to a second preset rule; the similarity comprises literal similarity and semantic similarity.

After a part of candidate words are screened out from the preset dictionary, the candidate words with the highest similarity are screened out from the candidate words to serve as target candidate words according to a second preset rule; the similarity comprises the literal similarity and the semantic similarity, and the target candidate words similar to the unknown words are screened through two dimensions, so that the accuracy of replacing the unknown words by the target candidate words is ensured.

And step 204, replacing the unknown word with the target candidate word, and performing voice synthesis.

Specifically, the target candidate word is substituted for the position of the unknown word in the sentence of the voice to be synthesized, and voice synthesis is performed; in the speech synthesis process, usually sentence-by-sentence synthesis is performed, in the sentence where the synthesized unknown word is located, the target candidate word is substituted for the position of the unknown word in the sentence in the speech to be synthesized, and then text preprocessing is performed on the sentence.

And (3) performing vectorization processing on the target candidate words by adopting a preset tool, and inputting the word vectors into a preset language model to finally synthesize the voice.

Optionally, in this embodiment of the present invention, step 202 includes:

In the process of screening the candidate words, first words meeting a first preset condition are selected from a preset dictionary according to a first preset formula, the first preset condition is used for limiting the similarity of the word faces, and the first preset condition can be that the similarity of the word faces of the first words and the unknown words is larger than a preset threshold value, or that the first words with the first preset number of words with the maximum similarity of the word faces of the unknown words are the first words.

After the first vocabulary is determined, searching candidate words in the first vocabulary, the semantic similarity of which meets a second preset condition, according to a second preset formula; the second preset condition is used for limiting the semantic similarity, and the second preset condition may be that the semantic similarity between the candidate word and the unknown word is greater than a preset threshold, or the candidate word is the first preset number of words with the maximum semantic similarity to the unknown word.

Further, in this embodiment of the present invention, the step of querying, according to a first preset formula, a first vocabulary in the preset dictionary whose literal similarity to the unknown word satisfies a first preset condition includes:

according to the following formula, the literal similarity between the vocabulary in the preset dictionary and the unknown words is inquired;

wherein, sim_DSimilarity between the vocabulary in the preset dictionary and the character face of the unknown word is obtained; w is the unknown word, and w1 is a word in the preset dictionary;

and determining a first vocabulary of which the face similarity with the unknown words meets a first preset condition in the vocabularies in the preset dictionary.

The formula is formula 1, wherein formula 1 is a first preset formula; determining the similarity of the word planes according to a first preset formula; specifically, w ≧ w1 denotes the number of words of the same character (or kanji) in w and w1, w + w1 denotes the sum of the numbers of words of the characters (or kanji) in w and w1, i.e., the total number of words, for example, when w is "north", w1 is "north",

optionally, since the vocabulary in the predetermined dictionary is large, when the first vocabulary is selected, the vocabulary having at least one character (or chinese character) same as the unknown word can be selected preferentially.

And after the face similarity between the vocabulary in the preset dictionary and the unknown words is determined, screening the vocabulary with the face similarity meeting a first preset condition as a first vocabulary.

Further, in the embodiment of the present invention, according to a second preset formula, the step of querying a candidate word in the first vocabulary, where the semantic similarity between the candidate word and the unknown word satisfies a second preset condition, includes:

determining semantic similarity between the first vocabulary and the unknown words according to the following formula;

s_G＝γ[p(w_i|w_i-1，w_i-2)+p(w_i|w_i+1，w_i+2)]+(1-γ)

*[p(w_i|w_i-1)+p(w_i|w_i+1)]

determining candidate words in the first vocabulary, wherein the semantic similarity between the candidate words and the unknown words meets a second preset condition;

wherein the above formula is hereinafter referred to as formula 2; w is a_iIs the first vocabulary; s_GThe semantic similarity of the first vocabulary and the unknown words is obtained; gamma is a first preset weight; the first preset weight is a weight preset according to an empirical value, and the value range of gamma can be set between 0 and 1.

In formula 2, w_i-1And arranging the previous vocabulary of the unknown word in the sentence in the speech to be synthesized according to a preset direction, namely arranging the vocabulary before the unknown word in the preset direction.

For example, as a specific example, for a sentence of speech to be synthesized: "ABCDEF", wherein, A to F, each letter represents a vocabulary;

if the unregistered word is "D", the first word is "H", and the predetermined direction is from left to right, then w_i-1Is "C"; w is a_i-2Is the said w_i-1The previous vocabulary arranged according to the preset direction in the sentence in the speech to be synthesized, still referring to the above specific example, w_i-2Is "B".

After replacement of the unknown word by the first vocabulary, p (w)_i|w_i-1，w_i-2) Is w_i-2、wi-1、w_iA first predetermined probability arranged in a predetermined language model according to the predetermined direction, still referring to the above specific example, p (w)_i|w_i-1，w_i-2) Representing the probability that the 'B', 'C' and 'H' are sequentially arranged from left to right in a preset language model; p (w)_i|w_i-1) Is w_i-1、w_iAnd a second preset probability which is sequentially arranged in the preset language model according to the preset direction, namely the probability that the C and the H are sequentially arranged from left to right in the preset language model.

w_i+1The word after the arrangement of the unknown word in the sentence in the speech to be synthesized according to the preset direction is arranged in the preset direction, namely in the preset direction, the word is arranged after the unknown wordStill referring to the specific example above, if the unlisted word is "D" and the first word is "H", then w_i+1Is "E"; w is a_i+2W is the last vocabulary of wi +1 arranged in the preset direction in the sentence in the speech to be synthesized, still referring to the above specific example, w_i+1Is "F"; after replacement of the unknown word by the first vocabulary, p (w)_i|w_i+1，w_i+2) Is w_i、w_i+1、w_i+2A third preset probability, namely the probabilities of sequentially arranging H, E and F from left to right, in the preset language model according to the preset direction; p (w)_i|w_i+1) Is w_iAnd wi +1 is a fourth preset probability which is sequentially arranged in the preset language model according to the preset direction, namely, the probabilities of sequentially arranging H and E from left to right.

Wherein, the preset language model can be an n-gram language model; the language model is an important component in natural language processing task, wherein the n-gram language model is the most widely applied statistical language model at present. The n-gram language model is constructed as the probability distribution p(s) of the character string s, wherein p(s) reflects the frequency of the character string s as a sentence, and the higher the frequency is, the sentences formed by the character string s are close to the human language.

Optionally, in an embodiment of the present invention, step 203 includes:

Weighting and summing the word face similarity and the semantic similarity according to preset weights of the word face similarity and the semantic similarity to obtain a comprehensive weight of the candidate words, and taking the candidate word with the highest comprehensive weight as a target candidate word; specifically, the comprehensive weight of the candidate word is determined according to the following formula 3:

T＝a*sim_D+b*s_G

wherein T represents the comprehensive weight of the candidate word, and a is sim_D(literal similarity) with a predetermined weight, b being s_G(semantic similarity) of the preset weight.

After the comprehensive weight values of all candidate words are determined, determining the target candidate words according to the following formula 4:

w*＝arg max(T)

wherein, w is a target candidate word, and formula 4 represents that the candidate word with the highest comprehensive weight is used as the target candidate word.

In the above embodiment of the present invention, in the process of speech synthesis, when detecting that the speech to be synthesized includes an unknown word, determining a target candidate word with the highest similarity to the unknown word in a preset dictionary according to a preset algorithm; the target candidate words are used for replacing the unknown words to carry out voice synthesis, so that the problem that the unknown words cannot be subjected to voice synthesis is solved, a TTS corpus does not need to be expanded, and the training complexity and the decoding complexity of a TTS neural network cannot be increased; and the similarity comprises a face similarity and a semantic similarity, the face similarity and the semantic similarity are respectively calculated based on a formula 1 and a formula 2, and a target candidate word similar to the unknown word is screened through two dimensions, so that the accuracy of replacing the unknown word by the target candidate word is ensured.

The speech synthesis method according to the embodiment of the present invention is described above, and the speech synthesis apparatus according to the embodiment of the present invention will be described below with reference to the accompanying drawings.

Referring to fig. 3, an embodiment of the present invention provides a speech synthesis apparatus, including:

the detection module 301 is configured to detect that an unknown word is included in the speech to be synthesized, and determine a target candidate word with the highest similarity to the unknown word in a preset dictionary according to a preset algorithm; the similarity comprises literal similarity and semantic similarity.

A synthesizing module 302, configured to substitute the target candidate word for the unknown word, and perform speech synthesis.

Optionally, in this embodiment of the present invention, the detecting module 301 includes;

the first determining submodule is used for determining candidate words in the preset dictionary according to a first preset rule; the literal similarity between the candidate word and the unknown word meets a first preset condition, and the semantic similarity between the candidate word and the unknown word meets a second preset condition;

and the second determining submodule is used for determining a target candidate word with the highest similarity to the unknown word in the candidate words according to a second preset rule.

Optionally, in this embodiment of the present invention, the first determining sub-module includes:

the first query unit is used for querying a first vocabulary, which has the character face similarity with the unknown words and meets a first preset condition, in the preset dictionary according to a first preset formula;

and the second query unit is used for querying a candidate word in the first vocabulary according to a second preset formula, wherein the candidate word is the first vocabulary with semantic similarity to the unknown word meeting a second preset condition.

Optionally, in this embodiment of the present invention, the first querying unit is configured to:

according to the following formula, the literal similarity between the vocabulary in the preset dictionary and the unknown words is inquired;

wherein, sim_DSimilarity between the vocabulary in the preset dictionary and the character face of the unknown word is obtained; w is the unknown word, and w1 is a word in the preset dictionary;

and determining a first vocabulary of which the face similarity with the unknown words meets a first preset condition in the vocabularies in the preset dictionary.

Optionally, in this embodiment of the present invention, the second querying unit is configured to:

determining semantic similarity between the first vocabulary and the unknown words according to the following formula;

s_G＝γ[p(w_i|w_i-1，w_i-2)+p(w_i|w_i+1，w_i+2)]+(1-γ)

*[p(w_i|w_i-1)+p(w_i|w_i+1)]

determining candidate words in the first vocabulary, wherein the semantic similarity between the candidate words and the unknown words meets a second preset condition;

wherein s is_GThe semantic similarity of the first vocabulary and the unknown words is obtained; w is a_iIs the first vocabulary; gamma is a first preset weight;

w_i-1the words are the previous words arranged in the sentence of the speech to be synthesized according to a preset direction; w is a_i-2Is the said w_i-1The previous vocabulary arranged in the sentence in the voice to be synthesized according to the preset direction; p (w)_i|w_i-1，w_i-2) Is w_i-2、wi-1、w_iA first preset probability arranged in a preset language model in sequence according to the preset direction; p (w)_i|w_i-1) Is w_i-1、w_iIn a preset language modelA second preset probability arranged in sequence according to the preset direction;

w_i+1the next vocabulary arranged in the sentence of the speech to be synthesized according to the preset direction is the unknown word; w is a_i+2Is the said w_i+1The next vocabulary arranged in the preset direction in the sentence in the voice to be synthesized; p (w)_i|w_i+1，w_i+2) Is w_i、w_i+1、w_i+2A third preset probability arranged in the preset language model in sequence according to the preset direction; p (w)_i|w_i+1) Is w_iAnd wi +1 a fourth preset probability which is sequentially arranged in a preset language model according to the preset direction.

Optionally, in this embodiment of the present invention, the second determining sub-module is configured to:

Optionally, in an embodiment of the present invention, the synthesis module 302 is configured to:

and replacing the target candidate word with the position of the unknown word in the sentence of the voice to be synthesized, and performing voice synthesis.

In the above embodiment of the present invention, in the speech synthesis process, when the detection module 301 detects that the speech to be synthesized includes an unknown word, according to a preset algorithm, a target candidate word with the highest similarity to the unknown word in a preset dictionary is determined; the synthesis module 302 performs speech synthesis by replacing the target candidate word with the unknown word, so that the problem that the unknown word cannot be subjected to speech synthesis is solved, a TTS corpus does not need to be expanded, and the training complexity and the decoding complexity of a TTS neural network are not increased; and the similarity comprises the literal similarity and the semantic similarity, and the target candidate words similar to the unknown words are screened through two dimensions, so that the accuracy of replacing the unknown words by the target candidate words is ensured.

Fig. 4 is a schematic structural diagram of an electronic device according to yet another embodiment of the present invention.

Referring to fig. 4, an embodiment of the present invention provides an electronic device, which includes a memory (memory)41, a processor (processor)42, a bus 43, and a computer program stored in the memory 41 and running on the processor. The memory 41 and the processor 42 complete communication with each other through the bus 43.

The processor 42 is used to call the program instructions in the memory 41 to implement the method of fig. 1 when executing the program.

In another embodiment, the processor, when executing the program, implements the method of:

and replacing the target candidate word with the unknown word to perform voice synthesis.

The electronic device provided in the embodiment of the present invention may be configured to execute a program corresponding to the method in the foregoing method embodiment, and details of this implementation are not described again.

According to the electronic equipment provided by the embodiment of the invention, in the voice synthesis process, when the situation that the voice to be synthesized comprises the unknown word is detected, the target candidate word with the highest similarity with the unknown word in the preset dictionary is determined according to the preset algorithm; the target candidate words are used for replacing the unknown words to carry out voice synthesis, so that the problem that the unknown words cannot be subjected to voice synthesis is solved, a TTS corpus does not need to be expanded, and the training complexity and the decoding complexity of a TTS neural network cannot be increased; and the similarity comprises the literal similarity and the semantic similarity, and the target candidate words similar to the unknown words are screened through two dimensions, so that the accuracy of replacing the unknown words by the target candidate words is ensured.

A non-transitory computer readable storage medium is provided according to a further embodiment of the present invention, having a computer program stored thereon, which when executed by a processor implements the steps of fig. 1.

In another embodiment, the program when executed by a processor implements a method comprising:

and replacing the target candidate word with the unknown word to perform voice synthesis.

In the non-transitory computer-readable storage medium provided in the embodiment of the present invention, when the program is executed by the processor, the method in the above-described method embodiment is implemented, and details of this implementation are not described again.

In the non-transitory computer-readable storage medium provided by the embodiment of the invention, in the process of voice synthesis, when detecting that the voice to be synthesized includes an unknown word, determining a target candidate word with the highest similarity to the unknown word in a preset dictionary according to a preset algorithm; the target candidate words are used for replacing the unknown words to carry out voice synthesis, so that the problem that the unknown words cannot be subjected to voice synthesis is solved, a TTS corpus does not need to be expanded, and the training complexity and the decoding complexity of a TTS neural network cannot be increased; and the similarity comprises the literal similarity and the semantic similarity, and the target candidate words similar to the unknown words are screened through two dimensions, so that the accuracy of replacing the unknown words by the target candidate words is ensured.

Yet another embodiment of the present invention discloses a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-mentioned method embodiments, for example, comprising:

and replacing the target candidate word with the unknown word to perform voice synthesis.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

14页详细技术资料下载

Speech synthesis method and device

相关技术

网友询问留言