Multi-speaker and multi-language voice synthesis method and system

文档序号:925429 发布日期:2021-03-02 浏览:12次 中文

阅读说明:本技术 一种多说话人、多语言的语音合成方法及系统 (Multi-speaker and multi-language voice synthesis method and system ) 是由 朱海 王昆 周琳珉 刘书君 于 2020-11-11 设计创作,主要内容包括:本发明公开了一种多说话人、多语言的语音合成方法,包括提取语音声学特征;将不同语言的文本处理为统一的表征方式,并将音频和文本对齐,获取时长信息;构建说话人空间和语言空间,生成说话人id和语言id,提取说话人向量和语言向量加入到初始语音合成模型,采用对齐后的文本、时长信息和语音声学特征对初始语音合成模型进行训练,得到语音合成模型;对待合成文本处理后生成说话人id和语言id;将说话人id、文本以及语言id,输入到语音合成模型,输出语音声学特征并转换为音频。还公开了一种系统。本发明实现了对说话人的特征以及语言特征的“解纠缠”,并且只需要变换id就可以实现说话人或语言的转换。(The invention discloses a multi-speaker and multi-language voice synthesis method, which comprises the steps of extracting voice acoustic characteristics; processing texts of different languages into a uniform representation mode, aligning audio and the texts, and acquiring duration information; constructing a speaker space and a language space, generating a speaker id and a language id, extracting a speaker vector and a language vector, adding the speaker vector and the language vector into an initial voice synthesis model, and training the initial voice synthesis model by adopting aligned text, duration information and voice acoustic characteristics to obtain a voice synthesis model; processing a text to be synthesized to generate a speaker id and a language id; and inputting the speaker id, the text and the language id into a speech synthesis model, outputting the acoustic characteristics of the speech and converting the acoustic characteristics into audio. A system is also disclosed. The invention realizes the de-entanglement of the characteristics and the language characteristics of the speaker, and can realize the conversion of the speaker or the language only by transforming the id.)

1. A method for synthesizing multi-speaker and multi-language speech, comprising:

step S100: training a speech synthesis model specifically comprises:

step S110: acquiring a voice training database of multiple speakers and a single language, and extracting voice acoustic features;

step S120: processing texts of different languages in a voice training database into a uniform representation mode, aligning audio and the texts, and acquiring duration information corresponding to the texts;

step S130: constructing a speaker space and a language space, generating a speaker id corresponding to the aligned text and a language id corresponding to each character in the aligned text, extracting a speaker vector corresponding to the speaker id from the speaker space, and extracting a language vector corresponding to the language id from the language space;

step S140: adding the speaker vector and the language vector into each part of the initial voice synthesis model, and training the speaker space, the language space and the initial voice synthesis model by adopting the aligned text, duration information and voice acoustic characteristics to obtain a trained voice synthesis model;

step S200: converting the text to be synthesized into audio, specifically comprising:

step S210: carrying out standardized processing on the text to be synthesized, and classifying the text according to the text language;

step S220: processing the classified texts into a uniform representation mode, aligning the audio and the texts, predicting duration information corresponding to the texts by a predictor, and generating a speaker id corresponding to the aligned texts and a language id corresponding to each character in the aligned texts;

step S230: appointing a speaker id, inputting the speaker id, the text processed by S220 and the language id of the corresponding character into the trained speech synthesis model, and outputting the acoustic characteristics of the speech;

step S240: the speech acoustic features are converted into audio.

2. The method of claim 1, wherein the speech acoustic features include mel-frequency spectral features, spectral energy features, and fundamental frequency features.

3. The method of claim 2, wherein the step S120 comprises:

processing texts of different languages in a speech training database into a unified phoneme expression mode, or processing the texts of different languages into a unified Unicode coding expression mode;

aligning texts and audios of different languages by adopting an MFA algorithm to obtain the aligned texts and duration corresponding to the texts;

and converting the duration into the frame number, wherein the sum of the duration frame number is equal to the sum of the frame numbers of the Mel frequency spectrum characteristics.

4. The method of claim 3, wherein the step S130 comprises:

setting the length of the language id of each piece of voice training data to be equal to the duration of the aligned text; setting the length of the speaker id of each piece of voice training data as 1, and respectively taking different id values for different speakers and different languages;

and constructing a speaker space and a language space according to the number of speakers and the number of languages in the voice training data, initializing, converting the id and the id of the speakers into one-hot vectors, and extracting the speaker vectors and the language vectors.

5. The method of claim 3, wherein the step S240 uses a Multi-band MelGAN vocoder to convert the acoustic features of the speech into audio.

6. A multi-speaker, multi-language speech synthesis system comprising a text processing module, an information tagging module, an information encoding module, an acoustic feature output module, and an vocoder module, wherein:

the text processing module is used for carrying out standardized processing on the text, classifying the text according to the language and processing the text of different languages into a uniform expression mode;

the information marking module is used for generating a corresponding language id for each character of the text and generating a speaker id according to the user requirement;

the information coding module is used for constructing a speaker space and a language space and extracting corresponding language vectors and speaker vectors from the language space and the speaker space according to the language id and the speaker id;

the acoustic characteristic output module is used for inputting the processed text, the language vector and the speaker vector into the speech synthesis model in the training stage to perform model training to obtain a trained speech synthesis model; inputting the processed text, language vector and speaker vector into the trained speech synthesis model in the inference stage, converting the text, language vector and speaker vector into acoustic characteristics of speech and outputting the acoustic characteristics;

and the vocoder module is used for outputting audio according to the acoustic characteristics of the input voice.

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a method and a system for synthesizing multi-speaker and multi-language voice.

Background

Speech synthesis is a technology for converting text information into speech information, i.e. converting text information into any audible speech, and relates to various subjects such as acoustics, linguistics, computer science and the like. However, how to build a multi-speaker, multi-language speech synthesis system using a monolingual speech database under the condition of keeping speaker consistency is always a difficult problem. The conventional multilingual speech synthesis system relies on a multilingual speech database, which is difficult to obtain in practice (it is difficult to find a speaker who is proficient in multilingual to record speech data), and cannot freely convert the tone, language pronunciation, etc. of the speaker.

Disclosure of Invention

The invention aims to provide a method and a system for synthesizing multi-speaker and multi-language voice, which are used for solving the problem that the prior art can not realize the synthesis of multi-speaker and multi-language voice by using a single-language voice database under the condition of consistent speakers.

The invention solves the problems through the following technical scheme:

a multi-speaker, multi-language speech synthesis method, comprising:

step S100: training a speech synthesis model specifically comprises:

step S110: acquiring a voice training database of multiple speakers and a single language, and extracting voice acoustic features;

step S120: processing texts of different languages in a voice training database into a uniform representation mode, aligning audio and the texts, and acquiring duration information corresponding to the texts;

step S130: constructing a speaker space and a language space, generating a speaker id corresponding to the aligned text and a language id corresponding to each character in the aligned text, extracting a speaker vector corresponding to the speaker id from the speaker space, and extracting a language vector corresponding to the language id from the language space;

step S140: adding the speaker vector and the language vector into each part of the initial voice synthesis model, and training the speaker space, the language space and the initial voice synthesis model by adopting the aligned text, duration information and voice acoustic characteristics to obtain a trained voice synthesis model;

step S200: converting the text to be synthesized into audio, specifically comprising:

step S210: carrying out standardized processing on the text to be synthesized, and classifying the text according to the text language;

step S220: processing the classified texts into a uniform representation mode, aligning the audio and the texts, predicting duration information corresponding to the texts by a predictor, and generating a speaker id corresponding to the aligned texts and a language id corresponding to each character in the aligned texts;

step S230: appointing a speaker id, inputting the speaker id, the text processed by S220 and the language id of the corresponding character into the trained speech synthesis model, and outputting the acoustic characteristics of the speech;

step S240: the speech acoustic features are converted into audio.

The speech acoustic features include mel-frequency spectral features, spectral energy features and fundamental frequency features.

The step S120 specifically includes:

processing the texts of different languages in the speech training database into a unified phoneme expression mode, wherein the unified phoneme expression mode can be a pinyin phoneme or a CMU phoneme, or processing the texts of different languages into a unified Unicode coding expression mode;

aligning texts and audios of different languages by adopting an MFA algorithm to obtain the aligned texts and duration corresponding to the texts;

and converting the duration into the frame number, wherein the sum of the duration frame number is equal to the sum of the frame numbers of the Mel frequency spectrum characteristics.

The step S130 specifically includes:

setting the length of the language id of each piece of voice training data to be equal to the duration of the aligned text; setting the length of the speaker id of each piece of voice training data as 1, and respectively taking different id values for different speakers and different languages;

and constructing a speaker space and a language space according to the number of speakers and the number of languages in the voice training data, initializing, converting the id and the id of the speakers into one-hot vectors, and extracting the speaker vectors and the language vectors.

The step S240 converts the voice acoustic features into audio using a Multi-band MelGAN vocoder.

A multi-speaker, multi-language speech synthesis system comprising a text processing module, an information tagging module, an information encoding module, an acoustic feature output module, and a vocoder module, wherein:

the text processing module is used for carrying out standardized processing on the text, classifying the text according to the language and processing the text of different languages into a uniform expression mode;

optionally, the texts in different languages of the voice database are processed into a uniform phoneme expression mode, or the texts in different languages are processed into a uniform Unicode coding expression mode; if the training is used, aligning texts and audios of different languages by adopting an MFA algorithm to obtain aligned texts and durations corresponding to the texts, and converting the durations into frame numbers, wherein the sum of the number of the durations is equal to the sum of the number of the extracted Mel frequency spectrum features;

the information marking module is used for generating a corresponding language id for each character of the text and generating a speaker id according to the user requirement;

the length of the language id is equal to the length of the text processed by the text processing module, the length of the speaker id is 1, and different id values are taken for different speakers and different languages;

the information coding module is used for constructing a speaker space and a language space and extracting corresponding language vectors and speaker vectors from the language space and the speaker space according to the language id and the speaker id; the speaker space and the language space need to be constructed according to the number of speakers and the number of languages in the training data;

the acoustic characteristic output module is used for inputting the processed text, the language vector and the speaker vector into the speech synthesis model in the training stage to perform model training to obtain a trained speech synthesis model; inputting the processed text, language vector and speaker vector into the trained speech synthesis model in the inference stage, converting the text, language vector and speaker vector into acoustic characteristics of speech and outputting the acoustic characteristics;

the speaker vector extracted from the speaker space is directly added into an encoder, a variable information adapter and a decoder of the speech synthesis model; before the language vectors extracted in the language space are added into a decoder, the length of the spoken language vectors needs to be adjusted by a language adjuster, and the language vectors corresponding to each frame of frequency spectrum are obtained;

and the vocoder module is used for outputting audio according to the acoustic characteristics of the input voice. Preferably, the vocoder module may employ a Multi-band MelGAN vocoder.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention constructs the speaker space and the language space, extracts the speaker vector and the language vector through the speaker id and the language id, and adds the speaker vector and the language vector into each part of the speech synthesis model, thereby realizing the de-entanglement of the characteristics and the language characteristics of the speaker, and realizing the conversion of the speaker or the language (such as realizing the style of Chinese speaking by foreigners) only by transforming the id; the invention only needs to use the monolingual data of multiple speakers, has extremely fast speech synthesis speed, high synthesized speech tone quality and good stability, and can realize fluent conversion among different languages under the condition of keeping the consistency of the tone colors of the speakers.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a model structure of a speech synthesis module.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example 1:

referring to fig. 1, a method for synthesizing multi-speaker and multi-language speech includes:

a training stage:

step S11: acquiring a voice training database of multiple speakers and single language, and extracting voice acoustic characteristics, wherein the voice training data of the multiple speakers and the single language comprises voice data of the multiple speakers and corresponding texts of at least more than two different languages; the voice acoustic features comprise Mel frequency spectrum features, frequency spectrum energy features and fundamental frequency features; optionally, a Chinese and English voice database is selected as a training database, and a Chinese data set can use a public female voice database of Bibei and a voice database which is recorded by the user and covers more than 20 persons; the English voice database can use public databases such as LJSpeech, VCTK and the like;

step S12: processing texts of different languages in a speech training database into a uniform representation mode, namely a uniform phoneme expression mode, or processing texts of different languages into a uniform Unicode coding expression mode, aligning audio and texts by adopting an MFA (simple formed aligned) algorithm, and acquiring the aligned texts and time length information corresponding to the texts; converting the duration into frame numbers, wherein the sum of the duration frame numbers is equal to the sum of the frame numbers of the Mel frequency spectrum characteristics;

for example, the english text is "who met him the door", and the english text is converted into a phoneme expression manner, so that "h u1 m ai1 t h i1 m a1 t s i a0 d uo1r pp 4" is obtained; the Chinese text is 'I am Chinese, I love China', and is processed into a unified phoneme expression to obtain 'uo 3 sh iii4 pp1 zh ong1 g uo2r en2 pp3 uo3 ai4 zh ong1 g uo2 pp 4'. And obtaining the aligned text (with excessive sil, sp and other characters) and the time length corresponding to each character in the text by adopting an MFA algorithm, converting the time length into a frame number, and ensuring that the sum of the time length frame number is equal to the sum of the frame numbers of the extracted Mel frequency spectrum characteristics.

Step S13: constructing a speaker space and a language space, generating a corresponding speaker id for the text processed in the step S12, generating a corresponding language id for each character, and respectively extracting a speaker vector and a language vector in the speaker space and the language space by using the generated id; namely: setting the length of the language id of each piece of voice training data to be equal to the duration of the aligned text; setting the length of the speaker id of each piece of voice training data as 1, and respectively taking different id values for different speakers and different languages;

and constructing a speaker space and a language space according to the number of speakers and the number of languages in the voice training data, initializing, converting the id and the id of the speakers into one-hot vectors, and extracting the speaker vectors and the language vectors.

For example, the english text processed by S12 is "sil h 1 m ai1 t h i1 m a1 t S i a0 d uo1r pp 4", the speaker id corresponding to the speech is [20], the language id corresponding to the speech is [1,1,1,1,1,1,1,1,1, 1], the chinese text processed by S12 is "sil 3 sh 4 pp1 zh ong1 g uo2r en2 pp3 sp uo3 ai4 zh ong1 g uo2 pp 4", the speaker id corresponding to the speech is [7], the language id corresponding to [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, a language vector.

Step S14: adding the speaker vectors and the language vectors obtained in step S130 to each part of the initial speech synthesis model, specifically, as shown in fig. 2, the prosodic phoneme sequence is changed into phoneme embedded vectors through an embedding layer, and adding position codes (absolute position codes obtained by directly substituting positions of different phonemes into a sine function) into the phoneme embedded vectors for the purpose of adding position information between phonemes; after passing through the variable information adapter, position coding information needs to be added again in order to ensure that the position information is not lost due to the frame expansion operation of the length regulator; the prosody duration predictor in the variable information adapter mainly predicts the pause duration of the phoneme sequence and is used for controlling the prosody of the synthesized voice; directly adding the speaker vector extracted from the speaker space into an encoder, a variable information adapter and a decoder of a speech synthesis model; the length of the language vector extracted in the language space needs to be adjusted by a length adjuster before the language vector is added into a decoder; the main function of the length adjuster is to perform frame expansion operation on the language vector according to the duration prediction result, so as to be added to the decoder.

And training the speaker space, the language space and the initial speech synthesis model by adopting the aligned text, duration information and speech acoustic characteristics to obtain a trained speech synthesis model.

And (3) reasoning stage:

step S21: carrying out standardized processing on the text to be synthesized, and classifying the text according to the text language;

step S22: processing the classified texts into a uniform representation mode, aligning the audio and the texts, predicting duration information corresponding to the texts by a predictor, and generating language id corresponding to each character of the texts;

step S23: appointing a speaker id, inputting the speaker id, the text processed by S22 and the language id of the corresponding character into a trained speech synthesis model, and outputting the acoustic characteristics of the speech;

step S24: a Multi-band MelGAN vocoder is employed to convert speech acoustic features into audio.

Example 2:

a multi-speaker, multi-language speech synthesis system comprising a text processing module, an information tagging module, an information encoding module, an acoustic feature output module, and a vocoder module, wherein:

the text processing module is used for carrying out standardized processing on the text, classifying the text according to the language and processing the text of different languages into a uniform expression mode;

optionally, the texts in different languages of the voice database are processed into a uniform phoneme expression mode, or the texts in different languages are processed into a uniform Unicode coding expression mode; if the training is used, aligning texts and audios of different languages by adopting an MFA algorithm to obtain aligned texts and durations corresponding to the texts, and converting the durations into frame numbers, wherein the sum of the number of the durations is equal to the sum of the number of the extracted Mel frequency spectrum features;

the information marking module is used for generating a corresponding language id for each character of the text and generating a speaker id according to the user requirement;

the length of the language id is equal to the length of the text processed by the text processing module, the length of the speaker id is 1, and different id values are taken for different speakers and different languages;

the information coding module is used for constructing a speaker space and a language space and extracting corresponding language vectors and speaker vectors from the language space and the speaker space according to the language id and the speaker id; the speaker space and the language space need to be constructed according to the number of speakers and the number of languages in the training data;

the acoustic characteristic output module is used for inputting the processed text, the language vector and the speaker vector into the speech synthesis model in the training stage to perform model training to obtain a trained speech synthesis model; inputting the processed text, language vector and speaker vector into the trained speech synthesis model in the inference stage, converting the text, language vector and speaker vector into acoustic characteristics of speech and outputting the acoustic characteristics;

the speaker vector extracted from the speaker space is directly added into an encoder, a variable information adapter and a decoder of the speech synthesis model; before the language vectors extracted in the language space are added into a decoder, the length of the spoken language vectors needs to be adjusted by a language adjuster, and the language vectors corresponding to each frame of frequency spectrum are obtained;

and the vocoder module is used for outputting audio according to the acoustic characteristics of the input voice. Preferably, the vocoder module may employ a Multi-band MelGAN vocoder.

Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音数据自动标注的质量评估方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!