Original sound speech translation method

文档序号:36608 发布日期:2021-09-24 浏览:37次 中文

阅读说明:本技术 一种原声语音翻译方法 (Original sound speech translation method ) 是由 孟强祥 田俊麟 宋昱 于 2021-05-31 设计创作,主要内容包括:本发明公开了一种原声语音翻译方法,涉及语音翻译技术领域,一种原声语音翻译方法,包括以下步骤:源语言语音采集,声音特征学习模块提取说话人的声音特征,送入深度神经网络DNN训练学习,STT模块转换源语音的文字信息,并分别由翻译模块和语言特征学习模块获取,其中,语言特征学习提取和记录源语言的语言特征,通过合成语音模块进行语音合成模拟。本发明通过将语言发音特征作为特征值送入深度神经网络DNN训练学习,学习后获得分别用于翻译与合成模块参考的语言特征模型特征向量与人声特征模型特征向量,通过合成语音模块进行语音合成模拟,发出与说话人语音相似的声音,使翻译后合成的语音高度接近说话人本人的特征。(The invention discloses an acoustic speech translation method, which relates to the technical field of speech translation, and comprises the following steps: the source language voice acquisition and voice feature learning module extracts voice features of a speaker, the voice features are sent to deep neural network DNN training learning, the STT module converts character information of the source voice and the character information is acquired by the translation module and the language feature learning module respectively, wherein the language features of the source language are extracted and recorded in the language feature learning module, and voice synthesis simulation is carried out through the voice synthesis module. The invention takes the language pronunciation characteristics as characteristic values to be sent to a deep neural network DNN for training and learning, obtains the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module after learning, carries out voice synthesis simulation through a synthesis voice module, and sends out the voice similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.)

1. An acoustic speech translation method, comprising the steps of:

step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.

And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.

Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;

step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:

s is a given time-frequency spectrum signal,

xifor the signal reconstructed at the i-th time,

f is a short-time-range fourier transform,

f-1in order to realize the reverse transformation,

Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);

and step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to finish the voice translation process.

2. The acoustic speech translation method according to claim 1, wherein: in the first step, the source speech acquisition comprises preprocessing and judgment of the sound signal, the preprocessing comprises speech enhancement, background sound elimination, echo suppression and the like, which are beneficial to optimizing the signal, and the judgment comprises judging whether the sound signal comprises language information or not, and if the language information is not detected, the current information is discarded.

3. The acoustic speech translation method according to claim 1, wherein: and in the second step, the voice characteristic model has a pre-trained voice characteristic model, and the model is corrected every time a new voice characteristic is learned.

4. The acoustic speech translation method according to claim 1, wherein: the voice feature learning module in the second step comprises feature extraction, the extracted features mainly comprise features of language pronunciation, such as vowels, consonants, voiced sounds and the like, and the extracted features also comprise the pronunciation features of speakers, such as tone intensity, tone and timbre.

5. The acoustic speech translation method according to claim 1, wherein: the main modules of the translation process in the third step are synchronously executed in real time, and the learning of the voice and language characteristics and the model correction process can be asynchronously executed, so that the real-time performance of the translation process is not influenced.

Technical Field

The invention relates to the technical field of voice translation, in particular to an acoustic voice translation method.

Background

The development of artificial intelligence technology makes the speech translation greatly developed and applied. In the process of voice translation, a source voice signal of a speaker is mainly converted into source text information, the source text information is converted into text information of a target language through a text translation module, and then a voice signal of the target language is generated through a voice synthesis module to be played to complete voice translation.

Disclosure of Invention

The invention aims to provide an acoustic speech translation method to solve the defects in the prior art.

In order to achieve the above purpose, the invention provides the following technical scheme: an acoustic speech translation method comprising the steps of:

step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.

And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.

Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;

step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:

s is a given time-frequency spectrum signal,

xifor the signal reconstructed at the i-th time,

f is a short-time-range fourier transform,

f-1in order to realize the reverse transformation,

Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);

and step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to finish the voice translation process.

Preferably, the source speech acquisition in the first step includes preprocessing and judgment of the sound signal, the preprocessing includes speech enhancement, background sound elimination, echo suppression, and the like, which are beneficial for optimizing the signal, and the judgment includes judging whether the sound signal includes language information, and if the language information is not detected, the current information is discarded.

Preferably, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned.

Preferably, the sound feature learning module in the second step includes feature extraction, the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features further include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre.

Preferably, the main modules of the translation process in the third step are executed synchronously in real time, and the learning of the voice and language features and the model modification process can be executed asynchronously, so that the real-time performance of the translation process is not influenced.

In the technical scheme, the invention provides the following technical effects and advantages:

the invention collects the voice information through the voice collecting module, the language pronunciation characteristic is sent to the deep neural network DNN for training and learning as the characteristic value, the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of the translation and synthesis module are obtained after learning, meanwhile, the STT module converts the character information of the source voice, the language characteristic model is corrected after the learning of the deep neural network DNN and is used as the pre-judging information of the translation, then the voice synthesis simulation is carried out through the synthesis voice module, the voice similar to the voice of the speaker is sent out after the synthesis based on the language information with the speaking style of the speaker, and therefore, the synthesized voice after the translation is highly close to the characteristic of the speaker.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic view of the overall structure of the present invention.

Fig. 2 is a flow chart of sound feature extraction according to the present invention.

FIG. 3 is a diagram of ADSR envelope representation according to the present invention.

FIG. 4 is a logical block diagram of model reconstruction in accordance with the present invention.

Description of reference numerals:

a: the time from silence to the peak of pronunciation, this time is the energy burst phase;

d: time at which the pronunciation dropped from the peak is stable;

s: a time interval of stable pronunciation;

r: time to fall back after the end of pronunciation.

Detailed Description

In order to make the technical solutions of the present invention better understood, those skilled in the art will now describe the present invention in further detail with reference to the accompanying drawings.

The invention provides an acoustic speech translation method, which comprises the following steps:

step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.

And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.

Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;

step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:

s is a given time-frequency spectrum signal,

xifor the signal reconstructed at the i-th time,

f is a short-time-range fourier transform,

f-1in order to realize the reverse transformation,

Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);

given a time-frequency spectrum signal S, the closer the time-frequency spectrum information is to the S, the better the signal needs to be reconstructed;

the human voice features include:

intensity (intensity): the strength of the pronunciation is also the vibration amplitude of the audio signal,

pitch (pitch): the frequency of the vibration of the audio signal,

timbre (time): the timbre is an important index for a speaker to represent that the voice of the speaker is different from other people, the timbre is determined by a corresponding spectral Envelope (Envelope), ADSR is mainly composed of four parameters, namely attach, Delay, Sustain and Release, the same characters are used, different voices of different people can be different, and the four parameters are mainly determined;

step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to complete the voice translation process;

further, in the above technical solution, the source speech acquisition in the first step includes preprocessing and determining a speech signal, where the preprocessing includes speech enhancement, background sound cancellation, echo suppression, and the like, which are beneficial for optimizing the signal, and the determining includes determining whether the speech signal includes language information, and if the language information is not detected, discarding the current information;

further, in the above technical solution, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned;

further, in the above technical solution, the sound feature learning module in the second step includes feature extraction, where the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features also include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre;

furthermore, in the above technical solution, the main modules of the translation process in step three are executed synchronously in real time, and the learning and model modification processes of the sound and language features can be executed asynchronously, so that the real-time performance of the translation process is not affected;

the implementation mode is specifically as follows: after voice information is collected by a voice collecting module, the voice information is sent to a voice characteristic learning module and an STT module, voice characteristics are extracted and then are learned by a deep neural network DNN to establish a voice characteristic model, language pronunciation characteristics are used as characteristic values to be sent to the deep neural network DNN for training and learning, and after learning, a language characteristic model characteristic vector and a human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module are obtained, meanwhile, the STT module converts the character information of the source language, the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the deep neural network DNN learning and is used as the translation prejudgment information, then the speech synthesis simulation is carried out through the speech synthesis module, based on the language information with the speaking style of the speaker, the synthesized voice is synthesized to make a sound similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.

While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that the described embodiments may be modified in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are illustrative in nature and should not be construed as limiting the scope of the invention.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种快速语音克隆方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!