Self-adaptive speech synthesis method and device

文档序号:1339658 发布日期:2020-07-17 浏览:11次 中文

阅读说明:本技术 一种自适应语音合成方法及装置 (Self-adaptive speech synthesis method and device ) 是由 贺来朋 于 2020-03-11 设计创作,主要内容包括:本发明公开了一种自适应语音合成方法及装置,包括:利用预设录音和预设录音对应的文本标注数据对预设神经网络模型进行训练,得到训练后的预设神经网络模型;设计录音文本库以供用户挑选目标录音文本进行录音,得到当前录音;利用当前录音和目标录音文本对训练后的预设神经网络模型进行二次训练;利用二次训练后的预设神经网络模型提取待合成文本的静态语音参数,将静态语音参数输入到合成器中获得合成语音。有效的解决了现有技术中由于训练所需数据量较少且质量通常不高,以及模型预测精度不够等原因,导致合成的语音质量和精度都偏低的问题,提高了用户的体验感。(The invention discloses a self-adaptive voice synthesis method and a device, comprising the following steps: training a preset neural network model by using a preset sound recording and text marking data corresponding to the preset sound recording to obtain the trained preset neural network model; designing a recording text base for a user to select a target recording text for recording to obtain a current recording; performing secondary training on the trained preset neural network model by using the current recording and the target recording text; and extracting static voice parameters of the text to be synthesized by using the preset neural network model after the secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice. The problem that in the prior art, due to the fact that the amount of data needed by training is small, the quality is not high usually, the model prediction accuracy is not enough and the like, the quality and the accuracy of synthesized voice are low is solved effectively, and the experience of a user is improved.)

1. An adaptive speech synthesis method, comprising the steps of:

training a preset neural network model by using a preset sound record and text marking data corresponding to the preset sound record to obtain the trained preset neural network model;

designing a recording text base for a user to select a target recording text for recording to obtain a current recording;

performing secondary training on the trained preset neural network model by using the current recording and the target recording text;

and extracting static voice parameters of the text to be synthesized by using the preset neural network model after the secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice.

2. The adaptive speech synthesis method of claim 1, wherein the designing of the recording text library for a user to select a target recording text for recording to obtain a current recording comprises:

a blank recording text base is established in advance;

acquiring N recording texts, and inputting the N recording texts into the blank recording text base to form the recording text base;

when an instruction of requesting recording by a user is received, pushing M first recording texts for selection, wherein the first recording texts are any one of the recording texts;

determining a first recording text selected by a user in the M first recording texts as the target recording text;

and receiving the current recording of the user based on the target recording text.

3. The adaptive speech synthesis method of claim 1, wherein before performing a second training of the trained neural network model using the current recording and the target recording text, the method further comprises:

acquiring each sentence of voice in the current recording;

removing a mute section exceeding a preset time length in each sentence of voice;

carrying out pre-processing of denoising and dereverberation on each sentence of voice;

detecting whether the current voice after preprocessing is complete;

if so, using the label corresponding to the target recording text;

otherwise, reminding the user that the preprocessed current voice does not meet the requirement.

4. The adaptive speech synthesis method of claim 1, wherein the training the trained neural network model twice using the current recording and the target recording text comprises:

extracting acoustic characteristic parameters of the preprocessed current voice;

extracting first linguistic information associated with the context in the target sound recording text content;

generating training data according to the acoustic characteristic parameters and the first linguistic information;

and performing secondary training on the trained preset neural network model by using the training data.

5. The adaptive speech synthesis method according to claim 1, wherein the extracting a static speech parameter of a text to be synthesized by using the secondarily trained preset neural network model, and inputting the static speech parameter into a synthesizer to obtain a synthesized speech, comprises:

acquiring second linguistic information of the text to be synthesized;

inputting the second linguistic information into the secondarily trained preset neural network model to obtain voice characteristic parameters;

obtaining static voice parameters according to the voice characteristic parameters;

inputting the static voice parameters into a synthesizer for synthesis;

and outputting the synthesized voice after the synthesis is finished.

6. An adaptive speech synthesis apparatus, comprising:

the first training module is used for training a preset neural network model by using a preset sound record and text marking data corresponding to the preset sound record to obtain the trained preset neural network model;

the recording module is used for designing a recording text base for a user to select a target recording text for recording to obtain a current recording;

the second training module is used for carrying out secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;

and the synthesis module is used for extracting the static voice parameters of the text to be synthesized by using the preset neural network model after the secondary training and inputting the static voice parameters into the synthesizer to obtain the synthesized voice.

7. The adaptive speech synthesis apparatus of claim 6, wherein the recording module comprises:

establishing a submodule for establishing a blank recording text base in advance;

the first obtaining submodule is used for obtaining N recording texts and inputting the N recording texts into the blank recording text base to form the recording text base;

the pushing submodule is used for pushing M first recording texts for selection when receiving a recording request instruction of a user, wherein the first recording texts are any one of the recording texts;

the determining submodule is used for determining the first sound recording text selected by the user in the M first sound recording texts as the target sound recording text;

and the receiving submodule is used for receiving the current sound recording of the user based on the target sound recording text.

8. The adaptive speech synthesis apparatus of claim 6, wherein the apparatus further comprises:

the acquisition module is used for acquiring each sentence of voice in the current recording;

the removing module is used for removing the mute sections exceeding the preset duration in each sentence of voice;

the preprocessing module is used for carrying out preprocessing of denoising and dereverberating on each sentence of voice;

the detection module is used for detecting whether the current voice after the preprocessing is complete;

the determining module is used for using the label corresponding to the target recording text when the detecting module detects that the preprocessed current voice is complete;

and the reminding module is used for reminding a user that the preprocessed current voice does not meet the requirement when the detection module detects that the preprocessed current voice is not complete.

9. The adaptive speech synthesis apparatus of claim 6, wherein the second training module comprises:

the first extraction submodule is used for extracting the acoustic characteristic parameters of the preprocessed current voice;

the second extraction submodule is used for extracting the first linguistic information associated with the context in the target sound recording text content;

the generating submodule is used for generating training data according to the acoustic characteristic parameters and the first linguistic information;

and the training submodule is used for carrying out secondary training on the trained preset neural network model by utilizing the training data.

10. The adaptive speech synthesis apparatus of claim 6, wherein the synthesis module comprises:

the second obtaining submodule is used for obtaining second linguistic information of the text to be synthesized;

the obtaining submodule is used for inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;

the third obtaining submodule is used for obtaining static voice parameters according to the voice characteristic parameters;

the synthesis submodule is used for inputting the static voice parameters into a synthesizer for synthesis;

and the output submodule is used for outputting the synthesized voice after the synthesis is finished.

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a self-adaptive voice synthesis method and a self-adaptive voice synthesis device.

Background

In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potential wide use value. Due to the fact that a large amount of high-quality voice is needed for training, the self-adaptive voice synthesis system is provided, namely, the synthesis system is quickly constructed by utilizing a small amount of recording and text data of a target speaker, and synthesized voice of the tone of the target speaker is generated. However, this method has the following disadvantages: due to the fact that the amount of data needed by training is small, the quality is not high usually, the model prediction accuracy is not enough, the quality and the accuracy of synthesized voice are low, and the experience of a user is affected.

Disclosure of Invention

Aiming at the displayed problems, the method carries out secondary training on the trained preset neural network model based on the current recording data of the user, and finally carries out voice synthesis on the text to be synthesized according to the secondarily trained preset neural network model.

An adaptive speech synthesis method comprising the steps of:

training a preset neural network model by using a preset sound record and text marking data corresponding to the preset sound record to obtain the trained preset neural network model;

designing a recording text base for a user to select a target recording text for recording to obtain a current recording;

performing secondary training on the trained preset neural network model by using the current recording and the target recording text;

and extracting static voice parameters of the text to be synthesized by using the preset neural network model after the secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice.

Preferably, the designing of the recording text base for the user to select the target recording text for recording to obtain the current recording includes:

a blank recording text base is established in advance;

acquiring N recording texts, and inputting the N recording texts into the blank recording text base to form the recording text base;

when an instruction of requesting recording by a user is received, pushing M first recording texts for selection, wherein the first recording texts are any one of the recording texts;

determining a first recording text selected by a user in the M first recording texts as the target recording text;

and receiving the current recording of the user based on the target recording text.

Preferably, before performing secondary training on the trained preset neural network model by using the current recording and the target recording text, the method further includes:

acquiring each sentence of voice in the current recording;

removing a mute section exceeding a preset time length in each sentence of voice;

carrying out pre-processing of denoising and dereverberation on each sentence of voice;

detecting whether the current voice after preprocessing is complete;

if so, using the label corresponding to the target recording text;

otherwise, reminding the user that the preprocessed current voice does not meet the requirement.

Preferably, the performing secondary training on the trained preset neural network model by using the current recording and the target recording text includes:

extracting acoustic characteristic parameters of the preprocessed current voice;

extracting first linguistic information associated with the context in the target sound recording text content;

generating training data according to the acoustic characteristic parameters and the first linguistic information;

and performing secondary training on the trained preset neural network model by using the training data.

Preferably, the extracting, by using the secondarily trained preset neural network model, the static speech parameter of the text to be synthesized, and inputting the static speech parameter into the synthesizer to obtain the synthesized speech includes:

acquiring second linguistic information of the text to be synthesized;

inputting the second linguistic information into the secondarily trained preset neural network model to obtain voice characteristic parameters;

obtaining static voice parameters according to the voice characteristic parameters;

inputting the static voice parameters into a synthesizer for synthesis;

and outputting the synthesized voice after the synthesis is finished.

An adaptive speech synthesis apparatus, the apparatus comprising:

the first training module is used for training a preset neural network model by using a preset sound record and text marking data corresponding to the preset sound record to obtain the trained preset neural network model;

the recording module is used for designing a recording text base for a user to select a target recording text for recording to obtain a current recording;

the second training module is used for carrying out secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;

and the synthesis module is used for extracting the static voice parameters of the text to be synthesized by using the preset neural network model after the secondary training and inputting the static voice parameters into the synthesizer to obtain the synthesized voice.

Preferably, the sound recording module includes:

establishing a submodule for establishing a blank recording text base in advance;

the first obtaining submodule is used for obtaining N recording texts and inputting the N recording texts into the blank recording text base to form the recording text base;

the pushing submodule is used for pushing M first recording texts for selection when receiving a recording request instruction of a user, wherein the first recording texts are any one of the recording texts;

the determining submodule is used for determining the first sound recording text selected by the user in the M first sound recording texts as the target sound recording text;

and the receiving submodule is used for receiving the current sound recording of the user based on the target sound recording text.

Preferably, the apparatus further comprises:

the acquisition module is used for acquiring each sentence of voice in the current recording;

the removing module is used for removing the mute sections exceeding the preset duration in each sentence of voice;

the preprocessing module is used for carrying out preprocessing of denoising and dereverberating on each sentence of voice;

the detection module is used for detecting whether the current voice after the preprocessing is complete;

the determining module is used for using the label corresponding to the target recording text when the detecting module detects that the preprocessed current voice is complete;

and the reminding module is used for reminding a user that the preprocessed current voice does not meet the requirement when the detection module detects that the preprocessed current voice is not complete.

Preferably, the second training module includes:

the first extraction submodule is used for extracting the acoustic characteristic parameters of the preprocessed current voice;

the second extraction submodule is used for extracting the first linguistic information associated with the context in the target sound recording text content;

the generating submodule is used for generating training data according to the acoustic characteristic parameters and the first linguistic information;

and the training submodule is used for carrying out secondary training on the trained preset neural network model by utilizing the training data.

Preferably, the synthesis module comprises:

the second obtaining submodule is used for obtaining second linguistic information of the text to be synthesized;

the obtaining submodule is used for inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;

the third obtaining submodule is used for obtaining static voice parameters according to the voice characteristic parameters;

the synthesis submodule is used for inputting the static voice parameters into a synthesizer for synthesis;

and the output submodule is used for outputting the synthesized voice after the synthesis is finished.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flowchart illustrating a method for adaptive speech synthesis according to the present invention;

FIG. 2 is another flowchart of an adaptive speech synthesis method according to the present invention;

FIG. 3 is a block diagram of an adaptive speech synthesis apparatus according to the present invention;

fig. 4 is another structural diagram of an adaptive speech synthesis apparatus provided in the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potential wide use value. Due to the fact that a large amount of high-quality voice is needed for training, the self-adaptive voice synthesis system is provided, namely, the synthesis system is quickly constructed by utilizing a small amount of recording and text data of a target speaker, and synthesized voice of the tone of the target speaker is generated. However, this method has the following disadvantages: 1. due to the fact that the amount of data needed by training is small, the quality is not high usually, the model prediction accuracy is not enough, the quality and the accuracy of synthesized voice are low, and the experience of a user is affected. 2. Recording environments of different users are greatly different, and recorded voice data can contain overlong silent sections, noise, reverberation and other interferences, so that the model training effect is influenced. 3. The actual pronunciation recording texts of the users are inconsistent, and phenomena of word loss, multiple words, repetition, misreading, overlong pause and the like occur, so that the audio data and the text labels are not matched, and the model training effect is influenced. In order to solve the above problem, this embodiment discloses a method for performing secondary training on a trained preset neural network model based on current recording data of a user, and finally performing speech synthesis on a text to be synthesized according to the secondarily trained preset neural network model.

An adaptive speech synthesis method, as shown in fig. 1, includes the following steps:

step S101, training a preset neural network model by using a preset sound recording and text label data corresponding to the preset sound recording to obtain the trained preset neural network model;

step S102, designing a recording text base for a user to select a target recording text for recording to obtain a current recording;

step S103, performing secondary training on the trained preset neural network model by using the current recording and the target recording text;

step S104, extracting static voice parameters of the text to be synthesized by using the preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice;

in this embodiment, a large amount of high-quality preset recordings and text label data corresponding to the preset recordings are used to train a preset neural network model to obtain a trained preset neural network model, then a user selects a suitable target recording text to record according to own preference and demand to obtain a small amount of current recording, and then the trained preset neural network model is subjected to secondary training according to the small amount of recording to obtain a new model capable of synthesizing own voice, and a static voice parameter of any text to be synthesized can be extracted according to the new model and input into a synthesizer to obtain synthesized voice of the user. The preset neural network model comprises a duration model and an acoustic model, namely, different voices are synthesized according to different timbres of the users while paying attention to the duration of the voice synthesis of the users.

The working principle of the technical scheme is as follows: the method comprises the steps of training a preset neural network model by utilizing a preset recording and text label data corresponding to the preset recording to obtain the trained preset neural network model, designing a recording text base for a user to select a target recording text for recording to obtain a current recording, performing secondary training on the trained preset neural network model by utilizing the current recording and the target recording text, extracting static voice parameters of a text to be synthesized by utilizing the secondarily trained preset neural network model, and inputting the static voice parameters into a synthesizer to obtain synthesized voice.

The beneficial effects of the above technical scheme are: the user can train the preset neural network model according to the preset recording, then carry out secondary training on the trained preset neural network model according to the current recording of the user, and finally carry out voice synthesis on the recorded text of the user according to the preset neural network model after secondary training, because the training of the first preset neural network model is trained according to a large amount of high-quality preset recordings, the voice quality and precision of model synthesis are both extremely high, and a model for synthesizing the voice of the user can be obtained by carrying out secondary training on the preset neural network model according to the current recording, and the voice quality and precision of synthesis are also extremely high, thereby effectively solving the problems that the quality and precision of the synthesized voice are both low due to the reasons of less data quantity required by training, generally low quality, insufficient model prediction precision and the like in the prior art, the experience of the user is improved. And the user can select the target text from the recording text library for recording, so that the selection is diversified, and the problem that in the prior art, the recorded voice data contains overlong silent sections, noises, reverberation and other interferences and influences the model training effect due to the fact that the recording environments of different users are greatly different is solved.

In one embodiment, designing a recording text library for a user to select a target recording text for recording to obtain a current recording includes:

a blank recording text base is established in advance;

acquiring N recording texts, and inputting the N recording texts into a blank recording text library to form a recording text library;

when an instruction of requesting recording by a user is received, pushing M first recording texts for selection, wherein the first recording texts are any one of the recording texts;

determining first recording texts selected by users in the M first recording texts as target recording texts;

based on the target recording text, a current recording of the user is received.

The beneficial effects of the above technical scheme are: the user can select the target recording text suitable for the user according to the age, culture level, region and use scene of the user by providing the first recording text selectable by the user, so that the user has a plurality of different choices, and the experience of the user is further improved.

In one embodiment, before performing the secondary training on the trained preset neural network model by using the current recording and the target recording text, the method further includes:

acquiring each sentence of voice in the current recording;

removing silence segments exceeding preset duration in each sentence of voice;

carrying out pre-processing of denoising and dereverberation on each sentence of voice;

detecting whether the current voice after preprocessing is complete;

if so, using the label corresponding to the target recording text;

otherwise, reminding the user that the preprocessed current voice does not meet the requirement;

in this embodiment, the step of detecting whether the current voice after the preprocessing is complete includes: if the insertion error or the deletion error exists in the processed current voice, prompting the user that the recording quality does not meet the requirement, and enabling the user to select to repeat the current text or switch a new text for re-recording. If there are no insertion and deletion errors, but there are replacement errors, the piece of speech is accepted and the original recorded text is replaced with the text recognized by the recognizer to generate the annotation. And if the identification error does not exist, using the label corresponding to the original recording text.

The beneficial effects of the above technical scheme are: the method has the advantages that the recorded voice is subjected to noise reduction and reverberation removal, redundant silent sections are removed, voice quality is improved, and meanwhile, a good sample is provided for synthesis of the voice. And by detecting whether the current voice after preprocessing is complete or not, re-recording or correcting the text label is selected according to the current voice quality, so that the consistency of the recording and the text label is ensured. And the data quality is improved.

In one embodiment, as shown in fig. 2, the secondary training of the trained preset neural network model using the current recording and the target recording text includes:

step S201, extracting acoustic characteristic parameters of the preprocessed current voice;

step S202, extracting first linguistic information associated with context in the target sound recording text content;

step S203, generating training data according to the acoustic characteristic parameters and the first linguistic information;

and S204, performing secondary training on the trained preset neural network model by using the training data.

The beneficial effects of the above technical scheme are: the preset neural network model is trained secondarily by using the acoustic characteristic parameters and the first linguistic information, so that the accuracy of modeling of the voice characteristic parameters is improved, and a good model is provided for extracting static voice parameters required by voice synthesis.

In one embodiment, extracting static speech parameters of a text to be synthesized by using a preset neural network model after secondary training, and inputting the static speech parameters into a synthesizer to obtain synthesized speech, the method includes:

acquiring second linguistic information of a text to be synthesized;

inputting the second linguistic information into the preset neural network model after secondary training to obtain voice characteristic parameters;

obtaining static voice parameters according to the voice characteristic parameters;

inputting the static speech parameters into a synthesizer for synthesis;

outputting synthesized voice after the synthesis is finished;

in this embodiment, the speech feature parameters include dynamic speech parameters, and the dynamic speech parameters are converted into static speech parameters according to the established model.

The beneficial effects of the above technical scheme are: the voice is synthesized by obtaining the static voice parameters, and compared with the dynamic voice parameters, the voice synthesizing method is more stable, and meanwhile, unstable factors are removed, so that the synthesized voice quality is higher.

In one embodiment, the method comprises the following steps:

training a multi-speaker mixed basic neural network model (adopting a model structure of a feedforward neural network and an RNN-L STM) by using high-quality multi-speaker recording and text labeling data, and adding speaker embedded information in the neural network input to improve the stability of tone modeling;

step 2: under the principle of guaranteeing phoneme coverage, a recording text library is designed, and the number of the recording text library is far larger than the number N of sentences which need to be recorded actually. Randomly selecting N sound recording texts for each user, wherein for each sound recording text, the user can choose to skip and switch a new sound recording text;

and 3, step 3: when a user inputs a voice, the recorded voice frequency passes through the voice frequency preprocessing module, an overlong mute section in the recording is removed, and noise reduction and reverberation removal processing are carried out on the input voice frequency;

and 4, step 4: and sending the processed audio to an audio quality evaluation module for voice recognition detection, and if an insertion error or a deletion error is found, prompting the user that the recording quality does not meet the requirement, wherein the user can select to repeat the current text or switch a new text for re-recording. If there are no insertion and deletion errors, but there are replacement errors, the piece of speech is accepted and the original recorded text is replaced with the text recognized by the recognizer to generate the annotation. If the identification error does not exist, using the label corresponding to the original recording text;

using the voice characteristic parameters and the linguistic information to generate training data of a neural network model, using the basic model in the step 1 as a source model, and adopting an adaptive technology to retrain the duration and the acoustic neural network model;

and 6, step 6: in the synthesis stage, according to the input text to be synthesized, the context-related linguistic information is obtained through a front-end model, the duration obtained through the training in the step 5 and the acoustic neural network model are used for reasoning, the voice characteristic parameters (including dynamic characteristic parameters) can be obtained, the smooth static voice characteristic parameters are obtained through a parameter generation module, and the characteristic parameters are sent to a synthesizer, so that the synthesized voice of the target speaker can be obtained.

The technical scheme has the beneficial effects that 1, the problem that the existing system is fixed and can not be changed is solved, in a framed range, a space of a recording text is freely selected for a user, the accuracy and the fluency of pronunciation are improved, 2, noise reduction and reverberation removal are carried out on the recording, redundant silent sections are removed, and the voice quality is improved, 3, quality evaluation is carried out on the recording file, re-recording or text marking is selected according to the recording quality, the consistency of the recording and the text marking is ensured, the data quality is improved, a small number of training data in an adaptive system are fully utilized, 4, a neural network model (a feedforward network and RNN-L STM structure) is adopted, and dynamic parameter modeling and a maximum likelihood parameter generation algorithm are combined, so that the accuracy of voice characteristic parameter modeling is improved.

The embodiment also discloses an adaptive speech synthesis apparatus, as shown in fig. 3, the apparatus includes:

the first training module 301 is configured to train a preset neural network model by using a preset sound recording and text label data corresponding to the preset sound recording to obtain the trained preset neural network model;

the recording module 302 is configured to design a recording text library for a user to select a target recording text for recording, so as to obtain a current recording;

the second training module 303 is configured to perform secondary training on the trained preset neural network model by using the current recording and the target recording text;

and the synthesis module 304 is configured to extract a static speech parameter of the text to be synthesized by using the secondarily trained preset neural network model, and input the static speech parameter into the synthesizer to obtain a synthesized speech.

In one embodiment, a sound recording module includes:

establishing a submodule for establishing a blank recording text base in advance;

the first obtaining submodule is used for obtaining N recording texts and inputting the N recording texts into a blank recording text base to form a recording text base;

the pushing submodule is used for pushing M first recording texts for selection when receiving a recording request instruction of a user, wherein the first recording texts are any one of the recording texts;

the determining submodule is used for determining the first sound recording text selected by the user in the M first sound recording texts as a target sound recording text;

and the receiving submodule is used for receiving the current sound recording of the user based on the target sound recording text.

In one embodiment, the above apparatus further comprises:

the acquisition module is used for acquiring each sentence of voice in the current recording;

the removing module is used for removing the mute sections exceeding the preset time length in each sentence of voice;

the preprocessing module is used for carrying out denoising and dereverberation preprocessing on each sentence of voice;

the detection module is used for detecting whether the current voice after the preprocessing is complete;

the determining module is used for using the label corresponding to the target recording text when the detection module detects that the preprocessed current voice is complete;

and the reminding module is used for reminding the user that the preprocessed current voice does not meet the requirement when the detection module detects that the preprocessed current voice is not complete.

In one embodiment, as shown in fig. 4, the second training module includes:

a first extraction submodule 3031, configured to extract acoustic feature parameters of the preprocessed current voice;

the second extraction submodule 3032 is used for extracting the first linguistic information associated with the context in the target sound recording text content;

a generating submodule 3033, configured to generate training data according to the acoustic feature parameter and the first linguistic information;

and the training submodule 3034 is configured to perform secondary training on the trained preset neural network model by using the training data.

In one embodiment, a synthesis module comprises:

the second obtaining submodule is used for obtaining second linguistic information of the text to be synthesized;

the obtaining submodule is used for inputting the second linguistic information into the preset neural network model after secondary training to obtain the voice characteristic parameters;

the third obtaining submodule is used for obtaining the static voice parameters according to the voice characteristic parameters;

the synthesis submodule is used for inputting the static voice parameters into the synthesizer for synthesis;

and the output submodule is used for outputting the synthesized voice after the synthesis is finished.

It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音播报方法、装置、设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!