Method for improving personalized synthesized voice quality

文档序号:1339707 发布日期:2020-07-17 浏览:28次 中文

阅读说明:本技术 提升个性化合成语音质量的方法 (Method for improving personalized synthesized voice quality ) 是由 丁少为 关海欣 于 2020-03-11 设计创作,主要内容包括:本发明涉及语音处理技术领域,提供了一种提升个性化合成语音质量的方法,包括以下步骤:S100采集用户语音作为原始语音数据;S200对原始语音数据进行降噪处理;S300利用降噪后的语音数据,通过模型转换将基底模型生成个性化语音模型;S400采用所述个性化模型与用户语音进行合成,得到个性化合成语音。本专利的提升个性化合成语音质量的方法,提通过对用户采集数据进行离线降噪处理,提升个性化语音模型质量,之后采用该质量较好的个性化语音模型对用户语音进行个性化语音合成的模型训练,达到提升个性化合成语音质量的目的。(The invention relates to the technical field of voice processing, and provides a method for improving the quality of personalized synthesized voice, which comprises the following steps: s100, collecting user voice as original voice data; s200, noise reduction processing is carried out on the original voice data; s300, generating a personalized voice model from the base model through model conversion by using the voice data subjected to noise reduction; s400, synthesizing the personalized model and the user voice to obtain personalized synthesized voice. According to the method for improving the personalized synthesized voice quality, the quality of the personalized voice model is improved by performing off-line noise reduction processing on the data collected by the user, and then the personalized voice model with better quality is adopted to perform model training of personalized voice synthesis on the voice of the user, so that the purpose of improving the personalized synthesized voice quality is achieved.)

1. A method for improving the quality of personalized synthesized speech, comprising the steps of:

s100, collecting user voice as original voice data;

s200, noise reduction processing is carried out on the original voice data;

s300, generating a personalized voice model from the base model through model conversion by using the voice data subjected to noise reduction;

s400, voice synthesis is carried out by adopting the personalized model to obtain personalized synthetic voice.

2. The method according to claim 1, wherein in step S200, an offline denoising processing manner is adopted to perform denoising processing on the speech data.

3. The method of claim 2, wherein the offline denoising comprises:

s210, pre-emphasis processing is carried out on the original voice data, and then short-time Fourier transform is carried out;

s250, filtering and denoising the transformed voice data by adopting a denoising filter;

s290, short-time inverse Fourier transform is carried out on the processed data, the processed data are restored to a time domain, and then de-emphasis is carried out, so that voice data after noise reduction are obtained.

4. A method for improving the quality of personalized synthesized speech according to claim 3, wherein the noise reduction filter is obtained by the steps of:

s220, performing noise and effective voice estimation on the voice data subjected to short-time Fourier transform by adopting a global noise estimation mode;

s230, generating a noise reduction filter according to the noise estimation result and the effective voice estimation result;

s240 performs smoothing processing on the noise reduction filter.

5. The method according to claim 4, wherein in step S220, the global noise estimation method comprises the following steps:

s222, calculating signal energy values of each time frequency point of the voice data subjected to short-time Fourier transform, screening all the time frequency points with energy values larger than zero, taking a logarithmic mean value according to a time frame, and taking an exponential operation value of the logarithmic mean value as a first threshold value; screening out the time-frequency point energy values with the energy values larger than zero and lower than the first threshold value from all the time-frequency points, taking the logarithmic mean value again according to the time frame, and taking the exponential operation value of the logarithmic mean value as a second threshold value; comparing the signal energy value with a second threshold, wherein the time frequency points which are greater than or equal to the second threshold are effective voice time frequency points and are marked as 1, and the time frequency points which are less than the second threshold are noise time frequency points and are marked as 0, so as to obtain a preliminary estimation result of the time frequency point masking;

s224, according to the preliminary estimation of the time frequency point masking, taking a sum value of the marks of each time frequency point in each frame, taking a nonzero logarithmic average value in each frame sum value, and taking an exponential calculation value of the logarithmic average value as a third threshold value; screening out a logarithmic mean value of which the sum value is greater than zero and less than a third threshold value from each frame, taking an exponential operation value of the logarithmic mean value as a fourth threshold value, comparing the sum value of each frame with the fourth threshold value, and taking a speech frame if the sum value is greater than the fourth threshold value, or taking a noise frame if the sum value is not greater than the fourth threshold value, thereby obtaining a frame masking estimation result;

s226, according to the frame masking estimation result and the signal energy of each time-frequency point, noise and effective voice estimation are carried out.

6. The method of claim 5, wherein the global noise estimation approach further comprises the following steps:

s225, firstly, the frame masking estimation result is subjected to corrosion treatment, and then expansion treatment is carried out, so that the accuracy is improved.

7. The method for improving the quality of personalized synthesized speech according to claim 1, wherein in step S100, different host voices are collected for adaptive model training to obtain a voice beautification model; in step S400, a voice beautification model is used to beautify the personalized synthesized voice.

8. The method for improving the quality of the personalized synthesized speech according to any one of claims 1 to 7, wherein in the step S100, the speech of different users is collected, the steps S200 and S300 are repeated to obtain personalized speech models for different users, a personalized speech model library is established, and in the step S400, the speech instruction information is received, and the user identification is performed to call the corresponding personalized speech model for speech synthesis.

9. The method of claim 8, wherein the user recognition comprises: extracting the voice characteristics of the voice of each user collected in the step S100 as first characteristic information, storing the first characteristic information and the personalized voice model in a personalized voice model base, and establishing respective corresponding relations with the first characteristic information and the personalized voice model of the same user respectively; in step S400, the voice characteristics of the voice command information are extracted as second characteristic information, the second characteristic information is compared with the stored first characteristic information one by one, and if the second characteristic information is consistent with the stored first characteristic information, the personalized voice model having the corresponding relationship is selected for voice synthesis.

10. The method of claim 9, wherein the speech characteristics include at least a sonic frequency, a sonic amplitude, a duration, and a timbre.

Technical Field

The invention relates to the technical field of voice processing, in particular to a method for improving the quality of personalized synthesized voice.

Background

The speech synthesis technology is widely applied, and because the tone quality and the naturalness of the synthesized speech are good at present, people put forward more demands on a synthesis system, the speech synthesis has a trend of diversified and personalized development, such as diversified speech synthesis comprising a plurality of speakers, a plurality of pronunciation styles, a plurality of languages and the like, and the technical software with speech synthesis function is very much, for example, model adaptive technical software developed on the basis of a trainable speech synthesis technology and the like.

In the process of collecting user data, influences such as noise and channels are difficult to avoid, and the user data and the base model are not matched, so that the quality of the generated personalized voice synthesis model is reduced more, and further the quality of the personalized synthesis voice is reduced.

Disclosure of Invention

In order to solve the above technical problem, the present invention provides a method for improving the quality of personalized synthesized speech, which comprises the following steps:

s100, collecting user voice as original voice data;

s200, noise reduction processing is carried out on the original voice data;

s300, generating a personalized voice model from the base model through model conversion by using the voice data subjected to noise reduction;

s400, voice synthesis is carried out by adopting the personalized model to obtain personalized synthetic voice.

Optionally, in step S200, an offline denoising processing manner is adopted to perform denoising processing on the voice data.

Optionally, the offline denoising processing method includes the following steps:

s210, pre-emphasis processing is carried out on the original voice data, and then short-time Fourier transform is carried out;

s250, filtering and denoising the transformed voice data by adopting a denoising filter;

s290, short-time inverse Fourier transform is carried out on the processed data, the processed data are restored to a time domain, and then de-emphasis is carried out, so that voice data after noise reduction are obtained.

Optionally, the noise reduction filter is obtained by:

s220, performing noise and effective voice estimation on the voice data subjected to short-time Fourier transform by adopting a global noise estimation mode;

s230, generating a noise reduction filter according to the noise estimation and the effective voice estimation result;

s240 performs smoothing processing on the noise reduction filter.

Optionally, in step S220, the global noise estimation method includes the following steps:

s222, calculating signal energy values of each time frequency point of the voice data subjected to short-time Fourier transform, screening all the time frequency points with energy values larger than zero, taking a logarithmic mean value according to a time frame, and taking an exponential operation value of the logarithmic mean value as a first threshold value; screening out the time-frequency point energy values with the energy values larger than zero and lower than the first threshold value from all the time-frequency points, taking the logarithmic mean value again according to the time frame, and taking the exponential operation value of the logarithmic mean value as a second threshold value; comparing the signal energy value with a second threshold, wherein the time frequency points which are greater than or equal to the second threshold are effective voice time frequency points and are marked as 1, and the time frequency points which are less than the second threshold are noise time frequency points and are marked as 0, so as to obtain a preliminary estimation result of the time frequency point masking;

s224, according to the preliminary estimation of the time frequency point masking, taking a sum value of the marks of each time frequency point in each frame, taking a nonzero logarithmic average value in each frame sum value, and taking an exponential calculation value of the logarithmic average value as a third threshold value; screening out a logarithmic mean value of which the sum value is greater than zero and less than a third threshold value from each frame, taking an exponential operation value of the logarithmic mean value as a fourth threshold value, comparing the sum value of each frame with the fourth threshold value, and taking a speech frame if the sum value is greater than the fourth threshold value, or taking a noise frame if the sum value is not greater than the fourth threshold value, thereby obtaining a frame masking estimation result;

s226, according to the frame masking estimation and the signal energy of each time-frequency point, noise and effective voice estimation are carried out.

Optionally, the global noise estimation method further includes the following steps:

s225, firstly, the frame masking estimation result is subjected to corrosion treatment, and then expansion treatment is carried out, so that the accuracy is improved.

Optionally, in step S100, collecting voices of different hosts to perform adaptive model training, so as to obtain a voice beautification model; in step S400, a voice beautification model is used to beautify the personalized synthesized voice.

Optionally, in step S100, voices of different users are collected, the steps S200 and S300 are repeated to obtain personalized voice models for the different users, a personalized voice model library is established, and in step S400, voice instruction information is received, and the user is identified and the corresponding personalized voice model is called for voice synthesis.

Optionally, the user identification includes: extracting the voice characteristics of the voice of each user collected in the step S100 as first characteristic information, storing the first characteristic information and the personalized voice model in a personalized voice model base, and establishing respective corresponding relations with the first characteristic information and the personalized voice model of the same user respectively; in step S400, the voice characteristics of the voice command information are extracted as second characteristic information, the second characteristic information is compared with the stored first characteristic information one by one, and if the second characteristic information is consistent with the stored first characteristic information, the personalized voice model having the corresponding relationship is selected for voice synthesis.

Optionally, the speech characteristics include at least a sound wave frequency, a sound wave amplitude, a sound length, and a timbre.

According to the method for improving the personalized synthesized voice quality, the quality of the personalized voice model is improved by performing off-line noise reduction processing on the data collected by the user, and then the personalized voice model with better quality is adopted to perform model training of personalized voice synthesis on the voice of the user, so that the purpose of improving the personalized synthesized voice quality is achieved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart illustrating an embodiment of a method for improving personalized synthesized speech quality according to the present invention;

FIG. 2 is a flowchart of an embodiment of an off-line denoising process according to the present invention;

FIG. 3 is a flowchart of an embodiment of an off-line denoising process according to the present invention;

FIG. 4 is a flow chart of an embodiment of global noise estimation used in the present invention;

FIG. 5 is a flowchart of an embodiment of the global noise estimation method employed by the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Fig. 1 shows a flow of an alternative embodiment of the method for improving the quality of personalized synthesized speech according to the present invention, which includes the following steps:

s100, collecting user voice as original voice data;

s200, noise reduction processing is carried out on the original voice data;

s300, generating a personalized voice model from the base model through model conversion by using the voice data subjected to noise reduction;

s400, voice synthesis is carried out by adopting the personalized model to obtain personalized synthetic voice.

The working principle of the technical scheme is as follows: through voice noise reduction processing, background noise in original voice data is removed, so that the discomfortable influence between the voice data and a base model is reduced, the quality of the generated personalized voice model is improved, and personalized voice with higher quality can be generated by synthesizing the personalized voice model and user voice.

The beneficial effects of the above technical scheme are: the quality of the personalized voice model can be improved by carrying out noise reduction processing on the voice data for generating the personalized voice model in advance, and the personalized voice model subjected to the processing is adopted for personalized synthesis, so that the quality of personalized synthesized voice is improved.

In one embodiment, in step S200, an offline denoising processing manner is adopted to perform denoising processing on the voice data.

The beneficial effects of the above technical scheme are: the off-line noise reduction processing mode reduces interference and adverse effect in the transmission process, the processing is faster, the voice data distortion is prevented, and the personalized characteristics of the voice data of the user can be well maintained.

In one embodiment, as shown in fig. 2, the offline denoising processing method adopted by the method for improving the quality of personalized synthesized speech of the present invention includes the following steps:

s210, pre-emphasis processing is carried out on the original voice data, and then short-time Fourier transform is carried out;

s250, filtering and denoising the transformed voice data by adopting a denoising filter;

s290, short-time inverse Fourier transform is carried out on the processed data, the processed data are restored to a time domain, and then de-emphasis is carried out, so that voice data after noise reduction are obtained.

The working principle of the technical scheme is as follows: pre-emphasis is a signal processing mode for compensating high-frequency components of input signals at a sending end, an original voice data signal is enhanced through pre-emphasis processing, and pre-emphasis has no influence on noise, so that the signal-to-noise ratio of signal output can be effectively improved; then, short-time Fourier transform is carried out to determine the frequency and the phase of a sinusoidal wave in a local area of the time-varying signal; and then filtering and denoising, and then reversely and successively carrying out short-time Fourier inverse transformation and de-emphasis to obtain denoised voice data.

The beneficial effects of the above technical scheme are: the signal to noise ratio is improved through pre-emphasis processing before noise reduction, the recognition rate of noise can be improved, the noise reduction effect is increased, the purity of voice data is improved, and the voice data after noise reduction can be obtained through recovery in a phase inversion process.

In one embodiment, as shown in fig. 3, the offline denoising processing method includes the following steps:

s210, pre-emphasis processing is carried out on the original voice data, and then short-time Fourier transform is carried out;

s220, performing noise and effective voice estimation on the voice data subjected to short-time Fourier transform by adopting a global noise estimation mode;

s230, generating a noise reduction filter according to the noise estimation result and the effective voice estimation result;

s240, smoothing the noise reduction filter;

s250, filtering and denoising the transformed voice data by adopting a denoising filter;

s290, short-time inverse Fourier transform is carried out on the processed data, the processed data are restored to a time domain, and then de-emphasis is carried out, so that voice data after noise reduction are obtained.

The working principle of the technical scheme is as follows: and performing noise and effective voice estimation by adopting a global noise estimation mode, then generating a noise reduction filter, performing smoothing processing to improve the data quality of the noise reduction filter, and performing filtering and noise reduction processing on the converted voice data.

The beneficial effects of the above technical scheme are: the global noise estimation mode is adopted to estimate noise and effective voice to generate the noise reduction filter, so that the performance is stable, the noise resistance is high, the distortion possibility is low, and the noise reduction effect can be further enhanced.

In one embodiment, as shown in fig. 4, in step S220, the global noise estimation method includes the following steps:

s222, calculating signal energy values of each time frequency point of the voice data subjected to short-time Fourier transform, screening all the time frequency points with energy values larger than zero, taking a logarithmic mean value according to a time frame by taking a natural constant e as a base, and taking an exponential operation value of the logarithmic mean value as a first threshold value; screening out the time-frequency point energy values with the energy values larger than zero and lower than the first threshold value from all the time-frequency points, taking the logarithmic mean value again according to the time frame, and taking the exponential operation value of the logarithmic mean value as a second threshold value; comparing the signal energy value with a second threshold, wherein the time frequency points which are greater than or equal to the second threshold are effective voice time frequency points and are marked as 1, and the time frequency points which are less than the second threshold are noise time frequency points and are marked as 0, so as to obtain a preliminary estimation result of the time frequency point masking;

s224, according to the preliminary estimation of the time frequency point masking, taking the mark sum value of each time frequency point in each frame, taking a natural constant e as a base, taking a nonzero logarithmic average value in each frame sum value, and taking an exponential operation value of the logarithmic average value as a third threshold; screening out a logarithmic mean value of which the sum value is greater than zero and less than a third threshold value from each frame, taking an exponential operation value of the logarithmic mean value as a fourth threshold value, comparing the sum value of each frame with the fourth threshold value, marking the sum value which is greater than the fourth threshold value as a speech frame and marking the speech frame as 1, and otherwise, marking the speech frame as a noise frame and marking the speech frame as 0 to obtain a frame masking estimation result;

s226, according to the frame masking estimation result and the signal energy of each time-frequency point, noise and effective voice estimation are carried out.

The working principle of the technical scheme is as follows: each frame comprises a plurality of time frequency points, a signal energy value is calculated for each time frequency point of each frame of voice data, threshold value comparison is adopted, effective voice time frequency points and noise time frequency points are distinguished, the more voice time frequency points are, the larger possible line of the frame is voice, and the similar method is used for analyzing whether each frame of voice data belongs to an effective voice frame or a noise frame, so that subsequent noise reduction processing is facilitated.

The beneficial effects of the above technical scheme are: because all voice data can be used for processing, noise estimation is carried out on all voice data, more accurate noise estimation is obtained, noise judgment is more accurate, and a foundation is provided for subsequent efficient noise reduction.

In one embodiment, as shown in fig. 5, in step S220, the global noise estimation method includes the following steps:

s222, calculating signal energy values of each time frequency point of the voice data subjected to short-time Fourier transform, screening all the time frequency points with energy values larger than zero, taking a logarithmic mean value according to a time frame by taking a natural constant e as a base, and taking an exponential operation value of the logarithmic mean value as a first threshold value; screening out the time-frequency point energy values with the energy values larger than zero and lower than the first threshold value from all the time-frequency points, taking the logarithmic mean value again according to the time frame, and taking the exponential operation value of the logarithmic mean value as a second threshold value; comparing the signal energy value with a second threshold, wherein the time frequency points which are greater than or equal to the second threshold are effective voice time frequency points and are marked as 1, and the time frequency points which are less than the second threshold are noise time frequency points and are marked as 0, so as to obtain a preliminary estimation result of the time frequency point masking;

s224, according to the preliminary estimation of the time frequency point masking, taking the mark sum value of each time frequency point in each frame, taking a natural constant e as a base, taking a nonzero logarithmic average value in each frame sum value, and taking an exponential operation value of the logarithmic average value as a third threshold; screening out a logarithmic mean value of which the sum value is greater than zero and less than a third threshold value from each frame, taking an exponential operation value of the logarithmic mean value as a fourth threshold value, comparing the sum value of each frame with the fourth threshold value, marking the sum value which is greater than the fourth threshold value as a speech frame and marking the speech frame as 1, and otherwise, marking the speech frame as a noise frame and marking the speech frame as 0 to obtain a frame masking estimation result;

s225, firstly, carrying out corrosion treatment on the frame masking estimation result, and then carrying out expansion treatment to increase the accuracy of the frame masking estimation result;

s226, according to the frame masking estimation result and the signal energy of each time-frequency point, noise and effective voice estimation are carried out.

The working principle of the technical scheme is as follows: the global noise estimation mode introduces a corrosion expansion technology in image processing, firstly carries out corrosion processing on the frame masking estimation result, and then carries out expansion processing reversely to eliminate some small and meaningless objects.

The beneficial effects of the above technical scheme are: and the corrosion expansion technology in the image processing is combined to further inhibit the noise, so that the accuracy of the frame masking estimation result is improved, and a better effect than that of a common noise reduction method can be obtained.

In one embodiment, in step S100, voices of different hosts are collected for adaptive model training to obtain a voice beautification model; in step S400, a voice beautification model is used to beautify the personalized synthesized voice.

The working principle of the technical scheme is as follows: and establishing a voice beautifying model through voice training by using the voice of the host group with better voice, so as to beautify the personalized synthesized voice and beautify the quality of the personalized synthesized voice.

The beneficial effects of the above technical scheme are: through the beautification treatment of the voice beautification model, the possible defects in the personalized synthesized voice are made up, so that the personalized synthesized voice becomes more beautiful, pleasant and vivid, and the perception effect of the personalized synthesized voice is improved.

In one embodiment, in step S100, voices of different users are collected, the steps S200 and S300 are repeated to obtain personalized voice models for the different users, a personalized voice model library is established, and in step S400, voice instruction information is received, and user identification is performed to call the corresponding personalized voice model for voice synthesis.

The working principle of the technical scheme is as follows: and establishing a high-quality personalized voice model library for storing personalized voice models corresponding to different users by adopting the same method, and calling the personalized voice models of the corresponding users for personalized voice synthesis by taking the received voice instruction information as a trigger instruction.

The beneficial effects of the above technical scheme are: the method has the advantages that voices of different users are collected to establish the personalized voice model base, each subsequent user can be called and used through a voice instruction, and if the personalized voice model base is connected to the Internet, the user is not limited by regions and can be called conveniently at any time and any place.

In one embodiment, in the step S100, voices of different users are collected, voice characteristics of the voices of the users are extracted as first characteristic information, the steps S200 and S300 are repeated to obtain personalized voice models for the different users, a personalized voice model library is established, the first characteristic information and the personalized voice models are stored in the personalized voice model library, and respective corresponding relations are established by the first characteristic information and the personalized voice models of the same user respectively; in the step S400, receiving the voice instruction information, extracting the voice characteristics of the voice instruction information as second characteristic information, comparing the second characteristic information with the first characteristic information, and if the second characteristic information is consistent with the first characteristic information, selecting an individualized voice model having a corresponding relationship for voice synthesis; the speech characteristics include at least a sound wave frequency, a sound wave amplitude, a sound length, and a sound color.

The working principle of the technical scheme is as follows: the voice characteristics of the user voice are used as a corresponding relation information trigger of the personalized voice model, after the voice characteristics of the received voice instruction information are extracted, the voice characteristics are compared with the voice characteristics of the user voice in the personalized voice model base to be consistent and used as a trigger condition, and the personalized voice model corresponding to the user is called for voice synthesis.

The beneficial effects of the above technical scheme are: the method has the advantages that the triggering conditions for calling the personalized voice model corresponding to the user are simple and convenient, manual operation is not needed, and the method is suitable for some special groups such as non-dumb disabled persons and old people and children who do not learn characters.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:具收音场景切换功能的助听系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!