Singing voice synthesis method and device and readable storage medium

文档序号：600200 发布日期：2021-05-04 浏览：14次中文

阅读说明：本技术 一种歌声合成方法、装置及可读存储介质 (Singing voice synthesis method and device and readable storage medium ) 是由杨喜鹏郁霖陈云琳江明奇张旭殷昊于 2020-12-23 设计创作，主要内容包括：本发明公开了一种歌声合成方法、装置及可读存储介质,该方法包括：获取用户朗诵的音频、音频对应的的歌词文本；根据预设语音识别模型和歌词文本对音频中的第一音素进行时长标注,得到第一音素的第一时长；确定音频的第一谱特征；当歌词文本中的第一歌词和预设目标歌曲的第二歌词对应,根据预设的目标歌曲的第二音素的第二时长、第一音素的第一时长对第一谱特征进行缩放处理,得到第二谱特征；对第二谱特征、预设的目标歌曲的第一基频进行合成,得到合成歌声。本发明无需收集大量的录音数据就可以实现歌声合成,可以降低歌声合成的成本,且合成歌声更加自然,具有目标歌曲原歌手歌唱时的韵律感且能保留用户原有的音色。(The invention discloses a singing voice synthesis method, a singing voice synthesis device and a readable storage medium, wherein the method comprises the following steps: acquiring audio frequency recited by a user and lyric text corresponding to the audio frequency; carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first time length of the first phoneme; determining a first spectral feature of the audio; when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; and synthesizing the second spectrum characteristic and the first fundamental frequency of the preset target song to obtain the synthesized singing voice. The invention can realize the synthesis of the singing voice without collecting a large amount of recording data, can reduce the cost of synthesizing the singing voice, has more natural synthesized singing voice, has the rhythmic feeling of the original singer of the target song when singing, and can keep the original tone of the user.)

1. A singing voice synthesizing method, comprising:

acquiring audio frequency recited by a user and lyric text corresponding to the audio frequency;

carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and the lyric text to obtain a first time length of the first phoneme;

determining a first spectral feature of the audio;

when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second preset duration of a second phoneme of the target song and the first preset duration of the first phoneme to obtain a second spectral feature;

and synthesizing the second spectrum characteristic and a preset first fundamental frequency of the target song to obtain a synthesized singing voice.

2. The method of synthesizing singing voice according to claim 1, wherein said time-length labeling the first phoneme in the audio according to a preset speech recognition model and the lyric text to obtain the first time length of the first phoneme comprises:

determining a first phoneme in the audio according to an initial consonant and a final of a character in a first lyric in the lyric text;

inputting the first phoneme and the audio into a preset speech recognition model;

and labeling the audio through the voice recognition model according to the first phoneme to obtain the first duration of the first phoneme.

3. The method for synthesizing singing voice according to claim 1, wherein the first lyrics in the lyrics text correspond to second lyrics of a preset target song, comprising:

and when the characters of each lyric in the first lyric are the same as the characters of each lyric in the second lyric of the preset target song, or the number of the characters of each lyric in the first lyric is the same as the number of the characters of each lyric in the second lyric of the preset target song, the first lyric in the lyric text corresponds to the second lyric of the preset target song.

4. The method for synthesizing singing voice according to claim 1, wherein the scaling the first spectral feature according to the preset second duration of the second phoneme of the target song and the preset first duration of the first phoneme to obtain a second spectral feature comprises:

labeling the first spectrum feature according to the first duration of the first phoneme to obtain a third spectrum feature;

calculating a scaling ratio according to a second duration of a second phoneme of the target song and a first duration of the first phoneme;

and carrying out scaling processing on the third spectral feature according to the scaling ratio to obtain the second spectral feature.

5. The singing voice synthesizing method according to claim 1,

when first lyrics in the lyric text do not correspond to second lyrics of a preset target song, splicing and/or cutting the first phoneme to obtain a third phoneme corresponding to the second phoneme;

determining a third duration of the third phoneme according to the first duration of the first phoneme;

scaling the first spectral feature according to a preset second duration of a second phoneme of the target song and a preset third duration of the third phoneme to obtain a third spectral feature;

and synthesizing the third spectral feature and the preset fundamental frequency of the target song to obtain synthesized singing voice.

6. The method for synthesizing singing voice according to claim 1, further comprising, before the synthesizing the second spectral feature and the preset first fundamental frequency of the target song to obtain the synthesized singing voice:

determining a second fundamental frequency of the audio;

determining a zero value in the second fundamental frequency;

determining a zero value in a preset first fundamental frequency of the target song;

interpolating zero values in the first fundamental frequency to non-zero values;

adjusting a non-zero value in the first fundamental frequency based on a zero value in the second fundamental frequency.

7. The method for synthesizing singing voice according to claim 1, further comprising, after the synthesizing the second spectral feature and the preset first fundamental frequency of the target song to obtain a synthesized singing voice:

performing voice changing processing on the synthesized singing voice;

and filtering the synthesized singing voice subjected to the voice changing processing.

8. A singing voice synthesizing apparatus, comprising:

the acquiring unit is used for acquiring the voice frequency recited by the user and the lyric text corresponding to the voice frequency;

the labeling unit is used for carrying out duration labeling on a first phoneme in the audio according to a preset voice recognition model and the lyric text to obtain a first duration of the first phoneme;

a determining unit for determining a first spectral feature of the audio;

the processing unit is used for scaling the first spectral feature according to a preset second duration of a second phoneme of the target song and a preset first duration of the first phoneme to obtain a second spectral feature when a first lyric in the lyric text corresponds to a preset second lyric of the target song;

and the synthesis unit is used for synthesizing the second spectrum characteristic and the preset first fundamental frequency of the target song to obtain a synthesized singing voice.

9. A computer, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the singing voice synthesis method of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a computer to execute the method of singing voice synthesis according to any one of claims 1-7.

Technical Field

The present application relates to the field of singing voice synthesis technology, and in particular, to a singing voice synthesis method, device and readable storage medium.

Background

In recent years, singing voice synthesis technology has been receiving attention from all communities. With the development of the singing voice synthesis technology, the singing voice synthesis technology is gradually applied to the daily life of people. For example, some users may not fully sing their songs in five tones, and may wish to pronounce their lyrics and then generate their own singing voice by using a singing voice synthesis technique.

At present, the related technology generally identifies the speech of the user speaking, correspondingly finds out the inherent singing voice in the singing voice synthesis database, then extracts the tone of the singing voice, and then adopts a pre-established conversion model to change the tone of the singing voice into the tone of the user, so as to obtain the synthesized singing voice of the user.

However, the core of the above technology is to record the singing voice of each pronunciation in a certain language at different pitches in advance to obtain a singing voice synthesis database, so the singing voice of the user is synthesized by using the inherent singing voice in the singing voice synthesis database, and the very large recording data is relied on, which needs a lot of time and manpower to collect data, thus resulting in higher cost of singing voice synthesis.

Content of application

The embodiment of the application provides a singing voice synthesis method, a singing voice synthesis device and a readable storage medium, and aims to solve the problem that in the prior art, singing voice synthesis depends on huge recording data, and a large amount of time and manpower are consumed to collect the recording data, so that the singing voice synthesis cost is high.

In order to solve the above problem, in a first aspect, an embodiment of the present invention provides a singing voice synthesis method, including: acquiring audio frequency recited by a user and lyric text corresponding to the audio frequency; carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first time length of the first phoneme; determining a first spectral feature of the audio; when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; and synthesizing the second spectrum characteristic and the first fundamental frequency of the preset target song to obtain the synthesized singing voice.

Optionally, performing duration tagging on a first phoneme in the audio according to a preset speech recognition model and a lyric text to obtain a first duration of the first phoneme, including: determining a first phoneme in the audio according to the initial consonant and the final of a character in the first lyric in the lyric text; inputting the first phoneme and the audio into a preset speech recognition model; and labeling the audio through a speech recognition model according to the first phoneme to obtain the first duration of the first phoneme.

Optionally, the step of enabling a first lyric in the lyric text to correspond to a second lyric of the preset target song includes: and when the characters of each lyric in the first lyric are the same as the characters of each lyric in the second lyric of the preset target song, or the number of the characters of each lyric in the first lyric is the same as the number of the characters of each lyric in the second lyric of the preset target song, the first lyric in the lyric text corresponds to the second lyric of the preset target song.

Optionally, the scaling the first spectral feature according to a preset second duration of a second phoneme of the target song and a preset first duration of the first phoneme to obtain a second spectral feature, including: labeling the first spectrum feature according to the first duration of the first phoneme to obtain a third spectrum feature; calculating a scaling ratio according to the second duration of the second phoneme of the target song and the first duration of the first phoneme; and carrying out scaling processing on the third spectral characteristics according to the scaling ratio to obtain second spectral characteristics.

Optionally, when the first lyrics in the lyric text do not correspond to the second lyrics of the preset target song, splicing and/or cutting the first phonemes to obtain third phonemes corresponding to the second phonemes; determining a third duration of a third phoneme according to the first duration of the first phoneme; scaling the first spectral feature according to a preset second duration of a second phoneme of the target song and a preset third duration of a third phoneme to obtain a third spectral feature; and synthesizing the third spectral feature and the preset fundamental frequency of the target song to obtain the synthesized singing voice.

Optionally, before synthesizing the second spectral feature and the preset first fundamental frequency of the target song to obtain the synthesized singing voice, the singing voice synthesizing method further includes: determining a second fundamental frequency of the audio; determining a zero value in the second fundamental frequency; determining zero values in a first fundamental frequency of a preset target song; interpolating zero values in the first fundamental frequency to non-zero values; the non-zero values in the first fundamental frequency are adjusted in accordance with the zero values in the second fundamental frequency.

Optionally, after synthesizing the second spectral feature and the preset first fundamental frequency of the target song to obtain a synthesized singing voice, the singing voice synthesizing method further includes: performing voice changing processing on the synthesized singing voice; and filtering the synthesized singing voice after the voice changing processing.

In a second aspect, an embodiment of the present invention provides a singing voice synthesizing apparatus, including: the acquisition unit is used for acquiring the voice frequency recited by the user and the lyric text corresponding to the voice frequency; the labeling unit is used for carrying out duration labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first duration of the first phoneme; a determining unit for determining a first spectral feature of the audio; the processing unit is used for carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme when first lyrics in the lyric text correspond to second lyrics of the preset target song to obtain a second spectral feature; and the synthesis unit is used for synthesizing the second spectrum characteristic and the first fundamental frequency of the preset target song to obtain the synthesized singing voice.

In a third aspect, an embodiment of the present invention provides a singing voice synthesizing apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a singing voice synthesis method as in the first aspect or any embodiment of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to execute the method for singing voice synthesis according to the first aspect or any implementation manner of the first aspect.

According to the singing voice synthesis method, the singing voice synthesis device and the readable storage medium, the voice frequency recited by a user and the lyric text corresponding to the voice frequency are obtained; carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first time length of the first phoneme; determining a first spectral feature of the audio; when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; synthesizing the second spectrum characteristic and a preset first fundamental frequency of the target song to obtain a synthesized singing voice, so that the first spectrum characteristic can be obtained from the voice frequency recited by a user, and the first spectrum characteristic is zoomed based on the second duration of the second phoneme and the first duration of the first phoneme in the target song, so that the obtained second spectrum characteristic has the rhythmicity when the original singer of the target song sings and retains the original tone of the user, so that the synthesized singing voice has the rhythmicity when the original singer of the target song sings and retains the original tone of the user, and the first fundamental frequency of the target song is adopted when the singing voice is synthesized, so that the synthesized singing voice is more natural, meanwhile, the singing voice synthesis can be realized without collecting a large amount of recording data, and the cost of the singing voice synthesis can be reduced; moreover, the method can support that the first lyric text of the audio recited by the user is inconsistent with the second lyric of the target song, thereby meeting the requirement of the user on lyric modification.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

FIG. 1 is a schematic flow chart of a singing voice synthesizing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic hardware configuration diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a singing voice synthesis method, as shown in fig. 1, including:

s101, acquiring voice frequency recited by a user and lyric text corresponding to the voice frequency; specifically, the executing subject of the present invention may be a singing voice synthesizing device, and may also be a terminal or a server, which is not specifically limited herein, and in the embodiment of the present invention, the singing voice synthesizing device is taken as an example for explanation. The singing voice synthesizing device can receive a singing request of the user through a wired connection mode or a wireless connection mode, then prompt the user to recite the first lyric in the lyric text, and generate audio frequency recited by the user according to the first lyric recited by the user. The lyric text can be a lyric text designated by a user, a lyric text randomly selected from a preset lyric library by the singing sound synthesizing device when the singing request is received, a lyric text selected from the preset lyric library by the singing sound synthesizing device according to the behavior and the use habit of the user, and a self-defined lyric text input by the user. The invention also supports inputting the index of the starting character and the ending character of the lyric text according to the position index of the first lyric in the lyric text, and then obtaining the lyric text according to the index of the starting character and the ending character.

S102, carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first time length of the first phoneme; specifically, a first phoneme may be determined according to an initial and a final of a character in the lyric text, and then the audio is labeled according to the first phoneme by using a speech recognition model, so as to obtain a first duration of the first phoneme.

S103, determining a first spectrum characteristic of the audio; specifically, a world vocoder may be adopted to extract a first spectral feature in the audio to obtain the first spectral feature of the audio. The first spectral features may include mel-frequency spectral features and aperiodic component features.

S104, when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; in particular, the target song may be a user-specified song. The singing voice synthesizing device is preset with a target song, a lyric text of the target song and a second duration of a second phoneme of the target song. The lyric text comprises second lyrics. The second phoneme is determined from the initials and finals of the characters of the second lyric. The initial of the character corresponds to a second phoneme and the final of the character may correspond to at least one second phoneme.

When calculating the second time length of the second phoneme of the target song, if the target song is a song mixed with background music, the background music can be separated by using a spleteter open source tool, so as to obtain dry sound and background music. And then, marking characters of the second lyric and the corresponding duration and position of the second phoneme in the stem sound by using a duration alignment method (alignment) in the speech recognition model, and pre-marking a duration file. The duration file includes: song id, location of the stem relative to the background music, fourth duration of characters in the second lyric (syllable duration), and second duration of the second phoneme. And then converting the duration file into a file in a TextGrid format, and finely adjusting the second duration of the pre-labeled second phoneme by using a praat voice analysis tool to generate the accurate second duration of the second phoneme.

The first lyric in the lyric text corresponds to a second lyric of a preset target song, and may include: and when the characters of each lyric in the first lyric are the same as the characters of each lyric in the second lyric of the preset target song, or the number of the characters of each lyric in the first lyric is the same as the number of the characters of each lyric in the second lyric of the preset target song, the first lyric in the lyric text corresponds to the second lyric of the preset target song. Therefore, the user can change the second lyrics to obtain the first lyrics corresponding to the second lyrics.

When the first lyrics in the lyric text correspond to the second lyrics of the preset target song, the first duration of the first phoneme may be extended or shortened according to the second duration of the second phoneme because the second duration of the second phoneme is different from the first duration of the first phoneme. Correspondingly, the first spectral feature may be scaled according to a ratio of the second duration of the second phoneme to the first duration of the first phoneme, so as to obtain a second spectral feature. Since the first spectral feature is extracted according to the frame, the scaling processing is performed on the first spectral feature according to the second duration of the second phoneme and the first duration of the first phoneme, so that the first spectral feature can be scaled according to the first phoneme, and the second spectral feature has the rhythm of the target song and can conform to the habit of a person when singing a song because the stretching duration of each phoneme in the character is different when the person sings a long sound in the song. Scaling the first spectral feature according to phoneme correspondence can make the synthesized song more accurate.

And S105, synthesizing the second spectrum characteristic and the first fundamental frequency of the preset target song to obtain the synthesized singing voice. Specifically, the first fundamental frequency of the target song may be extracted by using a tool including, but not limited to Yin, Melodia, Wrold, etc. for extracting the fundamental frequency of the song. Then, a world vocoder is used for synthesizing the sound, and the value of the first fundamental frequency of the sound is adjusted to obtain more accurate first fundamental frequency.

The embodiment of the invention provides a singing voice synthesis method, which comprises the steps of obtaining voice frequency recited by a user and lyrics text corresponding to the voice frequency; carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first time length of the first phoneme; determining a first spectral feature of the audio; when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; synthesizing the second spectrum characteristic and a preset first fundamental frequency of the target song to obtain a synthesized singing voice, so that the first spectrum characteristic can be obtained from the voice frequency recited by a user, and the first spectrum characteristic is zoomed based on the second duration of the second phoneme and the first duration of the first phoneme in the target song, so that the obtained second spectrum characteristic has the rhythmicity when the original singer of the target song sings and retains the original tone of the user, so that the synthesized singing voice has the rhythmicity when the original singer of the target song sings and retains the original tone of the user, and the first fundamental frequency of the target song is adopted when the singing voice is synthesized, so that the synthesized singing voice is more natural, meanwhile, the singing voice synthesis can be realized without collecting a large amount of recording data, and the cost of the singing voice synthesis can be reduced; moreover, the method can support that the first lyric text of the audio recited by the user is inconsistent with the second lyric of the target song, thereby meeting the requirement of the user on lyric modification.

In an alternative embodiment, in step S102, performing a duration annotation on a first phoneme in the audio according to the preset speech recognition model and the lyric text to obtain a first duration of the first phoneme, which may specifically include: determining a first phoneme in the audio according to the initial consonant and the final of a character in the first lyric in the lyric text; inputting the first phoneme audio into a preset speech recognition model; and labeling the audio through a speech recognition model according to the first phoneme to obtain the first duration of the first phoneme.

Specifically, characters in the lyric text can be converted into initials and finals by using a pypinyin tool or a speech synthesis tool, the initials of the characters correspond to one first phoneme, the finals of the characters can correspond to at least one first phoneme, and the number of the finals corresponding to the first phonemes is determined according to the composition of the finals. For example, for a combined vowel, the composition of the vowel is iang, and the vowel corresponds to two first phonemes, i and ang respectively. For non-combined finals, the finals are combined into ei, and the finals correspond to a first phoneme. The characters can be converted into lyric texts and audios of initials and finals, the lyric texts and the audios are input into a speech recognition model, the speech recognition model can perform speech analysis on the audios, duration labeling is performed on the audios according to first phonemes corresponding to the characters in the lyrics in sequence, so that a time stamp and duration of the first phonemes are obtained, and the first duration of the first phonemes can be determined according to the time stamp and the duration of the first phonemes.

The voice recognition model is adopted to label the duration of the audio, so that the first duration of the first phoneme is obtained, and the method is quick and accurate.

In an optional embodiment, in step S104, the scaling process is performed on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature, which specifically includes: labeling the first spectrum feature according to the first duration of the first phoneme to obtain a third spectrum feature; calculating a scaling ratio according to the second duration of the second phoneme of the target song and the first duration of the first phoneme; and carrying out scaling processing on the third spectral characteristics according to the scaling ratio to obtain second spectral characteristics.

Specifically, the first spectral feature may be scaled by phoneme. Since the first spectral feature is extracted in units of frames, when the first spectral feature is scaled according to phonemes, the first spectral feature needs to be labeled according to a first duration of the first phoneme, so that the first spectral feature can be divided according to phonemes, and thus the third spectral feature of the first phoneme can be obtained. The scaling ratio may be determined based on a ratio of the second duration of the second phone to the first duration of the first phone. And scaling the third spectral feature of the first phoneme according to the scaling ratio to obtain a second spectral feature. Further, when the final corresponding to the character includes a first phoneme, scaling the third spectral feature of the first phoneme according to the scaling ratio, so as to obtain the second spectral feature. When the final corresponding to the character comprises a plurality of first phonemes, carrying out scaling processing on the third spectral feature of the last first phoneme corresponding to the final according to the scaling ratio, and carrying out scaling processing on the third spectral feature of the first phoneme corresponding to the initial according to the scaling ratio to obtain a second spectral feature. And the third spectral features of other first phonemes of the final are not scaled.

In the scaling process, a linear interpolation method may be adopted to linearly interpolate the third spectral feature according to the scaling.

Labeling the first spectrum feature according to the first duration of the first phoneme to obtain a third spectrum feature; calculating a scaling ratio according to the second duration of the second phoneme and the first duration of the first phoneme; and scaling the third spectral feature according to the scaling ratio to obtain a second spectral feature, so that the first spectral feature can be scaled according to the phonemes, and the second spectral feature has the rhythm of the target song and can meet the habit of a person when singing the song because the stretching duration of each phoneme in the character is different when the person sings the long tone in the song. Scaling the first spectral feature according to phoneme correspondence can make the synthesized song more accurate.

In an optional embodiment, when the first lyrics in the lyric text do not correspond to the second lyrics of the preset target song, the first phoneme is spliced and/or cut to obtain a third phoneme corresponding to the second phoneme; determining a third duration of a third phoneme according to the first duration of the first phoneme; scaling the first spectral feature according to a preset second duration of a second phoneme of the target song and a preset third duration of a third phoneme to obtain a third spectral feature; and synthesizing the third spectral feature and the preset fundamental frequency of the target song to obtain the synthesized singing voice.

Specifically, when the number of characters in the first lyric in the lyric text is different from the number of characters in the preset target song, the first lyric in the lyric text does not correspond to the second lyric of the preset target song. When the number of characters of the first lyric is less than that of the second lyric, the first phoneme may be cut according to a long syllable cutting method, so as to obtain a third phoneme corresponding to the second phoneme. When the number of characters of the first lyric is larger than that of the second lyric, the first phoneme can be spliced according to a method of splicing syllables, so that a third phoneme corresponding to the second phoneme is obtained. The third phoneme corresponds to the second phoneme, which means that the number of the third phoneme is equal to the number of the second phoneme. The third duration of the third phone may be determined according to the duration of the first phone and a duration ratio threshold between the third phones. And labeling the first spectral feature according to the third duration of the third phoneme, and then scaling the labeled first spectral feature according to the ratio of the second duration of the second phoneme to the third duration of the third phoneme to obtain a third spectral feature.

According to the embodiment of the invention, when first lyrics in a lyric text do not correspond to second lyrics of a preset target song, a first phoneme is spliced and/or cut to obtain a third phoneme corresponding to the second phoneme; determining a third duration of a third phoneme according to the first duration of the first phoneme; scaling the first spectral feature according to a preset second duration of a second phoneme of the target song and a preset third duration of a third phoneme to obtain a third spectral feature; synthesizing the third spectral characteristics and the preset fundamental frequency of the target song to obtain synthesized singing voice; so that the user can synthesize the singing voice with arbitrary first lyrics and make the synthesized singing voice have the rhythm and the first fundamental frequency of the target song.

In an optional embodiment, before synthesizing the second spectral feature and the preset first fundamental frequency of the target song to obtain the synthesized singing voice, the first fundamental frequency may be further adjusted, so that the singing voice synthesizing method further includes: determining a second fundamental frequency of the audio; determining a zero value in the second fundamental frequency; determining zero values in a first fundamental frequency of a preset target song; interpolating zero values in the first fundamental frequency to non-zero values; the non-zero values in the first fundamental frequency are adjusted in accordance with the zero values in the second fundamental frequency.

In particular, a second fundamental frequency of the audio may be extracted by the fundamental frequency extraction tool. Determining a zero value in the second fundamental frequency may determine a beginning and end of each lyric in the first lyric. And interpolating the zero value in the first fundamental frequency to be a non-zero value, wherein the first fundamental frequency at the beginning and the end of each sentence of the lyrics in the second lyrics is mainly ensured not to gradually go to zero. Since some of the first phonemes are without fundamental frequency information, such as b, sh, and for the first phoneme without fundamental frequency information, the corresponding second fundamental frequency is zero, accordingly, when synthesizing the singing voice, the fundamental frequency value at the corresponding position in the first fundamental frequency should be made zero, so that the non-zero value in the first fundamental frequency can be adjusted according to the zero value in the second fundamental frequency.

By adjusting the first fundamental frequency, the noise problem caused by inaccurate extraction of the first fundamental frequency can be reduced.

In an optional embodiment, after synthesizing the second spectral feature and the preset first fundamental frequency of the target song to obtain the synthesized singing voice, the singing voice synthesizing method further includes: performing voice changing processing on the synthesized singing voice; and filtering the synthesized singing voice after the voice changing processing.

Specifically, the vocal processing can be performed on the synthesized singing voice by using a sound touch open source tool, and the hissing voice of the synthesized singing voice is eliminated by using low-pass filtering. Background music may also be added to the synthesized singing voice. In singing voice synthesis at a sampling rate, background music may be up-sampled or down-sampled (supporting, but not limited to, 16k, 22.05k, 24k, 44.1k, 48k, etc.). The reverberation operation may also be performed on the synthesized song. The synthesized singing voice is processed by changing voice, and the synthesized singing voice after being processed by changing voice is filtered, so that the singing effect of the synthesized singing voice can be improved.

An embodiment of the present invention further provides a singing voice synthesizing apparatus, as shown in fig. 2, including: the acquiring unit 201 is used for acquiring the audio frequency recited by the user and the lyric text corresponding to the audio frequency; the detailed description of the specific implementation manner is given in step S101 in the above embodiments, and is not repeated herein. The labeling unit 202 is configured to perform duration labeling on a first phoneme in the audio according to a preset speech recognition model and a lyric text to obtain a first duration of the first phoneme; the detailed description of the specific implementation manner is given in step S102 in the above embodiments, and is not repeated herein. A determining unit 203 for determining a first spectral feature of the audio; the detailed description of the specific implementation manner is given in step S103 in the above embodiments, and is not repeated herein. The processing unit 204 is configured to, when first lyrics in the lyric text correspond to second lyrics of a preset target song, scale the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; the detailed description of the specific implementation manner is given in step S104 in the above embodiments, and is not repeated herein. And the synthesizing unit 205 is configured to synthesize the second spectral feature and the first fundamental frequency of the preset target song to obtain a synthesized singing voice. The detailed description of the specific implementation manner is given in step S105 in the above embodiments, and is not repeated herein.

The embodiment of the invention provides a singing voice synthesizing device, which obtains the voice frequency recited by a user and the lyric text corresponding to the voice frequency; carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first time length of the first phoneme; determining a first spectral feature of the audio; when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; synthesizing the second spectrum characteristic and a preset first fundamental frequency of the target song to obtain a synthesized singing voice, so that the first spectrum characteristic can be obtained from the voice frequency recited by a user, and the first spectrum characteristic is zoomed based on the second duration of the second phoneme and the first duration of the first phoneme in the target song, so that the obtained second spectrum characteristic has the rhythmicity when the original singer of the target song sings and retains the original tone of the user, so that the synthesized singing voice has the rhythmicity when the original singer of the target song sings and retains the original tone of the user, and the first fundamental frequency of the target song is adopted when the singing voice is synthesized, so that the synthesized singing voice is more natural, meanwhile, the singing voice synthesis can be realized without collecting a large amount of recording data, and the cost of the singing voice synthesis can be reduced; moreover, the method can support that the first lyric text of the audio recited by the user is inconsistent with the second lyric of the target song, thereby meeting the requirement of the user on lyric modification.

Based on the same inventive concept as one of the singing voice synthesis in the foregoing embodiments, the present invention also provides a singing voice synthesis apparatus having stored thereon a computer program which, when executed by a processor, implements the steps of any one of the methods of one of the singing voice synthesis described above.

Where in fig. 3 a bus architecture (represented by bus 300), bus 300 may include any number of interconnected buses and bridges, bus 300 linking together various circuits including one or more processors, represented by processor 302, and memory, represented by memory 304. The bus 300 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 306 provides an interface between the bus 300 and the receiver 301 and transmitter 303. The receiver 301 and the transmitter 303 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium.

The processor 302 is responsible for managing the bus 300 and general processing, and the memory 304 may be used for storing data used by the processor 302 in performing operations.

Based on the same inventive concept as one of the singing voice synthesizing methods in the foregoing embodiments, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, realizes the steps of:

acquiring audio frequency recited by a user and lyric text corresponding to the audio frequency; carrying out time length labeling on a first phoneme in the audio according to a preset voice recognition model and a lyric text to obtain a first time length of the first phoneme; determining a first spectral feature of the audio; when first lyrics in the lyric text correspond to second lyrics of a preset target song, carrying out scaling processing on the first spectral feature according to a second duration of a second phoneme of the preset target song and a first duration of the first phoneme to obtain a second spectral feature; and synthesizing the second spectrum characteristic and the first fundamental frequency of the preset target song to obtain the synthesized singing voice.

In a specific implementation, when the program is executed by a processor, any method step in the first embodiment may be further implemented.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable information processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable information processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable information processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable information processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

13页详细技术资料下载

Singing voice synthesis method and device and readable storage medium

相关技术

网友询问留言