Singing voice synthesis method, device and equipment

文档序号：600201 发布日期：2021-05-04 浏览：17次中文

阅读说明：本技术 一种歌声合成方法、装置及设备 (Singing voice synthesis method, device and equipment ) 是由杨喜鹏张旭殷昊江明奇陈云琳于 2020-12-23 设计创作，主要内容包括：本发明公开了一种歌声合成方法、装置及设备,该方法包括：获取目标歌曲的第一歌词文本,确定第一歌词文本的第一音素,将第一音素和预设第一音素的歌唱时长输入预设的声学模型进行处理,输出对应的第一声学特征,第一声学特征包括第一基频、第一谱包络,根据预设目标歌曲的第二基频对第一基频进行调整,对调整后的第一基频和第一谱包络进行合成,得到合成歌声。由于训练声学模型的数据远小于现有的歌声合成所需要的数据,从而无需收集大量的数据就可以实现歌声合成,可以降低歌声合成的成本；并且,本发明的合成歌声具有目标歌曲原歌手歌唱时的韵律,曲调,且曲调连续,不会因音调的突然转换而产生不自然的听感。(The invention discloses a singing voice synthesis method, a singing voice synthesis device and singing voice synthesis equipment, wherein the method comprises the following steps: the method comprises the steps of obtaining a first lyric text of a target song, determining a first phoneme of the first lyric text, inputting the first phoneme and a singing duration of a preset first phoneme into a preset acoustic model for processing, outputting a corresponding first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope, adjusting the first fundamental frequency according to a second fundamental frequency of the preset target song, and synthesizing the adjusted first fundamental frequency and the first spectrum envelope to obtain a synthesized singing voice. Because the data of the training acoustic model is far smaller than the data required by the existing singing voice synthesis, the singing voice synthesis can be realized without collecting a large amount of data, and the cost of the singing voice synthesis can be reduced; moreover, the synthesized singing voice of the invention has rhythm and melody when the original singer sings the target song, and the melody is continuous, thus the unnatural listening feeling can not be generated due to the sudden change of the tone.)

1. A singing voice synthesizing method, comprising:

acquiring a first lyric text of a target song;

determining a first phoneme of the first lyric text;

inputting the first phoneme and a singing duration of a preset first phoneme into a preset acoustic model for processing, and outputting a corresponding first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope;

adjusting the first fundamental frequency according to a second fundamental frequency of a preset target song;

and synthesizing the adjusted first fundamental frequency and the first spectrum envelope to obtain synthesized singing voice.

2. The method of synthesizing singing voice according to claim 1, wherein the determining the phoneme of the first lyric text comprises:

converting the first lyric text into a pinyin text, wherein the pinyin text comprises initial consonants and vowels;

and determining a first phoneme of the first lyric text according to the initial consonant and the final.

3. The method of synthesizing singing voice according to claim 1, further comprising, before the obtaining the first lyrics text of the target song:

obtaining a plurality of training samples, the training samples comprising: the second lyric text and the reading voice corresponding to the second lyric text;

extracting a second phoneme of a second lyric text and a second acoustic feature of the reading speech from each training sample, wherein the second acoustic feature comprises a third fundamental frequency and a second spectrum envelope;

determining the reading duration of the second phoneme according to the second phoneme and the reading speech;

inputting a second phoneme and the reading duration of the second phoneme to a cyclic neural network, and training a duration model;

inputting a second phoneme, the reading duration of the second phoneme and a second acoustic feature into a recurrent neural network, and training a feature model;

and obtaining a preset acoustic model according to the duration model and the feature model.

4. The method for synthesizing singing voice according to claim 3, wherein the inputting the first phoneme and the singing duration of the preset first phoneme into a preset acoustic model for processing and outputting a corresponding first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectral envelope, and comprises:

inputting the first phoneme into the duration model to obtain the reading duration of the first phoneme;

adjusting the reading time length of the first phoneme according to the singing time length of the first phoneme;

inputting the first phoneme and the adjusted reading duration of the first phoneme into the feature model to obtain the first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectral envelope.

5. The method for synthesizing singing voice according to claim 1, wherein the adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song comprises:

and adjusting the first fundamental frequency to be the second fundamental frequency according to the second fundamental frequency of a preset target song.

6. The method for synthesizing singing voice according to claim 1, further comprising, before the adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song:

interpolating a zero value in a second fundamental frequency of a preset target song into a non-zero value;

determining a zero value in the first fundamental frequency;

adjusting the fundamental frequency value at the corresponding position in the second fundamental frequency according to the zero value in the first fundamental frequency.

7. The method for synthesizing singing voice according to claim 1, further comprising, after the synthesizing the adjusted first fundamental frequency and the first spectral envelope to obtain the synthesized singing voice:

performing voice changing processing on the synthesized singing voice;

and filtering the synthesized singing voice subjected to the voice changing processing.

8. A singing voice synthesizing apparatus, comprising:

the acquisition unit is used for acquiring a first lyric text of a target song;

a determining unit for determining a first phoneme of the first lyrics text;

the processing unit is used for inputting the first phoneme and the singing duration of the preset first phoneme into a preset acoustic model for processing, and outputting a corresponding first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope;

the adjusting unit is used for adjusting the first fundamental frequency according to a second fundamental frequency of a preset target song;

and the synthesis unit is used for synthesizing the adjusted first fundamental frequency and the first spectrum envelope to obtain a synthesized singing voice.

9. A singing voice synthesizing apparatus, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the singing voice synthesis method of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a computer to execute the singing voice synthesizing method according to any one of claims 1 to 7.

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a singing voice synthesis method, device, and apparatus.

Background

In recent years, singing voice synthesis technology has been receiving attention from all communities. The most convenient of the singing voice synthesis technology is that it can make a computer sing a song with any melody. One of the mainstream techniques for synthesizing singing voice in the prior art is waveform splicing, which has the core of prerecording the singing voice of each pronunciation at different pitches in a certain language to obtain a voice synthesis database. Therefore, synthesizing singing voice using the singing voice inherent in the voice synthesis database relies on very large recorded voice data, which requires a lot of time and labor for collecting data, resulting in a high cost of singing voice synthesis.

Content of application

The embodiment of the application provides a singing voice synthesis method, a singing voice synthesis device and singing voice synthesis equipment, and is used for solving the problems that in the prior art, the singing voice synthesis depends on huge recording data, and a large amount of time and manpower are consumed for collecting the recording data, so that the singing voice synthesis cost is high.

In order to solve the above problem, in a first aspect, an embodiment of the present invention provides a singing voice synthesis method, including: acquiring a first lyric text of a target song; determining a first phoneme of the first lyrics text; inputting the first phoneme and a singing duration preset for the first phoneme into a preset acoustic model for processing, and outputting a corresponding first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope; adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song; and synthesizing the adjusted first fundamental frequency and the first spectrum envelope to obtain the synthesized singing voice.

Optionally, determining the phoneme of the first lyrics text comprises: converting the first lyric text into a pinyin text, wherein the pinyin text comprises initial consonants and vowels; and determining a first phoneme of the first lyric text according to the initial consonant and the final.

Optionally, before obtaining the first lyric text of the target song, the singing voice synthesizing method further includes: obtaining a plurality of training samples, the training samples comprising: the second lyric text and the reading voice corresponding to the second lyric text; extracting a second phoneme of a second lyric text and a second acoustic feature of the reading speech from each training sample, wherein the second acoustic feature comprises a third fundamental frequency and a second spectrum envelope; determining the reading duration of the second phoneme according to the second phoneme and the reading speech; inputting a second phoneme and the reading duration of the second phoneme to a cyclic neural network, and training a duration model; inputting a second phoneme, the reading duration of the second phoneme and a second acoustic feature into a recurrent neural network, and training a feature model; and obtaining a preset acoustic model according to the duration model and the characteristic model.

Optionally, inputting the first phoneme and a singing duration of the preset first phoneme into a preset acoustic model for processing, and outputting a corresponding first acoustic feature, where the first acoustic feature includes a first fundamental frequency and a first spectral envelope, and the method includes: inputting the first phoneme into a duration model to obtain the reading duration of the first phoneme; adjusting the reading duration of the first phoneme according to the singing duration of the first phoneme; inputting the first phoneme and the adjusted reading duration of the first phoneme into a feature model to obtain a first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope.

Optionally, adjusting the first fundamental frequency according to a second fundamental frequency of the preset target song includes: and adjusting the first fundamental frequency to be the second fundamental frequency according to the second fundamental frequency of the preset target song.

Optionally, before adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song, the singing voice synthesizing method further includes: interpolating a zero value in a second fundamental frequency of a preset target song into a non-zero value; determining a zero value in the first fundamental frequency; the value of the fundamental frequency at the corresponding position in the second fundamental frequency is adjusted according to the zero value in the first fundamental frequency.

Optionally, after synthesizing the adjusted first fundamental frequency and the adjusted first spectral envelope to obtain a synthesized singing voice, the singing voice synthesizing method further includes: performing voice changing processing on the synthesized singing voice; and filtering the synthesized singing voice after the voice changing processing.

In a second aspect, an embodiment of the present invention provides a singing voice synthesizing apparatus, including: the acquisition unit is used for acquiring a first lyric text of a target song; a determining unit for determining a first phoneme of the first lyrics text; the processing unit is used for inputting the first phoneme and the singing duration of the preset first phoneme into a preset acoustic model for processing, and outputting a corresponding first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope; the adjusting unit is used for adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song; and the synthesis unit is used for synthesizing the adjusted first fundamental frequency and the adjusted first spectrum envelope to obtain the synthesized singing voice.

In a third aspect, an embodiment of the present invention provides a singing voice synthesizing apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a singing voice synthesis method as in the first aspect or any embodiment of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are used to cause a computer to execute the singing voice synthesizing method according to the first aspect or any implementation manner of the first aspect.

The embodiment of the invention provides a singing voice synthesis method, a singing voice synthesis device and singing voice synthesis equipment, wherein a target song is processed in advance to obtain the singing duration and a second fundamental frequency of a first phoneme, an acoustic model is trained in advance, so that when the singing voice is synthesized, the first phoneme of a first lyric text is determined by obtaining a first lyric text of the target song, the first phoneme and the singing duration of the preset first phoneme are input into the preset acoustic model to be processed, corresponding first acoustic features are output, the first acoustic features comprise the first fundamental frequency and a first spectrum envelope, the first fundamental frequency is adjusted according to the second fundamental frequency of the preset target song, the adjusted first fundamental frequency and the first spectrum envelope are synthesized to obtain the synthesized singing voice, and because the data for training the acoustic model is far smaller than the data required by the existing singing voice synthesis, the singing voice synthesis can be realized without collecting a large amount of data, the cost of singing voice synthesis can be reduced; and the invention inputs the first phoneme and the singing duration of the preset first phoneme into a preset acoustic model for processing, and outputs the corresponding first acoustic characteristic, so that the first acoustic characteristic has the rhythm when the original singer of the target song sings, and then the first fundamental frequency is adjusted by adopting the second fundamental frequency of the target song, so that the first fundamental frequency is consistent with the second fundamental frequency of the target song, thereby the synthesized singing voice has the rhythm and the melody when the original singer of the target song sings, and the synthesized singing voice is synthesized based on the adjusted first fundamental frequency, the synthesized singing voice melody is continuous, and the unnatural auditory sensation can not be generated due to the sudden change of the tone.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

FIG. 1 is a schematic flow chart of a singing voice synthesizing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic hardware configuration diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a singing voice synthesis method, as shown in fig. 1, including:

s101, acquiring a first lyric text of a target song; specifically, the executing subject of the present invention may be a singing voice synthesizing device, and may also be a terminal or a server, which is not specifically limited herein, and in the embodiment of the present invention, the singing voice synthesizing device is taken as an example for explanation. The singing voice synthesizing device can receive a singing request of a user in a wired connection mode or a wireless connection mode and then acquire a lyric text of a target song according to the lyric request of the user. The target song may be a song designated by the user, may be a song randomly selected by the singing sound synthesizing apparatus from a preset song library when the singing request is received, or may be a song selected by the singing sound synthesizing apparatus from the preset song library according to the behavior and usage habit of the user.

S102, determining a first phoneme of the first lyric text; specifically, the first phoneme of the first lyric text can be determined according to the initial consonant and the final consonant corresponding to the lyric in the first lyric text.

S103, inputting the first phoneme and a singing duration of the preset first phoneme into a preset acoustic model for processing, and outputting corresponding first acoustic features, wherein the first acoustic features comprise a first fundamental frequency and a first spectrum envelope; specifically, the singing duration of the first phoneme may be set in advance in the singing voice synthesizing device, and in calculating the singing duration of the first phoneme in the target song, if the target song is a song mixed with background music, the background music may be separated using a spleteter open source tool to obtain the dry sound and the background music. And then, marking the corresponding time length and position of the first phoneme of the lyric in the stem sound by using a time length alignment method (alignment) in the voice recognition model, and pre-marking a time length file. The duration file includes: song id, the position of the stem relative to the background music, the duration of singing of the first phoneme in the lyrics. And then converting the duration file into a file in a TextGrid format, and finely adjusting the singing duration of the pre-labeled first phoneme by using a praat voice analysis tool to generate the accurate singing duration of the first phoneme.

The preset acoustic model can be obtained by training the recurrent neural network by adopting a plurality of training samples. The training samples include: and the second lyric text and the reading voice corresponding to the second lyric text. The reading speech may be generated when the user reads the second lyric text, or may be generated when the second lyric text is reported by using a speech reporting model. The first phoneme and the singing duration of the preset first phoneme are input into a preset acoustic model for processing, and corresponding first acoustic features can be output. The first spectral envelope includes fast fourier transform features, non-periodic component features, mel-frequency spectral features, Linear Predictive Coding (LPC) features, Fbank features. Since the first acoustic feature is obtained based on the first phoneme and the singing duration of the first phoneme, the first acoustic feature has the rhythm of the target song and can conform to the habit of a person when singing a song because the stretching duration of each phoneme in a character is different when a person sings a long note in a song. Therefore, the first acoustic feature is obtained based on the first phoneme and the singing duration of the first phoneme, so that the synthesized song can be more accurate.

S104, adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song; specifically, the first fundamental frequency of the preset target song may be extracted to be preset in the singing voice synthesizing apparatus. In determining the first fundamental frequency of the target song, a second fundamental frequency of the target song may be extracted by a fundamental frequency extraction tool. And adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song, so that the first fundamental frequency can have the melody of the target song.

And S105, synthesizing the adjusted first fundamental frequency and the first spectrum envelope to obtain a synthesized singing voice. Specifically, the adjusted first fundamental frequency and the adjusted first spectral envelope are synthesized to obtain the singing voice synthesized according to the lyric text of the target song. It is understood that the synthesized singing voice obtained in the present embodiment is a singing voice without accompanying.

The embodiment of the invention provides a singing voice synthesis method, which comprises the steps of processing a target song in advance to obtain a singing time length and a second fundamental frequency of a first phoneme, training an acoustic model in advance, determining the first phoneme of a first lyric text by obtaining the first lyric text of the target song during singing voice synthesis, inputting the first phoneme and the singing time length of a preset first phoneme into a preset acoustic model for processing, outputting corresponding first acoustic characteristics, wherein the first acoustic characteristics comprise the first fundamental frequency and a first spectrum envelope, adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song, synthesizing the adjusted first fundamental frequency and the first spectrum envelope to obtain synthesized singing voice, and because the data of a training acoustic model is far smaller than the data required by the existing singing voice synthesis, the singing voice synthesis can be realized without collecting a large amount of data, the cost of singing voice synthesis can be reduced; and the invention inputs the first phoneme and the singing duration of the preset first phoneme into a preset acoustic model for processing, and outputs the corresponding first acoustic characteristic, so that the first acoustic characteristic has the rhythm when the original singer of the target song sings, and then the first fundamental frequency is adjusted by adopting the second fundamental frequency of the target song, so that the first fundamental frequency is consistent with the second fundamental frequency of the target song, thereby the synthesized singing voice has the rhythm and the melody when the original singer of the target song sings, and the synthesized singing voice is synthesized based on the adjusted first fundamental frequency, the synthesized singing voice melody is continuous, and the unnatural auditory sensation can not be generated due to the sudden change of the tone.

In an alternative embodiment, the step S102 of determining the phoneme of the first lyrics text comprises: converting the first lyric text into a pinyin text, wherein the pinyin text comprises initial consonants and vowels; and determining a first phoneme of the first lyric text according to the initial consonant and the final.

Specifically, the lyrics in the first lyric text may be converted into pinyin by using a pypinyin tool or a speech synthesis tool, so as to obtain a pinyin text. And then determining a first phoneme of the first lyric text according to the initial consonant and the final consonant in the pinyin text. The initial consonant corresponds to a first phoneme, and the final sound corresponds to a first phoneme.

Because the stretching duration of the initial consonant and the stretching duration of the final are inconsistent when a person sings, the first lyric text is converted into the pinyin text, and then the first phoneme of the first lyric text is determined according to the initial consonant and the final, so that the first acoustic characteristic determined according to the first phoneme and the singing duration of the preset first phoneme is more in line with the rule of the person when the person sings, and the synthesized singing voice can be more natural.

In an alternative embodiment, before obtaining the first lyric text of the target song in step S101, the singing voice synthesizing method further includes: obtaining a plurality of training samples, the training samples comprising: the second lyric text and the reading voice corresponding to the second lyric text; extracting a second phoneme of a second lyric text and a second acoustic feature of the reading speech from each training sample, wherein the second acoustic feature comprises a third fundamental frequency and a second spectrum envelope; determining the reading duration of the second phoneme according to the second phoneme and the reading speech; inputting a second phoneme and the reading duration of the second phoneme to a cyclic neural network, and training a duration model; inputting a second phoneme, the reading duration of the second phoneme and a second acoustic feature into a recurrent neural network, and training a feature model; and obtaining a preset acoustic model according to the duration model and the characteristic model.

Specifically, the preset acoustic models may include a duration model and a feature model. When the duration model is trained, the second phoneme and the reading duration of the second phoneme may be input to the recurrent neural network to train the duration model. The reading duration of the second phoneme can be obtained by carrying out duration labeling on the second phoneme in the reading speech through the second lyric text and the speech recognition model. In training the feature model, the first phoneme, the reading duration of the second phoneme, and the second acoustic feature may be input to the recurrent neural network to train the feature model. And obtaining the acoustic model according to the trained duration model and the trained feature model.

Through training the time-length model and training the characteristic model, the acoustic model is obtained, so that data which are far smaller than the existing singing voice synthesis are adopted to obtain the acoustic model, the synthesized singing voice is obtained, the singing voice synthesis can be realized without collecting a large amount of data, and the cost of the singing voice synthesis can be reduced.

In an alternative embodiment, in step S103, inputting the first phoneme and a preset singing duration of the first phoneme into a preset acoustic model for processing, and outputting a corresponding first acoustic feature, where the first acoustic feature includes a first fundamental frequency and a first spectral envelope, and the method includes: inputting the first phoneme into a duration model to obtain the reading duration of the first phoneme; adjusting the reading duration of the first phoneme according to the singing duration of the first phoneme; inputting the first phoneme and the adjusted reading duration of the first phoneme into a feature model to obtain a first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope.

Specifically, since the duration model is obtained by training the second phoneme and the reading duration of the second phoneme, the reading duration of the first phoneme can be output by inputting the first phoneme into the duration model. The reading duration of the first phoneme may then be adjusted based on the singing duration of the first phoneme such that the reading duration of the first phoneme corresponds to the singing duration of the first phoneme. Because the feature model is obtained by training according to the second phoneme, the reading time length of the second phoneme and the second acoustic feature, the adjusted reading time length of the first phoneme and the first phoneme are input into the feature model, and the first acoustic feature corresponding to the target song can be output.

The reading duration of the first phoneme is adjusted according to the singing duration of the first phoneme, the reading duration of the first phoneme and the adjusted reading duration of the first phoneme are input into the feature model, the first acoustic feature is obtained, and the first acoustic feature is obtained based on the singing duration of the first phoneme, so that the first acoustic feature has the rhythm of the target song.

In an alternative embodiment, in step S104, adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song includes: and adjusting the first fundamental frequency to be the second fundamental frequency according to the second fundamental frequency of the preset target song.

Specifically, the first fundamental frequency is adjusted to the second fundamental frequency, so that the first fundamental frequency can be made to have a tune identical to the target song.

In an optional embodiment, before adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song, the singing voice synthesizing method further includes: interpolating a zero value in a second fundamental frequency of a preset target song into a non-zero value; determining a zero value in the first fundamental frequency; the value of the fundamental frequency at the corresponding position in the second fundamental frequency is adjusted according to the zero value in the first fundamental frequency.

Specifically, the zero value in the second fundamental frequency is interpolated into a non-zero value, which mainly ensures that the second fundamental frequency at the beginning and end of each sentence of lyrics in the lyric text does not gradually go to zero. Since some of the first phonemes have no fundamental frequency information, such as b and sh, for the first phoneme having no fundamental frequency information, the corresponding first fundamental frequency is zero, accordingly, when synthesizing the singing voice, the fundamental frequency value at the corresponding position in the second fundamental frequency should be made zero before the first fundamental frequency is adjusted according to the second fundamental frequency, so that the fundamental frequency value at the corresponding position in the second fundamental frequency can be adjusted according to the zero value in the first fundamental frequency.

By adjusting the second fundamental frequency, the noise problem caused by inaccurate extraction of the second fundamental frequency can be reduced.

In an alternative embodiment, after synthesizing the adjusted first fundamental frequency and the first spectral envelope to obtain the synthesized singing voice, the singing voice synthesizing method further includes: performing voice changing processing on the synthesized singing voice; and filtering the synthesized singing voice after the voice changing processing.

Specifically, the vocal processing can be performed on the synthesized singing voice by using a sound touch open source tool, and the hissing voice of the synthesized singing voice is eliminated by using low-pass filtering. Background music may also be added to the synthesized singing voice. In singing voice synthesis at a sampling rate, background music may be up-sampled or down-sampled (supporting, but not limited to, 16k, 22.05k, 24k, 44.1k, 48k, etc.). The reverberation operation may also be performed on the synthesized song. The synthesized singing voice is processed by changing voice, and the synthesized singing voice after being processed by changing voice is filtered, so that the singing effect of the synthesized singing voice can be improved.

An embodiment of the present invention further provides a singing voice synthesizing apparatus, as shown in fig. 2, including: an obtaining unit 201, configured to obtain a first lyric text of a target song; the specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here. A determining unit 202 for determining a first phoneme of the first lyrics text; the specific implementation manner is described in detail in step S102 of the above embodiment, and is not described again here. The processing unit 203 is configured to input the first phoneme and a singing duration preset for the first phoneme into a preset acoustic model for processing, and output a corresponding first acoustic feature, where the first acoustic feature includes a first fundamental frequency and a first spectrum envelope; the specific implementation manner is described in detail in step S103 of the above embodiment, and is not described again here. The adjusting unit 204 is configured to adjust the first fundamental frequency according to a second fundamental frequency of a preset target song; the specific implementation manner is described in detail in step S104 of the above embodiment, and is not described herein again. A synthesizing unit 205, configured to synthesize the adjusted first fundamental frequency and the first spectral envelope to obtain a synthesized singing voice. The specific implementation manner is described in detail in step S105 of the above embodiment, and is not described again here.

The singing voice synthesizing device provided by the embodiment of the invention processes a target song in advance to obtain the singing time length and the second fundamental frequency of a first phoneme, and trains an acoustic model in advance, so that when the singing voice is synthesized, the first phoneme of the first lyric text is determined by obtaining the first lyric text of the target song, the first phoneme and the singing time length of the preset first phoneme are input into the preset acoustic model to be processed, corresponding first acoustic characteristics are output, the first acoustic characteristics comprise the first fundamental frequency and a first spectrum envelope, the first fundamental frequency is adjusted according to the second fundamental frequency of the preset target song, the adjusted first fundamental frequency and the adjusted first spectrum envelope are synthesized to obtain the synthesized singing voice, and because the data of training the acoustic model is far smaller than the data required by the existing singing voice synthesis, the singing voice synthesis can be realized without collecting a large amount of data, the cost of singing voice synthesis can be reduced; and the invention inputs the first phoneme and the singing duration of the preset first phoneme into a preset acoustic model for processing, and outputs the corresponding first acoustic characteristic, so that the first acoustic characteristic has the rhythm when the original singer of the target song sings, and then the first fundamental frequency is adjusted by adopting the second fundamental frequency of the target song, so that the first fundamental frequency is consistent with the second fundamental frequency of the target song, thereby the synthesized singing voice has the rhythm and the melody when the original singer of the target song sings, and the synthesized singing voice is synthesized based on the adjusted first fundamental frequency, the synthesized singing voice melody is continuous, and the unnatural auditory sensation can not be generated due to the sudden change of the tone.

Based on the same inventive concept as one of the singing voice synthesis in the foregoing embodiments, the present invention also provides a singing voice synthesis apparatus having stored thereon a computer program which, when executed by a processor, implements the steps of any one of the methods of one of the singing voice synthesis described above.

Where in fig. 3 a bus architecture (represented by bus 300), bus 300 may include any number of interconnected buses and bridges, bus 300 linking together various circuits including one or more processors, represented by processor 302, and memory, represented by memory 304. The bus 300 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 306 provides an interface between the bus 300 and the receiver 301 and transmitter 303. The receiver 301 and the transmitter 303 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium.

The processor 302 is responsible for managing the bus 300 and general processing, and the memory 304 may be used for storing data used by the processor 302 in performing operations.

Based on the same inventive concept as one of the singing voice synthesizing methods in the foregoing embodiments, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, realizes the steps of:

acquiring a first lyric text of a target song; determining a first phoneme of the first lyrics text; inputting the first phoneme and a singing duration preset for the first phoneme into a preset acoustic model for processing, and outputting a corresponding first acoustic feature, wherein the first acoustic feature comprises a first fundamental frequency and a first spectrum envelope; adjusting the first fundamental frequency according to the second fundamental frequency of the preset target song; and synthesizing the adjusted first fundamental frequency and the first spectrum envelope to obtain the synthesized singing voice.

In a specific implementation, when the program is executed by a processor, any method step in the first embodiment may be further implemented.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable information processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable information processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable information processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable information processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

12页详细技术资料下载

Singing voice synthesis method, device and equipment

相关技术

网友询问留言