Song processing method and device

文档序号:1339657 发布日期:2020-07-17 浏览:12次 中文

阅读说明:本技术 歌曲处理方法及装置 (Song processing method and device ) 是由 孙见青 于 2020-03-03 设计创作,主要内容包括:本发明是关于歌曲处理方法及装置。该方法包括:获取歌曲的简谱信息;确定所述简谱信息的理论音频特征和所述歌曲的文本的当前音素特征;根据所述当前音素特征、所述理论音频特征以及音素时长模型,确定所述歌曲的目标音素时长;根据所述目标音素时长对所述歌曲的文本进行语音合成。通过本发明的技术方案,在歌曲合成时,利用理论音频特征以及音素时长模型对音素时长以及合成音高进行控制,进而提高歌唱合成的准确性和自然度。(The invention relates to a song processing method and device. The method comprises the following steps: acquiring numbered musical notation information of songs; determining theoretical audio features of the numbered musical notation information and current phoneme features of the text of the song; determining the target phoneme duration of the song according to the current phoneme characteristics, the theoretical audio characteristics and the phoneme duration model; and carrying out voice synthesis on the text of the song according to the target phoneme duration. According to the technical scheme, when the song is synthesized, the phoneme duration and the synthesized pitch are controlled by using the theoretical audio features and the phoneme duration model, so that the accuracy and the naturalness of singing synthesis are improved.)

1. A song processing method, comprising:

acquiring numbered musical notation information of songs;

determining theoretical audio features of the numbered musical notation information and current phoneme features of the text of the song;

determining the target phoneme duration of the song according to the current phoneme characteristics, the theoretical audio characteristics and the phoneme duration model;

and carrying out voice synthesis on the text of the song according to the target phoneme duration.

2. The method of claim 1,

the theoretical audio features comprise theoretical syllable duration and theoretical fundamental frequency values;

determining the target phoneme duration of the song according to the current phoneme characteristics, the theoretical audio characteristics and the phoneme duration model, wherein the determining comprises:

inputting the current phoneme characteristics into the phoneme duration model to obtain a current phoneme duration;

and adjusting the current phoneme duration according to the theoretical syllable duration to obtain the target phoneme duration.

3. The method of claim 2,

the speech synthesis of the text of the song according to the target phoneme duration comprises:

performing frame expansion on the phonemes of the text of the song according to the target phoneme duration to obtain target phoneme characteristics of the text after frame expansion;

inputting the theoretical fundamental frequency value and the target phoneme characteristics into an end-to-end speech synthesis model to predict the acoustic parameters of the song;

and reconstructing a target voice corresponding to the text of the song according to the acoustic parameters of the song.

4. The method of claim 3,

the number of phonemes of the text after frame expansion is the same as the number of frames of the target voice;

the acoustic parameters include a fundamental frequency and spectral parameters.

5. The method of claim 2,

the determining theoretical audio features of the numbered musical notation information includes:

determining the theoretical syllable time length d of the numbered musical notation information through a first preset formulai(in seconds), wherein the first preset formula is as follows:

(first Preset formula)

tmpo is the rhythm in the numbered musical notation information, i.e. the number of beats per minute, dnoteiThe beat of the current syllable i;

determining the theoretical fundamental frequency value f0 of the numbered musical notation information through a second preset formula, wherein the second preset formula is as follows:

440 represents the frequency of the onset of the A note at center C (in HZ), and p is the distance between the pitch noted in the numbered notation information and the A note at center C, in semitones.

6. A song processing apparatus, comprising:

the acquisition module is used for acquiring numbered musical notation information of the songs;

the first determining module is used for determining theoretical audio characteristics of the numbered musical notation information and current phoneme characteristics of the text of the song;

a second determining module, configured to determine a target phoneme duration of the song according to the current phoneme feature, the theoretical audio feature, and a phoneme duration model;

and the synthesis module is used for carrying out voice synthesis on the text of the song according to the target phoneme duration.

7. The apparatus of claim 6,

the theoretical audio features comprise theoretical syllable duration and theoretical fundamental frequency values;

the second determining module includes:

the input submodule is used for inputting the current phoneme characteristics to the phoneme duration model to obtain the current phoneme duration;

and the adjusting submodule is used for adjusting the current phoneme duration according to the theoretical syllable duration to obtain the target phoneme duration.

8. The apparatus of claim 7,

the synthesis module comprises:

the extension submodule is used for carrying out frame extension on the phonemes of the text of the song according to the target phoneme duration to obtain target phoneme characteristics of the text after frame extension;

the prediction submodule is used for inputting the theoretical fundamental frequency value and the target phoneme characteristics into an end-to-end speech synthesis model so as to predict the acoustic parameters of the song;

and the reconstruction submodule is used for reconstructing the target voice corresponding to the text of the song according to the acoustic parameters of the song.

9. The apparatus of claim 8,

the number of phonemes of the text after frame expansion is the same as the number of frames of the target voice;

the acoustic parameters include a fundamental frequency and spectral parameters.

10. The apparatus of claim 7,

the first determining module includes:

a first determining submodule for determining the theoretical syllable time length d of the numbered musical notation information by a first preset formulai(in seconds), wherein the first preset formula is as follows:

(first Preset formula)

tmpo is the rhythm in the numbered musical notation information, i.e. the number of beats per minute, dnoteiThe beat of the current syllable i;

a second determining submodule, configured to determine the theoretical fundamental frequency value f0 of the numbered musical notation information through a second preset formula, where the second preset formula is as follows:

440 represents the frequency of the onset of the A note at center C (in HZ), and p is the distance between the pitch noted in the numbered notation information and the A note at center C, in semitones.

Technical Field

The invention relates to the technical field of song processing, in particular to a song processing method and device.

Background

At present, the speech synthesis of the text of the song is needed in many scenes, but in the prior art, when the speech synthesis of the text of the song is carried out, the controllability of the speech synthesis is poor, and the problems that the synthesis rhythm and the pitch cannot be controlled and the like are likely to occur, so that the problem that the naturalness of the singing synthesis is low is caused.

Disclosure of Invention

The embodiment of the invention provides a song processing method and device. The technical scheme is as follows:

according to a first aspect of embodiments of the present invention, there is provided a song processing method, including:

acquiring numbered musical notation information of songs;

determining theoretical audio features of the numbered musical notation information and current phoneme features of the text of the song;

determining the target phoneme duration of the song according to the current phoneme characteristics, the theoretical audio characteristics and the phoneme duration model;

and carrying out voice synthesis on the text of the song according to the target phoneme duration.

In one embodiment, the theoretical audio features include theoretical syllable duration and theoretical fundamental frequency value;

determining the target phoneme duration of the song according to the current phoneme characteristics, the theoretical audio characteristics and the phoneme duration model, wherein the determining comprises:

inputting the current phoneme characteristics into the phoneme duration model to obtain a current phoneme duration;

and adjusting the current phoneme duration according to the theoretical syllable duration to obtain the target phoneme duration.

In one embodiment, the speech synthesizing the text of the song according to the target phoneme duration includes:

performing frame expansion on the phonemes of the text of the song according to the target phoneme duration to obtain target phoneme characteristics of the text after frame expansion;

inputting the theoretical fundamental frequency value and the target phoneme characteristics into an end-to-end speech synthesis model to predict the acoustic parameters of the song;

and reconstructing a target voice corresponding to the text of the song according to the acoustic parameters of the song.

In one embodiment, the number of phonemes of the frame-extended text is the same as the number of frames of the target speech;

the acoustic parameters include a fundamental frequency and spectral parameters.

In one embodiment, the determining theoretical audio features of the numbered musical notation information comprises:

determining the theoretical syllable time length d of the numbered musical notation information through a first preset formulai(in seconds), wherein the first preset formula is as follows:

(first Preset formula)

tmpo is the rhythm in the numbered musical notation information, i.e. the number of beats per minute, dnoteiThe beat of the current syllable i;

determining the theoretical fundamental frequency value f0 of the numbered musical notation information through a second preset formula, wherein the second preset formula is as follows:

440 represents the frequency of the onset of the A note at center C (in HZ), and p is the distance between the pitch noted in the numbered notation information and the A note at center C, in semitones.

According to a second aspect of embodiments of the present invention, there is provided a song processing apparatus including:

the acquisition module is used for acquiring numbered musical notation information of the songs;

the first determining module is used for determining theoretical audio characteristics of the numbered musical notation information and current phoneme characteristics of the text of the song;

a second determining module, configured to determine a target phoneme duration of the song according to the current phoneme feature, the theoretical audio feature, and a phoneme duration model;

and the synthesis module is used for carrying out voice synthesis on the text of the song according to the target phoneme duration.

In one embodiment, the theoretical audio features include theoretical syllable duration and theoretical fundamental frequency value;

the second determining module includes:

the input submodule is used for inputting the current phoneme characteristics to the phoneme duration model to obtain the current phoneme duration;

and the adjusting submodule is used for adjusting the current phoneme duration according to the theoretical syllable duration to obtain the target phoneme duration.

In one embodiment, the synthesis module comprises:

the extension submodule is used for carrying out frame extension on the phonemes of the text of the song according to the target phoneme duration to obtain target phoneme characteristics of the text after frame extension;

the prediction submodule is used for inputting the theoretical fundamental frequency value and the target phoneme characteristics into an end-to-end speech synthesis model so as to predict the acoustic parameters of the song;

and the reconstruction submodule is used for reconstructing the target voice corresponding to the text of the song according to the acoustic parameters of the song.

In one embodiment, the number of phonemes of the frame-extended text is the same as the number of frames of the target speech;

the acoustic parameters include a fundamental frequency and spectral parameters.

In one embodiment, the first determining module comprises:

a first determining submodule for determining the theoretical syllable time length d of the numbered musical notation information by a first preset formulai(in seconds), wherein the first preset formula is as follows:

(first Preset formula)

tmpo is the rhythm in the numbered musical notation information, i.e. the number of beats per minute, dnoteiThe beat of the current syllable i;

a second determining submodule, configured to determine the theoretical fundamental frequency value f0 of the numbered musical notation information through a second preset formula, where the second preset formula is as follows:

440 represents the frequency of the onset of the A note at center C (in HZ), and p is the distance between the pitch noted in the numbered notation information and the A note at center C, in semitones.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

after theoretical audio characteristics and current phoneme characteristics of a text of a song are determined according to numbered musical notation information of the song, target phoneme duration of the song can be automatically determined according to the current phoneme characteristics, the theoretical audio characteristics and a phoneme duration model, so that the target phoneme duration is utilized to carry out speech synthesis on the text of the song, the phoneme duration and the synthesis pitch are controlled by utilizing the theoretical audio characteristics and the phoneme duration model when the song is synthesized, and the accuracy and the naturalness of singing synthesis are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a song processing method according to an exemplary embodiment.

Fig. 2 is a block diagram illustrating a song processing apparatus according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In order to solve the above technical problem, an embodiment of the present invention provides a song processing method, which may be used in a song processing program, system or device, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1, where the method includes steps S101 to S104:

in step S101, the numbered musical notation information of the song is acquired;

in step S102, theoretical audio characteristics of the numbered musical notation information and current phoneme characteristics of the text of the song are determined;

in step S103, determining a target phoneme duration of the song according to the current phoneme feature, the theoretical audio feature and a phoneme duration model;

the phoneme duration model is used to predict the duration of the phonemes of the text.

In step S104, the text of the song is speech-synthesized according to the target phoneme duration.

After theoretical audio characteristics and current phoneme characteristics of a text of a song are determined according to numbered musical notation information of the song, target phoneme duration of the song can be automatically determined according to the current phoneme characteristics, the theoretical audio characteristics and a phoneme duration model, so that the target phoneme duration is utilized to carry out speech synthesis on the text of the song, the phoneme duration and the synthesis pitch are controlled by utilizing the theoretical audio characteristics and the phoneme duration model when the song is synthesized, and the accuracy and the naturalness of singing synthesis are improved.

In one embodiment, the theoretical audio features include theoretical syllable duration and theoretical fundamental frequency value;

determining the target phoneme duration of the song according to the current phoneme characteristics, the theoretical audio characteristics and the phoneme duration model, wherein the determining comprises:

inputting the current phoneme characteristics into the phoneme duration model to obtain a current phoneme duration;

the current phoneme characterization is the current phoneme characterization of the numbered musical notation information of the song.

And adjusting the current phoneme duration according to the theoretical syllable duration to obtain the target phoneme duration.

The current phoneme duration can be obtained by inputting the current phoneme characteristics into the phoneme duration model, and then the current phoneme duration is pulled up or compressed by using the theoretical syllable duration in the numbered musical notation information to obtain the target phoneme duration, so that the rhythm and high-pitched sound height information integrated into the numbered musical notation is used for adjusting the phoneme of the text of the song to control the pitch during singing synthesis and improve the singing synthesis naturalness.

In one embodiment, the speech synthesizing the text of the song according to the target phoneme duration includes:

performing frame expansion on the phonemes of the text of the song according to the target phoneme duration to obtain target phoneme characteristics of the text after frame expansion; the target phoneme duration is the duration occupied by the target phoneme characteristics.

Inputting the theoretical fundamental frequency value and the target phoneme characteristics into a DNN (Deep Neural Networks) based end-to-end speech synthesis model to predict acoustic parameters of the song;

the end-to-end speech synthesis model training process is as follows:

a certain amount of singing voice library needs to be recorded to obtain a text and voice pair;

during training, the phoneme representation corresponding to the text is used as input and input into the end-to-end model, and the acoustic parameters (fundamental frequency and spectrum parameters) corresponding to the speech are used as output, which is specifically as follows: in order to control the rhythm of the synthesized voice, the method carries out force alignment on the text and the voice, and carries out frame extension on the phonemes of the text according to the result of the force alignment (namely supplementing the phonemes of the text with the phonemes to increase the number of the phonemes), thereby ensuring that the frame number of input and output is consistent. Of course, in order to control the pitch of the synthesized speech, the input includes not only the phoneme representation information but also the fundamental frequency information, the acoustic parameter information corresponding to the speech is output, and the end-to-end model is trained.

And reconstructing a target voice corresponding to the text of the song according to the acoustic parameters of the song.

In order to control the rhythm of the synthesized voice, the phoneme of the text of the song can be subjected to frame expansion according to the duration of the target phoneme to obtain the target phoneme characteristic of the text after the frame expansion, the theoretical fundamental frequency value and the target phoneme characteristic are input into an end-to-end voice synthesis model to predict the acoustic parameter of the song, namely the spectral parameter of the song, wherein the spectral parameter is used for representing parameters such as tone, pronunciation content and the like, and then the target voice corresponding to the text of the song is reconstructed according to the acoustic parameter of the song, so that when the song is synthesized, the acoustic parameter is used for adjusting the tone and the tone in voice synthesis to ensure the consistency of the tone and the suitability of the pitch, and the singing synthesis naturalness is improved.

Secondly, the end-to-end speech synthesis model in the prior art can not accurately control the rhythm generally, but the invention can adjust the phoneme duration by combining the numbered musical notation information, thereby accurately controlling the rhythm of the song synthesized by the end-to-end speech synthesis model.

In addition, a phoneme is a minimum voice unit divided according to natural attributes of voice, and is analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. Phonemes are divided into two major categories, vowels and consonants.

In reconstructing the target voice, the acoustic parameters of the song may be input to the vocoder, which may then use the vocoder to reconstruct the target voice the vocoder may be griffin-L im, WaveNet, L PCNet, or the like.

In one embodiment, the number of phonemes of the frame-extended text is the same as the number of frames of the target speech;

the acoustic parameters include a fundamental frequency and spectral parameters.

After the frame expansion is performed, it is ensured that the number of phonemes of the frame-expanded text is the same as the number of frames of the target speech, so that the text and the target speech are aligned. The number of phonemes is the number of phonemes of the text.

In one embodiment, the determining theoretical audio features of the numbered musical notation information comprises:

determining the theoretical syllable time length d of the numbered musical notation information through a first preset formulai(in seconds), wherein the first preset formula is as follows:

(first Preset formula)

tmpo is rhythm in the numbered musical notation informationI.e. the number of beats per minute, dnoteiThe beat of the current syllable i;

determining the theoretical fundamental frequency value f0 of the numbered musical notation information through a second preset formula, wherein the second preset formula is as follows:

440 represents the frequency of the onset of the A note at center C (in HZ), and p is the distance between the pitch noted in the numbered notation information and the A note at center C, in semitones.

Determining the theoretical syllable duration d by using the first predetermined formulaiAnd the theoretical fundamental frequency value f0 is determined by the second preset formula, and the pitch, the beat and the rhythm of the songs during synthesis can be controlled, so that the accuracy and the naturalness of the synthesized target voice are improved.

Finally, it is clear that: the above embodiments can be freely combined by those skilled in the art according to actual needs.

Corresponding to the song processing method provided by the embodiment of the present invention, an embodiment of the present invention further provides a song processing apparatus, as shown in fig. 2, the apparatus includes:

an obtaining module 201, configured to obtain numbered musical notation information of a song;

a first determining module 202, configured to determine theoretical audio features of the numbered musical notation information and current phoneme features of a text of the song;

a second determining module 203, configured to determine a target phoneme duration of the song according to the current phoneme feature, the theoretical audio feature, and a phoneme duration model;

and a synthesis module 204, configured to perform speech synthesis on the text of the song according to the target phoneme duration.

In one embodiment, the theoretical audio features include theoretical syllable duration and theoretical fundamental frequency value;

the second determining module includes:

the input submodule is used for inputting the current phoneme characteristics to the phoneme duration model to obtain the current phoneme duration;

and the adjusting submodule is used for adjusting the current phoneme duration according to the theoretical syllable duration to obtain the target phoneme duration.

In one embodiment, the synthesis module comprises:

the extension submodule is used for carrying out frame extension on the phonemes of the text of the song according to the target phoneme duration to obtain target phoneme characteristics of the text after frame extension;

the prediction submodule is used for inputting the theoretical fundamental frequency value and the target phoneme characteristics into an end-to-end speech synthesis model so as to predict the acoustic parameters of the song;

and the reconstruction submodule is used for reconstructing the target voice corresponding to the text of the song according to the acoustic parameters of the song.

In one embodiment, the number of phonemes of the frame-extended text is the same as the number of frames of the target speech;

the acoustic parameters include a fundamental frequency and spectral parameters.

In one embodiment, the first determining module comprises:

a first determining submodule for determining the theoretical syllable time length d of the numbered musical notation information by a first preset formulai(in seconds), wherein the first preset formula is as follows:

(first Preset formula)

tmpo is the rhythm in the numbered musical notation information, i.e. the number of beats per minute, dnoteiThe beat of the current syllable i;

a second determining submodule, configured to determine the theoretical fundamental frequency value f0 of the numbered musical notation information through a second preset formula, where the second preset formula is as follows:

440 represents the frequency of the onset of the A note at center C (in HZ), and p is the distance between the pitch noted in the numbered notation information and the A note at center C, in semitones.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种自适应语音合成方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!