Song generation method and device, readable medium and electronic equipment

文档序号：1355677 发布日期：2020-07-24 浏览：12次中文

阅读说明：本技术 歌曲生成方法、装置、可读介质及电子设备 (Song generation method and device, readable medium and electronic equipment ) 是由殷翔于 2020-03-23 设计创作，主要内容包括：本公开涉及一种歌曲生成方法、装置、可读介质及电子设备,包括接收用户输入的目标文字信息；确定目标歌曲模板；确定目标文字信息中每个文字的演唱时长；根据目标文字信息中的每个文字的演唱时长和目标歌曲模板生成目标频谱数据；通过目标歌曲模板和目标频谱数据合成目标语音波形数据；将目标语音波形数据与目标歌曲模板的模板伴奏信息合成为目标歌曲。这样,用户无需考虑原歌曲对应歌词的内容,任意输入不限个数的文字都可将该文字替换为原歌曲中的歌词来生成新的歌曲,并且生成的歌曲中的人声部分还能保留原歌曲中的人声特征,从而使得生成的歌曲中的人声能够与原歌曲的人声更加接近,进而使得歌曲生成的效果更好。(The disclosure relates to a song generating method, a song generating device, a readable medium and electronic equipment, which comprise receiving target text information input by a user; determining a target song template; determining singing duration of each character in the target character information; generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template; synthesizing target voice waveform data through the target song template and the target frequency spectrum data; and synthesizing the target voice waveform data and the template accompaniment information of the target song template into the target song. Therefore, a user does not need to consider the content of the lyrics corresponding to the original song, the words with unlimited number can be replaced by the lyrics in the original song to generate a new song by inputting any number of words, and the voice part in the generated song can also keep the voice characteristics in the original song, so that the voice in the generated song can be closer to the voice of the original song, and the generation effect of the song is better.)

1. A song generation method, the method comprising:

receiving target character information input by a user;

determining a target song template;

determining singing duration of each character in the target character information;

generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template;

synthesizing target voice waveform data through the target song template and the target frequency spectrum data;

and synthesizing the target voice waveform data and the template accompaniment information of the target song template into a target song.

2. The method of claim 1,

the target song template also comprises template lyric information, template fundamental frequency data and template music information;

the singing duration of each character in the target character information is determined according to the template lyric information and the template music information;

the target frequency spectrum data is generated according to the singing duration of each character in the target character information and the template music information;

and synthesizing the target voice waveform data by the template fundamental frequency data and the target frequency spectrum data.

3. The method of claim 2, further comprising:

performing text analysis on the target character information to obtain phoneme information contained in each character in the target character information;

the determining the singing duration of each character in the target character information comprises:

performing character dynamic matching on the template lyric information and the target character information to obtain a corresponding relation between each character in the target character information and the character in the template lyric information;

and determining the state duration of each state in each phoneme contained in each character in the target character information according to the corresponding relation and the template music information.

4. The method of claim 3, wherein dynamically matching words between the template lyric information and the target word information to obtain a correspondence between each word in the target word information and a word in the template lyric information comprises:

and dynamically matching characters of the template lyric information and the target character information through a first preset machine learning model to obtain the corresponding relation between each character in the target character information and the character in the template lyric information.

5. The method of claim 3 or 4, wherein dynamically matching the template lyric information and the target text information further comprises:

in the case where the matching effect between the target text information and the template lyric information is lower than the expected matching target,

adding one or more preset voice-imitating words into the target character information, and then performing character dynamic matching on the template lyric information and the target character information added with the voice-imitating words until the matching effect reaches the expected matching target; and/or

And repeating all characters in the target character information, and then dynamically matching the characters of the template lyric information and the repeated target character information until the matching effect reaches the expected matching target.

6. The method of claim 3 or 4, wherein dynamically matching the template lyric information and the target text information further comprises:

and under the condition that the matching effect between the target text information and the template lyric information is lower than an expected matching target, repeating the target song template, and then performing text dynamic matching on the target text information and the template lyric information in the repeated target song template until the matching effect reaches the expected matching target.

7. The method according to claim 3, wherein the determining the state duration of each state in each phoneme contained in each text in the target text information according to the correspondence and the template music information comprises:

and determining the state duration of each state in each phoneme contained in each character in the target character information according to the corresponding relation, the template lyric information and the template music information through a second preset machine learning model.

8. The method of claim 1, wherein generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template comprises:

and generating the target frequency spectrum data according to the singing duration of each character in the target character information and the target song template through a preset neural network acoustic model.

9. An apparatus for generating a song, the apparatus comprising:

the receiving module is used for receiving target character information input by a user;

the first determining module is used for determining a target song template;

the second determining module is used for determining singing duration of each character in the target character information;

the generating module is used for generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template;

the first synthesis module is used for synthesizing target voice waveform data through the target song template and the target frequency spectrum data;

and the second synthesis module is used for synthesizing the target voice waveform data and the template accompaniment information of the target song template into a target song.

10. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.

11. An electronic device, comprising:

a storage device having one or more computer programs stored thereon;

one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of any one of claims 1-8.

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a song generation method, an apparatus, a readable medium, and an electronic device.

Background

In the prior art, a relatively common speech synthesis scheme can only read a segment of characters by using human voice, and does not have a technical scheme that the information of the characters containing any number of characters can be directly converted into a song to be sung, so that how to automatically replace the lyrics of the segment of the song with the randomly input segment of the characters is intelligently combined with the song, so that the segment of the characters is sung as the lyrics of the segment of the song, which is a problem that cannot be solved in the prior art.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a song generation method, including:

receiving target character information input by a user;

determining a target song template;

determining singing duration of each character in the target character information;

generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template;

synthesizing target voice waveform data through the target song template and the target frequency spectrum data;

and synthesizing the target voice waveform data and the template accompaniment information of the target song template into a target song.

In a second aspect, the present disclosure provides a song generating apparatus, the apparatus comprising:

the receiving module is used for receiving target character information input by a user;

the first determining module is used for determining a target song template;

the second determining module is used for determining singing duration of each character in the target character information;

the generating module is used for generating target frequency spectrum data according to the singing duration of each character in the target character information;

a first synthesis module for synthesizing target voice waveform data by the target spectrum data;

and the second synthesis module is used for synthesizing the target voice waveform data and the template accompaniment information of the target song template into a target song.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having one or more computer programs stored thereon;

one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of the first aspect.

Through the technical scheme, when a user needs to generate a song, the content of the lyrics corresponding to the original song does not need to be considered, the words with unlimited number can be replaced by the lyrics in the original song by inputting words at will, a new song is generated according to the melody of the original song, and the voice characteristic of the original song can be kept in the voice part of the generated song, so that the voice in the generated song can be closer to the voice of the original song, and the generation effect of the song is better.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

In the drawings:

fig. 1 is a flowchart illustrating a song generation method according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a song generation method according to yet another exemplary embodiment of the present disclosure.

Fig. 3 is a block diagram illustrating a structure of a song generating apparatus according to an exemplary embodiment of the present disclosure.

FIG. 4 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart illustrating a song generation method according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the method includes steps 101 to 106.

In step 101, target text information input by a user is received.

In step 102, a target song template is determined.

The target text information may be input by the user in any manner, for example, the target text information may be directly input by the user manually, or may be obtained by acquiring an input voice of the user and then recognizing the input voice. Aiming at the number of words of target character information which can be input by different users, an input upper limit and/or an input lower limit can be set according to actual conditions so as to ensure the generation effect of songs.

The determination of the target song template may be determined by a user's selection, or may be a determination of a default song template automatically if not selected by the user. That is, in the case that there are a plurality of songs that can be generated, the user can select a desired song by himself after inputting the target text information, or can directly use the default song template to generate the song without selection, or, in the case of being supported by the song template selection function, can also select a song template that randomly determines one song from all the existing songs as the target song template.

In one possible implementation, the target song template may be a portion of the original song, for example, a refrain portion of the original song, i.e., a climax of the original song.

In step 103, the singing duration of each character in the target character information is determined.

The singing duration of each character in the target character information can be respectively determined through the target song template.

In a possible implementation manner, the target song template may further include template lyric information and template music information, and the template lyric information may include tagged information of the part of speech, length, number of melodies, and the like of lyrics in the song; the template music information may include melody, tempo, strength, score duration, rhythm, tempo, bar, paragraph, vibrato, etc. annotation information for the song. Wherein, the singing duration of each character in the target character information can be determined according to the template lyric information and the template music information. And distributing singing time length for each character in the target character information through template lyric information and template music information existing in the determined target song template. For example, each character in the target character information may be allocated according to the melody number in the lyric information, and then the singing duration may be allocated according to the score duration and the melody number corresponding to each character in the template music information.

In addition, the singing duration of each character in the target character information can be determined through a method flow chart shown in fig. 2, and the description of a specific method flow is described in the following description of fig. 2.

In step 104, target frequency spectrum data is generated according to the singing duration of each character in the target character information and the target song template.

After the singing duration of each character in the target character information is determined, the target song template can be combined to generate target frequency spectrum data. When the template music information is included in the target song template, the target frequency spectrum data may be generated according to the template music information and the singing duration of each character in the target character information. For example, in one possible implementation, the target frequency spectrum data may be generated according to the singing duration of each character in the target character information and the template music information through a preset neural network acoustic model. The preset Neural network acoustic model is obtained by pre-training, and may be, for example, a dnn (deep Neural network) model.

The target spectrum data may be, for example, MCEP (mel-cepstral coefficients).

In step 105, target voice waveform data is synthesized from the target song template and the target spectrum data. The target song template may further include template fundamental frequency data, and the target speech waveform data may be synthesized by the template fundamental frequency data and the target spectrum data. The template base frequency data is the base frequency data of the vocal part in the song, wherein the template base frequency data can also be base frequency data obtained by smoothly correcting the base frequency data of the vocal part in the song.

The target speech waveform data may be wave waveform data, for example.

Because the template fundamental frequency data in the target song template can be multiplexed in the newly generated song, the target voice waveform data can be directly obtained by synthesizing the template fundamental frequency data and the target frequency spectrum data generated according to the target text information, and the target voice waveform data is the voice data corresponding to the voice in the song to be generated.

The target speech waveform data may be synthesized by a predictive neural network vocoder, which may be, for example, a WaveNet vocoder.

In step 106, the target voice waveform data and the template accompaniment information of the target song template are synthesized into a target song.

The template accompaniment information included in the target song template can be the accompaniment audio which is extracted from the original song and does not include the vocal. After the target voice waveform data is determined, the target voice waveform data and the template accompaniment information in the target song template are directly synthesized, and the target song generated through the target character information input by the user can be obtained.

Fig. 2 is a flowchart illustrating a song generation method according to yet another exemplary embodiment of the present disclosure. As shown in fig. 2, the method includes steps 201 to 205 in addition to steps 101, 104 and 106 as shown in fig. 1.

In step 201, a target song template is determined, wherein the target song template comprises template accompaniment information, template fundamental frequency data, template lyric information and template music information.

In step 202, performing text analysis on the target character information to obtain phoneme and tone information included in each character in the target character information.

The method for performing text analysis on the target text information may be to perform text analysis on the target text information through a pre-established text analysis module.

After receiving the target character information input by the user in step 101, performing text analysis on the target character information, thereby obtaining phoneme and pitch information included in each character in the target character information.

The execution sequence of step 202 may be executed immediately after step 101 in the actual execution process, or may be executed before the step of determining the singing duration of each character in the target character information in step 103 shown in fig. 1 needs to be executed. As long as it can be ensured that the text analysis of the target character information is completed before the singing duration of each character in the target character information is determined.

As described above, in step 103 shown in fig. 1, there are various methods for determining the singing duration of each character in the target character information, and step 203 and step 204 shown in this embodiment provide another method for determining the singing duration of each character in the target character information.

In step 203, dynamically matching characters of the template lyric information and the target character information to obtain a corresponding relationship between each character of the target character information and the character of the template lyric information.

In step 204, a state duration of each state in each phoneme included in each text in the target text information is determined according to the corresponding relationship and the template music information.

The dynamic matching of the characters may be performed by performing dynamic character matching on the template lyric information and the target character information through a first preset machine learning model to obtain a corresponding relationship between each character in the target character information and a character in the template lyric information. The first preset machine learning Model may be, for example, a Hidden Markov Model (HMM). The first preset machine learning model can dynamically match each character in the target character information with each character in the lyrics in the template lyric information through the tagging information such as the part of speech, the length, the number of melodies and the like included in the template lyric information and the phoneme and the tone included in the target character information.

For example, if the target text information input by the user is "today's super-good mood", the lyrics corresponding to the template lyrics information included in the determined target song template are "celestial blue and other rains, but i is waiting for you", the result of performing text dynamic matching on the template lyrics information and the target text information may be "today's" corresponding "celestial blue", "mood" corresponding "and other rains", and "true super-good" corresponding "and i is waiting for you". The singing time of each word in the lyrics in the template lyric information is fixed, so that the singing time corresponding to each word or each plurality of words in the target word information is determined after the corresponding relation between the target word information and the template lyric information is determined.

Further, the duration of the detail singing of each character in the target character information is determined according to the corresponding relation between the target character information and the template lyric information obtained through the dynamic character matching, namely the state duration of each state in each phoneme contained in each character in the target character information is determined. Preferably, the state of each phoneme contained in each text in the present disclosure is a 5-dimensional state. After the state duration of each state of each phoneme contained in each character in the target character information is determined, the singing duration of each character is determined.

For example, the word "today" in the target word information corresponds to "sky cyan" in the template word information, and it is determined that the word "sky cyan" sings for 1 second (for example, 12 frames) together, after all phonemes included in the word "today" are obtained through text analysis, the singing time for 1 second can be allocated to each state corresponding to all phonemes included in "today". After the state duration of each state of each phoneme included in each character is obtained, the sum of the state durations of all states included in all phonemes is the singing duration of the character.

The manner in which the above state duration determination is made may be determined by a second preset machine learning model. The second preset machine learning model may determine the state duration of each state in each phoneme included in each text in the target text information according to the correspondence, the template lyric information, and the template music information. The second predetermined machine learning model may also be a hidden markov model as the first predetermined machine learning model.

In step 205, target speech waveform data is synthesized from the template fundamental frequency data and the target spectrum data.

Through the technical scheme, after text analysis is carried out on target character information to obtain information such as phonemes, tones and the like of each character, the singing duration of each character in the target character information can be determined by a method of firstly carrying out character dynamic matching and then predicting the state duration of each character on the basis of a matching result obtained by dynamic matching, so that the adaptation degree of the target character information and the target song template can be further improved through the character dynamic matching, the target character information can be better combined with the target song template, and the finally generated song effect is improved.

The matching effect between the target text information and the target lyric information may be determined in any preset manner, for example, in the case of dynamically matching the text through the first preset machine learning model, the matching effect can be intelligently judged by the first preset machine learning model, and if the effect of directly performing dynamic matching between the target character information and the target lyric information is not good, whether the first preset machine learning model is used for automatically adding the voice-imitating words to the target character information or repeating the target character information or used for adding the voice-imitating words to the target character information or repeating the target character information in other modes, the finally obtained corresponding relation between the target character information and the template lyric information can enable the song generating effect to be better.

Adding one or more preset sound-imitating words to the target text message may be, for example, adding a preset sound-imitating word "a" after the target text message "true super-good weather today"; repeating all the characters in the target character information can be used for repeating the original target character information ' true super-good weather today ' into ' true super-good weather today ' and true super-good weather today '. The above two processing methods may be performed in one or both of them.

In a possible implementation manner, the method for dynamically matching words in step 203 shown in fig. 2 between the template lyric information and the target word information may further include: and under the condition that the matching effect between the target text information and the template lyric information is lower than an expected matching target, repeating the target song template, and then performing text dynamic matching on the target text information and the template lyric information in the repeated target song template until the matching effect reaches the expected matching target. For example, if the target text information is "super-good weather today, but i want to blow air conditioner at home", the lyrics included in the template lyric information in the determined target song template are only "azure and other smoke and rain", and i wait you ", then the target song template can be repeated and then dynamically matched with the target text information, after the target song template is repeated, the lyrics included in the corresponding template lyric information are two repeated" azure and other smoke and rain ", i wait you", and then the two repeated template lyric information are used to dynamically match with the target text information "super-good weather today, but i want to blow air conditioner at home".

In addition, the target song template is not repeated only when the number of characters contained in the target character information is large, and the matching effect between the target song template and/or the target character information can be improved by the methods for processing the target song template and/or the target character information as long as the matching effect between the target character information and the template lyric information cannot meet the matching target of the target song template and/or the template lyric information.

Fig. 3 is a block diagram illustrating a structure of a song generating apparatus 100 according to still another exemplary embodiment of the present disclosure. As shown in fig. 3, the apparatus 100 includes: the receiving module 10 is used for receiving target character information input by a user; a first determining module 20, configured to determine a target song template; a second determining module 30, configured to determine a singing duration of each character in the target character information; the generating module 40 is configured to generate target frequency spectrum data according to the singing duration of each character in the target character information and the target song template; a first synthesizing module 50, configured to synthesize target speech waveform data by using the target song template and the target spectrum data; and a second synthesizing module 60, configured to synthesize the target speech waveform data and the template accompaniment information of the target song template into a target song.

In a possible implementation, the target song template determined by the first determining module 20 further includes template lyric information, template fundamental frequency data and template music information; the singing duration of each character in the target character information is determined by the second determining module 30 according to the template lyric information and the template music information; the target frequency spectrum data is generated by the generating module 40 according to the singing duration of each character in the target character information and the template music information; the target voice waveform data is synthesized by the first synthesis module 50 through the template fundamental frequency data and the target frequency spectrum data.

In a possible implementation, the apparatus 100 further comprises: the text analysis module is used for performing text analysis on the target character information to obtain phoneme and tone information contained in each character in the target character information; the second determination module 30 includes: the dynamic matching module is used for carrying out character dynamic matching on the template lyric information and the target character information so as to obtain the corresponding relation between each character in the target character information and the character in the template lyric information; and the duration prediction module is used for determining the state duration of each state in each phoneme contained in each character in the target character information according to the corresponding relation and the template music information.

In a possible implementation manner, the dynamic matching module is further configured to perform character dynamic matching on the template lyric information and the target character information through a first preset machine learning model, so as to obtain a correspondence between each character in the target character information and a character in the template lyric information.

In a possible implementation manner, the dynamic matching module is further configured to, when a matching effect between the target text information and the template lyric information is lower than an expected matching target, add one or more preset pseudonyms to the target text information and then perform text dynamic matching on the template lyric information and the target text information to which the pseudonyms are added again until the matching effect reaches the expected matching target; and/or repeating all characters in the target character information, and then performing character dynamic matching on the template lyric information and the repeated target character information again until the matching effect reaches the expected matching target.

In a possible implementation manner, the duration prediction module is further configured to determine, through a second preset machine learning model, the state duration of each state in each phoneme included in each text in the target text information according to the correspondence, the template lyric information, and the template music information.

In a possible implementation manner, the generating module 40 is further configured to generate the target frequency spectrum data according to the singing duration of each character in the target character information and the target song template through a preset neural network acoustic model.

Referring now to FIG. 4, a block diagram of an electronic device 400 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

In general, input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 407 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 408 including, for example, magnetic tape, hard disk, etc., and communication devices 409 may allow electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data although FIG. 4 illustrates electronic device 400 with various means, it is to be understood that not all of the illustrated means are required to be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 401.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

Examples of communication networks include a local area network ("L AN"), a wide area network ("WAN"), AN internet network (e.g., the internet), and a peer-to-peer network (e.g., AN ad hoc peer-to-peer network), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: receiving target character information input by a user; determining a target song template; determining singing duration of each character in the target character information; generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template; synthesizing target voice waveform data through the target song template and the target frequency spectrum data; and synthesizing the target voice waveform data and the template accompaniment information of the target song template into the target song.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including but not limited to AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not constitute a limitation to the module itself in some cases, and for example, the receiving module may be further described as a "module that receives target text information input by a user".

For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex programmable logic devices (CP L D), and so forth.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a song generation method according to one or more embodiments of the present disclosure, the method including: receiving target character information input by a user; determining a target song template; determining singing duration of each character in the target character information; generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template; synthesizing target voice waveform data through the target song template and the target frequency spectrum data; and synthesizing the target voice waveform data and the template accompaniment information of the target song template into a target song.

Example 2 provides the method of example 1, the target song template further including template lyric information, template fundamental frequency data, and template music information; the singing duration of each character in the target character information is determined according to the template lyric information and the template music information; the target frequency spectrum data is generated according to the singing duration of each character in the target character information and the template music information; and synthesizing the target voice waveform data by the template fundamental frequency data and the target frequency spectrum data.

Example 3 provides the method of example 2, further comprising, in accordance with one or more embodiments of the present disclosure: performing text analysis on the target character information to obtain phoneme and tone information contained in each character in the target character information; the determining the singing duration of each character in the target character information comprises: performing character dynamic matching on the template lyric information and the target character information to obtain a corresponding relation between each character in the target character information and the character in the template lyric information; and determining the state duration of each state in each phoneme contained in each character in the target character information according to the corresponding relation and the template music information.

Example 4 provides the method of example 3, and the performing dynamic word matching on the template lyric information and the target word information to obtain a correspondence between each word in the target word information and a word in the template lyric information includes: and dynamically matching characters of the template lyric information and the target character information through a first preset machine learning model to obtain the corresponding relation between each character in the target character information and the character in the template lyric information.

Example 5 provides the method of example 3 or example 4, the dynamically text matching the template lyric information and the target text information further comprising: under the condition that the matching effect between the target character information and the template lyric information is lower than an expected matching target, adding one or more preset voice words into the target character information, and then performing character dynamic matching on the template lyric information and the target character information added with the voice words again until the matching effect reaches the expected matching target; and/or repeating all characters in the target character information, and then performing character dynamic matching on the template lyric information and the repeated target character information again until the matching effect reaches the expected matching target.

Example 6 provides the method of example 3 or example 4, the dynamically text matching the template lyric information and the target text information further comprising: and under the condition that the matching effect between the target text information and the template lyric information is lower than an expected matching target, repeating the target song template, and then performing text dynamic matching on the target text information and the template lyric information in the repeated target song template until the matching effect reaches the expected matching target.

Example 7 provides the method of example 3, wherein the determining a state duration of each state in each phoneme included in each text in the target text information according to the correspondence and the template music information includes: and determining the state duration of each state in each phoneme contained in each character in the target character information according to the corresponding relation, the template lyric information and the template music information through a second preset machine learning model.

Example 8 provides the method of example 1, and the generating target spectrum data according to the singing duration of each text in the target text information and the template music information includes: and generating the target frequency spectrum data according to the singing duration of each character in the target character information and the template music information through a preset neural network acoustic model.

Example 9 provides, in accordance with one or more embodiments of the present disclosure, a song generation apparatus, the apparatus comprising: the receiving module is used for receiving target character information input by a user; the first determining module is used for determining a target song template; the second determining module is used for determining singing duration of each character in the target character information; the generating module is used for generating target frequency spectrum data according to the singing duration of each character in the target character information and the target song template; the first synthesis module is used for synthesizing target voice waveform data through the target song template and the target frequency spectrum data; and the second synthesis module is used for synthesizing the target voice waveform data and the template accompaniment information of the target song template into a target song.

Example 10 provides the apparatus of example 9, the target song template determined by the first determination module further including template lyric information, template fundamental frequency data, and template music information; the singing time length of each character in the target character information is determined by the second determining module according to the template lyric information and the template music information; the target frequency spectrum data is generated by the generating module according to the singing duration of each character in the target character information and the template music information; the target voice waveform data is synthesized by the first synthesis module through the template fundamental frequency data and the target frequency spectrum data.

Example 11 provides the apparatus of example 10, in accordance with one or more embodiments of the present disclosure, further comprising: the text analysis module is used for performing text analysis on the target character information to obtain phoneme and tone information contained in each character in the target character information; the second determining module includes: the dynamic matching module is used for carrying out character dynamic matching on the template lyric information and the target character information so as to obtain the corresponding relation between each character in the target character information and the character in the template lyric information; and the duration prediction module is used for determining the state duration of each state in each phoneme contained in each character in the target character information according to the corresponding relation and the template music information.

Example 12 provides the apparatus of example 11, wherein the dynamic matching module is further configured to perform dynamic character matching on the template lyric information and the target character information through a first preset machine learning model to obtain a correspondence between each character in the target character information and a character in the template lyric information.

Example 13 provides the apparatus of example 11 or 12, wherein the dynamic matching module is further configured to, in a case that a matching effect between the target text information and the template lyric information is lower than an expected matching target, add one or more preset pseudonyms to the target text information and then perform text dynamic matching on the template lyric information and the target text information to which the pseudonyms are added again until the matching effect reaches the expected matching target; and/or repeating all characters in the target character information, and then performing character dynamic matching on the template lyric information and the repeated target character information again until the matching effect reaches the expected matching target.

Example 14 provides the apparatus of example 11 or 12, wherein the dynamic matching module is further configured to repeat the target song template and then perform dynamic character matching on the target text information and the template lyric information in the repeated target song template again until the matching effect reaches the expected matching target, in a case that the matching effect between the target text information and the template lyric information is lower than the expected matching target.

Example 15 provides the apparatus of example 11, wherein the duration prediction module is further configured to determine the state duration for each state in each phoneme included in each text in the target text information according to the correspondence, the template lyric information, and the template music information through a second preset machine learning model.

Example 16 provides the apparatus of example 9, and the generating module is further configured to generate the target frequency spectrum data according to a singing duration of each text in the target text information and the target song template through a preset neural network acoustic model.

Example 17 provides a computer-readable medium, on which a computer program is stored, according to one or more embodiments of the present disclosure, characterized in that the program, when executed by a processing device, implements the steps of the method of any one of examples 1-8.

Example 18 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having one or more computer programs stored thereon; one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of any of examples 1-8.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

19页详细技术资料下载

Song generation method and device, readable medium and electronic equipment

相关技术

网友询问留言