Music data generation method, device, equipment and storage medium

文档序号：88015 发布日期：2021-10-08 浏览：30次中文

阅读说明：本技术 一种音乐数据的生成方法、装置、设备以及存储介质 (Music data generation method, device, equipment and storage medium ) 是由王振宇于 2021-06-25 设计创作，主要内容包括：本公开关于一种音乐数据的生成方法、装置、设备以及存储介质,涉及人工智能领域。本公开实施例至少解决相关技术中,生成说唱音乐的技术门槛较高,以及生成的说唱音乐效果不佳的问题。该方法包括：获取原始视频数据,以及预设的资源模板；资源模板包括每个伴奏片段的字符串数量、第一时长以及重音级别；根据原始视频数据的资源特征,以及每个伴奏片段的字符串数量,生成歌词,并基于歌词、歌词中每个字符串的第一时长、每个字符串的重音级别,以及预设的音色特征,生成具有音色特征的语音数据；语音数据用于按照对应的第一时长及重音级别播放歌词中的每个字符串；合并语音数据以及伴奏数据,以生成原始视频数据对应的音乐数据。(The disclosure relates to a music data generation method, a music data generation device, music data generation equipment and a storage medium, and relates to the field of artificial intelligence. The embodiment of the disclosure at least solves the problems that the technical threshold for generating rap music is high and the effect of the generated rap music is poor in the related art. The method comprises the following steps: acquiring original video data and a preset resource template; the resource template comprises the number of character strings, the first time length and the stress level of each accompaniment clip; generating lyrics according to the resource characteristics of the original video data and the number of character strings of each accompaniment fragment, and generating voice data with tone color characteristics based on the lyrics, the first duration of each character string in the lyrics, the stress level of each character string and preset tone color characteristics; the voice data is used for playing each character string in the lyrics according to the corresponding first duration and the stress level; and combining the voice data and the accompaniment data to generate music data corresponding to the original video data.)

1. A method of generating music data, comprising:

acquiring original video data and a preset resource template; the resource template comprises the number of character strings, first duration and stress level of each accompaniment clip in preset accompaniment data; the first duration is the number of frames occupied by the character string in the accompaniment data, and the stress level is the stress level of the character string in the accompaniment data;

generating lyrics corresponding to the resource characteristics of the original video according to the resource characteristics of the original video data and the number of character strings of each accompaniment fragment in the accompaniment data; lyric fragments in the lyrics correspond to accompaniment fragments in the accompaniment data one by one, and the number of character strings of each lyric fragment is equal to that of the character strings of the corresponding accompaniment fragments;

generating voice data with the tone color characteristics based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string and preset tone color characteristics; the voice data is used for playing each character string in the lyrics according to the corresponding first duration and the stress level;

and combining the voice data and the accompaniment data to generate music data corresponding to the original video data.

2. The method for generating music data according to claim 1, wherein the generating of the voice data having the timbre characteristic based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string, and a preset timbre characteristic comprises:

determining phonemes included in each character string in the lyrics and a pitch of each phoneme;

determining a second duration of said each phoneme in said accompaniment data and a first energy value for said each phoneme; the sum of the second durations of all phonemes in each character string is the first duration of each character string; the first energy value of each phoneme is the energy value of each phoneme in the accompaniment data, and the first energy value of each phoneme is positively correlated with the stress level of the character string in which each phoneme is located;

generating the voice data according to the each phoneme, the tone of the each phoneme, the second duration of the each phoneme, the first energy value of the each phoneme and the tone color feature.

3. The method of generating music data according to claim 2, wherein said determining a second duration of each phoneme in the accompaniment data and a first energy value of each phoneme comprises:

for a first character string, determining a third duration of a phoneme in the first character string and a second energy value of the phoneme in the first character string; the first character string is any character string in the lyrics; the third duration of each phoneme is the number of frames of each phoneme in the character string of the lyrics; the second energy value of each phoneme is an energy value of each phoneme in a character string of the lyric;

determining the second duration of the phoneme in the first character string according to the third duration of the phoneme in the first character string and the first duration of the first character string;

determining the first energy value of a phoneme in the first string based on the second energy value of a phoneme in the first string and the accent level of the first string.

4. The method for generating music data according to claim 3, wherein said determining the second duration of the phoneme in the first string according to the third duration of the phoneme in the first string and the first duration of the first string comprises:

determining the ratio of the sum of the third duration of the phonemes in the first character string to the first duration of the first character string as the adjustment ratio of the first character string;

and respectively adjusting the third duration of the phonemes in the first character string based on the adjustment proportion to obtain the second duration of the phonemes in the first character string.

5. The method of generating music data according to claim 2, wherein said generating the speech data based on the each phoneme, the pitch of the each phoneme, the second duration of the each phoneme, the first energy value of the each phoneme, and the timbre characteristic includes:

for a first phoneme, generating a frame feature group corresponding to the first phoneme according to the second duration of the first phoneme, the pitch of the first phoneme, the first energy value of the first phoneme and the timbre feature; the first phoneme is any phoneme in the lyrics; the frame feature group corresponding to each phoneme comprises a plurality of frame features, and the number of the plurality of frame features corresponds to the second duration of each phoneme; each frame feature corresponding to each phoneme comprises each phoneme, the tone of each phoneme, the first energy value of each phoneme and the tone feature;

and combining the frame feature groups respectively corresponding to all phonemes in the lyrics to obtain a voice spectrum parameter corresponding to the lyrics, and converting the voice spectrum parameter into the voice data.

6. The method of generating music data according to claim 5, wherein said generating a frame feature set corresponding to the first phoneme according to the second duration of the first phoneme, the pitch of the first phoneme, the first energy value of the first phoneme, and the timbre feature comprises:

determining the number of frames occupied by the second duration of the first phoneme, wherein the number of the frame features corresponding to the first phoneme is the number of the frame features;

generating a plurality of frame features corresponding to the first phoneme based on the number of the plurality of frame features corresponding to the first phoneme, the pitch of the first phoneme, the first energy value of the first phoneme, and the timbre feature; each frame feature comprises a position index in a plurality of frame features corresponding to the first phoneme; the position index is used for identifying the position of the frame feature in the frame feature group;

and combining the plurality of frame features corresponding to the first phoneme to obtain a frame feature group corresponding to the first phoneme.

7. A generation apparatus of music data is characterized by comprising an acquisition unit, a generation unit and a merging unit;

the acquisition unit is used for acquiring original video data and a preset resource template; the resource template comprises the number of character strings, first duration and stress level of each accompaniment clip in preset accompaniment data; the first duration is the number of frames occupied by the character string in the accompaniment data, and the stress level is the stress level of the character string in the accompaniment data;

the generating unit is used for generating lyrics corresponding to the resource characteristics of the original video according to the resource characteristics of the original video data and the number of character strings of each accompaniment fragment in the accompaniment data; lyric fragments in the lyrics correspond to accompaniment fragments in the accompaniment data one by one, and the number of character strings of each lyric fragment is equal to that of the character strings of the corresponding accompaniment fragments;

the generating unit is further used for generating voice data with the tone characteristic based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string and a preset tone characteristic; the voice data is used for playing each character string in the lyrics according to the corresponding first duration and the stress level;

the merging unit is used for merging the voice data and the accompaniment data to generate music data corresponding to the original video data.

8. An electronic device, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the method of generating music data of any one of claims 1-6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of generating music data according to any one of claims 1 to 6.

10. A computer program product comprising instructions, characterized in that it comprises computer instructions which, when run on an electronic device, cause the electronic device to carry out the method of generating music data according to any one of claims 1 to 6.

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for generating music data.

Background

Speaking and singing music utilizes a great amount of rhyme-convincing skills and dynamic music beats, and people can feel more shocking. In the related art, Artificial Intelligence (AI) technology and speech synthesis technology are generally applied to the creation of rap music. Specifically, the rap music creation module is used for matching appropriate background music for the rap lyrics according to the rap lyrics input by the user; furthermore, the creating module converts the rap lyrics into a voice spectrum, and attaches the voice spectrum obtained by conversion to the background music to generate the rap music.

However, in the above-mentioned rap music creation process, the user is required to create rap lyrics and input the rap lyrics into the creation module, and the user is required to have a certain rap basis. Meanwhile, in the process of generating the rap music, the creating module directly attaches the voice spectrum of the rap lyrics to the background music, so that the lyrics cannot be well attached to the rhythm of the background music, and the generated rap music is mechanical and cannot meet the requirements of users.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device, and a storage medium for generating music data, so as to at least solve the problems in the related art that the technical threshold for generating rap music is high and the effect of the generated rap music is not good. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for generating music data, including: acquiring original video data and a preset resource template; the resource template comprises the number of character strings, the first time length and the stress level of each accompaniment clip in preset accompaniment data; the first duration is the frame number occupied by the character string in the accompaniment data, and the stress level is the stress level of the character string in the accompaniment data; generating lyrics corresponding to the resource characteristics of the original video according to the resource characteristics of the original video data and the number of character strings of each accompaniment fragment in the accompaniment data; lyric fragments in the lyrics correspond to accompaniment fragments in the accompaniment data one by one, and the number of character strings of each lyric fragment is equal to that of the character strings of the corresponding accompaniment fragments; generating voice data with tone color characteristics based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string and preset tone color characteristics; the voice data is used for playing each character string in the lyrics according to the corresponding first duration and the stress level; and combining the voice data and the accompaniment data to generate music data corresponding to the original video data.

Optionally, the "generating voice data with a timbre characteristic based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string, and a preset timbre characteristic" includes: determining phonemes included in each character string in the lyrics and the tone of each phoneme; determining a second duration of each phoneme in the accompaniment data and a first energy value for each phoneme; the sum of the second durations of all phonemes in each character string is the first duration of each character string; the first energy value of each phoneme is the energy value of each phoneme in the accompaniment data, and the first energy value of each phoneme is positively correlated with the stress level of the character string in which each phoneme is positioned;

and generating the voice data according to each phoneme, the tone of each phoneme, the second duration of each phoneme, the first energy value of each phoneme and the tone characteristic.

Optionally, the "determining the second duration of each phoneme in the accompaniment data and the first energy value of each phoneme" includes: for the first character string, determining a third duration of a phoneme in the first character string and a second energy value of the phoneme in the first character string; the first character string is any character string in the lyrics; the third duration of each phoneme is the number of frames each phoneme occupies in the character string of the lyrics; the second energy value of each phoneme is the energy value of each phoneme in the character string of the lyrics; determining a second duration of the phoneme in the first character string according to the third duration of the phoneme in the first character string and the first duration of the first character string; the first energy value of the phoneme in the first character string is determined based on the second energy value of the phoneme in the first character string and the accent level of the first character string.

Optionally, the "determining the second duration of the phoneme in the first character string according to the third duration of the phoneme in the first character string and the first duration of the first character string" includes: determining the ratio of the sum of the third duration of the phonemes in the first character string to the first duration of the first character string as the adjustment ratio of the first character string; and respectively adjusting the third duration of the phonemes in the first character string based on the adjustment proportion to obtain the second duration of the phonemes in the first character string.

Optionally, the "generating the speech data according to each phoneme, the pitch of each phoneme, the second duration of each phoneme, the first energy value of each phoneme, and the timbre characteristic" includes: for a first phoneme, generating a frame feature group corresponding to the first phoneme according to a second duration of the first phoneme, a tone of the first phoneme, a first energy value of the first phoneme and a tone feature; the first phoneme is any phoneme in the lyrics; the frame feature group corresponding to each phoneme comprises a plurality of frame features, and the number of the plurality of frame features corresponds to the second duration of each phoneme; each frame feature corresponding to each phoneme comprises each phoneme, the tone of each phoneme, a first energy value of each phoneme and a tone feature; and combining the frame feature groups respectively corresponding to all phonemes in the lyrics to obtain the voice spectrum parameters corresponding to the lyrics, and converting the voice spectrum parameters into voice data.

Optionally, the "generating a frame feature group corresponding to the first phoneme according to the second duration of the first phoneme, the pitch of the first phoneme, the first energy value of the first phoneme, and the timbre feature" includes: determining the number of frames occupied by the second duration of the first phoneme, wherein the number of the frame features corresponding to the first phoneme is the number of the frame features; generating a plurality of frame features corresponding to the first phoneme based on the number of the plurality of frame features corresponding to the first phoneme, the tone of the first phoneme, the first energy value of the first phoneme and the tone feature; each frame feature comprises a position index in a plurality of frame features corresponding to the first phoneme; the position index is used for identifying the position of the frame feature in the frame feature group; and combining the plurality of frame features corresponding to the first phoneme to obtain a frame feature group corresponding to the first phoneme.

Optionally, after the "combining the voice data and the accompaniment data to generate the music data corresponding to the original video data", the method further includes: and combining the music data and the original video data to generate target video data corresponding to the original video data.

According to a second aspect of the embodiments of the present disclosure, there is provided a generation apparatus of music data, including an acquisition unit, a generation unit, and a merging unit; the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring original video data and a preset resource template; the resource template comprises the number of character strings, the first time length and the stress level of each accompaniment clip in preset accompaniment data; the first duration is the frame number occupied by the character string in the accompaniment data, and the stress level is the stress level of the character string in the accompaniment data; the generating unit is used for generating lyrics corresponding to the resource characteristics of the original video according to the resource characteristics of the original video data and the number of character strings of each accompaniment fragment in the accompaniment data; lyric fragments in the lyrics correspond to accompaniment fragments in the accompaniment data one by one, and the number of character strings of each lyric fragment is equal to that of the character strings of the corresponding accompaniment fragments; the generating unit is further used for generating voice data with tone color characteristics based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string and preset tone color characteristics; the voice data is used for playing each character string in the lyrics according to the corresponding first duration and the stress level; and the merging unit is used for merging the voice data and the accompaniment data so as to generate music data corresponding to the original video data.

Optionally, the generating unit is specifically further configured to: determining phonemes included in each character string in the lyrics and the tone of each phoneme; determining a second duration of each phoneme in the accompaniment data and a first energy value for each phoneme; the sum of the second durations of all phonemes in one character string is the first duration of the character string; the first energy value of one phoneme is the energy value of one phoneme in the accompaniment data, and the first energy value of one phoneme is positively correlated with the accent level of the character string in which the phoneme is positioned; and generating the voice data according to each phoneme, the tone of each phoneme, the second duration of each phoneme, the first energy value of each phoneme and the tone characteristic.

Optionally, the generating unit is specifically further configured to: for the first character string, determining a third duration of a phoneme in the first character string and a second energy value of the phoneme in the first character string; the first character string is any character string in the lyrics; the third duration of a phoneme is the number of frames of the phoneme in the character string of the lyric; the second energy value of a phoneme is the energy value of a phoneme in the character string of the lyric; determining a second duration of the phoneme in the first character string according to the third duration of the phoneme in the first character string and the first duration of the first character string; the first energy value of the phoneme in the first character string is determined based on the second energy value of the phoneme in the first character string and the accent level of the first character string.

Optionally, the generating unit is specifically further configured to: determining the ratio of the sum of the third duration of the phonemes in the first character string to the first duration of the first character string as the adjustment ratio of the first character string; and respectively adjusting the third duration of the phonemes in the first character string based on the adjustment proportion to obtain the second duration of the phonemes in the first character string.

Optionally, the generating unit is specifically configured to: for a first phoneme, generating a frame feature group corresponding to the first phoneme according to the first phoneme, the second duration of the first phoneme, the tone of the first phoneme, the first energy value of the first phoneme and the tone feature; the first phoneme is any phoneme in the lyrics; the frame feature group corresponding to one phoneme comprises a plurality of frame features, and the number of the plurality of frame features corresponds to the second duration of one phoneme; each frame feature corresponding to one phoneme comprises one phoneme, a tone of the phoneme, a first energy value of the phoneme and a tone feature; and combining the frame feature groups respectively corresponding to all phonemes in the lyrics to obtain the voice spectrum parameters corresponding to the lyrics, and converting the voice spectrum parameters into voice data.

Optionally, the generating unit is further specifically configured to: determining the number of frames occupied by the second duration of the first phoneme, wherein the number of the frame features corresponding to the first phoneme is the number of the frame features; generating a plurality of frame features corresponding to the first phoneme based on the number of the plurality of frame features corresponding to the first phoneme, the tone of the first phoneme, the first energy value of the first phoneme and the tone feature; each frame feature comprises a position index in a plurality of frame features corresponding to the first phoneme; the position index is used for identifying the position of the frame feature in the frame feature group; and combining the plurality of frame features corresponding to the first phoneme to obtain a frame feature group corresponding to the first phoneme.

Optionally, the merging unit is further configured to merge the music data and the original video data after the music data are merged to generate the target video data corresponding to the original video data.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor, a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of generating music data as provided by the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method for generating music data as provided by the first aspect and any one of its possible design manners.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of generating music data as provided by the first aspect and any one of its possible designs.

The technical scheme provided by the disclosure at least brings the following beneficial effects: through obtaining the original video data and the preset resource template, lyrics corresponding to the resource characteristics of the original video data can be matched for the user. Meanwhile, the acquired resource template comprises the first duration and the accent level of the character segments in the accompaniment data, so that each character segment in the rap voice in the generated voice data conforms to the first duration and the accent level required by the accompaniment data and has preset timbre characteristics. Further, the voice data and the accompaniment data may be combined into music data. Therefore, by the technical means, the user can input video data without perception without the need of having a corresponding music foundation, the threshold of the rap song creation is reduced, and the generated voice data can meet the melody and the drumbeat rhythm of the accompaniment data and the preset tone characteristic. Meanwhile, the generated music data comprise the voice data and the accompaniment data, so that the requirements of users can be better met.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a block diagram illustrating a multi-media system in accordance with an exemplary embodiment;

fig. 2 is one of flow diagrams illustrating a method of generating music data according to an exemplary embodiment;

fig. 3 is a second flowchart illustrating a music data generating method according to an exemplary embodiment;

fig. 4 is a third flowchart illustrating a music data generating method according to an exemplary embodiment;

fig. 5 is a fourth flowchart illustrating a music data generation method according to an exemplary embodiment;

fig. 6 is a fifth flowchart illustrating a music data generation method according to an exemplary embodiment;

fig. 7 is a sixth flowchart illustrating a music data generation method according to an exemplary embodiment;

fig. 8 is a seventh flowchart illustrating a music data generation method according to an exemplary embodiment;

fig. 9 is a schematic structural diagram showing a music data generation apparatus according to an exemplary embodiment;

fig. 10 is a schematic structural diagram of an electronic device according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In addition, in the description of the embodiments of the present disclosure, "/" indicates an OR meaning, for example, A/B may indicate A or B, unless otherwise specified. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "a plurality" means two or more than two.

The music data generation method provided by the embodiment of the disclosure can be applied to a multimedia system. Fig. 1 shows a schematic structural diagram of the multimedia system. As shown in fig. 1, the multimedia system 10 is used to solve the problems of the related art that the threshold for generating rap music is high and the effect of the generated rap music is not good. The multimedia system 10 includes a generating device (hereinafter simply referred to as generating device for convenience of description) 11 of a music system and an electronic apparatus 12. The generating device 11 is connected to the electronic apparatus 12. The generating device 11 and the electronic device 12 may be connected in a wired manner or in a wireless manner, which is not limited in the embodiment of the present disclosure.

It should be noted that the multimedia system according to the embodiment of the present disclosure may be applied to a plurality of scenarios. For example, in the first scenario, the generating apparatus 11 may be a user device, and the electronic device 12 may be a server. In a second scenario, the generating means 11 may be a unit or a module with corresponding functionality, and the electronic device 23 may be a user equipment. In the following description of the embodiments of the present disclosure, a multimedia system and a music data generation method according to the present disclosure are described by taking a second scenario as an example.

The generating means 11 may be used for data interaction with the electronic device 12, for example, the generating means 11 may be used for receiving video data transmitted by the electronic device and transmitting generated music data to the electronic device.

The generating device 11 may also execute a music data generating method in the embodiment of the present disclosure, for example, to perform corresponding processing on the received video data to obtain music data with voice data corresponding to the accompaniment data melody and the drum point.

The electronic device 12 captures, or otherwise accepts video data transmitted by other similar devices.

Illustratively, the electronic device 12 includes a shooting module and a communication module. The shooting module is used for shooting and collecting video data. The communication module is used for data interaction with the generating device 11.

In the second scenario, the generating device 11 and the electronic device 12 may be independent devices or may be integrated in the same device, and the disclosure is not limited thereto.

When the generating device 11 and the electronic device 12 are integrated in the same device, the communication mode between the generating device 11 and the electronic device 12 is communication between internal modules of the device. In this case, the communication flow between the two is the same as "the communication flow between the generating apparatus 11 and the electronic device 12 when they are independent of each other".

In the following embodiments provided by the present disclosure, the present disclosure is explained by taking an example in which the generating device 11 and the electronic apparatus 12 are provided independently of each other.

In practical applications, the method for generating music data provided by the embodiment of the present disclosure may be applied to a generating device, and may also be applied to an electronic device.

As shown in fig. 2, the method for generating music data provided by the embodiment of the present disclosure includes the following steps S201 to S204.

S201, the generating device acquires original video data and a preset resource template.

The resource template comprises the number of character strings, the first time length and the stress level of each accompaniment clip in preset accompaniment data. The first duration is the number of frames occupied by the character string in the accompaniment data, and the stress level is the stress level of the character string in the accompaniment data.

As a possible implementation, the generating apparatus may receive raw video data sent by an electronic device or other similar devices.

As another possible implementation, the generating device itself has a capture or shooting function, and the raw video data can be obtained by capture.

It should be noted that the original video data may be uploaded to the generating device by the user. The original video data may or may not include a human voice signal.

Note that the resource template may be stored in advance in the memory of the generation apparatus. The preset accompaniment data may be accompaniment data designated or selected by the user. The accent levels of a string may be classified into 0-n levels, where level 0 indicates that the string does not require accenting, and the accent of the string is heavier as the accent level increases. Background music can be further included in the resource template, and the accompaniment data can be segments or parts of the background music. The accompaniment data includes a plurality of accompaniment clips, each of which may include at least one beat. In each accompaniment clip, the first duration of each character string in the accompaniment clip and the stress level of each character string in the accompaniment clip are preset.

The character string related to the embodiment of the present disclosure may specifically be a character in the subsequently acquired lyrics. For example, when chinese is included in the lyrics, the string may be any one of chinese characters or words. When the lyrics include an english word, the character string may be any english word.

The number of the character strings of each accompaniment clip is the number which is preset in each accompaniment clip and can contain the character strings.

The disclosed embodiments illustrate an example relating to a resource template, for example, the resource template may be [ aaa, bbb, ccc, ddd, eee, fff, … … ], where aaa is an identifier of the background music where the accompaniment data is located, bbb is a start time of the accompaniment data in the background music, ccc is a first character string, ddd is a first duration of the first character string, eee is an accent level of the first character string, fff is a second character string in the lyrics, and so on.

S202, the generating device generates lyrics corresponding to the resource characteristics of the original video according to the resource characteristics of the original video data and the number of character strings of each accompaniment fragment in the accompaniment data.

The lyric fragments in the lyrics correspond to the accompaniment fragments in the accompaniment data one by one, and the number of character strings of each lyric fragment is equal to that of the character strings of the corresponding accompaniment fragments.

As a possible implementation, the generating device performs frame extraction and OCR processing on the original video data to determine resource characteristics of the original video data.

It should be noted that the resource characteristics of the original video data can be used to reflect the content, scene, theme, and other information in the original video data.

Further, the generating means inputs the determined resource feature and the number of character strings for each accompaniment clip to the accompaniment processing device

And in a preset first neural network, obtaining the lyrics corresponding to the resource characteristics.

It should be noted that the first neural network is a supervised training model. In the training process, the student set of the training sample is the preset resource characteristics of the video data and the number of character strings of different accompaniment clips, and the teacher set of the training sample is the preset lyrics corresponding to the preset resource characteristics of the video data.

In one case, to ensure that the lyrics determined via the first neural network are rhymed, the lyrics in the training sample also have a corresponding rhyme effect.

Illustratively, the first neural network may be a GPT-3 model.

S203, the generating device generates voice data with tone color characteristics based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string and preset tone color characteristics.

The voice data is used for playing each character string in the lyrics according to the corresponding first time length and the corresponding accent level.

As a possible implementation, the generating means processes the lyrics accordingly to obtain the phonemes included in each character string of the lyrics and the pitch of each phoneme.

It should be noted that each character string includes at least one phoneme.

Further, the generating means predicts a duration of each phoneme in each character string, and an energy value of each phoneme in the character string. Wherein, the energy value of a phoneme is used for reflecting the phonetic feature of the phoneme.

Further, the generating means adjusts the duration of each phoneme in the character string according to the first duration in the resource template to obtain the duration of the phoneme in the accompaniment data. Meanwhile, the generating device also adjusts the energy value of the phoneme in the character string according to the stress level of each character string in the resource template so as to obtain the energy value of each phoneme in the accompaniment data.

Finally, the generating means generates the speech data based on the duration of each phoneme determined in the accompaniment data, the energy value of each phoneme in the accompaniment data, and the timbre characteristics.

The preset timbre features may be preset in the generating device, or may be selected and determined by the user in the generating device. And

the specific implementation of this step can refer to the following description of the present disclosure, and is not repeated here.

S204, the generating device combines the voice data and the accompaniment data to generate music data corresponding to the original video data.

As a possible implementation manner, the generating device acquires accompaniment data based on the starting time of the accompaniment data in the resource template in the background music and the identifier of the background music, and combines the voice data and the accompaniment data based on a preset data synthesis algorithm to obtain music data corresponding to the original video data.

The specific implementation of merging the speech data and the accompaniment data in this step can refer to the description in the prior art, and is not repeated here.

The technical scheme provided by the embodiment at least has the following beneficial effects: through obtaining the original video data and the preset resource template, lyrics corresponding to the resource characteristics of the original video data can be matched for the user. Meanwhile, the acquired resource template comprises the first duration and the accent level of the character segments in the accompaniment data, so that each character segment in the rap voice in the generated voice data conforms to the first duration and the accent level required by the accompaniment data and has preset timbre characteristics. Further, the voice data and the accompaniment data may be combined into music data. Therefore, by the technical means, the user can input video data without perception without the need of having a corresponding music foundation, the threshold of the rap song creation is reduced, and the generated voice data can meet the melody and the drumbeat rhythm of the accompaniment data and the preset tone characteristic. Meanwhile, the generated music data comprise the voice data and the accompaniment data, so that the requirements of users can be better met.

In one design, to enable generation of voice data, as shown in fig. 3, S203 provided in the embodiments of the present disclosure specifically includes the following S301 to S303.

S301, the generating device determines phonemes included in each character string of the lyrics and a pitch of each phoneme.

As a possible implementation manner, the generating device may decompose the lyrics according to a preset function module to obtain phonemes included in each character string of the lyrics and a pitch of each phoneme.

For example, the preset function modules may be a tn (token norm) module and a ZhuYin module.

Wherein the TN module is used for determining the pronunciation of the character string in the lyric (such as polyphone, digital spoken language, etc.), and further determining the phoneme according to the pronunciation. The ZhuYin module is used for matching the tone for the determined factors.

The specific implementation of the TN module and the phonetic notation module in this step may refer to the description in the prior art, and will not be described herein again.

S302, the generating device determines a second duration of each phoneme in the accompaniment data and a first energy value of each phoneme.

And the sum of the second durations of all phonemes in one character string is the first duration of one character string. The first energy value of one phoneme is the energy value of one phoneme in the accompaniment data, the first energy value of one phoneme is used for representing the phonetic feature of one phoneme in the accompaniment data, and the first energy value of one phoneme is positively correlated with the stress level of the character string in which the phoneme is positioned.

As a possible implementation, the generating means determines the duration of all phonemes in the lyrics in the character string of the lyrics and the energy values of all phonemes in the character string of the lyrics based on all phonemes in the lyrics, the pitch of all phonemes, and a preset second neural network.

Further, the generating means determines a first duration of each phoneme in the character string of the lyrics in the accompaniment data based on the duration of each phoneme in the character string and the first duration of the character string.

Meanwhile, the generating means determines the first energy value of each phoneme in the character string of the lyrics according to the energy value of each phoneme in the character string and the accent level of the character string in the accompaniment data.

For a specific implementation of this step, reference may be made to the subsequent description of the embodiment of the present disclosure, and details are not repeated here.

S303, generating means generates voice data according to each phoneme, the tone of each phoneme, the second duration of each phoneme, the first energy value of each phoneme and the tone color characteristics.

As a possible implementation manner, the generating means generates a speech spectrum parameter corresponding to the lyrics according to a preset third neural network, the phoneme in the lyrics, the third duration of the phoneme in the lyrics, the pitch of the phoneme in the lyrics, the second energy value of the phoneme in the lyrics, and the timbre characteristics, and converts the speech spectrum parameter into speech data at a sampling point level.

The specific implementation manner of this step may refer to the subsequent description of the embodiment of the present disclosure, and is not described herein again.

The technical scheme provided by the embodiment at least has the following beneficial effects: the character string in the lyrics is decomposed into phonemes with the minimum granularity of pronunciation, and further, the corresponding second duration and the first energy value are configured for each phoneme in the lyrics, so that the generated voice data can be more suitable for the melody and the drum point of the accompaniment data.

In one design, in order to determine the second duration of each phoneme in the lyric in the accompaniment data and the first energy value of each phoneme, as shown in fig. 4, S302 provided by the embodiment of the present disclosure may specifically include the following S3021 to S3023.

S3021, for the first character string, the generating device determines a third duration of the phoneme in the first character string and a second energy value of the phoneme in the first character string.

Wherein, the first character string is any character string in the lyrics. The third duration of a phoneme is the number of frames a phoneme occupies in the string of lyrics. The second energy value of a phoneme is the energy value of a phoneme in the string of lyrics. As a possible implementation, the generating means may input phonemes included in the lyrics and a pitch of each phoneme into a preset second neural network, and determine a third duration of each phoneme and a second energy value of each phoneme from a result output from the second neural network.

It should be noted that the second neural network is a supervised training model. In the training process of the second neural network, the student set in the training sample is any phoneme and the tone of the phoneme, and the teacher set in the training sample is the number of frames the phoneme occupies in the character string and the energy value of the phoneme in the character string.

In one case, the generating means further performs a one-hot (onehot) encoding and normalization on the result output by the second neural network model to obtain a third duration for each phoneme and a second energy value for each phoneme, respectively.

It is understood that the second energy value of any phoneme is a normalized value, and is any one of 0 to 1.

And S3022, the generating device determines the second duration of the phoneme in the first character string according to the third duration of the phoneme in the first character string and the first duration of the first character string.

As a possible implementation manner, the generating device may determine a sum of the third durations of all the phonemes in the first character string as a default duration of the first character string, and determine the second duration of the phoneme in the first character string in the accompaniment data based on the first duration of the first character string in the accompaniment data.

The specific implementation manner in this step may refer to the subsequent description of the embodiment of the present disclosure, and is not described herein again.

S3023, the generating device determines the first energy value of the phoneme in the first character string based on the second energy value of the phoneme in the first character string and the accent level of the first character string.

As a possible implementation manner, the generating device adjusts the second energy value of the phoneme in the first character string according to the accent level of the first character string to obtain the first energy value of the phoneme in the first character string.

For example, for any phoneme, if the stress level of the character string in which the phoneme is located in the accompaniment data is 2, the preset unit energy value corresponding to each level of stress is 0.1, and the second energy value of the phoneme is 0.3, after the adjustment, the first energy value of the phoneme in the accompaniment data is 0.3+2 × 0.1 — 0.5.

In practical application, S3022 and S3023 may be executed first and then S3023, or S3023 and S3022 may be executed first and then S3022, or S3022 and S3023 may be executed at the same time, which is not limited in this embodiment of the disclosure.

The technical scheme provided by the embodiment at least has the following beneficial effects: based on the third duration and the second energy value of the phoneme in the character string, the determined second duration and the determined first energy value are more accurate and are more suitable for the melody and the drum point of the accompaniment data.

In one design, to determine the second duration of the phoneme in the first string in the accompaniment data, as shown in fig. 5, S3022 provided by the present disclosure includes, in particular, the following S401-S402.

S401, the generating device determines the ratio of the sum of the third duration of the phonemes in the first character string to the first duration of the first character string as the adjustment ratio of the first character string.

As a possible implementation manner, the generating device determines the sum of the third durations of the phonemes in the first character string as a default duration of the first character string, and determines a ratio of the default duration of the first character string to the first duration of the first character string as an adjustment ratio of the first character string.

Illustratively, for the string "good", its first duration in the accompaniment data is 0.4 milliseconds (ms), and its phonemes include "h" and "ao". Wherein the third duration of the phoneme of "h" is 0.2ms and the third duration of the phoneme of "ao" is 0.3 ms. Thus, the default duration of the string "good" can be determined to be 0.5 ms. Further, the adjustment ratio of "good" of the character string is 1.25 of 0.5ms/0.4 ms.

S402, the generating device respectively adjusts the third duration of the phonemes in the first character string based on the adjustment proportion to obtain the second duration of the phonemes in the first character string.

As a possible implementation manner, the generating device determines a ratio of the third duration of the phoneme in the first character string to the adjustment ratio as the second duration of the phoneme in the first character string.

Taking the above-described character string "good" as an example, in the case where the adjustment ratio is 1.250.2/1.25, the generation means may determine that the second duration of the phoneme "h" is 0.2ms/1.25 to 0.16ms, and the second duration of the phoneme "ao" is 0.3ms/1.25 to 0.24 ms.

Thus, for the character string "good", the sum of the third durations of the phonemes "h" and "ao" is 0.4ms, which is the same as the first duration of the character string.

The technical scheme provided by the embodiment at least has the following beneficial effects: the first time length of the character string can be more accurately distributed to different phonemes through the ratio of the sum of the third time lengths of the phonemes in the character string to the first time length of the character string, and the second time length of each phoneme can be more accurately determined.

In one design, to enable generation of voice data, as shown in fig. 6, S303 provided in the embodiments of the present disclosure specifically includes the following S501-S503.

S501, for the first phoneme, the generating device generates a frame feature group corresponding to the first phoneme according to the first phoneme, the second duration of the first phoneme, the pitch of the first phoneme, the first energy value of the first phoneme, and the timbre feature.

Wherein, the first phoneme is any phoneme in the lyrics. The frame feature set corresponding to one phoneme comprises a plurality of frame features, and the number of the plurality of frame features corresponds to the second duration of one phoneme. Each frame feature corresponding to one phoneme comprises one phoneme, the tone of one phoneme, the first energy value of one phoneme and the tone color feature.

As a possible implementation manner, the generating device determines the number of frame features corresponding to the first phoneme according to the second duration of the first phoneme.

Further, the generating device generates a plurality of frame features corresponding to the first phoneme according to the number of the plurality of frame features corresponding to the first phoneme, the pitch of the first phoneme, the first energy value of the first phoneme, and the timbre feature.

Further, the generating device combines the plurality of frame features corresponding to the first phoneme to obtain a frame feature group corresponding to the first phoneme.

For example, the frame feature set corresponding to the first phoneme may be a first matrix. Each row in the first matrix is a frame feature corresponding to the first phoneme. Any frame feature corresponding to the first phoneme comprises the first phoneme, the tone of the first phoneme, the first energy value of the first phoneme and the tone feature.

S502, the generating device combines the frame feature groups corresponding to all phonemes in the lyrics to obtain the voice spectrum parameters corresponding to the lyrics.

As a possible implementation manner, the generating device combines the frame feature groups corresponding to all phonemes in the lyric to obtain a combination result of the frame feature groups.

Illustratively, the lyric includes 10 phonemes, and each phoneme corresponds to a frame feature set that is a matrix of 100 × 4. Therefore, the generating device combines all the frame feature groups to obtain a 1000 × 4 matrix, which is used to predict the speech spectrum parameters corresponding to the lyrics.

Further, the generating device inputs the merged result of combining the frame feature combinations into a third neural network, and the third neural network learns the merged result to obtain the voice spectrum parameters corresponding to the lyrics.

It should be noted that the format of the speech spectrum parameters may be Linear Predictive Coding (LPC) format, and may also be mel format.

For example, the third neural network may be a predictive model composed of a plurality of convolutional layers. Meanwhile, under the condition that the adoption rate of the voice data is 16KHz, the generating device can utilize a third neural network to learn the 1000 x 4 matrix so as to predict and obtain a voice spectrum parameter with the matrix size of 1000 x 80, and further the requirement of the sampling rate of the voice data can be met.

S503, the generating device converts the voice spectrum parameters into voice data.

As a possible implementation, the generating means may input the voice spectrum parameters into the neural vocoder wavernn or the vocoder Ipcent to convert the voice spectrum parameters into voice data.

In this step, the implementation manner of converting the voice spectrum parameter into the voice data may refer to the description in the prior art, and is not described herein again.

The technical scheme provided by the embodiment at least has the following beneficial effects: a specific implementation is provided that may generate speech data based on phonemes in the lyrics, the pitch of each phoneme, the second duration of each phoneme, the first energy value of each phoneme, and the timbre characteristics, capable of converting frame-level data to sample-level speech data.

In one design, in order to generate the frame feature set corresponding to the first phoneme, as shown in fig. 7, the foregoing S501 provided by the embodiment of the present disclosure specifically includes the following S5011 to S5013.

S5011, the generating device determines a number of frames occupied by the second duration of the first phoneme, which is the number of the plurality of frame features corresponding to the first phoneme.

Illustratively, if the second duration of a phoneme occupies 100 speech frames, the number of frame features corresponding to the phoneme is 100.

S5012, the generating device generates a plurality of frame features corresponding to the first phoneme based on the number of the plurality of frame features corresponding to the first phoneme, the pitch of the first phoneme, the first energy value of the first phoneme, and the timbre feature.

Each frame feature of the plurality of frame features corresponding to the first phoneme includes a position index. The position index is used to identify the position of the frame feature in the set of frame features.

As a possible implementation manner, the generating device generates the same number of position indexes according to the number of the plurality of frame features corresponding to the first phoneme, and generates the same number of frame features as the plurality of frame features corresponding to the first phoneme according to the first phoneme, the pitch of the first phoneme, the second energy value of the first phoneme, the timbre feature and the same number of position indexes.

Illustratively, for any one of the frame features [ aa, bb, cc, dd, ee ] corresponding to the first phoneme, aa represents the first phoneme, bb represents the pitch of the first phoneme, cc represents the first energy value of the first phoneme, dd represents the identification of the timbre feature, and ee represents the position index of the frame feature in the frame feature group corresponding to the first phoneme. Taking the number of the frame features corresponding to the first phoneme as 100 as an example, the numerical value of the position index is 0-99.

S5013, the generating device combines the plurality of frame features corresponding to the first phoneme to obtain a frame feature group corresponding to the first phoneme.

Illustratively, taking the number of the frame features corresponding to the first phoneme as 100 as an example, the frame feature group corresponding to the first phoneme is a matrix with a size of 100 × 5.

It can be understood that in the matrix of 100 × 5 of the frame feature group corresponding to the first phoneme, the numerical values of the first 4 columns are the same, except for the position index of the 5 th column.

The technical scheme provided by the embodiment at least has the following beneficial effects: different position indexes can be set for different frame characteristics in the frame characteristic set, the generated frame characteristic set can be more accurate, and for voice data, the accompaniment data and the voice data are equal in length and aligned in time.

In one design, in order to improve the user experience, as shown in fig. 8, the music data generation method provided by the embodiment of the present disclosure further includes, after S204, the following S205.

S205, the generating device combines the music data and the original video data to generate target video data corresponding to the original video data.

The technical scheme provided by the embodiment at least has the following beneficial effects: the music data can be combined with the original video data to obtain the target video data on the premise of generating the music data. Since the target video data includes the original video data and the music data, the user experience can be improved to the greatest extent.

Fig. 9 is a schematic structural diagram showing a music data generation apparatus according to an exemplary embodiment. Referring to fig. 9, a generation apparatus 60 of music data provided by the embodiment of the present disclosure includes an acquisition unit 601, a generation unit 602, and a merging unit 603.

The obtaining unit 601 is configured to obtain original video data and a preset resource template. The resource template comprises the number of character strings, the first duration and the stress level of each accompaniment clip in preset accompaniment data. The first duration is the number of frames occupied by the character string in the accompaniment data, and the stress level is the stress level of the character string in the accompaniment data. For example, as shown in fig. 2, the obtaining unit 601 may be configured to execute S201.

The generating unit 602 is configured to generate lyrics corresponding to the resource feature of the original video according to the resource feature of the original video data and the number of character strings of each accompaniment segment in the accompaniment data. Lyric fragments in the lyrics correspond to accompaniment fragments in the accompaniment data one by one, and the number of character strings of each lyric fragment is equal to that of the character strings of the corresponding accompaniment fragments. For example, as shown in fig. 2, the generating unit 602 may be configured to execute S202.

The generating unit 602 is further configured to generate voice data with timbre characteristics based on the lyrics, the first duration of each character string in the lyrics, the accent level of each character string, and preset timbre characteristics. The voice data is used for playing each character string in the lyrics according to the corresponding first duration and the stress level. For example, as shown in fig. 2, the generating unit 602 may be configured to execute S203.

The merging unit 603 is configured to merge the speech data and the accompaniment data to generate music data corresponding to the original video data. For example, as shown in fig. 2, the merging unit 603 may be configured to execute S204.

Optionally, as shown in fig. 9, the generating unit 602 provided in the embodiment of the present disclosure is specifically further configured to:

the phonemes included in each string of words of the lyrics and the pitch of each phoneme are determined. For example, as shown in fig. 3, the generating unit 602 may be configured to perform S301.

A second duration of each phoneme in the accompaniment data is determined, as well as a first energy value for each phoneme. The sum of the second durations of all phonemes in a string is the first duration of a string. The first energy value of one phoneme is the energy value of one phoneme in the accompaniment data, and the first energy value of one phoneme is positively correlated with the accent level of the character string in which the phoneme is positioned. For example, as shown in fig. 3, the generating unit 602 may be configured to perform S302.

And generating the voice data according to each phoneme, the tone of each phoneme, the second duration of each phoneme, the first energy value of each phoneme and the tone characteristic. For example, as shown in fig. 3, the generating unit 602 may be configured to perform S303.

Optionally, as shown in fig. 9, the generating unit 602 provided in the embodiment of the present disclosure is specifically further configured to:

for the first string, a third duration of phonemes in the first string and a second energy value of phonemes in the first string are determined. The first character string is any character string in the lyrics. The third duration of a phoneme is the number of frames a phoneme occupies in the string of lyrics. The second energy value of a phoneme is the energy value of a phoneme in the string of lyrics. For example, as shown in fig. 4, the generating unit 602 may be configured to execute S3021.

And determining the second duration of the phoneme in the first character string according to the third duration of the phoneme in the first character string and the first duration of the first character string. For example, as shown in fig. 4, the generating unit 602 may be configured to execute S3022.

The first energy value of the phoneme in the first character string is determined based on the second energy value of the phoneme in the first character string and the accent level of the first character string. For example, as shown in fig. 4, the generating unit 602 may be configured to execute S3023.

Optionally, as shown in fig. 9, the generating unit 602 provided in the embodiment of the present disclosure is specifically further configured to:

and determining the ratio of the sum of the third duration of the phonemes in the first character string to the first duration of the first character string as the adjustment ratio of the first character string. For example, as shown in fig. 5, the generating unit 602 may be configured to execute S401.

And respectively adjusting the third duration of the phonemes in the first character string based on the adjustment proportion to obtain the second duration of the phonemes in the first character string. For example, as shown in fig. 5, the generating unit 602 may be configured to execute S402.

Optionally, as shown in fig. 9, the generating unit 602 provided in the embodiment of the present disclosure is specifically configured to:

and for the first phoneme, generating a frame feature group corresponding to the first phoneme according to the first phoneme, the second duration of the first phoneme, the tone of the first phoneme, the first energy value of the first phoneme and the tone feature. The first phoneme is any phoneme in the lyrics. The frame feature set corresponding to one phoneme comprises a plurality of frame features, and the number of the plurality of frame features corresponds to the second duration of one phoneme. Each frame feature corresponding to one phoneme comprises one phoneme, the tone of one phoneme, the first energy value of one phoneme and the tone color feature. For example, as shown in fig. 6, the generating unit 602 may be configured to perform S501.

And combining the frame feature groups respectively corresponding to all phonemes in the lyrics to obtain the voice spectrum parameters corresponding to the lyrics, and converting the voice spectrum parameters into voice data. For example, as shown in fig. 6, the generating unit 602 may be configured to perform S502-S503.

Optionally, as shown in fig. 9, the generating unit 602 provided in the embodiment of the present disclosure is further specifically configured to:

and determining the number of frames occupied by the second duration of the first phoneme, wherein the number of the frame features corresponding to the first phoneme is the number of the frame features. For example, as shown in fig. 7, the generating unit 602 may be configured to perform S5011.

And generating a plurality of frame features corresponding to the first phoneme based on the number of the plurality of frame features corresponding to the first phoneme, the tone of the first phoneme, the first energy value of the first phoneme and the tone feature. Each frame feature of the plurality of frame features corresponding to the first phoneme includes a position index. The position index is used to identify the position of the frame feature in the set of frame features. For example, as shown in fig. 7, the generating unit 602 may be configured to perform S5012.

And combining the plurality of frame features corresponding to the first phoneme to obtain a frame feature group corresponding to the first phoneme. For example, as shown in fig. 7, the generating unit 602 may be configured to perform S5013.

Optionally, as shown in fig. 9, the merging unit 603 provided in the embodiment of the present disclosure is further configured to, after merging the generated music data, merge the music data and the original video data to generate target video data corresponding to the original video data. For example, as shown in fig. 8, the merging unit 603 may be configured to perform S205.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 10 is a schematic structural diagram of an electronic device provided by the present disclosure. As shown in fig. 10, the electronic device 70 may include at least one processor 701 and a memory 703 for storing processor-executable instructions. Wherein the processor 701 is configured to execute instructions in the memory 703 to implement the generation method of music data in the above-described embodiments.

Additionally, electronic device 70 may also include a communication bus 702 and at least one communication interface 704.

The processor 701 may be a processor (CPU), a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of programs in accordance with the disclosed aspects.

The communication bus 702 may include a path that conveys information between the aforementioned components.

Communication interface 704, using any transceiver or the like, may be used to communicate with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The memory 703 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and connected to the processing unit by a bus. The memory may also be integrated with the processing unit.

The memory 703 is used for storing instructions for executing the disclosed solution, and is controlled by the processor 701. The processor 701 is configured to execute instructions stored in the memory 703 to implement the functions of the disclosed method.

As an example, in connection with fig. 9, the functions implemented by the acquisition unit 601, the generation unit 602, and the merging unit 603 in the generation apparatus 60 of music data are the same as those of the processor 701 in fig. 10.

In particular implementations, processor 701 may include one or more CPUs such as CPU0 and CPU1 in fig. 10, for example, as an example.

In particular implementations, electronic device 70 may include multiple processors, such as processor 701 and processor 707 in fig. 10, for example, as an embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In particular implementations, electronic device 70 may also include an output device 705 and an input device 706, as one embodiment. An output device 705 is in communication with the processor 701 and may display information in a variety of ways. For example, the output device 705 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 706 communicates with the processor 701 and may accept input from a user in a variety of ways. For example, the input device 706 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

Those skilled in the art will appreciate that the configuration shown in fig. 10 is not limiting of electronic device 70 and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components may be used.

In addition, the present disclosure also provides a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the generation method of music data provided as the above embodiment.

In addition, the present disclosure also provides a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of generating music data as provided in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

24页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种声音信号降噪干扰去除装置

Music data generation method, device, equipment and storage medium

相关技术

网友询问留言