Animation generation method, system, medium and electronic terminal

文档序号：1923226 发布日期：2021-12-03 浏览：18次中文

阅读说明：本技术 一种动画生成方法、系统、介质及电子终端 (Animation generation method, system, medium and electronic terminal ) 是由骆晋豪刘鹏于 2021-09-09 设计创作，主要内容包括：本发明提供一种动画生成方法、系统、介质及电子终端,方法包括：通过采集故事文本,获取镜头元素信息；将镜头元素信息输入动画图层生成模型进行图层生成,获取一个或多个动画图层；图层生成的步骤包括：背景图层匹配、文本词法分析、文本情感分析、文本语音转换和唇形序列生成；通过对多个动画图层进行合成渲染,生成动画；本发明中的方法,通过将镜头元素信息输入动画图层生成模型进行背景图层匹配、文本词法分析、文本情感分析、文本语音转换和唇形序列生成,获取一个或多个动画图层,通过对多个动画图层进行合成渲染,生成动画,降低了普通用户或动画新手创作者的动画创作难度,门槛较低,成本较低,自动化程度较高。(The invention provides an animation generation method, an animation generation system, an animation generation medium and an electronic terminal, wherein the method comprises the following steps: acquiring lens element information by acquiring a story text; inputting the lens element information into an animation layer generation model for layer generation to obtain one or more animation layers; the layer generation step comprises the following steps: background image layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation; generating an animation by performing synthesis rendering on the plurality of animation layers; according to the method, the lens element information is input into the animation layer generation model to perform background layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation, one or more animation layers are obtained, and the animation is generated by synthesizing and rendering the plurality of animation layers, so that the animation creation difficulty of common users or animation creators is reduced, the threshold is low, the cost is low, and the automation degree is high.)

1. An animation generation method, comprising:

acquiring lens element information by acquiring a story text;

inputting the lens element information into an animation layer generation model for layer generation to obtain one or more animation layers;

the layer generation step comprises: background image layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation;

and generating the animation by performing synthesis rendering on the plurality of animation layers.

2. The animation generation method according to claim 1, wherein the step of obtaining the animation layer generation model includes:

acquiring a training set;

inputting the training texts in the training set into a neural network for training to obtain an animation layer, wherein the neural network comprises: the system comprises a first convolution neural subnetwork for performing text lexical analysis, a long-short term memory subnetwork for performing text emotion analysis, a deep neural subnetwork for performing text-to-speech conversion and a second convolution neural subnetwork for performing lip sequence generation;

and training the neural network according to the animation layer output by the neural network and a preset loss function to obtain an animation layer generation model.

3. The animation generation method according to claim 1, wherein the animation layer includes: a background layer, the lens element information including: scene description and character information, the background layer obtaining step includes:

extracting keywords from the scene description and/or the character information to obtain keywords;

matching the keywords with scene resource labels in a preset digital asset management center to obtain a first matching result, wherein the scene resource labels and the scene resources have a first corresponding relation;

and acquiring corresponding scene resources according to the first matching result and the first corresponding relation, and taking the corresponding scene resources as a background layer of the animation to complete background layer matching.

4. The animation generation method as claimed in claim 1, wherein the animation layer further comprises: the special effect layer, the lens element information further includes: the method comprises the steps of obtaining conversation content and voice-over content, wherein the step of obtaining the special effect layer comprises the following steps:

inputting the dialogue content and/or the voice-over content into a long-term and short-term memory sub-network of the animation layer generation model, and performing emotion tag matching according to the context of the dialogue content and/or the voice-over content to obtain a corresponding emotion tag;

determining corresponding emotional tendency according to the emotional tag, wherein the emotional tendency comprises: negative and positive, completing text emotion analysis;

respectively matching the emotion labels and the emotion tendencies with special effect resource labels in a preset digital asset management center to obtain a second matching result, wherein the special effect resource labels and the special effect resources have a second corresponding relation;

and acquiring corresponding special effect resources according to the second matching result and the second corresponding relation, and taking the corresponding special effect resources as a special effect layer of the animation.

5. The animation generation method as claimed in claim 4, wherein the animation layer further comprises: the figure action map layer, the figure action map layer obtain step including:

performing word segmentation processing on the conversation content and/or the voice-over content to obtain one or more word segmentation vocabularies;

inputting the word segmentation words into a first convolution neural subnetwork of the animation layer generation model for entity recognition and part-of-speech classification, and obtaining an entity recognition result and a part-of-speech classification result, wherein the type of the entity recognition comprises: name, organization name, address and time, completing text lexical analysis;

acquiring corresponding character model resources according to the character information;

acquiring corresponding action resources according to the entity identification result, the part of speech classification result, the emotion label and the emotion tendency;

and synthesizing the figure action map layer according to the figure model resource and the action resource.

6. The animation generation method as claimed in claim 4, wherein the animation layer further comprises: the character expression layer acquiring step comprises the following steps:

matching the emotion labels and the emotion tendencies with expression resource labels in a preset digital asset management center respectively to obtain a third matching result, wherein the expression resource labels and the expression resources have a third corresponding relation;

acquiring corresponding expression resources according to the third matching result and the third corresponding relation;

and synthesizing the character expression layer according to the acquired character model resources and the expression resources.

7. The animation generation method as claimed in claim 4, wherein the animation layer further comprises: the figure lip-shaped layer acquiring step comprises the following steps:

inputting the dialogue content and/or the voice-over content into a deep neural subnetwork of the animation layer generation model for voice conversion to obtain a corresponding voice sequence;

performing prosody adjustment on the voice sequence according to the emotion tag and the emotion tendency to acquire voice audio and complete text-to-voice conversion;

inputting the voice audio into a second convolution neural sub-network of the animation layer generation model to perform voice feature extraction, and acquiring voice features;

acquiring a corresponding lip sequence according to the voice characteristics to complete the generation of the lip sequence;

and synthesizing the figure lip-shaped layer according to the lip-shaped sequence and the acquired figure model resources.

8. An animation generation system, comprising:

the element information acquisition module is used for acquiring lens element information by acquiring a story text;

the animation layer synthesis module is used for inputting the lens element information into an animation layer generation model for layer generation to obtain one or more animation layers; the layer generation step comprises: background image layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation;

9. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

10. An electronic terminal, comprising: a processor and a memory;

the memory is for storing a computer program and the processor is for executing the computer program stored by the memory to cause the terminal to perform the method of any of claims 1 to 7.

Technical Field

The present invention relates to the field of animation production, and in particular, to an animation generation method, system, medium, and electronic terminal.

Background

With the increase of network bandwidth and the popularization of mobile terminals, animations are gradually paid more and more attention and loved by more people, and people tend to select animations as one of the main forms of acquiring information. However, the animation production process includes multiple processes, such as dynamic scenario production, animation modeling, visual special effect production, and the like, and the production and implementation are complex, which is a technical threshold of animation creation for a common user or an author of a new animation, and is difficult and costly.

Disclosure of Invention

The invention provides an animation generation method, an animation generation system, an animation generation medium and an electronic terminal, and aims to solve the problems that in the prior art, a common user or an animation creator has high technical threshold, high difficulty and high cost in animation creation.

The animation generation method provided by the invention comprises the following steps:

acquiring lens element information by acquiring a story text;

inputting the lens element information into an animation layer generation model for layer generation to obtain one or more animation layers;

the layer generation step comprises: background image layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation;

and generating the animation by performing synthesis rendering on the plurality of animation layers.

Optionally, the step of obtaining the animation layer generation model includes:

acquiring a training set;

and training the neural network according to the animation layer output by the neural network and a preset loss function to obtain an animation layer generation model.

Optionally, the animation layer includes: a background layer, the lens element information including: scene description and character information, the background layer obtaining step includes:

extracting keywords from the scene description and/or the character information to obtain keywords;

Optionally, the animation layer further includes: the special effect layer, the lens element information further includes: the method comprises the steps of obtaining conversation content and voice-over content, wherein the step of obtaining the special effect layer comprises the following steps:

determining corresponding emotional tendency according to the emotional tag, wherein the emotional tendency comprises: negative and positive, completing text emotion analysis;

Optionally, the animation layer further includes: the figure action map layer, the figure action map layer obtain step including:

performing word segmentation processing on the conversation content and/or the voice-over content to obtain one or more word segmentation vocabularies;

acquiring corresponding character model resources according to the character information;

acquiring corresponding action resources according to the entity identification result, the part of speech classification result, the emotion label and the emotion tendency;

and synthesizing the figure action map layer according to the figure model resource and the action resource.

Optionally, the animation layer further includes: the character expression layer acquiring step comprises the following steps:

acquiring corresponding expression resources according to the third matching result and the third corresponding relation;

and synthesizing the character expression layer according to the acquired character model resources and the expression resources.

Optionally, the animation layer further includes: the figure lip-shaped layer acquiring step comprises the following steps:

inputting the dialogue content and/or the voice-over content into a deep neural subnetwork of the animation layer generation model for voice conversion to obtain a corresponding voice sequence;

performing prosody adjustment on the voice sequence according to the emotion tag and the emotion tendency to acquire voice audio and complete text-to-voice conversion;

inputting the voice audio into a second convolution neural sub-network of the animation layer generation model to perform voice feature extraction, and acquiring voice features;

acquiring a corresponding lip sequence according to the voice characteristics to complete the generation of the lip sequence;

and synthesizing the figure lip-shaped layer according to the lip-shaped sequence and the acquired figure model resources.

The present invention also provides an animation generation system, comprising:

the element information acquisition module is used for acquiring lens element information by acquiring a story text;

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as defined in any one of the above.

The present invention also provides an electronic terminal, comprising: a processor and a memory;

the memory is adapted to store a computer program and the processor is adapted to execute the computer program stored by the memory to cause the terminal to perform the method as defined in any one of the above.

The invention has the beneficial effects that: according to the animation generation method, the story text is collected, the lens element information is obtained, the lens element information is input into the animation layer generation model to perform background layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation, one or more animation layers are obtained, and the animation layers are synthesized and rendered to generate the animation, so that the animation creation difficulty of a common user or an animation creator is reduced, the threshold is low, the cost is low, and the automation degree is high.

Drawings

FIG. 1 is a flow chart of an animation generation method according to an embodiment of the invention.

Fig. 2 is a schematic flow chart illustrating the process of obtaining an animation layer generation model in the animation generation method according to the embodiment of the present invention.

Fig. 3 is a schematic flow chart of obtaining a background layer in an animation generation method according to an embodiment of the present invention.

Fig. 4 is a schematic flowchart of obtaining a special effect layer in the animation generation method according to the embodiment of the present invention.

Fig. 5 is a schematic flowchart of acquiring a character action layer in the animation generation method according to the embodiment of the present invention.

Fig. 6 is a schematic flow chart of obtaining a character expression layer in the animation generation method according to the embodiment of the present invention.

Fig. 7 is a schematic flowchart of obtaining a lip-shaped layer of a character in an animation generation method according to an embodiment of the present invention.

FIG. 8 is a schematic structural diagram of an animation generation system according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The inventor finds that with the increase of network bandwidth and the popularization of mobile terminals, animation is gradually paid more and more attention and favored by people, and people tend to select animation as one of the main forms of acquiring information. However, the animation production process includes multiple processes, such as dynamic scenario production, animation modeling, visual special effect production, and the like, and the production and implementation are complex, which is a technical threshold of animation creation for a common user or an author of a new animation, and is difficult and costly. Animation is a cartoon advanced expression form, according to international convention, the production standard of animation is 24 frames per second, which means that the information content of animation in 1 second is equal to that of 24 cartoons, so that the time and cost required for production are exponentially increased along with the increase of the total duration of the animation, and the difficulty of animation creation of a common user or a new animation creator is increased. Therefore, the inventor provides an animation generation method, system, medium and electronic terminal, lens element information is acquired by collecting story text, the lens element information is input into an animation layer generation model to perform background layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation, one or more animation layers are acquired, and animation is generated by performing synthesis rendering on a plurality of animation layers, so that the animation creation difficulty of a common user or an animation creator is reduced, the threshold is low, the cost is low, the automation degree is high, the implementation is convenient, and the feasibility is high.

As shown in fig. 1, the animation generation method in the present embodiment includes:

s101: acquiring lens element information by acquiring a story text; the lens element information includes: scene description, character information, dialog content, and voice-over content; for example: when a user inputs a story text, different types of lens element information are respectively input according to a preset input format, and animation with high accuracy and high fitting degree with the input story text can be generated at a later stage by inputting a plurality of types of lens element information. The personal information includes: name, gender, age, and wear, etc.

S102: inputting the lens element information into an animation layer generation model for layer generation to obtain one or more animation layers; the layer generation step comprises: background image layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation; for example: performing background layer matching according to the scene description and/or the character information to obtain a background layer of the animation; performing text emotion analysis according to the conversation content and/or the bystander content to acquire corresponding emotion labels and emotion tendencies, and acquiring a special effect layer of the animation according to the emotion labels and the emotion tendencies; performing text lexical analysis according to the conversation content and/or the bystander content to obtain an entity recognition result and a part-of-speech classification result, and obtaining a character action image layer according to the character information, the entity recognition result, the part-of-speech classification result, the obtained emotion label and the emotion tendency; and acquiring a character expression layer according to the character information, the emotion labels and the emotional tendency. By acquiring one or more animation layers, the subsequent animation can be synthesized conveniently, and the accuracy is high.

S103: and generating the animation by performing synthesis rendering on the plurality of animation layers. Through with lens element information input animation layer generative model, generate one or more with the animation layer that lens element information was laminated mutually to carry out the synthesis to a plurality of animation layers and play up, generate with the higher animation of lens element information laminating degree, realized the automatic generation of animation, be convenient for generate corresponding animation according to the story text of user input, reduce user's animation creation threshold and the creation degree of difficulty, reduce user's animation creation cost, degree of automation is higher, and it is more convenient to implement. In some embodiments, an animation may be generated by compositely rendering a plurality of the animation compositions using an animation composition renderer (ZAC).

Referring to fig. 2, in order to improve the accuracy of generating an animation layer by an animation layer generating model, the inventor proposes that the acquiring step of the animation layer generating model includes:

s201: acquiring a training set; the training set includes: the system comprises a plurality of training texts and one or more real animation layers corresponding to the training texts.

S202: inputting the training texts in the training set into a neural network for training to obtain an animation layer, wherein the neural network comprises: the system comprises a first convolution neural subnetwork for performing text lexical analysis, a long-short term memory subnetwork for performing text emotion analysis, a deep neural subnetwork for performing text-to-speech conversion and a second convolution neural subnetwork for performing lip sequence generation;

s203: training the neural network according to the animation layer output by the neural network and a preset loss function to obtain an animation layer generation model; the loss function can adopt the existing loss functions, such as a first convolution neural sub-network for text lexical analysis, a long-short term memory sub-network for emotion analysis, a second convolution neural sub-network for lip sequence generation, a cross entropy loss function, a mean square error loss function and the like.

In order to facilitate obtaining a plurality of animation layers, the inventor proposes to preset a Digital asset management center (DAM) for storing a plurality of texts, images, models, actions and other data related to animation content creation, such as: scene resources, special effect resources, character model resources, action resources, expression resources and the like are provided for the animation layer generation model to call resources.

As shown in fig. 3, the animation layer includes: in order to facilitate obtaining the background layer, the inventor proposes that the obtaining step of the background layer includes:

s301: extracting keywords from the scene description and/or the character information to obtain keywords;

s302: matching the keywords with scene resource labels in a preset digital asset management center to obtain a first matching result, wherein the scene resource labels and the scene resources have a first corresponding relation;

s303: and acquiring corresponding scene resources according to the first matching result and the first corresponding relation, and taking the corresponding scene resources as a background layer of the animation to complete background layer matching. For example: and when the first matching result exceeds a preset scene resource label matching threshold, determining that the scene resource corresponding to the scene resource label is a background layer of the animation.

Referring to fig. 4, the animation layer further includes: the method comprises the following steps of obtaining a special effect layer:

s401: inputting the dialogue content and/or the voice-over content into a long-term and short-term memory sub-network of the animation layer generation model, and performing emotion tag matching according to the context of the dialogue content and/or the voice-over content to obtain a corresponding emotion tag; by combining the context of the dialog content and/or the voice-over content, a more accurate emotion tag can be obtained.

S402: determining corresponding emotional tendency according to the emotional tag, wherein the emotional tendency comprises: negative and positive, completing text emotion analysis; for example: when the emotion label is angry or sad, the corresponding emotional tendency is negative; when the emotional tag is happy or excited, the corresponding emotional tendency is positive.

S403: respectively matching the emotion labels and the emotion tendencies with special effect resource labels in a preset digital asset management center to obtain a second matching result, wherein the special effect resource labels and the special effect resources have a second corresponding relation;

s404: and acquiring corresponding special effect resources according to the second matching result and the second corresponding relation, and taking the corresponding special effect resources as a special effect layer of the animation.

As shown in fig. 5, the animation layer further includes: the figure action map layer, the figure action map layer obtain step including:

s501: performing word segmentation processing on the conversation content and/or the voice-over content to obtain one or more word segmentation vocabularies;

s502: inputting the word segmentation words into a first convolution neural subnetwork of the animation layer generation model for entity recognition and part-of-speech classification, and obtaining an entity recognition result and a part-of-speech classification result, wherein the type of the entity recognition comprises: name, organization name, address and time, completing text lexical analysis; as can be understood, the part of speech classification result includes: verbs, nouns, adjectives, quantifiers, and the like.

S503: acquiring corresponding character model resources according to the character information; for example: and matching character model resources in the digital asset management center according to the gender, age, wearing and the like in the character information to obtain character model resources corresponding to the character information.

S504: acquiring corresponding action resources according to the entity identification result, the part of speech classification result, the emotion label and the emotion tendency; for example: and respectively setting different weights for the entity recognition result, the part of speech classification result, the emotion label and the emotional tendency, and determining corresponding action resources according to the weights.

S505: and synthesizing the figure action map layer according to the figure model resource and the action resource.

Referring to fig. 6, the animation layer further includes: the character expression layer acquiring step comprises the following steps:

s601: matching the emotion labels and the emotion tendencies with expression resource labels in a preset digital asset management center respectively to obtain a third matching result, wherein the expression resource labels and the expression resources have a third corresponding relation;

s602: acquiring corresponding expression resources according to the third matching result and the third corresponding relation;

s603: and synthesizing the character expression layer according to the acquired character model resources and the expression resources. By matching the corresponding expression resources with the emotion labels and the emotion tendencies and synthesizing the character expression layers according to the character model resources and the expression resources, the character expression layers are higher in accuracy and more fit with story texts.

Referring to fig. 7, the animation layer further includes: the figure lip-shaped layer acquiring step comprises the following steps:

s701: inputting the dialogue content and/or the voice-over content into a deep neural subnetwork of the animation layer generation model for voice conversion to obtain a corresponding voice sequence;

s702: performing prosody adjustment on the voice sequence according to the emotion tag and the emotion tendency to acquire voice audio and complete text-to-voice conversion; the prosody includes: rhythm and tone, rhythm refers to how fast a speech sequence is. Namely, the rhythm and the tone of the voice sequence are adjusted according to the preset rhythm adjustment rule, the emotion label and the emotion tendency, and a better voice audio is obtained. For example: when the emotion label is angry and the emotional tendency is negative, the rhythm of the voice sequence is accelerated according to a preset rhythm adjusting rule, the tone of the voice sequence is improved, and the rhythm adjustment of the voice sequence is realized.

S703: inputting the voice audio into a second convolution neural sub-network of the animation layer generation model to perform voice feature extraction, and acquiring voice features;

s704: acquiring a corresponding lip sequence according to the voice characteristics to complete the generation of the lip sequence; namely, the voice features are used for matching the corresponding lip features, and the corresponding lip sequence is obtained according to the lip features.

S705: and synthesizing the figure lip-shaped layer according to the lip-shaped sequence and the acquired figure model resources. By synthesizing the lip sequence and the acquired character model resources, a better character lip layer can be acquired, and the synthesis accuracy and the fitting degree of the character lip layer are effectively improved.

In some embodiments, a plurality of animation layers are input into an animation synthesis renderer to be subjected to synthesis rendering, and an animation is generated;

and acquiring an animation video according to the generated animation and the voice audio. Such as: and carrying out time sequence matching or corresponding on the generated animation and the voice audio to obtain an animation video.

In some embodiments, after acquiring the lens element information, animation generation may be performed in other manners, including:

according to the scene description and/or the character information in the lens element information, scene resource retrieval is carried out in a preset digital asset management center, and corresponding scene resources are obtained to serve as a background layer;

transmitting the dialogue content and/or the voice-over content in the lens element information to a preset emotion analysis module through a preset first interface, and acquiring corresponding emotion information by using the emotion analysis module; the emotion analysis module can be an existing system with a text emotion analysis function, and corresponding emotion information can be acquired quickly by inputting conversation content and/or voice-over content into the emotion analysis module for emotion analysis, such as: love, pleasure, thank you, neutrality, complain, anger, disgust, fear, sadness and the like, the calling of external resources is realized, and the operation difficulty is reduced;

matching the emotion information with a special effect resource label of the digital asset management center to obtain a corresponding special effect resource as a special effect layer;

transmitting the conversation content and/or the voice-over content to a preset lexical analysis module through a preset second interface to obtain an entity recognition result and a part-of-speech classification result; the lexical analysis module can be an existing system with lexical analysis capability, conversation content and/or voice-over content are/is transmitted to the lexical analysis module for lexical analysis by the aid of the second interface through the second interface, external resources can be well called for lexical analysis, and implementation is convenient;

according to the entity recognition result, the part-of-speech classification result and the emotion information, action resource matching is carried out in the digital resource management center, and corresponding action resources are obtained;

according to the character information, searching is carried out in the digital asset management center, and corresponding character model resources are obtained;

synthesizing a figure action map layer according to the action resources and the figure model resources;

acquiring corresponding expression resources from the digital asset management center according to the emotion information;

synthesizing a character expression layer according to the acquired character model resources and the expression resources;

transmitting the conversation content and/or the voice-over-voice content to a preset text-to-voice module through a preset third interface to obtain a corresponding voice sequence; the text-to-speech module can be an existing system or device with the capability of converting text to speech, call resources corresponding to the text-to-speech system, convert conversation content and/or voice-over content into a speech sequence, and reduce operation load;

utilizing a preset phoneme extraction tool to extract phonemes from the voice sequence to obtain phoneme information, wherein the phoneme information comprises: vowel phoneme information and consonant phoneme information;

according to the phoneme information, acquiring corresponding lip-shaped pictures from the digital asset management center, and combining the lip-shaped pictures according to the time sequence to acquire a lip-shaped sequence matched with the voice sequence;

synthesizing a figure lip-shaped layer according to the lip-shaped sequence and the obtained figure model resources;

and performing synthesis rendering on the background layer, the special effect layer, the figure action layer, the figure expression layer and the figure lip layer to generate the animation. As shown in fig. 8, the present embodiment further provides an animation generation system, including:

the element information acquisition module is used for acquiring lens element information by acquiring a story text;

the animation generation module is used for generating an animation by performing synthesis rendering on the plurality of animation layers; the element information acquisition module, the animation layer synthesis module and the animation generation module are connected. The animation generation system in the embodiment acquires lens element information by collecting story texts, inputs the lens element information into an animation layer generation model to perform background layer matching, text lexical analysis, text emotion analysis, text voice conversion and lip sequence generation, acquires one or more animation layers, and generates an animation by synthesizing and rendering a plurality of animation layers, so that the animation creation difficulty of a common user or an animation creator is reduced, the threshold is low, the cost is low, and the degree of automation is high.

In some embodiments, the step of obtaining the animation layer generation model includes:

acquiring a training set;

and training the neural network according to the animation layer output by the neural network and a preset loss function to obtain an animation layer generation model.

In some embodiments, the animation layer comprises: a background layer, the lens element information including: scene description and character information, the background layer obtaining step includes:

extracting keywords from the scene description and/or the character information to obtain keywords;

In some embodiments, the animation layer further comprises: the special effect layer, the lens element information further includes: the method comprises the steps of obtaining conversation content and voice-over content, wherein the step of obtaining the special effect layer comprises the following steps:

determining corresponding emotional tendency according to the emotional tag, wherein the emotional tendency comprises: negative and positive, completing text emotion analysis;

In some embodiments, the animation layer further comprises: the figure action map layer, the figure action map layer obtain step including:

performing word segmentation processing on the conversation content and/or the voice-over content to obtain one or more word segmentation vocabularies;

acquiring corresponding character model resources according to the character information;

acquiring corresponding action resources according to the entity identification result, the part of speech classification result, the emotion label and the emotion tendency;

and synthesizing the figure action map layer according to the figure model resource and the action resource.

In some embodiments, the animation layer further comprises: the character expression layer acquiring step comprises the following steps:

acquiring corresponding expression resources according to the third matching result and the third corresponding relation;

and synthesizing the character expression layer according to the acquired character model resources and the expression resources.

In some embodiments, the animation layer further comprises: the figure lip-shaped layer acquiring step comprises the following steps:

inputting the dialogue content and/or the voice-over content into a deep neural subnetwork of the animation layer generation model for voice conversion to obtain a corresponding voice sequence;

performing prosody adjustment on the voice sequence according to the emotion tag and the emotion tendency to acquire voice audio and complete text-to-voice conversion;

inputting the voice audio into a second convolution neural sub-network of the animation layer generation model to perform voice feature extraction, and acquiring voice features;

acquiring a corresponding lip sequence according to the voice characteristics to complete the generation of the lip sequence;

and synthesizing the figure lip-shaped layer according to the lip-shaped sequence and the acquired figure model resources.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment further provides an electronic terminal, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the method in the embodiment.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic terminal provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for completing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program so that the electronic terminal can execute the steps of the method.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

All the "eigenvectors" in the above embodiments refer to vectors representing data features, and not to the recognized term "eigenvectors" in linear algebra, which refers to eigenvalues and eigenvectors of the matrix.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

17页详细技术资料下载

Animation generation method, system, medium and electronic terminal

相关技术

网友询问留言