Voice-driven 3D character facial expression method based on deep learning

文档序号：1939511 发布日期：2021-12-07 浏览：25次中文

阅读说明：本技术 一种基于深度学习的语音驱动3d人物面部表情方法 (Voice-driven 3D character facial expression method based on deep learning ) 是由赵锐候志迎李帆于 2021-11-09 设计创作，主要内容包括：本发明提供一种基于深度学习的语音驱动3D人物面部表情方法,包括以下步骤：从制作好的动画中获取语音并提取语音特征以得到融合语音特征,将其与控制器参数对应起来构成训练数据；搭建深度学习网络模型,对语音特征从多个维度进行建模,得到抽象特征用于训练和回归预测,得到面部表情控制器参数；将融合语音特征输入上述模型中,通过实际输出与目标输出的均方误差,利用Adam优化器最小化误差,优化网络参数,得到最终模型；将语音特征经过处理后输入深度学习网络模型中,回归预测得到控制器参数,用输出的控制器参数驱动动画模型以生成面部表情动画,可以更快地得到更加稳定的动画出品,同时降低了人力和时间成本。(The invention provides a voice-driven 3D character facial expression method based on deep learning, which comprises the following steps of: acquiring voice from the manufactured animation, extracting voice characteristics to obtain fused voice characteristics, and corresponding the fused voice characteristics with the parameters of the controller to form training data; building a deep learning network model, modeling voice features from multiple dimensions, obtaining abstract features for training and regression prediction, and obtaining parameters of a facial expression controller; inputting the fused voice features into the model, and optimizing network parameters by utilizing an Adam optimizer to minimize errors through mean square errors of actual output and target output to obtain a final model; the voice features are processed and then input into the deep learning network model, controller parameters are obtained through regression prediction, the animation model is driven by the output controller parameters to generate facial expression animations, more stable animation products can be obtained more quickly, and meanwhile labor and time costs are reduced.)

1. A voice-driven 3D character facial expression method based on deep learning is characterized by comprising the following steps:

s1, acquiring control parameters and corresponding voices of facial expressions of each frame of animation characters from the manufactured animation, extracting multiple voice features, splicing and fusing to obtain fused voice features, and enabling the fused voice features to correspond to controller parameters to jointly form training data;

s2, building a deep learning network model, and modeling the voice features from multiple dimensions to obtain abstract features for training and regression prediction to obtain facial expression controller parameters;

s3, model training, namely inputting the fusion voice features in the step S1 into the deep learning network model in the step S2 to obtain actual output, taking expression controller parameters corresponding to the voice features as target output of the network, obtaining mean square error between the actual output and the target output, and optimizing network parameters by using an Adam optimizer to minimize the error;

s4, training a plurality of rounds according to the step S3 to obtain a final model;

and S5, inputting a voice feature into the deep learning network model after feature extraction and processing, obtaining a series of controller parameters through regression prediction, and driving the animation model by using the output controller parameters to generate facial expression animation.

2. The method of claim 1, wherein the deep learning network model of step S2 comprises

The CNN network layers are used for performing convolution operation on the feature dimension of the fused voice feature to realize the dimension reduction processing of the feature and obtain the abstract feature;

the RNN layer adopts a bidirectional lstm network to perform sequence modeling on the time dimension of the features extracted from the CNN layer to obtain feature vectors;

and the full-connection network layer performs regression prediction on the characteristic vector to obtain a parameter value of the facial expression controller.

3. The method of claim 1, wherein the mean square error of step S3 is calculated as:

MSELoss= ，

wherein the content of the first and second substances,Ythe controller parameter value corresponding to the input characteristic, namely the target output;Zinputting control parameters obtained by regression prediction in the deep learning network model for input features, namely actual output;nsample size for training;mis the controller parameter dimension, i.e. the last fully-connected network layer dimension value.

4. The method of claim 2, wherein the CNN network layer is 7 layers, the RNN layer is 1 layer, the fully-connected network layer is two layers, and the fully-connected network layer comprises a hidden layer.

5. The method of claim 1, wherein the step S1 of obtaining the control parameters of the facial expression of each frame of the animated character from the produced animation comprises extracting the controller parameters from the produced maya animation.

6. The method of claim 1, wherein the speech features of step S1 include fbank speech features and mfcc speech features, and are merged and fused in feature dimension through intermediate features of the decapspeed open-source model and the speech features, so as to serve as fused speech features for training.

7. The method for driving 3D character facial expressions by voices based on deep learning of claim 1, wherein when model training is performed, data are trained by using voice features of q frames before and after a current frame for voice features of the current frame, wherein q is more than or equal to 1.

8. The method of claim 1, wherein in step S5, the controller parameter values predicted by the model are connected to each other by means of maya or ue, so that each set of controller parameters deforms the model to obtain the animation of the facial expression of the 3D character.

Technical Field

The invention relates to the technical field of animation production, in particular to a voice-driven 3D character facial expression method based on deep learning.

Background

In the existing animation production technology, the facial expression of a 3D character is generally captured by a face capturing technology on a shooting site to obtain a primary expression animation, then the primary expression animation is produced by a producer, the parameter value of a controller is manually adjusted, and finally the final facial expression animation is obtained. This process is time and labor intensive because it is manually engaged, and the final 3D character facial expression animation effect may be uneven due to the factors of the producer.

The invention patent with the application number of CN202011413230.2 discloses a digital human facial expression animation automatic migration method based on a deep learning framework, which uses a neural network to learn the mapping relation between channel parameters of animation controllers among different models, thereby realizing the effect of using the animation of a known model to drive the animation of another model.

Disclosure of Invention

The invention aims to provide a method for driving the facial expression of a 3D character by voice based on deep learning, which can obtain a more stable animation product and reduce the labor and time costs.

The invention provides the following technical scheme:

a voice-driven 3D character facial expression method based on deep learning comprises the following steps:

s4, training a plurality of rounds according to the step S3 to obtain a final model;

Preferably, the deep learning network model of step S2 includes

the RNN layer adopts a bidirectional lstm network to perform sequence modeling on the time dimension of the features extracted from the CNN layer to obtain feature vectors;

and the full-connection network layer performs regression prediction on the characteristic vector to obtain a parameter value of the facial expression controller.

Preferably, the mean square error calculation formula in step S3 is:

MSELoss= ，

Preferably, the CNN network layer is 7 layers, the RNN layer is 1 layer, the fully-connected network layer is two layers, and the fully-connected network layer includes a hidden layer.

Preferably, the step of obtaining the control parameters of the facial expression of each frame of the animated character from the produced animation of step S1 includes extracting the controller parameters from the produced maya animation.

Preferably, the speech features in step S1 include fbank speech features and mfcc speech features, and are spliced and fused in feature dimension through intermediate features of the decapspeed open-source model and the speech features, so as to serve as fused speech features for training.

Preferably, when performing model training, the speech features of the previous and subsequent q frames are used for the speech features of the current frame to train data, wherein q is greater than or equal to 1.

Preferably, in step S5, the controller parameter values predicted by the model are connected to each other by means of maya or ue, so that each set of controller parameters deforms the model to obtain the 3D character facial expression animation.

The invention has the beneficial effects that:

(1) the corresponding 3D character facial expression controller is obtained through the voice end reasoning, human factors are eliminated, more stable animation products are obtained, and meanwhile, the labor and time costs are reduced;

(2) according to the method, the fbank characteristics and deep deepness characteristics are spliced to be used as the input of the network, so that the characteristic enhancement effect is achieved, and the generalization performance of the model is improved;

(3) according to the method, the original characteristics are modeled in multiple dimensions through the CNN layers and the RNN layers, and the predicted values of the parameters of the controller are regressed by the last full-connection layers, so that the accuracy of model prediction is greatly improved, and the quality of the final 3D character facial expression animation effect is ensured.

Detailed Description

Example one

A voice-driven 3D character facial expression method based on deep learning comprises the following steps:

s1, acquiring control parameters and corresponding voices of facial expressions of each frame of animation characters from the manufactured animation, extracting multiple voice features, splicing and fusing to obtain fused voice features, and enabling the fused voice features to correspond to the controller parameters to jointly form training data;

s2, building a deep learning network model, modeling voice features from multiple dimensions, obtaining abstract features for training and regression prediction, and obtaining facial expression controller parameters;

s3, model training, namely inputting the fused voice features obtained in the step S1 into the deep learning network model obtained in the step S2 to obtain actual output, taking the expression controller parameters corresponding to the voice features as target output of the network, obtaining mean square error between the actual output and the target output, and optimizing network parameters by using an Adam optimizer to minimize errors;

s4, training a plurality of rounds according to the step S3 to obtain a final model;

The corresponding 3D character facial expression controller is obtained through voice end reasoning, human factors are eliminated, more stable animation products are obtained, and meanwhile labor and time costs are reduced.

Example two

A voice-driven 3D character facial expression method based on deep learning comprises the following steps:

s1, obtaining control parameters of facial expressions of each frame of animation character and corresponding voice from the manufactured maya animation, extracting various voice characteristics including fbank voice characteristics and mfcc voice characteristics, splicing and fusing the intermediate characteristics and the voice characteristics of the deepspeech open-source model in characteristic dimension, enabling the fused voice characteristics to correspond to the controller parameters to jointly form training data, and experiments show that after the deepspeech characteristics are added, the generalization performance of the model is improved;

s2, building a deep learning network model, wherein the structure comprises:

the CNN network layers are used for performing convolution operation on the feature dimension of the fused voice feature to realize the dimension reduction processing of the feature and obtain the abstract feature;

the RNN layer adopts a bidirectional lstm network to perform time-dimensional sequence modeling on the features extracted from the CNN layer to obtain feature vectors;

the full-connection network layer comprises a hidden layer and is used for performing regression prediction on the characteristic vector to obtain a parameter value of the facial expression controller;

s3, model training, namely inputting the fused voice features of the step S1 into the deep learning network model of the step S2 to obtain actual output, taking the expression controller parameters corresponding to the voice features as target output of the network, and obtaining the mean square error between the actual output and the target output, wherein the calculation formula is as follows:

MSELoss= ，

then, an Adam optimizer is utilized to minimize errors and optimize network parameters;

s4, training a plurality of rounds according to the step S3 to obtain a final model, specifically, when model training is carried out, in order to make the final result smoother, the voice characteristics of front and back q frames are used for the voice characteristics of the current frame to train data, wherein q is more than or equal to 1;

and S5, inputting a voice feature into the deep learning network model after feature extraction and processing, performing regression prediction to obtain a series of controller parameters, and enabling each set of controller parameters to deform the model once through maya or ue according to the controller parameter values obtained by model prediction to obtain the 3D character facial expression animation by connecting.

EXAMPLE III

A method for driving 3D character facial expressions by voice based on deep learning comprises the following steps:

the method comprises the steps of firstly, obtaining control parameters of facial expressions of each frame of animation character and corresponding voice from a manufactured maya animation, extracting various voice characteristics including fbank voice characteristics and mfcc voice characteristics, and extracting 50 frames per second because the dimension of deep characteristic time before a deepspeed classification layer is the same as the dimension of the characteristics.

And secondly, modeling the original characteristics in multiple dimensions by the deep learning network model through 7 CNN layers and one RNN layer, and regressing the predicted values of the parameters of the controller by using the last two fully-connected layers.

Each layer of the CNN layer performs convolution operation on feature dimensions of fused voice features (the feature size is (1, 20, 109), 20 represents voice features of previous and next 10 frames) to reduce the dimension of the features, for the features of 109 dimensions, the first layer CNN reduces the dimension of the features to 55 and increases the number of convolution channels to 16, the feature dimension obtained after 7 layers of CNN is 1, the number of convolution channels is 256, and the feature size obtained after the CNN layer is (256, 20, 1).

The RNN layer carries out sequence modeling on a feature time sequence dimension, namely 20 dimensions of the features, the RNN used here is bidirectional lstm, the features obtained after passing through the RNN layer are 512-dimensional vectors, and the vectors not only fuse feature information on the feature dimension, but also model the importance of the features on the time sequence.

And the full-connection layer regresses the vectors and comprises a hidden layer with the dimensionality of 256 and the dimensionality of the final full-connection layer of 127, namely the controller parameter values are obtained through regression prediction.

The deep learning network model can also adjust the network layer number or structure according to the characteristic dimension, the data volume or other factors.

The model models the voice characteristics from multiple dimensions, and the obtained abstract characteristics can be effectively used for training and regressing to predict the parameters of the facial expression controller.

And thirdly, taking the fused voice features as network input, taking the expression controller parameters corresponding to the voice features as target output of the network, and training a plurality of rounds by using Adam optimizer to minimize errors, optimizing network parameters and obtaining a final model through the Mean Square Error (MSE) of actual output and target output of the s2 network.

Let the input feature be X, which is a vector of size (1, 20, 109) and the controller parameter value for the corresponding animation frame be Y.

Let the deep learning network model be f, then we can get the controller parameter Z through f, which is obtained by model regression prediction.

MSELoss= ，n is the sample size at training and m is equal to the controller parameter dimension 127.

Specifically, when model training is carried out, because the frame number of the animation is smaller than the frame number of the extracted voice feature, the voice feature is corresponding to each frame of animation through a multiple value, the voice feature 1s obtained by the method has 50 frames, odd or even frames can be selected as key frames corresponding to the animation, and meanwhile, in order to enable the final result to be smoother, the voice features of front and rear q frames are used for the voice feature of the current frame to train data, wherein q is more than or equal to 1.

And fourthly, inputting a voice feature into the deep learning network model after feature extraction and processing, performing regression prediction to obtain a series of controller parameters, and enabling each set of controller parameters to deform the model once through maya or ue according to the controller parameter values obtained by model prediction to obtain the 3D character facial expression animation by connecting.

The corresponding 3D character facial expression controller is obtained through the voice end reasoning, human factors are eliminated, more stable animation products are obtained, and meanwhile, the labor and time costs are reduced;

according to the method, the fbank characteristics and deep deepness characteristics are spliced to be used as the input of the network, so that the characteristic enhancement effect is achieved, and the generalization performance of the model is improved;

according to the method, the original characteristics are modeled in multiple dimensions through the CNN layers and the RNN layers, and the predicted values of the parameters of the controller are regressed by the last full-connection layers, so that the accuracy of model prediction is greatly improved, and the quality of the final 3D character facial expression animation effect is ensured.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

8页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种指示三维模型方位的方法和装置

Voice-driven 3D character facial expression method based on deep learning

相关技术

网友询问留言