Expression driving method and system and computer equipment

文档序号：1954831 发布日期：2021-12-10 浏览：16次中文

阅读说明：本技术 表情驱动方法、系统及计算机设备 (Expression driving method and system and computer equipment ) 是由李团辉王擎于 2021-09-15 设计创作，主要内容包括：本申请实施例提供一种表情驱动方法、系统及计算机设备,通过目标对象的各个面部表情对应的多视角表情图像序列进行多视角重建获得各面部表情分别对应的三维表情模型,然后通过所述面部表情分别对应的多视角表情图像序列以及各所述面部表情分别对应的三维表情模型,对预设的神经网络进行训练,得到表情预测神经网络,最后,将获取的所述目标对象的面部图像输入所述表情预测神经网络,得到所述目标对象的三维表情模型,并根据所述三维表情模型对直播画面中的虚拟数字形象的面部表情进行驱动。如此,通过更加逼真的虚拟数字形象对目标对象(如主播)的面部表情进行更为精准且细腻的表达,让直播更加生动有趣、可大大提升虚拟直播的效果及用户体验。(The embodiment of the application provides an expression driving method, an expression driving system and computer equipment, wherein multi-view reconstruction is carried out on multi-view expression image sequences corresponding to facial expressions of a target object to obtain three-dimensional expression models corresponding to the facial expressions respectively, then a preset neural network is trained through the multi-view expression image sequences corresponding to the facial expressions respectively and the three-dimensional expression models corresponding to the facial expressions respectively to obtain an expression prediction neural network, finally, the obtained facial images of the target object are input into the expression prediction neural network to obtain the three-dimensional expression models of the target object, and the facial expressions of virtual digital images in a live broadcast picture are driven according to the three-dimensional expression models. Therefore, the facial expression of the target object (such as the anchor) is expressed more accurately and finely through the more vivid virtual digital image, so that the live broadcast is more vivid and interesting, and the virtual live broadcast effect and the user experience can be greatly improved.)

1. An expression driving method, characterized in that the method comprises:

the method comprises the steps that expression images of a plurality of facial expressions of a target object are shot from a plurality of visual angles through image acquisition equipment, and a multi-visual-angle expression image sequence corresponding to each facial expression of the target object is obtained; the multi-view expression image sequence corresponding to each facial expression comprises at least one facial expression image under each view angle, wherein the at least one facial expression image is obtained by shooting the facial expression of the target object from a plurality of different view angles;

obtaining a three-dimensional expression model corresponding to each facial expression through multi-view reconstruction according to the multi-view expression image sequence corresponding to each facial expression;

training a preset neural network through the multi-view expression image sequences corresponding to the facial expressions respectively and the three-dimensional expression models corresponding to the facial expressions respectively to obtain an expression prediction neural network;

and inputting the acquired facial image of the target object into the expression prediction neural network to obtain a three-dimensional expression model of the target object, and driving the facial expression of the virtual digital image in the live broadcast picture according to the three-dimensional expression model.

2. The expression driving method according to claim 1, wherein obtaining a three-dimensional expression model corresponding to each of the facial expressions through multi-view reconstruction according to a multi-view expression image sequence corresponding to each of the facial expressions, comprises:

for each multi-view expression image sequence, extracting key points of each facial expression image in the multi-view expression image sequence to obtain facial key points included in each facial expression image;

taking one facial expression image in the multi-view expression image sequence as a reference image, sequentially traversing each facial key point in the reference image, and searching for facial key points corresponding to each facial key point in the reference image in other facial expression images in the multi-view expression image sequence;

determining the position information of each facial key point in the reference image according to the found facial key points corresponding to each facial key point in the other facial expression images;

and reconstructing according to the position information of each facial key point to obtain a three-dimensional expression model of the facial expression corresponding to the multi-view expression image sequence.

3. The expression driving method according to claim 1 or 2, wherein the step of training a preset neural network through a multi-view expression image sequence corresponding to each facial expression and a three-dimensional expression model corresponding to each facial expression to obtain an expression prediction neural network comprises:

carrying out topological mapping on the three-dimensional expression models respectively corresponding to the facial expressions to obtain a regular three-dimensional grid model meeting set rules;

for each multi-view expression image sequence, determining at least one training sample based on facial expression images in the multi-view expression image sequence;

sequentially inputting the training samples into the neural network to obtain a predicted three-dimensional grid model output by the neural network, calculating a loss function value of the neural network according to the predicted three-dimensional grid model and a sample label corresponding to the training sample, and iteratively updating network parameters of the neural network according to the loss function value until a training termination condition is met to obtain the expression predicted neural network;

the sample label of the training sample is a regularization three-dimensional grid model corresponding to the multi-view expression image sequence to which the training sample belongs.

4. The expression driving method according to claim 3, wherein determining, for each of the sequences of multi-view expression images, at least one training sample based on a facial expression image in the sequence of multi-view expression images comprises:

and taking the facial expression image corresponding to the preset shooting visual angle in the multi-visual-angle expression image sequence as the training sample.

5. The expression driving method according to claim 3, wherein determining, for each of the sequences of multi-view expression images, at least one training sample based on a facial expression image in the sequence of multi-view expression images comprises:

taking a facial expression image corresponding to a preset shooting visual angle in the multi-visual-angle expression image sequence as a reference sample;

performing data enhancement on the basis of the reference sample to obtain at least one enhanced sample, and finally taking the reference sample and the at least one enhanced sample as the training sample;

wherein the data enhancement mode comprises one or a combination of two or more of rotation, mirror image, brightness adjustment and noise implantation on the reference sample.

6. The expression driving method according to claim 1 or 2, wherein the step of training a preset neural network through a multi-view expression image sequence corresponding to each facial expression and a three-dimensional expression model corresponding to each facial expression to obtain an expression prediction neural network comprises:

for each multi-view expression image sequence, determining at least one training sample based on facial expression images in the multi-view expression image sequence;

the sample label of the training sample is a three-dimensional expression model obtained through multi-view reconstruction according to the multi-view expression image sequence to which the training sample belongs.

7. The expression driving method according to claim 1 or 2, wherein driving a facial expression of the avatar in the live view according to the three-dimensional expression model includes:

driving each facial key point of the virtual digital image to move according to the position information of each model vertex included in the three-dimensional expression model output by the expression prediction neural network, so that the virtual digital image expresses the facial expression of the target object; or

And replacing the position coordinates of each corresponding facial key point of the virtual digital image with the position coordinates of each model vertex included in the three-dimensional expression model output by the expression prediction neural network to realize the facial expression drive of the virtual digital image.

8. The expression driving method according to claim 1, wherein the image capturing device comprises a plurality of 4D video cameras, the plurality of 4D video cameras are arranged around the target object to form an array camera system, and the plurality of 4D video cameras are respectively used for shooting facial expressions of the target object from different perspectives to obtain facial expression images corresponding to the different perspectives so as to form the multi-perspective expression image sequence.

9. An expression drive system, comprising:

the image acquisition module is used for respectively shooting expression images of a plurality of facial expressions of a target object from a plurality of visual angles through image acquisition equipment to obtain a multi-visual-angle expression image sequence corresponding to each facial expression of the target object; the multi-view expression image sequence corresponding to each facial expression comprises at least one facial expression image under each view angle, wherein the at least one facial expression image is obtained by shooting the facial expression of the target object from a plurality of different view angles;

the three-dimensional reconstruction module is used for obtaining a three-dimensional expression model corresponding to each facial expression through multi-view reconstruction according to the multi-view expression image sequence corresponding to each facial expression;

the network training module is used for training a preset neural network through the multi-view expression image sequences corresponding to the facial expressions respectively and the three-dimensional expression models corresponding to the facial expressions respectively to obtain an expression prediction neural network;

and the expression driving module is used for inputting the acquired facial image of the target object into the expression prediction neural network to obtain a three-dimensional expression model of the target object and driving the facial expression of the virtual digital image in the live broadcast picture according to the three-dimensional expression model.

10. A computer device comprising a machine-readable storage medium and one or more processors, the machine-readable storage medium having stored thereon machine-executable instructions that, when executed by the one or more processors, perform the method of any one of claims 1-8.

Technical Field

The application relates to the technical field of digital live broadcast correlation based on artificial intelligence, in particular to an expression driving method, system and computer equipment.

Background

With the continuous development of mobile internet technology and network communication technology, live webcasting is rapidly developed and applied in daily work and life of people. For example, a user may watch live content provided by various anchor broadcasters of a live broadcast platform on line through a smart phone, a computer, a tablet computer, or the like, or the user may also provide live content on a corresponding live broadcast platform at any time and any place through a smart phone, a computer, a tablet computer, or the like, so as to be watched by others. In some specific live scenes, in order to provide diversified live experience, a virtual live mode based on an virtual digital image is also widely applied. Compared with a live broadcast mode by a live anchor, the virtual live broadcast does not need the live anchor to carry out live interaction, and the live broadcast interaction can be carried out by the live anchor in the background by controlling the behavior of the virtual digital image simulation background live broadcast. In a virtual live broadcast application scene based on an virtual digital image, expression driving of the virtual digital image is an important technical branch of virtual live broadcast. Most of the existing common schemes for performing expression driving on the virtual digital image have the problem that the expression of the virtual digital image is not fine and smooth due to low driving precision.

Disclosure of Invention

Based on the above, in a first aspect, an embodiment of the present application provides an expression driving method, where the method includes:

Based on a possible implementation manner of the first aspect, obtaining, through multi-view reconstruction, a three-dimensional expression model corresponding to each facial expression according to a multi-view expression image sequence corresponding to each facial expression, includes:

Based on a possible implementation manner of the first aspect, training a preset neural network through a multi-view expression image sequence corresponding to each facial expression and a three-dimensional expression model corresponding to each facial expression to obtain an expression prediction neural network includes:

carrying out topological mapping on the three-dimensional expression models respectively corresponding to the facial expressions to obtain a regular three-dimensional grid model meeting set rules;

for each multi-view expression image sequence, determining at least one training sample based on facial expression images in the multi-view expression image sequence;

the sample label of the training sample is a regularization three-dimensional grid model corresponding to the multi-view expression image sequence to which the training sample belongs.

In a possible implementation manner of the first aspect, for each sequence of multi-view expression images, determining at least one training sample based on a facial expression image in the sequence of multi-view expression images includes:

and taking the facial expression image corresponding to the preset shooting visual angle in the multi-visual-angle expression image sequence as the training sample.

taking a facial expression image corresponding to a preset shooting visual angle in the multi-visual-angle expression image sequence as a reference sample;

wherein the data enhancement mode comprises one or a combination of two or more of rotation, mirror image, brightness adjustment and noise implantation on the reference sample.

for each multi-view expression image sequence, determining at least one training sample based on facial expression images in the multi-view expression image sequence;

In a possible implementation manner of the first aspect, the driving a facial expression of an avatar in a live view according to the three-dimensional expression model includes:

Based on one possible implementation manner of the first aspect, the image capturing device includes a plurality of 4D video cameras, the plurality of 4D video cameras are arranged around the target object to form an array camera system, and the plurality of 4D video cameras are respectively used for shooting facial expressions of the target object from different perspectives to obtain facial expression images corresponding to different perspectives, so as to form the multi-perspective expression image sequence.

In a second aspect, an embodiment of the present application further provides an expression driving system, where the expression driving system includes:

In a third aspect, embodiments of the present application further provide a computer device including a machine-readable storage medium and one or more processors, where the machine-readable storage medium stores machine-executable instructions, and the machine-executable instructions, when executed by the one or more processors, implement the method described above.

Based on the above content of the embodiment of the present application, compared with the prior art, the expression driving method, system and computer device provided in the embodiment of the present application perform multi-view reconstruction on the obtained multi-view expression image sequence corresponding to each facial expression of the target object to obtain the three-dimensional expression model corresponding to each facial expression, train the preset neural network through the multi-view expression image sequence corresponding to each facial expression and the three-dimensional expression model corresponding to each facial expression, obtain the expression prediction neural network, finally input the obtained facial image of the target object into the expression prediction neural network to obtain the three-dimensional expression model of the target object, and drive the facial expression of the virtual digital image in the live broadcast picture according to the three-dimensional expression model.

Thus, compared with the conventional expression driving scheme based on expression bases and the like, the scheme of the embodiment has high precision and good real-time performance. Especially when the application scene of the live broadcast room is used, one-to-one virtual digital image can be created for the main broadcast. Simultaneously, this embodiment can further train neural network with the help of high accuracy situation reconstruction technique, can carry out more accurate and exquisite expression to the facial expression of anchor through more lifelike virtual digital image, let live broadcast more lively interesting, can promote virtual live broadcast's effect and user experience greatly. Furthermore, the scheme provided by the embodiment does not need to depend on a large amount of labor consumption, and can greatly improve the production efficiency and the manufacturing cost.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic diagram of a live broadcast system for implementing the expression driving method provided in this embodiment.

Fig. 2 is a schematic flow chart of an expression driving method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of an array camera system for multi-view facial image capture of a target object according to an embodiment of the present application.

Fig. 4 is a flow chart illustrating the sub-steps of step S200 in fig. 2.

Fig. 5 is a flow chart illustrating the sub-steps of step S300 in fig. 2.

Fig. 6 is a schematic diagram of a process for training a neural network according to this embodiment.

Fig. 7 is an application diagram of expression model prediction using the trained expression prediction neural network.

Fig. 8 is a schematic diagram of a computer device for implementing the expression driving method according to an embodiment of the present application.

Fig. 9 is a functional block diagram of an expression driving system according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

In the description of the present application, it is further noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.

Based on the problems mentioned in the background art, the inventor finds, through research and investigation, that, in a common virtual live broadcast scene based on a virtual digital human, a more common expression animation driving scheme is an expression driving mode of the virtual digital human implemented based on an expression base. The expression base refers to an expression unit obtained by decomposing a specific expression of the driven character. Different expression units represent the movement of different parts, and can comprise eyes, mouth, eyebrows, nose and other parts, such as closing eyes, opening mouth, lifting eyebrows and the like. Different expressions can be obtained by linearly combining the expression bases according to different weights. However, this scheme has a drawback that the precision is very low, and it cannot express fine expression. The upper limit of the expression animation precision obtained by the expression baseline linear combination is influenced by the number of expression bases, generally, hundreds or even thousands of expression bases are designed in a film and television level solution, and each expression base is responsible for the expression change of a specific part of the face. The scheme needs a large amount of manpower consumption, is very dependent on manual design of an animator, and needs repeated manual expression of the emoticons to enable the emoticons to better express the real expression. Therefore, there are some professional actors performing in the common movie and television, and then their expressions are captured by the driving of the digital avatar. Therefore, the scheme based on the expression base is very expensive to achieve the driving of the high-precision and high-fineness virtual digital image, and is not beneficial to batch production.

Based on the expression driving method based on the neural network model, the expression driving method based on the neural network model is innovatively provided, high-precision expression driving can be achieved, the expression driving of the virtual digital image can be expressed more finely, and meanwhile the driving efficiency is greatly improved and the required cost is reduced. The embodiments of the present application will be described in detail below with reference to specific embodiments.

First, a system architecture of an application scenario in the embodiment of the present application is introduced. Fig. 1 is a schematic view of a live broadcast system provided in an embodiment of the present application. In this embodiment, the live broadcast system includes a live broadcast providing terminal 100, a live broadcast server 200, and a live broadcast receiving terminal 300. Illustratively, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may access the live broadcast server 200 through a network to use a live broadcast service provided by the live broadcast server 200. For example, as an example, for the live broadcast providing terminal 100, a main broadcast Application (APP) may be downloaded through the live broadcast server 200, and content live broadcast may be performed through the live broadcast server 200 after registration is performed through the main broadcast application. Correspondingly, the live broadcast receiving terminal 300 may also download the viewer-side application through the live broadcast server 200, and may view the live broadcast content provided by the live broadcast providing terminal 100 by accessing the live broadcast server 200 through the viewer-side application. In some possible embodiments, the anchor application and the viewer application may also be one integrated application.

For example, the live broadcast providing terminal 100 may transmit live content (e.g., a live video stream) to the live broadcast server 200, and the viewer may access the live broadcast server 200 through the live broadcast receiving terminal 300 to view the live broadcast content. The live content pushed by the live server 200 may be real-time content currently live in the live platform, or historical live content stored after the live broadcast is completed. It will be appreciated that the live system shown in fig. 1 is only an alternative example, and that in other possible embodiments the live system may comprise only some of the components shown in fig. 1 or may also comprise further components.

In addition, it should be noted that, in a specific application scenario, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may also implement role exchange. For example, a main broadcast of the live broadcast providing terminal 100 may provide a live broadcast service using the live broadcast providing terminal 100, or view live broadcast content provided by other main broadcasts as viewers. For example, the user of the live broadcast receiving terminal 300 may view live broadcast content provided by a concerned anchor using the live broadcast receiving terminal 300, or may perform live broadcast as an anchor through the live broadcast receiving terminal 300.

In this embodiment, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may be, but are not limited to, a smart phone, a personal digital assistant, a tablet computer, a personal computer, a notebook computer, a virtual reality terminal device, an augmented reality terminal device, and the like. The live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may have installed therein related applications or program components for implementing live broadcast interaction, such as an application APP, a Web page, a live broadcast applet, a live broadcast plug-in or component, but are not limited thereto. The live server 200 may be a background device providing live services, and may be, for example and without limitation, a server cluster, a cloud service center, and the like.

In this embodiment, the live broadcast providing terminal 100 may include an image capturing device for capturing a main broadcast image. In addition, an audio acquisition device for acquiring the sound of the anchor, an input/output device for inputting information by the anchor, and the like can be further included, for example, but not limited to, a keyboard, a mouse, a touch screen, a microphone, a loudspeaker, and the like. The image capturing device, the audio capturing device, and the input/output device may be directly installed or integrated on the live broadcast providing terminal 100, or may be independent of the live broadcast providing terminal 100 and communicatively connected to the live broadcast providing terminal 100 to perform data communication and interaction.

Fig. 2 is a schematic flow chart of an expression driving method according to an embodiment of the present application. In this embodiment, the expression driving method is executed and implemented by a computer device. The computer device may be the live broadcast providing terminal 100 shown in fig. 1, or may be the live broadcast server 200, which is not limited specifically. It should be understood that, in the expression driving method provided in this embodiment, the sequences of some steps included in the expression driving method may be interchanged according to actual needs in actual implementation, or some steps may be omitted or deleted, which is not specifically limited in this embodiment.

The steps of the expression driving method of the present embodiment are described in detail below with reference to fig. 2 by way of example, and in detail, as shown in fig. 2, the method may include related contents described in steps S100 to S400 below.

Step S100, expression image shooting is respectively carried out on a plurality of facial expressions of a target object from a plurality of visual angles through image acquisition equipment, and a multi-visual-angle expression image sequence corresponding to each facial expression of the target object is obtained.

In this embodiment, the sequence of multi-view expression images corresponding to each facial expression includes at least one facial expression image at each view angle, which is obtained by capturing the facial expression of the target object from a plurality of different view angles (image capturing), respectively.

For example, as an example, as shown in fig. 3, one camera array distributed in an array around a target object (such as a anchor) may be used to perform expression image capturing on each facial expression of the target object, so as to obtain one multi-view expression image sequence for each facial expression. In this embodiment, as shown in fig. 3, in an alternative preferred embodiment, 12 4D digital human capture devices (4D video cameras) may be used to form a high-precision array camera system as the image capture device to capture images of multiple-view and multiple-facial expressions of the target object, so as to obtain multiple multi-view expression image sequences corresponding to multiple different facial expressions respectively.

Illustratively, the high-precision array camera system according to the present embodiment may include, for example, twelve 4D video cameras (capturing devices) such as C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, and C12 shown in fig. 3, which are set according to pre-calibrated capturing parameters (including, for example, camera focal length, shooting angle, camera position distribution, and the like). When a multi-view expression sequence of the target object needs to be acquired, the target object can obtain various predefined expressions (including extreme expressions) according to requirements, and then expression images of the target object under different views are obtained by shooting through the 4D cameras respectively under the condition that the target object makes corresponding expressions. For example, twelve facial expression images at twelve viewing angles, such as P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, etc., as shown in fig. 3 may be obtained, and then the multi-view expression image sequence may be composed of these twelve facial expression images.

Further, in order to make the subsequent expression driving for the virtual digital image more sophisticated and fine, in this embodiment, the predefined various expressions may include a large amount (e.g., hundreds of thousands) of expression contents. For example, in facial expression image capturing, data contents of performances required for making extreme expressions may be defined in advance to request a target object (such as a anchor) to make various expressions by definition. Illustratively, the first may include extreme expressive content, requiring the target object to make a predefined extreme expression and capture the course of motion of the extreme expression with a 4D device in order to capture the maximum range of facial motion. These extreme expressions may illustratively include, but are not limited to, opening the mouth as much as possible, moving the chin as far to the side and front as possible, sipping the lips, opening the large eyes and forcing them closed, etc. The second is a FACS type expression, which is typically some conventional preset expression, such as squinting, mouth opening, frown, cheeks, etc. The third category is some preset contents of speaking, and sentences and ancient poems containing common syllables can be selected in the part, so that all common syllables can be covered when the target object speaks, and a corresponding mouth shape is made to be beneficial to capturing facial expressions.

Step S200, obtaining a three-dimensional expression model corresponding to each facial expression through multi-view reconstruction according to the multi-view expression image sequence corresponding to each facial expression.

In detail, the multi-view reconstruction described in this embodiment means that a three-dimensional mesh model is reconstructed using images at each view angle in a multi-view expression image sequence corresponding to each facial expression, and the reconstructed three-dimensional expression model is a three-dimensional mesh (mesh) model that can express the facial expression of the target object. In this embodiment, the three-dimensional expression model may include model vertices formed by a plurality of different key points (e.g., eye key points, eyebrow key points, nose key points, mouth key points, chin key points), and the like, and the three-dimensional mesh model formed by different mesh patches (e.g., triangular mesh patches or polygonal mesh patches) may be obtained by connecting the different model vertices according to a set topological order.

In this embodiment, in one exemplary implementation of a possible multi-view reconstruction, for example, as shown in fig. 4, the step S200 may include the following steps S210-S240, which are exemplarily described as follows.

Step S210, for each multi-view expression image sequence, performing key point extraction on each facial expression image in the multi-view expression image sequence to obtain a facial key point included in each facial expression image.

In detail, in one possible implementation manner, in step S210, a key point SDK (Software Development Kit) may be adopted to perform facial key point extraction based on each of the facial expression images. The key point SDK may be any mature key point acquisition tool in the market at present, which is not limited in this embodiment.

Step S220, using one facial expression image in the multi-view expression image sequence as a reference image, sequentially traversing each facial key point in the reference image, and searching for facial key points corresponding to each facial key point in the reference image in other facial expression images in the multi-view expression image sequence.

In this embodiment, one facial expression image having the most complete facial key points may be used as the reference image, and for example, a facial expression image (e.g., a front face image) captured by a 4D camera that is being set on the face of the target object may be used as the reference image.

Step S230, determining position information of each facial key point in the reference image according to the found facial key point corresponding to each facial key point in the other facial expression images in the reference image.

In this embodiment, the position information of each facial key point may be a three-dimensional coordinate of each facial key point.

Step S240, reconstructing to obtain a three-dimensional expression model of facial expression corresponding to the multi-view expression image sequence according to the position information of each facial key point.

For example, adjacent facial key points may be connected in topological order according to the position information of each of the facial key points. For example, three adjacent facial key points may form a triangular model patch, and a plurality of different triangular model patches are connected in topological order to form a corresponding three-dimensional expression model.

Step S300, training a preset neural network through the multi-view expression image sequences corresponding to the facial expressions respectively and the three-dimensional expression models corresponding to the facial expressions respectively to obtain an expression prediction neural network.

In this embodiment, for step S300, as shown in fig. 5, the specific implementation steps of training the preset neural network to obtain the expression prediction neural network may include the following steps S310 to S330, which are exemplarily described as follows.

Step S310, topological mapping is carried out on the three-dimensional expression models respectively corresponding to the facial expressions to obtain a regular three-dimensional grid model meeting set rules.

Specifically, in this embodiment, the three-dimensional expression model obtained by performing model reconstruction on each multi-view expression image sequence may be an irregular three-dimensional mesh model, for example, model numbers (topological relations) corresponding to facial key points having the same semantic meaning in different three-dimensional expression models may be different, and in this case, when performing subsequent neural network training, it is difficult to rapidly converge the trained neural network because the three-dimensional expression model is not a regular three-dimensional mesh model. Based on this, in step S310, topological mapping is performed on each three-dimensional expression model, so that the key point sequence numbers corresponding to the same facial key points on each three-dimensional expression model are the same, and meanwhile, the regularized three-dimensional mesh models corresponding to each three-dimensional expression model have the same number of model vertices.

Step S320, for each multi-view expression image sequence, determining at least one training sample based on facial expression images in the multi-view expression image sequence.

In detail, in this embodiment, in an alternative implementation manner, one facial expression image corresponding to a preset shooting perspective in the multi-perspective expression image sequence may be used as the training sample. For example, as an example, a facial expression image (front face image) captured by a 4D camera that is being set on the face of the target object in the multi-view expression image sequence may be used as the training sample.

In addition, in another alternative embodiment, one facial expression image corresponding to a preset shooting perspective in the multi-perspective expression image sequence may be first used as a reference sample, then data enhancement is performed based on the reference sample to obtain at least one enhanced sample, and finally the reference sample and the at least one enhanced sample are used as the training sample. The reference sample may be a facial expression image (front face image) captured by a 4D camera that is being set on the face of the target object. The data enhancement may include, but is not limited to, one or a combination of two or more of rotation, mirroring, brightness adjustment, noise implantation, and the like of the reference sample. In this way, the training samples obtained through data enhancement can further increase the data volume of the training samples so as to enhance the robustness of the neural network obtained through training. After data enhancement, each reference sample (such as a front face image) can obtain a training sample composed of M images, and training labels corresponding to the M training samples respectively correspond to the same regularized three-dimensional grid model, namely, the regularized three-dimensional grid model corresponding to the multi-view expression image sequence to which the reference sample belongs. Thus, if there are N (e.g., ten thousand) multi-angle expression image sequences, the training data set obtained after sample data enhancement includes N × M training samples.

Step S330, inputting the training samples into the neural network in sequence to obtain a predicted three-dimensional grid model output by the neural network, calculating a loss function value of the neural network according to the predicted three-dimensional grid model and a sample label corresponding to the training sample, and iteratively updating network parameters of the neural network according to the loss function value until a training termination condition is met to obtain the expression predicted neural network.

In this embodiment, the sample label of the training sample is a regularized three-dimensional mesh model corresponding to the multi-view expression image sequence to which the training sample belongs. In this way, the loss function value may be obtained from the degree of matching (or similarity) between the predicted three-dimensional mesh model and the regularized three-dimensional mesh model, or may be the degree of matching, and the higher the degree of matching, the smaller the loss function value. The training termination condition may be that the loss function value is smaller than a set loss function threshold, or that the number of training iterations reaches a preset number.

In this embodiment, it should be understood that, in some possible application scenarios, the three-dimensional expression model obtained by the above three-dimensional reconstruction may also be directly used for training the neural network without performing the re-topology mapping, that is, in some other possible embodiments, the step S310 may be omitted. Based on this, in the case where the step S310 is omitted, the step S300 may specifically include the following.

Firstly, for each multi-view expression image sequence, at least one training sample is determined based on facial expression images in the multi-view expression image sequence. The method for determining the training samples is substantially the same as the step of step S320, and is not repeated here.

Then, the training samples are sequentially input into the neural network to obtain a predicted three-dimensional grid model output by the neural network, a loss function value of the neural network is calculated according to the predicted three-dimensional grid model and a sample label corresponding to the training sample, and network parameters of the neural network are iteratively updated according to the loss function value until a training termination condition is met, so that the expression predicted neural network is obtained. Based on the step, the sample label of the training sample is a three-dimensional expression model reconstructed according to the multi-view expression image sequence to which the training sample belongs. In this way, the loss function value may be obtained according to a matching degree (or similarity) between the predicted three-dimensional mesh model and the reconstructed three-dimensional expression model, or may be the matching degree. The training termination condition may be that the loss function value is smaller than a set loss function threshold, or that the number of training iterations reaches a preset number.

As an example, please refer to fig. 6, which shows a schematic diagram of a training process of the neural network according to the present embodiment, and the training is generally described below with reference to fig. 6.

The training process comprises two main branches, wherein one branch is used for reconstructing a regular three-dimensional expression model through collected training data, the other branch is used for predicting the training data through a neural network, and error calculation is carried out through the predicted three-dimensional expression model and the reconstructed regular three-dimensional expression model so as to adjust network parameters of the neural network.

In detail, firstly, the 4D camera array may be used to perform multi-view shooting on various predefined expressions made by the target object, so as to obtain multi-view expression image sequences corresponding to different facial expressions, respectively, and form a training data set.

Secondly, performing three-dimensional model reconstruction on each facial expression image in each multi-view expression image sequence in the training data set to obtain a three-dimensional expression model M1 corresponding to each facial expression. And then carrying out topological mapping on the three-dimensional expression model corresponding to each facial expression to obtain a regularized three-dimensional expression model M2 of each facial expression.

Thirdly, a training sample P (such as a front face image) is obtained from the training data set, and the training sample P is input into the neural network for model prediction, so that a predicted three-dimensional expression model M3 corresponding to the training sample P is obtained.

Fourthly, calculating model vertex errors between the predicted three-dimensional expression model M3 and the regularized three-dimensional expression model M2 corresponding to the multi-view expression image sequence to which the training sample P belongs to obtain a loss function value, and adjusting network parameters of the neural network according to the loss function value until a training termination condition is met, so that the expression predicted neural network can be obtained. The model vertex error may be a position offset between three-dimensional coordinates of each model vertex, and may be represented by an euclidean distance, for example, which is not limited in this embodiment.

Step S400, inputting the acquired facial image of the target object into the expression prediction neural network to obtain a three-dimensional expression model of the target object, and driving the facial expression of the virtual digital image in the live broadcast picture according to the three-dimensional expression model.

In detail, in this embodiment, the target object may be subjected to face shooting by any one of the 4D cameras, so as to obtain a face image of the target object. And then inputting the facial image into the expression prediction neural network for expression prediction to obtain a three-dimensional expression model of the target object. In step S400, only a single 4D camera is needed, for example, the camera is aligned to the front face of the target object, then the image data of each frame is preprocessed according to the input size of the training data of the neural network and then input to the expression prediction neural network for prediction, and the output of the expression prediction neural network is a regular three-dimensional expression model which is the same as the training data and can express the face mesh data, and is used for driving the virtual digital image to act according to the prediction result.

For example, as shown in fig. 7, a 4D camera facing the face of a target object may capture the face of the target object, continuously obtain a front face image P0 of the face of the target object, input the front face image P0 into a trained expression prediction neural network for prediction, obtain a predicted three-dimensional expression model M0 of the front face image P0, and finally drive the facial expression of the virtual digital image through the predicted three-dimensional expression model M0, so that the virtual digital image expresses the facial expression of the target object in real time.

As an example, in this embodiment, each facial key point of the virtual digital image may be driven to move according to the position information of each model vertex (facial key point) included in the three-dimensional expression model output by the expression prediction neural network, so that the virtual digital image may express the facial expression of the target object. Or, the facial expression driving of the virtual digital image may be implemented by using the position coordinates of each model vertex included in the three-dimensional expression model output by the expression prediction neural network to replace the position coordinates of each corresponding facial key point of the virtual digital image.

Referring to fig. 8, fig. 8 is a schematic view of a computer device for implementing the expression driving method according to an embodiment of the present application. In detail, the computer device may include one or more processors 110, a machine-readable storage medium 120, and an emoji drive system 130. The processor 110 and the machine-readable storage medium 120 may be communicatively connected via a system bus. The machine-readable storage medium 120 stores machine-executable instructions, and the processor 110 implements the emoji driving method described above by reading and executing the machine-executable instructions in the machine-readable storage medium 120.

The machine-readable storage medium 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The machine-readable storage medium 120 is used for storing a program, and the processor 110 executes the program after receiving an execution instruction.

The processor 110 may be an integrated circuit chip having signal processing capabilities. The Processor may be, but is not limited to, a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like.

Fig. 9 is a schematic diagram of functional modules of the expression driving system 130. In this embodiment, the emotion driving system 130 may include one or more software functional modules running on the computer device, and these software functional modules may be stored in the machine-readable storage medium 120 in the form of a computer program, so that when these software functional modules are called and executed by the processor 130, the emotion driving method described in this embodiment of the present application may be implemented.

In detail, the expression driving system 130 includes an image acquisition module 131, a three-dimensional reconstruction module 132, a network training module 133, and an expression driving module 134.

The image acquisition module 131 is configured to perform expression image shooting on a plurality of facial expressions of a target object from a plurality of viewing angles through an image acquisition device, and obtain a multi-view expression image sequence corresponding to each facial expression of the target object. The multi-view expression image sequence corresponding to each facial expression comprises at least one facial expression image under each view angle, wherein the at least one facial expression image is obtained by shooting the facial expression of the target object from a plurality of different view angles. In this embodiment, the image acquisition module 131 is configured to execute step S100 in the above method embodiment, and for the detailed content of the image acquisition module 131, reference may be made to the above detailed description of step S100, which is not described herein again.

The three-dimensional reconstruction module 132 is configured to obtain a three-dimensional expression model corresponding to each facial expression through multi-view reconstruction according to the multi-view expression image sequence corresponding to each facial expression. In this embodiment, the three-dimensional reconstruction module 132 is configured to execute the step S200 in the above method embodiment, and for the detailed content of the three-dimensional reconstruction module 132, reference may be made to the detailed description of the step S200, which is not repeated herein.

The network training module 133 is configured to train a preset neural network through the multi-view expression image sequences corresponding to the facial expressions respectively and the three-dimensional expression models corresponding to the facial expressions respectively, so as to obtain an expression prediction neural network. In this embodiment, the network training module 133 is configured to execute the step S300 in the above method embodiment, and for the detailed content of the network training module 133, reference may be made to the above detailed description of the step S300, which is not repeated herein.

The expression driving module 134 is configured to input the acquired facial image of the target object into the expression prediction neural network, obtain a three-dimensional expression model of the target object, and drive a facial expression of the virtual digital image in the live broadcast frame according to the three-dimensional expression model. In this embodiment, the expression driver module 134 is configured to execute step S400 in the above method embodiment, and for the detailed content of the expression driver module 134, reference may be made to the above detailed description of step S400, which is not described in detail herein.

In summary, the expression driving method, the expression driving system and the computer device provided by the embodiment of the application provide an innovative solution capable of outputting high-precision emotional animation in real time. Compared with the traditional expression driving scheme based on the expression base and the like, the scheme of the embodiment has high precision and good real-time performance. Especially when the application scene of the live broadcast room is used, one-to-one virtual digital image can be created for the main broadcast. Simultaneously, this embodiment can further train neural network with the help of high accuracy condition reconstruction technique, is a more exquisite solution, can carry out more accurate and exquisite expression to the facial expression of anchor through more lifelike virtual digital image, lets live more lively interesting, can promote virtual live effect and user experience greatly. Furthermore, the scheme provided by the embodiment does not need to depend on a large amount of labor consumption, and can greatly improve the production efficiency and the manufacturing cost.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：AR汉服换装方法

Expression driving method and system and computer equipment

相关技术

网友询问留言