Voice-driven speaking face video generation method based on teacher and student network

文档序号：1818109 发布日期：2021-11-09 浏览：19次中文

阅读说明：本技术 一种基于教师学生网络的语音驱动说话人脸视频生成方法 (Voice-driven speaking face video generation method based on teacher and student network ) 是由熊盛武陈燚雷曾瑞林承德马宜祯于 2021-07-19 设计创作，主要内容包括：本发明涉及一种基于教师学生网络的语音驱动说话人脸视频生成方法。首先利用教师网络压缩出视频数据中的动态信息,接着利用学生网络学习语音到动态信息的预测,然后使用预训练好的教师网络提取的人脸动态信息作为监督,结合人脸身份信息实现语音驱动人脸的说话任务。相比于传统的任意说话人脸视频生成技术,本发明首次挖掘视频信号中的动态信息,在人脸生成、图片清晰度和生成说话人脸的视频唇形的准确度上有较大的提升。(The invention relates to a voice-driven speaking face video generation method based on a teacher and student network. Firstly, compressing dynamic information in video data by using a teacher network, then learning the prediction of the dynamic information by using a student network, then using human face dynamic information extracted by the pre-trained teacher network as supervision, and combining human face identity information to realize a speech task of driving a human face by using voice. Compared with the traditional arbitrary speaking face video generation technology, the invention firstly mines the dynamic information in the video signal and greatly improves the face generation, the picture definition and the accuracy of generating the video lip shape of the speaking face.)

1. A voice-driven speaking face video generation method based on a teacher student network is characterized by comprising the following steps:

step 1, acquiring a large number of speaking face video data sets;

step 2, extracting video frames and voice data from the data set obtained in the step 1;

step 3, extracting the face photos in the video frames in the step 2, converting the face photos into front face photos, and cutting the front face photos into N multiplied by N size front face photos I₁Extracting MFCC characteristics of the voice signal in the step 2;

step 4, detecting the face-righting picture I cut in the step 3₁Human face feature points in;

step 5, establishing and training a teacher network;

step 6, constructing and training a student network;

step 7, cascading student network training;

and 8, inputting the MFCC feature sequence extracted in the step 3 and any face picture I into the cascade student network trained in the step 7 to obtain a corresponding picture sequence, and synthesizing the picture sequence into a video by using ffmpeg.

2. The method of claim 1 for generating a voice-driven talking face video based on teacher-student network, characterized by: the step 5 of constructing and training the teacher network comprises the following steps:

step 5.1, finishingThe individual network adopts a self-supervision learning mode to respectively carry out the face characteristic points l detected in the step 4₁、l₂And a cut-out front face photograph I₁Using three encoders f₁、f₂、f₃Encoding is carried out to generate an implicit variable z₁、z₂、z₃；

Step 5.2, let z₄＝concat((z₂-z₁),z₃) By a decoder f_DTo z₄Decoding to obtain the face-up picture I with cut-out representation₁The dynamic characteristics m and c are calculated as follows:

(m,c)＝f_D(z₄) (1)

step 5.3, combining the cut frontal photo I by using the parameters m and c obtained by calculation in the step 5.2₁To obtain a synthetic photograph I₁′：

I₁′＝m×c+(1-m)×I₁ (2)

And 5.4, training the teacher network by using the network architecture of the W-GAN-gp algorithm.

3. The method of claim 2 for generating voice-driven talking face video based on teacher student network, characterized by: the step 5.4 of training the teacher network by using the network architecture of the W-GAN-gp algorithm comprises a generator training stage and a discriminator training stage, wherein the generation stage and the discrimination stage are alternately trained until the algorithm converges, and the teacher network training is finished, wherein in the generator training stage, a preprocessed face characteristic point l is given₁、l₂And a cut-out front face photograph I₁Using the computational procedure of steps 5.1-5.3, the network generates picture I from predicted motion information m and c₁' and calculating the loss function l of the generator_loss：

l_loss＝l_rec+l_reg+l_gen (3)

l_rec＝||I₁-I₁′||₁ (4)

l_reg＝||m||₁ (5)

l_gen＝-D_I([I₁′,m]) (6)

In the formula I_recTo reconstruct the losses,/_regFor sparse regularization loss, l_genTo combat losses, D_I(. to) a discriminator, | | | | | non-conducting phosphor₁Represents L₁And (4) norm.

4. The method of claim 3 for generating voice-driven talking face video based on teacher student network, characterized by: in the stage of training the arbiter in step 5.4, the arbiter loss function of the W-GAN-gp arbiter part is usedThe calculation method is as follows:

in the formula (I), the compound is shown in the specification,denotes derivation, D_I(. to) represents a discriminator, | | | | represents L₂Norm, λ 10, l_gpLipschitz penalty factors are expressed in order to resolve gradient explosions.

5. The method of claim 1 for generating a voice-driven talking face video based on teacher-student network, characterized by: the step 6 of constructing and training the student network comprises the following steps:

step 6.1, using the MFCC characteristics of the voice signals extracted in the step 3, and adding a time window of 10ms to extract the MFCC signals by taking the time point of a video frame as a center;

step 6.2, inputting the facial feature points l by using the teacher network trained in the step 5₁、l₂And a cut-out front face photograph I₁Obtaining change information c of the pixel values in the change area m and the change area m;

step 6.3, inputting MFCC feature a of 10ms of the speech signal cut in step 6.1_mfccAnd a cut-out front face photograph I₁Respectively using speech coders f₄And identity information encoder f₅Encoding is carried out to generate an implicit variable z₅And z₆Then let z₇＝concat(z₅,z₆)；

Step 6.4, use the decoderPredicting motion information (m)_s,c_s)，

Step 6.5, the parameter m calculated in the step 6.4 is utilized_sAnd c_sCombined with cut-out front face photograph I₁To obtain a synthetic photograph I₁′_s：

I_1s′＝m_s×c_s+(1-m_s)×I₁ (9)

And 6.6, training the student network by using the network architecture of the W-GAN-gp algorithm.

6. The method of claim 5 for generating voice-driven talking face video based on teacher student network, characterized by: the step 6.6 of training the student network by using the network architecture of the W-GAN-gp algorithm comprises a generator training stage and a discriminator training stage, wherein the generation stage and the discrimination stage are alternately trained until the algorithm converges, and the student network training is finished, wherein in the generator training stage, the MFCC feature a is given_mfccAnd a cut-out front face photograph I₁Using the calculation process of steps 6.2-6.5, the student network passes the predicted movement information m_sAnd c_sGenerate picture I'_1sAnd calculating a loss function l 'of the generator'_loss：

l′_loss＝l′_rec+l′_reg+l′_gen+l_mot (10)

l′_rec＝||I₁-I_1s′|| (11)

l′_reg＝||m||₁ (12)

l′_gen＝-D_I([I_1s′,m]) (13)

l_mot＝||m_s-m||₁+||c_s-c||₁ (14)

Of formula (II) to'_recTo rebuild the loss,/'_regIs sparse regularization loss, l'_genTo combat the loss,/_motTo supervise loss of motion information, D_I(. to) a discriminator, | | | | | non-conducting phosphor₁Represents L₁And (4) norm.

7. The method of claim 6 for generating a voice-driven talking face video based on teacher student network, characterized by: in the arbiter training stage of step 6.6, the arbiter loss function using the arbiter portion of W-GAN-gp is usedComprises the following steps:

in the formula (I), the compound is shown in the specification,denotes derivation, D_I(. to) represents a discriminator, | | | | represents L₂Norm, λ ═ 10, l'_gpLipschitz penalty factors are expressed in order to resolve gradient explosions.

8. The method of claim 1 for generating a voice-driven talking face video based on teacher-student network, characterized by: the step 7 of training the cascaded student network comprises the following steps:

step 7.1, extracting the MFCC characteristic sequence { a ] extracted in step 3₁,a₂,...a_nThe sequence goes through the speech coder f in step 6.3₄Obtaining a voice hidden variable sequence { a'₁,a′₂,...a′_n}；

Step 7.2, inputting a face identity photo I₁By the identity encoder f in step 6.3₅Obtaining an identity hidden variable z, and enabling the hidden variable z and a voice hidden variable sequence { a'₁,a′₂,...a′_nSplicing to obtain an implicit variable sequence { b }₁,b₂,...b_n}；

Step 7.3, in order to model the time sequence of the time sequence, a hidden variable sequence { b₁,b₂,...b_nInputting the sequence into an LSTM network to obtain a hidden variable sequence { b 'containing time sequence information'₁,b′₂,...b′_n} then the sequence of hidden variables { b'₁,b′₂,...b′_nTraining each hidden variable in the sequence according to the steps 6.4-6.6 respectively to generate a picture sequence (I)_1a,I_2a,...I_na}。

Technical Field

The invention relates to the field of multimedia and the field of artificial intelligence, in particular to a voice-driven speaking face video generation method based on a teacher-student network.

Background

The arbitrary speaking face video generation technology is that a front face photo of an arbitrary person and a section of speaking voice of the arbitrary person are input, then a front face speaking video of the person is generated, and the generated video has accurate lip movement and expression change. The generation of natural and smooth speaking face video from a single face picture and speaking voice is very challenging, it needs to generate multi-frame faces with preserved identity characteristics, and it requires that face variations, especially lip variations, are consistent with the input voice in the time domain. The speaking face video generation technology has very wide application prospect and potential in the fields of virtual anchor, intelligent home, game movie character production and the like.

The task of generating a speaking face can be traced back to the nineties of the last century at the earliest, when a face is modeled using sparse meshes, and then face mesh motion is driven using speech signals. In the beginning of the 20 th century, ezuat proposed a "make it talk" scheme, which includes collecting a certain number of videos of a single person with a speaking face to form a single person video library, converting text signals into phoneme signals, searching the phoneme signals for the most suitable visemes in the single person video library, and calculating intermediate frames of the visemes by using optical flow to generate a video. In recent years, with the increase of computer computing power, the construction of large-scale data sets and the rise of deep learning, Joon Son Chung of VGG group in 2016 in its paper you said that? The method realizes training on a large-scale data set LRW by using a coding and decoding learning structure for the first time, and can generate a single face speaking video by using a single face photo and speaking audio. Subsequent techniques use video frames as truth to perform self-supervised learning for the network, but none of these methods adequately mine the dynamics of the video information.

Disclosure of Invention

Aiming at the defects of the prior art, the invention integrates the excellent characteristics of generation of a confrontation network and knowledge distillation in the aspect of image generation on the basis of a deep learning self-encoder generation model, and provides a voice-driven talking face video generation method based on a teacher student network. Firstly, compressing dynamic information in video data by using a teacher network, then learning the prediction of the dynamic information by using a student network, then using human face dynamic information extracted by the pre-trained teacher network as supervision, and combining human face identity information to realize a speech task of driving a human face by using voice.

In order to achieve the aim, the technical scheme provided by the invention is a voice-driven talking face video generation method based on a teacher-student network, which comprises the following steps:

step 1, acquiring a large number of speaking face video data sets;

step 2, extracting video frames and voice data from the data set obtained in the step 1 by using an ffmpeg tool;

step 3, extracting the face photos in the video frames in the step 2 by using a face detection tool provided by the dlib library, converting the face photos into front face photos, and cutting the front face photos into N multiplied by N sized front face photos I₁Extracting MFCC features of the voice signal in the step 2 by using a speech processing tool library python _ speech _ features;

step 4, using the face alignment tool provided by the face _ alignment to detect the face photo I cut in the step 3₁Human face feature points in;

step 5, establishing and training a teacher network;

step 6, constructing and training a student network;

step 7, cascading student network training;

Moreover, the step 5 of constructing and training the teacher network includes the following steps:

step 5.1, the whole network adopts a self-supervision learning mode to respectively carry out the human face characteristic points l detected in the step 4₁、l₂And a cut-out front face photograph I₁Using three encoders f₁、f₂、f₃Encoding is carried out to generate an implicit variable z₁、z₂、z₃；

(m,c)＝f_D(z₄) (1)

step 5.3, combining the cut frontal photo I by using the parameters m and c obtained by calculation in the step 5.2₁To obtain a synthetic photograph I₁′：

I₁′＝m×c+(1-m)×I₁ (2)

And 5.4, training the teacher network by using the network architecture of the W-GAN-gp algorithm.

Moreover, the training of the teacher network in the step 5.4 by using the network architecture of the W-GAN-gp algorithm includes a generator training phase and a discriminator training phase:

step 5.4.1, generator training phase, giving preprocessed face characteristic point l₁、l₂And a cut-out front face photograph I₁Using the computational procedure of steps 5.1-5.3, the network generates picture I from predicted motion information m and c₁' and calculating the loss function l of the generator_loss：

l_loss＝l_rec+l_reg+l_gen (3)

l_rec＝||I₁-I₁′||₁ (4)

l_reg＝||m||₁ (5)

l_gen＝-D_I([I₁′,m]) (6)

In the formula I_recTo reconstruct the losses,/_regFor sparse regularization loss, l_genTo combat losses, D_I(. to) a discriminator, | | | | | non-conducting phosphor₁Represents L₁And (4) norm.

Step 5.4.2, arbiter training phase, using arbiter part of W-GAN-gp, arbiter loss functionThe calculation method is as follows:

And alternately training the generation stage and the judgment stage until the algorithm converges, and finishing the teacher network training.

Moreover, the step 6 of constructing and training the student network includes the following steps:

step 6.2, inputting people by using the teacher network trained in the step 5Face feature point l₁、l₂And a cut-out front face photograph I₁Obtaining change information c of the pixel values in the change area m and the change area m;

Step 6.4, use the decoderPredicting motion information (m)_s,c_s)，

Step 6.5, the parameter m calculated in the step 6.4 is utilized_sAnd c_sCombined with cut-out front face photograph I₁To give synthetic photograph I'_1s：

I_1s′＝m_s×c_s+(1-m_s)×I₁ (9)

And 6.6, training the student network by using the network architecture of the W-GAN-gp algorithm.

Moreover, the training of the student network in the step 6.6 by using the network architecture of the W-GAN-gp algorithm includes a generator training phase and a discriminator training phase:

step 6.6.1, Generator training phase, given MFCC feature a_mfccAnd a cut-out front face photograph I₁Using the calculation process of steps 6.2-6.5, the student network passes the predicted movement information m_sAnd c_sGenerate picture I'_1sAnd calculating a loss function l 'of the generator'_loss：

l′_loss＝l′_rec+l′_reg+l′_gen+l_mot (10)

l′_rec＝||I₁-I₁′_s||₁ (11)

l′_reg＝||m||₁ (12)

l′_gen＝-D_I([I_1s′,m]) (13)

l_mot＝||m_s-m||₁+||c_s-c||₁ (14)

Step 6.6.2, arbiter training phase, arbiter loss function using arbiter part of W-GAN-gpComprises the following steps:

And alternately training the generation stage and the discrimination stage until the algorithm is converged, and finishing the network training of the students.

Moreover, the step 7 of training the cascaded student network includes the following steps:

Compared with the prior art, the invention has the following advantages: compared with the traditional arbitrary speaking face video generation technology, the invention firstly mines the dynamic information in the video signal and greatly improves the face generation, the picture definition and the accuracy of generating the video lip shape of the speaking face.

Drawings

Fig. 1 is a network structure diagram according to an embodiment of the present invention.

Fig. 2 is a block diagram of a teacher network model based on a countermeasure network according to the embodiment.

Fig. 3 is a block diagram of a student network model based on an antagonistic network in the embodiment.

Fig. 4 is a block diagram of a cascade student network model based on a countermeasure network according to the embodiment.

Detailed Description

The invention provides a voice-driven talking face video generation method based on a teacher student network, which comprises the steps of firstly compressing dynamic information in video data by using the teacher network, then learning the prediction of the dynamic information by using the student network, then using face dynamic information extracted by the pre-trained teacher network as supervision, and combining face identity information to realize the voice-driven talking task of a face.

The technical solution of the present invention is further explained with reference to the drawings and the embodiments.

As shown in fig. 1, the process of the embodiment of the present invention includes the following steps:

step 1, acquiring a large number of speaking face video data sets.

And 2, extracting video frames and voice data from the data set acquired in the step 1 by using an ffmpeg tool.

And 3, extracting the face picture in the video frame in the step 2 by using a face detection tool provided by the dlib library, converting the face picture into a face-positive picture, and cutting the face-positive picture into an N multiplied by N (N can be equal to 64, 128 and 256) face-positive picture I₁The MFCC features of the step 2 speech signal are extracted using the library of speech processing tools python _ speech _ features.

Step 4, using the face alignment tool provided by the face _ alignment to detect the face photo I cut in the step 3₁The human face feature point in (1).

And 5, constructing and training a teacher network.

Step 5.1, the whole network adopts a self-supervision learning mode, firstly, the face characteristic points l detected in the step 4 are respectively checked₁、l₂And a cut-out front face photograph I₁Using three encoders f₁、f₂、f₃Encoding is carried out to generate an implicit variable z₁、z₂、z₃。

Step 5.2, let z₄＝concat((z₂-z₁),z₃) By a decoder f_DTo z₄Decoding to obtain the face-up picture I with cut-out representation₁A change area m and change information c of pixel values in the change area.

The dynamic features m and c are calculated as follows:

(m,c)＝f_D(z₄) (1)

step 5.3, the calculation of the step 5.2 is utilized to obtainIn combination with the cropped front face photograph I₁To obtain a synthetic photograph I₁′。

Synthesis of photograph I₁The way' is calculated as follows:

I₁′＝m×c+(1-m)×I₁ (2)

and 5.4, training the teacher network by using the network architecture of the W-GAN-gp algorithm.

Step 5.4.1, generator training phase, giving preprocessed face characteristic point l₁、l₂And a cut-out front face photograph I₁Using the computational procedure of steps 5.1-5.3, the network generates picture I from predicted motion information m and c₁', loss function of generator l_lossIncluding reconstruction loss l_recSparse regularization loss l_regAnd to combat the loss l_genThree loss functions, calculated as follows:

l_loss＝l_rec+l_reg+l_gen (3)

l_rec＝||I₁-I₁′||₁ (4)

l_reg＝||m||₁ (5)

l_gen＝-D_I([I₁′,m]) (6)

in the formula, D_I(. to) a discriminator, | | | | | non-conducting phosphor₁Represents L₁And (4) norm.

Step 5.4.2, arbiter training phase, using arbiter part of W-GAN-gp, arbiter loss functionThe calculation method is as follows:

And alternately training the generation stage and the judgment stage until the algorithm converges, and finishing the teacher network training.

And 6, constructing and training a student network.

And 6.1, using the MFCC characteristics of the voice signals extracted in the step 3, and adding a time window of 10ms to the time point of the video frame as a center to extract the MFCC signals.

Step 6.2, inputting the face characteristic points l by using the pre-trained teacher network in the step 5₁、l₂And a cut-out front face photograph I₁And obtaining the change information c of the pixel values in the change area m and the change area m.

Step 6.3, inputting MFCC feature a of 10ms of the speech signal cut in step 6.1_mfccAnd a cut-out front face photograph I₁Respectively using speech coders f₄And identity information encoder f₅Encoding is carried out to generate an implicit variable z₅And z₆Then let z₇＝concat(z₅,z₆)。

Step 6.4, use the decoderPredicting motion information (m)_s,c_s)，

Step 6.5, the parameter m calculated in the step 6.4 is utilized_sAnd c_sCombined with cut-out front face photograph I₁To give synthetic photograph I'_1s。

Synthesis of photo I'_1sThe calculation method of (c) is as follows:

I_1s′＝m_s×c_s+(1-m_s)×I₁ (9)

and 6.6, training the student network by using the network architecture of the W-GAN-gp algorithm.

Step 6.6.1, Generator training phase, given MFCC feature a_mfccAnd a cut-out front face photograph I₁Using the calculation process of steps 6.2-6.5, the student network passes the predicted movement information m_sAnd c_sGenerate picture I'_1sLoss function of generator l'_lossIncluding reconstruction loss l_recSparse regularization loss l_regTo combat the loss l_genAnd supervising motion information loss l_motFour loss functions, calculated as follows:

l′_loss＝l′_rec+l′_reg+l′_gen+l_mot (10)

l′_rec＝||I₁-I_1s′||₁ (11)

l′_reg＝||m||₁ (12)

l′_gen＝-D_I([I_1s′,m]) (13)

l_mot＝||m_s-m||₁+||c_s-c||₁ (14)

in the formula, D_I(. to) a discriminator, | | | | | non-conducting phosphor₁Represents L₁And (4) norm.

Step 6.6.2, arbiter training phase, arbiter loss function using arbiter part of W-GAN-gpComprises the following steps:

And alternately training the generation stage and the discrimination stage until the algorithm is converged, and finishing the network training of the students.

And 7, cascading student network training.

Step 7.1, extracting the MFCC characteristic sequence { a ] extracted in step 3₁,a₂,...a_nThe sequence goes through the speech coder f in step 6.3₄Obtaining a voice hidden variable sequence { a'₁,a′₂,...a′_n}；

Step 7.2, inputting a face identity photo I₁By the identity encoder f in step 6.3₅Obtaining an identity hidden variable z, and combining the identity hidden variable z with a voice hidden variable sequence { a'₁,a′₂,...a′_nSplicing to obtain an implicit variable sequence { b }₁,b₂,...b_n}；

Step 7.3, in order to model the time sequence of the time sequence, a hidden variable sequence { b₁,b₂,...b_nInputting the sequence into an LSTM network to obtain a hidden variable sequence { b 'containing time sequence information'₁,b′₂,...b′_n} sequence of hidden variables { b'₁,b′₂,...b′_nTraining each hidden variable in the sequence according to the steps 6.4-6.6 respectively to generate a picture sequence (I)_1a,I_2a,...I_na}。

Step 8, extracting the MFCC feature sequence { a ] extracted in step 3₁,a₂......a_nInputting the images and the pictures I of any face into the cascade student network trained in the step 7 to obtain the corresponding picture sequence (I)_1a,I_2a,...I_naGet it ahead ofThe sequence of pictures is then composited into video using ffmpeg.

In specific implementation, the above process can adopt computer software technology to realize automatic operation process.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

12页详细技术资料下载

Voice-driven speaking face video generation method based on teacher and student network

相关技术

网友询问留言