Voice-driven speaking face video generation method based on teacher and student network

文档序号:1818109 发布日期:2021-11-09 浏览:19次 中文

阅读说明:本技术 一种基于教师学生网络的语音驱动说话人脸视频生成方法 (Voice-driven speaking face video generation method based on teacher and student network ) 是由 熊盛武 陈燚雷 曾瑞 林承德 马宜祯 于 2021-07-19 设计创作,主要内容包括:本发明涉及一种基于教师学生网络的语音驱动说话人脸视频生成方法。首先利用教师网络压缩出视频数据中的动态信息,接着利用学生网络学习语音到动态信息的预测,然后使用预训练好的教师网络提取的人脸动态信息作为监督,结合人脸身份信息实现语音驱动人脸的说话任务。相比于传统的任意说话人脸视频生成技术,本发明首次挖掘视频信号中的动态信息,在人脸生成、图片清晰度和生成说话人脸的视频唇形的准确度上有较大的提升。(The invention relates to a voice-driven speaking face video generation method based on a teacher and student network. Firstly, compressing dynamic information in video data by using a teacher network, then learning the prediction of the dynamic information by using a student network, then using human face dynamic information extracted by the pre-trained teacher network as supervision, and combining human face identity information to realize a speech task of driving a human face by using voice. Compared with the traditional arbitrary speaking face video generation technology, the invention firstly mines the dynamic information in the video signal and greatly improves the face generation, the picture definition and the accuracy of generating the video lip shape of the speaking face.)

1. A voice-driven speaking face video generation method based on a teacher student network is characterized by comprising the following steps:

step 1, acquiring a large number of speaking face video data sets;

step 2, extracting video frames and voice data from the data set obtained in the step 1;

step 3, extracting the face photos in the video frames in the step 2, converting the face photos into front face photos, and cutting the front face photos into N multiplied by N size front face photos I1Extracting MFCC characteristics of the voice signal in the step 2;

step 4, detecting the face-righting picture I cut in the step 31Human face feature points in;

step 5, establishing and training a teacher network;

step 6, constructing and training a student network;

step 7, cascading student network training;

and 8, inputting the MFCC feature sequence extracted in the step 3 and any face picture I into the cascade student network trained in the step 7 to obtain a corresponding picture sequence, and synthesizing the picture sequence into a video by using ffmpeg.

2. The method of claim 1 for generating a voice-driven talking face video based on teacher-student network, characterized by: the step 5 of constructing and training the teacher network comprises the following steps:

step 5.1, finishingThe individual network adopts a self-supervision learning mode to respectively carry out the face characteristic points l detected in the step 41、l2And a cut-out front face photograph I1Using three encoders f1、f2、f3Encoding is carried out to generate an implicit variable z1、z2、z3

Step 5.2, let z4=concat((z2-z1),z3) By a decoder fDTo z4Decoding to obtain the face-up picture I with cut-out representation1The dynamic characteristics m and c are calculated as follows:

(m,c)=fD(z4) (1)

step 5.3, combining the cut frontal photo I by using the parameters m and c obtained by calculation in the step 5.21To obtain a synthetic photograph I1′:

I1′=m×c+(1-m)×I1 (2)

And 5.4, training the teacher network by using the network architecture of the W-GAN-gp algorithm.

3. The method of claim 2 for generating voice-driven talking face video based on teacher student network, characterized by: the step 5.4 of training the teacher network by using the network architecture of the W-GAN-gp algorithm comprises a generator training stage and a discriminator training stage, wherein the generation stage and the discrimination stage are alternately trained until the algorithm converges, and the teacher network training is finished, wherein in the generator training stage, a preprocessed face characteristic point l is given1、l2And a cut-out front face photograph I1Using the computational procedure of steps 5.1-5.3, the network generates picture I from predicted motion information m and c1' and calculating the loss function l of the generatorloss

lloss=lrec+lreg+lgen (3)

lrec=||I1-I1′||1 (4)

lreg=||m||1 (5)

lgen=-DI([I1′,m]) (6)

In the formula IrecTo reconstruct the losses,/regFor sparse regularization loss, lgenTo combat losses, DI(. to) a discriminator, | | | | | non-conducting phosphor1Represents L1And (4) norm.

4. The method of claim 3 for generating voice-driven talking face video based on teacher student network, characterized by: in the stage of training the arbiter in step 5.4, the arbiter loss function of the W-GAN-gp arbiter part is usedThe calculation method is as follows:

in the formula (I), the compound is shown in the specification,denotes derivation, DI(. to) represents a discriminator, | | | | represents L2Norm, λ 10, lgpLipschitz penalty factors are expressed in order to resolve gradient explosions.

5. The method of claim 1 for generating a voice-driven talking face video based on teacher-student network, characterized by: the step 6 of constructing and training the student network comprises the following steps:

step 6.1, using the MFCC characteristics of the voice signals extracted in the step 3, and adding a time window of 10ms to extract the MFCC signals by taking the time point of a video frame as a center;

step 6.2, inputting the facial feature points l by using the teacher network trained in the step 51、l2And a cut-out front face photograph I1Obtaining change information c of the pixel values in the change area m and the change area m;

step 6.3, inputting MFCC feature a of 10ms of the speech signal cut in step 6.1mfccAnd a cut-out front face photograph I1Respectively using speech coders f4And identity information encoder f5Encoding is carried out to generate an implicit variable z5And z6Then let z7=concat(z5,z6);

Step 6.4, use the decoderPredicting motion information (m)s,cs),

Step 6.5, the parameter m calculated in the step 6.4 is utilizedsAnd csCombined with cut-out front face photograph I1To obtain a synthetic photograph I1s

I1s′=ms×cs+(1-ms)×I1 (9)

And 6.6, training the student network by using the network architecture of the W-GAN-gp algorithm.

6. The method of claim 5 for generating voice-driven talking face video based on teacher student network, characterized by: the step 6.6 of training the student network by using the network architecture of the W-GAN-gp algorithm comprises a generator training stage and a discriminator training stage, wherein the generation stage and the discrimination stage are alternately trained until the algorithm converges, and the student network training is finished, wherein in the generator training stage, the MFCC feature a is givenmfccAnd a cut-out front face photograph I1Using the calculation process of steps 6.2-6.5, the student network passes the predicted movement information msAnd csGenerate picture I'1sAnd calculating a loss function l 'of the generator'loss

l′loss=l′rec+l′reg+l′gen+lmot (10)

l′rec=||I1-I1s′|| (11)

l′reg=||m||1 (12)

l′gen=-DI([I1s′,m]) (13)

lmot=||ms-m||1+||cs-c||1 (14)

Of formula (II) to'recTo rebuild the loss,/'regIs sparse regularization loss, l'genTo combat the loss,/motTo supervise loss of motion information, DI(. to) a discriminator, | | | | | non-conducting phosphor1Represents L1And (4) norm.

7. The method of claim 6 for generating a voice-driven talking face video based on teacher student network, characterized by: in the arbiter training stage of step 6.6, the arbiter loss function using the arbiter portion of W-GAN-gp is usedComprises the following steps:

in the formula (I), the compound is shown in the specification,denotes derivation, DI(. to) represents a discriminator, | | | | represents L2Norm, λ ═ 10, l'gpLipschitz penalty factors are expressed in order to resolve gradient explosions.

8. The method of claim 1 for generating a voice-driven talking face video based on teacher-student network, characterized by: the step 7 of training the cascaded student network comprises the following steps:

step 7.1, extracting the MFCC characteristic sequence { a ] extracted in step 31,a2,...anThe sequence goes through the speech coder f in step 6.34Obtaining a voice hidden variable sequence { a'1,a′2,...a′n};

Step 7.2, inputting a face identity photo I1By the identity encoder f in step 6.35Obtaining an identity hidden variable z, and enabling the hidden variable z and a voice hidden variable sequence { a'1,a′2,...a′nSplicing to obtain an implicit variable sequence { b }1,b2,...bn};

Step 7.3, in order to model the time sequence of the time sequence, a hidden variable sequence { b1,b2,...bnInputting the sequence into an LSTM network to obtain a hidden variable sequence { b 'containing time sequence information'1,b′2,...b′n} then the sequence of hidden variables { b'1,b′2,...b′nTraining each hidden variable in the sequence according to the steps 6.4-6.6 respectively to generate a picture sequence (I)1a,I2a,...Ina}。

Technical Field

The invention relates to the field of multimedia and the field of artificial intelligence, in particular to a voice-driven speaking face video generation method based on a teacher-student network.

Background

The arbitrary speaking face video generation technology is that a front face photo of an arbitrary person and a section of speaking voice of the arbitrary person are input, then a front face speaking video of the person is generated, and the generated video has accurate lip movement and expression change. The generation of natural and smooth speaking face video from a single face picture and speaking voice is very challenging, it needs to generate multi-frame faces with preserved identity characteristics, and it requires that face variations, especially lip variations, are consistent with the input voice in the time domain. The speaking face video generation technology has very wide application prospect and potential in the fields of virtual anchor, intelligent home, game movie character production and the like.

The task of generating a speaking face can be traced back to the nineties of the last century at the earliest, when a face is modeled using sparse meshes, and then face mesh motion is driven using speech signals. In the beginning of the 20 th century, ezuat proposed a "make it talk" scheme, which includes collecting a certain number of videos of a single person with a speaking face to form a single person video library, converting text signals into phoneme signals, searching the phoneme signals for the most suitable visemes in the single person video library, and calculating intermediate frames of the visemes by using optical flow to generate a video. In recent years, with the increase of computer computing power, the construction of large-scale data sets and the rise of deep learning, Joon Son Chung of VGG group in 2016 in its paper you said that? The method realizes training on a large-scale data set LRW by using a coding and decoding learning structure for the first time, and can generate a single face speaking video by using a single face photo and speaking audio. Subsequent techniques use video frames as truth to perform self-supervised learning for the network, but none of these methods adequately mine the dynamics of the video information.

Disclosure of Invention

Aiming at the defects of the prior art, the invention integrates the excellent characteristics of generation of a confrontation network and knowledge distillation in the aspect of image generation on the basis of a deep learning self-encoder generation model, and provides a voice-driven talking face video generation method based on a teacher student network. Firstly, compressing dynamic information in video data by using a teacher network, then learning the prediction of the dynamic information by using a student network, then using human face dynamic information extracted by the pre-trained teacher network as supervision, and combining human face identity information to realize a speech task of driving a human face by using voice.

In order to achieve the aim, the technical scheme provided by the invention is a voice-driven talking face video generation method based on a teacher-student network, which comprises the following steps:

step 1, acquiring a large number of speaking face video data sets;

step 2, extracting video frames and voice data from the data set obtained in the step 1 by using an ffmpeg tool;

step 3, extracting the face photos in the video frames in the step 2 by using a face detection tool provided by the dlib library, converting the face photos into front face photos, and cutting the front face photos into N multiplied by N sized front face photos I1Extracting MFCC features of the voice signal in the step 2 by using a speech processing tool library python _ speech _ features;

step 4, using the face alignment tool provided by the face _ alignment to detect the face photo I cut in the step 31Human face feature points in;

step 5, establishing and training a teacher network;

step 6, constructing and training a student network;

step 7, cascading student network training;

and 8, inputting the MFCC feature sequence extracted in the step 3 and any face picture I into the cascade student network trained in the step 7 to obtain a corresponding picture sequence, and synthesizing the picture sequence into a video by using ffmpeg.

Moreover, the step 5 of constructing and training the teacher network includes the following steps:

step 5.1, the whole network adopts a self-supervision learning mode to respectively carry out the human face characteristic points l detected in the step 41、l2And a cut-out front face photograph I1Using three encoders f1、f2、f3Encoding is carried out to generate an implicit variable z1、z2、z3

Step 5.2, let z4=concat((z2-z1),z3) By a decoder fDTo z4Decoding to obtain the face-up picture I with cut-out representation1The dynamic characteristics m and c are calculated as follows:

(m,c)=fD(z4) (1)

step 5.3, combining the cut frontal photo I by using the parameters m and c obtained by calculation in the step 5.21To obtain a synthetic photograph I1′:

I1′=m×c+(1-m)×I1 (2)

And 5.4, training the teacher network by using the network architecture of the W-GAN-gp algorithm.

Moreover, the training of the teacher network in the step 5.4 by using the network architecture of the W-GAN-gp algorithm includes a generator training phase and a discriminator training phase:

step 5.4.1, generator training phase, giving preprocessed face characteristic point l1、l2And a cut-out front face photograph I1Using the computational procedure of steps 5.1-5.3, the network generates picture I from predicted motion information m and c1' and calculating the loss function l of the generatorloss

lloss=lrec+lreg+lgen (3)

lrec=||I1-I1′||1 (4)

lreg=||m||1 (5)

lgen=-DI([I1′,m]) (6)

In the formula IrecTo reconstruct the losses,/regFor sparse regularization loss, lgenTo combat losses, DI(. to) a discriminator, | | | | | non-conducting phosphor1Represents L1And (4) norm.

Step 5.4.2, arbiter training phase, using arbiter part of W-GAN-gp, arbiter loss functionThe calculation method is as follows:

in the formula (I), the compound is shown in the specification,denotes derivation, DI(. to) represents a discriminator, | | | | represents L2Norm, λ 10, lgpLipschitz penalty factors are expressed in order to resolve gradient explosions.

And alternately training the generation stage and the judgment stage until the algorithm converges, and finishing the teacher network training.

Moreover, the step 6 of constructing and training the student network includes the following steps:

step 6.1, using the MFCC characteristics of the voice signals extracted in the step 3, and adding a time window of 10ms to extract the MFCC signals by taking the time point of a video frame as a center;

step 6.2, inputting people by using the teacher network trained in the step 5Face feature point l1、l2And a cut-out front face photograph I1Obtaining change information c of the pixel values in the change area m and the change area m;

step 6.3, inputting MFCC feature a of 10ms of the speech signal cut in step 6.1mfccAnd a cut-out front face photograph I1Respectively using speech coders f4And identity information encoder f5Encoding is carried out to generate an implicit variable z5And z6Then let z7=concat(z5,z6);

Step 6.4, use the decoderPredicting motion information (m)s,cs),

Step 6.5, the parameter m calculated in the step 6.4 is utilizedsAnd csCombined with cut-out front face photograph I1To give synthetic photograph I'1s

I1s′=ms×cs+(1-ms)×I1 (9)

And 6.6, training the student network by using the network architecture of the W-GAN-gp algorithm.

Moreover, the training of the student network in the step 6.6 by using the network architecture of the W-GAN-gp algorithm includes a generator training phase and a discriminator training phase:

step 6.6.1, Generator training phase, given MFCC feature amfccAnd a cut-out front face photograph I1Using the calculation process of steps 6.2-6.5, the student network passes the predicted movement information msAnd csGenerate picture I'1sAnd calculating a loss function l 'of the generator'loss

l′loss=l′rec+l′reg+l′gen+lmot (10)

l′rec=||I1-I1s||1 (11)

l′reg=||m||1 (12)

l′gen=-DI([I1s′,m]) (13)

lmot=||ms-m||1+||cs-c||1 (14)

Of formula (II) to'recTo rebuild the loss,/'regIs sparse regularization loss, l'genTo combat the loss,/motTo supervise loss of motion information, DI(. to) a discriminator, | | | | | non-conducting phosphor1Represents L1And (4) norm.

Step 6.6.2, arbiter training phase, arbiter loss function using arbiter part of W-GAN-gpComprises the following steps:

in the formula (I), the compound is shown in the specification,denotes derivation, DI(. to) represents a discriminator, | | | | represents L2Norm, λ ═ 10, l'gpLipschitz penalty factors are expressed in order to resolve gradient explosions.

And alternately training the generation stage and the discrimination stage until the algorithm is converged, and finishing the network training of the students.

Moreover, the step 7 of training the cascaded student network includes the following steps:

step 7.1, extracting the MFCC characteristic sequence { a ] extracted in step 31,a2,...anThe sequence goes through the speech coder f in step 6.34Obtaining a voice hidden variable sequence { a'1,a′2,...a′n};

Step 7.2, inputting a face identity photo I1By the identity encoder f in step 6.35Obtaining an identity hidden variable z, and enabling the hidden variable z and a voice hidden variable sequence { a'1,a′2,...a′nSplicing to obtain an implicit variable sequence { b }1,b2,...bn};

Step 7.3, in order to model the time sequence of the time sequence, a hidden variable sequence { b1,b2,...bnInputting the sequence into an LSTM network to obtain a hidden variable sequence { b 'containing time sequence information'1,b′2,...b′n} then the sequence of hidden variables { b'1,b′2,...b′nTraining each hidden variable in the sequence according to the steps 6.4-6.6 respectively to generate a picture sequence (I)1a,I2a,...Ina}。

Compared with the prior art, the invention has the following advantages: compared with the traditional arbitrary speaking face video generation technology, the invention firstly mines the dynamic information in the video signal and greatly improves the face generation, the picture definition and the accuracy of generating the video lip shape of the speaking face.

Drawings

Fig. 1 is a network structure diagram according to an embodiment of the present invention.

Fig. 2 is a block diagram of a teacher network model based on a countermeasure network according to the embodiment.

Fig. 3 is a block diagram of a student network model based on an antagonistic network in the embodiment.

Fig. 4 is a block diagram of a cascade student network model based on a countermeasure network according to the embodiment.

Detailed Description

The invention provides a voice-driven talking face video generation method based on a teacher student network, which comprises the steps of firstly compressing dynamic information in video data by using the teacher network, then learning the prediction of the dynamic information by using the student network, then using face dynamic information extracted by the pre-trained teacher network as supervision, and combining face identity information to realize the voice-driven talking task of a face.

The technical solution of the present invention is further explained with reference to the drawings and the embodiments.

As shown in fig. 1, the process of the embodiment of the present invention includes the following steps:

step 1, acquiring a large number of speaking face video data sets.

And 2, extracting video frames and voice data from the data set acquired in the step 1 by using an ffmpeg tool.

And 3, extracting the face picture in the video frame in the step 2 by using a face detection tool provided by the dlib library, converting the face picture into a face-positive picture, and cutting the face-positive picture into an N multiplied by N (N can be equal to 64, 128 and 256) face-positive picture I1The MFCC features of the step 2 speech signal are extracted using the library of speech processing tools python _ speech _ features.

Step 4, using the face alignment tool provided by the face _ alignment to detect the face photo I cut in the step 31The human face feature point in (1).

And 5, constructing and training a teacher network.

Step 5.1, the whole network adopts a self-supervision learning mode, firstly, the face characteristic points l detected in the step 4 are respectively checked1、l2And a cut-out front face photograph I1Using three encoders f1、f2、f3Encoding is carried out to generate an implicit variable z1、z2、z3

Step 5.2, let z4=concat((z2-z1),z3) By a decoder fDTo z4Decoding to obtain the face-up picture I with cut-out representation1A change area m and change information c of pixel values in the change area.

The dynamic features m and c are calculated as follows:

(m,c)=fD(z4) (1)

step 5.3, the calculation of the step 5.2 is utilized to obtainIn combination with the cropped front face photograph I1To obtain a synthetic photograph I1′。

Synthesis of photograph I1The way' is calculated as follows:

I1′=m×c+(1-m)×I1 (2)

and 5.4, training the teacher network by using the network architecture of the W-GAN-gp algorithm.

Step 5.4.1, generator training phase, giving preprocessed face characteristic point l1、l2And a cut-out front face photograph I1Using the computational procedure of steps 5.1-5.3, the network generates picture I from predicted motion information m and c1', loss function of generator llossIncluding reconstruction loss lrecSparse regularization loss lregAnd to combat the loss lgenThree loss functions, calculated as follows:

lloss=lrec+lreg+lgen (3)

lrec=||I1-I1′||1 (4)

lreg=||m||1 (5)

lgen=-DI([I1′,m]) (6)

in the formula, DI(. to) a discriminator, | | | | | non-conducting phosphor1Represents L1And (4) norm.

Step 5.4.2, arbiter training phase, using arbiter part of W-GAN-gp, arbiter loss functionThe calculation method is as follows:

in the formula (I), the compound is shown in the specification,denotes derivation, DI(. to) represents a discriminator, | | | | represents L2Norm, λ 10, lgpLipschitz penalty factors are expressed in order to resolve gradient explosions.

And alternately training the generation stage and the judgment stage until the algorithm converges, and finishing the teacher network training.

And 6, constructing and training a student network.

And 6.1, using the MFCC characteristics of the voice signals extracted in the step 3, and adding a time window of 10ms to the time point of the video frame as a center to extract the MFCC signals.

Step 6.2, inputting the face characteristic points l by using the pre-trained teacher network in the step 51、l2And a cut-out front face photograph I1And obtaining the change information c of the pixel values in the change area m and the change area m.

Step 6.3, inputting MFCC feature a of 10ms of the speech signal cut in step 6.1mfccAnd a cut-out front face photograph I1Respectively using speech coders f4And identity information encoder f5Encoding is carried out to generate an implicit variable z5And z6Then let z7=concat(z5,z6)。

Step 6.4, use the decoderPredicting motion information (m)s,cs),

Step 6.5, the parameter m calculated in the step 6.4 is utilizedsAnd csCombined with cut-out front face photograph I1To give synthetic photograph I'1s

Synthesis of photo I'1sThe calculation method of (c) is as follows:

I1s′=ms×cs+(1-ms)×I1 (9)

and 6.6, training the student network by using the network architecture of the W-GAN-gp algorithm.

Step 6.6.1, Generator training phase, given MFCC feature amfccAnd a cut-out front face photograph I1Using the calculation process of steps 6.2-6.5, the student network passes the predicted movement information msAnd csGenerate picture I'1sLoss function of generator l'lossIncluding reconstruction loss lrecSparse regularization loss lregTo combat the loss lgenAnd supervising motion information loss lmotFour loss functions, calculated as follows:

l′loss=l′rec+l′reg+l′gen+lmot (10)

l′rec=||I1-I1s′||1 (11)

l′reg=||m||1 (12)

l′gen=-DI([I1s′,m]) (13)

lmot=||ms-m||1+||cs-c||1 (14)

in the formula, DI(. to) a discriminator, | | | | | non-conducting phosphor1Represents L1And (4) norm.

Step 6.6.2, arbiter training phase, arbiter loss function using arbiter part of W-GAN-gpComprises the following steps:

in the formula (I), the compound is shown in the specification,denotes derivation, DI(. to) represents a discriminator, | | | | represents L2Norm, λ ═ 10, l'gpLipschitz penalty factors are expressed in order to resolve gradient explosions.

And alternately training the generation stage and the discrimination stage until the algorithm is converged, and finishing the network training of the students.

And 7, cascading student network training.

Step 7.1, extracting the MFCC characteristic sequence { a ] extracted in step 31,a2,...anThe sequence goes through the speech coder f in step 6.34Obtaining a voice hidden variable sequence { a'1,a′2,...a′n};

Step 7.2, inputting a face identity photo I1By the identity encoder f in step 6.35Obtaining an identity hidden variable z, and combining the identity hidden variable z with a voice hidden variable sequence { a'1,a′2,...a′nSplicing to obtain an implicit variable sequence { b }1,b2,...bn};

Step 7.3, in order to model the time sequence of the time sequence, a hidden variable sequence { b1,b2,...bnInputting the sequence into an LSTM network to obtain a hidden variable sequence { b 'containing time sequence information'1,b′2,...b′n} sequence of hidden variables { b'1,b′2,...b′nTraining each hidden variable in the sequence according to the steps 6.4-6.6 respectively to generate a picture sequence (I)1a,I2a,...Ina}。

Step 8, extracting the MFCC feature sequence { a ] extracted in step 31,a2......anInputting the images and the pictures I of any face into the cascade student network trained in the step 7 to obtain the corresponding picture sequence (I)1a,I2a,...InaGet it ahead ofThe sequence of pictures is then composited into video using ffmpeg.

In specific implementation, the above process can adopt computer software technology to realize automatic operation process.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音交互方法、装置及设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!