Virtual face construction method and device, computer equipment and computer readable medium

文档序号：909793 发布日期：2021-02-26 浏览：4次中文

阅读说明：本技术 虚拟人脸构建方法、装置、计算机设备及计算机可读介质 (Virtual face construction method and device, computer equipment and computer readable medium ) 是由魏舒刘玉宇于 2020-11-17 设计创作，主要内容包括：本申请属于人工智能技术领域,本申请提供了一种基于音素文本的虚拟人脸构建方法、装置、计算机设备及计算机可读存储介质。本申请通过获取目标文本,并将目标文本通过预设TTS方式生成音频,将音频进行音素切分,以得到音频所对应的音素文本,将音素文本输入至预设循环神经网络模型,以得到音素文本所对应的预设真人主播的人脸特征,根据预设真人主播的人脸特征,构建音素文本所对应的预设真人主播的虚拟人脸,可直接根据输入的目标文本构建预设真人主播所对应虚拟人脸,简化了将真人主播转化成所对应的虚拟主播的生成程序,提高了虚拟主播的生成效率和准确性。(The application belongs to the technical field of artificial intelligence and provides a virtual human face construction method and device based on phoneme texts, computer equipment and a computer readable storage medium. According to the method and the device, the target text is obtained, the audio is generated through the target text in a preset TTS mode, the audio is subjected to phoneme segmentation to obtain a phoneme text corresponding to the audio, the phoneme text is input to a preset cyclic neural network model to obtain the face characteristics of a preset human anchor corresponding to the phoneme text, the virtual face of the preset human anchor corresponding to the phoneme text is constructed according to the face characteristics of the preset human anchor, the virtual face corresponding to the preset human anchor can be directly constructed according to the input target text, the generation program for converting the human anchor into the corresponding virtual anchor is simplified, and the generation efficiency and accuracy of the virtual anchor are improved.)

1. A virtual human face construction method based on phoneme texts is characterized by comprising the following steps:

acquiring a target text, and generating audio from the target text in a preset TTS mode;

performing phoneme segmentation on the audio to obtain a phoneme text corresponding to the audio;

inputting the phoneme text into a preset cyclic neural network model to obtain the human face characteristics of a preset human anchor corresponding to the phoneme text;

and constructing a virtual face of the preset human anchor corresponding to the phoneme text according to the face characteristics of the preset human anchor.

2. The method for constructing a virtual face based on phoneme texts as claimed in claim 1, wherein before the step of inputting the phoneme texts into a preset recurrent neural network model to obtain the face features of the preset human anchor corresponding to the phoneme texts, the method further comprises:

acquiring a training video sample recorded by a preset live anchor, wherein the training video sample comprises a training audio sample and the image of the preset live anchor;

performing phoneme segmentation on the training audio sample to obtain a training phoneme text, and extracting the face features of the training human anchor corresponding to the training phoneme text from the image of the preset human anchor;

and inputting the training phoneme text and the human face characteristics of the training human anchor into a preset cyclic neural network model so as to train the preset cyclic neural network model.

3. The method for constructing a virtual face based on phoneme texts as claimed in claim 2, wherein the step of performing phoneme segmentation on the training audio samples to obtain training phoneme texts and extracting the face features of the training human anchor corresponding to the training phoneme texts from the image of the preset human anchor comprises:

acquiring a training audio sample corresponding to the training video sample;

according to the time sequence, performing phoneme segmentation on the training audio sample according to preset phonemes to obtain a training phoneme text sequence arranged according to the time sequence, wherein the training phoneme text sequence comprises a plurality of training phoneme text elements;

according to the time sequence, acquiring a video frame corresponding to the training phoneme text element from the training video sample, wherein the video frame comprises an image of the preset human anchor;

and extracting the appearance feature, the expression feature and the posture feature of the preset human anchor contained in the video frame corresponding to the training phoneme text element so as to obtain the face feature of the training human anchor corresponding to the training phoneme text.

4. The method for constructing a virtual face based on a phoneme text as claimed in claim 1, wherein after the step of constructing the virtual face of the preset human anchor corresponding to the phoneme text according to the human face features of the preset human anchor, the method further comprises:

acquiring all the phoneme texts contained in the audio and virtual faces corresponding to the phoneme texts;

combining all the virtual human faces into a video frame sequence according to the time sequence of the phoneme texts in the audio;

combining the audio with the sequence of video frames to obtain a virtual video.

5. The method for constructing a virtual face based on a phoneme text as claimed in claim 4, wherein after the step of constructing the virtual face of the preset human anchor corresponding to the phoneme text according to the human face features of the preset human anchor, the method further comprises:

generating other audio frequencies by the target text in other preset TTS modes, wherein the other audio frequencies and the audio frequencies are audio frequencies with different audio style types;

combining the other audio with the sequence of video frames to obtain other virtual video.

6. The method for constructing a virtual face based on phoneme texts as claimed in claim 1, wherein before the step of generating audio from the target text by a preset TTS manner, the method further comprises:

counting the text length corresponding to the target text;

judging whether the text length is greater than or equal to a preset text length threshold value or not;

and if the text length is greater than or equal to the preset text length threshold value, executing the step of generating the audio by the target text in a preset TTS mode.

7. The method for constructing a virtual face based on phoneme texts as claimed in claim 6, wherein if the text length is greater than or equal to the preset text length threshold, after the step of generating the audio from the target text in a preset TTS manner, the method further comprises:

and if the text length is smaller than the preset text length threshold value, sending an alarm that the target text does not meet the preset requirement.

8. A virtual human face constructing device based on phoneme texts is characterized by comprising:

the first acquisition unit is used for acquiring a target text and generating an audio from the target text in a preset TTS mode;

the first segmentation unit is used for carrying out phoneme segmentation on the audio to obtain a phoneme text corresponding to the audio;

the first input unit is used for inputting the phoneme text into a preset recurrent neural network model so as to obtain the human face characteristics of a preset human anchor corresponding to the phoneme text;

and the construction unit is used for constructing the virtual face of the preset human anchor corresponding to the phoneme text according to the human face characteristics of the preset human anchor.

9. A computer device, comprising a memory and a processor coupled to the memory; the memory is used for storing a computer program; the processor is adapted to run the computer program to perform the steps of the method according to any of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, realizes the steps of the method according to any one of claims 1 to 7.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a virtual face construction method and apparatus based on phoneme text, a computer device, and a computer-readable storage medium.

Background

The virtual anchor is an anchor or customer service which uses an avatar to interact with a customer in a video based on technologies such as voice, NLP, vision and the like. The virtual anchor can solve the problems that the traditional customer service agent is high in cost (for example, the agent needs wage and welfare, the management and training cost is high, the period is long, and the like), the working quality is unstable (easily influenced by emotion, fatigue and the like), and the like, reduces the customer service cost of a company, improves the customer satisfaction degree, and ensures the stability of the working quality.

At present, most of the conventional virtual anchor technologies are based on audio or audio combined with images to construct virtual video images, corresponding audio or audio and images need to be recorded as input, and different virtual anchors need to learn different audio characteristics in a customization process, or collect a large amount of data of video/audio recorded by anchor anchors of different real persons to perform generalized learning, so that the technical problem of low efficiency in constructing virtual anchors exists in the conventional technologies.

Disclosure of Invention

The application provides a virtual face construction method and device based on phoneme texts, computer equipment and a computer readable storage medium, and can solve the problem of low efficiency of constructing a virtual anchor in the prior art.

In a first aspect, the present application provides a virtual face construction method based on phoneme texts, where the method includes: acquiring a target text, and generating audio from the target text in a preset TTS mode; performing phoneme segmentation on the audio to obtain a phoneme text corresponding to the audio; inputting the phoneme text into a preset cyclic neural network model to obtain the human face characteristics of a preset human anchor corresponding to the phoneme text; and constructing a virtual face of the preset human anchor corresponding to the phoneme text according to the face characteristics of the preset human anchor.

In a second aspect, the present application further provides a virtual face constructing apparatus based on a phoneme text, including: the first acquisition unit is used for acquiring a target text and generating an audio from the target text in a preset TTS mode; the first segmentation unit is used for carrying out phoneme segmentation on the audio to obtain a phoneme text corresponding to the audio; the first input unit is used for inputting the phoneme text into a preset recurrent neural network model so as to obtain the human face characteristics of a preset human anchor corresponding to the phoneme text; and the construction unit is used for constructing the virtual face of the preset human anchor corresponding to the phoneme text according to the human face characteristics of the preset human anchor.

In a third aspect, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the virtual face construction method based on phoneme text when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium storing a computer program, which when executed by a processor, causes the processor to execute the steps of the phoneme text based virtual face construction method.

The application provides a virtual face construction method and device based on phoneme texts, computer equipment and a computer readable storage medium. This application is through the target text who obtains one section input, and will the target text is through predetermineeing TTS mode generation audio frequency, will the audio frequency carries out the phoneme segmentation, in order to obtain the phoneme text that the audio frequency corresponds is input to predetermineeing the recurrent neural network model, in order to obtain the facial feature of the preset real man anchor that the phoneme text corresponds, according to the facial feature of the preset real man anchor, the structure the phoneme text corresponds predetermine the virtual human face of real man anchor, can directly construct the virtual human face that the preset real man anchor corresponds according to the target text of input, simplified and converted the real man anchor into the generation procedure of the virtual anchor that corresponds, improved the generation efficiency and the accuracy of virtual anchor.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a virtual face construction method based on phoneme texts according to an embodiment of the present application;

fig. 2 is a schematic view of a first sub-flow of a virtual face construction method based on phoneme texts according to an embodiment of the present application;

fig. 3 is a second sub-flow diagram of a virtual face construction method based on phoneme texts according to an embodiment of the present application;

fig. 4 is a third sub-flow diagram of a virtual face construction method based on phoneme texts according to an embodiment of the present application;

fig. 5 is a fourth sub-flow diagram of a virtual face construction method based on phoneme texts according to an embodiment of the present application;

fig. 6 is a fifth sub-flow diagram of a virtual face construction method based on phoneme texts according to an embodiment of the present application;

fig. 7 is a schematic block diagram of a virtual face construction apparatus based on phoneme texts according to an embodiment of the present application; and

fig. 8 is a schematic block diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Referring to fig. 1, fig. 1 is a schematic flowchart of a virtual face construction method based on phoneme texts according to an embodiment of the present application. As shown in FIG. 1, the method includes the following steps S11-S14:

and S11, acquiring a target text, and generating audio from the target text in a preset TTS mode.

Wherein, TTS, english is Text To Speech, and "from Text To Speech" is a process of intelligently converting characters into Speech.

Specifically, in the embodiment of the application, when the virtual face is constructed based on the phoneme text, only the target text needs to be input, and the virtual face can be constructed according to the input target text. After a target text is acquired, generating an audio corresponding to the target text from the target text in a preset TTS mode. The TTS includes a "splicing method" and a "parametric method", where the splicing method prepares a large amount of speech, and the speech is spliced by basic units (such as syllables, phonemes, etc.), and then extracts a synthesis target sound from the prepared speech. The parametric method is to generate speech parameters (including fundamental frequency, formant frequency, etc.) at every moment according to a statistical model and then convert the parameters into waveforms.

S12, carrying out phoneme segmentation on the audio to obtain a phoneme text corresponding to the audio.

Wherein, a phoneme (Phone in english) is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation action in the syllable, and one action constitutes one phoneme.

Specifically, after audio is generated according to the obtained target text, the audio is segmented and aligned to obtain factor texts corresponding to the text, virtual faces are constructed according to the factor texts, and when the virtual faces and the audio are used as virtual videos to be played, the factor texts correspond to each frame in the virtual videos.

And S13, inputting the phoneme text into a preset recurrent neural network model to obtain the human face characteristics of the preset human anchor corresponding to the phoneme text.

S14, constructing a virtual face of the preset human anchor corresponding to the phoneme text according to the face features of the preset human anchor.

Wherein, the Recurrent Neural Network is called current Neural Network in English, and abbreviated as RNN.

Specifically, a preset recurrent neural network model is trained first. The training preset cyclic neural network model is characterized in that a training audio sample contained in a training video sample is segmented by recording a training video sample of a human-centered character, so as to obtain a training phoneme text, the training phoneme text and an image of the human-centered character corresponding to the training phoneme text contained in the training video sample are input into the preset cyclic neural network model, so that the cyclic neural network model automatically learns the incidence relation between the training phoneme text and the human face characteristics of the human-centered character image corresponding to the training phoneme text, and the training phoneme text is matched with the human face characteristics of the human-centered character contained in the image of the human-centered character.

After the phoneme text corresponding to the acquired target text is input into a preset cyclic neural network model, the preset cyclic neural network model combines the incidence relation between the training phoneme text and the human face characteristics of the human anchor contained in the image of the human anchor during training according to the phoneme text to obtain the human face characteristics of the preset human anchor contained in each video frame corresponding to the phoneme text, and combines the relationship between the human face characteristics and the human face to construct a virtual human face corresponding to the human face characteristics of the preset human anchor so as to obtain the virtual human face of the preset human anchor corresponding to the phoneme text, render the virtual human face, and then correspond the virtual human face with the audio so as to obtain the virtual anchor video.

In this application embodiment, through obtaining the target text, and will the target text generates the audio frequency through predetermineeing the TTS mode, will the audio frequency carries out the phoneme segmentation, in order to obtain the phoneme text that the audio frequency corresponds, will the phoneme text is input to predetermineeing the recurrent neural network model, in order to obtain the facial feature of the preset real man anchor that the phoneme text corresponds, according to the facial feature of the preset real man anchor, the structure the phoneme text corresponds the virtual human face of preset real man anchor can directly be found the virtual human face that the preset real man anchor corresponds according to the target text of input, has simplified and has been converted into the generation procedure of the virtual anchor that corresponds with real man anchor, has improved the generation efficiency and the accuracy of virtual anchor.

Referring to fig. 2, fig. 2 is a schematic sub-flow chart of a virtual face construction method based on phoneme texts according to an embodiment of the present application. As shown in fig. 2, in this embodiment, before the step of inputting the phoneme text into a preset recurrent neural network model to obtain a face feature of a preset human anchor corresponding to the phoneme text, the method further includes:

s21, acquiring a training video sample recorded by a preset live anchor, wherein the training video sample comprises a training audio sample and the image of the preset live anchor;

s22, performing phoneme segmentation on the training audio sample to obtain a training phoneme text, and extracting the face features of the training human anchor corresponding to the training phoneme text from the image of the preset human anchor;

and S23, inputting the training phoneme text and the human face characteristics of the training human anchor into a preset cyclic neural network model so as to train the preset cyclic neural network model.

Specifically, before the step of inputting the phoneme text into a preset cyclic neural network model to obtain the face features of the preset human anchor corresponding to the phoneme text, training the preset cyclic neural network model by using a training video sample corresponding to the preset human anchor. When the training video sample corresponding to the preset real man anchor is adopted to train the preset cyclic neural network model, recording the training video sample corresponding to the preset real man anchor in a preset time period, requiring the front face of the preset real man anchor to be recorded and exposing the whole face, supplementing light and adopting a microphone to record when recording, for example, recording the training video sample with the time duration of 2-3 hours to meet the training requirement of the preset cyclic neural network model in the embodiment of the application, wherein the training video sample comprises a training audio sample and the image of the preset real man anchor, performing phoneme segmentation on the training audio sample to obtain a training phoneme text, and extracting the face characteristics of the training real man anchor corresponding to the training phoneme text from the training video sample, for example, 3DMM features of a human face (i.e., Morph of 3dmax, 3D deformable model) may be extracted, where each phoneme corresponds to each video frame of the video.

Inputting the training phoneme text and the human face features of the training human anchor into a preset recurrent neural network model to train the preset recurrent neural network model, for example, Embedding the obtained phoneme text to obtain an input vector of the preset recurrent neural network model, and using the pre-extracted 150-dimensional features of the 3DMM as the Label of the preset recurrent neural network model. The training of the preset recurrent neural network model comprises the following two processes:

1) during forward propagation: the Embedding feature vector passes through two layers of bidirectional RNNs to obtain a 150-dimensional feature Predict. 2) When reversely propagating: and calculating the Loss such as MSE/MAE/L1/L2 and the like according to the Predict and the Label, and updating in a reverse mode to perform gradient updating. When a virtual face is constructed by using a preset cyclic neural network model subsequently, only a phoneme text (with an assumed length of N) needs to be input, and N x 150 dimensional 3DMM characteristics of Ppreredict can be obtained according to the forward propagation process, and are respectively corresponding to each frame of N frames of a video, and are used for face reconstruction of each frame in a subsequent virtual anchor, so that a complete and continuous virtual anchor video is obtained.

And repeating the steps 1) and 2) until convergence, so that the preset cyclic neural network model adjusts the parameters of the preset cyclic neural network model through automatic learning, and the preset cyclic neural network model is trained by adopting training video samples corresponding to preset human anchor, wherein different preset RNN models, different Loss, different optimization modes and initialization modes of model parameters can be tried to adapt to different types of data. Because the training video sample within shorter time can be adopted in the embodiment of the application, for example, the training requirement of the preset cyclic neural network model can be met without recording the training audio and video sample data for 2 to 3 hours, the cyclic neural network model with better effect can be trained, and the accurate virtual face can be constructed according to the input target text, compared with the traditional technology which needs a large number of training video samples or training video samples with longer time, the training process of the preset cyclic neural network model is greatly simplified in the embodiment of the application, the problem that a large number of audios and videos need to be played at the same time when the preset cyclic neural network model is trained is avoided, the data usage amount in the training process of the cyclic neural network model is greatly reduced, and the training efficiency and the training accuracy of the preset cyclic neural network model are improved, the training video sample data size required by customizing a virtual anchor is reduced to a great extent, and the time cost for generating the recurrent neural network model is greatly reduced.

Referring to fig. 3, fig. 3 is a schematic view of a second sub-flow of a virtual face construction method based on phoneme texts according to an embodiment of the present application. As shown in fig. 3, in this embodiment, the step of performing phoneme segmentation on the training audio sample to obtain a training phoneme text, and extracting a face feature of the training human anchor corresponding to the training phoneme text from the image of the preset human anchor includes:

s31, obtaining a training audio sample corresponding to the training video sample;

s32, performing phoneme segmentation on the training audio sample according to a preset phoneme according to a time sequence to obtain a training phoneme text sequence arranged according to the time sequence, wherein the training phoneme text sequence comprises a plurality of training phoneme text elements;

s33, acquiring video frames corresponding to the training phoneme text elements from the training video samples according to the time sequence, wherein the video frames comprise the images of the preset human anchor;

and S34, extracting the appearance feature, the expression feature and the posture feature of the preset human anchor contained in the video frame corresponding to the training phoneme text element to obtain the face feature of the training human anchor corresponding to the training phoneme text.

Specifically, the audio contained in the training video sample is a set of phonemes, the video contained in the training video sample is a set of video frames, and the phoneme elements in the phoneme set and the video frame elements in the video frame set have a correspondence in chronological order. Therefore, after the training video sample is obtained, a training audio sample corresponding to the training video sample is obtained, the training audio sample is a set of training phoneme text elements according to a time sequence, and the training audio sample is subjected to phoneme segmentation according to a preset phoneme according to the time sequence so as to obtain a training phoneme text sequence which is arranged according to the time sequence and contains a plurality of training phoneme text elements. For example, in one example, the phoneme segmentation results are as follows:

Intervals[1]:xmin＝0.000 xmax＝0.700 text＝”sil”；

Intervals[2]:xmin＝0.700 xmax＝0.780 text＝”HH”；

Intervals[3]:xmin＝0.780 xmax＝0.920 text＝”IY1”；

Intervals[4]:xmin＝0.920 xmax＝1.070 text＝”HH”；

where intervals [ i ] represents the ith phoneme, xmin, xmax represents the start time and end time, respectively, and so on.

And generating a factor text corresponding to the video frame according to the segmentation result, wherein in an example, the generated text is as follows:

[”sil”,”sil”,”sil”,”HH”,”HH””HH”,”IY1”,”IY1”,”IY1”,”HH”,”HH”,”HH”,”HH”]。

where each factor corresponds to each frame of the video.

According to the time sequence, acquiring video frames corresponding to the training phoneme text elements from the training video sample, wherein the video frames comprise images of the preset human anchor, extracting human face features of the training human anchor contained in the video frames corresponding to the training phoneme text elements according to the phoneme text elements, wherein the human face features of the training human anchor can comprise appearance features, expression features and posture features of a human face, and thus, the human face features of the training human anchor corresponding to the training phoneme text arranged according to the time sequence can be obtained, and the training phoneme text sequence contained in the training audio sample and the human face features of the training human anchor of each video frame corresponding to the training phoneme text sequence are obtained.

When extracting the face features from the training video sample, 3D DMM face features (3D DMM, 3D Morphable Models, i.e., face 3D deformation statistical Models) can be extracted from the training video sample. 3DMM features are extracted from each video frame of a training video sample, and [0-80] dimension Shape features (namely, appearance features), [80-144] dimension expression features, and [224-, [ 227-, [254-, [ 257] Pose features (namely, Pose features) can be selected to obtain the face features for training the real human anchor, so that rich real human anchor face features can be obtained from the aspects of appearance features, expression features, Pose features and the like, and the richness, the vividness and the reality of the constructed virtual face can be improved. The shape feature is to describe the shape of the face by extracting key points of the shape of the face, the expression feature extraction is to position and extract organ features, texture regions and predefined feature points of the face, and the posture feature is a posture feature of the face, such as a posture feature of the face looking up, looking down or looking sideways.

Referring to fig. 4, fig. 4 is a third sub-flow diagram of a virtual face construction method based on phoneme texts according to an embodiment of the present application. As shown in fig. 4, in this embodiment, after the step of constructing the virtual face of the preset human anchor corresponding to the phoneme text according to the human face features of the preset human anchor, the method further includes:

s41, acquiring all the phoneme texts contained in the audio and virtual faces corresponding to the phoneme texts;

s42, combining all the virtual human faces into a video frame sequence according to the time sequence of the phoneme texts in the audio;

s43, combining the audio and the video frame sequence to obtain the virtual video.

Specifically, after the virtual faces corresponding to the phoneme texts are constructed, all the phoneme texts contained in the audio are sequenced according to the time sequence of the phoneme texts in the audio, so that a phoneme text sequencing sequence consistent with the audio can be obtained, the virtual faces corresponding to the phoneme texts are sequenced according to the sequence of the phoneme text sequencing sequence, all the virtual faces are combined into a video frame sequence, and then the audio and the video frame sequence are combined to obtain a virtual video, so that when the virtual main broadcast video is constructed, the corresponding video can be generated by directly using the phoneme texts without recording audio data, the use cost of customers is reduced, and the use efficiency is improved.

Referring to fig. 5, fig. 5 is a fourth sub-flow diagram of a virtual face construction method based on phoneme texts according to an embodiment of the present application. As shown in fig. 5, in this embodiment, after the step of constructing the virtual face of the preset human anchor corresponding to the phoneme text according to the human face features of the preset human anchor, the method further includes:

s51, generating other audios by the target text in other preset TTS modes, wherein the other audios and the audios are audios of different audio style types;

s52, combining the other audio and the video frame sequence to obtain other virtual video.

Specifically, different TTS (Text-To-Speech) tools can be used To generate different styles and types of audio for the same target Text, because the audio of different audio style types generated by the same target text has the same factor text and the same time sequence corresponding to the factor text, the audio of different style types is respectively combined with the same video frame sequence, the video with different audio styles and types and different video styles and types can be obtained, the possibility of changing different sounds for the same video can be realized according to different scene requirements, different sound supports can be provided for the video of the same anchor, the video does not need to be recorded again, data processing, training and model adjustment are not needed to regenerate the model, the efficiency of generating the videos with various styles and types is improved, and the applicability of the generated videos is improved.

Referring to fig. 6, fig. 6 is a fifth sub-flow diagram of a virtual face construction method based on phoneme texts according to an embodiment of the present application. As shown in fig. 6, in this embodiment, before the step of generating audio by using a preset TTS manner, the method further includes:

s61, counting the text length corresponding to the target text;

s62, judging whether the text length is larger than or equal to a preset text length threshold value;

s63, if the text length is larger than or equal to the preset text length threshold value, executing the step of generating audio by the target text in a preset TTS mode;

and S64, if the text length is smaller than the preset text length threshold value, sending an alarm that the target text does not meet the preset requirement.

Specifically, for the target text which is in an extreme case and does not meet the condition of being converted into audio, for example, the target text is an exclamation word or a phrase, in this case, the target text cannot be converted into voice, and then the virtual anchor video is constructed according to the target text. Therefore, before the target text is used to generate audio in a preset TTS manner, the text length corresponding to the target text may be counted, for example, the number of characters corresponding to characters included in the target text is counted, the text length corresponding to the target text is obtained, and then whether the text meets the requirement of converting into speech is determined according to the text length, it is determined whether the text length is greater than or equal to a preset text length threshold, for example, the preset text length threshold may be 2 characters or 5 characters, if the target text is greater than or equal to the preset text length threshold, the step of generating audio from the target text in a preset TTS manner is executed, if the target text is less than the preset text length threshold, that is, the target text is too short, an alarm that the target text does not meet the preset requirement is issued, therefore, the virtual anchor can not be constructed subsequently according to the target text which does not meet the requirement.

It should be noted that, in the virtual face construction method based on the phoneme text described in the above embodiments, the technical features included in different embodiments may be recombined as needed to obtain a combined implementation, but all of the embodiments are within the scope of the present application.

Referring to fig. 7, fig. 7 is a schematic block diagram of a virtual face construction apparatus based on phoneme texts according to an embodiment of the present application. Corresponding to the virtual face construction method based on the phoneme text, the embodiment of the application also provides a virtual face construction device based on the phoneme text. As shown in fig. 7, the phoneme text based virtual face construction apparatus includes a unit for performing the phoneme text based virtual face construction method, and the phoneme text based virtual face construction apparatus may be configured in a computer device. Specifically, referring to fig. 7, the virtual face construction apparatus 70 based on phoneme text includes a first obtaining unit 71, a first dividing unit 72, a first input unit 73, and a construction unit 74.

The first obtaining unit 71 is configured to obtain a target text, and generate an audio from the target text in a preset TTS manner;

a first segmentation unit 72, configured to perform phoneme segmentation on the audio to obtain a phoneme text corresponding to the audio;

a first input unit 73, configured to input the phoneme text into a preset recurrent neural network model, so as to obtain a human face feature of a preset human anchor corresponding to the phoneme text;

and the constructing unit 74 is configured to construct a virtual face of the preset human anchor corresponding to the phoneme text according to the face features of the preset human anchor.

In an embodiment, the virtual face construction apparatus 70 based on phoneme text further includes:

the second acquisition unit is used for acquiring a training video sample recorded by a preset live anchor, wherein the training video sample comprises a training audio sample and the image of the preset live anchor;

the second segmentation unit is used for performing phoneme segmentation on the training audio sample to obtain a training phoneme text, and extracting the face features of the training human anchor corresponding to the training phoneme text from the image of the preset human anchor;

and the training unit is used for inputting the training phoneme text and the human face characteristics of the training human anchor into a preset cyclic neural network model so as to train the preset cyclic neural network model.

In one embodiment, the second slicing unit includes:

the first obtaining subunit is used for obtaining a training audio sample corresponding to the training video sample;

the first molecular unit is used for carrying out phoneme segmentation on the training audio sample according to a preset phoneme according to a time sequence so as to obtain a training phoneme text sequence arranged according to the time sequence, and the training phoneme text sequence comprises a plurality of training phoneme text elements;

a second obtaining subunit, configured to obtain, according to the time sequence, a video frame corresponding to the training phoneme text element from the training video sample, where the video frame includes an image of the preset human anchor;

and the extracting subunit is used for extracting the appearance characteristic, the expression characteristic and the posture characteristic of the preset human anchor contained in the video frame corresponding to the training phoneme text element so as to obtain the face characteristic of the training human anchor corresponding to the training phoneme text.

In an embodiment, the virtual face construction apparatus 70 based on phoneme text further includes:

a third obtaining unit, configured to obtain all the phoneme texts included in the audio and a virtual face corresponding to the phoneme texts;

the first combination unit is used for combining all the virtual human faces into a video frame sequence according to the time sequence of the phoneme texts in the audio;

a second combining unit for combining the audio with the sequence of video frames to obtain a virtual video.

In an embodiment, the virtual face construction apparatus 70 based on phoneme text further includes:

the generating unit is used for generating other audios from the target text in other preset TTS modes, wherein the other audios and the audios are audios of different audio style types;

a third combining unit for combining the further audio with the sequence of video frames to obtain a further virtual video.

In an embodiment, the virtual face construction apparatus 70 based on phoneme text further includes:

the statistical unit is used for counting the text length corresponding to the target text;

the judging unit is used for judging whether the text length is larger than or equal to a preset text length threshold value or not;

and the execution unit is used for executing the step of generating the audio by the target text in a preset TTS mode if the text length is greater than or equal to the preset text length threshold value.

In an embodiment, the virtual face construction apparatus 70 based on phoneme text further includes:

and the sending unit is used for sending an alarm that the target text does not meet the preset requirement if the text length is smaller than the preset text length threshold.

It should be noted that, as can be clearly understood by those skilled in the art, the above-mentioned virtual face construction apparatus based on phoneme text and the specific implementation process of each unit may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, no further description is provided here.

Meanwhile, the division and connection mode of each unit in the virtual face construction device based on the phoneme text are only used for illustration, in other embodiments, the virtual face construction device based on the phoneme text may be divided into different units as required, and each unit in the virtual face construction device based on the phoneme text may also adopt different connection orders and modes to complete all or part of the functions of the virtual face construction device based on the phoneme text.

The above-described virtual face construction apparatus based on phoneme text may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a computer device such as a desktop computer or a server, or may be a component or part of another device.

Referring to fig. 8, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected by a system bus 501, wherein the memory may include a non-volatile storage medium 503 and an internal memory 504, and the memory may also be a volatile computer-readable storage medium.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a method for constructing a virtual face based on phoneme text as described above.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for running the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 may be enabled to execute a virtual face construction method based on the phoneme text.

The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 8 is a block diagram of only a portion of the configuration relevant to the present teachings and does not constitute a limitation on the computer device 500 to which the present teachings may be applied, and that a particular computer device 500 may include more or less components than those shown, or combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 8, and are not described herein again.

Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps: acquiring a target text, and generating audio from the target text in a preset TTS mode; performing phoneme segmentation on the audio to obtain a phoneme text corresponding to the audio; inputting the phoneme text into a preset cyclic neural network model to obtain the human face characteristics of a preset human anchor corresponding to the phoneme text; and constructing a virtual face of the preset human anchor corresponding to the phoneme text according to the face characteristics of the preset human anchor.

In an embodiment, before implementing the step of inputting the phoneme text into a preset recurrent neural network model to obtain a face feature of a preset human anchor corresponding to the phoneme text, the processor 502 further implements the following steps:

acquiring a training video sample recorded by a preset live anchor, wherein the training video sample comprises a training audio sample and the image of the preset live anchor;

In an embodiment, when implementing the step of performing phoneme segmentation on the training audio sample to obtain a training phoneme text, and extracting a face feature of the training human anchor corresponding to the training phoneme text from the image of the preset human anchor, the processor 502 specifically implements the following steps:

acquiring a training audio sample corresponding to the training video sample;

In an embodiment, after the step of constructing the virtual face of the preset human anchor corresponding to the phoneme text according to the human face features of the preset human anchor is implemented by the processor 502, the following steps are further implemented:

acquiring all the phoneme texts contained in the audio and virtual faces corresponding to the phoneme texts;

combining all the virtual human faces into a video frame sequence according to the time sequence of the phoneme texts in the audio;

combining the audio with the sequence of video frames to obtain a virtual video.

generating other audio frequencies by the target text in other preset TTS modes, wherein the other audio frequencies and the audio frequencies are audio frequencies with different audio style types;

combining the other audio with the sequence of video frames to obtain other virtual video.

In an embodiment, before the step of generating audio by using a preset TTS manner for the target text, the processor 502 further performs the following steps:

counting the text length corresponding to the target text;

judging whether the text length is greater than or equal to a preset text length threshold value or not;

and if the text length is greater than or equal to the preset text length threshold value, executing the step of generating the audio by the target text in a preset TTS mode.

In an embodiment, after implementing the step of generating the audio from the target text in a preset TTS manner if the text length is greater than or equal to the preset text length threshold, the processor 502 further implements the following steps:

and if the text length is smaller than the preset text length threshold value, sending an alarm that the target text does not meet the preset requirement.

It should be understood that in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the processes in the method for implementing the above embodiments may be implemented by a computer program, and the computer program may be stored in a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a computer-readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium, or may be a volatile computer readable storage medium, and the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program causes the processor to execute the steps of the virtual face construction method based on the phoneme text described in the above embodiments.

The computer readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The storage medium is an entity and non-transitory storage medium, and may be various entity storage media capable of storing computer programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a terminal, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

18页详细技术资料下载

Virtual face construction method and device, computer equipment and computer readable medium

相关技术

网友询问留言