Audio and video synthesis method

文档序号：1568591 发布日期：2020-01-24 浏览：28次中文

阅读说明：本技术 一种音视频合成方法 (Audio and video synthesis method ) 是由孙见青于 2019-09-25 设计创作，主要内容包括：本发明提供了一种音视频合成方法,该音视频合成方法有别于现有技术需要对视频图像的关键点进行提取和建模处理,其在合成过程中并不涉及任何视频图像关键点的设计,而是完全通过输入的文本数据进行驱动实现音频和视频的生成以及两者的合成,并且该音视频合成方法在生成视频时充分考虑了视频前后帧图像之间的关系,还采用由粗到精的生成方式来对文本数据对应的视频进行进一步的优化处理,以使得生成的视频能够在画面连续性和画面清晰度上都满足特定要求,从而大大地降低了音视频合成的难度和提高音视频合成的质量。(The invention provides an audio and video synthesis method, which is different from the prior art that the key points of a video image need to be extracted and modeled, the method does not relate to the design of any key points of the video image in the synthesis process, but completely drives the generation of audio and video and the synthesis of the audio and video through input text data, fully considers the relation between the front frame image and the rear frame image of the video when generating the video, and further optimizes the video corresponding to the text data by adopting a coarse-to-fine generation mode, so that the generated video can meet the specific requirements on the continuity and the definition of the picture, thereby greatly reducing the difficulty of the audio and video synthesis and improving the quality of the audio and video synthesis.)

1. An audio and video synthesis method is characterized by comprising the following steps:

inputting text data into a trained deep neural network model to obtain acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data;

step (2), obtaining a target audio corresponding to the text data according to the acoustic parameters;

step (3), according to the hidden layer characteristics of the variational self-encoder VAE, obtaining a target video corresponding to the text data;

and (4) combining the target audio and the target video to realize audio and video synthesis.

2. The audio-visual synthesizing method according to claim 1, characterized in that:

in the step (1), inputting text data to the trained deep neural network model to obtain acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data specifically comprises,

a step (101) of building a text-audio-video database for said training;

a step (102) of training the deep neural network model based on all data in the text-audio-video database;

and (103) inputting the text data into the trained deep neural network model, and obtaining acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data from the output of the deep neural network model.

3. The audio-visual synthesizing method according to claim 2, characterized in that:

in said step (101), building a text-audio-video database for said training comprises in particular,

acquiring audio data and video data related to different scenes, extracting corresponding acoustic parameters from the audio data, extracting corresponding VAE hidden layer characteristics from each frame of image of the video data, and recording and labeling the acoustic parameters and the VAE hidden layer characteristics according to a preset data structure to obtain the text-audio-video database.

4. The audio-visual synthesizing method according to claim 1, characterized in that:

in the step (2), obtaining the target audio corresponding to the text data specifically includes, according to the acoustic parameter,

step (201), inputting the text data into a trained network model, thereby obtaining the acoustic parameters;

and step (202) of inputting the acoustic parameters into a vocoder, so that the target audio is obtained through the reconstruction of the vocoder.

5. The audio-visual synthesizing method according to claim 1, characterized in that:

in the step (3), obtaining the target video corresponding to the text data according to the hidden layer feature of the variational self-encoder VAE specifically includes,

step (301), the hidden layer characteristics of the variational self-encoder VAE are transformed, so as to obtain a primary video;

step (302), the primary video is subjected to transformation processing related to a preset algorithm model, so that a natural video related to the text data is obtained;

and (303) generating the target video according to the natural video.

6. An audio-visual synthesizing method according to claim 5, characterized in that:

in said step (301), transforming said variational auto-encoder VAE hidden layer features to obtain a primary video specifically comprises,

step (3011), carry on VAE decoding to the hidden layer characteristic of VAE of the said variational self-encoder, thus get several frame pictures related to the said text data;

and (3012) combining the plurality of frame images according to a preset image combination mode to obtain the primary video.

7. An audio-visual synthesizing method according to claim 5, characterized in that:

in the step (302), the primary video is subjected to transformation processing with respect to a preset algorithm model, so as to obtain a natural video with respect to the text data,

step (3021), generating a confrontation network CGAN based on the condition, and constructing a Video-to-Video synthesis model as the preset algorithm model;

step (3022), training the Video-to-Video Synthesis model;

and (3023) inputting the primary Video into the trained Video-to-Video synchronization model, and outputting the model to obtain the natural Video.

8. An audio-visual synthesizing method according to claim 7, characterized in that:

in the step (3021), generating a confrontation network CGAN based on the condition, and constructing a Video-to-Video Synthesis model as the preset algorithm model specifically includes generating the confrontation network CGAN based on the condition, presetting a Video image frame sequence condition and a Video precision adjustment mode, and constructing to obtain the Video-to-Video Synthesis model;

alternatively, the first and second electrodes may be,

in the step (3022), the training of the Video-to-Video Synthesis model specifically includes,

and training the Video-to-Video Synthesis model by adopting Video related data in a text-audio-Video database for training the deep neural network model so as to enable the Video transformation characteristic of the Video-to-Video Synthesis model to meet a preset characteristic condition.

9. An audio-visual synthesizing method according to claim 7, characterized in that:

in the step (3023), the primary Video is input to the Video-to-Video Synthesis model after the training process, and the natural Video obtained through the output specifically includes,

step (30231), acquiring Video conversion precision and/or Video conversion continuity corresponding to the Video-to-Video Synthesis model;

step (30232), training the Video-to-Video Synthesis model based on the Video conversion precision and/or the Video conversion continuity;

and (30233) inputting the primary Video into the trained Video-to-Video synchronization model, so as to obtain the natural Video through conversion.

10. The audio-visual synthesizing method according to claim 1, characterized in that:

in the step (4), combining the target audio and the target video to realize the audio-video synthesis specifically includes,

step (401), obtaining audio characteristics corresponding to the target audio and video characteristics corresponding to the target video;

and (402) combining the target audio and the target video through a preset audio and video synthesizer according to the audio characteristics and the video characteristics.

Technical Field

The invention relates to the technical field of intelligent signal processing, in particular to an audio and video synthesis method.

Background

Conventional speech synthesis is limited to the synthesis of audio signals only and does not involve the simultaneous synthesis of audio and video signals. With the change of application scenes and the wide application of man-machine interaction technology, the audio and video synthesis field of mutual combination between audio signals and video signals is receiving wide attention. In the prior art, for video synthesis, modeling is generally performed by modeling key points in a video image or by performing dimensionality reduction on the video image and then performing modeling.

Although the image can be accurately calibrated by modeling the video image, the modeling process of the video image involves more complicated steps and also involves extracting key point information in the video image, but different problems of poor continuity, low definition and the like of video image synthesis processing generally exist no matter the modeling is performed on the key points in the video image or the modeling is performed after the video image is subjected to dimensionality reduction processing, so that the audio and video signals synthesized by a modeling mode cannot meet the requirements of continuity and definition of the audio and video signals. Therefore, a synthesis method capable of simply, conveniently and quickly obtaining high-quality audio/video synthesis signals is urgently needed in the prior art.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an audio and video synthesis method, which comprises the following steps: step (1), inputting text data into a trained deep neural network model to obtain acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data; step (2), obtaining a target audio corresponding to the text data according to the acoustic parameters; step (3), obtaining a target video corresponding to the text data according to the hidden layer characteristics of the variational self-encoder VAE; and (4) combining the target audio and the target video to realize audio and video synthesis. It is obvious that the audio and video synthesis method is different from the prior art that the extraction and modeling processing needs to be carried out on the key points of the video images, the design of any key points of the video images is not involved in the synthesis process, the generation of the audio and the video and the synthesis of the audio and the video are completely realized by driving the input text data, the relation between the front frame image and the rear frame image of the video is fully considered when the video is generated, and the video corresponding to the text data is further optimized by adopting a coarse-to-fine generation mode, so that the generated video can meet the specific requirements on the continuity and the definition of the picture, thereby greatly reducing the difficulty of the audio and video synthesis and improving the quality of the audio and video synthesis.

The invention provides an audio and video synthesis method which is characterized by comprising the following steps:

inputting text data into a trained deep neural network model to obtain acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data;

step (2), obtaining a target audio corresponding to the text data according to the acoustic parameters;

step (3), according to the hidden layer characteristics of the variational self-encoder VAE, obtaining a target video corresponding to the text data;

step (4), combining the target audio and the target video to realize audio and video synthesis;

further, in the step (1), inputting text data to the trained deep neural network model to obtain acoustic parameters and variational self-encoder VAE hidden layer features corresponding to the text data specifically includes,

a step (101) of building a text-audio-video database for said training;

a step (102) of training the deep neural network model based on all data in the text-audio-video database;

step (103), inputting the text data into the trained deep neural network model, and obtaining acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data from the output of the deep neural network model;

further, in said step (101), constructing a text-audio-video database for said training comprises in particular,

further, in the step (2), obtaining the target audio corresponding to the text data specifically includes, according to the acoustic parameter,

step (201), inputting the text data into a trained network model, thereby obtaining the acoustic parameters;

a step (202) of inputting the acoustic parameters to a vocoder, so as to obtain the target audio through the vocoder reconstruction;

further, in the step (3), obtaining the target video corresponding to the text data according to the hidden layer feature of the variational auto-encoder VAE specifically includes,

step (301), the hidden layer characteristics of the variational self-encoder VAE are transformed, so as to obtain a primary video;

step (302), the primary video is subjected to transformation processing related to a preset algorithm model, so that a natural video related to the text data is obtained;

a step (303) of generating the target video from the natural video;

further, in the step (301), transforming the variational auto-encoder VAE hidden layer features to obtain the primary video specifically includes,

step (3011), carry on VAE decoding to the hidden layer characteristic of VAE of the said variational self-encoder, thus get several frame pictures related to the said text data;

step (3012), according to a preset image combination mode, combining the plurality of frame images to obtain the primary video;

further, in the step (302), the primary video is transformed according to a predetermined algorithm model, so as to obtain a natural video related to the text data,

step (3021), generating a confrontation network CGAN based on the condition, and constructing a Video-to-Video Synthesis model as the preset algorithm model;

step (3022), training the Video-to-Video Synthesis model;

step (3023) of inputting the primary Video into the trained Video-to-Video synchronization model, and outputting the model to obtain the natural Video;

further, in the step (3021), generating a confrontation network CGAN based on the condition, and constructing a Video-to-Video synchronization model as the preset algorithm model specifically includes,

generating a confrontation network CGAN, a preset Video image frame sequence condition and a Video precision adjusting mode based on the condition, and constructing to obtain a Video-to-Video synchronization model;

alternatively, the first and second electrodes may be,

in the step (3022), the training of the Video-to-Video Synthesis model specifically includes,

training the Video-to-Video Synthesis model by adopting Video related data in a text-audio-Video database for training the deep neural network model so as to enable the Video transformation characteristic of the Video-to-Video Synthesis model to meet a preset characteristic condition;

further, in the step (3023), the primary Video is input to the Video-to-Video Synthesis model after the training process, and the natural Video obtained by outputting the primary Video specifically includes,

step (30231), acquiring Video conversion precision and/or Video conversion continuity corresponding to the Video-to-Video Synthesis model;

step (30232), training the Video-to-Video Synthesis model based on the Video conversion precision and/or the Video conversion continuity;

step (30233), inputting the primary Video into the trained Video-to-Video Synthesis model, thereby obtaining the natural Video through conversion;

further, in the step (4), combining the target audio and the target video to realize the audio-video synthesis specifically includes,

step (401), obtaining audio characteristics corresponding to the target audio and video characteristics corresponding to the target video;

and (402) combining the target audio and the target video through a preset audio and video synthesizer according to the audio characteristics and the video characteristics.

Compared with the prior art, the audio and video synthesis method comprises the following steps: step (1), inputting text data into a trained deep neural network model to obtain acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data; step (2), obtaining a target audio corresponding to the text data according to the acoustic parameters; step (3), obtaining a target video corresponding to the text data according to the hidden layer characteristics of the variational self-encoder VAE; and (4) combining the target audio and the target video to realize audio and video synthesis. It is obvious that the audio and video synthesis method is different from the prior art that the extraction and modeling processing needs to be carried out on the key points of the video images, the design of any key points of the video images is not involved in the synthesis process, the generation of the audio and the video and the synthesis of the audio and the video are completely realized by driving the input text data, the relation between the front frame image and the rear frame image of the video is fully considered when the video is generated, and the video corresponding to the text data is further optimized by adopting a coarse-to-fine generation mode, so that the generated video can meet the specific requirements on the continuity and the definition of the picture, thereby greatly reducing the difficulty of the audio and video synthesis and improving the quality of the audio and video synthesis.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of an audio/video synthesis method provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of an audio/video synthesis method according to an embodiment of the present invention. The audio and video synthesis method comprises the following steps:

step (1), inputting text data into the trained deep neural network model to obtain acoustic parameters and variational self-encoder VAE hidden layer characteristics corresponding to the text data.

Preferably, in the step (1), inputting text data to the trained deep neural network model to obtain the acoustic parameters and the variational self-encoder VAE hidden layer features corresponding to the text data specifically includes,

a step (101) of building a text-audio-video database for the training;

a step (102) of training the deep neural network model based on all data in the text-audio-video database;

Preferably, in this step (101), constructing a text-audio-video database for the training comprises in particular,

And (2) obtaining a target audio corresponding to the text data according to the acoustic parameters.

Preferably, in the step (2), obtaining the target audio corresponding to the text data specifically includes, according to the acoustic parameter,

step (201), inputting the text data into the trained network model, thereby obtaining the acoustic parameters;

and (202) inputting the acoustic parameters into a vocoder, so that the target audio is obtained through the reconstruction of the vocoder.

In the step (3), obtaining the target video corresponding to the text data according to the hidden layer feature of the variational auto-encoder VAE specifically includes,

step (301), the hidden layer characteristics of the variational self-encoder VAE are transformed, so as to obtain a primary video;

step (302), the primary video is subjected to transformation processing related to a preset algorithm model, so that a natural video related to the text data is obtained;

and (303) generating the target video according to the natural video.

Preferably, in this step (301), the transforming of the variational autoencoder VAE hidden layer feature to obtain the primary video specifically comprises,

step (3011), carry on VAE decoding to the hidden layer characteristic of VAE of the autoencoder of the variation, thus get several frames of pictures related to the text data;

and (3012) combining the plurality of frame images according to a preset image combination mode to obtain the primary video.

Preferably, in the step (302), the primary video is transformed with respect to a predetermined algorithm model, so as to obtain a natural video with respect to the text data,

step (3021), generating a confrontation network CGAN based on the condition, and constructing a Video-to-Video Synthesis model as the preset algorithm model;

step (3022), training the Video-to-Video Synthesis model;

step (3023) inputting the primary Video into the trained Video-to-Video synchronization model, and outputting the model to obtain the natural Video.

Preferably, in the step (3021), generating a confrontation network CGAN based on the condition, and constructing the Video-to-Video synchronization model as the preset algorithm model specifically includes,

and generating a confrontation network CGAN, a preset Video image frame sequence condition and a Video precision adjusting mode based on the condition, and constructing to obtain the Video-to-Video Synthesis model.

Preferably, in the step (3022), the training of the Video-to-Video Synthesis model specifically includes,

Preferably, in the step (3023), the primary Video is input to the Video-to-Video Synthesis model after the training process, so as to output the natural Video specifically including,

step (30231), obtaining Video conversion precision and/or Video conversion continuity corresponding to the Video-to-Video Synthesis model;

step (30232), training the Video-to-Video Synthesis model based on the Video conversion accuracy and/or the Video conversion continuity;

and step 30233, inputting the primary Video into the trained Video-to-Video Synthesis model, thereby transforming to obtain the natural Video.

And (4) combining the target audio and the target video to realize audio and video synthesis.

Preferably, in the step (4), combining the target audio and the target video to realize the audio-visual synthesis specifically comprises,

step (401), acquiring the audio characteristic corresponding to the target audio and the video characteristic corresponding to the target video;

and (402) combining the target audio and the target video through a preset audio and video synthesizer according to the audio characteristics and the video characteristics.

It can be seen from the above embodiments that the audio and video synthesis method is different from the prior art that extraction and modeling processing need to be performed on key points of a video image, and does not involve any design of key points of the video image in the synthesis process, but completely drives through input text data to realize generation of audio and video and synthesis of the audio and video, and the audio and video synthesis method fully considers the relationship between front and rear frame images of the video when generating the video, and further optimizes the video corresponding to the text data by adopting a coarse-to-fine generation mode, so that the generated video can meet specific requirements on picture continuity and picture definition, thereby greatly reducing difficulty in audio and video synthesis and improving quality of audio and video synthesis.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：音色相似度的确定方法、装置及计算机存储介质

Audio and video synthesis method

相关技术

网友询问留言