Video prediction method and device, electronic equipment and vehicle
阅读说明:本技术 视频预测方法和装置、电子设备及车辆 (Video prediction method and device, electronic equipment and vehicle ) 是由 侯鹏飞 范坤 于 2018-07-13 设计创作,主要内容包括:公开了一种视频预测方法、视频预测装置、电子设备和车辆。该视频预测方法包括:训练步骤,包括:使用先验编码器从先前帧生成先验分布;使用后验编码器从先前帧和后续帧生成后验分布;将后验分布用作生成器的中间变量以从先前帧生成第一预测帧;以及以第一预测帧与后续帧之间的差异以及先验分布与后验分布之间的KL散度作为损失函数来优化先验编码器和生成器;以及预测步骤,包括:使用先验编码器从已知帧生成先验分布;以及,将先验分布用作生成器的中间变量以从已知帧生成未来帧。这样,可以通过使用先验编码器的先验分布和后验编码器的后验分布优化用于预测视频的生成器和先验编码器,从而简化了视频预测的训练过程并提升了预测效果。(A video prediction method, a video prediction apparatus, an electronic device, and a vehicle are disclosed. The video prediction method comprises the following steps: a training step comprising: generating an a priori distribution from a previous frame using an a priori encoder; generating an a posteriori distribution from a previous frame and a subsequent frame using an a posteriori encoder; using the a posteriori distribution as an intermediate variable of a generator to generate a first predicted frame from a previous frame; and optimizing the prior encoder and the generator with the difference between the first predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function; and a predicting step comprising: generating an a priori distribution from a known frame using an a priori encoder; and using the a priori distribution as an intermediate variable of the generator to generate future frames from the known frames. In this way, the generator and the a priori encoder for predicting video can be optimized by using the a priori distribution of the a priori encoder and the a posteriori distribution of the a posteriori encoder, thereby simplifying the training process of video prediction and improving the prediction effect.)
1. A video prediction method, comprising:
a training step comprising:
generating an a priori distribution from a previous frame using an a priori encoder;
generating an a posteriori distribution from the previous and subsequent frames using an a posteriori encoder;
using the a posteriori distribution as an intermediate variable for a generator, generating a first predicted frame from the previous frame using the generator; and
optimizing the prior encoder and the generator with a difference between the first predicted frame and the subsequent frame and a KL divergence between the prior distribution and the posterior distribution as a loss function; and
a prediction step comprising:
generating an a priori distribution from a known frame using the a priori encoder; and
using the prior distribution as an intermediate variable for the generator, generating future frames from the known frames using the generator.
2. The method of claim 1, wherein the a priori encoder and the a posteriori encoder each comprise a convolutional network, and the generator comprises one of a long short term memory network, a convolutional network, and an optical flow network.
3. The method of claim 1, wherein the previous and subsequent frames for the a priori encoder and the a posteriori encoder in the training step are both real video data.
4. The video prediction method of claim 3, wherein the previous frame and the subsequent frame are video frames acquired by a driver assistance system of a vehicle.
5. The method of claim 1, wherein the predicting step further comprises:
generating a next future frame using the future frame as a known frame.
6. The method of claim 1, wherein,
generating an a priori distribution from a previous frame using an a priori encoder includes:
generating a first data pair of a plurality of means and variances using the previous frame; and
using the first pair of data of the plurality of means and variances as the prior distribution to generate a first random number that follows a Gaussian distribution, an
Generating an a posteriori distribution from the previous and subsequent frames using an a posteriori encoder comprises:
generating a plurality of second data pairs of means and variances using the previous and subsequent frames; and
generating a second random number subject to a Gaussian distribution as the posterior distribution using the second data pair of the plurality of means and variances.
7. The method of claim 1, wherein the training step further comprises:
using the a priori distribution as an intermediate variable for the generator, generating a second predicted frame from the previous frame using the generator; and
optimizing the prior encoder and the generator with a difference between the second predicted frame and the subsequent frame and a KL divergence between the prior distribution and the posterior distribution as a loss function.
8. A video prediction apparatus includes a prior encoder, a posterior encoder, a generator, a training unit, and a prediction unit,
the training unit is configured to:
generating an a priori distribution from a previous frame using the a priori encoder;
generating an a posteriori distribution from the previous and subsequent frames using the a posteriori encoder;
generating, by the generator, a first predicted frame from the previous frame using the a posteriori distribution as an intermediate variable for the generator; and
optimizing the prior encoder and the generator using a difference between the first predicted frame and the subsequent frame and a KL divergence between the prior distribution and the posterior distribution as a loss function, and
the prediction unit is configured to:
generating an a priori distribution from a known frame using the a priori encoder; and
generating, by the generator, a future frame from the known frame using the prior distribution as an intermediate variable of the generator.
9. The apparatus of claim 8, wherein the training unit is further configured to:
generating, by the generator, a second predicted frame from the previous frame using the prior distribution as an intermediate variable of the generator; and
optimizing the prior encoder and the generator with a difference between the second predicted frame and the subsequent frame and a KL divergence between the prior distribution and the posterior distribution as a loss function.
10. The apparatus of claim 9, wherein said a priori encoder and said a posteriori encoder each comprise a convolutional network, and said generator comprises one of a long short term memory network, a convolutional network, and an optical flow network.
11. An electronic device, comprising:
a processor; and
memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the video prediction method of any one of claims 1-7.
12. A vehicle comprising the electronic device of claim 11.
13. A computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a video prediction method according to any one of claims 1-7.
Technical Field
The present application relates generally to the field of Assisted Driving (ADAS), and more particularly, to a video prediction method, a video prediction apparatus, an electronic device, and a vehicle.
Background
In recent years, automated driving, or Advanced Driving Assistance Systems (ADAS), have received extensive attention and intense research. The ADAS system needs to sense various states of the vehicle itself and the surrounding environment using various vehicle-mounted sensors, collect data, perform identification, detection and tracking of static and dynamic entities, and perform systematic calculation and analysis in combination with map data, thereby making driving policy decisions and finally realizing an automatic driving function.
In an automatic driving scene, videos obtained by image acquisition devices such as a camera and the like need to be predicted to realize dynamic prediction of entities in the environment, and then prediction results are used by subsequent modules to realize functions such as driving control of vehicles.
In video prediction, a variational automatic encoder (variational automatic encoder) is used to fit the posterior distribution of the future frames of a video by calculating the prior distribution of the video from the previous frames. The predicted result image needs to be as vivid as possible, and the motion track needs to conform to the real motion rule of the object as much as possible. During the training process, as the training progresses, the posterior distribution may need to gradually approach the distribution of the data set, and the prior distribution may need to gradually approach the posterior distribution. But since the posterior distribution is random at the beginning, the prior training is easily affected, and finally the overall effect is not ideal.
Accordingly, there is a need for an improved video prediction scheme.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide a video prediction method, a video prediction apparatus, an electronic device, and a vehicle, which obtain a posterior distribution by using real data instead of prediction data in a training stage, and optimize a priori encoder and a prediction generator using the posterior distribution, thereby simplifying a training process of video prediction and improving a prediction effect.
According to an aspect of the present application, there is provided a video prediction method, including: a training step comprising: generating an a priori distribution from a previous frame using an a priori encoder; generating an a posteriori distribution from the previous and subsequent frames using an a posteriori encoder; using the a posteriori distribution as an intermediate variable for a generator, generating a first predicted frame from the previous frame using the generator; and optimizing the prior encoder and the generator with the difference between the first predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function; and, a predicting step comprising: generating an a priori distribution from a known frame using the a priori encoder; and using the a priori distribution as an intermediate variable for the generator, generating future frames from the known frames using the generator.
In the above video prediction method, the a priori encoder and the a posteriori encoder each comprise a convolutional network, and the generator comprises one of a long-short term memory network, a convolutional network, and an optical flow network.
In the above video prediction method, the previous frame and the subsequent frame for the a priori encoder and the a posteriori encoder in the training step are both real video data.
In the above video prediction method, the previous frame and the subsequent frame are video frames acquired by a driving assistance system of the vehicle.
In the above video prediction method, the predicting step further includes: generating a next future frame using the future frame as a known frame.
In the above video prediction method, generating an a priori distribution from a previous frame using an a priori encoder comprises: generating a first data pair of a plurality of means and variances using the previous frame; and generating a first random number that follows a gaussian distribution using the first pair of data for the plurality of means and variances as the prior distribution, and generating a posterior distribution from the previous and subsequent frames using an posterior encoder comprises: generating a plurality of second data pairs of means and variances using the previous and subsequent frames; and generating a second random number subject to a gaussian distribution as the posterior distribution using the second data pair of the plurality of means and variances.
In the above video prediction method, the training step further includes: using the a priori distribution as an intermediate variable for the generator, generating a second predicted frame from the previous frame using the generator; and optimizing the prior encoder and the generator with the difference between the second predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function.
According to another aspect of the present application, there is provided a video prediction apparatus comprising an a priori encoder, an a posteriori encoder, a generator, a training unit, and a prediction unit, wherein the training unit is configured to: generating an a priori distribution from a previous frame using the a priori encoder; generating an a posteriori distribution from the previous and subsequent frames using the a posteriori encoder; generating, by the generator, a first predicted frame from the previous frame using the a posteriori distribution as an intermediate variable for the generator; and optimizing the prior encoder and the generator using the difference between the first predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function, and the prediction unit is configured to: generating an a priori distribution from a known frame using the a priori encoder; and generating, by the generator, future frames from the known frames using the a priori distribution as an intermediate variable of the generator.
In the above video prediction apparatus, the a priori encoder and the a posteriori encoder each include a convolutional network, and the generator includes one of a long-short term memory network, a convolutional network, and an optical flow network.
In the above video prediction apparatus, the previous frame and the subsequent frame for the a priori encoder and the a posteriori encoder in the training unit are both real video data.
In the above-described video prediction apparatus, the previous frame and the subsequent frame are video frames acquired by a driving assistance system of the vehicle.
In the above video prediction apparatus, the prediction unit is further configured to: generating a next future frame using the future frame as a known frame.
In the above video prediction apparatus, the training unit generating an a priori distribution from a previous frame using an a priori encoder comprises: generating a first data pair of a plurality of means and variances using the previous frame; and generating a first random number that is uniform with a gaussian distribution using the first data pair of the plurality of means and variances as the prior distribution, and the training unit generating a posterior distribution from the previous and subsequent frames using an posterior encoder comprises: generating a plurality of second data pairs of means and variances using the previous and subsequent frames; and generating a second random number subject to a gaussian distribution as the posterior distribution using the second data pair of the plurality of means and variances.
In the above video prediction apparatus, the training unit is further configured to: using the a priori distribution as an intermediate variable for the generator, generating a second predicted frame from the previous frame using the generator; and optimizing the prior encoder and the generator with the difference between the second predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function.
According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the video prediction method as described above.
According to yet another aspect of the present application, there is provided a vehicle comprising an electronic device as described above.
According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a video prediction method as described above.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is a schematic diagram illustrating a system architecture to which a video prediction method according to an embodiment of the present application is applied.
Fig. 2 illustrates a flow diagram of a video prediction method according to an embodiment of the present application.
Fig. 3 illustrates a schematic diagram of a training process of a video prediction method according to an embodiment of the present application.
Fig. 4 illustrates a schematic diagram of a prediction process of a video prediction method according to an embodiment of the present application.
Fig. 5 illustrates a schematic diagram of another example of a training process of a video prediction method according to an embodiment of the present application.
Fig. 6 illustrates a block diagram of a video prediction apparatus according to an embodiment of the present application.
FIG. 7 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Summary of the application
As described above, in current video prediction, a variational auto-encoder is mainly used to calculate a prior distribution from the first few frames of a video, and to fit a posterior distribution of the future few frames of the video later, which is usually estimated a prior distribution with an LSTM (Long-Short term memory) network or simply assumed to be a standard normal distribution, and also used to encode the posterior. However, assuming a prior distribution as a normal distribution is too simple to fit the actual data. On the other hand, if the LSTM network is adopted, the difficulty of training the LSTM network structure itself is relatively high, so that it is easy to cause difficulty in learning a priori and a posteriori, and the training efficiency is low.
In view of the above technical problems, a basic idea of the present application is to provide a video prediction method, a video prediction apparatus, an electronic device, and a vehicle, in which a prior encoder generates a prior distribution using real data in a training step, a posterior encoder generates a posterior distribution using real data of more frames, the posterior distribution is used as an intermediate variable of a prediction generator to perform prediction, and the prior encoder and the generator are trained using KL divergence between the prior and posterior distributions and a difference between a predicted frame and a real frame, such as a mean square error. Moreover, the prior encoder and the posterior encoder can adopt a convolution network to replace a commonly used LSTM network, thereby greatly simplifying the training process and improving the prediction effect. In the prediction step, the a priori distribution generated by the trained a priori encoder can be used as an intermediate variable of the generator to perform video prediction.
Here, the video prediction method, the video prediction apparatus, the electronic device, and the vehicle according to the embodiments of the present application may be directly applied to video prediction, and may also be used for any other prediction task that can be converted into video prediction. For example, the motion prediction of objects such as vehicles and pedestrians in an automatic driving scene can be converted into a prediction task of a position occupying lattice point diagram sequence of each object in a panoramic top view. Furthermore, the predicted image does not only refer to a natural image containing a single or three color channels, but may also be any multi-channel three-dimensional data that implicitly expresses other information (e.g., velocity, acceleration).
Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary System
Fig. 1 is a schematic diagram illustrating a system architecture to which a video prediction method according to an embodiment of the present application is applied.
As shown in fig. 1, the
Here, specific implementations of the a
Exemplary method
Fig. 2 illustrates a flow diagram of a
The training step S210 and the prediction step S220 shown in fig. 2 will be described in detail below with reference to fig. 3 to 4. As shown in fig. 2, the training step S210 may include a step S211 of generating an a priori distribution from a previous frame using the a
With continued reference to fig. 2, the training step S210 further comprises a step S213 of using the posterior distribution P2 as an intermediate variable for the
In step S214, frame X 'may be predicted't+1And its true value (i.e. the subsequent frame X)t+1) Example of the difference therebetweenSuch as Mean Square Error (MSE) and KL divergence between the prior distribution P1 and the posterior distribution P2 as a loss function, i.e., loss-MSE + KL, to train the
The inventors have found that when convolutional networks are used for both a
After the training process is completed, a prediction step S220 may be performed, the prediction step S220 performing prediction using only the a
Fig. 4 illustrates this process. As shown in FIG. 4, the a
Fig. 5 illustrates a schematic diagram of another example of a training process of a video prediction method according to an embodiment of the present application. For simplicity and clarity, only the differences of the example of fig. 5 from the example of fig. 3 will be described below. As shown in fig. 5, the a priori distribution P1 produced by the a priori encoder 10 is also provided to the
Exemplary devices
Fig. 6 illustrates a functional block diagram of a video prediction apparatus 300 according to an embodiment of the present application. As shown in fig. 6, the video prediction apparatus 300 according to the embodiment of the present application may include an a priori encoder 310, an a posteriori encoder 320, a generator 330, a training unit 340, and a prediction unit 350.
Training unit 340 may be configured to schedule other units to perform the training process, and in particular, may use a priori encoder 310 to perform the training process from previous frame Xt-4:XtGenerating an a priori distribution P1 from a previous frame X using a posteriori encoder 320t-4:XtAnd subsequentlyFrame Xt+1:Xt+4A posterior distribution P2 is generated and a posterior distribution P2 is provided to the generator 330 as an intermediate variable. Training unit 340 may also use generator 330 to derive from previous frame Xt-4:XtGenerating a predicted frame X't+1And to predict frame X't+1Corresponding to true value, i.e. frame Xt+1The difference between, for example, the Mean Square Error (MSE) and the KL divergence between the a priori distribution P1 and the a posteriori distribution P2, as loss functions to optimize the a priori encoder 310 and generator 330. In some embodiments, training unit 340 may also provide a priori distribution P1 to generator 330 as an intermediate variable, using generator 330, from a previous frame Xt-4:XtGenerating a predicted frame Y't+1And to predict frame Y't+1Corresponding to true value, i.e. frame Xt+1The difference between, for example, the mean square error and the KL divergence between the a priori distribution P1 and the a posteriori distribution P2, as a loss function to optimize the a priori encoder 310 and the generator 330. The training unit 340 may alternatively or synchronously perform a training process with the prior distribution P1 and the posterior distribution P2 as intermediate variables until the prior distribution P1 and the posterior distribution P2 converge to be consistent.
The prediction unit 350 may be configured to schedule other units to perform the training process, and in particular, may use the a priori encoder 310 to derive from the known frame Xt-4:XtAn a priori distribution P1 is generated and an a priori distribution P1 is used as an intermediate variable for the generator 330 from the known frame X by the generator 330t-4:XtGenerating future frame X't+1。
In one example, the a priori encoder 310 and the a posteriori encoder 320 may each comprise a convolutional network, and the generator 330 may comprise one of a long-short term memory network, a convolutional network, and an optical flow network.
It is to be understood that the specific functions and operations of the respective units and modules in the video prediction apparatus 300 have been described in detail in the video prediction method described above with reference to fig. 1 to 5, and thus, a repetitive description thereof will be omitted.
As described above, the video prediction apparatus 300 according to the embodiment of the present application can be implemented in various terminal devices, for example, in-vehicle devices for driving assistance. In one example, the video prediction apparatus 300 according to the embodiment of the present application may be integrated into the terminal device as a software module and/or a hardware module. For example, the apparatus 300 may be a software module in an operating system of the terminal device, or may be an application program developed for the terminal device, which runs on a CPU (central processing unit) and/or a GPU (graphics processing unit), or runs on a dedicated hardware acceleration chip, such as a dedicated chip adapted to run a deep neural network; of course, the apparatus 300 may also be one of many hardware modules of the terminal device.
Alternatively, in another example, the video prediction apparatus 300 and the terminal device may be separate devices, and the apparatus 300 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information according to an agreed data format.
Exemplary electronic device
FIG. 7 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
As shown in fig. 7, electronic device 400 includes one or more processors 410 and
In one example, electronic device 400 can also include an
Of course, for simplicity, only some of the components of the electronic device 400 relevant to the present application are shown in fig. 7, and components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 400 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the video prediction method according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the video prediction method according to various embodiments of the present application described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:视频解码的方法和装置、计算机设备及存储介质