Disentangling Propagation and Generation for Video Prediction

Gao, Hang; Xu, Huazhe; Cai, Qi-Zhi; Wang, Ruth; Yu, Fisher; Darrell, Trevor

doi:10.1109/iccv.2019.00910

Cited by 86 publications

(81 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Apart from using a single method, some studies [4], [17], [18] consider combining optical flow information and pixel generation information in the model. Liang et al [4] made use of the idea of dual learning to perform dual learning through video frame prediction tasks and optical flow prediction tasks to alleviate the cumulative error in the frame prediction process.…”

Section: Related Workmentioning

confidence: 99%

“…In addition, the prediction effect of the model depends on the motion trajectory vector provided by the user, but it is difficult to require the user to provide accurate motion trajectory vector for video prediction. Gao et al [18] proposed a framework decoupling the optical flow prediction model and the pixel generation model. At first, the framework obtains the predicted optical flow map through the optical flow estimation module, then calculates the occlusion map from the pixel density transformation law in the optical flow map, and then uses the occlusion map to occlude the output frame obtained from optical flow estimation, and finally uses the filling module of predicting occlusion contents to predict the occlusion information to obtain the final video frame.…”

Section: Related Workmentioning

confidence: 99%

“…In this way, the unified extraction of optical flow information and pixel generation is realized, and the problem that the user needs to input the motion trajectory vector in reference [17] is solved. The problem that the motion propagation module and the repair module need to be trained separately in reference [18] is solved.…”

Section: Motivation and Contributionsmentioning

confidence: 99%

See 2 more Smart Citations

A Video Prediction Method Based on Optical Flow Estimation and Pixel Generation

et al. 2021

View full text Add to dashboard Cite

Video prediction has developed rapidly after the booming of deep learning. As an important part of unsupervised representation learning, it plays an important role in anomalous behavior detection, autonomous driving, video games, and other fields. However, the prediction method based on optical flow estimation is susceptible to brightness change and camera shake, and it is difficult to predict occluded objects. While the prediction method based on pixel generation is difficult to fit ambiguous and complex scenes, which leads to a blurry prediction. In this work, we proposed an end-to-end video prediction framework that combines the optical flow estimation module with the pixel generation module by a learnable mask weight to predict high-fidelity videos. To further improve the prediction effect, we introduce adversarial training to the framework. We introduced a frame discriminator and a sequence discriminator to ensure the consistency of the spatio-temporal distribution of predicted video frames and real video frames. The results of experiments on challenging datasets demonstrate the practicability and effectiveness of our proposed video prediction framework. On the one hand, our proposed framework has achieved an equal quality compared with the current latest model, which requires fewer parameters and has a faster prediction speed. On the other hand, the results of ablation experiments demonstrate the effect of fusing different modules and the effectiveness of adversarial training.INDEX TERMS adversarial training, convolutional neural network, optical flow prediction, pixel generation, video prediction.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Motivation and Contributionsmentioning

confidence: 99%

See 1 more Smart Citation

A Video Prediction Method Based on Optical Flow Estimation and Pixel Generation

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Others split the problem into two problems, motion and content prediction, and learn separate representations for the static and dynamic components. For training, these approaches either use a motion prior, such as optical flow information [9,20,23,[26][27][28], as a conditional input or use learned features to represent pixel dynamics [29].…”

Section: Related Workmentioning

confidence: 99%

The Importance of Loss Functions for Increasing the Generalization Abilities of a Deep Learning-Based Next Frame Prediction Model for Traffic Scenes

Aigner

Körner

2020

MAKE

View full text Add to dashboard Cite

This paper analyzes in detail how different loss functions influence the generalization abilities of a deep learning-based next frame prediction model for traffic scenes. Our prediction model is a convolutional long-short term memory (ConvLSTM) network that generates the pixel values of the next frame after having observed the raw pixel values of a sequence of four past frames. We trained the model with 21 combinations of seven loss terms using the Cityscapes Sequences dataset and an identical hyper-parameter setting. The loss terms range from pixel-error based terms to adversarial terms. To assess the generalization abilities of the resulting models, we generated predictions up to 20 time-steps into the future for four datasets of increasing visual distance to the training dataset-KITTI Tracking, BDD100K, UA-DETRAC, and KIT AIS Vehicles. All predicted frames were evaluated quantitatively with both traditional pixel-based evaluation metrics, that is, mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), and recent, more advanced, feature-based evaluation metrics, that is, Fréchet inception distance (FID), and learned perceptual image patch similarity (LPIPS). The results show that solely by choosing a different combination of losses, we can boost the prediction performance on new datasets by up to 55%, and by up to 50% for long-term predictions.

show abstract

“…To make predictions occur more realistic, others tackled the problem by learning separate representations for the static and dynamic components of a video. This is done either by incorporating motion conditions, such as optical flow information [12], [34], [15], [17], [25], or by learning sparse features that represent pixel dynamics [26]. Decomposing the video into static and non-static components allows the network to simply reproduce the values of the static part for the majority of pixels.…”

Section: Related Workmentioning

confidence: 99%

Enhancing Traffic Scene Predictions with Generative Adversarial Networks

König

Aigner

Körner

2019

2019 IEEE Intelligent Transportation Systems Conference (ITSC)

View full text Add to dashboard Cite

We present a new two-stage pipeline for predicting frames of traffic scenes where relevant objects can still reliably be detected. Using a recent video prediction network, we first generate a sequence of future frames based on past frames. A second network then enhances these frames in order to make them appear more realistic. This ensures the quality of the predicted frames to be sufficient to enable accurate detection of objects, which is especially important for autonomously driving cars. To verify this two-stage approach, we conducted experiments on the Cityscapes dataset. For enhancing, we trained two image-to-image translation methods based on generative adversarial networks, one for blind motion deblurring and one for image super-resolution. All resulting predictions were quantitatively evaluated using both traditional metrics and a state-of-the-art object detection network showing that the enhanced frames appear qualitatively improved. While the traditional image comparison metrics, i.e., MSE, PSNR, and SSIM, failed to confirm this visual impression, the object detection evaluation resembles it well. The best performing prediction-enhancement pipeline is able to increase the average precision values for detecting cars by about 9% for each prediction step, compared to the non-enhanced predictions.

show abstract

Disentangling Propagation and Generation for Video Prediction

Cited by 86 publications

References 42 publications

A Video Prediction Method Based on Optical Flow Estimation and Pixel Generation

A Video Prediction Method Based on Optical Flow Estimation and Pixel Generation

The Importance of Loss Functions for Increasing the Generalization Abilities of a Deep Learning-Based Next Frame Prediction Model for Traffic Scenes

Enhancing Traffic Scene Predictions with Generative Adversarial Networks

Contact Info

Product

Resources

About