Compositional Video Prediction

Ye, Yufei; Singh, Maneesh; Gupta, Abhinav; Tulsiani, Shubham

doi:10.1109/iccv.2019.01045

Cited by 77 publications

(72 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video Prediction Video prediction task predicts future frames by conditioning on the input frame(s) [20,28,30,69,71]. Using future frames as ground-truth leads to conditioned supervised learning approach which gives better results in contrast to unconditional video generation [8,18,28,39].…”

Section: Related Workmentioning

confidence: 99%

“…Using future frames as ground-truth leads to conditioned supervised learning approach which gives better results in contrast to unconditional video generation [8,18,28,39]. GAN based approaches often relies on a sequence of input frames as priors to reduce ambiguity [15,19,62,71]. Our approach uses only the first input frame and action class name as prior for the prediction task similar to [28,60].…”

Section: Related Workmentioning

confidence: 99%

“…Video generation is a challenging problem with a lot of applications in robotics [18,33], augmented reality [64,65], data augmentation [4,22,50,74], and action imitation [1,13,37,39,55,60,62]. It has different variations, such as video prediction [28,31,71], video synthesis [64,65], video interpolation [42,54], and super-resolution [14,27,38]. In this work, we focus on generating human actions conditioned on image of an actor and a target action.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

LARNet: Latent Action Representation for Human Action Synthesis

Biyani¹,

Rana²,

Vyas³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present LARNet, a novel end-to-end approach for generating human action videos. A joint generative modeling of appearance and dynamics to synthesize a video is very challenging and therefore recent works in video synthesis have proposed to decompose these two factors. However, these methods require a driving video to model the video dynamics. In this work, we propose a generative approach instead, which explicitly learns action dynamics in latent space avoiding the need of a driving video during inference. The generated action dynamics is integrated with the appearance using a recurrent hierarchical structure which induces motion at different scales to focus on both coarse as well as fine level action details. In addition, we propose a novel mix-adversarial loss function which aims at improving the temporal coherency of synthesized videos. We evaluate the proposed approach on four real-world human action datasets demonstrating the effectiveness of the proposed approach in generating human actions. Code available at https://github.com/aayushjr/larnet.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LARNet: Latent Action Representation for Human Action Synthesis

Biyani¹,

Rana²,

Vyas³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, although the model can generate possible future frames from one image, it is not suitable for complex scenes and has low accuracy. Ye et al [6] proposed a pixel-level future prediction approach, which implicitly predicts future states of independent entities while reasoning about their interactions, and composes future video frames using these predicted states. Jasti et al [7] proposed a model based on temporal motion encodings to make it possible to predict any arbitrary number of future frames.…”

Section: Introductionmentioning

confidence: 99%

Efficient Traffic Accident Warning Based on Unsupervised Prediction Framework

et al. 2021

View full text Add to dashboard Cite

Recognizing potentially hazardous objects is crucial in the field of transportation, especially in assisted and unmanned driving. However, most existing studies do not focus on defensive driving as they only identify accidents ahead of the vehicle in which they are not involved. In this paper, a driving assistance system is proposed to predict the risk score of potential targets ahead of the vehicle and provide an early warning, which relies on a deep architecture called Fusion-Residual Predictive Network (FRPN) that fused multi-scale residual features and improved adversarial learning. This architecture provides an environment for the generator to perform joint learning from ground-truth images and discriminators with gradient penalty constraints. The deeper convolutional neural network can greatly improve the quality of the image by fusing residual features. Several deep convolutional neural network models were used to evaluate the method on various datasets; among them, the prediction model based on the VGG network, with peak signal-to-noise ratio of 32.67 and structural similarity index of 0.921, respectively, yielded the best results. Subsequently, we utilize the tracking model to design a risk score evaluation method based on the location of the target and it have an improvement in ability to give early warning with 1.95s earlier in the best case. These results prove that our method can effectively reduce the risk of traffic accidents.INDEX TERMS Generative adversarial network, video predicition, recurrent neural network, convolution nerual network, object tracking, traffic warning, unsupervised learning, risk score assessment. I.

show abstract

“…Those can be explicitly estimated as optical flow [13,14,18] resulting in high fidelity outcome for real video sequences, or with Spatial transformers [25] as in [8,12]. Closer to our proposal, Ye et al follow a compositional approach by factorizing abstract visual entities, yet, they operate in the latent space rather than with visual clues [22]. Wu et al [23] proposes a very similar pipeline to ours, with pretrained networks for many sub-tasks although the method is claimed to be unsupervised.…”

Section: Introductionmentioning

confidence: 99%

Self-Supervision By Prediction For Object Discovery In Videos

Besbinar

Frossard

2021

2021 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

Despite their irresistible success, deep learning algorithms still heavily rely on annotated data, and unsupervised settings pose many challenges, such as finding the right inductive bias in diverse scenarios. In this paper, we propose an object-centric model for image sequence representation that uses the prediction task for self-supervision. By disentangling object representation and motion dynamics, our novel compositional structure explicitly handles occlusion and inpaints inferred objects and background for the composition of the predicted frame. Using auxiliary losses to promote spatially and temporally consistent object representations, we train our self-supervised framework without the help of any annotation or pretrained network. Initial experiments confirm that our new pipeline is a promising step towards object-centric video prediction.

show abstract

Compositional Video Prediction

Cited by 77 publications

References 23 publications

LARNet: Latent Action Representation for Human Action Synthesis

LARNet: Latent Action Representation for Human Action Synthesis

Efficient Traffic Accident Warning Based on Unsupervised Prediction Framework

Self-Supervision By Prediction For Object Discovery In Videos

Contact Info

Product

Resources

About