2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.01045
|View full text |Cite
|
Sign up to set email alerts
|

Compositional Video Prediction

Abstract: We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is comprised of distinct entities that undergo motion and present an approach that operationalizes this insight. We implicitly predict future states of independent entities while reasoning about their interactions, and compose future video frames using these predicted states. We overcome the inherent multi-modality of the task using a global trajectory-level latent random variable, and show that th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
70
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 77 publications
(72 citation statements)
references
References 23 publications
1
70
1
Order By: Relevance
“…Video Prediction Video prediction task predicts future frames by conditioning on the input frame(s) [20,28,30,69,71]. Using future frames as ground-truth leads to conditioned supervised learning approach which gives better results in contrast to unconditional video generation [8,18,28,39].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Video Prediction Video prediction task predicts future frames by conditioning on the input frame(s) [20,28,30,69,71]. Using future frames as ground-truth leads to conditioned supervised learning approach which gives better results in contrast to unconditional video generation [8,18,28,39].…”
Section: Related Workmentioning
confidence: 99%
“…Using future frames as ground-truth leads to conditioned supervised learning approach which gives better results in contrast to unconditional video generation [8,18,28,39]. GAN based approaches often relies on a sequence of input frames as priors to reduce ambiguity [15,19,62,71]. Our approach uses only the first input frame and action class name as prior for the prediction task similar to [28,60].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, although the model can generate possible future frames from one image, it is not suitable for complex scenes and has low accuracy. Ye et al [6] proposed a pixel-level future prediction approach, which implicitly predicts future states of independent entities while reasoning about their interactions, and composes future video frames using these predicted states. Jasti et al [7] proposed a model based on temporal motion encodings to make it possible to predict any arbitrary number of future frames.…”
Section: Introductionmentioning
confidence: 99%
“…Those can be explicitly estimated as optical flow [13,14,18] resulting in high fidelity outcome for real video sequences, or with Spatial transformers [25] as in [8,12]. Closer to our proposal, Ye et al follow a compositional approach by factorizing abstract visual entities, yet, they operate in the latent space rather than with visual clues [22]. Wu et al [23] proposes a very similar pipeline to ours, with pretrained networks for many sub-tasks although the method is claimed to be unsupervised.…”
Section: Introductionmentioning
confidence: 99%