Motion Prediction Under Multimodality with Conditional Stochastic Networks

Fragkiadaki, Katerina; Huang, Jonathan; Alemi, Alex; Vijayanarasimhan, Sudheendra; Ricco, Susanna; Sukthankar, Rahul

doi:10.48550/arxiv.1705.02082

Cited by 7 publications

(12 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The uncertainty is usually encoded as a sequence of latent variables, which are then used in a generative model such as GAN [12] based [27,34,5,31], or, similar to ours, VAE [20] based [35,6]. These methods [11,6,41] often leverage an input sequence instead of a single frame, which helps reduce the ambiguities. Further, the latent variables are either per-timestep [6], or global [1,41] whereas our model leverages a global latent variable, which in turn induces per-timestep variables.…”

Section: Related Workmentioning

confidence: 99%

Compositional Video Prediction

Singh

Gupta

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is comprised of distinct entities that undergo motion and present an approach that operationalizes this insight. We implicitly predict future states of independent entities while reasoning about their interactions, and compose future video frames using these predicted states. We overcome the inherent multi-modality of the task using a global trajectory-level latent random variable, and show that this allows us to sample diverse and plausible futures. We empirically validate our approach against alternate representations and ways of incorporating multi-modality. We examine two datasets, one comprising of stacked objects that may fall, and the other containing videos of humans performing activities in a gym, and show that our approach allows realistic stochastic video prediction across these diverse settings. See project website for video predictions.

show abstract

Section: Related Workmentioning

confidence: 99%

Compositional Video Prediction

Singh

Gupta

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…Authors focused on evaluating the linearization properties, yet the model was not contrasted to previous works. Extending [92], [186], Fragkiadaki et al [177] proposed several architectural changes and training schemes to handle marginalization over stochastic variables, such as sampling from the prior and variational inference. Their stochastic ED architecture predicts future optical flow, i.e., dense pixel motion field, used to spatially transform the current frame into the next frame prediction.…”

Section: Incorporating Uncertaintymentioning

confidence: 99%

A Review on Deep Learning Techniques for Video Prediction

Oprea

Martinez-Gonzalez

García-García

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

153

View full text Add to dashboard Cite

The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.

show abstract

“…Using future frames as ground-truth leads to conditioned supervised learning approach which gives better results in contrast to unconditional video generation [8,18,28,39]. GAN based approaches often relies on a sequence of input frames as priors to reduce ambiguity [15,19,62,71]. Our approach uses only the first input frame and action class name as prior for the prediction task similar to [28,60].…”

Section: Related Workmentioning

confidence: 99%

LARNet: Latent Action Representation for Human Action Synthesis

Biyani¹,

Rana²,

Vyas³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present LARNet, a novel end-to-end approach for generating human action videos. A joint generative modeling of appearance and dynamics to synthesize a video is very challenging and therefore recent works in video synthesis have proposed to decompose these two factors. However, these methods require a driving video to model the video dynamics. In this work, we propose a generative approach instead, which explicitly learns action dynamics in latent space avoiding the need of a driving video during inference. The generated action dynamics is integrated with the appearance using a recurrent hierarchical structure which induces motion at different scales to focus on both coarse as well as fine level action details. In addition, we propose a novel mix-adversarial loss function which aims at improving the temporal coherency of synthesized videos. We evaluate the proposed approach on four real-world human action datasets demonstrating the effectiveness of the proposed approach in generating human actions. Code available at https://github.com/aayushjr/larnet.

show abstract

Motion Prediction Under Multimodality with Conditional Stochastic Networks

Cited by 7 publications

References 23 publications

Compositional Video Prediction

Compositional Video Prediction

A Review on Deep Learning Techniques for Video Prediction

LARNet: Latent Action Representation for Human Action Synthesis

Contact Info

Product

Resources

About