Transformation-based Adversarial Video Prediction on Large-Scale Data

Luc, Pauline; Clark, Aidan; Dieleman, Sander; Casas, Diego de Las; Doron, Yotam; Cassirer, Albin; Simonyan, Karen

doi:10.48550/arxiv.2003.04035

Cited by 7 publications

(15 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Quantitatively, Table 1 3 shows FVD results on BAIR, com-3 SV2P (Babaeizadeh et al, 2017), SAVP (Lee et al, 2018), DVD-GAN-FP (Clark et al, 2019), Video Transformer (Weissenborn et al, 2019, Latent Video Transformer (LVT) (Rakhimov et al, 2020), and TrIVD-GAN (Luc et al, 2020) are our baselines We can see that our method is able to generate realistically looking samples. In addition, we see that VideoGPT is able to sample different trajectories from the same initial frame, showing that it is not simply copying the dataset.…”

Section: Bair Robot Pushingmentioning

confidence: 97%

See 1 more Smart Citation

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan,

Zhang,

Abbeel

et al. 2021

Preprint

View full text Add to dashboard Cite

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPTlike architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with stateof-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan. github.io/videogpt/index.html.

show abstract

Section: Bair Robot Pushingmentioning

confidence: 97%

“…1. On the widely benchmarked BAIR Robot Pushing dataset (Ebert et al, 2017), VideoGPT can generate realistic samples that are competitive with existing methods such as TrIVD-GAN (Luc et al, 2020), achieving an FVD of 103 when benchmarked with real samples, and an FVD* (Razavi et al, 2019) of 94 when benchmarked with reconstructions.…”

Section: Introductionmentioning

confidence: 99%

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan,

Zhang,

Abbeel

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, no performance comparison with previous works has been conducted. Improving [127], Luc et al [128] proposed the Transformation-based & TrIple Video Discriminator GAN (TrIVD-GAN-FP) featuring a novel recurrent unit that computes the parameters of a transformation used to warp previous hidden states without any supervision. These Transformation-based Spatial Recurrent Units (TSRUs) are generic modules and can replace any traditional recurrent unit in currently existent video prediction approaches.…”

Section: Kernel-based Resamplingmentioning

confidence: 99%

A Review on Deep Learning Techniques for Video Prediction

Oprea

Martinez-Gonzalez

García-García

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

153

View full text Add to dashboard Cite

The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.

show abstract

“…While some work in the area of video generation [7,54,45] has explored video synthesis-generating videos with no prior frame information-many approaches actually focus on the task of video prediction conditioned on past frames [41,47,38,30,23,2,34,58,60,13,27]. It can be argued that video synthesis is a combination of image generation and video prediction.…”

Section: Introductionmentioning

confidence: 99%

“…Approaches toward video prediction have largely skewed toward variations of generative adversarial networks [30,23,7,54,27]. In comparison, we are aware of only a relatively small number of approaches which propose variational autoencoders [2,60,8], autoregressive models [20,57], or flow based approaches [22].…”

Section: Introductionmentioning

confidence: 99%

Predicting Video with VQVAE

Walker,

Razavi,

Oord

2021

Preprint

View full text Add to dashboard Cite

In recent years, the task of video prediction-forecasting future video given past video frames-has attracted attention in the research community. In this paper we propose a novel approach to this problem with Vector Quantized Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video. In contrast to previous work that has largely emphasized highly constrained datasets, we focus on very diverse, large-scale datasets such as Kinetics-600. We predict video at a higher resolution on unconstrained videos, 256 × 256, than any other previous method to our knowledge. We further validate our approach against prior work via a crowdsourced human evaluation.Preprint. Under review.

show abstract

Transformation-based Adversarial Video Prediction on Large-Scale Data

Cited by 7 publications

References 33 publications

VideoGPT: Video Generation using VQ-VAE and Transformers

VideoGPT: Video Generation using VQ-VAE and Transformers

A Review on Deep Learning Techniques for Video Prediction

Predicting Video with VQVAE

Contact Info

Product

Resources

About