2021
DOI: 10.48550/arxiv.2103.01950
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Predicting Video with VQVAE

Jacob Walker,
Ali Razavi,
Aäron van den Oord

Abstract: In recent years, the task of video prediction-forecasting future video given past video frames-has attracted attention in the research community. In this paper we propose a novel approach to this problem with Vector Quantized Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive genera… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(20 citation statements)
references
References 29 publications
0
20
0
Order By: Relevance
“…A clean architecture with VQ-VAE for video generation has not been presented yet and we hope VideoGPT is useful from that standpoint. While VQ-VAE-2 (Razavi et al, 2019) proposes using multi-scale hierarchical latents and SNAIL blocks (Chen et al, 2017) (and this setup has been applied to videos in (Walker et al, 2021)), the pipeline is inherently complicated and hard to reproduce. For simplicity, ease of reproduction and presenting the first VQ-VAE based video generation model with minimal complexity, we stick with a single scale of discrete latents and transformers for the autoregressive priors, a design choice also adopted in DALL-E (Ramesh et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…A clean architecture with VQ-VAE for video generation has not been presented yet and we hope VideoGPT is useful from that standpoint. While VQ-VAE-2 (Razavi et al, 2019) proposes using multi-scale hierarchical latents and SNAIL blocks (Chen et al, 2017) (and this setup has been applied to videos in (Walker et al, 2021)), the pipeline is inherently complicated and hard to reproduce. For simplicity, ease of reproduction and presenting the first VQ-VAE based video generation model with minimal complexity, we stick with a single scale of discrete latents and transformers for the autoregressive priors, a design choice also adopted in DALL-E (Ramesh et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…This may, however, be practical for real-world web services and other applications, where users already continually interact with the system and A/B testing is standard practice. End-to-end training could also enable PICO to be applied to problems other than compression, such as image captioning for visually-impaired users, or audio visualization for hearing-impaired users [42] -such applications could also be enabled through continued improvements to generative models for video [43,44], audio [45], and text [46,47]. Another exciting area for future work is to apply pragmatic compression to a wider range of realistic applications, including video compression for robotic space exploration [13], audio compression for hearing aids [48,49], and spatial compression for virtual reality [50].…”
Section: Discussionmentioning
confidence: 99%
“…Early approaches for this problem typically employed recurrent convolutional models trained with reconstruction objective [17,54,65], but later adversarial losses were introduced to improve the synthesis quality [37,71,75]. Some recent works explore autoregressive video prediction with recurrent or attention-based models (e.g., [27,53,73,78,81]). Another close line of research is video interpolation, i.e.…”
Section: Related Workmentioning
confidence: 99%