2021
DOI: 10.48550/arxiv.2104.10157
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan,
Yunzhi Zhang,
Pieter Abbeel
et al.

Abstract: We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPTlike architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
76
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 55 publications
(87 citation statements)
references
References 48 publications
1
76
0
Order By: Relevance
“…This may, however, be practical for real-world web services and other applications, where users already continually interact with the system and A/B testing is standard practice. End-to-end training could also enable PICO to be applied to problems other than compression, such as image captioning for visually-impaired users, or audio visualization for hearing-impaired users [42] -such applications could also be enabled through continued improvements to generative models for video [43,44], audio [45], and text [46,47]. Another exciting area for future work is to apply pragmatic compression to a wider range of realistic applications, including video compression for robotic space exploration [13], audio compression for hearing aids [48,49], and spatial compression for virtual reality [50].…”
Section: Discussionmentioning
confidence: 99%
“…This may, however, be practical for real-world web services and other applications, where users already continually interact with the system and A/B testing is standard practice. End-to-end training could also enable PICO to be applied to problems other than compression, such as image captioning for visually-impaired users, or audio visualization for hearing-impaired users [42] -such applications could also be enabled through continued improvements to generative models for video [43,44], audio [45], and text [46,47]. Another exciting area for future work is to apply pragmatic compression to a wider range of realistic applications, including video compression for robotic space exploration [13], audio compression for hearing aids [48,49], and spatial compression for virtual reality [50].…”
Section: Discussionmentioning
confidence: 99%
“…Recently, VQ-VAE-based [40] visual auto-regressive models were proposed for visual synthesis tasks. By converting images into discrete visual tokens, such methods can conduct efficient and large-scale pre-training for textto-image generation (e.g., DALL-E [33] and CogView [9]), text-to-video generation (e.g., GODIVA [45]), and video prediction (e.g., LVT [31] and VideoGPT [48]), with higher resolution of generated images or videos. However, none of these models was trained by images and videos together.…”
Section: Visual Auto-regressive Modelsmentioning
confidence: 99%
“…We use placeholder 1 since images have no temporal dimensions. Videos can be viewed as a temporal extension of images, and recent works like VideoGPT [48] and VideoGen [51] extend convolutions in the VQ-VAE encoder from 2D to 3D and train a video-specific representation. However, this fails to share a common codebook for both images and videos.…”
Section: D Data Representationmentioning
confidence: 99%
“…The initial idea was proposed in the seminal work [67] (VQVAE) and further improved in [56] (VQVAE-2). Applications of VQVAE for content generation include images [56,67], audio [15,40,67,81], and videos [54,74]. Recently, it was found to be beneficial to train a transformer to sample from the codebook given a rich condition e.g.…”
Section: Related Workmentioning
confidence: 99%