VideoGPT: Video Generation using VQ-VAE and Transformers

Yan, Wilson; Zhang, Yunzhi; Abbeel, Pieter; Srinivas, Aravind

doi:10.48550/arxiv.2104.10157

Cited by 55 publications

(87 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This may, however, be practical for real-world web services and other applications, where users already continually interact with the system and A/B testing is standard practice. End-to-end training could also enable PICO to be applied to problems other than compression, such as image captioning for visually-impaired users, or audio visualization for hearing-impaired users [42] -such applications could also be enabled through continued improvements to generative models for video [43,44], audio [45], and text [46,47]. Another exciting area for future work is to apply pragmatic compression to a wider range of realistic applications, including video compression for robotic space exploration [13], audio compression for hearing aids [48,49], and spatial compression for virtual reality [50].…”

Section: Discussionmentioning

confidence: 99%

Pragmatic Image Compression for Human-in-the-Loop Decision-Making

Reddy,

Dragan,

Levine

2021

Preprint

View full text Add to dashboard Cite

Standard lossy image compression algorithms aim to preserve an image's appearance, while minimizing the number of bits needed to transmit it. However, the amount of information actually needed by a user for downstream tasks -e.g., deciding which product to click on in a shopping website -is likely much lower. To achieve this lower bitrate, we would ideally only transmit the visual features that drive user behavior, while discarding details irrelevant to the user's decisions. We approach this problem by training a compression model through human-in-the-loop learning as the user performs tasks with the compressed images. The key insight is to train the model to produce a compressed image that induces the user to take the same action that they would have taken had they seen the original image. To approximate the loss function for this model, we train a discriminator that tries to distinguish whether a user's action was taken in response to the compressed image or the original. We evaluate our method through experiments with human participants on four tasks: reading handwritten digits, verifying photos of faces, browsing an online shopping catalogue, and playing a car racing video game. The results show that our method learns to match the user's actions with and without compression at lower bitrates than baseline methods, and adapts the compression model to the user's behavior: it preserves the digit number and randomizes handwriting style in the digit reading task, preserves hats and eyeglasses while randomizing faces in the photo verification task, preserves the perceived price of an item while randomizing its color and background in the online shopping task, and preserves upcoming bends in the road in the car racing game.Preprint. Under review.

show abstract

Section: Discussionmentioning

confidence: 99%

Pragmatic Image Compression for Human-in-the-Loop Decision-Making

Reddy,

Dragan,

Levine

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, VQ-VAE-based [40] visual auto-regressive models were proposed for visual synthesis tasks. By converting images into discrete visual tokens, such methods can conduct efficient and large-scale pre-training for textto-image generation (e.g., DALL-E [33] and CogView [9]), text-to-video generation (e.g., GODIVA [45]), and video prediction (e.g., LVT [31] and VideoGPT [48]), with higher resolution of generated images or videos. However, none of these models was trained by images and videos together.…”

Section: Visual Auto-regressive Modelsmentioning

confidence: 99%

“…We use placeholder 1 since images have no temporal dimensions. Videos can be viewed as a temporal extension of images, and recent works like VideoGPT [48] and VideoGen [51] extend convolutions in the VQ-VAE encoder from 2D to 3D and train a video-specific representation. However, this fails to share a common codebook for both images and videos.…”

Section: D Data Representationmentioning

confidence: 99%

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Wu¹,

Ji²,

Ji³

et al. 2021

Preprint

View full text Add to dashboard Cite

“…The initial idea was proposed in the seminal work [67] (VQVAE) and further improved in [56] (VQVAE-2). Applications of VQVAE for content generation include images [56,67], audio [15,40,67,81], and videos [54,74]. Recently, it was found to be beneficial to train a transformer to sample from the codebook given a rich condition e.g.…”

Section: Related Workmentioning

confidence: 99%

Taming Visually Guided Sound Generation

Iashin,

Rahtu

2021

Preprint

View full text Add to dashboard Cite

show abstract

VideoGPT: Video Generation using VQ-VAE and Transformers

Cited by 55 publications

References 48 publications

Pragmatic Image Compression for Human-in-the-Loop Decision-Making

Pragmatic Image Compression for Human-in-the-Loop Decision-Making

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Taming Visually Guided Sound Generation

Contact Info

Product

Resources

About