Autoregressive Image Generation using Residual Quantization

Lee, Doyup; Kim, Chiheon; Kim, Saehoon; Cho, Minsu; Han, Wook-Shin

doi:10.1109/cvpr52688.2022.01123

Cited by 62 publications

(33 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike the minimax game in popular GAN models [13,1,20,21], the VQ-based generator is trained by optimizing negative log-likelihood over all examples in the training set, leading to a stable training and bypassing the "mode collapse" issue. Driven by these advantages, many image synthesis models follow the two-stage paradigm, such as image generation [31,45,2,24,16], image-to-image translation [11,10,32], text-to-image synthesis [30,29,10,7], conditional video generation [28,42,44], and image completion [11,10,47]. Apart from VQGAN, the most related works also include ViT-VQGAN [45] and RQ-VAE [24] that aim to train a better quantizer in the first stage.…”

Section: Related Workmentioning

confidence: 99%

“…Driven by these advantages, many image synthesis models follow the two-stage paradigm, such as image generation [31,45,2,24,16], image-to-image translation [11,10,32], text-to-image synthesis [30,29,10,7], conditional video generation [28,42,44], and image completion [11,10,47]. Apart from VQGAN, the most related works also include ViT-VQGAN [45] and RQ-VAE [24] that aim to train a better quantizer in the first stage. Compared to them, our model is simple and efficient, yet effective to improve the image quality, without adding the computational cost on higher resolution representations [45] or more stages of recursive quantization [24].…”

Section: Related Workmentioning

confidence: 99%

“…Apart from VQGAN, the most related works also include ViT-VQGAN [45] and RQ-VAE [24] that aim to train a better quantizer in the first stage. Compared to them, our model is simple and efficient, yet effective to improve the image quality, without adding the computational cost on higher resolution representations [45] or more stages of recursive quantization [24].…”

Section: Related Workmentioning

confidence: 99%

“…To the best of our knowledge, this is the first work to modulate quantized vectors and use multichannel quantization on the VQ-based image generation framework. In the following sections, we will describe the modulated quantized vectors and multichannel quantization and discuss their advantages over the concurrent models such as [32,45] and [24] in detail.…”

Section: Related Workmentioning

confidence: 99%

“…On the other hand, compared with VAE [23], VQ-VAE maps an image into a palette of latent discrete codes in higher resolution with spatial structure information and learns the composition of the codes from the data itself, which overcomes the long-dragged image quality issue of VAE in image synthesis. These advantages of VQ-VAE have led to remarkable image synthesis results as evident by its recent extensions, such as VQ-VAE-2 [31], DALL-E [30], VQGAN [11], ImageBART [10], LDMs [32], VIT-VQGAN [45], RQ-VAE [24], MaskGIT [2] and DALL-E-2 [29].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

Zheng¹,

Vuong²,

Cai³

et al. 2022

Preprint

View full text Add to dashboard Cite

Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation. 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

show abstract