GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol, Alex; Dhariwal, Prafulla; Ramesh, A.; Shyam, Pranav; Mishkin, Pamela; Sutskever, Ilya; Chen, Mark

doi:10.48550/arxiv.2112.10741

Cited by 180 publications

(304 citation statements)

References 28 publications

Supporting

Mentioning

292

Contrasting

Order By: Relevance

“…Specially, VQ-Diffusion propose to model latent space of a vector quantized variational autoencoder [138] by learning a parametric model using a conditional variant of the Denoising Diffusion Probabilistic Model (DDPM) [43], [149]. GLIDE [164] compares CLIP guidance and classifierfree guidance in diffusion models for the text-guided image synthesis, and concludes that a diffusion model of 3.5 billion parameters with classifier-free guidance outperforms DALL-E in terms of human evaluation.…”

Section: Other Methodsmentioning

confidence: 99%

Multimodal Image Synthesis and Editing: A Survey

Zhan¹,

Yu²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modelling the interaction among multimodal information, multimodal image synthesis and editing have become a hot research topic in recent years. Different from traditional visual guidance which provides explicit clues, multimodal guidance offers intuitive and flexible means in image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of features with inherent modality gaps, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis & editing and formulate taxonomies according to data modality and model architectures. We start with an introduction to different types of guidance modalities in image synthesis and editing. We then describe multimodal image synthesis and editing approaches extensively with detailed frameworks including Generative Adversarial Networks (GANs), GAN Inversion, Transformers, and other methods such as NeRF and Diffusion models. This is followed by a comprehensive description of benchmark datasets and corresponding evaluation metrics as widely adopted in multimodal image synthesis and editing, as well as detailed comparisons of different synthesis methods with analysis of respective advantages and limitations. Finally, we provide insights into the current research challenges and possible future research directions. We hope this survey could lay a sound and valuable foundation for future development of multimodal image synthesis and editing. A project associated with this survey is available at https://github.com/fnzhan/MISE.

show abstract

Section: Other Methodsmentioning

confidence: 99%

Multimodal Image Synthesis and Editing: A Survey

Zhan¹,

Yu²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We see that CM3 is capable of generating non-trivial semantically coherent captions. That being said, most failure cases of our proposed zero-shot captioning are due Model FID Zero-shot FID AttnGAN (Xu et al, 2017) 35.49 DM-GAN (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE (Nichol et al, 2021) 12. 2021) we sample roughly 30k conditioned samples for our models, and compare against the entire validation set.…”

Section: Source Imagementioning

confidence: 97%

“…We continue by doing an empirical study of the unconditional generation of CM3, by generating 30k samples without textual conditioning and calculating the Fréchet Inception Distance (FID, Heusel et al ( 2017)) over MS-COCO, following the methodology proposed in Nichol et al (2021) (Lin et al, 2014). We present our results in the unified table showing FID calculations in Table 2.…”

Section: Unconditional Image Generationmentioning

confidence: 99%

See 1 more Smart Citation

CM3: A Causal Masked Multimodal Model of the Internet

Aghajanyan¹,

Bernie²,

Ross³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked languageimage models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multimodal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM De Cao et al., 2020;Aghajanyan et al., 2021). We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

show abstract

“…(iii) Quality and resolution. Although quality has gradually improved between consecutive methods, the previous state-of-the-art methods are still limited to an output image resolution of 256 × 256 pixels [45,41]. Alternative approaches propose a super-resolution network which results in less favorable visual and quantitative results [12].…”

Section: Introductionmentioning

confidence: 99%