2021
DOI: 10.48550/arxiv.2112.10741
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Abstract: Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
292
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 180 publications
(304 citation statements)
references
References 28 publications
3
292
1
Order By: Relevance
“…Specially, VQ-Diffusion propose to model latent space of a vector quantized variational autoencoder [138] by learning a parametric model using a conditional variant of the Denoising Diffusion Probabilistic Model (DDPM) [43], [149]. GLIDE [164] compares CLIP guidance and classifierfree guidance in diffusion models for the text-guided image synthesis, and concludes that a diffusion model of 3.5 billion parameters with classifier-free guidance outperforms DALL-E in terms of human evaluation.…”
Section: Other Methodsmentioning
confidence: 99%
“…Specially, VQ-Diffusion propose to model latent space of a vector quantized variational autoencoder [138] by learning a parametric model using a conditional variant of the Denoising Diffusion Probabilistic Model (DDPM) [43], [149]. GLIDE [164] compares CLIP guidance and classifierfree guidance in diffusion models for the text-guided image synthesis, and concludes that a diffusion model of 3.5 billion parameters with classifier-free guidance outperforms DALL-E in terms of human evaluation.…”
Section: Other Methodsmentioning
confidence: 99%
“…We see that CM3 is capable of generating non-trivial semantically coherent captions. That being said, most failure cases of our proposed zero-shot captioning are due Model FID Zero-shot FID AttnGAN (Xu et al, 2017) 35.49 DM-GAN (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE (Nichol et al, 2021) 12. 2021) we sample roughly 30k conditioned samples for our models, and compare against the entire validation set.…”
Section: Source Imagementioning
confidence: 97%
“…We continue by doing an empirical study of the unconditional generation of CM3, by generating 30k samples without textual conditioning and calculating the Fréchet Inception Distance (FID, Heusel et al ( 2017)) over MS-COCO, following the methodology proposed in Nichol et al (2021) (Lin et al, 2014). We present our results in the unified table showing FID calculations in Table 2.…”
Section: Unconditional Image Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…(iii) Quality and resolution. Although quality has gradually improved between consecutive methods, the previous state-of-the-art methods are still limited to an output image resolution of 256 × 256 pixels [45,41]. Alternative approaches propose a super-resolution network which results in less favorable visual and quantitative results [12].…”
Section: Introductionmentioning
confidence: 99%