2022
DOI: 10.48550/arxiv.2204.08583
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
37
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(37 citation statements)
references
References 0 publications
0
37
0
Order By: Relevance
“…Using DrawBench, we compare Imagen with DALL-E 2 (the public version) [54], GLIDE [41], Latent Diffusion [57], and CLIP-guided VQ-GAN [12]. Fig.…”
Section: Results On Drawbenchmentioning
confidence: 99%
See 2 more Smart Citations
“…Using DrawBench, we compare Imagen with DALL-E 2 (the public version) [54], GLIDE [41], Latent Diffusion [57], and CLIP-guided VQ-GAN [12]. Fig.…”
Section: Results On Drawbenchmentioning
confidence: 99%
“…4a). , GLIDE [41], VQ-GAN+CLIP [12] and Latent Diffusion [57] on DrawBench: User preference rates (with 95% confidence intervals) for image-text alignment and image fidelity. Scaling text encoder size is more important than U-Net size.…”
Section: Analysis Of Imagenmentioning
confidence: 99%
See 1 more Smart Citation
“…[32] Many of the earliest opensource text-to-image generative frameworks to gain traction used CLIP in a discriminator like fashion, using it in conjunction with a host of generative models from BigGan to VQGAN. [2, 12,13,17] Newer methods such as diffusion models have also increased output quality. [11,15,29,30]…”
Section: Multimodalitymentioning
confidence: 99%
“…Out of consideration for cross-modal feature alignment, we choose to render specific visualization corresponding to each piece of input text from scratch. Specifically, we construct imagination of the textual input with a large-scale vision and language model guided generative framework -VQGAN+CLIP (Crowson et al, 2022). For each piece of input text x, we treat it as the prompt and use the VQGAN (Esser et al, 2021) model to render the imagination i with 128 × 128 resolution and 200-step optimization.…”
Section: Model Architecturementioning
confidence: 99%