VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Crowson, Katherine; Biderman, Stella; Kornis, Daniel; Stander, Dashiell; Hallahan, Eric; Castricato, Louis; Raff, Edward

doi:10.48550/arxiv.2204.08583

Cited by 26 publications

(37 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using DrawBench, we compare Imagen with DALL-E 2 (the public version) [54], GLIDE [41], Latent Diffusion [57], and CLIP-guided VQ-GAN [12]. Fig.…”

Section: Results On Drawbenchmentioning

confidence: 99%

“…4a). , GLIDE [41], VQ-GAN+CLIP [12] and Latent Diffusion [57] on DrawBench: User preference rates (with 95% confidence intervals) for image-text alignment and image fidelity. Scaling text encoder size is more important than U-Net size.…”

Section: Analysis Of Imagenmentioning

confidence: 99%

“…Multimodal learning has come into prominence recently, with text-to-image synthesis [53,12,57] and image-text contrastive learning [49,31,74] at the forefront. These models have transformed the research community and captured widespread public attention with creative image generation [22,54] and editing applications [21,41,34].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Saharia¹,

Chan²,

Saxena³

et al. 2022

Preprint

222

312

View full text Add to dashboard Cite

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-byside comparisons, both in terms of sample quality and image-text alignment. See imagen.research.google for an overview of the results. * Equal contribution. † Core contribution.

show abstract

“…Using DrawBench, we compare Imagen with DALL-E 2 (the public version) [54], GLIDE [41], Latent Diffusion [57], and CLIP-guided VQ-GAN [12]. Fig.…”

Section: Results On Drawbenchmentioning

confidence: 99%

Section: Analysis Of Imagenmentioning

confidence: 99%

See 1 more Smart Citation

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Saharia¹,

Chan²,

Saxena³

et al. 2022

Preprint

222

312

View full text Add to dashboard Cite

show abstract

“…[32] Many of the earliest opensource text-to-image generative frameworks to gain traction used CLIP in a discriminator like fashion, using it in conjunction with a host of generative models from BigGan to VQGAN. [2, 12,13,17] Newer methods such as diffusion models have also increased output quality. [11,15,29,30]…”

Section: Multimodalitymentioning

confidence: 99%

Opal: Multimodal Image Generation for News Illustration

Qiao¹,

Chilton²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Out of consideration for cross-modal feature alignment, we choose to render specific visualization corresponding to each piece of input text from scratch. Specifically, we construct imagination of the textual input with a large-scale vision and language model guided generative framework -VQGAN+CLIP (Crowson et al, 2022). For each piece of input text x, we treat it as the prompt and use the VQGAN (Esser et al, 2021) model to render the imagination i with 128 × 128 resolution and 200-step optimization.…”

Section: Model Architecturementioning

confidence: 99%

Imagination-Augmented Natural Language Understanding

Lu¹,

Zhu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human brains integrate linguistic and perceptual information simultaneously to understand natural language, and hold the critical ability to render imaginations. Such abilities enable us to construct new abstract concepts or concrete objects, and are essential in involving practical knowledge to solve problems in low-resource scenarios. However, most existing methods for Natural Language Understanding (NLU) are mainly focused on textual signals. They do not simulate human visual imagination ability, which hinders models from inferring and learning efficiently from limited data samples. Therefore, we introduce an Imagination-Augmented Cross-modal Encoder (iACE) to solve natural language understanding tasks from a novel learning perspective-imagination-augmented cross-modal understanding. iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models. Extensive experiments on GLUE SWAG (Zellers et al., 2018) show that iACE achieves consistent improvement over visually-supervised pre-trained models. More importantly, results in extreme and normal few-shot settings validate the effectiveness of iACE in low-resource natural language understanding circumstances. 1

show abstract

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Cited by 26 publications

References 0 publications

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Opal: Multimodal Image Generation for News Illustration

Imagination-Augmented Natural Language Understanding

Contact Info

Product

Resources

About