Improving Text-to-Image Synthesis Using Contrastive Learning

Ye, Hui; Yang, Xiulong; Takáč, Martin; Sunderraman, Rajshekhar; Ji, Shihao

doi:10.48550/arxiv.2107.02423

Cited by 19 publications

(31 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The human evaluator may also indicate that neither image is significantly better than the other, in which case half of a win is assigned to both models. (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN (Zhang et al, 2021) 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E (Ramesh et al, 2021) ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE 12.24 GLIDE (Validation filtered)…”

Section: Quantitative Resultsmentioning

confidence: 99%

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol¹,

Dhariwal²,

Ramesh³

et al. 2021

Preprint

218

292

View full text Add to dashboard Cite

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifierfree guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

show abstract

Section: Quantitative Resultsmentioning

confidence: 99%

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol¹,

Dhariwal²,

Ramesh³

et al. 2021

Preprint

218

292

View full text Add to dashboard Cite

show abstract

“…A mountain with pine trees in a starry winter night. For text-to-image, we compare with DF-GAN [60] and DM-GAN + CL [73] on MS-COCO. Since the original models are trained on the 2014 split, we retrain their models on the 2017 split using the official code.…”

Section: Resultsmentioning

confidence: 99%

“…To this end, we have various single modality-to-image models. When the input modality is text, we have the text-to-image model [48,49,60,71,73,76,82]. When the input modality is a segmentation mask, we have the segmentation-to-image model [10,20,34,43,53,65].…”

Section: Introductionmentioning

confidence: 99%

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Huang¹,

Mallya²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, sketch, or style reference. They are often unable to leverage multimodal user inputs when available, which reduces their practicality. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. PoE-GAN consists of a productof-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at this link.

show abstract

“…We see that CM3 is capable of generating non-trivial semantically coherent captions. That being said, most failure cases of our proposed zero-shot captioning are due Model FID Zero-shot FID AttnGAN (Xu et al, 2017) 35.49 DM-GAN (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE (Nichol et al, 2021) 12. 2021) we sample roughly 30k conditioned samples for our models, and compare against the entire validation set.…”

Section: Source Imagementioning

confidence: 96%

CM3: A Causal Masked Multimodal Model of the Internet

Aghajanyan¹,

Bernie²,

Ross³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked languageimage models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multimodal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM De Cao et al., 2020;Aghajanyan et al., 2021). We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

show abstract

Improving Text-to-Image Synthesis Using Contrastive Learning

Cited by 19 publications

References 35 publications

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

CM3: A Causal Masked Multimodal Model of the Internet

Contact Info

Product

Resources

About