2021
DOI: 10.48550/arxiv.2107.02423
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improving Text-to-Image Synthesis Using Contrastive Learning

Abstract: The goal of text-to-image synthesis is to generate a visually realistic image that matches a given text description. In practice, the captions annotated by humans for the same image have large variance in terms of contents and the choice of words. The linguistic discrepancy between the captions of the identical image leads to the synthetic images deviating from the ground truth. To address this issue, we propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
31
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(31 citation statements)
references
References 35 publications
0
31
0
Order By: Relevance
“…The human evaluator may also indicate that neither image is significantly better than the other, in which case half of a win is assigned to both models. (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN (Zhang et al, 2021) 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E (Ramesh et al, 2021) ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE 12.24 GLIDE (Validation filtered)…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…The human evaluator may also indicate that neither image is significantly better than the other, in which case half of a win is assigned to both models. (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN (Zhang et al, 2021) 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E (Ramesh et al, 2021) ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE 12.24 GLIDE (Validation filtered)…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…A mountain with pine trees in a starry winter night. For text-to-image, we compare with DF-GAN [60] and DM-GAN + CL [73] on MS-COCO. Since the original models are trained on the 2014 split, we retrain their models on the 2017 split using the official code.…”
Section: Resultsmentioning
confidence: 99%
“…To this end, we have various single modality-to-image models. When the input modality is text, we have the text-to-image model [48,49,60,71,73,76,82]. When the input modality is a segmentation mask, we have the segmentation-to-image model [10,20,34,43,53,65].…”
Section: Introductionmentioning
confidence: 99%
“…We see that CM3 is capable of generating non-trivial semantically coherent captions. That being said, most failure cases of our proposed zero-shot captioning are due Model FID Zero-shot FID AttnGAN (Xu et al, 2017) 35.49 DM-GAN (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE (Nichol et al, 2021) 12. 2021) we sample roughly 30k conditioned samples for our models, and compare against the entire validation set.…”
Section: Source Imagementioning
confidence: 96%