2022
DOI: 10.48550/arxiv.2202.04053
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Abstract: Text-to-Image Generative ModelFigure 1. Overview of our evaluation process for text-to-image models. We propose to evaluate models in four ways: visual reasoning skills (Sec. 4.1), image-text alignment (Sec. 4.2), image quality (Sec. 4.3), and social biases (Sec. 4.4). Images in the figure are generated using ruDALL-E-XL. We also conduct human evaluation to verify our model-based visual reasoning, image-text alignment, and social bias evaluations.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
33
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(34 citation statements)
references
References 41 publications
1
33
0
Order By: Relevance
“…For model analysis in Subsection 2.2, we randomly sample 3K prompts from the COCO for efficiency and report both FID and CLIPscore. Besides COCO, several other benchmarks have also been proposed to systematically evaluate the performance of different text-to-image models, such as PaintSkills [Cho et al, 2022], DrawBench [Saharia et al, 2022] and PartiPrompts [Yu et al, 2022]. These benchmarks don't have reference images, so it's difficult to make automatic evaluation on them.…”
Section: Evaluation Benchmarkmentioning
confidence: 99%
“…For model analysis in Subsection 2.2, we randomly sample 3K prompts from the COCO for efficiency and report both FID and CLIPscore. Besides COCO, several other benchmarks have also been proposed to systematically evaluate the performance of different text-to-image models, such as PaintSkills [Cho et al, 2022], DrawBench [Saharia et al, 2022] and PartiPrompts [Yu et al, 2022]. These benchmarks don't have reference images, so it's difficult to make automatic evaluation on them.…”
Section: Evaluation Benchmarkmentioning
confidence: 99%
“…Park et al [11] provide the curated splits of the CUB [28,29] and Flowers [30] to assess unseen color and shape compositions in the narrow domains. DALL-Eval [31] proposed a diagnostic dataset PaintSkills to evaluate visual reasoning skills to assess models based on this dataset. Since the dataset is generated from a 3D simulator using limited configurations, the data distributions deviate from other real-world datasets.…”
Section: Metrics For Assessing Text-to-image Generationmentioning
confidence: 99%
“…The runner-up was InfoNCE, while SOA was ineffective to differentiate among the two fake images and the real image. Although DALL-Eval [31] proposed to use a detector and its dedicated heads for the count, color, and spatial relationship tasks, this was limited to the 3D-generated images with their near-perfect detection ability. Remind that the accuracy of random guessing is 33.3%, where MID requires a powerful feature extractor of the CLIP ViT-L/14 to get meaningful performances on the count, color, and spatial relationship tasks.…”
Section: Evaluation On Text-to-image Generationmentioning
confidence: 99%
“…This has been further extended to develop natural language explanations by using captioning methods to describe a set of image patches that activated a neuron [107], [188]. Aside from these, dissection has also been used to analyze what types of neurons are exploited by adversarial examples [263], identify failure modes for text-toimage models [49], and probe neural responses in transformers to isolate where certain facts are stored [169]. Unfortunately, these types of methods are limited by the diversity of examples in the dataset used and the quality of labels.…”
Section: B Dataset-based (Post Hoc)mentioning
confidence: 99%