Benchmarking Spatial Relationships in Text-to-Image Generation

Gokhale, Tejas; Palangi, Hamid; Nushi, Besmira; Vineet, Vibhav; Horvitz, Eric; Kamar, Ece; Baral, Chitta; Yang, Yingzi

doi:10.48550/arxiv.2212.10015

Cited by 5 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we evaluate Control-GPT on a range of experimental settings to test its controllability regarding to spatial relations, object positions, and sizes based on the Visor dataset [6]. We also extend the evaluation to multiple objects and out-of-distribution prompts.…”

Section: Methodsmentioning

confidence: 99%

“…Human evaluation. We randomly sample 100 queries from the Visor Dataset [6], which includes 25K text prompts specifying the spatial relationships of two objects like "a carrot above a boat", "a bird below a bus". These text prompts are challenging partially because many of them are rare compositions of two unrelated objects, and associating the spatial deixis in text with regions in images is not easy.…”

Section: Querying Gpt-4 For Programmatic Sketches At Inferencementioning

confidence: 99%

“…However, precise control during image generation from textual inputs remains a formidable challenge [6]. As illustrated in Figure 1, specifying the exact location, size, or shape of objects using natural language is inherently difficult and prevalent models like DALL-E 2 [21] or Stable Diffusion [22] often lead to unsatisfactory results.…”

Section: Introductionmentioning

confidence: 99%

“…In practice, utilizing off-the-shelf pretrained models to synthesize images from textual prompts and TikZ sketches can often lead to unsatisfactory results. For example, the GPT-4 generated sketches achieve close to 97% accuracy in following the spatial relations described in [6]. However, directly using the generated sketches as segmentation maps and feeding them to the pretrained ControlNet cannot translate the performance improvement.…”

Section: Introductionmentioning

confidence: 99%

“…It is found that only the GPT-4 model can consistently generate reasonable TikZ images, while the majority of open-source models face difficulty in producing compilable TikZ code. We then evaluate our image generation framework, which incorporates GPT-4, on the spatial relation benchmark created by [6].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Generative Adversarial Networks for Diverse and Explainable Text-to-Image Generation

Zhang¹

View full text Add to dashboard Cite

Section: Methodsmentioning

confidence: 99%

Section: Querying Gpt-4 For Programmatic Sketches At Inferencementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Generative Adversarial Networks for Diverse and Explainable Text-to-Image Generation

Zhang¹

View full text Add to dashboard Cite

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Ma,

Pan,

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

Explicitly Representing Syntax Improves Sentence-to-Layout Prediction of Unexpected Situations

Nuyts,

Cartuyvels,

Moens

2024

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Recognizing visual entities in a natural language sentence and arranging them in a 2D spatial layout require a compositional understanding of language and space. This task of layout prediction is valuable in text-to-image synthesis as it allows localized and controlled in-painting of the image. In this comparative study it is shown that we can predict layouts from language representations that implicitly or explicitly encode sentence syntax, if the sentences mention similar entity-relationships to the ones seen during training. To test compositional understanding, we collect a test set of grammatically correct sentences and layouts describing compositions of entities and relations that unlikely have been seen during training. Performance on this test set substantially drops, showing that current models rely on correlations in the training data and have difficulties in understanding the structure of the input sentences. We propose a novel structural loss function that better enforces the syntactic structure of the input sentence and show large performance gains in the task of 2D spatial layout prediction conditioned on text. The loss has the potential to be used in other generation tasks where a tree-like structure underlies the conditioning modality. Code, trained models, and the USCOCO evaluation set are available via Github.1

show abstract

Benchmarking Spatial Relationships in Text-to-Image Generation

Cited by 5 publications

References 0 publications

Generative Adversarial Networks for Diverse and Explainable Text-to-Image Generation

Generative Adversarial Networks for Diverse and Explainable Text-to-Image Generation

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Explicitly Representing Syntax Improves Sentence-to-Layout Prediction of Unexpected Situations

Contact Info

Product

Resources

About