Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.241
|View full text |Cite
|
Sign up to set email alerts
|

Image Retrieval from Contextual Descriptions

Abstract: The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we propose a new multimodal challenge, Image Retrieval from Contextual Descriptions (IMAGECODE). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
13
1

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(17 citation statements)
references
References 13 publications
(13 reference statements)
1
13
1
Order By: Relevance
“…Here, we show that, when the best pre-trained image retrieval system of Krojer et al (2022) is fed captions produced by an out-of-the box neural caption generator, its performance makes a big jump forward. 0-shot image retrieval accuracy improves by almost 6% compared to the highest previously reported human-caption-based performance by the same model, with fine-tuning and various ad-hoc architectural adaptations.…”
Section: Introductionmentioning
confidence: 92%
See 4 more Smart Citations
“…Here, we show that, when the best pre-trained image retrieval system of Krojer et al (2022) is fed captions produced by an out-of-the box neural caption generator, its performance makes a big jump forward. 0-shot image retrieval accuracy improves by almost 6% compared to the highest previously reported human-caption-based performance by the same model, with fine-tuning and various ad-hoc architectural adaptations.…”
Section: Introductionmentioning
confidence: 92%
“…Data We use the more challenging video section of the IMAGECODE data-set (Krojer et al, 2022). Since we do not fine-tune our model, we only use the validation set, including 1,872 data points.…”
Section: Setupmentioning
confidence: 99%
See 3 more Smart Citations