Image Retrieval from Contextual Descriptions

Krojer, Benno; Adlakha, Vaibhav; Vineet, Vibhav; Goyal, Yash; Ponti, Edoardo Maria; Reddy, Siva

doi:10.18653/v1/2022.acl-long.241

Cited by 9 publications

(17 citation statements)

References 13 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, we show that, when the best pre-trained image retrieval system of Krojer et al (2022) is fed captions produced by an out-of-the box neural caption generator, its performance makes a big jump forward. 0-shot image retrieval accuracy improves by almost 6% compared to the highest previously reported human-caption-based performance by the same model, with fine-tuning and various ad-hoc architectural adaptations.…”

Section: Introductionmentioning

confidence: 92%

“…Data We use the more challenging video section of the IMAGECODE data-set (Krojer et al, 2022). Since we do not fine-tune our model, we only use the validation set, including 1,872 data points.…”

Section: Setupmentioning

confidence: 99%

“…We use the simplest CLIP-based retrieval system of Krojer et al (2022) (the one without context module and temporal embeddings), which corresponds to a standard CLIP architecture from Radford et al (2021). The caption and each image in the set are passed through a transformerbased text encoder and a transformer-based visual encoder, respectively.…”

Section: Image Retrievalmentioning

confidence: 99%

“…Neural vision-and-language models have achieved impressive results in tasks such as visual commonsense reasoning and question answering (e.g., Chen et al, 2019;Lu et al, 2019). However, Krojer et al (2022) recently showed, in the context of captionbased image retrieval, that state-of-the-art multimodal models still perform poorly when the candidate pool contains very similar distractor images (such as close frames from the same video).…”

Section: Introductionmentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 4 more Smart Citations

Communication breakdown: On the low mutual intelligibility between human and neural captioning

Dessì¹,

Eleonora²,

Franzon³

et al. 2022

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

We compare the 0-shot performance of a neural caption-based image retriever when given as input either human-produced captions or captions generated by a neural captioner. We conduct this comparison on the recently introduced IMAGECODE data-set (Krojer et al., 2022), which contains hard distractors nearly identical to the images to be retrieved. We find that the neural retriever has much higher performance when fed neural rather than human captions, despite the fact that the former, unlike the latter, were generated without awareness of the distractors that make the task hard. Even more remarkably, when the same neural captions are given to human subjects, their retrieval performance is almost at chance level. Our results thus add to the growing body of evidence that, even when the "language" of neural models resembles English, this superficial resemblance might be deeply misleading.

show abstract

Section: Introductionmentioning

confidence: 92%

Section: Setupmentioning

confidence: 99%

Section: Image Retrievalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 3 more Smart Citations

Communication breakdown: On the low mutual intelligibility between human and neural captioning

Dessì¹,

Eleonora²,

Franzon³

et al. 2022

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

Cross-Domain Image Captioning with Discriminative Finetuning

Dessì,

Bevilacqua,

Gualdoni

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Neural captioners are typically trained to mimic humangenerated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a selfsupervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to groundtruth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task. 1 10 As the focus of this analysis in on the Conceptual Captionstrained/finetuned models, we will drop the -ConCap suffix.

show abstract

Equivariant Similarity for Vision-Language Foundation Models

Wang,

Lin,

et al. 2023

2023 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Image Retrieval from Contextual Descriptions

Cited by 9 publications

References 13 publications

Communication breakdown: On the low mutual intelligibility between human and neural captioning

Communication breakdown: On the low mutual intelligibility between human and neural captioning

Cross-Domain Image Captioning with Discriminative Finetuning

Equivariant Similarity for Vision-Language Foundation Models

Contact Info

Product

Resources

About