2021
DOI: 10.48550/arxiv.2111.14447
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Abstract: Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefitin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(15 citation statements)
references
References 45 publications
0
15
0
Order By: Relevance
“…It should be emphasized that MAGIC Search allows us to directly plug visual controls into the decoding process of the language model, without the need of extra supervised training [14] or gradient update on additional features [14,78]. This property makes our method much more computationally efficient than previous approaches as demonstrated in our experiments (Section §4.1).…”
Section: Magic Searchmentioning
confidence: 94%
See 4 more Smart Citations
“…It should be emphasized that MAGIC Search allows us to directly plug visual controls into the decoding process of the language model, without the need of extra supervised training [14] or gradient update on additional features [14,78]. This property makes our method much more computationally efficient than previous approaches as demonstrated in our experiments (Section §4.1).…”
Section: Magic Searchmentioning
confidence: 94%
“…It has shown impressive zero-shot capabilities on various vision-language tasks and can open new avenues for answering the former question. ZeroCap [78] is the most related to our work. It is built on a pre-trained CLIP model together with the GPT-2 language model [60].…”
Section: Image Captioningmentioning
confidence: 99%
See 3 more Smart Citations