2021
DOI: 10.48550/arxiv.2111.09734
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ClipCap: CLIP Prefix for Image Captioning

Abstract: Image captioning is a fundamental task in visionlanguage understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
137
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 85 publications
(158 citation statements)
references
References 35 publications
1
137
0
Order By: Relevance
“…However, this ability is exactly the generative task DALL-E was trained to do, only in new domains. No previous computer vision work, as far as we can ascertain, has Method B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s CLIP-Prefix [49] 2. Comparison of our method and CLIP-Prefix baseline on our novel benchmark for visual relations.…”
Section: Discussion and Limitationsmentioning
confidence: 99%
See 4 more Smart Citations
“…However, this ability is exactly the generative task DALL-E was trained to do, only in new domains. No previous computer vision work, as far as we can ascertain, has Method B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s CLIP-Prefix [49] 2. Comparison of our method and CLIP-Prefix baseline on our novel benchmark for visual relations.…”
Section: Discussion and Limitationsmentioning
confidence: 99%
“…1, we present our results for COCO's test set [42]. Two recent baselines that use CLIP's embedding are compared to: CLIP-Prefix [49] and CLIP-VL [61]. In CLIP-Prefix, the image is encoded using CLIP and the representation is transferred and plugged as a token into a fine-tuned GPT-2.…”
Section: Image Captioningmentioning
confidence: 99%
See 3 more Smart Citations