2022
DOI: 10.48550/arxiv.2206.01843
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

Abstract: People say, "A picture is worth a thousand words". Then how can we get the rich information out of the image? We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training. Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image (e.g., image tags, object attributes / locations, captions) as a structured textual prompt, called visual clue… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 25 publications
(47 reference statements)
0
1
0
Order By: Relevance
“…MAGIC (Su et al, 2022) uses a CLIP-induced score to regularize the language generation of GPT-2 so that the zeroshot generated caption is semantically related to the given image. BEST (Xie et al, 2022) uses the cooperation of Florence (Yuan et al, 2021) and GPT-3 for visual storytelling and image paragraph captioning. Wang et al (2022j) propose the cooperation of CLIP, BLIP (Li et al, 2022f), and GPT-3 for few-shot video-language learning.…”
Section: Vlp For L Big Modelsmentioning
confidence: 99%
“…MAGIC (Su et al, 2022) uses a CLIP-induced score to regularize the language generation of GPT-2 so that the zeroshot generated caption is semantically related to the given image. BEST (Xie et al, 2022) uses the cooperation of Florence (Yuan et al, 2021) and GPT-3 for visual storytelling and image paragraph captioning. Wang et al (2022j) propose the cooperation of CLIP, BLIP (Li et al, 2022f), and GPT-3 for few-shot video-language learning.…”
Section: Vlp For L Big Modelsmentioning
confidence: 99%