2022
DOI: 10.48550/arxiv.2209.15162
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Linearly Mapping from Image to Text Space

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 0 publications
0
11
0
Order By: Relevance
“…Some works have also been proposed to answer this question. Merullo et al proposed a method [146] that injects a linear projection between the frozen image encoder and the text encoder. During training, only the linear projection is tuned.…”
Section: Vision Language Generationmentioning
confidence: 99%
“…Some works have also been proposed to answer this question. Merullo et al proposed a method [146] that injects a linear projection between the frozen image encoder and the text encoder. During training, only the linear projection is tuned.…”
Section: Vision Language Generationmentioning
confidence: 99%
“…Many recent works have transferred it on multiple downstream tasks, including semantic segmentation [8,24], object detection [5], Visual Question Answering [48] and image generation [33]. Many researchers regard CLIP as a pre-trained feature extractor [8,24,26,38,52]. [26,38] directly utilize CLIP as an image encoder to extract visual context.…”
Section: Vision-language Contrastive Learningmentioning
confidence: 99%
“…Many researchers regard CLIP as a pre-trained feature extractor [8,24,26,38,52]. [26,38] directly utilize CLIP as an image encoder to extract visual context. [8,24] employ CLIP to align the image with the target class for open-vocabulary tasks.…”
Section: Vision-language Contrastive Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…MAPL keeps both the vision encoder and the LM frozen (thus further reducing the number of trainable parameters) and only learns a lightweight mapping network to connect both frozen models. Similar to MAPL, concurrent work LiMBeR (Merullo et al, 2022) also proposes to connect a frozen vision encoder with a frozen LM but using a linear mapping, which is not as parameter-and compute-efficient as MAPL (Sec. 4.5).…”
Section: Related Workmentioning
confidence: 99%