2023
DOI: 10.48550/arxiv.2302.14794
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Abstract: Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(8 citation statements)
references
References 22 publications
(39 reference statements)
0
0
0
Order By: Relevance
“…Frozen [56] freezes the language decoder and trains the vision encoder, which is also the baseline of this problem. MML [54] is a method that uses a mapper to convert the vision feature into language feature space, which trains by using the meta learning to construct a new training dataset. Our method VL-Few proposes to use the learnable vision prompt and vision-language prompt to carry out the vision information to make the model learn images, which could help the model understand the images that are the different modal data.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Frozen [56] freezes the language decoder and trains the vision encoder, which is also the baseline of this problem. MML [54] is a method that uses a mapper to convert the vision feature into language feature space, which trains by using the meta learning to construct a new training dataset. Our method VL-Few proposes to use the learnable vision prompt and vision-language prompt to carry out the vision information to make the model learn images, which could help the model understand the images that are the different modal data.…”
Section: Resultsmentioning
confidence: 99%
“…However, this method requires a large amount of data to train the visual encoder, thus making it difficult to apply to small sample scenarios. MML [54] proposes the meta mapper to map the image features to language feature space. Specifically, MML extracts image features through a visual encoder and inputs them together with a visual prefix into the meta mapper.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…It also constructs a meta-learning dataset to make the model learn how to learn and perform a task that has not been seen before. MML [40] assumes that Frozen [39] needs to re-train the vision model, which requires much computing power and time. So, it designs a light module that can obtain the vision features by using four learnable prefix tokens and then extracts these four learnable prefix tokens as the image embedding and concatenates them with the text embedding.…”
Section: Multimodal Meta-learningmentioning
confidence: 99%
“…But it needs much data and computing power to train the vision encoder, which is not friendly for low-source learners. MML [40] proposes a mapper module to convert the vision feature to language feature space with limited data. It can reduce the training computation, but the vision feature cannot learn the language information, which is important for multimodal tasks.…”
Section: Introductionmentioning
confidence: 99%