2023
DOI: 10.48550/arxiv.2302.04858
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Abstract: Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computationalexpensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 21 publications
(40 reference statements)
0
3
0
Order By: Relevance
“…In addition, they introduced contrastive learning [17], similarity-bucket, retrieval-augmented [37] Captions in NICE dataset often contain new concepts like camera angle descriptions and proper nouns like place names, which are difficult to predict under zero-shot settings. Motivated from retrieval-augmented models [5,20], the RETRIEVER framework aims to complement such lim-ited data condition by efficiently utilizing external knowledge in model training and inference.…”
Section: Fine-tuning Stage Further Compresses the Dataset In The Lastmentioning
confidence: 99%
“…In addition, they introduced contrastive learning [17], similarity-bucket, retrieval-augmented [37] Captions in NICE dataset often contain new concepts like camera angle descriptions and proper nouns like place names, which are difficult to predict under zero-shot settings. Motivated from retrieval-augmented models [5,20], the RETRIEVER framework aims to complement such lim-ited data condition by efficiently utilizing external knowledge in model training and inference.…”
Section: Fine-tuning Stage Further Compresses the Dataset In The Lastmentioning
confidence: 99%
“…Recently retrieval-augmented generation (RAG) attracted increasing attention of both the natural language processing [1,9,10,2] and vision-and-language [3,4,18]. REALM [1] uses the query to retrieve the top 𝑘 most relevant article snippets, and uses large language models (LLMs) to generate 𝑘 responses, which are then combined to obtain a final output for question answering.…”
Section: Retrieval-augmented Generationmentioning
confidence: 99%
“…The memory, encoder, retriever, and generator are pre-trained in an end-to-end manner. Re-ViLM [4] augments Flamingo [5], by retrieving relevant image-text pairs from the external image-text datasets [6,7,8] for zero and in-context few-shot image-to-text generations. RA-CM3 [40] performs retrieval from an external memory for generating images and text.…”
Section: Retrieval-augmented Generationmentioning
confidence: 99%