2022
DOI: 10.48550/arxiv.2206.01201
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Abstract: This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is extensively studied in traditional VQA, it is under-explored in knowledge-based VQA even though these two tasks share the common spirit, i.e., rely on visual input to answer the question. Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 27 publications
0
6
0
Order By: Relevance
“…combine Wikipedia, ConceptNet and Google images to supplement multi-modal knowledge. With the emergence of language models, researchers consider them as implicit KBs [43,54] and there are several studies [12,15,28,31] combining explicit and implicit knowledge to improve model's ability of handling visual questions. Recently, large language models impress people by their quantum leap of understanding and reasoning capabilities.…”
Section: Related Work 21 Vqa Tasksmentioning
confidence: 99%
“…combine Wikipedia, ConceptNet and Google images to supplement multi-modal knowledge. With the emergence of language models, researchers consider them as implicit KBs [43,54] and there are several studies [12,15,28,31] combining explicit and implicit knowledge to improve model's ability of handling visual questions. Recently, large language models impress people by their quantum leap of understanding and reasoning capabilities.…”
Section: Related Work 21 Vqa Tasksmentioning
confidence: 99%
“…KRISP [89] leverages several external KGs [24,26,81], visual knowledge from Visual Genome [90], as well as implicit knowledge from BERT [27]. REVIVE [91] deploys several visual features to retrieve knowledge from various sources, such as Wikidata and GPT-3. Visual feature guidance was proven critical towards improving the knowledge retrieval process.…”
Section: Visual Question Answering (Vqa)mentioning
confidence: 99%
“…Knowledge-Based VQA. In REVIVE (Lin et al, 2022), the authors proposed to first employ an object detector to locate the objects, and then use the cropped bounding-box proposals to retrieve various types of external knowledge. Finally, they merged this knowledge with the regional visual features into a transformer to predict an answer.…”
Section: Related Workmentioning
confidence: 99%
“…The traditional knowledge retrieval module usually retrieves knowledge from sources such as Wikipedia, knowledge graphs, and web search (Wu et al, 2022). More recently, Large Language Models (LLMs) such as GPT-3 are used to produce related knowledge (Lin et al, 2022;Hu et al, 2022b). The latter approach is preferred since traditional knowledge retrieval often introduces irrelevant information to the question .…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation