REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Lin, Yuanze; Xie, Yujia; Chen, Dongdong; Yi‐chong, Xu; C, Zhu; Liu, Yuan

doi:10.48550/arxiv.2206.01201

Cited by 3 publications

(6 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…combine Wikipedia, ConceptNet and Google images to supplement multi-modal knowledge. With the emergence of language models, researchers consider them as implicit KBs [43,54] and there are several studies [12,15,28,31] combining explicit and implicit knowledge to improve model's ability of handling visual questions. Recently, large language models impress people by their quantum leap of understanding and reasoning capabilities.…”

Section: Related Work 21 Vqa Tasksmentioning

confidence: 99%

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Lan,

Li,

Liu

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

Section: Related Work 21 Vqa Tasksmentioning

confidence: 99%

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Lan,

Li,

Liu

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

“…KRISP [89] leverages several external KGs [24,26,81], visual knowledge from Visual Genome [90], as well as implicit knowledge from BERT [27]. REVIVE [91] deploys several visual features to retrieve knowledge from various sources, such as Wikidata and GPT-3. Visual feature guidance was proven critical towards improving the knowledge retrieval process.…”

Section: Visual Question Answering (Vqa)mentioning

confidence: 99%

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Lymperaiou¹,

Stamou²

2023

Preprint

View full text Add to dashboard Cite

Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches. Moreover, we categorize existing knowledge sources and types, proceeding to discussion regarding the KG vs LLM dilemma and its potential impact to future hybrid approaches.

show abstract

“…Knowledge-Based VQA. In REVIVE (Lin et al, 2022), the authors proposed to first employ an object detector to locate the objects, and then use the cropped bounding-box proposals to retrieve various types of external knowledge. Finally, they merged this knowledge with the regional visual features into a transformer to predict an answer.…”

Section: Related Workmentioning

confidence: 99%

“…The traditional knowledge retrieval module usually retrieves knowledge from sources such as Wikipedia, knowledge graphs, and web search (Wu et al, 2022). More recently, Large Language Models (LLMs) such as GPT-3 are used to produce related knowledge (Lin et al, 2022;Hu et al, 2022b). The latter approach is preferred since traditional knowledge retrieval often introduces irrelevant information to the question .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering

Khademi,

Yang,

Frujeri

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Thanks to the strong reasoning capabilities of Large Language Models (LLMs), recent approaches to knowledge-based visual question answering (KVQA) utilize LLMs with a global caption of an input image to answer a question. However, these approaches may miss key visual information that is not captured by the caption. Moreover, they cannot fully utilize the visual information required to answer the question. To address these issues, we introduce a new framework called Multi-Modal Knowledge-Aware Reasoner (MM-Reasoner) for KVQA. MM-Reasoner first utilizes a set of vision APIs, such as dense captioners, object detectors, and OCR, to extract detailed information from the image in textual format. Then, it prompts an LLM to extract query-specific knowledge from the extracted textual information to provide a rich representation that contains external knowledge, commonsense, explicit supporting facts, and rationales required for reasoning. Finally, the knowledge, query, and visual input are used to fine-tune a Vision-Language Model (VLM). At test time, MM-Reasoner uses the potential answers predicted by the VLM to iteratively update and optimize the prompt, refining its answer. Empirical studies show that MM-Reasoner achieves state-ofthe-art performance on several KVQA datasets.

show abstract

REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Cited by 3 publications

References 27 publications

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering

Contact Info

Product

Resources

About