Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

Jain, Aman; Kothyari, Mayank; Kumar, Vishwajeet; Jyothi, Preethi; Ramakrishnan, Ganesh; Chakrabarti, Soumen

doi:10.1145/3404835.3463259

Cited by 16 publications

(8 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…KB-VQA questions can also require commonsense reasoning, as in parts of OK-VQA and A-OKVQA (Schwenk et al, 2022). In particular, S3VQA (Jain et al, 2021) is an augmented version of OKVQA, improving both the quantity and quality of some question types. A-OKVQA has shifted its core task to "reasoning questions".…”

Section: Related Workmentioning

confidence: 99%

“…VQA 2.0 (Goyal et al, 2017) collects 'complementary images' such that each question is associated with a pair of images that result in different answers. Jain et al (2021) derive new S3VQA questions from manually defined question templates. They annotated spans of objects that could be replaced, and then substituted them with a com-plicated substitute-and-search system.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering

Lin¹,

Wang²,

Byrne³

2023

Preprint

View full text Add to dashboard Cite

The widely used Fact-based Visual Question Answering (FVQA) dataset contains visuallygrounded questions that require information retrieval using common sense knowledge graphs to answer. It has been observed that the original dataset is highly imbalanced and concentrated on a small portion of its associated knowledge graph. We introduce FVQA 2.0 which contains adversarial variants of test questions to address this imbalance. We show that systems trained with the original FVQA train sets can be vulnerable to adversarial samples and we demonstrate an augmentation scheme to reduce this vulnerability without human annotations.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering

Lin¹,

Wang²,

Byrne³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…retriever) to recall the required explicit knowledge as external input of the downstream reader. In order to take advantage of the information on the Internet, [20,31,33,37] pass the vision-linguistic information through a search engine (e.g., Google) to retrieve relevant corpus (e.g., sentences from Wikipedia articles or snippets in searching result) as weak positive knowledge samples, which are further passed to the reader module for knowledge incorporation. Within the above methods, Luo et al [31] apply the previously retrieved snippets as a KB, and assign those snippets which contain the answer words as weak positive signal for retriever training.…”

Section: Knowledge-based Vqamentioning

confidence: 99%

“…Works from the second category are based on the knowledge retrieval strategy. We observe that those methods [20,31,37,57] usually pass the vision-linguistic information through a search engine where the network delay might become a bottleneck. Others retrieve relevant corpus from encyclopedia articles, which leads to lots of irrelevant information and interferes with the model's judgment.…”

Section: Introductionmentioning

confidence: 99%

LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

Chen¹,

Huang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Visual question answering (VQA) often requires an understanding of visual concepts and language semantics, which relies on external knowledge. Most existing methods exploit pre-trained language models or/and unstructured text, but the knowledge in these resources are often incomplete and noisy. Some methods prefer to use knowledge graphs (KGs) which often have intensive structured knowledge, but the research is still quite preliminary. In this paper, we propose LaKo, a knowledge-driven VQA method via Late Knowledge-to-text Injection. To effectively incorporate an external KG, we transfer triples into text and propose a late injection mechanism. Finally we address VQA as a text generation task with an effective encoder-decoder paradigm. In the evaluation with OKVQA datasets, our method achieves state-of-the-art results. CCS CONCEPTS• Computing methodologies → Artificial intelligence; Knowledge representation and reasoning; Semantic networks.

show abstract

“…In this work, we mainly focus on the KRVQR dataset and also test our model on the FVQA dataset. Other VQA datasets that require external knowledge exist (Marino et al 2019;Jain et al 2021) but here the task is to search for external knowledge, which is not the scope of this work.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Moens

2022

AAAI

View full text Add to dashboard Cite

Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions using knowledge that is not presented in the given image. It is not only a more challenging task than regular VQA but also a vital step towards building a general VQA system. Most existing knowledge-based VQA systems process knowledge and image information similarly and ignore the fact that the knowledge base (KB) contains complete information about a triplet, while the extracted image information might be incomplete as the relations between two objects are missing or wrongly detected. In this paper, we propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which performs explicit and implicit reasoning over a key-value knowledge memory module and a spatial-aware image graph, respectively. Specifically, the memory module learns a dynamic knowledge representation and generates a knowledge-aware question representation at each reasoning step. Then, this representation is used to guide a graph attention operator over the spatial-aware image graph. Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets. We also conduct ablation experiments to prove the effectiveness of each component of the proposed model.

show abstract

Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

Cited by 16 publications

References 33 publications

FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering

FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering

LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Contact Info

Product

Resources

About