Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Li, Mingxiao; Moens, Marie‐Francine

doi:10.1609/aaai.v36i10.21346

Cited by 9 publications

(3 citation statements)

References 32 publications

(58 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Graph Neural Network. Graph neural network (GNN) (Li and Moens 2022;Scarselli et al 2008;Li et al 2019;Gao et al 2020;Zhu et al 2020) is a highly effective framework for representing graph-structured data. GNNs follow the message passing scheme that updates each node's feature using its neighborhoods of nodes to capture specific patterns of a graph.…”

Section: Related Workmentioning

confidence: 99%

“…GNNs follow the message passing scheme that updates each node's feature using its neighborhoods of nodes to capture specific patterns of a graph. Some encouraging works (Li and Moens 2022;Li et al 2019;Gao et al 2020;Zhu et al 2020) study graph neural networks to solve the VQA task. For example, ReGAT (Li et al 2019) represents the image as a graph and captures interactions between objects through the graph attention mechanism.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Object Attribute Matters in Visual Question Answering

Li,

Si,

et al. 2024

AAAI

View full text Add to dashboard Cite

Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Object Attribute Matters in Visual Question Answering

Li,

Si,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Several benchmark datasets [32,42,48,49], including complex reasoning questions, facilitate the development of this field. To incorporate with external knowledge, early methods turned to textual Knowledge Bases (KBs) and applied either graphbased [24,36,60,61] or transformer-based approaches [11,13] to introduce the KB information into the question answering module. Besides, multi-modal KBs are also leveraged to solve VQA tasks.…”

Section: Related Work 21 Vqa Tasksmentioning

confidence: 99%