Question-Driven Graph Fusion Network for Visual Question Answering

Qian, Yuxi; Hu, Yuncong; Wang, Ruonan; Feng, Fan; Wang, Xiaojie

doi:10.1109/icme52920.2022.9859591

Cited by 12 publications

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, several models using multimodal techniques for VQA have been proposed. For instance, Qian et al [43] proposed a QD-GFN model that utilizes question information to guide the aggregation of semantic, spatial, and implicit graphs. Their model also incorporates an object filtering mechanism to remove irrelevant objects.…”

Section: Multimodal Interactionmentioning

confidence: 99%

Multimodal Joint Representation by Shared Knowledge Embedding with Vision-Language View for VQA

Cheng,

Jiang,

et al. 2024

Preprint

View full text Add to dashboard Cite

Current works in Visual Question Answering(VQA) that introduce external knowledge mainly focus on leveraging the knowledge to supplement the language representation in VQA model’s question input. However, such approaches ignore the commonsense information implied in the image. In this paper, we propose a novel VQA framework to embed knowledge features into vision and language respectively with a shared knowledge graph. To bridge the gap between visual representation and knowledge representation, we propose the knowledge enhancing visual representation (KEVR) module. KEVR is designed to explore external knowledge in image from the knowledge graph. By utilizing KEVR, external knowledge related to the objects in the image can be embedded into visual representation directly. In terms of the input question, a designed transformer is utilized to embed knowledge features into language representation. The knowledge graph used in our model is extracted from three knowledge bases. We organize the prior knowledge in the form of RDF triples to establish knowledge connections, then a graph neural network is employed to extract multilateral relationships in knowledge graph. What's more, a two-stream transformer is employed to get the attention based vision-language representation. The results of our experiments show that our model outperforms the best baseline model with an accuracy of 1.34% and 2.59% on VQA 2.0 and OK-VQA datasets respectively.

show abstract

Section: Multimodal Interactionmentioning

confidence: 99%