Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.643
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Neural Graph Memory Networks for Visual Question Answering

Abstract: We introduce a new neural network architecture, Multimodal Neural Graph Memory Networks (MN-GMN), for visual question answering. The MN-GMN uses graph structure with different region features as node attributes and applies a recently proposed powerful graph neural network model, Graph Network (GN), to reason about objects and their interactions in an image. The input module of the MN-GMN generates a set of visual features plus a set of encoded region-grounded captions (RGCs) for the image. The RGCs capture obj… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 24 publications
(19 citation statements)
references
References 28 publications
(22 reference statements)
0
18
0
Order By: Relevance
“…), whereas in our case the graph nodes represent multimodal views of a single data generating source (visual, acoustic, textual nodes from a single speaking person). In the NLP domain, multimodal GNN methods (Khademi, 2020;Yin et al, 2020) on tasks such as Visual Question Answering and Machine Translation. However, these settings still differ from ours because they focused on static images and short text which, unlike the multimodal video data in our case, do not exhibit long-term temporal dependencies across modalities.…”
Section: Related Workmentioning
confidence: 99%
“…), whereas in our case the graph nodes represent multimodal views of a single data generating source (visual, acoustic, textual nodes from a single speaking person). In the NLP domain, multimodal GNN methods (Khademi, 2020;Yin et al, 2020) on tasks such as Visual Question Answering and Machine Translation. However, these settings still differ from ours because they focused on static images and short text which, unlike the multimodal video data in our case, do not exhibit long-term temporal dependencies across modalities.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, Graph Convolutional Network has been applied in different multimodal tasks, such as Visual Dialog (Guo et al, 2020;Khademi, 2020), multimodal fake news detection (Wang et al, 2020a), and Visual Question Answering (VQA) (Hudson and Manning, 2019;Khademi, 2020). Jiang et al (2020) applied a novel Knowledge-Bridge Graph Network (KBGN) in modeling the relations among the visual dialogue cross-modal information in fine granularity.…”
Section: Graph Neural Networkmentioning
confidence: 99%
“…However, the KMGCN extracted visual words as visual information and did not make full use of the global information of the image. Khademi (2020) introduced a new neural network architecture, a Multimodal Neural Graph Memory Network (MN-GMN), for VQA, which model constructed a visual graph network based on the bounding-boxes, which produced overlapping parts that might provide redundant information.…”
Section: Graph Neural Networkmentioning
confidence: 99%
“…Visual Question Answering. VQA has aroused wide concerns (Cao et al 2021;Jain et al 2021;Khademi 2020;Yu et al 2020), as it is regarded as a typical multimodal task related to natural language processing and computer vision.…”
Section: Related Workmentioning
confidence: 99%