Multimodal Neural Graph Memory Networks for Visual Question Answering

Khademi, Mahmoud

doi:10.18653/v1/2020.acl-main.643

Cited by 24 publications

(19 citation statements)

References 28 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…), whereas in our case the graph nodes represent multimodal views of a single data generating source (visual, acoustic, textual nodes from a single speaking person). In the NLP domain, multimodal GNN methods (Khademi, 2020;Yin et al, 2020) on tasks such as Visual Question Answering and Machine Translation. However, these settings still differ from ours because they focused on static images and short text which, unlike the multimodal video data in our case, do not exhibit long-term temporal dependencies across modalities.…”

Section: Related Workmentioning

confidence: 99%

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Yang¹,

Wang²,

Yi³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Human communication is multimodal in nature; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Modal-Temporal Attention Graph (MTAG). MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions across modalities and through time. Then, a novel graph fusion operation, called MTAG fusion, along with a dynamic pruning and read-out technique, is designed to efficiently process this modal-temporal graph and capture various interactions. By learning to focus only on the important interactions within the graph, MTAG achieves state-ofthe-art performance on multimodal sentiment analysis and emotion recognition benchmarks, while utilizing significantly fewer model parameters. 1

show abstract

Section: Related Workmentioning

confidence: 99%

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Yang¹,

Wang²,

Yi³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…Recently, Graph Convolutional Network has been applied in different multimodal tasks, such as Visual Dialog (Guo et al, 2020;Khademi, 2020), multimodal fake news detection (Wang et al, 2020a), and Visual Question Answering (VQA) (Hudson and Manning, 2019;Khademi, 2020). Jiang et al (2020) applied a novel Knowledge-Bridge Graph Network (KBGN) in modeling the relations among the visual dialogue cross-modal information in fine granularity.…”

Section: Graph Neural Networkmentioning

confidence: 99%

“…However, the KMGCN extracted visual words as visual information and did not make full use of the global information of the image. Khademi (2020) introduced a new neural network architecture, a Multimodal Neural Graph Memory Network (MN-GMN), for VQA, which model constructed a visual graph network based on the bounding-boxes, which produced overlapping parts that might provide redundant information.…”

Section: Graph Neural Networkmentioning

confidence: 99%

Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks

Yang¹,

Feng²,

Zhang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

With the popularity of smartphones, we have witnessed the rapid proliferation of multimodal posts on various social media platforms. We observe that the multimodal sentiment expression has specific global characteristics, such as the interdependencies of objects or scenes within the image. However, most previous studies only considered the representation of a single image-text post and failed to capture the global co-occurrence characteristics of the dataset. In this paper, we propose Multi-channel Graph Neural Networks with Sentiment-awareness (MGNNS) for imagetext sentiment detection. Specifically, we first encode different modalities to capture hidden representations. Then, we introduce multichannel graph neural networks to learn multimodal representations based on the global characteristics of the dataset. Finally, we implement multimodal in-depth fusion with the multi-head attention mechanism to predict the sentiment of image-text pairs. Extensive experiments conducted on three publicly available datasets demonstrate the effectiveness of our approach for multimodal sentiment detection.

show abstract

“…Visual Question Answering. VQA has aroused wide concerns (Cao et al 2021;Jain et al 2021;Khademi 2020;Yu et al 2020), as it is regarded as a typical multimodal task related to natural language processing and computer vision.…”

Section: Related Workmentioning

confidence: 99%

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Xu¹,

Lin²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering(VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. And it also pushes the multimodal fusion to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised postpretraining with the span mask strategy and supervised prefinetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully consider the rich inputs of context and diagrams, we propose cross-guided multimodal attention to update the features of text, question diagram and instructional diagram based on a progressive strategy. Further, a dual gating mechanism is adopted to improve the model ensemble. The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21% and 2.43% for validation and test split respectively.

show abstract

Multimodal Neural Graph Memory Networks for Visual Question Answering

Cited by 24 publications

References 28 publications

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Contact Info

Product

Resources

About