Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.79
|View full text |Cite
|
Sign up to set email alerts
|

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Abstract: Human communication is multimodal in nature; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Modal-Temporal Attention Graph (MTAG). MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 43 publications
(17 citation statements)
references
References 25 publications
(32 reference statements)
0
13
0
Order By: Relevance
“…[14] show that multimodal transformers can be considered a special case of fully connected GNN's, including large multimodal transformers such as MERLOT [41] and UNITER [6]. Most similar to our work is [34], who use a graph neural network to model cross-modality, within-modality, and temporal connections, but differs from our strategy in our factorization strategy and contrastive learning setup. c) Self-Supervised Contrastive Learning: Selfsupervised learning attempts to create "labels" from data (e.g.…”
Section: Related Workmentioning
confidence: 82%
See 2 more Smart Citations
“…[14] show that multimodal transformers can be considered a special case of fully connected GNN's, including large multimodal transformers such as MERLOT [41] and UNITER [6]. Most similar to our work is [34], who use a graph neural network to model cross-modality, within-modality, and temporal connections, but differs from our strategy in our factorization strategy and contrastive learning setup. c) Self-Supervised Contrastive Learning: Selfsupervised learning attempts to create "labels" from data (e.g.…”
Section: Related Workmentioning
confidence: 82%
“…For a supervised training (baselines we will introduce next), we model the modality-modality components of each graph similar to the fashion in [34], and implemented in the Pytorch-Geometric package [10]. We add the following edges and associated graph convolutions to the graph: the edge connecting questions, the edge connecting answers, the edge connecting questions and answers, and the edge connecting factorization nodes z s to the question and answer nodes.…”
Section: Experimental Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…(Zhang et al, 2021a) propose a multimodal graph fusion approach for named entity recognition, which conducted graph encoding via multimodal semantic interaction. (Yang et al, 2021) focus on multimodal sentiment analysis and emotion recognition, which unified video, audio, and text modalities into an attention graph and learned the interaction through graph fusion, dynamic pruning, and the read-out technique.…”
Section: Related Workmentioning
confidence: 99%
“…Multimodal modeling is a research hotspot that has attracted the attention of many scholars in recent years [2,64]. The key to multimodal modeling includes multimodal fusion [53,56], consistency and difference [16,55], modality alignment [47]. We recommend the survey [2] for a comprehensive understanding.…”
Section: Related Workmentioning
confidence: 99%