Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1382
|View full text |Cite
|
Sign up to set email alerts
|

Contextual Inter-modal Attention for Multi-modal Sentiment Analysis

Abstract: Multi-modal sentiment analysis offers various challenges, one being the effective combination of different input modalities, namely text, visual and acoustic. In this paper, we propose a recurrent neural network based multi-modal attention framework that leverages the contextual information for utterance-level sentiment prediction. The proposed approach applies attention on multi-modal multi-utterance representations and tries to learn the contributing features amongst them. We evaluate our proposed approach o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
76
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 123 publications
(88 citation statements)
references
References 22 publications
1
76
0
1
Order By: Relevance
“…Multimodal sentiment analysis provides an opportunity to learn interactions between different modalities. Similar to approaches mentioned for intermodal attention in Ghosal et al [10], we propose a method to learn cross-interaction vectors. For a pair of Text (H T ) and Video (H V ) modalities, co-attention matrix (M T V ∈R u×u ) can be defined as:…”
Section: Cross Attentionmentioning
confidence: 99%
See 4 more Smart Citations
“…Multimodal sentiment analysis provides an opportunity to learn interactions between different modalities. Similar to approaches mentioned for intermodal attention in Ghosal et al [10], we propose a method to learn cross-interaction vectors. For a pair of Text (H T ) and Video (H V ) modalities, co-attention matrix (M T V ∈R u×u ) can be defined as:…”
Section: Cross Attentionmentioning
confidence: 99%
“…In our experiments, we used same features mentioned in Ghosal et al [10]. Specifically, for CMU-MOSEI dataset, we used Glove embeddings for word features, Facets 2 for visual features and CovaRep [16] for acoustic features.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 3 more Smart Citations