Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-Hml) 2018
DOI: 10.18653/v1/w18-3302
|View full text |Cite
|
Sign up to set email alerts
|

Recognizing Emotions in Video Using Multimodal DNN Feature Fusion

Abstract: We present our system description of input-level multimodal fusion of audio, video, and text for recognition of emotions and their intensities for the 2018 First Grand Challenge on Computational Modeling of Human Multimodal Language. Our proposed approach is based on input-level feature fusion with sequence learning from Bidirectional Long-Short Term Memory (BLSTM) deep neural networks (DNNs). We show that our fusion approach outperforms unimodal predictors. Our system performs 6-way simultaneous classificatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 60 publications
(15 citation statements)
references
References 26 publications
0
10
0
Order By: Relevance
“…MSA aims to predict the people's sentiment from the video, audio, and text of the utterances. The models like MFN [30] and EF-LSTM [27] can work on aligned multimodal data, which means the frames of audio and vision have explicit correspondence with the words in the text modality. To deal with more practical scenarios, MSA models are gradually expanding to the area of unaligned multimodal data inputs.…”
Section: Multimodal Sentiment Analysismentioning
confidence: 99%
“…MSA aims to predict the people's sentiment from the video, audio, and text of the utterances. The models like MFN [30] and EF-LSTM [27] can work on aligned multimodal data, which means the frames of audio and vision have explicit correspondence with the words in the text modality. To deal with more practical scenarios, MSA models are gradually expanding to the area of unaligned multimodal data inputs.…”
Section: Multimodal Sentiment Analysismentioning
confidence: 99%
“…This method is used in many models. In [23], the author directly performs input-level feature fusion on multimodal data in the input stage and combines deep neural networks for sentiment analysis. In literature [24,25], the author first encodes each modal separately and then uses feature-level fusion in the middle layer to obtain multimodal embedding, which is static feature-level fusion.…”
Section: Multimodal Sentiment Analysismentioning
confidence: 99%
“…It integrates the contextual information encoded in word embeddings space and the extracted glyph representation of sequential handwritten characters, containing two-modal features. Inspired by the previous fusion methods of multi-modal representation such as Early Fusion LSTM [35], Tensor Fusion Network [40], Memory Fusion Network [41] and Low-rank multi-modal Fusion [21], we design a multi-modal fusion sub-layer in each block of DSTFN to tackle the above features.…”
Section: Related Workmentioning
confidence: 99%