Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval 2018
DOI: 10.1145/3206025.3206076
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(24 citation statements)
references
References 14 publications
0
24
0
Order By: Relevance
“…The general architecture is shown as Figure 1, and the main idea is adopted from Sivaprasad et al 50 The visual LSTM network and acoustic LSTM network are both a 32‐unit FC layer connected by a 64‐unit LSTM layer, and their activation function is RELU; fusion LSTM network is a 128‐unit LSTM layer with RELU function. The optimizer used in training is Adam, which learning rate is 0.001.…”
Section: Methodsmentioning
confidence: 99%
“…The general architecture is shown as Figure 1, and the main idea is adopted from Sivaprasad et al 50 The visual LSTM network and acoustic LSTM network are both a 32‐unit FC layer connected by a 64‐unit LSTM layer, and their activation function is RELU; fusion LSTM network is a 128‐unit LSTM layer with RELU function. The optimizer used in training is Adam, which learning rate is 0.001.…”
Section: Methodsmentioning
confidence: 99%
“…ZCR [86] is used to separate different types of audio signals, such as music, environmental sound and speech of human. Besides these frequent related features, audio flatness [177], spectral flux [177], delta spectrum magnitude, harmony [86,111,177], band energy ratio, spectral centroid [49,177], and spectral contrast [86] are also utilized.…”
Section: Content-related Featuresmentioning
confidence: 99%
“…e dimensional method has been used in most predictive studies [5][6][7][8][9][10] because the dimensional method constituted by arousal and valence dimensions can effectively represent the emotions elicited by pictures, videos, sounds, etc. [11].…”
Section: Introductionmentioning
confidence: 99%
“…Goyal et al [6] proposed a mixture of experts-(MoE-) based fusion model that dynamically combines information from audio and video modalities for predicting the dynamic emotion evoked in movies. Sivaprasad et al [7] presented a continuous emotion prediction model for movies based on long short-term memory (LSTM) [13] that models contextual information while using handcrafted audio-video features as input. Joshi et al [8] proposed a method to model the interdependence of arousal and valence using custom joint loss terms to simultaneously train different LSTM models for arousal and valence prediction.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation