Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

Sivaprasad, Sarath; Joshi, Tanmayee; Agrawal, Ritesh; Pedanekar, Niranjan

doi:10.1145/3206025.3206076

Cited by 16 publications

(24 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The general architecture is shown as Figure 1, and the main idea is adopted from Sivaprasad et al 50 The visual LSTM network and acoustic LSTM network are both a 32‐unit FC layer connected by a 64‐unit LSTM layer, and their activation function is RELU; fusion LSTM network is a 128‐unit LSTM layer with RELU function. The optimizer used in training is Adam, which learning rate is 0.001.…”

Section: Methodsmentioning

confidence: 99%

A computational model of emotion based on audio‐visual stimuli understanding and personalized regulation with concurrency

Jiang

Jin

Zhuang

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

The target of Emotion modeling is to establish an system that can perceive, recognize, and express emotions with concurrency which humanity have by proper mathematical models. A big challenge in Emotion modeling is to establish the complex personal system of an intelligent agent or machine in a quasi‐physical and quasi‐sociological way, so that it rationally responds to emotions and behaviors by different internal and external stimuli with concurrency. For this purpose, an emotion classification model is presented for the audio‐visual external stimuli based on an improved long short‐term memory network. Then, based on Gross's emotional regulation theory, a hidden Markov model is constructed to imitate the process framework of human emotional cognition and personalized emotional regulation and expression, so as to realize the machine's intuitive and reasonable response to stimuli with concurrency. In this article, a framework of machine personalized artificial emotion simulation is preliminarily constructed, which provides a new model for human–computer interaction, in order to provide a reference solution for machine emotion understanding and expression.

show abstract

Section: Methodsmentioning

confidence: 99%

A computational model of emotion based on audio‐visual stimuli understanding and personalized regulation with concurrency

Jiang

Jin

Zhuang

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…ZCR [86] is used to separate different types of audio signals, such as music, environmental sound and speech of human. Besides these frequent related features, audio flatness [177], spectral flux [177], delta spectrum magnitude, harmony [86,111,177], band energy ratio, spectral centroid [49,177], and spectral contrast [86] are also utilized.…”

Section: Content-related Featuresmentioning

confidence: 99%

Affective Computing for Large-scale Heterogeneous Multimedia Data

Zhao

Wang

Soleymani

et al. 2019

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

The wide popularity of digital photography and social networks has generated a rapidly growing volume of multimedia data (i.e., image, music, and video), resulting in a great demand for managing, retrieving, and understanding these data. Affective computing (AC) of these data can help to understand human behaviors and enable wide applications. In this article, we survey the state-of-the-art AC technologies comprehensively for large-scale heterogeneous multimedia data. We begin this survey by introducing the typical emotion representation models from psychology that are widely employed in AC. We briefly describe the available datasets for evaluating AC algorithms. We then summarize and compare the representative methods on AC of different multimedia types, i.e., images, music, videos, and multimodal data, with the focus on both handcrafted features-based methods and deep learning methods. Finally, we discuss some challenges and future directions for multimedia affective computing.

show abstract

“…e dimensional method has been used in most predictive studies [5][6][7][8][9][10] because the dimensional method constituted by arousal and valence dimensions can effectively represent the emotions elicited by pictures, videos, sounds, etc. [11].…”

Section: Introductionmentioning

confidence: 99%

“…Goyal et al [6] proposed a mixture of experts-(MoE-) based fusion model that dynamically combines information from audio and video modalities for predicting the dynamic emotion evoked in movies. Sivaprasad et al [7] presented a continuous emotion prediction model for movies based on long short-term memory (LSTM) [13] that models contextual information while using handcrafted audio-video features as input. Joshi et al [8] proposed a method to model the interdependence of arousal and valence using custom joint loss terms to simultaneously train different LSTM models for arousal and valence prediction.…”

Section: Introductionmentioning

confidence: 99%

“…In 2019, ao et al [9] used the same features and models to predict intended or experienced emotion, and this was chosen as the baseline of this paper. For predicting the emotion on a continuous valence-arousal scale over time, some methods [6,7,9,10] need to extract image content and optical flow information from each frame of movies, which has high computational complexity. Goyal et al [6] proposed splitting all movies into nonoverlapping 5-second samples after a frequency response analysis of the intended and experienced emotion labels to find a suitable unit for affective video content analysis.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Deep Multimodal Model for Predicting Affective Responses Evoked by Movies Based on Shot Segmentation

Wang

Zhang

Jiang

et al. 2021

Security and Communication Networks

View full text Add to dashboard Cite

Predicting the emotions evoked in a viewer watching movies is an important research element in affective video content analysis over a wide range of applications. Generally, the emotion of the audience is evoked by the combined effect of the audio-visual messages of the movies. Current research has mainly used rough middle- and high-level audio and visual features to predict experienced emotions, but combining semantic information to refine features to improve emotion prediction results is still not well studied. Therefore, on the premise of considering the time structure and semantic units of a movie, this paper proposes a shot-based audio-visual feature representation method and a long short-term memory (LSTM) model incorporating a temporal attention mechanism for experienced emotion prediction. First, the shot-based audio-visual feature representation defines a method for extracting and combining audio and visual features of each shot clip, and the advanced pretraining models in the related audio-visual tasks are used to extract the audio and visual features with different semantic levels. Then, four components are included in the prediction model: a nonlinear multimodal feature fusion layer, a temporal feature capture layer, a temporal attention layer, and a sentiment prediction layer. This paper focuses on experienced emotion prediction and evaluates the proposed method on the extended COGNIMUSE dataset. The method performs significantly better than the state-of-the-art while significantly reducing the number of calculations, with increases in the Pearson correlation coefficient (PCC) from 0.46 to 0.62 for arousal and from 0.18 to 0.34 for valence in experienced emotion.

show abstract

Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

Cited by 16 publications

References 14 publications

A computational model of emotion based on audio‐visual stimuli understanding and personalized regulation with concurrency

A computational model of emotion based on audio‐visual stimuli understanding and personalized regulation with concurrency

Affective Computing for Large-scale Heterogeneous Multimedia Data

A Deep Multimodal Model for Predicting Affective Responses Evoked by Movies Based on Shot Segmentation

Contact Info

Product

Resources

About