Synchronous prediction of arousal and valence using LSTM network for affective video content analysis

Zhang, Ligang; Zhang, Jiulong

doi:10.1109/fskd.2017.8393364

Cited by 13 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that our proposed method provides better performance than the methods of Baveye et al [20] and Gan et al [21], even though we use only visual information. Also, our method shows comparable or better performance with the method of Zhang and Zhang [19] where it uses carefully designed and selected handcrafted features. The results show the proposed model is promising in the emotional evaluation of video clips.…”

Section: Resultsmentioning

confidence: 79%

“…In recent study, LSTM has been used to estimate the emotion of a video clip in the same Thayer's emotion space. But the hand-crafted audio and video features were extracted, then only the selected features were exploited to estimate the degrees of arousal and valence in the LSTM [19]. In the work, the idea of LSTM is similar to ours, in that it takes a role to characterize a long-term dynamic behavior of video clips.…”

Section: Lstm With Mlp-type Regression Networkmentioning

confidence: 99%

See 1 more Smart Citation

A Deep-Learning Based Model for Emotional Evaluation of Video Clips

Kim¹,

Lee²

2018

IJFIS

View full text Add to dashboard Cite

Emotional evaluation of video clips is the difficult task because it includes not only stationary objects as the background but also dynamic objects as the foreground. In addition, there are many video analysis problems to be solved beforehand to properly address the emotionrelated tasks. Recently, however, the convolutional neural network (CNN)-based deep learning approach, opens the possibility by solving the action recognition problem. Inspired by the CNN-based action recognition technology, this paper challenges to evaluate the emotion of video clips. In the paper, we propose a deep learning model to capture the video features and evaluate the emotion of a video clip on Thayer 2D emotion space. In the model, the pre-trained convolutional 3D neural network (C3D) generates short-term spatiotemporal features of the video, LSTM accumulates those consecutive time-varying features to characterize long-term dynamic behaviors, and multilayer perceptron (MLP) evaluates emotion of a video clip by regression on the emotion space. Due to the limited number of labeled data, the C3D is employed to extract diverse spatiotemporal from various layers by transfer learning technique. The pre-trained C3D on the Sports-1M dataset and long short term memory (LSTM) followed by the MLP for regression are trained in end-to-end manner to fine-tune the C3D, and to adjust weights of LSTM and the MLP-type emotion estimator. The proposed method achieves the concordance correlation coefficient values of 0.6024 for valence and 0.6460 for arousal, respectively. We believe this emotional evaluation of video could be easily associated with appropriate music recommendation, once the music is emotionally evaluated in the same high-level emotional space.

show abstract

Section: Resultsmentioning

confidence: 79%

Section: Lstm With Mlp-type Regression Networkmentioning

confidence: 99%

A Deep-Learning Based Model for Emotional Evaluation of Video Clips

Kim¹,

Lee²

2018

IJFIS

View full text Add to dashboard Cite

show abstract

“…The study [8] represents 27 distinct possible categories of human emotion but in case of music video, it is convenient to organize them with coarse semantic groups so that an end-user can easily demand the required music video from large video banks or online music video stores. We categorize the adjectives of music video emotion classification into six basic emotion categories with references [41,52,67], namely, Exciting, Fear, Neutral, Relaxation, Sad, and Tension. From each emotion class, respectively three samples are represented (from left to right) in Fig.…”

Section: Music Video Emotion Datasetmentioning

confidence: 99%

Deep learning-based late fusion of multimodal information for emotion classification of music video

Pandeya

Lee

2020

Multimed Tools Appl

113

View full text Add to dashboard Cite

Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.

show abstract

“…Images usually contain textual descriptions such as street names, road signs, building numbers and product descriptions, which often provide key clues for information perception. Thus, scene text understanding in natural image is extremely useful for these fields, such as the direct perception for autonomous driving [7], the image caption for image retrieval [8, 9], the text recognition for automatic translation [10, 11], the text location and recognition for video content analysis [12, 13] etc.…”

Section: Introductionmentioning

confidence: 99%