A supervised approach to movie emotion tracking

Malandrakis, Nikos; Potamianos, Alexandros; Evangelopoulos, Georgios; Zlatintsi, Athanasia

doi:10.1109/icassp.2011.5946961

Cited by 81 publications

(67 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Cui et al [5] address affective content analysis of music videos, where they employ audio-visual features for the construction of arousal and valence models. Intended emotion tracking of movies is a subject addressed by Malandrakis et al [13], where audio-visual features are extracted for the affective representation of movies. In [19], a combined analysis of low-level audio-and visual representations based on early feature fusion is presented for the facial emotion recognition in videos.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, one key issue in designing video affective content analysis algorithms is the representation of video content as in any pattern recognition task. The common approach for video content representation is either to use low-level audio-visual features or to build hand-crafted higher level representations based on the low-level ones (e.g., [5,8,13,21]). Low-level features have the disadvantage of losing global relations or structure in data, whereas creating hand-crafted higher level representations is time consuming, problem-dependent, and requires domain knowledge.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Affective Content of Music Videos through Learned Representations

Acar

Hopfgartner

Albayrak

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. In consideration of the ever-growing available multimedia data, annotating multimedia content automatically with feeling(s) expected to arise in users is a challenging problem. In order to solve this problem, the emerging research field of video affective analysis aims at exploiting human emotions. In this field where no dominant feature representation has emerged yet, choosing discriminative features for the effective representation of video segments is a key issue in designing video affective content analysis algorithms. Most existing affective content analysis methods either use low-level audio-visual features or generate hand-crafted higher level representations based on these low-level features. In this work, we propose to use deep learning methods, in particular convolutional neural networks (CNNs), in order to learn mid-level representations from automatically extracted low-level features. We exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the RGB space in order to build higher level audio and visual representations. We use the learned representations for the affective classification of music video clips. We choose multi-class support vector machines (SVMs) for classifying video clips into four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results on a subset of the DEAP dataset (on 76 music video clips) show that a significant improvement is obtained when higher level representations are used instead of low-level features, for video affective content analysis.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Understanding Affective Content of Music Videos through Learned Representations

Acar

Hopfgartner

Albayrak

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…There has been little prior work towards emotion recognition using both audio and visual cues in multimedia contents [16,18,19]. The authors in [16] performed continuous-scale emotion tracking in movies, fusing features from audio, music, and video modalities.…”

Section: Related Workmentioning

confidence: 99%

Multimedia content analysis for emotional characterization of music video clips

et al. 2013

View full text Add to dashboard Cite

Nowadays, tags play an important role in the search and retrieval process in multimedia content sharing social networks. As the amount of multimedia contents explosively increases, it is a challenging problem to find a content that will be appealing to the users. Furthermore, the retrieval of multimedia contents, which can match users' current mood or affective state, can be of great interest. One approach to indexing multimedia contents is to determine the potential affective state, which they can induce in users. In this paper, multimedia content analysis is performed to extract affective audio and visual cues from different music video clips. Furthermore, several fusion techniques are used to combine the information extracted from the audio and video contents of music video clips. We show that using the proposed methodology, a relatively high performance (up to 90%) of affect recognition is obtained.

show abstract

“…(c) Affective information: both intended emotions and experienced emotions have been annotated. More details on the affective annotation and the associated emotion tracking task are provided in [73].…”

Section: Databasementioning

confidence: 99%

Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention

Evangelopoulos

Zlatintsi

Potamianos

et al. 2013

IEEE Trans. Multimedia

226

152

View full text Add to dashboard Cite

Abstract-Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher-level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color and orientation. Textual or linguistic saliency is extracted from partof-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.

show abstract

A supervised approach to movie emotion tracking

Cited by 81 publications

References 11 publications

Understanding Affective Content of Music Videos through Learned Representations

Understanding Affective Content of Music Videos through Learned Representations

Multimedia content analysis for emotional characterization of music video clips

Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention

Contact Info

Product

Resources

About