AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

Thao, Ha Thi Phuong; Balamurali, B T; Roig, Gemma; Herremans, Dorien

doi:10.3390/s21248356

Cited by 13 publications

(10 citation statements)

References 104 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The main goal of multimodal fusion is to reduce the heterogeneous differences among modalities, [17] keep the integrity of specific semantics of each modality, and achieve the best performance in deep learning models. It is divided into three types: joint architecture, cooperative architecture and codec architecture.…”

Section: Fig 4 Heterogeneous Integration Of Multimedia Information In...mentioning

confidence: 99%

Construction of university English teaching connection model based on multimedia fusion

luo

2023

Preprint

View full text Add to dashboard Cite

In order to improve the cohesion and quality of college English teaching, the cohesion model of college English teaching under the interactive network environment of database server is constructed. This paper puts forward a connecting platform for college English teaching under the interactive network environment of database server based on multimedia fusion. Build a big data analysis model of college English teaching convergence multimedia information in the database server interactive network environment, get the feature quantity of college English teaching convergence information in the queue resource quota manager, and combine the big data mining method to carry out multimedia information fusion and feature clustering processing of college English teaching convergence in the database server interactive network environment. The heterogeneous directed graph fusion clustering is adopted to design the optimal storage structure of college English teaching convergence multimedia information in the database server interactive network environment, and the semantic ontology features of college English teaching convergence multimedia information in the database server interactive network environment are extracted. Through feature optimization retrieval, the design of college English teaching convergence platform and multimedia information fusion in the database server interactive network environment are realized. Through the method of multimedia information fusion of college English teaching convergence under the interactive network environment of database server, differential scheduling is carried out to improve the ability of college English teaching convergence and multimedia information fusion under the interactive network environment of database server. The simulation results show that the designed college English teaching interface platform has high information fusion degree and strong resource allocation ability under the interactive network environment of database server.

show abstract

Section: Fig 4 Heterogeneous Integration Of Multimedia Information In...mentioning

confidence: 99%

Construction of university English teaching connection model based on multimedia fusion

luo

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Reviews have summarized extracted features relevant to affect detection in the audio modality such as intensity (loudness, energy), timbre (MFCC) and rhythm (tempo, regularity) features [31] and video modality such as colour, lighting key, motion intensity and shot length [35], [36]. Features that can capture complex latent dimensions in the data, such as the audio embeddings generated by the VGGish model [37] [38] [39], are also becoming increasingly popular. Features may be provided in the dataset or extracted from the source data if it is available.…”

Section: Datasets For Affective Multimedia Content Analysismentioning

confidence: 99%

“…Audio feature extraction was performed with openS-MILE [46], a popular open-source library for audio feature extraction. Specifically, we used the "emobase" configuration file to extract a set of 988 low-level descriptors (LLDs) including MFCC, pitch, spectral, zero-crossing rate, loudness and intensity statistics, many of which have been shown to be effective for identifying emotion in music [38], [39], [47], [48]. Many other configurations are available in openSMILE but we provide the "emobase" set of acoustic features since it is well-documented and was designed for emotion recognition applications [49].…”

Section: Feature Extractionmentioning

confidence: 99%

Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses

Chua¹,

Makris²,

Herremans³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived emotion of media. The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual. Participants annotated the music videos for valence and arousal over time, as well as the overall emotion conveyed. We present detailed descriptive statistics for key measures in the dataset and the results of feature importance analyses for each condition. Finally, we propose a novel transfer learning architecture to train Predictive models Augmented with Isolated modality Ratings (PAIR) and demonstrate the potential of isolated modality ratings for enhancing multimodal emotion recognition. Our results suggest that perceptions of arousal are influenced primarily by auditory information, while perceptions of valence are more subjective and can be influenced by both visual and auditory information. The dataset is made publicly available.

show abstract

“…Alternatively, in an effort to collect larger quantities of affect labels in a shorter amount of time, although with a potential loss in accuracy, crowd-sourcing on platforms such as Amazon Mechanical Turk (MTurk) has also been explored [ 3 , 53 , 54 , 55 , 56 ]. Some researchers utilize a mix of both online and offline collection methods [ 57 , 58 ], or even use predictive models such as AttendAffectNet [ 59 ] for the emotion labeling [ 60 ]. Regardless of the data collection method, it is important for each musical excerpt in the dataset to be labelled by multiple participants in order to account for subjectivity.…”

Section: Data Gathering Proceduresmentioning

confidence: 99%

MERP: A Music Dataset with Emotion Ratings and Raters’ Profile Information

Koh

Cheuk

Heung

et al. 2022

Sensors

Self Cite

View full text Add to dashboard Cite

Music is capable of conveying many emotions. The level and type of emotion of the music perceived by a listener, however, is highly subjective. In this study, we present the Music Emotion Recognition with Profile information dataset (MERP). This database was collected through Amazon Mechanical Turk (MTurk) and features dynamical valence and arousal ratings of 54 selected full-length songs. The dataset contains music features, as well as user profile information of the annotators. The songs were selected from the Free Music Archive using an innovative method (a Triple Neural Network with the OpenSmile toolkit) to identify 50 songs with the most distinctive emotions. Specifically, the songs were chosen to fully cover the four quadrants of the valence-arousal space. Four additional songs were selected from the DEAM dataset to act as a benchmark in this study and filter out low quality ratings. A total of 452 participants participated in annotating the dataset, with 277 participants remaining after thoroughly cleaning the dataset. Their demographic information, listening preferences, and musical background were recorded. We offer an extensive analysis of the resulting dataset, together with a baseline emotion prediction model based on a fully connected model and an LSTM model, for our newly proposed MERP dataset.

show abstract

AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

Cited by 13 publications

References 104 publications

Construction of university English teaching connection model based on multimedia fusion

Construction of university English teaching connection model based on multimedia fusion

Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses

MERP: A Music Dataset with Emotion Ratings and Raters’ Profile Information

Contact Info

Product

Resources

About