2019 International Conference on Multimodal Interaction 2019
DOI: 10.1145/3340555.3355713
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

Abstract: The audio-video based emotion recognition aims to classify a given video into basic emotions. In this paper, we describe our approaches in EmotiW 2019, which mainly explores emotion features and feature fusion strategies for audio and visual modality. For emotion features, we explore audio feature with both speech-spectrogram and Log Mel-spectrogram and evaluate several facial features with different CNN models and different emotion pretrained strategies. For fusion strategies, we explore intra-modal and cross… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 57 publications
(24 citation statements)
references
References 33 publications
0
21
0
Order By: Relevance
“…These fusion strategies for combining audio and visual modalities emphasize the most important frames that reveal the subject's emotion. For example, Zhou et al consider CNNs to extract features from the speech spectrogram and several relevant video frames, which are highlighted through various intra-modal fusion strategies (e.g., self-attention, relationattention, perceptron-attention) [22].…”
Section: Related Workmentioning
confidence: 99%
“…These fusion strategies for combining audio and visual modalities emphasize the most important frames that reveal the subject's emotion. For example, Zhou et al consider CNNs to extract features from the speech spectrogram and several relevant video frames, which are highlighted through various intra-modal fusion strategies (e.g., self-attention, relationattention, perceptron-attention) [22].…”
Section: Related Workmentioning
confidence: 99%
“…The quality of features extracted from speech data is crucial for the performance of SER. Many existing studies have shown that the performance of SER task can be improved by converting the speech signal sequence data into log-Mels spectrogram data (Zhang et al, 2017;Chen et al, 2018;Dai et al, 2019;Zhao et al, 2019;Zhou et al, 2019;Chen and Zhao, 2020;Zayene et al, 2020).…”
Section: Log-mels Spectrogram Feature Calculationmentioning
confidence: 99%
“…Then using all frames they trained the 3D CNN on the visual data. More recently, Zhou et al [18] explored emotion features in multimodal video classification systems by using attention mechanisms to highlight important emotional features.…”
Section: Related Researchmentioning
confidence: 99%