2021
DOI: 10.1109/taffc.2019.2928297
|View full text |Cite
|
Sign up to set email alerts
|

EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings

Abstract: Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowledge from other auxiliary modalities to improve the performance of an emotion recognition system at hand. The framework generally includes two main learning components… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 43 publications
(45 citation statements)
references
References 55 publications
0
44
1
Order By: Relevance
“…It is to be noted that the dataset provides separate features for arousal and valence. As in [9][16], to compensate for the delay in annotation, we shift the ground-truth labels back in time by 2.4 s. This dataset is ideal for our objective, since the uni-modal performance of audio and video features varies considerably for arousal and valence, as reported in [9] and confirmed by our experiments (see Table 1). As in the AVEC 2016 challenge, we use the Concordance Correlation Coefficient (CCC) (eq.…”
Section: Dataset and Evaluation Measuresmentioning
confidence: 95%
See 4 more Smart Citations
“…It is to be noted that the dataset provides separate features for arousal and valence. As in [9][16], to compensate for the delay in annotation, we shift the ground-truth labels back in time by 2.4 s. This dataset is ideal for our objective, since the uni-modal performance of audio and video features varies considerably for arousal and valence, as reported in [9] and confirmed by our experiments (see Table 1). As in the AVEC 2016 challenge, we use the Concordance Correlation Coefficient (CCC) (eq.…”
Section: Dataset and Evaluation Measuresmentioning
confidence: 95%
“…In order to identify the stronger and weaker modalities, we first assess the unimodal performances of audio, video-geometric and video-appearance features for arousal and valence using a regressor similar to [9]. The regressor consists of 4 single time-step GRU-RNN layers, each made up of 120 neurons, followed by a linear layer and trained using the MSE loss.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations