EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings

Ren, Zhao; Schuller, Björn

doi:10.1109/taffc.2019.2928297

Cited by 43 publications

(45 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is to be noted that the dataset provides separate features for arousal and valence. As in [9][16], to compensate for the delay in annotation, we shift the ground-truth labels back in time by 2.4 s. This dataset is ideal for our objective, since the uni-modal performance of audio and video features varies considerably for arousal and valence, as reported in [9] and confirmed by our experiments (see Table 1). As in the AVEC 2016 challenge, we use the Concordance Correlation Coefficient (CCC) (eq.…”

Section: Dataset and Evaluation Measuresmentioning

confidence: 95%

“…In order to identify the stronger and weaker modalities, we first assess the unimodal performances of audio, video-geometric and video-appearance features for arousal and valence using a regressor similar to [9]. The regressor consists of 4 single time-step GRU-RNN layers, each made up of 120 neurons, followed by a linear layer and trained using the MSE loss.…”

Section: Methodsmentioning

confidence: 99%

“…However, the modalities considered in these methods are images of equal size and the uni-modal networks have the same architecture, thus preventing their direct application to distinct modalities like audio, video and text, whose feature types and dimensionality differ. A few works have been proposed to address this problem [9][11] [12]. For sentiment analysis, a sequence-to-sequence network with cyclic translation across modalities generates an intermediate representation that is robust to missing modalities during testing [11].…”

Section: Related Workmentioning

confidence: 99%

“…In contrast, we aim to explicitly improve the weaker modality using the stronger modality during training. A joint audiovisual training and cross-modal triplet loss can be used to develop a face/speech emotion recognition system using multi-modal training [9]. However, in such a system the weaker modality may degrade the performance of the stronger modality [9].…”

Section: Related Workmentioning

confidence: 99%

“…In this section, we compare the performance of SEW with other unimodal methods [16][17][18] [19] and a state-of-the-art cross-modal knowledge transfer method [9]. We describe the dataset, the evaluation metrics, the details about the architecture and training, and present an ablation study, which quantifies the contributions of different parts of SEW.…”

Section: Validationmentioning

confidence: 99%

See 4 more Smart Citations

Robust Latent Representations Via Cross-Modal Translation and Alignment

Rajan

Brutti

Cavallaro

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when signals from some modalities are unavailable or severely degraded. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of a worse performing (or weaker) modality. The translation from the weaker to the better performing (or stronger) modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representation in a shared latent space. We validate the proposed framework on the AVEC 2016 dataset (RECOLA) for continuous emotion recognition and show the effectiveness of the framework that achieves state-ofthe-art (uni-modal) performance for weaker modalities.

show abstract

Section: Dataset and Evaluation Measuresmentioning

confidence: 95%

Section: Methodsmentioning

confidence: 99%