Cross-Corpus Speech Emotion Recognition Based on Few-Shot Learning and Domain Adaptation

Ahn, Youngdo; Lee, Sung Joo; Shin, Jong Won

doi:10.1109/lsp.2021.3086395

Cited by 30 publications

(16 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Classes Accuracy GAN [29] eGeMAPS [27]+EMO-DB 2 66% (UAR) FLUDA [30] IS10 [31]+IEMOCAP 4 50% (UA) VAE+LSTM [13] LogMel+IEMOCAP 4 56.08% (UA) AE+LSTM [13] LogMel+IEMOCAP 4 55.42% (UA) Stacked-AE+BLSTM-RNN [12] COVAREP+IEMOCAP [ on a similar concept as that in this paper and their performance in Tab. 1.…”

Section: Methods Features+datasetmentioning

confidence: 93%

Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Das¹,

Lønfeldt²,

Pagsberg³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoisingautoencoder, which achieves an unweighted classification accuracy of over 52.09% for four-class emotion classification. This performance is comparable to that of similar baseline methods. Following this, we employ a VAE, the semi-supervised VAE and the VAE with KL annealing to obtain a more regularized latent space. We show that while the DAE has the highest classification accuracy among the methods, the semi-supervised VAE has a comparable classification accuracy and a more consistent latent embedding distribution over data sets. 1

show abstract

Section: Methods Features+datasetmentioning

confidence: 93%

Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Das¹,

Lønfeldt²,

Pagsberg³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Features+Dataset classes Accuracy GAN [18] eGeMAPS [10]+EMO-DB 2 66% (UAR) FLUDA [1] IS10 [27]+IEMOCAP(+) 4 50% (UA) VAE+LSTM [20] LogMel+IEMOCAP 4 56.08% (UA) AE+LSTM [20] LogMel+IEMOCAP 4 55.42% (UA) Stacked-AE+BLSTM-RNN [13] COVAREP+IEMOCAP [6] 4 50.26% (UA) DAE+Linear-SVM (baseline) eGeMAPS+IEMOCAP 4 52.09% (UA)…”

Section: Methodsmentioning

confidence: 99%

Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Das¹,

Lund²,

Lønfeldt³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…Compared with existing state-of-the-art approaches, our proposed CTA-RNN architecture can significantly improve the performance of SER in both within-corpus and cross-corpus experiments. Different from previous works [18][26] [27], the unlabeled target datasets were not available in advance for the cross-corpus experiments described in this paper. The excellent robustness of our approach mainly benefits from the large-scale ASR datasets for pre-training.…”

Section: Fusion Of Asr Embeddingsmentioning

confidence: 99%

CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition

Chen¹,

Zhang²

2022

Preprint

View full text Add to dashboard Cite

Previous research has looked into ways to improve speech emotion recognition (SER) by utilizing both acoustic and linguistic cues of speech. However, the potential association between state-of-the-art ASR models and the SER task has yet to be investigated. In this paper, we propose a novel channel and temporal-wise attention RNN (CTA-RNN) architecture based on the intermediate representations of pre-trained ASR models. Specifically, the embeddings of a large-scale pre-trained endto-end ASR encoder contain both acoustic and linguistic information, as well as the ability to generalize to different speakers, making them well suited for downstream SER task. To further exploit the embeddings from different layers of the ASR encoder, we propose a novel CTA-RNN architecture to capture the emotional salient parts of embeddings in both the channel and temporal directions. We evaluate our approach on two popular benchmark datasets, IEMOCAP and MSP-IMPROV, using both within-corpus and cross-corpus settings. Experimental results show that our proposed method can achieve excellent performance in terms of accuracy and robustness.

show abstract

Cross-Corpus Speech Emotion Recognition Based on Few-Shot Learning and Domain Adaptation

Cited by 30 publications

References 35 publications

Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition

Contact Info

Product

Resources

About