Towards Speech Emotion Recognition “in the Wild” Using Aggregated Corpora and Deep Multi-Task Learning

Kim, Jaebok; Englebienne, Gwenn; Truong, Khiet P.; Evers, Vanessa

doi:10.21437/interspeech.2017-736

Cited by 66 publications

(48 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Emotion is only one of several factors that impacts the acoustics of speech. Some factors that change across datasets and can impact affect recognition include the environmental noise [4], the spoken language [3], the recording device quality [5], and the elicitation strategy (acted versus natural) [6]. Additionally, a mismatch in subject demographics between datasets can result in misclassification, due to the small numbers of participants common in speech emotion recognition datasets [2].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Gideon

McInnis

Provost

2021

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train "meet in the middle" approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings. , and a co-author of the winner of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. Her research interests are in human-centered speech and video processing, multimodal interfaces design, and speech-based assistive technology. The goals of her research are motivated by the complexities of the perception and expression of human behavior.

show abstract

Section: Introductionmentioning

confidence: 99%

“…Most modern techniques of cross-corpus speech emotion recognition use deep learning to build representations over lowlevel acoustic features. Many of these techniques incorporate tasks in addition to emotion to be able to learn more robust representations [6], [10].…”

Section: Introductionmentioning

confidence: 99%

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Gideon

McInnis

Provost

2021

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

show abstract

“…In the first experiment, we discuss the influence of different feature selection manners and prediction manners. Since multi-task learning shows its performance in [6,20,21], we treat the multi-task learning as the comparison approach in the second experiment. In the last experiment, we show the advantages of adding the contrastive loss during the training phase.…”

Section: Evaluation Resultsmentioning

confidence: 99%

Speech Emotion Recognition via Contrastive Loss under Siamese Networks

Lian

Tao

et al. 2018

Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective

View full text Add to dashboard Cite

Speech emotion recognition is an important aspect of humancomputer interaction. Prior work proposes various end-to-end models to improve the classification performance. However, most of them rely on the cross-entropy loss together with softmax as the supervision component, which does not explicitly encourage discriminative learning of features. In this paper, we introduce the contrastive loss function to encourage intra-class compactness and inter-class separability between learnable features. Furthermore, multiple feature selection methods and pairwise sample selection methods are evaluated. To verify the performance of the proposed system, we conduct experiments on The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database -a common evaluation corpus. Experimental results reveal the advantages of the proposed method, which reaches 62.19% in the weighted accuracy and 63.21% in the unweighted accuracy. It outperforms the baseline system that is optimized without the contrastive loss function with 1.14% and 2.55% in the weighted accuracy and the unweighted accuracy, respectively.

show abstract

“…DAE and ladder structures pay less attention to modeling long-term dynamic dependencies. However, temporal information is important for speech emotion recognition [17,18]. Therefore, our self-attention based FOP model, which can capture long-term dynamic dependencies, is superior to other currently advanced unsupervised learning strategies.…”

Section: Comparison To Other Advanced Approachesmentioning

confidence: 99%

“…These methods [12,13] take the whole input into account, aiming to learn intermediate feature representations that can reconstruct the input. However, they pay less attention to modeling the long-term dynamic dependency, which is important for speech emotion recognition [17,18].…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Representation Learning with Future Observation Prediction for Speech Emotion Recognition

Lian¹,

Tao²,

Liu³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Prior works on speech emotion recognition utilize various unsupervised learning approaches to deal with low-resource samples. However, these methods pay less attention to modeling the long-term dynamic dependency, which is important for speech emotion recognition. To deal with this problem, this paper combines the unsupervised representation learning strategy -Future Observation Prediction (FOP), with transfer learning approaches (such as Fine-tuning and Hypercolumns). To verify the effectiveness of the proposed method, we conduct experiments on the IEMOCAP database. Experimental results demonstrate that our method is superior to currently advanced unsupervised learning strategies.

show abstract

Towards Speech Emotion Recognition “in the Wild” Using Aggregated Corpora and Deep Multi-Task Learning

Cited by 66 publications

References 25 publications

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Speech Emotion Recognition via Contrastive Loss under Siamese Networks

Unsupervised Representation Learning with Future Observation Prediction for Speech Emotion Recognition

Contact Info

Product

Resources

About