Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning

Parthasarathy, S.; Busso, Carlos

doi:10.21437/interspeech.2017-1494

Cited by 112 publications

(86 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Most modern techniques of cross-corpus speech emotion recognition use deep learning to build representations over lowlevel acoustic features. Many of these techniques incorporate tasks in addition to emotion to be able to learn more robust representations [6], [10].…”

Section: Introductionmentioning

confidence: 99%

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Gideon

McInnis

Provost

2021

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train "meet in the middle" approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings. , and a co-author of the winner of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. Her research interests are in human-centered speech and video processing, multimodal interfaces design, and speech-based assistive technology. The goals of her research are motivated by the complexities of the perception and expression of human behavior.

show abstract

Section: Introductionmentioning

confidence: 99%

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Gideon

McInnis

Provost

2021

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

show abstract

“…0.7, 0.2, and 1.0 for α, β, and γ, respectively. Our proposed MTL learning with three parameters outperforms STL and previous MTL [4]. For STL approaches, both arousal and valence obtained the highest CCC score when its attribute is optimized.…”

Section: Multitask Learning Resultsmentioning

confidence: 78%

“…where α, β, and γ are the weighting factors for each emotion dimension loss function. In a common approach, α, β, and γ are set to be 1, while in [4], γ is set to be 1 − (α + β) to minimize MSE. In that approach, all weighting factors are in range 0-1.…”

Section: Multitask Learning Based On CCC Lossmentioning

confidence: 99%

Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition

Atmaja

Akagi

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Due to its ability to accurately predict emotional state using multimodal features, audiovisual emotion recognition has recently gained more interest from researchers. This paper proposes two methods to predict emotional attributes from audio and visual data using a multitask learning and a fusion strategy. First, multitask learning is employed by adjusting three parameters for each attribute to improve the recognition rate. Second, a multistage fusion is proposed to combine results from various modalities' final prediction. Our approach used multitask learning, employed at unimodal and early fusion methods, shows improvement over single-task learning with an average CCC score of 0.431 compared to 0.297. A multistage method, employed at the late fusion approach, significantly improved the agreement score between true and predicted values on the development set of data (from [0.537, 0.565, 0.083] to [0.68, 0.656, 0.443]) for arousal, valence, and liking.

show abstract

“…In the first experiment, we discuss the influence of different feature selection manners and prediction manners. Since multi-task learning shows its performance in [6,20,21], we treat the multi-task learning as the comparison approach in the second experiment. In the last experiment, we show the advantages of adding the contrastive loss during the training phase.…”

Section: Evaluation Resultsmentioning

confidence: 99%

Speech Emotion Recognition via Contrastive Loss under Siamese Networks

Lian

Tao

et al. 2018

Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective

View full text Add to dashboard Cite

Speech emotion recognition is an important aspect of humancomputer interaction. Prior work proposes various end-to-end models to improve the classification performance. However, most of them rely on the cross-entropy loss together with softmax as the supervision component, which does not explicitly encourage discriminative learning of features. In this paper, we introduce the contrastive loss function to encourage intra-class compactness and inter-class separability between learnable features. Furthermore, multiple feature selection methods and pairwise sample selection methods are evaluated. To verify the performance of the proposed system, we conduct experiments on The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database -a common evaluation corpus. Experimental results reveal the advantages of the proposed method, which reaches 62.19% in the weighted accuracy and 63.21% in the unweighted accuracy. It outperforms the baseline system that is optimized without the contrastive loss function with 1.14% and 2.55% in the weighted accuracy and the unweighted accuracy, respectively.

show abstract

Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning

Cited by 112 publications

References 27 publications

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition

Speech Emotion Recognition via Contrastive Loss under Siamese Networks

Contact Info

Product

Resources

About