A multitask approach to continuous five-dimensional affect sensing in natural speech

Eyben, Florian; Wöllmer, Martin; Schuller, Björn

doi:10.1145/2133366.2133372

Cited by 40 publications

(24 citation statements)

References 47 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results show that arousal is significantly better recognised from the acoustic features than valence. This result is in agreement with the literature, where acoustic features have always been shown to present a stronger correlation with the arousal dimension in comparison with valence [21], [25], [26], [30], [37]. The values of CCC and CC are most of the time almost identical, as the RMSE is quite low; we obtained an average RMSE of 0.068 for arousal and of 0.128 for valence over a range of 2.…”

Section: Training and Optimization Of Ssrmsupporting

confidence: 92%

“…However, the natural diversity found in emotion perception is usually merged when a machine learning model is trained, by averaging several evaluations from a pool of raters into a single gold standard. Whereas the use of all annotation data can help at preserving diversity in emotion perception, e. g., by using multi-task learning of each annotator [25], [26], it has the main disadvantage to increase the overall complexity of the model according to the number of available raters. The issue of synchronisation of various individual ratings for defining a gold standard has also been investigated with signal processing techniques.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

Mencattini

Martinelli

Ringeval

et al. 2017

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Abstract-Automatic emotion recognition from speech has been recently focused on the prediction of time-continuous dimensions (e.g., arousal and valence) of spontaneous and realistic expressions of emotion, as found in real-life interactions. However, the automatic prediction of such emotions poses several challenges, such as the subjectivity found in the definition of a gold standard from a pool of raters and the issue of data scarcity in training models. In this work, we introduce a novel emotion recognition system, based on ensemble of single-speaker-regression-models (SSRMs). The estimation of emotion is provided by combining a subset of the initial pool of SSRMs selecting those that are most concordance among them. The proposed approach allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire machine learning system. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results obtained on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in WEB-based applications.

show abstract

Section: Training and Optimization Of Ssrmsupporting

confidence: 92%

Section: Related Workmentioning

confidence: 99%

Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

Mencattini

Martinelli

Ringeval

et al. 2017

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

show abstract

“…For speech emotion prediction, MTL has been frequently utilised. Eyben et al [17] firstly proposed to jointly train five different emotional dimensions for continuous emotion recognition. The experimental results have clearly indicated that the MTL model remarkably outperforms single-task-based models.…”

Section: Related Workmentioning

confidence: 99%

Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech

Zhang

Schuller

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Despite the increasing research interest in end-to-end learning systems for speech emotion recognition, conventional systems either suffer from the overfitting due in part to the limited training data, or do not explicitly consider the different contributions of automatically learnt representations for a specific task. In this contribution, we propose a novel end-to-end framework which is enhanced by learning other auxiliary tasks and an attention mechanism. That is, we jointly train an end-to-end network with several different but related emotion prediction tasks, i. e., arousal, valence, and dominance predictions, to extract more robust representations shared among various tasks than traditional systems with the hope that it is able to relieve the overfitting problem. Meanwhile, an attention layer is implemented on top of the layers for each task, with the aim to capture the contribution distribution of different segment parts for each individual task. To evaluate the effectiveness of the proposed system, we conducted a set of experiments on the widely used database IEMOCAP. The empirical results show that the proposed systems significantly outperform corresponding baseline systems.

show abstract

“…It was reported that while ground truth hard labels performed better than soft labels, soft labels had a more similar entropy to human annotators. In [15], the inter-annotator standard deviation was used to model the variability between multiple annotators in a multi-task learning emotion recognition framework.…”

Section: Relation To Prior Workmentioning

confidence: 99%

Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels

Fayek

Lech

Cavedon

2016

2016 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Ground truth labels obtained by averaging or majority voting are commonly used to train automatic emotion classifiers. However, ground truth labels fail to encapsulate interannotator variability and ignore the subjectivity of emotions. In this paper, we propose two viable approaches to model the subjectiveness of emotions by incorporating inter-annotator variability, which are soft labels and model ensembling, where each model represents an annotator. Using a deep neural network that recognizes emotions in real-time from one second windows of speech spectrograms, we demonstrate that both approaches lead to consistent improvement over using ground truth labels. It is empirically shown that the performance gain of the ensemble over the baseline model could be achieved using soft labels generated from multiple annotators.

show abstract

A multitask approach to continuous five-dimensional affect sensing in natural speech

Cited by 40 publications

References 47 publications

Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech

Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels

Contact Info

Product

Resources

About