On the Evolution of Speech Representations for Affective Computing: A brief history and critical overview

Alisamir, Sina; Ringeval, Fabien

doi:10.1109/msp.2021.3106890

Cited by 14 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, in the larger and more complex MSPConv, the E2E based BBB-KL model achieves the best uncertainty estimate performances, against all other models in comparison, with 0.1181 L ccc (s) and 0.3571 L KL . This trend is inline with literature which suggest end-to-end learning for uncertainty modeling [14].…”

Section: Ablation Studysupporting

confidence: 89%

See 1 more Smart Citation

End-to-End Label Uncertainty Modeling in Speech Emotion Recognition using Bayesian Neural Networks and Label Distribution Learning

Prabhu¹,

Lehmann‐Willenbrock²,

Gerkmann³

2022

Preprint

View full text Add to dashboard Cite

To train machine learning algorithms to predict emotional expressions in terms of arousal and valence, annotated datasets are needed. However, as different people perceive others’ emotional expressions differently, their annotations are per se subjective. For this, annotations are typically collected from multiple annotators and averaged to obtain ground-truth labels. However, when exclusively trained on this averaged ground-truth, the trained network is agnostic to the inherent subjectivity in emotional expressions. In this work, we therefore propose an end-to-end Bayesian neural network capable of being trained on a distribution of labels to also capture the subjectivity-based label uncertainty. Instead of a Gaussian, we model the label distribution using Student’s t-distribution, which also accounts for the number of annotations. We derive the corresponding Kullback-Leibler divergence loss and use it to train an estimator for the distribution of labels, from which the mean and uncertainty can be inferred. We validate the proposed method using two in-the-wild datasets. We show that the proposed t-distribution based approach achieves state-of-the-art uncertainty modeling results in speech emotion recognition, and also consistent results in cross-corpora evaluations. Furthermore, analyses reveal that the advantage of a t-distribution over a Gaussian grows with increasing inter-annotator correlation and a decreasing number of annotators.

show abstract

Section: Ablation Studysupporting

confidence: 89%

“…Recently, end-to-end architectures have been shown to deliver state-of-the-art emotion predictions [11]- [13], by learning features rather than relying on hand-crafted features. For modeling subjectivity in emotions, it has been conjured that end-to-end learning also promotes learning subjectivity dependent representations [14].…”

Section: Introductionmentioning

confidence: 99%

End-to-End Label Uncertainty Modeling in Speech Emotion Recognition using Bayesian Neural Networks and Label Distribution Learning

Prabhu¹,

Lehmann‐Willenbrock²,

Gerkmann³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…(Baltrusaitis et al, 2018 ). Alternatively, benefiting from the development of deep learning, deep-learned feature representations based on the large-scale pre-trained convolutional neural networks (CNN) such as ResNet (He et al, 2016 ) and VGGish (Hershey et al, 2017 ) also have been widely used for emotion recognition (Alisamir and Ringeval, 2021 ; Li and Deng, 2022 ). Compared with those hand-crafted features, the pre-trained CNN encoders can extract more powerful visual/audio features.…”

Section: Related Workmentioning

confidence: 99%

Multimodal interaction enhanced representation learning for video emotion recognition

2022

View full text Add to dashboard Cite

Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities.

show abstract

“…Authors showed that W2V2 representations allow the use of less complex models, compared to MFB features, concluding that W2V2 representations provide contextualised information of speech that are robust in different contexts. Thus, by using self-supervised representations, less labelled data are needed for the downstream task, which is beneficial for SER, as it is highly susceptible to data scarcity issues [18].…”

Section: Self-supervised Representationsmentioning

confidence: 99%

“…Self-Supervised Learning (SSL) methods such as the contrastive loss objective used to build a W2V2 model [15], do not need any labels to learn contextualised abstractions of speech, and can thus benefit from the abundance of unlabelled data. Therefore, by training SSL models on large amounts of data, we can achieve highly contextualised representations of speech that are robust against domain mismatch issues [16]- [18]. The robustness against unseen data can also be further improved by training on data from several different domains [19].…”

Section: Introductionmentioning

confidence: 99%

Multi-Corpus Affect Recognition with Emotion Embeddings and Self-Supervised Representations of Speech

Alisamir

Ringeval

Portet

2022

2022 10th International Conference on Affective Computing and Intelligent Interaction (ACII)

Self Cite

View full text Add to dashboard Cite

Speech emotion recognition systems use data-driven machine learning techniques that rely on annotated corpora. To achieve a usable performance in real-life, we need to exploit multiple different datasets since each one can shed the light on some specific expression of affect. However, different corpora use subjectively defined annotation schemes, which poses a challenge to train a model that can sense similar emotions across different corpora. Here, we propose a method that can relate similar emotions across corpora without being explicitly trained for it. Our method relies on self-supervised representations, which can provide us with highly contextualised speech representations, and multi-task learning paradigms. This allows to train on different corpora without changing their labelling schemes. The results show that by fine-tuning self-supervised representations on each corpus separately, we can significantly improve the state of the art within-corpus performance. We further demonstrate that by using multiple corpora during the training of the same model, we can improve the cross-corpus performance, and show that our emotion embeddings can effectively recognise the same emotions across different corpora.

show abstract

On the Evolution of Speech Representations for Affective Computing: A brief history and critical overview

Cited by 14 publications

References 28 publications

End-to-End Label Uncertainty Modeling in Speech Emotion Recognition using Bayesian Neural Networks and Label Distribution Learning

End-to-End Label Uncertainty Modeling in Speech Emotion Recognition using Bayesian Neural Networks and Label Distribution Learning

Multimodal interaction enhanced representation learning for video emotion recognition

Multi-Corpus Affect Recognition with Emotion Embeddings and Self-Supervised Representations of Speech

Contact Info

Product

Resources

About