2021
DOI: 10.1109/msp.2021.3106890
|View full text |Cite
|
Sign up to set email alerts
|

On the Evolution of Speech Representations for Affective Computing: A brief history and critical overview

Abstract: ecent advances in the field of machine learning have shown great potential for the automatic recognition of apparent human emotions. In the era of Internet of Things (IoT) and big-data processing, where voice-based systems are well established, opportunities to leverage cutting-edge technologies to develop personalised and humancentered services are genuinely real, with a growing demand in many areas such as education, health, well-being and entertainment. Automatic emotion recognition from speech, which is a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 28 publications
1
5
0
Order By: Relevance
“…Similarly, in the larger and more complex MSPConv, the E2E based BBB-KL model achieves the best uncertainty estimate performances, against all other models in comparison, with 0.1181 L ccc (s) and 0.3571 L KL . This trend is inline with literature which suggest end-to-end learning for uncertainty modeling [14].…”
Section: Ablation Studysupporting
confidence: 89%
See 1 more Smart Citation
“…Similarly, in the larger and more complex MSPConv, the E2E based BBB-KL model achieves the best uncertainty estimate performances, against all other models in comparison, with 0.1181 L ccc (s) and 0.3571 L KL . This trend is inline with literature which suggest end-to-end learning for uncertainty modeling [14].…”
Section: Ablation Studysupporting
confidence: 89%
“…Recently, end-to-end architectures have been shown to deliver state-of-the-art emotion predictions [11]- [13], by learning features rather than relying on hand-crafted features. For modeling subjectivity in emotions, it has been conjured that end-to-end learning also promotes learning subjectivity dependent representations [14].…”
Section: Introductionmentioning
confidence: 99%
“…(Baltrusaitis et al, 2018 ). Alternatively, benefiting from the development of deep learning, deep-learned feature representations based on the large-scale pre-trained convolutional neural networks (CNN) such as ResNet (He et al, 2016 ) and VGGish (Hershey et al, 2017 ) also have been widely used for emotion recognition (Alisamir and Ringeval, 2021 ; Li and Deng, 2022 ). Compared with those hand-crafted features, the pre-trained CNN encoders can extract more powerful visual/audio features.…”
Section: Related Workmentioning
confidence: 99%
“…Authors showed that W2V2 representations allow the use of less complex models, compared to MFB features, concluding that W2V2 representations provide contextualised information of speech that are robust in different contexts. Thus, by using self-supervised representations, less labelled data are needed for the downstream task, which is beneficial for SER, as it is highly susceptible to data scarcity issues [18].…”
Section: Self-supervised Representationsmentioning
confidence: 99%
“…Self-Supervised Learning (SSL) methods such as the contrastive loss objective used to build a W2V2 model [15], do not need any labels to learn contextualised abstractions of speech, and can thus benefit from the abundance of unlabelled data. Therefore, by training SSL models on large amounts of data, we can achieve highly contextualised representations of speech that are robust against domain mismatch issues [16]- [18]. The robustness against unseen data can also be further improved by training on data from several different domains [19].…”
Section: Introductionmentioning
confidence: 99%