2023
DOI: 10.1109/taffc.2021.3114365
|View full text |Cite
|
Sign up to set email alerts
|

Survey of Deep Representation Learning for Speech Emotion Recognition

Abstract: Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual effort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learni… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 51 publications
(28 citation statements)
references
References 188 publications
0
14
0
Order By: Relevance
“…This allows CNNs to develop a deeper understanding of the provided input compared to typical multilayer perceptron models. CNNs have revolutionized the field of computer vision where they have been used for a variety of tasks such as classification, object detection, segmentation, and object counting [ 43 , 44 ] and they have also successfully been used for applications within the speech and other time series signal application domain [ 45 , 46 , 47 ]. In this paper, rather then using hand crafted features, a CNN has been used to perform feature extraction in order to take advantage of the spatial and temporal dependency capturing capabilities of CNNs.…”
Section: Methodsmentioning
confidence: 99%
“…This allows CNNs to develop a deeper understanding of the provided input compared to typical multilayer perceptron models. CNNs have revolutionized the field of computer vision where they have been used for a variety of tasks such as classification, object detection, segmentation, and object counting [ 43 , 44 ] and they have also successfully been used for applications within the speech and other time series signal application domain [ 45 , 46 , 47 ]. In this paper, rather then using hand crafted features, a CNN has been used to perform feature extraction in order to take advantage of the spatial and temporal dependency capturing capabilities of CNNs.…”
Section: Methodsmentioning
confidence: 99%
“…In this work, we implement convolutional neural network (CNN)-BLSTM-based classifiers due to their popularity in SER research [37]. It has been found that the performance of BLSTM can be improved by feeding it with a good emotional representation [38].…”
Section: Speech Emotion Classifiermentioning
confidence: 99%
“…They are commonly divided into behavioral and physiological modalities. Behavioral modalities includes emotion recognition from facial expressions [ 9 , 10 , 11 , 12 , 13 ], from gestures [ 14 , 15 , 16 ] and from speech [ 17 , 18 , 19 ], while physiological modalities include emotion recognition from physiological signals such as electroencephalogram (EEG), electrocardiogram (ECG), galvanic skin response (GSR), electrodermal activity (EDA) and so on [ 20 , 21 , 22 , 23 , 24 ].…”
Section: Introductionmentioning
confidence: 99%