2019
DOI: 10.48550/arxiv.1906.05682
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
2
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…For example, ResNet is shown to have better performance than VGGNet and GoogLeNet using a residual learning framework to ease the training of networks that are substantially deeper than those used previously [37]. Besides, ResNet has been successfully applied to audio emotion recognition [72][73][74][75] and visual emotion recognition [76][77][78][79][80]. These factors inspire us to use ResNet as the backbone of our whole network for audio-visual emotion recognition.…”
Section: Feature Extraction With Deep Learningmentioning
confidence: 99%
“…For example, ResNet is shown to have better performance than VGGNet and GoogLeNet using a residual learning framework to ease the training of networks that are substantially deeper than those used previously [37]. Besides, ResNet has been successfully applied to audio emotion recognition [72][73][74][75] and visual emotion recognition [76][77][78][79][80]. These factors inspire us to use ResNet as the backbone of our whole network for audio-visual emotion recognition.…”
Section: Feature Extraction With Deep Learningmentioning
confidence: 99%
“…Several methods and architectures have been tried: Niu et al [43] extracted features from spectrograms via an AlexNetbased model and passed them to a DPARIP algorithm (a data augmentation technique based on the principle of retinal imaging and convex lens imaging), on six classes from IEMOCAP; Li et al [34] used a CNN-SSAE on Mel and IMel spectrograms from three separated datasets; Tripathi et al [65] implemented a ResNet supervised by Focal Loss to address the class imbalance in IEMOCAP; Wang et al [69] combined a CNN-BiLSTM model with multiple stacked Transformers creating well-defined features clusters in the latent space; Sultana et al [59] validated a series of CNNs/LSTMs-based architectures troughout multilungual experiments conducted on IEMOCAP and SUBESCO [60] (respectively English and Bangla); Su et al [58] applied a Graph Attentive GRU to 78-dimensional acoustic descriptors representing four classes from IEMOCAP and MSP-IMPROV [15]; Latif et al [32] proposed a hybrid architecture composed of Dense blocks and LSTM on spectrograms combining two speech datasets with real environmental noises from DEMAND [62] in order to improve noise robustness; Wu et al [72] utilized Capsule Network along with recurrent connections also on IEMOCAP; Sahu et al [52] passed a complex space of 1582 features extracted with the OpenSmile toolkit [21] into an Adversarial AutoEncoder; Mohan et al [40] recently achieved remarkable results with a decisiontree-based ensemble model with a gradient boosting framework (XG Boosting) using only MFCCs as input features.…”
Section: Related Work a Speech Emotion Recognitionmentioning
confidence: 99%
“…This loss function has been shown to be preferable over the cross-entropy loss when facing the class imbalance problem. Because of its effectiveness, it has been successfully applied in many applications, e.g., medical diagnosis (Al Rahhal et al, 2019;Shu et al, 2019;Ulloa et al, 2020;Xu et al, 2020), speech processing (Tripathi et al, 2019), and natural language processing (Shi et al, 2018). Although the focal loss has been successfully applied in many real-world problems (Al Rahhal et al, 2019;Chang et al, 2018;Lotfy et al, 2019;Romdhane and Pr, 2020;Shu et al, 2019;Sun et al, 2019;Ulloa et al, 2020;Xu et al, 2020), considerably less attention has *Nontawat and Jayakorn contributed equally.…”
Section: Introductionmentioning
confidence: 99%