Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition

Tripathi, Suraj; Kumar, Abhay; Ramesh, A.; Singh, Chirag; Yenigalla, Promod

doi:10.48550/arxiv.1906.05682

Cited by 4 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, ResNet is shown to have better performance than VGGNet and GoogLeNet using a residual learning framework to ease the training of networks that are substantially deeper than those used previously [37]. Besides, ResNet has been successfully applied to audio emotion recognition [72][73][74][75] and visual emotion recognition [76][77][78][79][80]. These factors inspire us to use ResNet as the backbone of our whole network for audio-visual emotion recognition.…”

Section: Feature Extraction With Deep Learningmentioning

confidence: 99%

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Zhang

et al. 2020

Applied Sciences

View full text Add to dashboard Cite

Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Re´nyi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE’05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.

show abstract

Section: Feature Extraction With Deep Learningmentioning

confidence: 99%

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Zhang

et al. 2020

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Several methods and architectures have been tried: Niu et al [43] extracted features from spectrograms via an AlexNetbased model and passed them to a DPARIP algorithm (a data augmentation technique based on the principle of retinal imaging and convex lens imaging), on six classes from IEMOCAP; Li et al [34] used a CNN-SSAE on Mel and IMel spectrograms from three separated datasets; Tripathi et al [65] implemented a ResNet supervised by Focal Loss to address the class imbalance in IEMOCAP; Wang et al [69] combined a CNN-BiLSTM model with multiple stacked Transformers creating well-defined features clusters in the latent space; Sultana et al [59] validated a series of CNNs/LSTMs-based architectures troughout multilungual experiments conducted on IEMOCAP and SUBESCO [60] (respectively English and Bangla); Su et al [58] applied a Graph Attentive GRU to 78-dimensional acoustic descriptors representing four classes from IEMOCAP and MSP-IMPROV [15]; Latif et al [32] proposed a hybrid architecture composed of Dense blocks and LSTM on spectrograms combining two speech datasets with real environmental noises from DEMAND [62] in order to improve noise robustness; Wu et al [72] utilized Capsule Network along with recurrent connections also on IEMOCAP; Sahu et al [52] passed a complex space of 1582 features extracted with the OpenSmile toolkit [21] into an Adversarial AutoEncoder; Mohan et al [40] recently achieved remarkable results with a decisiontree-based ensemble model with a gradient boosting framework (XG Boosting) using only MFCCs as input features.…”

Section: Related Work a Speech Emotion Recognitionmentioning

confidence: 99%

Speech Emotion Recognition and Deep Learning: An Extensive Validation Using Convolutional Neural Networks

Rí,

Ciardi,

Conci

2023

IEEE Access

View full text Add to dashboard Cite

The domain of Speech Emotion Recognition (SER) has experienced a tremendous revolution due to the outbreak of deep learning, which has contributed, as in many other research areas, to a significant boost in terms of model accuracy. SER refers to a branch of Human-Computer Interaction (HCI), which deals with recognizing emotional states from human speech. Although being a thriving field of research, SER still poses a number of non-trivial challenges, mainly due to the lack of shared best practices and highquality datasets that can make the developed models suitable for their application in real environments. In this paper, we implement a CNN-based model combined with a Convolutional Attention Block, and conduct a series of experiments involving a selection of four English datasets popularly used for SER applications: RAVDESS, TESS, CREMA-D, and IEMOCAP. After testing the proposed pipeline on individual datasets, achieving a mean accuracy of 83%, 100%, 68% and 63% respectively, we perform an extensive crossvalidation between common emotional classes belonging to single datasets or combinations of them, with the aim to investigate the generalization abilities of the extracted features.INDEX TERMS Speech emotion recognition, affective computing, deep learning.

show abstract

“…This loss function has been shown to be preferable over the cross-entropy loss when facing the class imbalance problem. Because of its effectiveness, it has been successfully applied in many applications, e.g., medical diagnosis (Al Rahhal et al, 2019;Shu et al, 2019;Ulloa et al, 2020;Xu et al, 2020), speech processing (Tripathi et al, 2019), and natural language processing (Shi et al, 2018). Although the focal loss has been successfully applied in many real-world problems (Al Rahhal et al, 2019;Chang et al, 2018;Lotfy et al, 2019;Romdhane and Pr, 2020;Shu et al, 2019;Sun et al, 2019;Ulloa et al, 2020;Xu et al, 2020), considerably less attention has *Nontawat and Jayakorn contributed equally.…”

Section: Introductionmentioning

confidence: 99%

On Focal Loss for Class-Posterior Probability Estimation: A Theoretical Perspective

Charoenphakdee

Vongkulbhisal²,

Chairatanakul

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

The focal loss has demonstrated its effectiveness in many real-world applications such as object detection and image classification, but its theoretical understanding has been limited so far. In this paper, we first prove that the focal loss is classification-calibrated, i.e., its minimizer surely yields the Bayes-optimal classifier and thus the use of the focal loss in classification can be theoretically justified. However, we also prove a negative fact that the focal loss is not strictly proper, i.e., the confidence score of the classifier obtained by focal loss minimization does not match the true class-posterior probability and thus it is not reliable as a class-posterior probability estimator. To mitigate this problem, we next prove that a particular closed-form transformation of the confidence score allows us to recover the true class-posterior probability. Through experiments on benchmark datasets, we demonstrate that our proposed transformation significantly improves the accuracy of class-posterior probability estimation.

show abstract

Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition

Cited by 4 publications

References 0 publications

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Speech Emotion Recognition and Deep Learning: An Extensive Validation Using Convolutional Neural Networks

On Focal Loss for Class-Posterior Probability Estimation: A Theoretical Perspective

Contact Info

Product

Resources

About