A Review on Emotion Recognition Algorithms using Speech Analysis

Gunawan, Teddy Surya; Alghifari, Muhammad Fahreza; Morshidi, Malik Arman; Kartiwi, Mira

doi:10.11591/ijeei.v6i1.409

Cited by 18 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The scope for future improvements is very appealing in this field. Different multimodel deep learning techniques can be used along with different architectures to improve the performance parameters [20][21][22][23][24][25][26][27]. Apart from recognizing the emotions only, there can be further addition of intensity scale.…”

Section: Resultsmentioning

confidence: 99%

Development of video-based emotion recognition using deep learning with Google Colab

Gunawan

Ashraf

Riza³

et al. 2020

TELKOMNIKA

Self Cite

View full text Add to dashboard Cite

Emotion recognition using images, videos, or speech as input is considered as a hot topic in the field of research over some years. With the introduction of deep learning techniques, e.g., convolutional neural networks (CNN), applied in emotion recognition, has produced promising results. Human facial expressions are considered as critical components in understanding one's emotions. This paper sheds light on recognizing the emotions using deep learning techniques from the videos. The methodology of the recognition process, along with its description, is provided in this paper. Some of the video-based datasets used in many scholarly works are also examined. Results obtained from different emotion recognition models are presented along with their performance parameters. An experiment was carried out on the fer2013 dataset in Google Colab for depression detection, which came out to be 97% accurate on the training set and 57.4% accurate on the testing set.

show abstract

Section: Resultsmentioning

confidence: 99%

Development of video-based emotion recognition using deep learning with Google Colab

Gunawan

Ashraf

Riza³

et al. 2020

TELKOMNIKA

Self Cite

View full text Add to dashboard Cite

show abstract

“…While such end-to-end DNNs may provide the valuable results, the limitation due to the scarcity of emotion-labeled speech datasets hinders the training of DNNs from scratch. A relevant number of studies, therefore, still employ traditional handcrafted speech features, particularly MFCCs, which are reportedly one of the most conventional and effective feature sets [3,41]. In [42], for example, MFCCs achieved notable performance on the Audio Video Emotion Challenge 2016.…”

Section: Speech Emotion Recognitionmentioning

confidence: 99%

When Old Meets New: Emotion Recognition from Speech Signals

et al. 2021

View full text Add to dashboard Cite

Speech is one of the most natural communication channels for expressing human emotions. Therefore, speech emotion recognition (SER) has been an active area of research with an extensive range of applications that can be found in several domains, such as biomedical diagnostics in healthcare and human–machine interactions. Recent works in SER have been focused on end-to-end deep neural networks (DNNs). However, the scarcity of emotion-labeled speech datasets inhibits the full potential of training a deep network from scratch. In this paper, we propose new approaches for classifying emotions from speech by combining conventional mel-frequency cepstral coefficients (MFCCs) with image features extracted from spectrograms by a pretrained convolutional neural network (CNN). Unlike prior studies that employ end-to-end DNNs, our methods eliminate the resource-intensive network training process. By using the best prediction model obtained, we also build an SER application that predicts emotions in real time. Among the proposed methods, the hybrid feature set fed into a support vector machine (SVM) achieves an accuracy of 0.713 in a 6-class prediction problem evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, which is higher than the previously published results. Interestingly, MFCCs taken as unique input into a long short-term memory (LSTM) network achieve a slightly higher accuracy of 0.735. Our results reveal that the proposed approaches lead to an improvement in prediction accuracy. The empirical findings also demonstrate the effectiveness of using a pretrained CNN as an automatic feature extractor for the task of emotion prediction. Moreover, the success of the MFCC-LSTM model is evidence that, despite being conventional features, MFCCs can still outperform more sophisticated deep-learning feature sets.

show abstract

“…For examples, the sadness, happy, joy, neutral etc., all are considered to be non-violent and only the anger is defined as violence. In earlier there are so many approaches are developed to detect the emotion from the audio signals [23]. With the help of MFCCs of speech signals, S. Demircan and H. Kahramanl [24] developed an emotion recognition system with unsupervised learning.…”

Section: Literature Surveymentioning

confidence: 99%

Classification of Ontological Violence Content Detection through Audio Features and Supervised Learning

Potharaju¹,

Kamsali²,

Kesavari³

2019

IJIES

View full text Add to dashboard Cite

Violence detection is one of the important aspects, which can be used in different applications. Based on the data format, the violence can be defined in many ways. This paper focused to develop an automatic violence detection framework from audio type data. To do this, a new and efficient set of features are extracted from the audio signals, which provides more discrimination between different types of violence types in audio signals. Considering both spatial and Mel frequency characteristics of audio signals, totally 12 statistical functionals are accomplished to define every signal. Furthermore, the violence is defined in an ontological fashion, such that the all possible violence types which signify the violent behavior are detected. Extensive simulations are carried out over the proposed detection framework by considering the audio signals extracted from different video clips ripped from different movies. The performance is analyzed through the Receiver Operating Characteristics like, Accuracy, Precision, Recall, and False Positive Rate and the obtained results verify the performance enhancement and show a better performance than the conventional approaches.

show abstract

A Review on Emotion Recognition Algorithms using Speech Analysis

Cited by 18 publications

References 19 publications

Development of video-based emotion recognition using deep learning with Google Colab

Development of video-based emotion recognition using deep learning with Google Colab

When Old Meets New: Emotion Recognition from Speech Signals

Classification of Ontological Violence Content Detection through Audio Features and Supervised Learning

Contact Info

Product

Resources

About