Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Xu, Mingke; Zhang, Fan; Zhang, Wei

doi:10.1109/access.2021.3067460

Cited by 68 publications

(30 citation statements)

References 37 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, in another work that was published in 2021, they worked on improving the accuracy and robustness on IEMOCAP and RAVDESS datasets [ 51 ]. In their work, they proposed a method called head fusion to improve speech emotion recognition accuracy.…”

Section: Discussion and Comparisonmentioning

confidence: 99%

Human-Computer Interaction for Recognizing Speech Emotions Using Multilayer Perceptron Classifier

Alnuaim

Zakariah

Shukla

et al. 2022

Journal of Healthcare Engineering

View full text Add to dashboard Cite

Human-computer interaction (HCI) has seen a paradigm shift from textual or display-based control toward more intuitive control modalities such as voice, gesture, and mimicry. Particularly, speech has a great deal of information, conveying information about the speaker’s inner condition and his/her aim and desire. While word analysis enables the speaker’s request to be understood, other speech features disclose the speaker’s mood, purpose, and motive. As a result, emotion recognition from speech has become critical in current human-computer interaction systems. Moreover, the findings of the several professions involved in emotion recognition are difficult to combine. Many sound analysis methods have been developed in the past. However, it was not possible to provide an emotional analysis of people in a live speech. Today, the development of artificial intelligence and the high performance of deep learning methods bring studies on live data to the fore. This study aims to detect emotions in the human voice using artificial intelligence methods. One of the most important requirements of artificial intelligence works is data. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) open-source dataset was used in the study. The RAVDESS dataset contains more than 2000 data recorded as speeches and songs by 24 actors. Data were collected for eight different moods from the actors. It was aimed at detecting eight different emotion classes, including neutral, calm, happy, sad, angry, fearful, disgusted, and surprised moods. The multilayer perceptron (MLP) classifier, a widely used supervised learning algorithm, was preferred for classification. The proposed model’s performance was compared with that of similar studies, and the results were evaluated. An overall accuracy of 81% was obtained for classifying eight different emotions by using the proposed model on the RAVDESS dataset.

show abstract

Section: Discussion and Comparisonmentioning

confidence: 99%

Human-Computer Interaction for Recognizing Speech Emotions Using Multilayer Perceptron Classifier

Alnuaim

Zakariah

Shukla

et al. 2022

Journal of Healthcare Engineering

View full text Add to dashboard Cite

show abstract

“…Patel et al presented another work in which they utilized an autoencoder to reduce dimensionality and used a CNN classifier to reach an accuracy of 80% for RAVDESS audio-only files [58]. A system consisting of CNN and head fusion multi-head attention achieved 77.8% WA for the audio-only speech files of RAVDESS in recent work [59]. The most recent SER system using this dataset was presented in [60].…”

Section: E Analysis Of Models Using Multilingual Datasets (Setup 7)mentioning

confidence: 99%

Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks

et al. 2022

View full text Add to dashboard Cite

In this study, we have presented a deep learning-based implementation for speech emotion recognition (SER). The system combines a deep convolutional neural network (DCNN) and a bidirectional long-short term memory (BLSTM) network with a time-distributed flatten (TDF) layer. The proposed model has been applied for the recently built audio-only Bangla emotional speech corpus SUBESCO. A series of experiments were carried out to analyze all the models discussed in this paper for baseline, cross-lingual, and multilingual training-testing setups. The experimental results reveal that the model with a TDF layer achieves better performance compared with other state-of-the-art CNN-based SER models which can work on both temporal and sequential representation of emotions. For the cross-lingual experiments, cross-corpus training, multi-corpus training, and transfer learning were employed for the Bangla and English languages using the SUBESCO and RAVDESS datasets. The proposed model has attained a state-of-the-art perceptual efficiency achieving weighted accuracies (WAs) of 86.9%, and 82.7% for the SUBESCO and RAVDESS datasets, respectively.

show abstract

“…Weighted accuracy (WA) and unweighted accuracy (UA) are used to assess the model performance. Following the recent studies [22,23,25,28,29,30,31], we use the averages from the 10-fold and 5-fold cross-validation as experimental results of IEMOCAP and RAVDESS, respectively. Baselines.…”

Section: Dataset and Experimental Setupmentioning

confidence: 99%

Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms

Kim¹,

An²,

Kim³

2022

Interspeech 2022

View full text Add to dashboard Cite

Attention has become one of the most commonly used mechanisms in deep learning approaches. The attention mechanism can help the system focus more on the feature space's critical regions. For example, high amplitude regions can play an important role for Speech Emotion Recognition (SER). In this paper, we identify misalignments between the attention and the signal amplitude in the existing multi-head self-attention. To improve the attention area, we propose to use a Focus-Attention (FA) mechanism and a novel Calibration-Attention (CA) mechanism in combination with the multi-head self-attention. Through the FA mechanism, the network can detect the largest amplitude part in the segment. By employing the CA mechanism, the network can modulate the information flow by assigning different weights to each attention head and improve the utilization of surrounding contexts. To evaluate the proposed method, experiments are performed with the IEMOCAP and RAVDESS datasets. Experimental results show that the proposed framework significantly outperforms the state-of-the-art approaches on both datasets.

show abstract

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Cited by 68 publications

References 37 publications

Human-Computer Interaction for Recognizing Speech Emotions Using Multilayer Perceptron Classifier

Human-Computer Interaction for Recognizing Speech Emotions Using Multilayer Perceptron Classifier

Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks

Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms

Contact Info

Product

Resources

About