2021
DOI: 10.1109/access.2021.3067460
|View full text |Cite
|
Sign up to set email alerts
|

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Abstract: Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER benefits Human-Computer Interaction(HCI). But there are still many problems in SER research, e.g., the lack of high-quality data, insufficient model accuracy, little research under noisy environments, etc. In this paper, we proposed a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER. We implemented an attentionbased convolutional… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 68 publications
(30 citation statements)
references
References 37 publications
(48 reference statements)
0
30
0
Order By: Relevance
“…In addition, in another work that was published in 2021, they worked on improving the accuracy and robustness on IEMOCAP and RAVDESS datasets [ 51 ]. In their work, they proposed a method called head fusion to improve speech emotion recognition accuracy.…”
Section: Discussion and Comparisonmentioning
confidence: 99%
“…In addition, in another work that was published in 2021, they worked on improving the accuracy and robustness on IEMOCAP and RAVDESS datasets [ 51 ]. In their work, they proposed a method called head fusion to improve speech emotion recognition accuracy.…”
Section: Discussion and Comparisonmentioning
confidence: 99%
“…Patel et al presented another work in which they utilized an autoencoder to reduce dimensionality and used a CNN classifier to reach an accuracy of 80% for RAVDESS audio-only files [58]. A system consisting of CNN and head fusion multi-head attention achieved 77.8% WA for the audio-only speech files of RAVDESS in recent work [59]. The most recent SER system using this dataset was presented in [60].…”
Section: E Analysis Of Models Using Multilingual Datasets (Setup 7)mentioning
confidence: 99%
“…Weighted accuracy (WA) and unweighted accuracy (UA) are used to assess the model performance. Following the recent studies [22,23,25,28,29,30,31], we use the averages from the 10-fold and 5-fold cross-validation as experimental results of IEMOCAP and RAVDESS, respectively. Baselines.…”
Section: Dataset and Experimental Setupmentioning
confidence: 99%