A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism

Lieskovská, Eva; Jakubec, Maroš; Jarina, Roman; Chmulík, Michal

doi:10.3390/electronics10101163

Cited by 114 publications

(67 citation statements)

References 113 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent tremendous results in speech emotion recognition (SER) have been focused on the utilizations of deep learning and convolutional networks [7], [8], [9], [10], [11], [12]. The task is also investigated in Arabic speech emotion recognition (ASER) in several recent results [13], [14], [15].…”

Section: Related Workmentioning

confidence: 99%

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Mohamed,

Aly

2021

Preprint

View full text Add to dashboard Cite

Recently, there have been tremendous research outcomes in the fields of speech recognition and natural language processing. This is due to the well-developed multilayers deep learning paradigms such as wav2vec2.0, Wav2vecU, WavBERT, and HuBERT that provide better representation learning and high information capturing. Such paradigms run on hundreds of unlabeled data, then fine-tuned on a small dataset for specific tasks. This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues. The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT. The experiment and performance results of our model overcome the previous known outcomes.

show abstract

Section: Related Workmentioning

confidence: 99%

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Mohamed,

Aly

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Overall, the precision, recall, and f1-score values obtained were very similar to the recognition accuracy as we can see from Tables 5-11, and the AUC values were also quite close to 1. Tables [6][7][8][9][10][11] show that the common point of CNN, CRNN, and GRU models were that the highest precision, recall, and f1-score were achieved with the "sadness" emotion, and the lowest recall and f1-score were for the "happiness" ("excitement") emotion and for both sets of parameters. The lowest precision was for the emotions of "excitement" or "anger."…”

Section: Resultsmentioning

confidence: 99%

“…The research in [7] has surveyed and evaluated quite a significant number of studies on speech emotion recognition for different corpuses including IEMOCAP [8]. IEMOCAP was a corpus collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC).…”

Section: Related Workmentioning

confidence: 99%

Emotional Speech Recognition Using Deep Neural Networks

Van

Xuan

et al. 2022

Sensors

View full text Add to dashboard Cite

The expression of emotions in human communication plays a very important role in the information that needs to be conveyed to the partner. The forms of expression of human emotions are very rich. It could be body language, facial expressions, eye contact, laughter, and tone of voice. The languages of the world’s peoples are different, but even without understanding a language in communication, people can almost understand part of the message that the other partner wants to convey with emotional expressions as mentioned. Among the forms of human emotional expression, the expression of emotions through voice is perhaps the most studied. This article presents our research on speech emotion recognition using deep neural networks such as CNN, CRNN, and GRU. We used the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus for the study with four emotions: anger, happiness, sadness, and neutrality. The feature parameters used for recognition include the Mel spectral coefficients and other parameters related to the spectrum and the intensity of the speech signal. The data augmentation was used by changing the voice and adding white noise. The results show that the GRU model gave the highest average recognition accuracy of 97.47%. This result is superior to existing studies on speech emotion recognition with the IEMOCAP corpus.

show abstract

“…They are now also used in speaker recognition [29]. The approaches that have been successfully applied in speaker recognition are often adopted in emotion recognition (see e.g., [30][31][32]).…”

Section: System Architecturementioning

confidence: 99%

Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach

et al. 2021

View full text Add to dashboard Cite

A frequently used procedure to examine the relationship between categorical and dimensional descriptions of emotions is to ask subjects to place verbal expressions representing emotions in a continuous multidimensional emotional space. This work chooses a different approach. It aims at creating a system predicting the values of Activation and Valence (AV) directly from the sound of emotional speech utterances without the use of its semantic content or any other additional information. The system uses X-vectors to represent sound characteristics of the utterance and Support Vector Regressor for the estimation the AV values. The system is trained on a pool of three publicly available databases with dimensional annotation of emotions. The quality of regression is evaluated on the test sets of the same databases. Mapping of categorical emotions to the dimensional space is tested on another pool of eight categorically annotated databases. The aim of the work was to test whether in each unseen database the predicted values of Valence and Activation will place emotion-tagged utterances in the AV space in accordance with expectations based on Russell’s circumplex model of affective space. Due to the great variability of speech data, clusters of emotions create overlapping clouds. Their average location can be represented by centroids. A hypothesis on the position of these centroids is formulated and evaluated. The system’s ability to separate the emotions is evaluated by measuring the distance of the centroids. It can be concluded that the system works as expected and the positions of the clusters follow the hypothesized rules. Although the variance in individual measurements is still very high and the overlap of emotion clusters is large, it can be stated that the AV coordinates predicted by the system lead to an observable separation of the emotions in accordance with the hypothesis. Knowledge from training databases can therefore be used to predict AV coordinates of unseen data of various origins. This could be used to detect high levels of stress or depression. With the appearance of more dimensionally annotated training data, the systems predicting emotional dimensions from speech sound will become more robust and usable in practical applications in call-centers, avatars, robots, information-providing systems, security applications, and the like.

show abstract

A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism

Cited by 114 publications

References 113 publications

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Emotional Speech Recognition Using Deep Neural Networks

Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach

Contact Info

Product

Resources

About