Arabic Automatic Speech Recognition Based on Emotion Detection

Abdelmaksoud, Engy Ragaei

doi:10.21608/ejle.2020.49690.1016

Cited by 1 publication

(2 citation statements)

References 33 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The researchers used the Arabic Natural Audio Dataset (ANAD) , which was previously employed in [31] Novel emotion recognition for Arabic speech using deep feedforward neural network (DFFNN) achieves 98.56% accuracy with PCA and 98.33% with combined features from ANAD dataset. In [39] evaluate three speaker traits-gender, emotion, and dialect-from Arabic speech, employing multitask learning (MTL).…”

Section: Resultsmentioning

confidence: 99%

“…For instance, in [19], [29], and [30], researchers utilized speech, text, and mocap data, including sub-modes such as facial expressions, hand gestures, and head rotations, to accurately identify emotions. Furthermore, [31] introduced a groundbreaking transformer-based model named multimodal transformers for audio-visual emotion recognition, overcoming the limitations of RNN and LSTM in capturing long-term dependencies. Three transformer branches are included in this model: audio-video cross-attention, video selfattention, and audio self-attention.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Mohamed,

Koura,

Kayed

2024

IJACSA

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) is a fastdeveloping area of study with a primary goal of automatically identifying and analyzing the emotional states expressed in speech. Emotions are crucial in human communication as they impact the effectiveness and meaning of linguistic expressions. SER aims to create computational approaches and models to detect and interpret emotions from speech signals. One of the primary applications of SER is evident in the field of Human-Computer Interaction (HCI), where it can be used to develop interactive systems that adapt to the user's emotional state based on their voice. This paper investigates the use of speech data for speech emotion recognition. Additionally, we applied a transformation process to convert the speech data into 2D images. Subsequently, we compared the outcomes of this transformation with the original speech data, aligning the comparison with a dataset containing labeled speech samples in both Arabic and English. Our experiments compare three methods: a transformer-based model, a Vision Transformer (ViT) based model, and a wave2vec-based model. The transformer model is trained from scratch on two significant audio datasets: the Arabic Natural Audio Dataset (ANAD) and the Toronto Emotional Speech Set (TESS), while the vision transformer is evaluated alongside wave2vec as part of transfer learning.The results are impressive. The transformer model achieved remarkable accuracies of 94% and 99% on ANAD and TESS datasets, respectively. Additionally, ViT demonstrates strong capabilities, achieving accuracies of 88% and 98% on the ANAD and TESS datasets, respectively. To assess the transfer learning potential, we also explore the Wave2Vector model with fine-tuning. However, the findings suggest limited success, achieving only a 56% accuracy rate on the ANAD dataset.

show abstract

Section: Resultsmentioning

confidence: 99%