Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Mohamed, Omar; Aly, Salah A.

doi:10.48550/arxiv.2110.04425

Cited by 10 publications

(12 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They even included the feature selection method, but the Linear SVM classifier was used, but with a comparatively lightweight model, they obtained an accuracy of 96.02%, which is much closer to our findings. However, the proposed model was inferior to the model produced by [ 51 ] using the BAVED database. In another work [ 52 ] this technique is also applied.…”

Section: Discussioncontrasting

confidence: 59%

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Alsabhan

2023

Sensors

View full text Add to dashboard Cite

Emotions have a crucial function in the mental existence of humans. They are vital for identifying a person’s behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker’s emotional state from their speech signal. SER is a growing discipline in human–computer interaction, and it has recently attracted more significant interest. This is because there are not so many universal emotions; therefore, any intelligent system with enough computational capacity can educate itself to recognise them. However, the issue is that human speech is immensely diverse, making it difficult to create a single, standardised recipe for detecting hidden emotions. This work attempted to solve this research difficulty by combining a multilingual emotional dataset with building a more generalised and effective model for recognising human emotions. A two-step process was used to develop the model. The first stage involved the extraction of features, and the second stage involved the classification of the features that were extracted. ZCR, RMSE, and the renowned MFC coefficients were retrieved as features. Two proposed models, 1D CNN combined with LSTM and attention and a proprietary 2D CNN architecture, were used for classification. The outcomes demonstrated that the suggested 1D CNN with LSTM and attention performed better than the 2D CNN. For the EMO-DB, SAVEE, ANAD, and BAVED datasets, the model’s accuracy was 96.72%, 97.13%, 96.72%, and 88.39%, respectively. The model beat several earlier efforts on the same datasets, demonstrating the generality and efficacy of recognising multiple emotions from various languages.

show abstract

Section: Discussioncontrasting

confidence: 59%

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Alsabhan

2023

Sensors

View full text Add to dashboard Cite

show abstract

“…Compared with the results in Table 2 , the proposed W2V-BLSTM-FT classifier performed better in all the measures than the AE-BLSTM-JT using eGeMAPS features, which is because the pretrained wav2vec2.0 model implicitly extracted the critical features for the ASD/TD classification, in contrast to the eGeMAPS, for which the feature extraction is based on the deterministic approach. In other words, data manipulation in an E2E manner benefits this ASD/TD classification, as researchers have reported in other tasks [ 28 , 29 , 32 , 33 , 34 ].…”

Section: Methodssupporting

confidence: 51%

“…The pretrained wav2vec2.0 model is a follow-up model of the wav2vec and VQ-wav2vec models [ 30 , 31 ], which can learn a representation of the raw waveform without labeled phonemes or graphemes. Researchers widely employ the model as a pretrained model in audio- and speech-processing tasks [ 32 , 33 , 34 ], as it has the advantage of from them from having to select the best predefined feature set task by task. In addition, the pretrained model usually comprises numerous parameters and is a priori trained with many speech and audio datasets, without regard to a specific task.…”

Section: Proposed End-to-end Asd/td Classification Based On Pretraine...mentioning

confidence: 99%

“…In addition, the pretrained model usually comprises numerous parameters and is a priori trained with many speech and audio datasets, without regard to a specific task. Consequently, we can effectively apply the pretrained model in various downstream tasks with fine-tuning, such as automatic speech recognition, emotion classification, and speaker identification [ 28 , 29 , 32 , 33 , 34 ].…”

Section: Proposed End-to-end Asd/td Classification Based On Pretraine...mentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Model-Based Detection of Infants with Autism Spectrum Disorder Using a Pretrained Model

Lee

Bong

et al. 2022

Sensors

View full text Add to dashboard Cite

In this paper, we propose an end-to-end (E2E) neural network model to detect autism spectrum disorder (ASD) from children’s voices without explicitly extracting the deterministic features. In order to obtain the decisions for discriminating between the voices of children with ASD and those with typical development (TD), we combined two different feature-extraction models and a bidirectional long short-term memory (BLSTM)-based classifier to obtain the ASD/TD classification in the form of probability. We realized one of the feature extractors as the bottleneck feature from an autoencoder using the extended version of the Geneva minimalistic acoustic parameter set (eGeMAPS) input. The other feature extractor is the context vector from a pretrained wav2vec2.0-based model directly applied to the waveform input. In addition, we optimized the E2E models in two different ways: (1) fine-tuning and (2) joint optimization. To evaluate the performance of the proposed E2E models, we prepared two datasets from video recordings of ASD diagnoses collected between 2016 and 2018 at Seoul National University Bundang Hospital (SNUBH), and between 2019 and 2021 at a Living Lab. According to the experimental results, the proposed wav2vec2.0-based E2E model with joint optimization achieved significant improvements in the accuracy and unweighted average recall, from 64.74% to 71.66% and from 65.04% to 70.81%, respectively, compared with a conventional model using autoencoder-based BLSTM and the deterministic features of the eGeMAPS.

show abstract

“…Recent work on speech recognition focuses on the way speakers are stressed, emotional, and disguised in their speeches [6]. In this work, we aim to develop a deep learning model for voice identification in Arabic speech.…”

Section: Introductionmentioning

confidence: 99%

Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT Based on the Quran Reciters Dataset

Moustafa¹,

Aly²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Current authentication and trusted systems depend on classical and biometric methods to recognize or authorize users. Such methods include audio speech recognitions, eye, and finger signatures. Recent tools utilize deep learning and transformers to achieve better results. In this paper, we develop a deep learning constructed model for Arabic speakers' identification by using Wav2Vec2.0 and HuBERT audio representation learning tools. The end-to-end Wav2Vec2.0 paradigm acquires contextualized speech representations learning's by randomly masking a set of feature vectors, and then applies a transformer neural network. We employ an MLP classifier that is able to differentiate between invariant labeled classes. We show several experimental results that safeguard the high accuracy of the proposed model. The experiments ensure that an arbitrary wave signal for a certain speaker can be identified with 98% and 97.1% accuracies in the cases of Wav2Vec2.0 and HuBERT, respectively.

show abstract

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Cited by 10 publications

References 12 publications

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

End-to-End Model-Based Detection of Infants with Autism Spectrum Disorder Using a Pretrained Model

Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT Based on the Quran Reciters Dataset

Contact Info

Product

Resources

About