Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Jiménez, Cristina Luna; Griol, David; Callejas, Zoraida; Kleinlein, Ricardo; Montero, Juan Manuel; Fernández-Martínez, Fernando

doi:10.3390/s21227665

Cited by 71 publications

(24 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…More specifically, for the feature extraction, we have obtained an improvement of 10.73 points, and for the fine-tuning, an increment of 5.24 with the current transformer-based approach. Regarding the visual modality, the AUs got a slight increment in comparison with the embeddings extracted from the STN on [13]. In our previous work, we reported an accuracy of 57.08%, and now we achieved 62.13%.…”

Section: Comparative Results With Previous Workmentioning

confidence: 53%

“…Regarding our previous publications, we can see that both methods, the feature extraction and the fine-tuning of the xlsr-Wav2Vec2.0, surpassed our previous proposals for the SER using CNNs in [13]. More specifically, for the feature extraction, we have obtained an improvement of 10.73 points, and for the fine-tuning, an increment of 5.24 with the current transformer-based approach.…”

Section: Comparative Results With Previous Workmentioning

confidence: 57%

“…The distribution per actor for the validation folds was as follows: We proposed this setup following the work of Issa et al [35], who applied a similar subject-wise cross-validation methodology using the eight classes of the dataset. This evaluation procedure allowed us to compare our contribution to this previous work and with our prior solutions in [13].…”

Section: The Dataset and Evaluationmentioning

confidence: 99%

“…In this way, we expect to create a common framework to compare contributions and models' performance on the RAVDESS dataset. We decided to continue with the formulation of our previous paper of Luna-Jiménez et al [13] that consisted of a subject-wise 5-CV technique based on the eight emotions captured in the RAVDESS dataset.…”

mentioning

confidence: 99%

See 3 more Smart Citations

A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

et al. 2021

Self Cite

View full text Add to dashboard Cite

Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.

show abstract

Section: Comparative Results With Previous Workmentioning

confidence: 53%

Section: Comparative Results With Previous Workmentioning

confidence: 57%

Section: The Dataset and Evaluationmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Emotions can be detected by facial expressions [37,38]. Article [39] proposed a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, they fine-tuned the CNN-14 of the PANNs framework, and for facial emotion recognizers, they proposed a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism.…”

Section: Introductionmentioning

confidence: 99%

Emotional Speech Recognition Method Based on Word Transcription

Bekmanova

Yergesh

Sharipbay

et al. 2022

Sensors

View full text Add to dashboard Cite

The emotional speech recognition method presented in this article was applied to recognize the emotions of students during online exams in distance learning due to COVID-19. The purpose of this method is to recognize emotions in spoken speech through the knowledge base of emotionally charged words, which are stored as a code book. The method analyzes human speech for the presence of emotions. To assess the quality of the method, an experiment was conducted for 420 audio recordings. The accuracy of the proposed method is 79.7% for the Kazakh language. The method can be used for different languages and consists of the following tasks: capturing a signal, detecting speech in it, recognizing speech words in a simplified transcription, determining word boundaries, comparing a simplified transcription with a code book, and constructing a hypothesis about the degree of speech emotionality. In case of the presence of emotions, there occurs complete recognition of words and definitions of emotions in speech. The advantage of this method is the possibility of its widespread use since it is not demanding on computational resources. The described method can be applied when there is a need to recognize positive and negative emotions in a crowd, in public transport, schools, universities, etc. The experiment carried out has shown the effectiveness of this method. The results obtained will make it possible in the future to develop devices that begin to record and recognize a speech signal, for example, in the case of detecting negative emotions in sounding speech and, if necessary, transmitting a message about potential threats or riots.

show abstract