2021
DOI: 10.3390/s21227665
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Abstract: Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framewo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
23
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 71 publications
(24 citation statements)
references
References 77 publications
0
23
0
1
Order By: Relevance
“…More specifically, for the feature extraction, we have obtained an improvement of 10.73 points, and for the fine-tuning, an increment of 5.24 with the current transformer-based approach. Regarding the visual modality, the AUs got a slight increment in comparison with the embeddings extracted from the STN on [13]. In our previous work, we reported an accuracy of 57.08%, and now we achieved 62.13%.…”
Section: Comparative Results With Previous Workmentioning
confidence: 53%
See 3 more Smart Citations
“…More specifically, for the feature extraction, we have obtained an improvement of 10.73 points, and for the fine-tuning, an increment of 5.24 with the current transformer-based approach. Regarding the visual modality, the AUs got a slight increment in comparison with the embeddings extracted from the STN on [13]. In our previous work, we reported an accuracy of 57.08%, and now we achieved 62.13%.…”
Section: Comparative Results With Previous Workmentioning
confidence: 53%
“…Regarding our previous publications, we can see that both methods, the feature extraction and the fine-tuning of the xlsr-Wav2Vec2.0, surpassed our previous proposals for the SER using CNNs in [13]. More specifically, for the feature extraction, we have obtained an improvement of 10.73 points, and for the fine-tuning, an increment of 5.24 with the current transformer-based approach.…”
Section: Comparative Results With Previous Workmentioning
confidence: 57%
See 2 more Smart Citations
“…Emotions can be detected by facial expressions [37,38]. Article [39] proposed a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, they fine-tuned the CNN-14 of the PANNs framework, and for facial emotion recognizers, they proposed a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism.…”
Section: Introductionmentioning
confidence: 99%