2021
DOI: 10.1016/j.knosys.2021.107316
|View full text |Cite
|
Sign up to set email alerts
|

A multimodal hierarchical approach to speech emotion recognition from audio and text

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
31
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 65 publications
(40 citation statements)
references
References 40 publications
1
31
0
Order By: Relevance
“…However, there has also been the use of Deep Learning models capable of processing hand-crafted features or a complete audio record. For example, Singh et al [39] proposed the use of hand-crafted features to feed deep neural networks. Prosody, spectral, and voice quality-based features were used to train a hierarchical DNN classifier, achieving an accuracy of 81.2% on the RAVDESS dataset.…”
Section: Speech Emotion Recognitionmentioning
confidence: 99%
See 2 more Smart Citations
“…However, there has also been the use of Deep Learning models capable of processing hand-crafted features or a complete audio record. For example, Singh et al [39] proposed the use of hand-crafted features to feed deep neural networks. Prosody, spectral, and voice quality-based features were used to train a hierarchical DNN classifier, achieving an accuracy of 81.2% on the RAVDESS dataset.…”
Section: Speech Emotion Recognitionmentioning
confidence: 99%
“…Although vocal information is an essential modality for predicting emotions, the results of the emotion recognizer could improve by incorporating other modalities, as demonstrated by Singh et al [39], where incorporating textual features enhanced the results of the speech emotion recognizer. In our case, we include the information that facial expressions give for emotion recognition.…”
Section: Facial Emotion Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…These classifiers are usually neural networks capable of processing these descriptors or the complete audio records. For example, Singh et al [33] suggested the use of prosody, spectral-information, and voice quality, to train a hierarchical DNN classifier, reaching an accuracy of 81.2% on RAVDESS. Pepino et al [34] combined eGeMAPS features with the embeddings extracted from an xlsr-Wav2Vec2.0 to train a CNN model.…”
Section: Speech Emotion Recognitionmentioning
confidence: 99%
“…Although voice is a crucial indicator of a subject's emotion, other modalities could enhance the SERs' performance, as demonstrated by Singh et al in [33], which incorporated textual features to supplement the speech emotion recognizer. In our scenario, we included the visual information of the facial expressions.…”
Section: Facial Emotion Recognitionmentioning
confidence: 99%