ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054229
|View full text |Cite
|
Sign up to set email alerts
|

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Abstract: In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
26
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 28 publications
(27 citation statements)
references
References 27 publications
1
26
0
Order By: Relevance
“…MCSAN outperforms the state-of-the-art MHA model by 6.9%. We also present the performance of AMH [20] in Table 1. AMH is a tri-modal version of MHA by incorporating the visual information into MHA's framework.…”
Section: Comparison To State-of-the-art Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…MCSAN outperforms the state-of-the-art MHA model by 6.9%. We also present the performance of AMH [20] in Table 1. AMH is a tri-modal version of MHA by incorporating the visual information into MHA's framework.…”
Section: Comparison To State-of-the-art Methodsmentioning
confidence: 99%
“…To align with previous studies [20], we use 7,487 utterances from seven emotions: frustration, neutral, anger, sadness, excitement, happiness, surprise. Since there is no standard split for this dataset, we follow [20,14] to perform 10-fold cross-validation, where 8:1:1 are used for training, validation and test, respectively. The weighted accuracy (WA, i.e., the overall accuracy) and unweighted accuracy (UA, i.e., the average accuracy over all emotion categories) is adopted as the evaluation metrics.…”
Section: Datasetsmentioning
confidence: 99%
“…To obtain more detailed information, more attention would be invested to the target area. In the meantime, it suppresses other useless information [55][56][57][58][59][60].…”
Section: A Cbam: Attention-based Conv-bilstmmentioning
confidence: 99%
“…For example, a driver emotion detection system can automatically infer the driver’s emotional state and take corresponding measures to ensure road safety and human health [ 19 ]. Many previous research efforts on emotion recognition process the information from different modalities and use multimodal clues to infer the emotional states, and they show improvement of the overall performance by multimodal fusion [ 20 ]. Tzirakis et al applied convolutional neural networks to extract features from the speech and the facial expression and utilized long short-term memory networks to model the context and improve the performance [ 21 ].…”
Section: Related Workmentioning
confidence: 99%