ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054441
|View full text |Cite
|
Sign up to set email alerts
|

Attention Driven Fusion for Multi-Modal Emotion Recognition

Abstract: Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition on combined acoustic and text modalities. Baseline systems model emotion information in text and acoustic modes independently using Deep Convolutional Neural Networks (DCNN) and Recurrent Neural Networks (RNN), followed by applying attention, fusion, and classification. In this paper, we present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification. We utilize a Sin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
41
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 52 publications
(41 citation statements)
references
References 23 publications
0
41
0
Order By: Relevance
“…To our knowledge, the best reported accuracy of using textual features only on this dataset was 70.8% [33]. Finetuning the off-the-shelf RoBERTa on word transcripts only (without pauses) achieved a better performance.…”
Section: Experiments and Resultsmentioning
confidence: 77%
See 1 more Smart Citation
“…To our knowledge, the best reported accuracy of using textual features only on this dataset was 70.8% [33]. Finetuning the off-the-shelf RoBERTa on word transcripts only (without pauses) achieved a better performance.…”
Section: Experiments and Resultsmentioning
confidence: 77%
“…We used IEMOCAP dataset [29], a benchmark dataset containing 12 hours of speech from 10 professional actors. Following the literature [30,31,32,33], we extracted 5531 utterances of four emotion types from the dataset: 1636 happy (also including excited), 1084 sad, 1103 angry, and 1708 neutral. The utterances were forced aligned using the P2FA forced aligner.…”
Section: Datamentioning
confidence: 99%
“…[ Priyasad et al] [64] presented a deep learning-based approach to protect codes that are characteristic emotion. Through a SincNet layer, band-pass filtering technique and neural net, the researchers managed to extract acoustic Identify applicable funding agency here.…”
Section: A Multimodal Emotion Recognition Combining (Audiomentioning
confidence: 99%
“…There is a recent interest on attention-based SER models for higher accuracy [8,9,12]. However, those attention mechanisms can only be calculated with a preset granularity which may not adapt dynamically to different areas of interest in spectrogram.…”
Section: Related Workmentioning
confidence: 99%
“…For example, a psychologist can design a treatment plan according to the emotions hidden/expressed in the patient's speech. Deep learning has accelerated the progress of recognizing human emotions from speech [4][5][6][7][8][9], but there are still deficiencies in the research of SER, such as data shortage and insufficient model accuracy.…”
Section: Introductionmentioning
confidence: 99%