Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2466
|View full text |Cite
|
Sign up to set email alerts
|

Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts

Abstract: In this paper, we propose to improve emotion recognition by combining acoustic information and conversation transcripts. On the one hand, a LSTM network was used to detect emotion from acoustic features like f0, shimmer, jitter, MFCC, etc. On the other hand, a multi-resolution CNN was used to detect emotion from word sequences. This CNN consists of several parallel convolutions with different kernel sizes to exploit contextual information at different levels. A temporal pooling layer aggregates the hidden repr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
45
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 72 publications
(46 citation statements)
references
References 14 publications
(26 reference statements)
0
45
0
1
Order By: Relevance
“…To measure the performance of systems, we report the weighted accuracy (WA) and unweighted accuracy (UA) averaging over the 10fold cross-validation experiments. We use the same dataset and features as other researchers [7,18]. Table 1 presents performances of proposed approaches for recognizing speech emotion in comparison with various models.…”
Section: Performance Evaluationmentioning
confidence: 99%
“…To measure the performance of systems, we report the weighted accuracy (WA) and unweighted accuracy (UA) averaging over the 10fold cross-validation experiments. We use the same dataset and features as other researchers [7,18]. Table 1 presents performances of proposed approaches for recognizing speech emotion in comparison with various models.…”
Section: Performance Evaluationmentioning
confidence: 99%
“…In [19], emotional keywords are exploited to effectively identify the classes. Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12].…”
Section: Recent Workmentioning
confidence: 99%
“…A total of 10 unique speakers participated in this work. Following the previous research [9,10,12], we assign single categorical emotion to the utterance with the majority of annotators agreed on the emotion labels. The final dataset contains 7,487 utterances in total (1,103 angry, 1,041 excite, 595 happy, 1,084 sad, 1,849 frustrated, 107 surprise and 1,708 neutral).…”
Section: Dataset and Experimental Setupmentioning
confidence: 99%
“…In [1,2,3] feature learning from raw-waveform or spectrogram using CNN, LSTM based models is explored. In [4,5,6,7], CNN and LSTM based models are explored from feature representations such as MFCC and OpenS-MILE [8] features. In [9,10,11,12], adversarial learning paradigm * Both the authors contributed equally to this paper is explored for robust recognition.…”
Section: Introductionmentioning
confidence: 99%