2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462440
|View full text |Cite
|
Sign up to set email alerts
|

Deep Mul Timodal Learning for Emotion Recognition in Spoken Language

Abstract: In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language. Our architecture has two distinctive characteristics. First, it extracts the high-level features from both text and audio via a hybrid deep multimodal structure, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features. Second, we fuse all features by using a three-layer deep neural network to le… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
41
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 37 publications
(42 citation statements)
references
References 15 publications
0
41
0
Order By: Relevance
“…Compared with [20,31] based on the IMOCAP database, the model presented in this paper performs better, as shown in Table 2. For the modal fusing, the feature-level and decision-level fusion method are all useful.…”
Section: Resultsmentioning
confidence: 88%
See 3 more Smart Citations
“…Compared with [20,31] based on the IMOCAP database, the model presented in this paper performs better, as shown in Table 2. For the modal fusing, the feature-level and decision-level fusion method are all useful.…”
Section: Resultsmentioning
confidence: 88%
“…In order to test the performance of multimodal emotion recognition model proposed in this paper, we compared it with other different models on the IEMOCAP database. Soujanya Gu et al [20] applied CNN-LSTM to process the speech date and CNNs for the textual features learning; finally, they integrated all features and trained them with a three-layer deep neural network. ey adopted the feature fusion method which we also referenced.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…The encoder-decoder model was recently introduced in natural language processing and computer vision to model sequential data such as phrases [10,11,29,30] and videos [13]. It has shown great performance on a number of tasks including machine translation [6], question answering [25] and video description [13].…”
Section: Related Workmentioning
confidence: 99%