Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3243
|View full text |Cite
|
Sign up to set email alerts
|

Deep Hierarchical Fusion with Application in Sentiment Analysis

Abstract: Recognizing the emotional tone in spoken language is a challenging research problem that requires modeling not only the acoustic and textual modalities separately but also their crossinteractions. In this work, we introduce a hierarchical fusion scheme for sentiment analysis of spoken sentences. Two bidirectional Long-Short-Term-Memory networks (BiLSTM), followed by multiple fully connected layers, are trained in order to extract feature representations for each of the textual and audio modalities. The represe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
23
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 28 publications
(23 citation statements)
references
References 20 publications
0
23
0
Order By: Relevance
“…[5] 76.5 73. 4 Zadeh et al [7] 76.9 77.0 Georgiou et al [9] 76.9 76.9 Poria et al [2] 77.64 -Ghosal et al [10] 82.31 80.69 Ghosal et al [10] 79.80 -Sun et al [4] 80 [7], ( ¦ ) results are obtained on CMU-MOSEI dataset after excluding the utterances with sentiment score of 0. We mention the results of proposed model with this setup in the parenthesis.…”
Section: Cmu-mosei Approachmentioning
confidence: 99%
See 2 more Smart Citations
“…[5] 76.5 73. 4 Zadeh et al [7] 76.9 77.0 Georgiou et al [9] 76.9 76.9 Poria et al [2] 77.64 -Ghosal et al [10] 82.31 80.69 Ghosal et al [10] 79.80 -Sun et al [4] 80 [7], ( ¦ ) results are obtained on CMU-MOSEI dataset after excluding the utterances with sentiment score of 0. We mention the results of proposed model with this setup in the parenthesis.…”
Section: Cmu-mosei Approachmentioning
confidence: 99%
“…Methods that jointly learn the interactions between two or three modalities [3,4], and 3. Methods that explicitly learn contributions from these unimodal and cross modal cues, typically using attention based techniques [5,6,7,8,9,10].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, the early fusion could model inter-actions across modalities at raw features stage. Georgiou et al [4] concatenated features from different modalitiy at various levels and used multi-layer perceptron for emotion prediction. Generally speaking, concatenation based early fusion methods do not outperform the late fusion methods in SER [5].…”
Section: Introductionmentioning
confidence: 99%
“…The general modus-operandi of SLU systems is to convert voice into text using an ASR engine and use natural language understanding (NLU) on the transcribed text by modelling conversational and channel properties while being robust to ASR errors. Since spoken conversation boasts of amalgamation of spontaneous speaker interactions, it has become imperative for model architectures to capture multimodal features from text and speech modalities (Georgiou et al, 2019). The aim of these multimodal systems is to capture acoustic information such as pitch, intonation, rate of speech, etc.…”
Section: Introductionmentioning
confidence: 99%