Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3201
|View full text |Cite
|
Sign up to set email alerts
|

Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts

Abstract: In human perception and understanding, a number of different and complementary cues are adopted according to different modalities. Various emotional states in communication between humans reflect this variety of cues across modalities. Recent developments in multi-modal emotion recognition utilize deeplearning techniques to achieve remarkable performances, with models based on different features suitable for text, audio and vision. This work focuses on cross-modal fusion techniques over deep learning models fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 49 publications
(21 citation statements)
references
References 21 publications
(29 reference statements)
0
21
0
Order By: Relevance
“…In [19], emotional keywords are exploited to effectively identify the classes. Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12].…”
Section: Recent Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In [19], emotional keywords are exploited to effectively identify the classes. Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12].…”
Section: Recent Workmentioning
confidence: 99%
“…Figure 1 shows the architecture of the proposed AMH model. Previous research used multi-modal information independently using a neural network model by fusion information over each modality [9,20]. Recently, researchers also investigated an interattention mechanism over the modality [11,12].…”
Section: Proposed Attentive Modality Hopping Mechanismmentioning
confidence: 99%
“…In order to take advantage of the linguistic content in SER, the fusion of both textual and audio information gains on popularity [32], [33], [34]. Three strategies are usually applied for multi-modal fusion: (a) at the feature level by concatenating the inputs of different modalities, (b) at the decision level with majority voting, or (c) at the model level by merging intermediate representations [7], [10], [35], [36]. More precisely, the fused model (c) is done by concatenated outputs of two distinct networks corresponding to each modality to feed next layers [37].…”
Section: Modality Fusionmentioning
confidence: 99%
“…Traditional feature-level fusion and decision-level fusion, as well as the current tensor-level fusion trend, are widely used [1]. Furthermore, many researchers have also compared the performance of various fusion methods [2]. All of these works demonstrated the utility of combining linguistic information in SER.…”
Section: Introductionmentioning
confidence: 99%