2022
DOI: 10.1609/aaai.v36i8.20895
|View full text |Cite
|
Sign up to set email alerts
|

Tailor Versatile Multi-Modal Learning for Multi-Label Emotion Recognition

Abstract: Multi-modal Multi-label Emotion Recognition (MMER) aims to identify various human emotions from heterogeneous visual, audio and text modalities. Previous methods mainly focus on projecting multiple modalities into a common latent space and learning an identical representation for all labels, which neglects the diversity of each modality and fails to capture richer semantic information for each label from different perspectives. Besides, associated relationships of modalities and labels have not been fully expl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(12 citation statements)
references
References 42 publications
0
12
0
Order By: Relevance
“…For instance, Fu et al ( 2022 ) proposed a non-homogeneous fusion network by first fusing audio and visual feature sequences through an attention aggregation module and then fusing audio-visual features with textual feature sequence via cross-modal attention. Similarly, Zhang et al ( 2022 ) proposed a hierarchical cross-modal encoder module to gradually fuse the modality features. Specifically, an adversarial multimodal refinement module was designed to decompose each modality-specific features to common and private representations.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…For instance, Fu et al ( 2022 ) proposed a non-homogeneous fusion network by first fusing audio and visual feature sequences through an attention aggregation module and then fusing audio-visual features with textual feature sequence via cross-modal attention. Similarly, Zhang et al ( 2022 ) proposed a hierarchical cross-modal encoder module to gradually fuse the modality features. Specifically, an adversarial multimodal refinement module was designed to decompose each modality-specific features to common and private representations.…”
Section: Related Workmentioning
confidence: 99%
“…In these works (Tsai et al, 2019 ; He et al, 2021 ; Zheng et al, 2022 ), the audio, visual, and text modalities were treated as three time-series that play the same role. Several works proposed to first fuse the audio and visual feature sequences into a higher level space, then fuse this bimodal feature sequence with the textual feature sequence (Fu et al, 2022 ; Zhang et al, 2022 ). Alternatively, text-centered frameworks were designed to explore the cross-modal interactions between textual and non-textual feature sequences (Han et al, 2021 ; He and Hu, 2021 ; Wu et al, 2021 ).…”
Section: Introductionmentioning
confidence: 99%
“…However, unifying multiple modalities into one identical representation can inevitably neglect the specificity of each modality, thus losing the rich discriminative features. Although recent works (Hazarika, Zimmermann, and Poria 2020;Zhang et al 2022) attempt to learn modality-specific representations, they still utilize attention to fuse these representations into one. Therefore, a key challenge of MMER is how to effectively represent multi-modal data while maintaining modality specificity and integrating complementary information.…”
Section: Introductionmentioning
confidence: 99%
“…Compared to unisensory recognition, multisensory recognition, which considers information from various senses, demonstrates superior performance. This superiority has led to extensive research in multisensory recognition tasks, resulting in significant progress in areas such as emotion recognition (Ju et al 2020;Zhang et al 2022), medical diagnosis (Boehm et al 2022a,b), and intelligent robotics (Papanastasiou et al 2019;Heredia et al 2022).…”
Section: Introductionmentioning
confidence: 99%