Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection

Ju, Xincheng; Zhang, Dong; Li, Junhui; Zhou, Guodong

doi:10.1145/3394171.3413577

Cited by 38 publications

(13 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, multi-modal multi-label emotion recognition has aroused increasing interest. For example, (Ju et al 2020;Zhang et al 2021a) models modality-to-label and featureto-label dependence besides label correlations.…”

Section: Related Workmentioning

confidence: 99%

“…In real-world applications, videos are often characterized by heterogeneous representations (i.e., visual, audio and text) and annotated with various emotion labels (e.g., happy, surprise). Multi-modal Multi-label Emotion Recognition (MMER) (Ju et al 2020;Zhang et al 2021a) refers to identifying various emotions by leveraging visual, audio and text modalities presented in videos.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tailor Versatile Multi-Modal Learning for Multi-Label Emotion Recognition

Zhang

Chen

Shen

et al. 2022

AAAI

View full text Add to dashboard Cite

Multi-modal Multi-label Emotion Recognition (MMER) aims to identify various human emotions from heterogeneous visual, audio and text modalities. Previous methods mainly focus on projecting multiple modalities into a common latent space and learning an identical representation for all labels, which neglects the diversity of each modality and fails to capture richer semantic information for each label from different perspectives. Besides, associated relationships of modalities and labels have not been fully exploited. In this paper, we propose versaTile multi-modAl learning for multI-labeL emOtion Recognition (TAILOR), aiming to refine multi-modal representations and enhance discriminative capacity of each label. Specifically, we design an adversarial multi-modal refinement module to sufficiently explore the commonality among different modalities and strengthen the diversity of each modality. To further exploit label-modal dependence, we devise a BERT-like cross-modal encoder to gradually fuse private and common modality representations in a granularity descent way, as well as a label-guided decoder to adaptively generate a tailored representation for each label with the guidance of label semantics. In addition, we conduct experiments on the benchmark MMER dataset CMU-MOSEI in both aligned and unaligned settings, which demonstrate the superiority of TAILOR over the state-of-the-arts.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Tailor Versatile Multi-Modal Learning for Multi-Label Emotion Recognition

Zhang

Chen

Shen

et al. 2022

AAAI

View full text Add to dashboard Cite

show abstract

“…where ATT self denotes self-modal multi-head attention as (Vaswani et al, 2017), and ATT cross denotes the cross-modal multi-head attention as (Ju et al, 2020). O rel and T rel are pre-trained embedding of image I and text X.…”

Section: Cross-modal Relation Detectionmentioning

confidence: 99%

Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection

Ju¹,

Zhang²,

Xiao³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

Aspect terms extraction (ATE) and aspect sentiment classification (ASC) are two fundamental and fine-grained sub-tasks in aspect-level sentiment analysis (ALSA). In the textual analysis, jointly extracting both aspect terms and sentiment polarities has been drawn much attention due to the better applications than individual sub-task. However, in the multimodal scenario, the existing studies are limited to handle each sub-task independently, which fails to model the innate connection between the above two objectives and ignores the better applications. Therefore, in this paper, we are the first to jointly perform multi-modal ATE (MATE) and multi-modal ASC (MASC), and we propose a multi-modal joint learning approach with auxiliary cross-modal relation detection for multi-modal aspect-level sentiment analysis (MALSA). Specifically, we first build an auxiliary text-image relation detection module to control the proper exploitation of visual information. Second, we adopt the hierarchical framework to bridge the multi-modal connection between MATE and MASC, as well as separately visual guiding for each sub module. Finally, we can obtain all aspect-level sentiment polarities dependent on the jointly extracted specific aspects. Extensive experiments show the effectiveness of our approach against the joint textual approaches, pipeline and collapsed multi-modal approaches.

show abstract

“…Comparing to CNN, the attention mechanism learns more global dependencies, therefore, transformer also shows great performance in low-level tasks [3]. Transformer has also been proved effectiveness in multi-modal area, including multi-modal representations [45] and applications [13,19,31]. Inspired by the extensive applications of transformer, we integrate the transformer encoder-decoder into the document image rectification problem.…”

Section: Transformer In Language and Visionmentioning

confidence: 99%