Mmlatch: Bottom-Up Top-Down Fusion For Multimodal Sentiment Analysis

Paraskevopoulos, Georgios; Georgiou, Efthymios; Potamianos, Alexandras

doi:10.1109/icassp43922.2022.9746418

Cited by 23 publications

(5 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To separately address the relationships between each pair of modalities, BBFN integrates two bimodal fusion modules along with a gated control mechanism. Given that top-down interactions remain unaccounted in previous approaches, MMLatch (Paraskevopoulos, Georgiou, and Potamianos 2022) addresses this limitation by incorporating a feedback mechanism in the forward pass.…”

Section: Related Work Multimodal Fusion Methodsmentioning

confidence: 99%

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Zhou,

Xu,

et al. 2024

AAAI

View full text Add to dashboard Cite

Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.

show abstract

Section: Related Work Multimodal Fusion Methodsmentioning

confidence: 99%

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Zhou,

Xu,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Kumar et al [39] achieved deep multimodal feature vector fusion by introducing learnable gating mechanisms, self-attended context representations, and recurrent layer-based self and gated cross-fusion. Paraskevopoulos et al [17] proposed a neural architecture for multimodal fusion, utilizing a feedback mechanism in the forward pass during network training to capture top-down cross-modal interactions. Subsequently, numerous studies have employed even more novel approaches.…”

Section: Multimodal Sentiment Analysismentioning

confidence: 99%

“…MMLATCH [17]: Bottom-up, top-down fusion proposes a neural architecture that captures top-down cross-modal interactions by using a feedback mechanism in the forward pass during network training.…”

Section: Baselinesmentioning

confidence: 99%

“…Subsequently, with the rise of deep learning, multimodal sentiment analysis began leveraging deep neural networks to extract and fuse features from different modalities automatically. The emphasis shifted toward designing efficient neural network architectures to integrate various types of data, resulting in the proposal of several excellent multimodal fusion methods [13][14][15][16][17].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Peng,

Gu,

et al. 2024

Electronics

View full text Add to dashboard Cite

Multimodal sentiment analysis aims to acquire and integrate sentimental cues from different modalities to identify the sentiment expressed in multimodal data. Despite the widespread adoption of pre-trained language models in recent years to enhance model performance, current research in multimodal sentiment analysis still faces several challenges. Firstly, although pre-trained language models have significantly elevated the density and quality of text features, the present models adhere to a balanced design strategy that lacks a concentrated focus on textual content. Secondly, prevalent feature fusion methods often hinge on spatial consistency assumptions, neglecting essential information about modality interactions and sample relationships within the feature space. In order to surmount these challenges, we propose a text-centric multimodal contrastive learning framework (TCMCL). This framework centers around text and augments text features separately from audio and visual perspectives. In order to effectively learn feature space information from different cross-modal augmented text features, we devised two contrastive learning tasks based on instance prediction and sentiment polarity; this promotes implicit multimodal fusion and obtains more abstract and stable sentiment representations. Our model demonstrates performance that surpasses the current state-of-the-art methods on both the CMU-MOSI and CMU-MOSEI datasets.

show abstract

“…Hazarika et al designed a new framework that projects modalities into modality-invariant and modality-specific subspaces to achieve a more holistic view of the multimodal data [36]. Paraskevopoulos et al [37] introduced a neural architecture that adeptly captures cross-modal interactions from a top-down perspective to analyze users' sentiment. Transformer-based methods have also been proposed for MSA tasks, such as the multi-layer fusion module based on the transformer-encoder developed by Li et al [38], which incorporates contrastive learning to further explore sentiment features, and the text-enhanced transformer fusion model proposed by Wang et al to better understand text-oriented pairwise cross-modal mappings and acquire crucial unified multimodal representations [39].…”

Section: Multimodal Sentiment Analysismentioning

confidence: 99%

Hierarchical Fusion Network with Enhanced Knowledge and Contrastive Learning for Multimodal Aspect-Based Sentiment Analysis on Social Media

Hu,

Yamamura

2023

Sensors

View full text Add to dashboard Cite

Aspect-based sentiment analysis (ABSA) is a task of fine-grained sentiment analysis that aims to determine the sentiment of a given target. With the increased prevalence of smart devices and social media, diverse data modalities have become more abundant. This fuels interest in multimodal ABSA (MABSA). However, most existing methods for MABSA prioritize analyzing the relationship between aspect–text and aspect–image, overlooking the semantic gap between text and image representations. Moreover, they neglect the rich information in external knowledge, e.g., image captions. To address these limitations, in this paper, we propose a novel hierarchical framework for MABSA, known as HF-EKCL, which also offers perspectives on sensor development within the context of sentiment analysis. Specifically, we generate captions for images to supplement the textual and visual features. The multi-head cross-attention mechanism and graph attention neural network are utilized to capture the interactions between modalities. This enables the construction of multi-level aspect fusion features that incorporate element-level and structure-level information. Furthermore, for this paper, we integrated modality-based and label-based contrastive learning methods into our framework, making the model learn shared features that are relevant to the sentiment of corresponding words in multimodal data. The results, based on two Twitter datasets, demonstrate the effectiveness of our proposed model.

show abstract

Mmlatch: Bottom-Up Top-Down Fusion For Multimodal Sentiment Analysis

Cited by 23 publications

References 36 publications

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Hierarchical Fusion Network with Enhanced Knowledge and Contrastive Learning for Multimodal Aspect-Based Sentiment Analysis on Social Media

Contact Info

Product

Resources

About