FiLMing Multimodal Sarcasm Detection with Attention

Gupta, Sundesh; Shah, Aditya; Shah, Miten; Syiemlieh, Laribok; Maurya, Chandresh Kumar

doi:10.1007/978-3-030-92307-5_21

Cited by 5 publications

(6 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Noticing this issue, nowadays the research interests have shifted to exploring the task of multimodal sarcasm detection (MSD), whose key objective is to accurately detect the inter-and intra-modal incongruities of someone's implied sentiment expression within the given context. Early approaches incorporated fusion techniques that combined entire text and image features by concatenating operation (Pan et al 2020) or attention mechanism (Gupta et al 2021). Despite their considerable progress, they overlook the possibility that sarcastic information may be expressed in some local segments of the text and certain regions of the image.…”

Section: Sota Modelmentioning

confidence: 99%

See 1 more Smart Citation

Debiasing Multimodal Sarcasm Detection with Contrastive Learning

Jia,

Xie,

Jing

2024

AAAI

View full text Add to dashboard Cite

Despite commendable achievements made by existing work, prevailing multimodal sarcasm detection studies rely more on textual content over visual information. It unavoidably induces spurious correlations between textual words and labels, thereby significantly hindering the models' generalization capability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sarcasm detection, which aims to evaluate models' generalizability when the word distribution is different in training and testing settings. Moreover, we propose a novel debiasing multimodal sarcasm detection framework with contrastive learning, which aims to mitigate the harmful effect of biased textual factors for robust OOD generalization. In particular, we first design counterfactual data augmentation to construct the positive samples with dissimilar word biases and negative samples with similar word biases. Subsequently, we devise an adapted debiasing contrastive learning mechanism to empower the model to learn robust task-relevant features and alleviate the adverse effect of biased words. Extensive experiments show the superiority of the proposed framework.

show abstract

Section: Sota Modelmentioning

confidence: 99%

“…Textual Encoding. To better model the semantic information in the textual sentence, we feed it into the pre-trained language encoder RoBERTa (Liu et al 2019), which has gained appreciative results in multimodal language understanding tasks (Cao et al 2022;Gupta et al 2021),…”

Section: Msd Model Initializationmentioning

confidence: 99%

Debiasing Multimodal Sarcasm Detection with Contrastive Learning

Jia,

Xie,

Jing

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Arevalo et al [43] propose the Gated Multimodal Unit (GMU) model, which controls the influence of input modalities on unit activation levels for data fusion. Gupta et al [44] introduce a Collaborative Attention Model based on RoBERTa and FiLMed ResNet, addressing the issue of visual-text inconsistency through joint attention mechanisms.…”

Section: B Multimodal Sentiment Analysismentioning

confidence: 99%

Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction

Lu,

Ni,

Ding

2024

IJACSA

View full text Add to dashboard Cite

Multimodal sentiment analysis is a traditional textbased sentiment analysis technique. However, the field of multimodal sentiment analysis still faces challenges such as inconsistent cross-modal feature information, poor interaction capabilities, and insufficient feature fusion. To address these issues, this paper proposes a cross-modal sentiment model based on CLIP image-text attention interaction. The model utilizes pre-trained ResNet50 and RoBERTa to extract primary image-text features. After contrastive learning with the CLIP model, it employs a multi-head attention mechanism for cross-modal feature interaction to enhance information exchange between different modalities. Subsequently, a cross-modal gating module is used to fuse feature networks, combining features at different levels while controlling feature weights. The final output is fed into a fully connected layer for sentiment recognition. Comparative experiments are conducted on the publicly available datasets MSVA-Single and MSVA-Multiple. The experimental results demonstrate that our model achieved accuracy rates of 75.38% and 73.95% , and F1-scores of 75.21% and 73.83% on the mentioned datasets, respectively. This indicates that the proposed approach exhibits higher generalization and robustness compared to existing sentiment analysis models.

show abstract

“…Gupta proposed FiLM, which uses FiLMed ResNet blocks to modulate input image and text features to integrate feature affine transformations (FiLM) for capturing multimodal information. The model connects the output of the CLS token from RoBERTa for the final prediction [16].…”

Section: Multi-modal Sarcasm Detectionmentioning

confidence: 99%

Granularity Based Inter and Intra-Modal Fusion Network for Sarcasm Detection

Shi,

Zhao,

Chen

2023

Preprint

View full text Add to dashboard Cite

Multi-modal sarcasm detection is a task that involves detecting and identifyingsarcasm using multiple modalities of information. The key aspect of this task lies in how to model intra and inter-modality incongruity. Existing multi-modal sarcasm detection methods often focus on incongruity between modalities while overlooking the potential of fully exploring the semantic information within each modality. We tackle the problem in this paper by designing the Granularity Based Inter and Intra-Modal Fusion Network (GIIFN). Our approach combines tradi-tional image algorithms with deep learning models to extract comprehensive andrich semantic information from images. By incorporating a pre-trained languagemodel, we leverage knowledge from large-scale textual data to enhance the anal-ysis of images. Furthermore, our feature interaction model can effectively fuse and interact features at different granularities, capturing fine details and contex-tual information in the images. Through extensive experimental validation, we demonstrate the outstanding performance of our approach in multimodal satiredetection tasks, surpassing existing methods and outperforming state-of-the-art results.

show abstract

FiLMing Multimodal Sarcasm Detection with Attention

Cited by 5 publications

References 10 publications

Debiasing Multimodal Sarcasm Detection with Contrastive Learning

Debiasing Multimodal Sarcasm Detection with Contrastive Learning

Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction

Granularity Based Inter and Intra-Modal Fusion Network for Sarcasm Detection

Contact Info

Product

Resources

About