Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering

Dancette, Corentin; Cadène, Rémi; Teney, Damien; Cord, Matthieu

doi:10.1109/iccv48922.2021.00160

Cited by 44 publications

(22 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep neural networks often solve the task-specific problem, e.g., image classification, by learning the shortcuts such as the correlations of cows and grass instead of the intended solution, e.g., the features from cows [8]. Recently, the shortcut in deep learning models gains increasing attention across the deep learning field from computer vision (CV) [3,32,55], natural language processing (NLP) [31,36] to reinforcement learning [1]. To date, various methods have been devised to mitigate the negative effects of shortcuts [27].…”

Section: Shortcut Learningmentioning

confidence: 99%

Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Ma¹,

Zhang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Learning harmful shortcuts such as spurious correlations and biases prevents deep neural networks from learning the meaningful and useful representations, thus jeopardizing the generalizability and interpretability of the learned representation. The situation becomes even more serious in medical imaging, where the clinical data (e.g., MR images with pathology) are limited and scarce while the reliability, generalizability and transparency of the learned model are highly required. To address this problem, we propose to infuse human experts' intelligence and domain knowledge into the training of deep neural networks. The core idea is that we infuse the visual attention information from expert radiologists to proactively guide the deep model to focus on regions with potential pathology and avoid being trapped in learning harmful shortcuts. To do so, we propose a novel eye-gaze-guided vision transformer (EG-ViT) for diagnosis with limited medical image data. We mask the input image patches that are out of the radiologists' interest and add an additional residual connection in the last encoder layer of EG-ViT to maintain the correlations of all patches. The experiments on two public datasets of INbreast and SIIM-ACR demonstrate our EG-ViT model can effectively learn/transfer experts' domain knowledge and achieve much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the EG-ViT model's interpretability. In general, EG-ViT takes the advantages of both human expert's prior knowledge and the power of deep neural networks. This work opens new avenues for advancing current artificial intelligence paradigms by infusing human intelligence.

show abstract

Section: Shortcut Learningmentioning

confidence: 99%

Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Ma¹,

Zhang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent years have shown rapid developments in the field of multimodal machine learning [2]. Neural architectures are employed in tasks that go beyond single modalities, for example, Visual Question Answering (VQA) [12], Visual Commonsense Reasoning (VCR) [46], etc. In these tasks and beyond, priors and features from different modalities are required and algorithms or deep networks cannot be effective when provided with only a single modality.…”

Section: Multimodal Learningmentioning

confidence: 99%

Multimodal Fake News Detection via CLIP-Guided Learning

Zhou¹,

Ying²,

Qian³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multimodal fake news detection has attracted many research interests in social forensics. Many existing approaches introduce tailored attention mechanisms to guide the fusion of unimodal features. However, how the similarity of these features is calculated and how it will affect the decision-making process in FND are still open questions. Besides, the potential of pretrained multimodal feature learning models in fake news detection has not been well exploited. This paper proposes a FND-CLIP framework, i.e., a multimodal Fake News Detection network based on Contrastive Language-Image Pretraining (CLIP). Given a targeted multimodal news, we extract the deep representations from the image and text using a ResNet-based encoder, a BERT-based encoder and two pairwise CLIP encoders. The multimodal feature is a concatenation of the CLIP-generated features weighted by the standardized crossmodal similarity of the two modalities. The extracted features are further processed for redundancy reduction before feeding them into the final classifier. We introduce a modality-wise attention module to adaptively reweight and aggregate the features. We have conducted extensive experiments on typical fake news datasets. The results indicate that the proposed framework has a better capability in mining crucial features for fake news detection. The proposed FND-CLIP can achieve better performances than previous works, i.e., 0.7%, 6.8% and 1.3%improvements in overall accuracy on Weibo, Politifact and Gossipcop, respectively. Besides, we justify that CLIP-based learning can allow better flexibility on multimodal feature selection.

show abstract

“…Shortcut learning Recently, shortcut learning has received much attention in deep learning areas such as computer vision (CV) [9,43,34] and natural language processing (NLP) [12,37,39]. For most of the tasks in deep learning, both the training and test sets come from the same dataset.…”

Section: Related Workmentioning

confidence: 99%

Rectify ViT Shortcut Learning by Visual Saliency

Ma¹,

Zhang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Shortcut learning is common but harmful to deep learning models, leading to degenerated feature representations and consequently jeopardizing the model's generalizability and interpretability. However, shortcut learning in the widely used Vision Transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts, which are predominated by background related factors. For example, in the medical imaging field, eye-gaze data from radiologists is an effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions of interest. However, obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even not practical. In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pre-trained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to distil the most informative image patches. In the proposed SGT, the self-attention Preprint. Under review.

show abstract

Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering

Cited by 44 publications

References 27 publications

Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Multimodal Fake News Detection via CLIP-Guided Learning

Rectify ViT Shortcut Learning by Visual Saliency

Contact Info

Product

Resources

About