2020
DOI: 10.48550/arxiv.2006.04315
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Counterfactual VQA: A Cause-Effect Look at Language Bias

Abstract: Visual Question Answering (VQA) models tend to rely on the language bias and thus fail to learn the reasoning from visual knowledge, which is however the original intention of VQA. In this paper, we propose a novel cause-effect look at the language bias, where the bias is formulated as the direct effect of question on answer from the view of causal inference. The effect can be captured by counterfactual VQA, where the image had not existed in an imagined scenario. Our proposed cause-effect look 1) is general t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 18 publications
(29 citation statements)
references
References 68 publications
0
29
0
Order By: Relevance
“…We will revisit this procedure formally in the following sections. Different from the plain "observation" (e.g., different color from the background) made by biased training, counterfactual training opens the door of "imagination" and allows models to think comprehensively [42]. A better prediction can be made possibly because of features, such as the elongated and curved shape (instead of the round shape of baseball and orange) or yellow-green color (instead of the dark color of remote control and avocado), are captured.…”
Section: Counterfactual Trainingmentioning
confidence: 99%
See 1 more Smart Citation
“…We will revisit this procedure formally in the following sections. Different from the plain "observation" (e.g., different color from the background) made by biased training, counterfactual training opens the door of "imagination" and allows models to think comprehensively [42]. A better prediction can be made possibly because of features, such as the elongated and curved shape (instead of the round shape of baseball and orange) or yellow-green color (instead of the dark color of remote control and avocado), are captured.…”
Section: Counterfactual Trainingmentioning
confidence: 99%
“…In this paper, we aim at finding a "cost-free" way to handle the distribution inconsistency in co-saliency detection. Intrigued by causal effect [46,45] and its extensions in vision & language [54,42,60], we introduce counterfactual training with regard to the gap between current training distribution D and true distribution T as the direct cause [44,50] of incorrect co-saliency predictions. As shown in Figure 2, the quality of prediction P made by a learning-based model is dependent on the quality of input data I under distribution D. The goal of counterfactual training is to synthesize "imaginary" data sample Î, whose distribution D -also originates from D -can mimic T .…”
Section: Introductionmentioning
confidence: 99%
“…RUBi [7], LMH [10] and PoE [28] re-weight samples based on the question-only prediction. Niu et al [33] further improve ensemble strategies from a causal-effect perspective. CSS [8] combines grounding-based and ensemble-based methods with counterfactual samples synthesizing.…”
Section: De-bias With Model Designmentioning
confidence: 99%
“…For example, a model may blindly answer "tennis" for the question "What sports ..." just based on the most common textual QA pairs in the train set. Unfortunately, models exploiting ods [8,33] further combine these two lines of work and achieve better performance.…”
Section: Introductionmentioning
confidence: 99%
“…Language and vision already interact in simple tasks such as object classification, where images are mapped to concepts in a closed vocabulary of categories. However, multimodal representations [4] allow for richer interactions enabling cross-modal tasks such as cross-modal retrieval [11,63,9,66,13], image captioning [18,12,49], visual question answering [47,23,10,65], and more recently text-to-image synthesis [32,75]. Lan- beyond the limited categories seen during training by projecting to language spaces, also known as zero-shot recognition [15,71].…”
Section: Introductionmentioning
confidence: 99%