Counterfactual VQA: A Cause-Effect Look at Language Bias

Niu, Yulei; Tang, Kaihua; Zhang, Hanwang; Lu, Zhiwu; Hua, Xian-Sheng; Wen, Ji-Rong

doi:10.48550/arxiv.2006.04315

Cited by 18 publications

(29 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We will revisit this procedure formally in the following sections. Different from the plain "observation" (e.g., different color from the background) made by biased training, counterfactual training opens the door of "imagination" and allows models to think comprehensively [42]. A better prediction can be made possibly because of features, such as the elongated and curved shape (instead of the round shape of baseball and orange) or yellow-green color (instead of the dark color of remote control and avocado), are captured.…”

Section: Counterfactual Trainingmentioning

confidence: 99%

“…In this paper, we aim at finding a "cost-free" way to handle the distribution inconsistency in co-saliency detection. Intrigued by causal effect [46,45] and its extensions in vision & language [54,42,60], we introduce counterfactual training with regard to the gap between current training distribution D and true distribution T as the direct cause [44,50] of incorrect co-saliency predictions. As shown in Figure 2, the quality of prediction P made by a learning-based model is dependent on the quality of input data I under distribution D. The goal of counterfactual training is to synthesize "imaginary" data sample Î, whose distribution D -also originates from D -can mimic T .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Free Lunch for Co-Saliency Detection: Context Adjustment

Kong,

Ganesh,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

We unveil a long-standing problem in the prevailing cosaliency detection systems: there is indeed inconsistency between training and testing. Constructing a high-quality co-saliency detection dataset involves time-consuming and labor-intensive pixel-level labeling, which has forced most recent works to rely instead on semantic segmentation or saliency detection datasets for training. However, the lack of proper co-saliency and the absence of multiple foreground objects in these datasets can lead to spurious variations and inherent biases learned by models. To tackle this, we introduce the idea of counterfactual training through context adjustment, and propose a "cost-free" group-cutpaste (GCP) procedure to leverage images from off-theshelf saliency detection datasets and synthesize new samples. Following GCP, we collect a novel dataset called Context Adjustment Training (CAT). CAT consists of 33,500 images, making it four times larger than the current cosaliency detection datasets. All images are automatically annotated with high-quality mask annotations, object categories, and edge maps. Extensive experiments with state-ofthe-art models are conducted to demonstrate the superiority of our dataset. We hope that the scale, diversity, and quality of our dataset can benefit researchers in this area and beyond. The dataset and benchmark toolkit will be publicly accessible through our project page.

show abstract

Section: Counterfactual Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Free Lunch for Co-Saliency Detection: Context Adjustment

Kong,

Ganesh,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…RUBi [7], LMH [10] and PoE [28] re-weight samples based on the question-only prediction. Niu et al [33] further improve ensemble strategies from a causal-effect perspective. CSS [8] combines grounding-based and ensemble-based methods with counterfactual samples synthesizing.…”

Section: De-bias With Model Designmentioning

confidence: 99%

“…For example, a model may blindly answer "tennis" for the question "What sports ..." just based on the most common textual QA pairs in the train set. Unfortunately, models exploiting ods [8,33] further combine these two lines of work and achieve better performance.…”

Section: Introductionmentioning

confidence: 99%

Greedy Gradient Ensemble for Robust Visual Question Answering

Han

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Language bias is a critical issue in Visual Question Answering (VQA), where models often exploit dataset biases for the final decision without considering the image information. As a result, they suffer from performance drop on out-of-distribution data and inadequate visual explanation. Based on experimental analysis for existing robust VQA methods, we stress the language bias in VQA that comes from two aspects, i.e., distribution bias and shortcut bias. We further propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning. With the greedy strategy, GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models. The experiments demonstrate that our method makes better use of visual information and achieves state-of-the-art performance on diagnosing dataset VQA-CP without using extra annotations.

show abstract

“…Language and vision already interact in simple tasks such as object classification, where images are mapped to concepts in a closed vocabulary of categories. However, multimodal representations [4] allow for richer interactions enabling cross-modal tasks such as cross-modal retrieval [11,63,9,66,13], image captioning [18,12,49], visual question answering [47,23,10,65], and more recently text-to-image synthesis [32,75]. Lan- beyond the limited categories seen during training by projecting to language spaces, also known as zero-shot recognition [15,71].…”

Section: Introductionmentioning

confidence: 99%

Continual learning in cross-modal retrieval

Wang¹,

Herranz²,

Weijer³

2021

Preprint

View full text Add to dashboard Cite

Multimodal representations and continual learning are two areas closely related to human intelligence. The former considers the learning of shared representation spaces where information from different modalities can be compared and integrated (we focus on cross-modal retrieval between language and visual representations). The latter studies how to prevent forgetting a previously learned task when learning a new one. While humans excel in these two aspects, deep neural networks are still quite limited. In this paper, we propose a combination of both problems into a continual cross-modal retrieval setting, where we study how the catastrophic interference caused by new tasks impacts the embedding spaces and their cross-modal alignment required for effective retrieval. We propose a general framework that decouples the training, indexing and querying stages. We also identify and study different factors that may lead to forgetting, and propose tools to alleviate it. We found that the indexing stage pays an important role and that simply avoiding reindexing the database with updated embedding networks can lead to significant gains. We evaluated our methods in two image-text retrieval datasets, obtaining significant gains with respect to the fine tuning baseline.

show abstract

Counterfactual VQA: A Cause-Effect Look at Language Bias

Cited by 18 publications

References 68 publications

Free Lunch for Co-Saliency Detection: Context Adjustment

Free Lunch for Co-Saliency Detection: Context Adjustment

Greedy Gradient Ensemble for Robust Visual Question Answering

Continual learning in cross-modal retrieval

Contact Info

Product

Resources

About