Counterfactual Samples Synthesizing for Robust Visual Question Answering

Chen, Long; Yan, Xin; Xiao, Jun; Zhang, Hanwang; Pu, Shiliang; Zhuang, Yueting

doi:10.1109/cvpr42600.2020.01081

Cited by 257 publications

(163 citation statements)

References 46 publications

Supporting

Mentioning

162

Contrasting

Order By: Relevance

“…Counterfactual sample. Constructing counterfactual samples has become an emergent data augmentation technique in natural language processing, which has been used in a wide spectral of language understanding tasks, including SA (Kaushik et al, 2019;, NLI (Kaushik et al, 2019), named entity recognition (Zeng et al, 2020) question answering (Chen et al, 2020), dialogue system , vision-language navigation (Fu et al, 2020). Beyond data augmentation under the standard supervised learning paradigm, a line of research explores to incorporate counterfactual samples into other learning paradigms such as adversarial training Fu et al, 2020;Teney et al, 2020) and contrastive learning (Liang et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Empowering Language Understanding with Counterfactual Reasoning

Feng¹,

Zhang²,

He³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

Present language understanding methods have demonstrated extraordinary ability of recognizing patterns in texts via machine learning. However, existing methods indiscriminately use the recognized patterns in the testing phase that is inherently different from us humans who have counterfactual thinking, e.g., to scrutinize for the hard testing samples. Inspired by this, we propose a Counterfactual Reasoning Model, which mimics the counterfactual thinking by learning from few counterfactual samples. In particular, we devise a generation module to generate representative counterfactual samples for each factual sample, and a retrospective module to retrospect the model prediction by comparing the counterfactual and factual samples. Extensive experiments on sentiment analysis (SA) and natural language inference (NLI) validate the effectiveness of our method.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Given the labeled factual sample, counterfactual samples can be constructed either manually(Kaushik et al, 2019) or automatically(Chen et al, 2020) by conducting minimum changes on x to swap its label from y to c…”

mentioning

confidence: 99%

Empowering Language Understanding with Counterfactual Reasoning

Feng¹,

Zhang²,

He³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Apart from coming up with newer architecture to tackle the VQA problem, there are training techniques [18,19,20,21] put forward which might help to increase the accuracy. A special care is taken in the dataset while training which takes the semantic changes in the input data into consideration that might affect the output.…”

Section: Related Workmentioning

confidence: 99%

Deep Cross of Intra and Inter Modalities for Visual Question Answering

Bhardwaj¹

2021

Atlantis Highlights in Computer Sciences

View full text Add to dashboard Cite

Visual Question Answering (VQA) has recently attained interest in the deep learning community. The main challenge that exists in VQA is to understand the sense of each modality and how to fuse these features. In this paper, DXMN (Deep Cross Modality Network) is introduced which takes into consideration not only the inter-modality fusion but also the intra-modality fusion. The main idea behind this architecture is to take the positioning of each feature into account and then recognize the relationship between multi-modal features as well as establishing a relationship among themselves in order to learn them in a better way. The architecture is pretrained on question answering datasets like, VQA v2.0, GQA, and Visual Genome which is later fine-tuned to achieve state-of-the-art performance. DXMN achieves an accuracy of 68.65 in test-standard and 68.43 in test-dev of VQA v2.0 dataset.

show abstract

“…Some other works (such as DCN [39], BAN [40], and MCAN [41]) investigate "dense" co-attention that use bidirectional attention between images and questions. More recent works try to capture a more complex visual-textual information [42]- [45]. Our work instead tries to keep our approach as simple as possible by using three independently trained models to obtain the entropy.…”

Section: Related Workmentioning

confidence: 99%

An Entropy Clustering Approach for Assessing Visual Question Difficulty

et al. 2020

View full text Add to dashboard Cite

We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (i.e., the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms.

show abstract

Counterfactual Samples Synthesizing for Robust Visual Question Answering

Cited by 257 publications

References 46 publications

Empowering Language Understanding with Counterfactual Reasoning

Empowering Language Understanding with Counterfactual Reasoning

Deep Cross of Intra and Inter Modalities for Visual Question Answering

An Entropy Clustering Approach for Assessing Visual Question Difficulty

Contact Info

Product

Resources

About