MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering

Gokhale, Tejas; Banerjee, Pratyay; Baral, Chitta; Yang, Yezhou

doi:10.18653/v1/2020.emnlp-main.63

Cited by 84 publications

(72 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training and testing under the independent and identically distributed (i.i.d) setting have resulted in the performance of most VQA models being highly affected by superficial correlations (i.e., language biases and dataset biases) [1,2,20,74]. Recently, evaluation on the out-of-distribution (OOD) setting [18,24,35,60] has thus become an increasing concern for VQA. To improve the OOD generalization performance of VQA models, the prevailing methods target eliminating the language bias.…”

Section: Related Work 21 Ood Generalization In Vqamentioning

confidence: 99%

“…To improve the OOD generalization performance of VQA models, the prevailing methods target eliminating the language bias. Accordingly, current debiasing methods to VQA can be broadly divided into two groups, Known Bias-based [7,10,45] and Unknown Bias-based [11,18,58].…”

Section: Related Work 21 Ood Generalization In Vqamentioning

confidence: 99%

“…To generalize to outof-distribution samples adaptively, the VQA model should own two capabilities: (1) overcoming negative language biases and (2) producing out-of-distribution answers by learning rules entailed in in-domain data. The prevailing OOD generalization methods [10,11,18,65] focus on enhancing the first capability, which achieves OOD generalization by explicitly mitigating the language biases. While the second capability, which directly endues VQA models the potentiality to generalize to out-of-distribution (i.e., unseen or rare) samples, has not been well explored.…”

mentioning

confidence: 99%

See 2 more Smart Citations

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

Jiang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components. CCS CONCEPTS• Computing methodologies → Computer vision tasks; • Information systems → Question answering.

show abstract

Section: Related Work 21 Ood Generalization In Vqamentioning

confidence: 99%

Section: Related Work 21 Ood Generalization In Vqamentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

Jiang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Most of them are designed for the language priors problem, while LXMERT represents the recent trend towards utilizing BERT-like pre-trained models (Li et al, 2019;Chen et al, 2020b;Li et al, 2020) which have top performances on various downstream vision and language tasks (including VQA-v2). Note that MUTANT (Gokhale et al, 2020) uses the extra object-name label to ground the textual concepts in the image. For fair comparison, we do not compare with MUTANT.…”

Section: Inference Processmentioning

confidence: 99%

Check It Again:Progressive Visual Question Answering via Visual Entailment

Si¹,

Lin²,

M³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

While sophisticated Visual Question Answering models have achieved remarkable success, they tend to answer questions only according to superficial correlations between question and answer. Several recent approaches have been developed to address this language priors problem. However, most of them predict the correct answer according to one best output without checking the authenticity of answers. Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers. In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. Specifically, we first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task, which verifies whether the image semantically entails the synthetic statement of the question and each candidate answer. Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement. 1

show abstract

“…Apart from coming up with newer architecture to tackle the VQA problem, there are training techniques [18,19,20,21] put forward which might help to increase the accuracy. A special care is taken in the dataset while training which takes the semantic changes in the input data into consideration that might affect the output.…”

Section: Related Workmentioning

confidence: 99%

Deep Cross of Intra and Inter Modalities for Visual Question Answering

Bhardwaj¹

2021

Atlantis Highlights in Computer Sciences

View full text Add to dashboard Cite

Visual Question Answering (VQA) has recently attained interest in the deep learning community. The main challenge that exists in VQA is to understand the sense of each modality and how to fuse these features. In this paper, DXMN (Deep Cross Modality Network) is introduced which takes into consideration not only the inter-modality fusion but also the intra-modality fusion. The main idea behind this architecture is to take the positioning of each feature into account and then recognize the relationship between multi-modal features as well as establishing a relationship among themselves in order to learn them in a better way. The architecture is pretrained on question answering datasets like, VQA v2.0, GQA, and Visual Genome which is later fine-tuned to achieve state-of-the-art performance. DXMN achieves an accuracy of 68.65 in test-standard and 68.43 in test-dev of VQA v2.0 dataset.

show abstract

MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering

Cited by 84 publications

References 43 publications

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

Check It Again:Progressive Visual Question Answering via Visual Entailment

Deep Cross of Intra and Inter Modalities for Visual Question Answering

Contact Info

Product

Resources

About