The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.670
|View full text |Cite
|
Sign up to set email alerts
|

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Abstract: Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

8
1,372
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 1,575 publications
(1,513 citation statements)
references
References 33 publications
8
1,372
0
1
Order By: Relevance
“…Synsets Figure 4 shows the counts of examples per synset in the training and development sets. Image Pair Reasoning We use a 200-sentence subset of the sentences analyzed in Table 5 (3) existential and (4) universal quantifiers; (5) coordination; (6) coreference; (7) spatial relations; (8) presupposition; (9) preposition attachment ambiguity VQA1.0 (Antol et al, 2015), VQA-CP (Agrawal et al, 2017), VQA2.0 (Goyal et al, 2017) Visual Question Answering Referring Expression Generation VQA (Abstract) (Zitnick and Parikh, 2013) Visual Question Answering ReferItGame (Kazemzadeh et al, 2014) Referring Expression Resolution SHAPES (Andreas et al, 2016) Visual Question Answering Bisk et al (2016) Instruction Following MSCOCO (Chen et al, 2016) Caption Generation Google RefExp (Mao et al, 2016) Referring Expression Resolution ROOM-TO-ROOM (Anderson et al, 2018) Instruction Following Visual Dialog (Das et al, 2017) Dialogue Visual Question Answering CLEVR (Johnson et al, 2017a) Visual Question Answering CLEVR-Humans (Johnson et al, 2017b) Visual Question Answering TDIUC (Kafle and Kanan, 2017) Visual Question Answering ShapeWorld (Kuhnle and Copestake, 2017) Binary Sentence Classification FigureQA (Kahou et al, 2018) Visual Question Answering TVQA (Lei et al, 2018) Video Question Answering LANI & CHAI (Misra et al, 2018) Instruction Following Talk the Walk (de Vries et al, 2018) Dialogue Instruction Following COG (Yang et al, 2018) Visual Question Answering; Instruction Following VCR (Zellers et al, 2019) Visual Question Answering TallyQA (Acharya et al, 2019) Visual Question Answering What to avoid…”
Section: Additional Data Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…Synsets Figure 4 shows the counts of examples per synset in the training and development sets. Image Pair Reasoning We use a 200-sentence subset of the sentences analyzed in Table 5 (3) existential and (4) universal quantifiers; (5) coordination; (6) coreference; (7) spatial relations; (8) presupposition; (9) preposition attachment ambiguity VQA1.0 (Antol et al, 2015), VQA-CP (Agrawal et al, 2017), VQA2.0 (Goyal et al, 2017) Visual Question Answering Referring Expression Generation VQA (Abstract) (Zitnick and Parikh, 2013) Visual Question Answering ReferItGame (Kazemzadeh et al, 2014) Referring Expression Resolution SHAPES (Andreas et al, 2016) Visual Question Answering Bisk et al (2016) Instruction Following MSCOCO (Chen et al, 2016) Caption Generation Google RefExp (Mao et al, 2016) Referring Expression Resolution ROOM-TO-ROOM (Anderson et al, 2018) Instruction Following Visual Dialog (Das et al, 2017) Dialogue Visual Question Answering CLEVR (Johnson et al, 2017a) Visual Question Answering CLEVR-Humans (Johnson et al, 2017b) Visual Question Answering TDIUC (Kafle and Kanan, 2017) Visual Question Answering ShapeWorld (Kuhnle and Copestake, 2017) Binary Sentence Classification FigureQA (Kahou et al, 2018) Visual Question Answering TVQA (Lei et al, 2018) Video Question Answering LANI & CHAI (Misra et al, 2018) Instruction Following Talk the Walk (de Vries et al, 2018) Dialogue Instruction Following COG (Yang et al, 2018) Visual Question Answering; Instruction Following VCR (Zellers et al, 2019) Visual Question Answering TallyQA (Acharya et al, 2019) Visual Question Answering What to avoid…”
Section: Additional Data Analysismentioning
confidence: 99%
“…However, commonly used resources for language and vision (e.g., Antol et al, 2015;Chen et al, 2016) focus mostly on identification of object properties and few spatial relations (Section 4; Ferraro et al, 2015;Alikhani and Stone, 2019). This relatively simple reasoning, together with biases in the data, removes much of the need to consider language compositionality (Goyal et al, 2017). This motivated the design of datasets that require compositional 1 visual reasoning, including Figure 1: Two examples from NLVR2.…”
Section: Introductionmentioning
confidence: 99%
“…We evaluate our model on four public datasets: the VQA 1.0 dataset (Antol et al 2015), the VQA 2.0 dataset (Goyal et al 2017), the COCO-QA dataset (Ren, Kiros, and Zemel 2015), and the TDIUC dataset (Kafle and Kanan 2017a). The VQA VQA 1%).…”
Section: Datasets and Evaluation Metricsmentioning
confidence: 99%
“…This tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing stateof-the-art approaches to selected tasks such as image captioning (Chen et al, 2015), visual question answering (VQA) (Antol et al, 2015;Goyal et al, 2017) and visual dialog (Das et al, 2017a,b), presenting the key architectural building blocks (such as co-attention (Lu et al, 2016)) and novel algorithms (such as cooperative/adversarial games (Das et al, 2017b)) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose (Anderson et al, 2018b;Das et al, 2018).…”
Section: Tutorial Overviewmentioning
confidence: 99%