Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal, Yash; Khot, Tejas; Summers-Stay, Douglas; Batra, Dhruv; Parikh, Devi

doi:10.1109/cvpr.2017.670

Cited by 1,575 publications

(1,513 citation statements)

References 33 publications

Supporting

Mentioning

1,372

Contrasting

Unclassified

Order By: Relevance

“…Synsets Figure 4 shows the counts of examples per synset in the training and development sets. Image Pair Reasoning We use a 200-sentence subset of the sentences analyzed in Table 5 (3) existential and (4) universal quantifiers; (5) coordination; (6) coreference; (7) spatial relations; (8) presupposition; (9) preposition attachment ambiguity VQA1.0 (Antol et al, 2015), VQA-CP (Agrawal et al, 2017), VQA2.0 (Goyal et al, 2017) Visual Question Answering Referring Expression Generation VQA (Abstract) (Zitnick and Parikh, 2013) Visual Question Answering ReferItGame (Kazemzadeh et al, 2014) Referring Expression Resolution SHAPES (Andreas et al, 2016) Visual Question Answering Bisk et al (2016) Instruction Following MSCOCO (Chen et al, 2016) Caption Generation Google RefExp (Mao et al, 2016) Referring Expression Resolution ROOM-TO-ROOM (Anderson et al, 2018) Instruction Following Visual Dialog (Das et al, 2017) Dialogue Visual Question Answering CLEVR (Johnson et al, 2017a) Visual Question Answering CLEVR-Humans (Johnson et al, 2017b) Visual Question Answering TDIUC (Kafle and Kanan, 2017) Visual Question Answering ShapeWorld (Kuhnle and Copestake, 2017) Binary Sentence Classification FigureQA (Kahou et al, 2018) Visual Question Answering TVQA (Lei et al, 2018) Video Question Answering LANI & CHAI (Misra et al, 2018) Instruction Following Talk the Walk (de Vries et al, 2018) Dialogue Instruction Following COG (Yang et al, 2018) Visual Question Answering; Instruction Following VCR (Zellers et al, 2019) Visual Question Answering TallyQA (Acharya et al, 2019) Visual Question Answering What to avoid…”

Section: Additional Data Analysismentioning

confidence: 99%

“…However, commonly used resources for language and vision (e.g., Antol et al, 2015;Chen et al, 2016) focus mostly on identification of object properties and few spatial relations (Section 4; Ferraro et al, 2015;Alikhani and Stone, 2019). This relatively simple reasoning, together with biases in the data, removes much of the need to consider language compositionality (Goyal et al, 2017). This motivated the design of datasets that require compositional 1 visual reasoning, including Figure 1: Two examples from NLVR2.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Corpus for Reasoning about Natural Language Grounded in Photographs

Suhr

Zhou²,

Zhang³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

299

241

View full text Add to dashboard Cite

We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge. * Contributed equally. † Work done as an undergraduate at Cornell University. 1 In parts of this paper, we use the term compositional differently than it is commonly used in linguistics to refer to reasoning that requires composition. This type of reasoning often manifests itself in highly compositional language.The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing.One image shows exactly two brown acorns in back-to-back caps on green foliage.

show abstract

Section: Additional Data Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Corpus for Reasoning about Natural Language Grounded in Photographs

Suhr

Zhou²,

Zhang³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

299

241

View full text Add to dashboard Cite

show abstract

“…We evaluate our model on four public datasets: the VQA 1.0 dataset (Antol et al 2015), the VQA 2.0 dataset (Goyal et al 2017), the COCO-QA dataset (Ren, Kiros, and Zemel 2015), and the TDIUC dataset (Kafle and Kanan 2017a). The VQA VQA 1%).…”

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

Differential Networks for Visual Question Answering

Liu

Wang

et al. 2019

AAAI

View full text Add to dashboard Cite

The task of Visual Question Answering (VQA) has emerged in recent years for its potential applications. To address the VQA task, the model should fuse feature elements from both images and questions efficiently. Existing models fuse image feature element vi and question feature element qi directly, such as an element product viqi. Those solutions largely ignore the following two key points: 1) Whether vi and qi are in the same space. 2) How to reduce the observation noises in vi and qi. We argue that two differences between those two feature elements themselves, like (vi − vj) and (qi −qj), are more probably in the same space. And the difference operation would be beneficial to reduce observation noise. To achieve this, we first propose Differential Networks (DN), a novel plug-and-play module which enables differences between pair-wise feature elements. With the tool of DN, we then propose DN based Fusion (DF), a novel model for VQA task. We achieve state-of-the-art results on four publicly available datasets. Ablation studies also show the effectiveness of difference operations in DF model.

show abstract

“…This tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing stateof-the-art approaches to selected tasks such as image captioning (Chen et al, 2015), visual question answering (VQA) (Antol et al, 2015;Goyal et al, 2017) and visual dialog (Das et al, 2017a,b), presenting the key architectural building blocks (such as co-attention (Lu et al, 2016)) and novel algorithms (such as cooperative/adversarial games (Das et al, 2017b)) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose (Anderson et al, 2018b;Das et al, 2018).…”

Section: Tutorial Overviewmentioning

confidence: 99%

Connecting Language and Vision to Actions

Anderson

Das

2018

Proceedings of ACL 2018, Tutorial Abstracts

View full text Add to dashboard Cite

A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress-from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding freeform conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments.

show abstract

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Cited by 1,575 publications

References 33 publications

A Corpus for Reasoning about Natural Language Grounded in Photographs

A Corpus for Reasoning about Natural Language Grounded in Photographs

Differential Networks for Visual Question Answering

Connecting Language and Vision to Actions

Contact Info

Product

Resources

About