IQA: Visual Question Answering in Interactive Environments

Gordon, Daniel; Kembhavi, Aniruddha; Rastegari, Mohammad; Redmon, Joseph; Fox, Dieter; Farhadi, Ali

doi:10.1109/cvpr.2018.00430

Cited by 301 publications

(300 citation statements)

References 59 publications

Supporting

Mentioning

295

Contrasting

Order By: Relevance

“…Gandhi et al [19] collect a dataset of drone crashes and train self-supervised agents to avoid obstacles. A number of new challenging tasks have been proposed including instruction-based navigation [6,7], target-driven navigation [2,4], embodied/interactive question answering [1,9], and task planning [5].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Wijmans

Datta

Maksymets

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

114

107

View full text Add to dashboard Cite

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and challenging to outperform, due to the specific choice of the evaluation setting presented by [1]. We find a novel lossweighting scheme we call Inflection Weighting to be important when training recurrent models for navigation with behavior cloning and are able to out perform the baselines with this technique. We find that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.

show abstract

Section: Related Workmentioning

confidence: 99%

“…We empirically show that point cloud representations are more effective for navigation in this task. Moreover, contrary to [1,9] that use synthetic environments, we extend the task to real environments sourced from [16]. 3D Representations and Architectures.…”

Section: Related Workmentioning

confidence: 99%

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Wijmans

Datta

Maksymets

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

114

107

View full text Add to dashboard Cite

show abstract

“…These approaches have been extended to the video domain as well [20,34,42]. Recently, [15,10] address the problem of question answering in an interactive environment. None of these approaches, however, is designed for leveraging external knowledge so they cannot handle the cases that the image does not represent the full knowledge to answer the question.…”

Section: Related Workmentioning

confidence: 99%

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Marino

Rastegari

Farhadi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

296

298

View full text Add to dashboard Cite

Visual Question Answering (VQA) in its ideal form lets us study reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most VQA benchmarks to date are focused on questions such as simple counting, visual attributes, and object detection that do not require reasoning or knowledge beyond what is in the image. In this paper, we address the task of knowledge-based visual question answering and provide a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources. Our new dataset includes more than 14,000 questions that require external knowledge to answer. We show that the performance of the state-of-the-art VQA models degrades drastically in this new setting. Our analysis shows that our knowledge-based VQA task is diverse, difficult, and large compared to previous knowledgebased VQA datasets. We hope that this dataset enables researchers to open up new avenues for research in this domain. See http://okvqa.allenai.org to download and browse the dataset.

show abstract

“…Several embodied or visual question answering datasets have been presented recently to address some of the problems of interest in our work, such as those of Brodeur et al (2017); Das et al (2017); Gordon et al (2017). In contrast with these, our purely text-based environment circumvents challenges inherent to modelling interactions between separate data modalities.…”

Section: Interactive Environmentsmentioning

confidence: 99%

Interactive Language Learning by Question Answering

Yuan¹,

Côté²,

Fu³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Humans observe and interact with the world to acquire knowledge. However, most existing machine reading comprehension (MRC) tasks miss the interactive, information-seeking component of comprehension. Such tasks present models with static documents that contain all necessary information, usually concentrated in a single short substring. Thus, models can achieve strong performance through simple word-and phrase-based pattern matching. We address this problem by formulating a novel text-based question answering task: Question Answering with Interactive Text (QAit). 1 In QAit, an agent must interact with a partially observable text-based environment to gather information required to answer questions. QAit poses questions about the existence, location, and attributes of objects found in the environment. The data is built using a text-based game generator that defines the underlying dynamics of interaction with the environment. We propose and evaluate a set of baseline models for the QAit task that includes deep reinforcement learning agents. Experiments show that the task presents a major challenge for machine reading systems, while humans solve it with relative ease.

show abstract

IQA: Visual Question Answering in Interactive Environments

Cited by 301 publications

References 59 publications

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Interactive Language Learning by Question Answering

Contact Info

Product

Resources

About