Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Wijmans, Erik; Datta, Samyak; Maksymets, Oleksandr; Das, Abhishek; Gkioxari, Georgia; Lee, Stefan; Essa, Irfan; Parikh, Devi; Batra, Dhruv

doi:10.1109/cvpr.2019.00682

Cited by 115 publications

(107 citation statements)

References 33 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…Additionally, we diversify the training trajectories by sampling actions from the agent's policy with probability , instead of exclusively following the expert trajectories. We use inflection weighting to prevent the policy from simply repeating the previous action (Wijmans et al 2019). The perturbation is varied from start to end during the course of training with a constant increase of 0.1 after every E updates.…”

Section: Imitation Learningmentioning

confidence: 99%

An Exploration of Embodied Visual Exploration

2021

View full text Add to dashboard Cite

Embodied computer vision considers perception for robots in novel, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite the progress thus far, many basic questions pertinent to this problem remain unanswered: (i) What does it mean for an agent to explore its environment well? (ii) Which methods work well, and under which assumptions and environmental settings? (iii) Where do current approaches fall short, and where might future work seek to improve? Seeking answers to these questions, we first present a taxonomy for existing visual exploration algorithms and create a standard framework for benchmarking them. We then perform a thorough empirical study of the four state-of-the-art paradigms using the proposed framework with two photorealistic simulated 3D environments, a state-of-the-art exploration architecture, and diverse evaluation metrics. Our experimental results offer insights and suggest new performance metrics and baselines for future work in visual exploration. Code, models and data are publicly available.

show abstract

Section: Imitation Learningmentioning

confidence: 99%

An Exploration of Embodied Visual Exploration

2021

View full text Add to dashboard Cite

show abstract

“…In this section, we provide the quantitative and qualitative results of our 3D adversarial perturbations on EQA and EVR through our differentiable renderer. For EQA, besides PACMAN-RL+Q, we also evaluate the transferability of our attacks using the following models: (1) NAV-GRU, an agent using GRU instead of LSTM in navigation [37];…”

Section: Attack Via a Differentiable Renderermentioning

confidence: 99%

“…Concurrently, Gordon et al [15] studied the EQA task in an interactive environment named AI2-THOR [20]. Recently, several studies have been proposed to improve agent performance using different frameworks [9] and point cloud perception [37]. Similar to EQA, embodied vision recognition (EVR) [40] is an embodied task, in which an agent instantiated close to an occluded target object to perform visual object recognition.…”

Section: Introductionmentioning

confidence: 99%

Spatiotemporal Attacks for Embodied Agents

Liu

Huang

Liu

et al. 2020

Computer Vision – ECCV 2020

View full text Add to dashboard Cite

Adversarial attacks are valuable for providing insights into the blindspots of deep learning models and help improve their robustness. Existing work on adversarial attacks have mainly focused on static scenes; however, it remains unclear whether such attacks are effective against embodied agents, which could navigate and interact with a dynamic environment. In this work, we take the first step to study adversarial attacks for embodied agents. In particular, we generate spatiotemporal perturbations to form 3D adversarial examples, which exploit the interaction history in both the temporal and spatial dimensions. Regarding the temporal dimension, since agents make predictions based on historical observations, we develop a trajectory attention module to explore scene view contributions, which further help localize 3D objects appeared with highest stimuli. By conciliating with clues from the temporal dimension, along the spatial dimension, we adversarially perturb the physical properties (e.g., texture and 3D shape) of the contextual objects that appeared in the most important scene views. Extensive experiments on the EQA-v1 dataset for several embodied tasks in both the white-box and black-box settings have been conducted, which demonstrate that our perturbations have strong attack and generalization abilities. §

show abstract

“…The Embodied Question Answering (EQA) v1.0 [ 18 ] dataset consists of scenes sampled from the SUNCG dataset with additional question–answer pairs. The authors further extended the EQA task for realistic scene setting by adapting the Matterport3D dataset to their Matterport3D EQA dataset [ 19 ]. The Room-to-Room dataset [ 20 ] added navigation instruction annotation to the Matterport3D dataset for the vision-language navigation task.…”

Section: Related Workmentioning

confidence: 99%

“…Various 2D approaches have been adapted for 3D data, such as recognition [ 3 , 4 , 5 ], detection [ 16 ], and segmentation [ 17 ]. Researchers have proposed a series of embodied AI tasks that define an indoor scene and an agent that explores the scene and answers vision-related questions (e.g., embodied question answering [ 18 , 19 ]), or navigates based on a given instruction (e.g., vision-language navigation [ 20 , 21 ]). However, most 3D recognition-related studies have focused on static scenes.…”

Section: Introductionmentioning

confidence: 99%

Indoor Scene Change Captioning Based on Multimodality Data

Qiu

Satoh

Suzuki

et al. 2020

Sensors

View full text Add to dashboard Cite

This study proposes a framework for describing a scene change using natural language text based on indoor scene observations conducted before and after a scene change. The recognition of scene changes plays an essential role in a variety of real-world applications, such as scene anomaly detection. Most scene understanding research has focused on static scenes. Most existing scene change captioning methods detect scene changes from single-view RGB images, neglecting the underlying three-dimensional structures. Previous three-dimensional scene change captioning methods use simulated scenes consisting of geometry primitives, making it unsuitable for real-world applications. To solve these problems, we automatically generated large-scale indoor scene change caption datasets. We propose an end-to-end framework for describing scene changes from various input modalities, namely, RGB images, depth images, and point cloud data, which are available in most robot applications. We conducted experiments with various input modalities and models and evaluated model performance using datasets with various levels of complexity. Experimental results show that the models that combine RGB images and point cloud data as input achieve high performance in sentence generation and caption correctness and are robust for change type understanding for datasets with high complexity. The developed datasets and models contribute to the study of indoor scene change understanding.

show abstract

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Cited by 115 publications

References 33 publications

An Exploration of Embodied Visual Exploration

An Exploration of Embodied Visual Exploration

Spatiotemporal Attacks for Embodied Agents

Indoor Scene Change Captioning Based on Multimodality Data

Contact Info

Product

Resources

About