REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

Qi, Yuankai; Wu, Qi; Anderson, Peter; Wang, Xin; Wang, William Yang; Shen, Chunhua; Hengel, Anton van den

doi:10.1109/cvpr42600.2020.01000

Cited by 184 publications

(197 citation statements)

References 19 publications

Supporting

Mentioning

196

Contrasting

Order By: Relevance

“…navigation and assembling), and there have been only a few recent efforts to combine the traditional navigation task with other tasks. Touchdown (Chen et al, 2019) combines navigation and object referring expression resolution, REVERIE (Qi et al, 2020) performs remote referring expression comprehension, while ALFRED (Shridhar et al, 2020) combines indoor navigation and household manipulation. Our new complementary task merges navigation in a complex outdoor space with object referring expression comprehension and assembling tasks that require spatial relation understanding in an interweaved temporal style, in which the two tasks alternate for multiple turns leading to cascading error effects.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, Vision-and-Language Navigation (VLN) tasks, in which agents follow NL instructions to navigate through an environment, have been actively studied in research communities (MacMahon et al, 2006;Mooney, 2008;Chen and Mooney, 2011;Tellex et al, 2011;Mei et al, 2016;Hermann et al, 2017;Anderson et al, 2018;Das et al, 2018;Thomason et al, 2019;Chen et al, 2019;Shridhar et al, 2020;Qi et al, 2020;Hermann et al, 2020).To encourage the exploration of this challenging research topic, multiple simulated environments have been introduced. Synthetic (Kempka et al, 2016;Beattie et al, 2016;Brodeur et al, 2017;Wu et al, 2018;Savva et al, 2017;Yan et al, 2018;Shah et al, 2018;Puig et al, 2018) as well as real-world and image-based environments (Anderson et al, 2018;Xia et al, 2018;Chen et al, 2019) have been used to provide agents with diverse and complement training environments.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

Kim

Zala

Burri

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

For embodied agents, navigation is an important ability but not an isolated goal. Agents are also expected to perform specific tasks after reaching the target location, such as picking up objects and assembling them into a particular arrangement. We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigationand-assembly task, named ARRAMON. During this task, the agent (similar to a PokéMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects partby-part in an egocentric grid-layout environment. To support this task, we implement a 3D dynamic environment simulator and collect a dataset (in English; and also extended to Hindi) with human-written navigation and assembling instructions, and the corresponding ground truth trajectories. We also filter the collected instructions via a verification stage, leading to a total of 7.7K task instances (30.8K instructions and paths). We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

Kim

Zala

Burri

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…Embodied Language Tasks. A number of 'Embodied AI' tasks combining language, visual perception, and navigation in realistic 3D environments have recently gained prominence, including Interactive and Embodied Question Answering (Das et al, 2018;Gordon et al, 2018), Vision-and-Language Navigation or VLN (Anderson et al, 2018;Chen et al, 2019;Mehta et al, 2020;Qi et al, 2020), and challenges based on household tasks (Puig et al, 2018;Shridhar et al, 2020). While these tasks utilize only a single question or instruction input, several papers have extended the VLN task -in which an agent must follow natural language instructions to traverse a path in the environment -to dialog settings.…”

Section: Related Workmentioning

confidence: 99%

Where Are You? Localization from Embodied Dialog

Hahn

Krantz

Batra

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

We present WHERE ARE YOU? (WAY), a dataset of ∼6k dialogs in which two humans -an Observer and a Locator -complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task -providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer's location within 3m in unseen buildings, vs. 70.4% for human Locators.

show abstract

“…Some of the typical multimodal intelligent tasks are vision question answering (VQA) [ 1 ] that generate answers to natural language questions on the image presented, visual dialog [ 2 ] that holds a meaningful question and answer (Q&A) dialog on the input image, and image/video captioning that generates texts describing the contents of the input image or video. More advanced multimodal intelligent tasks have also been presented, including embodied question answering (EQA) [ 3 ], assuming an embodied agent moving around in a virtual environment [ 3 ], interactive question answering (IQA) [ 4 ], cooperative vision and dialog navigation (CVDN) [ 5 ], remote embodied visual referring expression in real indoor environments (REVERIE) [ 6 ], and vision and language navigation (VLN) [ 7 ].…”

Section: Introductionmentioning

confidence: 99%

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Hwang¹,

Kim²

2021

Sensors

View full text Add to dashboard Cite

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

show abstract

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

Cited by 184 publications

References 19 publications

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

Where Are You? Localization from Embodied Dialog

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Contact Info

Product

Resources

About