2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01000
|View full text |Cite
|
Sign up to set email alerts
|

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
196
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 184 publications
(197 citation statements)
references
References 19 publications
1
196
0
Order By: Relevance
“…navigation and assembling), and there have been only a few recent efforts to combine the traditional navigation task with other tasks. Touchdown (Chen et al, 2019) combines navigation and object referring expression resolution, REVERIE (Qi et al, 2020) performs remote referring expression comprehension, while ALFRED (Shridhar et al, 2020) combines indoor navigation and household manipulation. Our new complementary task merges navigation in a complex outdoor space with object referring expression comprehension and assembling tasks that require spatial relation understanding in an interweaved temporal style, in which the two tasks alternate for multiple turns leading to cascading error effects.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…navigation and assembling), and there have been only a few recent efforts to combine the traditional navigation task with other tasks. Touchdown (Chen et al, 2019) combines navigation and object referring expression resolution, REVERIE (Qi et al, 2020) performs remote referring expression comprehension, while ALFRED (Shridhar et al, 2020) combines indoor navigation and household manipulation. Our new complementary task merges navigation in a complex outdoor space with object referring expression comprehension and assembling tasks that require spatial relation understanding in an interweaved temporal style, in which the two tasks alternate for multiple turns leading to cascading error effects.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, Vision-and-Language Navigation (VLN) tasks, in which agents follow NL instructions to navigate through an environment, have been actively studied in research communities (MacMahon et al, 2006;Mooney, 2008;Chen and Mooney, 2011;Tellex et al, 2011;Mei et al, 2016;Hermann et al, 2017;Anderson et al, 2018;Das et al, 2018;Thomason et al, 2019;Chen et al, 2019;Shridhar et al, 2020;Qi et al, 2020;Hermann et al, 2020).To encourage the exploration of this challenging research topic, multiple simulated environments have been introduced. Synthetic (Kempka et al, 2016;Beattie et al, 2016;Brodeur et al, 2017;Wu et al, 2018;Savva et al, 2017;Yan et al, 2018;Shah et al, 2018;Puig et al, 2018) as well as real-world and image-based environments (Anderson et al, 2018;Xia et al, 2018;Chen et al, 2019) have been used to provide agents with diverse and complement training environments.…”
Section: Related Workmentioning
confidence: 99%
“…Embodied Language Tasks. A number of 'Embodied AI' tasks combining language, visual perception, and navigation in realistic 3D environments have recently gained prominence, including Interactive and Embodied Question Answering (Das et al, 2018;Gordon et al, 2018), Vision-and-Language Navigation or VLN (Anderson et al, 2018;Chen et al, 2019;Mehta et al, 2020;Qi et al, 2020), and challenges based on household tasks (Puig et al, 2018;Shridhar et al, 2020). While these tasks utilize only a single question or instruction input, several papers have extended the VLN task -in which an agent must follow natural language instructions to traverse a path in the environment -to dialog settings.…”
Section: Related Workmentioning
confidence: 99%
“…Some of the typical multimodal intelligent tasks are vision question answering (VQA) [ 1 ] that generate answers to natural language questions on the image presented, visual dialog [ 2 ] that holds a meaningful question and answer (Q&A) dialog on the input image, and image/video captioning that generates texts describing the contents of the input image or video. More advanced multimodal intelligent tasks have also been presented, including embodied question answering (EQA) [ 3 ], assuming an embodied agent moving around in a virtual environment [ 3 ], interactive question answering (IQA) [ 4 ], cooperative vision and dialog navigation (CVDN) [ 5 ], remote embodied visual referring expression in real indoor environments (REVERIE) [ 6 ], and vision and language navigation (VLN) [ 7 ].…”
Section: Introductionmentioning
confidence: 99%