Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Mehta, Harsh; Artzi, Yoav; Baldridge, Jason; Ie, Eugene; Mirowski, Piotr

doi:10.18653/v1/2020.splu-1.7

Cited by 28 publications

(40 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of VLN datasets situated in photorealistic 3D reconstructions of real locations contain human instructions or dialogue: R2R (Anderson et al, 2018b), Touchdown (Chen et al, 2019;Mehta et al, 2020), CVDN (Thomason et al, 2019b) and REVERIE . RxR addresses shortcomings of these datasetsin particular, multilinguality, scale, fine-grained word grounding, and human follower demonstrations (Table 1).…”

Section: Motivationmentioning

confidence: 99%

“…Vision-and-Language Navigation (VLN) tasks require computational agents to mediate the relationship between language, visual scenes and movement. Datasets have been collected for both indoor (Anderson et al, 2018b;Thomason et al, 2019b; and outdoor (Chen et al, 2019;Mehta et al, 2020) environments; success in these is based on clearly-defined, objective task completion rather than language or vision specific annotations. These VLN tasks fall in the Goldilocks zone: they can be tackled -but not solved -with current methods, and progress on them makes headway on real world grounded language understanding.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Ku¹,

Anderson²,

Patel³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

132

150

View full text Add to dashboard Cite

We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the virtual poses of instruction creators and validators. We establish baseline scores for monolingual and multilingual settings and multitask learning when including Room-to-Room annotations (Anderson et al., 2018b). We also provide results for a model that learns from synchronized pose traces by focusing only on portions of the panorama attended to in human demonstrations. The size, scope and detail of RxR dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.

show abstract

Section: Motivationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Ku¹,

Anderson²,

Patel³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

132

150

View full text Add to dashboard Cite

show abstract

“…Embodied Language Tasks. A number of 'Embodied AI' tasks combining language, visual perception, and navigation in realistic 3D environments have recently gained prominence, including Interactive and Embodied Question Answering (Das et al, 2018;Gordon et al, 2018), Vision-and-Language Navigation or VLN (Anderson et al, 2018;Chen et al, 2019;Mehta et al, 2020;Qi et al, 2020), and challenges based on household tasks (Puig et al, 2018;Shridhar et al, 2020). While these tasks utilize only a single question or instruction input, several papers have extended the VLN task -in which an agent must follow natural language instructions to traverse a path in the environment -to dialog settings.…”

Section: Related Workmentioning

confidence: 99%

Where Are You? Localization from Embodied Dialog

Hahn

Krantz

Batra

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We present WHERE ARE YOU? (WAY), a dataset of ∼6k dialogs in which two humans -an Observer and a Locator -complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task -providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer's location within 3m in unseen buildings, vs. 70.4% for human Locators.

show abstract

“…We use the same experimental setup in Touchdown-SDR using the scenes provided in the concurrent work (Mehta et al, 2020), where we slice 360scene into 8 FoVs covering the scene. We pass each of these FoVs to a pre-trained model (He et al, 2016), and extract features from fourth to the last layer before classification to get a feature map representation of the FoVs.…”

Section: Localization Experimentsmentioning

confidence: 99%

Refer360∘: A Referring Expression Recognition Dataset in 360: A Referring Expression Recognition Dataset in 360∘ Images Images

Cirik

Berg-Kirkpatrick

Morency

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We propose a novel large-scale referring expression recognition dataset, Refer360°, consisting of 17,137 instruction sequences and ground-truth actions for completing these instructions in 360°scenes. Refer360°differs from existing related datasets in three ways. First, we propose a more realistic scenario where instructors and the followers have partial, yet dynamic, views of the scene -followers continuously modify their field-of-view (FoV) while interpreting instructions that specify a final target location. Second, instructions to find the target location consist of multiple steps for followers who will start at random FoVs. As a result, intermediate instructions are strongly grounded in object references and followers must identify intermediate FoVs to find the final target location correctly. Third, the target locations are neither restricted to predefined objects nor chosen by annotators; instead, they are distributed randomly across scenes. This "point anywhere" approach leads to more linguistically complex instructions, as shown in our analyses. Our examination of the dataset shows that Refer360°manifests linguistically rich phenomena in a language grounding task that poses novel challenges for computational modeling of language, vision, and navigation.

show abstract

Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Cited by 28 publications

References 11 publications

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Where Are You? Localization from Embodied Dialog

Refer360∘: A Referring Expression Recognition Dataset in 360: A Referring Expression Recognition Dataset in 360∘ Images Images

Contact Info

Product

Resources

About