Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation

Xiang, Jiannan; Wang, Xin Eric; Wang, William Yang

doi:10.18653/v1/2020.findings-emnlp.62

Cited by 23 publications

(22 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ku et al (2020) reports lower SDTW scores of 21% to 24%. Given this, the TC of 12.8% and SDTW of 1.4% obtained by Retouch-RCONCAT and current best results from Xiang et al (2020) (TC: 19.0%; SDTW: 16.3%), amply demonstrates the challenge of the outdoor navigation problem defined by Touchdown. The greater diversity of the visual environments and the far greater degreesof-freedom for navigation thus provide plenty of headroom for future research.…”

Section: Methodsmentioning

confidence: 81%

Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Mehta¹,

Artzi²,

Baldridge³

et al. 2020

Proceedings of the Third International Workshop on Spatial Language Understanding

View full text Add to dashboard Cite

The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn support both Touchdown tasks and can be used effectively for further research and comparison.

show abstract

Section: Methodsmentioning

confidence: 81%

Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Mehta¹,

Artzi²,

Baldridge³

et al. 2020

Proceedings of the Third International Workshop on Spatial Language Understanding

View full text Add to dashboard Cite

show abstract

“…Vision-and-Language Navigation (VLN) is a task that requires an agent to achieve the final goal based on the given instructions in a 3D environment. Besides the generalizability problem studied by previous works (Wang et al, , 2019, the data scarcity problem is another critical issue for the VLN task, expecially in the outdoor environment (Chen et al, 2019;Mehta et al, 2020;Xiang et al, 2020). Fried et al (2018) obtains a broad set of augmented training data for VLN by sampling trajectories in the navigation environment and using the Speaker model to back-translate their instructions.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

Zhu¹,

Wang²,

Fu³

et al. 2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Self Cite

View full text Add to dashboard Cite

One of the most challenging topics in Natural Language Processing (NLP) is visuallygrounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a challenging task to solve. This paper introduces a Multimodal Text Style Transfer (MTST) learning approach and leverages external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set. 1

show abstract

“…There are no previous results for multitask SILGNetHack and SymTD as they are introduced here. Though not comparable, the manual stop VisTD SOTA trained using imitation learning on supervised trajectories is 16.7% [56]. State-tracking consistently improves convergence and generalization, even when the correct next step is fully determined by current world observations (e.g.…”

Section: Analyses Of Recent Grounded Language Rl Modelling Contributionsmentioning

confidence: 95%

SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark

Zhong

W.²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Existing work in language grounding typically study single environments. How do we build unified models that apply across multiple environments? We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG), which unifies a collection of diverse grounded language learning environments under a common interface. SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown). Together, these environments provide diverse grounding challenges in richness of observation space, action space, language specification, and plan complexity. In addition, we propose the first shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-specific architectures. Moreover, we find that many recent modelling advances do not result in significant gains on environments other than the one they were designed for. This highlights the need for a multi-environment benchmark. Finally, the best models significantly underperform humans on SILG, which suggests ample room for future work. We hope SILG enables the community to quickly identify new methodologies for language grounding that generalize to a diverse set of environments and their associated challenges.

show abstract

Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation

Cited by 23 publications

References 12 publications

Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark

Contact Info

Product

Resources

About