2022
DOI: 10.48550/arxiv.2203.12667
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Jing Gu,
Eliana Stefani,
Qi Wu
et al.

Abstract: A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Visionand-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation m… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 58 publications
0
7
0
Order By: Relevance
“…Vision-Language Retrieval (VLR) can be used in many applications, such as text-based person search [250], or general object retrieval based on language [251]. Vision-Language Navigation (VLN) [252,253] is task that the agents learn to navigate in 3D indoor environments following the given natural language instruction. A benchmark for the popular VLN can be found at the following leaderboard.…”
Section: Visualmentioning
confidence: 99%
“…Vision-Language Retrieval (VLR) can be used in many applications, such as text-based person search [250], or general object retrieval based on language [251]. Vision-Language Navigation (VLN) [252,253] is task that the agents learn to navigate in 3D indoor environments following the given natural language instruction. A benchmark for the popular VLN can be found at the following leaderboard.…”
Section: Visualmentioning
confidence: 99%
“…VLN implicates interactive cooperation between a human interlocutor and an AI agent, facilitated through dialogue, to orchestrate the agent's maneuvering within an environment. [166] Pashevich et al [167] configured an episodic transformer to actualize the VLN task for autonomous agent interaction with humans and the environment, mediated via visual and textual modalities. Concurrently, Yan et al [168] introduced a memory vision-voice indoor navigation (MVV-IN) system, enabling humans to guide an AI agent verbally for VLN tasks.…”
Section: Cognitionmentioning
confidence: 99%
“…Natural-language-grounded visual navigation tasks have drawn increasing research interests in recent years due to their practicality in real life and also pose great challenges for vision-language understanding tasks. Depending on communication complexity [7] between the agent and human, i.e., whether the navigation instruction is given once or multiple times, natural-language-grounded visual navigation tasks can be divided into two types: Vision-and-Language Navigation (VLN) and Vision-and-Dialog Navigation (VDN).…”
Section: Natural-language-grounded Visual Navigationmentioning
confidence: 99%