Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Gu, Jing; Stefani, Eliana; Wu, Qi; Thomason, Jesse; Wang, Xin Eric

doi:10.48550/arxiv.2203.12667

Cited by 7 publications

(7 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Vision-Language Retrieval (VLR) can be used in many applications, such as text-based person search [250], or general object retrieval based on language [251]. Vision-Language Navigation (VLN) [252,253] is task that the agents learn to navigate in 3D indoor environments following the given natural language instruction. A benchmark for the popular VLN can be found at the following leaderboard.…”

Section: Visualmentioning

confidence: 99%

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Wang¹,

Chen²,

Qian³

et al. 2023

Preprint

View full text Add to dashboard Cite

With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cuttingedge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pretraining models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal BigModels Survey.

show abstract

Section: Visualmentioning

confidence: 99%

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Wang¹,

Chen²,

Qian³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…VLN implicates interactive cooperation between a human interlocutor and an AI agent, facilitated through dialogue, to orchestrate the agent's maneuvering within an environment. [166] Pashevich et al [167] configured an episodic transformer to actualize the VLN task for autonomous agent interaction with humans and the environment, mediated via visual and textual modalities. Concurrently, Yan et al [168] introduced a memory vision-voice indoor navigation (MVV-IN) system, enabling humans to guide an AI agent verbally for VLN tasks.…”

Section: Cognitionmentioning

confidence: 99%

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Wang,

Zheng,

et al. 2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

Human–robot interaction (HRI) has escalated in notability in recent years, and multimodal communication and control strategies are necessitated to guarantee a secure, efficient, and intelligent HRI experience. In spite of the considerable focus on multimodal HRI, comprehensive disquisitions delineating various modalities and intricately analyzing their combinations remain elusive, consequently limiting holistic understanding and future advancements. This article aspires to bridge this inadequacy by conducting a profound exploration of multimodal HRI, predominantly concentrating on four principal modalities: vision, auditory and language, haptics, and physiological sensing. An extensive review encapsulating algorithmic dissection, interface devices, and applicative dimensions forms part of this discourse. This manuscript distinctively combines multimodal HRI with cognitive science, deeply probing into the three dimensions, perception, cognition, and action, thereby demystifying algorithms intrinsic to multimodal HRI. Finally, it accentuates the empirical challenges and contours preemptive trajectories for multimodal HRI in human‐centric smart manufacturing.

show abstract

“…Natural-language-grounded visual navigation tasks have drawn increasing research interests in recent years due to their practicality in real life and also pose great challenges for vision-language understanding tasks. Depending on communication complexity [7] between the agent and human, i.e., whether the navigation instruction is given once or multiple times, natural-language-grounded visual navigation tasks can be divided into two types: Vision-and-Language Navigation (VLN) and Vision-and-Dialog Navigation (VDN).…”

Section: Natural-language-grounded Visual Navigationmentioning

confidence: 99%

Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help

Yuan

Luo

2022

Applied Sciences

View full text Add to dashboard Cite

Vision-and-Language Navigation (VLN) is a task designed to enable embodied agents carry out natural language instructions in realistic environments. Most VLN tasks, however, are guided by an elaborate set of instructions that is depicted step-by-step. This approach deviates from real-world problems in which humans only describe the object and its surroundings and allow the robot to ask for help when required. Vision-based Navigation with Language-based Assistance (VNLA) is a recently proposed task that requires an agent to navigate and find a target object according to a high-level language instruction. Due to the lack of step-by-step navigation guidance, the key to VNLA is to conduct goal-oriented exploration. In this paper, we design an Attention-based Knowledge-enabled Cross-modality Reasoning with Assistant’s Help (AKCR-AH) model to address the unique challenges of this task. AKCR-AH learns a generalized navigation strategy from three new perspectives: (1) external commonsense knowledge is incorporated into visual relational reasoning, so as to take proper action at each viewpoint by learning the internal–external correlations among object- and room-entities; (2) a simulated human assistant is introduced in the environment, who provides direct intervention assistance when required; (3) a memory-based Transformer architecture is adopted as the policy framework to make full use of the history clues stored in memory tokens for exploration. Extensive experiments demonstrate the effectiveness of our method compared with other baselines.

show abstract

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Cited by 7 publications

References 58 publications

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help

Contact Info

Product

Resources

About