Structured Scene Memory for Vision-Language Navigation

Wang, Hanqing; Yang, Yi; Liang, Wei; Xiong, Caiming

doi:10.1109/cvpr46437.2021.00835

Cited by 77 publications

(54 citation statements)

References 89 publications

(149 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Exploration and language grounding are two essential abilities for VLN agents. However, existing works either only allow for local actions A t [13][14][15] which hinders long-range action planning, or lack object representations O t [8,19,20] which might be insufficient for fine-grained grounding. Our work addresses both issues with a dualscale representation and global action planning.…”

Section: Methodsmentioning

confidence: 99%

“…Therefore, several works [38,39] propose to represent the map as topological structures for pre-exploring environments [40], or for back-tracking to other locations, tradingoff navigation accuracy with the path length [10,24]. A few recent VLN works [8,19,20] used topological maps to support global action planning, but they suffer from using recurrent architectures for state tracking and also lack a fine-scale representation for language grounding as shown in Figure 2. We address the above limitations via a dualscale graph transformer with topological maps.…”

Section: Related Workmentioning

confidence: 99%

“…This can be explained by the fact that for map-based approaches backtracking is encouraged which makes the trajectory length longer. We further compare a coarse-scale DUET for fair comparison with previous graph-based approaches [8,19,20] which do not use a fine-scale encoder. Even without using the fine-scale representation, DUET still outperform them by a margin, showing the effectiveness of our graph transformer.…”

Section: Comparison With State Of the Artmentioning

confidence: 99%

“…For example, the agent is able to select a long-term goal from all navigable locations in the map, and then uses the map to calculate a shortest path to the goal. Topological maps have been explored by previous VLN works [8,19,20]. These methods, however, still fall short in two aspects.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Chen,

Guhur,

Tapaswi

et al. 2022

Preprint

View full text Add to dashboard Cite

Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves the success rate on the fine-grained VLN benchmark R2R.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Comparison With State Of the Artmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Chen,

Guhur,

Tapaswi

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Another line of research was trying to improve the unseen generalization through either pre-training [13,10,23], auxiliary supervision [21,32,33,29], or training-data processing [26,9,24]. The basic Seq2Seq structure had also been improved by introducing cross-modal attention [30] and fine-grained relationship [12], utilizing the semantic or syntactic information of languages [25,19], reformulating under a Bayesian framework [2], and combining longrange memory for global decision [6,27]. Besides R2R, some later works also proposed more challenging datasets like Room-for-Room (R4R) [15], TOUCHDOWN [4], and Room-across-Room [17] (RxR).…”

Section: Related Workmentioning

confidence: 99%

Rethinking the Spatial Route Prior in Vision-and-Language Navigation

Zhou¹,

Liu²,

Mu³

2021

Preprint

View full text Add to dashboard Cite

Vision-and-language navigation (VLN) is a trending topic which aims to navigate an intelligent agent to an expected position through natural language instructions. This work addresses the task of VLN from a previouslyignored aspect, namely the spatial route prior of the navigation scenes. A critically enabling innovation of this work is explicitly considering the spatial route prior under several different VLN settings. In a most information-rich case of knowing environment maps and admitting shortestpath prior, we observe that given an origin-destination node pair, the internal route can be uniquely determined. Thus, VLN can be effectively formulated as an ordinary classification problem over all possible destination nodes in the scenes. Furthermore, we relax it to other more general VLN settings, proposing a sequential-decision variant (by abandoning the shortest-path route prior) and an exploreand-exploit scheme (for addressing the case of not knowing the environment maps) that curates a compact and informative sub-graph to exploit. As reported by [34], the performance of VLN methods has been stuck at a plateau in past two years. Even with increased model complexity, the state-of-the-art success rate on R2R validation-unseen set has stayed around 62% for single-run and 73% for beam-search. We have conducted comprehensive evaluations on both R2R and R4R, and surprisingly found that utilizing the spatial route priors may be the key of breaking above-mentioned performance ceiling. For example, on R2R validation-unseen set, when the number of discrete nodes explored is about 40, our single-model success rate reaches 73%, and increases to 78% if a Speaker model is ensembled, which significantly outstrips previous state-ofthe-art VLN-BERT with 3 models ensembled.

show abstract

Bridging the Visual Semantic Gap in VLN via Semantically Richer Instructions

Ossandón

Earle

Soto

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The Visual-and-Language Navigation (VLN) task requires understanding a textual instruction to navigate a natural indoor environment using only visual information. While this is a trivial task for most humans, it is still an open problem for AI models. In this work, we hypothesize that poor use of the visual information available is at the core of the low performance of current models. To support this hypothesis, we provide experimental evidence showing that state-of-the-art models are not severely affected when they receive just limited or even no visual data, indicating a strong overfitting to the textual instructions. To encourage a more suitable use of the visual information, we propose a new data augmentation method that fosters the inclusion of more explicit visual information in the generation of textual navigational instructions. Our main intuition is that current VLN datasets include textual instructions that are intended to inform an expert navigator, such as a human, but not a beginner visual navigational agent, such as a randomly initialized DL model. Specifically, to bridge the visual semantic gap of current VLN datasets, we take advantage of metadata available for the Matter-port3D dataset that, among others, includes information about object labels that are present in the scenes. Training a state-of-the-art model with the new set of instructions increase its performance by 8% in terms of success rate on unseen environments, demonstrating the advantages of the proposed data augmentation method.

show abstract

Structured Scene Memory for Vision-Language Navigation

Cited by 77 publications

References 89 publications

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Rethinking the Spatial Route Prior in Vision-and-Language Navigation

Bridging the Visual Semantic Gap in VLN via Semantically Richer Instructions

Contact Info

Product

Resources

About