Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Majumdar, Arjun; Shrivastava, Ayush; Anderson, Peter J.; Parikh, Devi; Batra, Dhruv

doi:10.48550/arxiv.2004.14973

Cited by 14 publications

(21 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Parisotto & Salakhutdinov (2017) investigate a memory system to navigate in mazes. Some methods (Sepulveda et al, 2018;Chen et al, 2019;Savinov et al, 2018) use both visual features and the topological guidance of scenes for navigation, while natural-language instructions are employed to guide an agent to route among rooms (Anderson et al, 2018b;Wang et al, 2019;Deng et al, 2020;Hu et al, 2019;Majumdar et al, 2020;Hao et al, 2020). We notice that transformer architectures are also employed by Hao et al (2020) et al, 2015) and thus not the most prominent ones across the feature pyramid.…”

Section: Related Workmentioning

confidence: 94%

VTNet: Visual Transformer Network for Object Goal Navigation

Du¹,

Yu²,

Zheng³

2021

Preprint

View full text Add to dashboard Cite

Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize "turning right" over "turning left" when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.

show abstract

Section: Related Workmentioning

confidence: 94%

VTNet: Visual Transformer Network for Object Goal Navigation

Du¹,

Yu²,

Zheng³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…PRESS [33] applies the pre-trained BERT to process instructions. PREVALENT [18] pre-trains an encoder with image-text-action triplets to align the language and visual states, while VLN-BERT [39] fine-tunes ViLBERT [34] with trajectory-instruction pairs. Hong et al [23] implements a recurrent function to leverage the history-dependent state representations based on previous models.…”

Section: Related Workmentioning

confidence: 99%

Vision-Language Navigation with Random Environmental Mixup

Liu¹,

Zhu²,

Chang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction. Large data bias, which is caused by the disparity ratio between the small data scale and large navigation space, makes the VLN task challenging. Previous works have proposed various data augmentation methods to reduce data bias. However, these works do not explicitly reduce the data bias across different house scenes. Therefore, the agent would overfit to the seen scenes and achieve poor navigation performance in the unseen scenes. To tackle this problem, we propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment. Specifically, we first select key viewpoints according to the room connection graph for each scene. Then, we crossconnect the key views of different scenes to construct augmented scenes. Finally, we generate augmented instructionpath pairs in the cross-connected scenes. The experimental results on benchmark datasets demonstrate that our augmentation data via REM help the agent reduce its performance gap between the seen and unseen environment and improve the overall performance, making our model the best existing approach on the standard VLN benchmark.

show abstract

“…A main line of research in VLN utilizes soft attention over individual words for cross-modal grounding in both the natural language instruction and the visual scene (Wang et al, 2018(Wang et al, , 2019Tan et al, 2019;Landi et al, 2019;Xia et al, 2020;Wang et al, 2020b,a;Xiang et al, 2020;Zhu et al, 2020b). Other works improve vision and language representations (Hu et al, 2019;Li et al, 2019;Huang et al, 2019b,a;Hao et al, 2020;Majumdar et al, 2020) and propose an additional progress monitor module (Ma et al, 2019b,a;Ke et al, 2019) and object and action aware modules (Qi et al, 2020b) that aid co-grounding.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, several approaches were proposed to solve the Vision-Language Navigation task with better interactions between natural language instructions and visual scenes (Fried et al, 2018;Wang et al, 2019;Landi et al, 2019;Wang et al, 2020a;Huang et al, 2019a;Hu et al, 2019;Majumdar et al, 2020;Ma et al, 2019a;Qi et al, 2020b;Zhu et al, 2020a,c). Some approaches utilize soft attention over individual words for better cross-modal grounding, while others improve cogrounding with better language and vision representation and additional alignment module.…”

Section: Introductionmentioning

confidence: 99%

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

Li¹,

Tan

Bansal

2021

Preprint

View full text Add to dashboard Cite

Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. One key challenge in this task is to ground instructions with the current visual information that the agent perceives. Most of the existing work employs soft attention over individual words to locate the instruction required for the next action. However, different words have different functions in a sentence (e.g., modifiers convey attributes, verbs convey actions). Syntax information like dependencies and phrase structures can aid the agent to locate important parts of the instruction. Hence, in this paper, we propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes. Empirically, our agent outperforms the baseline model that does not use syntax information on the Room-to-Room dataset, especially in the unseen environment. Besides, our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages (English, Hindi, and Telugu). We also show that our agent is better at aligning instructions with the current visual information via qualitative visualizations. 1 1 Code and models: https://github.com/ jialuli-luka/SyntaxVLN Navigation Steps: Parse Tree: Instruction: Walk forward then turn right at the stairs then go down the stairs.

show abstract

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Cited by 14 publications

References 0 publications

VTNet: Visual Transformer Network for Object Goal Navigation

VTNet: Visual Transformer Network for Object Goal Navigation

Vision-Language Navigation with Random Environmental Mixup

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

Contact Info

Product

Resources

About