Shifting the Baseline: Single Modality Performance on Visual Navigation &amp;

Thomason, Jesse; Gordon, Daniel; Bisk, Yonatan

doi:10.18653/v1/n19-1197

Cited by 64 publications

(50 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…R2R paths span 4-6 edges and are the shortest paths from start to goal. Thomason et al (2019a) showed that agents can exploit effective priors over R2R paths, and showed that R2R paths encourage goal seeking Figure 2: Given the panorama navigation graph P with room graph R in Figure 2a, we sample a simple room path (r 0 , r 2 , r 3 ) inducing the subgraph in Figure 2b.…”

Section: Motivationmentioning

confidence: 99%

“…Unimodal Ablations Table 7 reports the performance of the multilingual agent under settings in which we ablate either the vision or the language inputs during both training and evaluation, as advocated by Thomason et al (2019a). The multimodal agent (4) outperforms both the languageonly agent (9) and the vision-only agent (10), indicating that both modalities contribute to performance.…”

Section: Multitask and Transfer Learning Tablementioning

confidence: 99%

See 1 more Smart Citation

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Ku¹,

Anderson²,

Patel³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

136

150

View full text Add to dashboard Cite

We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the virtual poses of instruction creators and validators. We establish baseline scores for monolingual and multilingual settings and multitask learning when including Room-to-Room annotations (Anderson et al., 2018b). We also provide results for a model that learns from synchronized pose traces by focusing only on portions of the panorama attended to in human demonstrations. The size, scope and detail of RxR dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.

show abstract

Section: Motivationmentioning

confidence: 99%

Section: Multitask and Transfer Learning Tablementioning

confidence: 99%

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Ku¹,

Anderson²,

Patel³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

136

150

View full text Add to dashboard Cite

show abstract

“…There have also been concerns about structural biases present in these datasets which may provide hidden shortcuts to agents training on these problems. Thomason et al (2019) presented an analysis on R2R dataset, where the trained agent continued to perform surprisingly well in the absence of language inputs.…”

Section: Room-to-room (R2r)mentioning

confidence: 99%

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

Jain¹,

Magalhaes²,

Ku³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

118

121

View full text Add to dashboard Cite

Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation (VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understanding plays in this task, especially because dominant evaluation metrics have focused on goal completion rather than the sequence of actions corresponding to the instructions. Here, we highlight shortcomings of current metrics for the Room-to-Room dataset (Anderson et al., 2018b) and propose a new metric, Coverage weighted by Length Score (CLS). We also show that the existing paths in the dataset are not ideal for evaluating instruction following because they are direct-to-goal shortest paths. We join existing short paths to form more challenging extended paths to create a new data set, Room-for-Room (R4R). Using R4R and CLS, we show that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion.

show abstract

“…Biases in VQA datasets A growing body of work points to the existence of biases in popular VQA datasets (Agrawal et al, 2016;Zhang et al, 2016;Jabri et al, 2016;Goyal et al, 2017;Johnson et al, 2017;Chao et al, 2018;Thomason et al, 2019). In VQA v1 (Antol et al, 2015), for instance, for questions of the form, "What sport is...?…”

Section: Related Workmentioning

confidence: 99%

Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects

Grand¹,

Belinkov²

2019

Proceedings of the Second Workshop on Shortcomings in Vision and Language

View full text Add to dashboard Cite

Visual question answering (VQA) models have been shown to over-rely on linguistic biases in VQA datasets, answering questions "blindly" without considering visual context. Adversarial regularization (AdvReg) aims to address this issue via an adversary subnetwork that encourages the main model to learn a bias-free representation of the question. In this work, we investigate the strengths and shortcomings of AdvReg with the goal of better understanding how it affects inference in VQA models. Despite achieving a new stateof-the-art on VQA-CP, we find that AdvReg yields several undesirable side-effects, including unstable gradients and sharply reduced performance on in-domain examples. We demonstrate that gradual introduction of regularization during training helps to alleviate, but not completely solve, these issues. Through error analyses, we observe that AdvReg improves generalization to binary questions, but impairs performance on questions with heterogeneous answer distributions. Qualitatively, we also find that regularized models tend to over-rely on visual features, while ignoring important linguistic cues in the question. Our results suggest that AdvReg requires further refinement before it can be considered a viable bias mitigation technique for VQA.

show abstract

Shifting the Baseline: Single Modality Performance on Visual Navigation &

Cited by 64 publications

References 25 publications

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects

Contact Info

Product

Resources

About