Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

Jain, Vihan; Magalhaes, Gabriel; Ku, Alexander; Vaswani, Ashish; Ie, Eugene; Baldridge, Jason

doi:10.18653/v1/p19-1181

Cited by 118 publications

(122 citation statements)

References 24 publications

Supporting

Mentioning

121

Contrasting

Order By: Relevance

“…Despite recent progress in the area of vision and language, recent work (Jain et al, 2019) in the navigation task (VLN) argues that current research leaves unclear how much of a role language plays in this task. They point out that dominant evaluation metrics have focused on goal completion rather than how each action contributes to the goal.…”

Section: Previous Workmentioning

confidence: 99%

On the role of effective and referring questions in GuessWhat?!

Mazuecos¹,

Testoni²,

Bernardi³

et al. 2020

Proceedings of the First Workshop on Advances in Language and Vision Research

View full text Add to dashboard Cite

Task success is the standard metric used to evaluate referential visual dialogue systems.In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat?!. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.

show abstract

Section: Previous Workmentioning

confidence: 99%

On the role of effective and referring questions in GuessWhat?!

Mazuecos¹,

Testoni²,

Bernardi³

et al. 2020

Proceedings of the First Workshop on Advances in Language and Vision Research

View full text Add to dashboard Cite

show abstract

“…In a study on VLN tasks [ 7 ], a relatively simple deep neural network model of the sequence-to-sequence (Seq2Seq) type was proposed, in which an action sequence was output from two input sequences with input video stream and natural language instructions, respectively. A few other VLN-related studies [ 9 , 15 , 16 ] presented methods to solve the problem of insufficient R2R datasets for training VLN models. They undertook various data augmentation techniques, including the development of a speaker module to generate additional training data [ 9 ], new training data through environment dropout (eliminating selected objects from the environment) [ 15 ], and more sophisticated task data by concatenating the existing R2R data [ 16 ].…”

Section: Related Workmentioning

confidence: 99%

“…A few other VLN-related studies [ 9 , 15 , 16 ] presented methods to solve the problem of insufficient R2R datasets for training VLN models. They undertook various data augmentation techniques, including the development of a speaker module to generate additional training data [ 9 ], new training data through environment dropout (eliminating selected objects from the environment) [ 15 ], and more sophisticated task data by concatenating the existing R2R data [ 16 ].…”

Section: Related Workmentioning

confidence: 99%

“…Because most natural language instructions provide only partial descriptions of the trajectory to the target position, the agent encounters difficulties in understanding the instructions unless they are efficiently combined with visual features using the alignment and grounding feature. In VLN-related studies [ 7 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ], various attention mechanisms, such as visual or textual attention, are employed to ensure the alignment and grounding between natural language instruction and input images. However, these attention-based VLN models are trained only to learn the association between natural language instructions and images with a limited number of R2R datasets; moreover, they have difficulties acquiring broader general knowledge of the relationship between natural language instruction and images.…”

Section: Introductionmentioning

confidence: 99%

“…Effective path planning and action selection strategies for reaching the target position are crucial for executing VLN tasks. Most VLN-related studies [ 7 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ] applied search techniques that rely on local scoring of candidate actions, such as greedy local and beam searches. In the case of the greedy area search, the search speed is increased because the search range is limited; however, its success rate is low because it is difficult to return to the original path once a wrong path is selected.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Hwang¹,

Kim²

2021

Sensors

View full text Add to dashboard Cite

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

show abstract

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler

Wang

Peterson

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

Cited by 118 publications

References 24 publications

On the role of effective and referring questions in GuessWhat?!

On the role of effective and referring questions in GuessWhat?!

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler

Contact Info

Product

Resources

About