Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2022
DOI: 10.18653/v1/2022.naacl-main.438
|View full text |Cite
|
Sign up to set email alerts
|

Diagnosing Vision-and-Language Navigation: What Really Matters

Abstract: Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or training techniques to boost navigation performance. However, there still exist non-negligible gaps between machines' performance and human benchmarks. Moreover, the agents' inner mechanisms for navigation decisions remain unclear. To the best of our knowledge, how the agents perc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 26 publications
(12 citation statements)
references
References 55 publications
0
11
0
Order By: Relevance
“…While task completion significantly drops when direction tokens are masked, the agent still performs on a high level. This finding is surprising and in dissent with Zhu et al (2021a) who report that task completion nearly drops to zero when masking direction tokens during testing only. We believe that in our setting (masking during testing and training), the model learns to infer the correct directions from redundancies in the instructions or context around the direction tokens.…”
Section: Token Maskingmentioning
confidence: 57%
See 2 more Smart Citations
“…While task completion significantly drops when direction tokens are masked, the agent still performs on a high level. This finding is surprising and in dissent with Zhu et al (2021a) who report that task completion nearly drops to zero when masking direction tokens during testing only. We believe that in our setting (masking during testing and training), the model learns to infer the correct directions from redundancies in the instructions or context around the direction tokens.…”
Section: Token Maskingmentioning
confidence: 57%
“…To analyze the importance of direction and object tokens in the navigation instructions, we run masking experiments similar to Zhu et al (2021a), except that we mask the tokens during training and testing instead of during testing only. Figure 4 shows the resulting task completion rates for an increasing number of masked direction or object tokens.…”
Section: Token Maskingmentioning
confidence: 99%
See 1 more Smart Citation
“…High-level features such as visual appearance, route structure, and detected objects outperform the low level visual features extracted by CNN (Hu et al, 2019). Different types of tokens within the instruction also function differently (Zhu et al, 2021b). Extracting these tokens and encoding the object tokens and directions tokens are crucial Zhu et al, 2021b).…”
Section: Semantic Understandingmentioning
confidence: 99%
“…Different types of tokens within the instruction also function differently (Zhu et al, 2021b). Extracting these tokens and encoding the object tokens and directions tokens are crucial Zhu et al, 2021b).…”
Section: Semantic Understandingmentioning
confidence: 99%