2020
DOI: 10.48550/arxiv.2004.14973
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
21
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(21 citation statements)
references
References 0 publications
0
21
0
Order By: Relevance
“…Parisotto & Salakhutdinov (2017) investigate a memory system to navigate in mazes. Some methods (Sepulveda et al, 2018;Chen et al, 2019;Savinov et al, 2018) use both visual features and the topological guidance of scenes for navigation, while natural-language instructions are employed to guide an agent to route among rooms (Anderson et al, 2018b;Wang et al, 2019;Deng et al, 2020;Hu et al, 2019;Majumdar et al, 2020;Hao et al, 2020). We notice that transformer architectures are also employed by Hao et al (2020) et al, 2015) and thus not the most prominent ones across the feature pyramid.…”
Section: Related Workmentioning
confidence: 94%
“…Parisotto & Salakhutdinov (2017) investigate a memory system to navigate in mazes. Some methods (Sepulveda et al, 2018;Chen et al, 2019;Savinov et al, 2018) use both visual features and the topological guidance of scenes for navigation, while natural-language instructions are employed to guide an agent to route among rooms (Anderson et al, 2018b;Wang et al, 2019;Deng et al, 2020;Hu et al, 2019;Majumdar et al, 2020;Hao et al, 2020). We notice that transformer architectures are also employed by Hao et al (2020) et al, 2015) and thus not the most prominent ones across the feature pyramid.…”
Section: Related Workmentioning
confidence: 94%
“…PRESS [33] applies the pre-trained BERT to process instructions. PREVALENT [18] pre-trains an encoder with image-text-action triplets to align the language and visual states, while VLN-BERT [39] fine-tunes ViLBERT [34] with trajectory-instruction pairs. Hong et al [23] implements a recurrent function to leverage the history-dependent state representations based on previous models.…”
Section: Related Workmentioning
confidence: 99%
“…A main line of research in VLN utilizes soft attention over individual words for cross-modal grounding in both the natural language instruction and the visual scene (Wang et al, 2018(Wang et al, , 2019Tan et al, 2019;Landi et al, 2019;Xia et al, 2020;Wang et al, 2020b,a;Xiang et al, 2020;Zhu et al, 2020b). Other works improve vision and language representations (Hu et al, 2019;Li et al, 2019;Huang et al, 2019b,a;Hao et al, 2020;Majumdar et al, 2020) and propose an additional progress monitor module (Ma et al, 2019b,a;Ke et al, 2019) and object and action aware modules (Qi et al, 2020b) that aid co-grounding.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, several approaches were proposed to solve the Vision-Language Navigation task with better interactions between natural language instructions and visual scenes (Fried et al, 2018;Wang et al, 2019;Landi et al, 2019;Wang et al, 2020a;Huang et al, 2019a;Hu et al, 2019;Majumdar et al, 2020;Ma et al, 2019a;Qi et al, 2020b;Zhu et al, 2020a,c). Some approaches utilize soft attention over individual words for better cross-modal grounding, while others improve cogrounding with better language and vision representation and additional alignment module.…”
Section: Introductionmentioning
confidence: 99%