2020
DOI: 10.1007/978-3-030-58539-6_16
|View full text |Cite
|
Sign up to set email alerts
|

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
139
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 138 publications
(140 citation statements)
references
References 20 publications
1
139
0
Order By: Relevance
“…Another drawback of the models is the use of a recurrent neural network to model the sequence of words used in natural language instructions, which is unsuitable for parallel processing. To overcome these limitations, some researchers developed pretrained models [ 20 , 21 ] in which natural language instructions and images for the VLN task are embedded together with large-scale benchmark datasets in addition to R2R datasets. VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision鈥搇anguage tasks.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Another drawback of the models is the use of a recurrent neural network to model the sequence of words used in natural language instructions, which is unsuitable for parallel processing. To overcome these limitations, some researchers developed pretrained models [ 20 , 21 ] in which natural language instructions and images for the VLN task are embedded together with large-scale benchmark datasets in addition to R2R datasets. VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision鈥搇anguage tasks.…”
Section: Related Workmentioning
confidence: 99%
“…VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision鈥搇anguage tasks. There are also models pretrained specifically for VLN tasks [ 20 , 21 ]. These VLN-specific models have a simple structure that immediately selects one of the candidate actions because they use only the multimodal context of the concurrently embedded data extracted according to natural language instructions and input images.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations