Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Majumdar, Arjun; Shrivastava, Ayush; Anderson, Peter; Parikh, Devi; Batra, Dhruv

doi:10.1007/978-3-030-58539-6_16

Cited by 138 publications

(140 citation statements)

References 20 publications

Supporting

Mentioning

139

Contrasting

Order By: Relevance

“…Another drawback of the models is the use of a recurrent neural network to model the sequence of words used in natural language instructions, which is unsuitable for parallel processing. To overcome these limitations, some researchers developed pretrained models [ 20 , 21 ] in which natural language instructions and images for the VLN task are embedded together with large-scale benchmark datasets in addition to R2R datasets. VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks.…”

Section: Related Workmentioning

confidence: 99%

“…VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks. There are also models pretrained specifically for VLN tasks [ 20 , 21 ]. These VLN-specific models have a simple structure that immediately selects one of the candidate actions because they use only the multimodal context of the concurrently embedded data extracted according to natural language instructions and input images.…”

Section: Related Workmentioning

confidence: 99%

“…Because most natural language instructions provide only partial descriptions of the trajectory to the target position, the agent encounters difficulties in understanding the instructions unless they are efficiently combined with visual features using the alignment and grounding feature. In VLN-related studies [ 7 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ], various attention mechanisms, such as visual or textual attention, are employed to ensure the alignment and grounding between natural language instruction and input images. However, these attention-based VLN models are trained only to learn the association between natural language instructions and images with a limited number of R2R datasets; moreover, they have difficulties acquiring broader general knowledge of the relationship between natural language instruction and images.…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, it is a great challenge for attention-based VLN models to extract sufficient context from multimodal input data, including natural language instructions and images, to make real-time action decisions. To address this drawback, researchers proposed transformer-based pretrained models [ 20 , 21 ]. Unlike attention-based models, pretrained models are equipped with feature aligning natural language instructions and images, trained with attended masked language modeling (AMKM) and action prediction (AP) tasks based on a transformer neural network before being used for VLN tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Effective path planning and action selection strategies for reaching the target position are crucial for executing VLN tasks. Most VLN-related studies [ 7 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ] applied search techniques that rely on local scoring of candidate actions, such as greedy local and beam searches. In the case of the greedy area search, the search speed is increased because the search range is limited; however, its success rate is low because it is difficult to return to the original path once a wrong path is selected.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Hwang¹,

Kim²

2021

Sensors

View full text Add to dashboard Cite

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Hwang¹,

Kim²

2021

Sensors

View full text Add to dashboard Cite

show abstract

Reframing explanation as an interactive medium: The EQUAS (Explainable QUestion Answering System) project

et al. 2021

Self Cite

View full text Add to dashboard Cite

This letter is a retrospective analysis of our team's research for the Defense Advanced Research Projects Agency Explainable Artificial Intelligence project. Our initial approach was to use salience maps, English sentences, and lists of feature names to explain the behavior of deep-learning-based discriminative systems, with particular focus on visual question answering systems. We found that presenting static explanations along with answers led to limited positive effects. By exploring various combinations of machine and human explanation production and consumption, we evolved a notion of explanation as an interactive process that takes place usually between humans and artificial intelligence systems but sometimes within the software system. We realized that by interacting via explanations people could task and adapt machine learning (ML) agents. We added affordances for editing explanations and modified the ML system to act in accordance with the edits to produce an interpretable interface to the agent. Through this interface, editing an explanation can adapt a system's performance to new, modified purposes. This deep tasking, wherein the agent knows its objective and the explanation for that objective, will be critical to enable higher levels of autonomy.explainable artificial intelligence (XAI), human/computer interaction (HCI), tasking and adapting agents, visual question answering (VQA)

show abstract

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Xue

Zhang²,

Yu³

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Cited by 138 publications

References 20 publications

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Reframing explanation as an interactive medium: The EQUAS (Explainable QUestion Answering System) project

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Contact Info

Product

Resources

About