A Dual Semantic-Aware Recurrent Global-Adaptive Network for Vision-and-Language Navigation

Wang, Liuyi; He, Zongtao; Tang, Juipeng; Dang, Ronghao; Wang, Naijia; Liu, Chengju; Chen, Qijun

doi:10.24963/ijcai.2023/164

Cited by 3 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our model shows superior performance across multiple metrics. Notably, compared with the previous state-of-the-art DSRG [29], CausalVLN gains significant improvements in SR (↑ 2.61%) and RGS (↑ 2.13%) on the validation unseen set. Similar enhancements are also observed on the validation seen and test unseen sets.…”

Section: B Implementation Detailsmentioning

confidence: 86%

“…Finally, we utilize the memory-augmented global-local crossmodal fusion module from our previous work DSRG [29] to enable the agent to align and leverage features from different modalities, capturing valuable historical cues throughout the navigation.…”

Section: … …mentioning

confidence: 99%

“…SEvol [45] utilized the graph to construct relationships of objects. Our previous work DSRG [29] proposed a dual semantic-augmented module to model the semantics explicitly. Although previous methods have highlighted the importance of data augmentation and the leverage of semantic cues, we argue that these approaches are necessary but insufficient.…”

Section: A Vision-and-language Navigationmentioning

confidence: 99%

“…Therefore, it is essential to accurately represent the agent's state. In this paper, we employ the memory-augmented global-local crossmodal fusion proposed by our previous work [29], consisting of a global adaptive aggregation (GAA) method, a crossmodal encoder, and a recurrent memory fusion (RMF). GAA computes the visual representation at the t-th step as:…”

Section: F Memory-augmented Global-local Cross-modal Fusionmentioning

confidence: 99%

“…We adopt a similar configuration to DSRG [29], with 9 layers for the language encoder, 2 layers for the vision encoder, and 4 layers for the cross-modal encoder. To intervene with the object confounders, we use the object features extracted by the bottom-up attention model provided by [16].…”

Section: B Implementation Detailsmentioning

confidence: 99%

See 4 more Smart Citations

RES-StS: Referring Expression Speaker via Self-Training With Scorer for Goal-Oriented Vision-Language Navigation

Wang

Dang

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios. However, existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments. In this paper, we tackle this challenge by proposing a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations. Specifically, we establish reasonable assumptions about confounders for vision and language in VLN using the structured causal model (SCM). Building upon this, we propose an iterative backdoorbased representation learning (IBRL) method that allows for the adaptive and effective intervention on confounders. Furthermore, we introduce the visual and linguistic backdoor causal encoders to enable unbiased feature expression for multi-modalities during training and validation, enhancing the agent's capability to generalize across different environments. Experiments on three VLN datasets (R2R, RxR, and REVERIE) showcase the superiority of our proposed method over previous state-of-the-art approaches. Moreover, detailed visualization analysis demonstrates the effectiveness of CausalVLN in significantly narrowing down the performance gap between seen and unseen environments, underscoring its strong generalization capability.

show abstract

Section: B Implementation Detailsmentioning

confidence: 86%

Section: … …mentioning

confidence: 99%