“…A main line of research in VLN utilizes soft attention over individual words for cross-modal grounding in both the natural language instruction and the visual scene (Wang et al, 2018(Wang et al, , 2019Tan et al, 2019;Landi et al, 2019;Xia et al, 2020;Wang et al, 2020b,a;Xiang et al, 2020;Zhu et al, 2020b). Other works improve vision and language representations (Hu et al, 2019;Li et al, 2019;Huang et al, 2019b,a;Hao et al, 2020;Majumdar et al, 2020) and propose an additional progress monitor module (Ma et al, 2019b,a;Ke et al, 2019) and object and action aware modules (Qi et al, 2020b) that aid co-grounding.…”