“…Recently, Vision-and-Language Navigation (VLN) tasks, in which agents follow NL instructions to navigate through an environment, have been actively studied in research communities (MacMahon et al, 2006;Mooney, 2008;Chen and Mooney, 2011;Tellex et al, 2011;Mei et al, 2016;Hermann et al, 2017;Anderson et al, 2018;Das et al, 2018;Thomason et al, 2019;Chen et al, 2019;Shridhar et al, 2020;Qi et al, 2020;Hermann et al, 2020).To encourage the exploration of this challenging research topic, multiple simulated environments have been introduced. Synthetic (Kempka et al, 2016;Beattie et al, 2016;Brodeur et al, 2017;Wu et al, 2018;Savva et al, 2017;Yan et al, 2018;Shah et al, 2018;Puig et al, 2018) as well as real-world and image-based environments (Anderson et al, 2018;Xia et al, 2018;Chen et al, 2019) have been used to provide agents with diverse and complement training environments.…”