“…Extensive work has demonstrated that, after learning with indirect supervision from a reward function, rich representations for their task automatically emerge (Bansal et al, 2017;Lowe et al, 2017;Jaderberg et al, 2019). Several recent works have created 3D embodiment simulation environment (Kolve et al, 2017;Brodeur et al, 2017;Savva et al, 2017;Das et al, 2018;Xia et al, 2018;Savva et al, 2019) for navigation and visual question answering tasks. To train these models, visual navigation is often framed as a reinforcement learning problem (Chen et al, 2015;Giusti et al, 2015;Oh et al, 2016;Abel et al, 2016;Bhatti et al, 2016;Daftry et al, 2016;Mirowski et al, 2016;Brahmbhatt & Hays, 2017;Zhang et al, 2017a;Zhu et al, 2017a;Kahn et al, 2018).…”