“…a) Navigation approaches: Traditional approaches to visual navigation focus on building a 3D metric map of the environment [18], [3] before using that representation for any downstream navigation tasks, which does not lend itself favourably for task-driven learnable representations that can capture contextual cues. The recent introduction of largescale indoor environments and simulators [7], [17], [6] has fuelled a slew of learning based methods for indoor navigation tasks [1] such as point-goal [10], [19], [20], [21], [22], object-goal [23], [24], [25], [26], [27], and image-goal [8], [28], [29]. Modular approaches which incorporate explicit or learned map representations [11], [23], [25] have shown to outperform end-to-end methods on tasks such as object-goal, however, this is not currently the case for the point-goal [10], [20] task.…”