“…Audio-Visual Learning: Recent research bridges the audio and vision for various cross-model learning tasks. Some have achieved remarkable performance in audio-visual action recognition [52,40,58], audio-visual correspondence [8,6,7], audio-visual synchronization [68,55,102], visual sound separation [31,93,92,99,39,100,79,101], visual to auditory [98,36,96,71,86], audio spatialisation [23,88,67,96,38,71,86,69], and audio-visual navigation [16,17,18,37,27,15,63]. In this work, we leverage audio-visual learning for better perceiving the geometrical structure of environment.…”