“…Audio-visual spatial correspondence learning Learning the spatial alignment between video and audio is important for self-supervision [77,50,75,66], spatial audio generation [51,21,63,7,45], audio-visual embodied learning [8,44,46,9] and 3D scene mapping [62,47]. However, these methods are either restricted to exocentric settings [51,77,50,21,63,66,7], or else tackle egocentric settings [46,45,9,47] in simulated 3D environments that lack realism and diversity, both in terms of the audio-visual content of the videos and the continuous camera motion due to the camera-wearer's physical movements. On the contrary, we learn an audio-visual representation from real-world egocentric video.…”