SoundSpaces: Audio-Visual Navigation in 3D Environments

Chen, Changan; Jain, Unnat; Schissler, Carl; Garí, Sebastià V. Amengual; Al-Halah, Ziad; Ithapu, Vamsi Krishna; Robinson, Philip W.; Grauman, Kristen

doi:10.1007/978-3-030-58539-6_2

Cited by 125 publications

(197 citation statements)

References 67 publications

Supporting

Mentioning

192

Contrasting

Order By: Relevance

“…Christensen et al [20] predict depth maps from real-world scenes using echo responses. Gao et al [21] learns visual representations by echolocation in a simulated environment [22]. In contrast, we learn through passive observation, rather than active sensing.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Structure from Silence: Learning Scene Structure from Ambient Sound

Chen¹,

Hu²,

Owens³

2021

Preprint

View full text Add to dashboard Cite

https://ificl.github.io/structure-from-silence (a) Quiet Campus dataset (b) Depth estimation (c) Multimodal self-supervision ( ) , ( ) , (a) Quiet Campus dataset (b) Depth estimation (c) Multimodal self-supervision Figure 1: What can ambient sound tell us about 3D scene structure? (a) We collect an "in-the-wild" dataset of paired audio and RGB-D recordings from quiet indoor scenes. (b) Given audio from a scene, we estimate distance to a wall. (c) We use this ambient sound to learn audio-visual representations through self-supervision.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Recent works have proposed methods that use sound for robot navigation. Those robotic systems are designed to localize sound sources and to navigate to audio goals in indoor environments [68,69,22,70,71,72]. Unlike these methods, which largely use distinctive sound sources, we use ambient sounds collected in real-world scenes.…”

Section: Static Motionmentioning

confidence: 99%

Structure from Silence: Learning Scene Structure from Ambient Sound

Chen¹,

Hu²,

Owens³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The community has attracted an increasing amount of interest in recent years since synchronized audio-visual scenes are widely available in videos. In addition to localizing sound sources, a wide range of tasks have been proposed, including audio-visual sound separation [7,9,26,34,35], audio-visual action recognition [10,17,19,30], audio-visual event localization [27,33], audio-visual video captioning [23,28,32], embodied audio-visual navigation [4,8], audio-visual sound recognition [5], and audio-visual video parsing [29]. Our framework demonstrates that temporal learning facilitates better audio-visual understanding, which explicitly and subsequently benefits the localization performance.…”

Section: Audio-visual Video Understandingmentioning

confidence: 99%

Space-Time Memory Network for Sounding Object Localization in Videos

Tian

2021

Preprint

View full text Add to dashboard Cite

Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods. Code and data can be found at https://sites.google.com/view/bmvc2021stm. IntroductionNeurological evidence suggests that human understandings of scenes predominantly rely on the integration of visual and auditory cues [3]. As humans, we form attention mechanisms to sounding sources by leveraging the temporal, cross-modal alignments between vision and sound, where understandings of the past tell us where and what to attend to next. For computational models, although there have been several developed sound source spatial localization frameworks [21,22,27], how much we gain from explicitly leveraging temporal correspondence that exists naturally in both videos and audios is yet to be explored.However, considerations of temporal coherence are required to facilitate consistent understandings in complex scenes. Imagine a person playing a guitar in front of a wall of not-in-use guitars. In order to figure out which guitar is sounding and obtain stable localization results, we must take multiple timesteps into account. Hence, it is worthwhile to explore if learning temporal cues could benefit the localization of sounding objects in videos.To localize visual objects associated with specific sound sources, most of the previous works capture audio-visual spatial correspondence using similarities between audio and visual modalities [2,15,21], cross-modal attention mechanisms [25,27], and sounding class activation mapping [22]. Nevertheless, these methods often identify sounding objects for static images, and audio-visual temporal coherence in videos is commonly ignored.

show abstract

“…to the location of a sound-emitting source using audio and visual signals [9,21], semantic audio-visual navigation [8] with coherent room and sound semantics, active perception tasks such as active audio-visual source separation [30] and audiovisual dereverberation [11], curiosity-based exploration via audio-visual association [46] as well as tasks explicitly focusing on the geometric information contained in audio such as audio-visual floor plan reconstruction [5,35].…”

Section: Introductionmentioning

confidence: 99%

“…However, they have mostly focused on clean and distractorfree audio settings in which the only change to the audio signal comes from changes in the agent's position. Furthermore, they have struggled to generalize to unheard sounds [9,10]. In this work, we take the next steps towards more challenging scenarios.…”

Section: Introductionmentioning

confidence: 99%

Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds

Younes¹,

Honerkamp²,

Welschehold³

et al. 2021

Preprint

View full text Add to dashboard Cite

Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and static sound sources and struggle to generalize to unheard sounds. In this work, we propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds. We introduce a reinforcement learning approach that learns a robust navigation policy for these complex settings. To achieve this, we propose an architecture that fuses audio-visual information in the spatial feature space to learn correlations of geometric information inherent in both local maps and audio signals. We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments, on two challenging 3D scanned real-world environments, namely Matterport3D and Replica. The benchmark is available at http://dav-nav.cs.uni-freiburg.de.

show abstract

SoundSpaces: Audio-Visual Navigation in 3D Environments

Cited by 125 publications

References 67 publications

Structure from Silence: Learning Scene Structure from Ambient Sound

Structure from Silence: Learning Scene Structure from Ambient Sound

Space-Time Memory Network for Sounding Object Localization in Videos

Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds

Contact Info

Product

Resources

About