2020
DOI: 10.1007/978-3-030-58539-6_2
|View full text |Cite
|
Sign up to set email alerts
|

SoundSpaces: Audio-Visual Navigation in 3D Environments

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
192
3

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 125 publications
(197 citation statements)
references
References 67 publications
1
192
3
Order By: Relevance
“…Christensen et al [20] predict depth maps from real-world scenes using echo responses. Gao et al [21] learns visual representations by echolocation in a simulated environment [22]. In contrast, we learn through passive observation, rather than active sensing.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Christensen et al [20] predict depth maps from real-world scenes using echo responses. Gao et al [21] learns visual representations by echolocation in a simulated environment [22]. In contrast, we learn through passive observation, rather than active sensing.…”
Section: Related Workmentioning
confidence: 99%
“…Recent works have proposed methods that use sound for robot navigation. Those robotic systems are designed to localize sound sources and to navigate to audio goals in indoor environments [68,69,22,70,71,72]. Unlike these methods, which largely use distinctive sound sources, we use ambient sounds collected in real-world scenes.…”
Section: Static Motionmentioning
confidence: 99%
“…The community has attracted an increasing amount of interest in recent years since synchronized audio-visual scenes are widely available in videos. In addition to localizing sound sources, a wide range of tasks have been proposed, including audio-visual sound separation [7,9,26,34,35], audio-visual action recognition [10,17,19,30], audio-visual event localization [27,33], audio-visual video captioning [23,28,32], embodied audio-visual navigation [4,8], audio-visual sound recognition [5], and audio-visual video parsing [29]. Our framework demonstrates that temporal learning facilitates better audio-visual understanding, which explicitly and subsequently benefits the localization performance.…”
Section: Audio-visual Video Understandingmentioning
confidence: 99%
“…to the location of a sound-emitting source using audio and visual signals [9,21], semantic audio-visual navigation [8] with coherent room and sound semantics, active perception tasks such as active audio-visual source separation [30] and audiovisual dereverberation [11], curiosity-based exploration via audio-visual association [46] as well as tasks explicitly focusing on the geometric information contained in audio such as audio-visual floor plan reconstruction [5,35].…”
Section: Introductionmentioning
confidence: 99%
“…However, they have mostly focused on clean and distractorfree audio settings in which the only change to the audio signal comes from changes in the agent's position. Furthermore, they have struggled to generalize to unheard sounds [9,10]. In this work, we take the next steps towards more challenging scenarios.…”
Section: Introductionmentioning
confidence: 99%