Audio-Visual Floorplan Reconstruction

Purushwalkam, Senthil; Garí, Sebastià V. Amengual; Ithapu, Vamsi Krishna; Schissler, Carl; Robinson, Philip W.; Gupta, Abhinav; Grauman, Kristen

doi:10.1109/iccv48922.2021.00122

Cited by 25 publications

(15 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Audio-visual scene mapping. To our knowledge, the only prior work to translate audio-visual inputs into a general (arbitrarily shaped) floorplan maps is AV-Floorplan [59]. Unlike AV-Floorplan, our method maps from speech in natural human conversations, which avoids emitting intrusive frequency sweep signals to generate echoes.…”

Section: Related Workmentioning

confidence: 99%

“…ping (e.g., visual SLAM) are highly effective when extensive exposure to the environment is possible, in many real-world scenarios only a fraction of the space is observed by the camera. Recent work shows the promise of sensing 3D spaces with both sight and sound [8,14,26,28,59]: listening to echoes bounce around the room can reveal the depth and shape of surrounding surfaces, and even help extrapolate a floorplan beyond the camera's field of view or behind occluded objects [59]. While we are inspired by these advances, they also have certain limitations.…”

Section: Introductionmentioning

confidence: 99%

“…While we are inspired by these advances, they also have certain limitations. Often systems will emit sounds (e.g., a frequency sweep) into the environment to ping for spatial information [1,14,15,24,28,44,59,69], which is intrusive if done around people. Furthermore, existing audio-visual models assume that the camera is always on grabbing new frames, which is wasteful if not intractable, particularly on lightweight, low-power computing devices in AR settings.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Majumder¹,

Jiang²,

Moulon³

et al. 2023

Preprint

View full text Add to dashboard Cite

Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: http://vision.cs. utexas.edu/projects/chat2map.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Majumder¹,

Jiang²,

Moulon³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Parida et al [ 7 ] estimated depth maps using multi-modal data (RGB images, echoes, and materials of objects) from indoor scenes. Purushwalkam et al [ 23 ] reconstructed the floor plan of the invisible area using echoes. Batvision [ 5 ] used both vision and echoes to train, and in the test phase, they estimated depth using echoes only.…”

Section: Related Workmentioning

confidence: 99%

Deep Non-Line-of-Sight Imaging Using Echolocation

Jang

Shin

Kim

2022

Sensors

View full text Add to dashboard Cite

Non-line-of-sight (NLOS) imaging is aimed at visualizing hidden scenes from an observer’s (e.g., camera) viewpoint. Typically, hidden scenes are reconstructed using diffused signals that emit light sources using optical equipment and are reflected multiple times. Optical systems are commonly adopted in NLOS imaging because lasers can transport energy and focus light over long distances without loss. In contrast, we propose NLOS imaging using acoustic equipment inspired by echolocation. Existing acoustic NLOS is a computational method motivated by seismic imaging that analyzes the geometry of underground structures. However, this physical method is susceptible to noise and requires a clear signal, resulting in long data acquisition times. Therefore, we reduced the scan time by modifying the echoes to be collected simultaneously rather than sequentially. Then, we propose end-to-end deep-learning models to overcome the challenges of echoes interfering with each other. We designed three distinctive architectures: an encoder that extracts features by dividing multi-channel echoes into groups and merging them hierarchically, a generator that constructs an image of the hidden object, and a discriminator that compares the generated image with the ground-truth image. The proposed model successfully reconstructed the outline of the hidden objects.

show abstract

“…Sound Simulation using Machine Learning: Many recent deep learning methods have been proposed for sound synthesis [Hawley et al 2020;Ji et al 2020;Jin et al 2020], scattering effect computation, and sound propagation [Fan et al 2020;Pulkki and Svensson 2019;. Deep learning methods have also been used to compute material properties of a room and acoustic characteristics [Schissler et al 2017;Tang et al 2020a] Other applications that have used acoustic datasets include navigation , floorplan reconstruction [Purushwalkam et al 2021] and depth estimation algorithms [Gao et al 2020].…”

Section: Introductionmentioning

confidence: 99%

GWA: A Large High-Quality Acoustic Dataset for Audio Processing

Tang,

Aralikatti,

Ratnarajah

et al. 2022

Preprint

View full text Add to dashboard Cite

Figure 1: Our IR data generation pipeline starts from a 3D model of a complex scene and its visual material annotations (unstructured texts). We sample multiple collision-free source and receiver locations in the scene. We use a novel scheme to automatically assign acoustic material parameters by semantic matching from a large acoustic database. Our hybrid acoustic simulator generates accurate impulse responses (IRs), which become part of the large synthetic impulse response dataset after post-processing.

show abstract

Audio-Visual Floorplan Reconstruction

Cited by 25 publications

References 36 publications

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Deep Non-Line-of-Sight Imaging Using Echolocation

GWA: A Large High-Quality Acoustic Dataset for Audio Processing

Contact Info

Product

Resources

About