Recognizing Scenes from Novel Viewpoints

Qian, Shengyi; Kirillov, Alexander; Ravi, Nikhila; Chaplot, Devendra Singh; Johnson, Justin; Fouhey, David F.; Gkioxari, Georgia

doi:10.48550/arxiv.2112.01520

Cited by 4 publications

(5 citation statements)

References 68 publications

(95 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using neural networks to implicitly represent 3D scenes [29,36,48,49,52,57] has drawn much recent attention. NeRF [32] and its variants [2,3,30,34,53,54,61,66] have achieved impressive results on novel view synthesis [8,57,63] and have many applications ncluding 3D reconstruction [28,52,64,69,72], semantic segmentation [19,44,71], generative model [5,6,9,35,46], 3D content creation [1,17,37,42,55,65].…”

Section: Related Workmentioning

confidence: 99%

HexPlane: A Fast Representation for Dynamic Scenes

Cao¹,

Johnson²

2023

Preprint

View full text Add to dashboard Cite

Modeling and re-rendering dynamic 3D scenes is a challenging task in 3D vision. Prior approaches build on NeRF and rely on implicit representations. This is slow since it requires many MLP evaluations, constraining real-world applications. We show that dynamic 3D scenes can be explicitly represented by six planes of learned features, leading to an elegant solution we call HexPlane. A HexPlane computes features for points in spacetime by fusing vectors extracted from each plane, which is highly efficient. Pairing a HexPlane with a tiny MLP to regress output colors and training via volume rendering gives impressive results for novel view synthesis on dynamic scenes, matching the image quality of prior work but reducing training time by more than 100×. Extensive ablations confirm our HexPlane design and show that it is robust to different feature fusion mechanisms, coordinate systems, and decoding mechanisms. HexPlanes are a simple and effective solution for representing 4D volumes, and we hope they can broadly contribute to modeling spacetime for dynamic 3D scenes. 1

show abstract

Section: Related Workmentioning

confidence: 99%

HexPlane: A Fast Representation for Dynamic Scenes

Cao¹,

Johnson²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Most similarly to our approach, 2D3DNet [17] obtains 2D features in each image using a pre-trained segmentation model, projects ("lifts") these predictions to 3D points, and refines them by a 3D network trained without 3D labels, bypassing the need for 3D annotations during training. [33] predicts semantic segmentation for a target viewpoint by rendering a volumetric 3D representation of projected semantics predicted by a pre-trained segmentation model. Similarly to the latter two works, we use an existing pre-trained generic segmentation network but with a synthesized appearance view.…”

Section: Related Workmentioning

confidence: 99%

S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans

Artemorv¹,

Chen²,

Zhi³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…where R s,t is 2D rotation matrix of φ s,t . Following common practice in indoor scene reconstruction, we give the camera a fixed downward tilt [99,71,98] and only estimate azimuth [41]. This is also a common assumption in audio localization [1,86], since azimuth has strong binaural cues.…”

Section: Estimating Pose and Localizing Soundsmentioning

confidence: 99%

“…The rotations are limited to (10 • , 90 • ) relative to the source viewpoints. We follow the standard practice to set the height to agents to be 1.5m and lock a downward tilt angle [41,98,71,99]. We render the binaural RIRs and images given the position of agents and sound sources.…”

Section: Datasetmentioning

confidence: 99%

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

Chen¹,

Qian²,

Owens³

2023

Preprint

View full text Add to dashboard Cite

The images and sounds that we perceive undergo subtle but geometrically consistent changes as we rotate our heads. In this paper, we use these cues to solve a problem we call Sound Localization from Motion (SLfM): jointly estimating camera rotation and localizing sound sources. We learn to solve these tasks solely through self-supervision. A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from binaural sounds. We train these models to generate predictions that agree with one another. At test time, the models can be deployed independently. To obtain a feature representation that is well-suited to solving this challenging problem, we also propose a method for learning an audio-visual representation through cross-view binauralization: estimating binaural sound from one view, given images and sound from another. Our model can successfully estimate accurate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches. Project site: https://ificl.github.io/SLfM .

show abstract

Recognizing Scenes from Novel Viewpoints

Cited by 4 publications

References 68 publications

HexPlane: A Fast Representation for Dynamic Scenes

HexPlane: A Fast Representation for Dynamic Scenes

S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

Contact Info

Product

Resources

About