Unsupervised Learning of 3D Object Categories from Videos in the Wild

Henzler, Philipp; Reizenstein, Jeremy; Labatut, Patrick; Шаповалов, Роман; Ritschel, Tobias; Vedaldi, Andrea; Novotný, David

doi:10.1109/cvpr46437.2021.00467

Cited by 51 publications

(45 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Neural radiance fields. Recently, a variety of methods based on NeRF [94] have become popular for novel view synthesis [5,39,53,63,77,79,80,85,99,107,115,124,136,141], 3D reconstruction [8,8,15,27,35,54,60,61,102,116,132,139,142,143], generative modeling [70,90,100,121] and semantic segmentation [149]. The majority of these models demonstrate impressive results on novel view synthesis but are only applicable in the single-scene setting.…”

Section: Related Workmentioning

confidence: 99%

NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes

Vora¹,

Radwan²,

Greff³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present NeSF, a method for producing 3D semantic fields from posed RGB images alone. In place of classical 3D representations, our method builds on recent work in implicit neural scene representations wherein 3D structure is captured by point-wise functions. We leverage this methodology to recover 3D density fields upon which we then train a 3D semantic segmentation model supervised by posed 2D semantic maps. Despite being trained on 2D signals alone, our method is able to generate 3D-consistent semantic maps from novel camera poses and can be queried at arbitrary 3D points. Notably, NeSF is compatible with any method producing a density field, and its accuracy improves as the quality of the density field improves. Our empirical analysis demonstrates comparable quality to competitive 2D and 3D semantic segmentation baselines on complex, realisticallyrendered synthetic scenes. Our method is the first to offer truly dense 3D scene segmentations requiring only 2D supervision for training, and does not require any semantic input for inference on novel scenes. We encourage the readers to visit the project website.* Denotes equal contribution.

show abstract

Section: Related Workmentioning

confidence: 99%

NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes

Vora¹,

Radwan²,

Greff³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…A long-standing objective in computer vision has been to understand the 3D structure of scenes and objects from a single image. Many works have approached this problem by encoding a relationship between appearance and structure using prior knowledge of that structure as supervision [16,20,24,39]. Until recently however, the problem of deriving such a model from only single-view observations has remained very difficult.…”

Section: Related Workmentioning

confidence: 99%

LOLNeRF: Learn from One Look

Rebain¹,

Matthews²,

Yi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The densities for each point are predicted by aggregating information from all other points using a single attention layer. Similarly for category-specific reconstruction, NerFormer [38] proposed to replace the MLP in NeRF-WCE [23] with a transformer model to allow for spatial reasoning. Our work crucially differs from these methods at their core: the rendering framework.…”

Section: Related Workmentioning

confidence: 99%

Light Field Neural Rendering

Suhail¹,

Esteves²,

Sigal³

et al. 2021

Preprint

View full text Add to dashboard Cite

Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene. Methods based on geometric reconstruction need only sparse views, but cannot accurately model non-Lambertian effects. We introduce a model that combines the strengths and mitigates the limitations of these two directions. By operating on a four-dimensional representation of the light field, our model learns to represent view-dependent effects accurately. By enforcing geometric constraints during training and inference, the scene geometry is implicitly learned from a sparse set of views. Concretely, we introduce a two-stage transformer-based model that first aggregates features along epipolar lines, then aggregates features along reference views to produce the color of a target ray. Our model outperforms the state-ofthe-art on multiple forward-facing and 360 • datasets, with larger margins on scenes with severe view-dependent variations. Code and results can be found at light-field-neuralrendering.github.io.* Work done while interning at Google. Ground TruthOurs (35.3 dB) NeX (30.4 dB) NeRF (29.6 dB)

show abstract

Unsupervised Learning of 3D Object Categories from Videos in the Wild

Cited by 51 publications

References 30 publications

NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes

NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes

LOLNeRF: Learn from One Look

Light Field Neural Rendering

Contact Info

Product

Resources

About