FrameNet: Learning Local Canonical Frames of 3D Surfaces From a Single RGB Image

Huang, Jingwei; Zhou, Yichao; Guibas, Leonidas J.

doi:10.1109/iccv.2019.00873

Cited by 50 publications

(43 citation statements)

References 51 publications

(113 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3D generation and reconstruction has been studied extensively in the computer vision and graphics communities (Saxena et al 2009;Chaudhuri et al 2011;Kalogerakis et al 2012;Chang et al 2015;Rezende et al 2016;Soltani et al 2017;Kulkarni et al 2015;Tulsiani et al 2016;Huang et al 2019;Jiang et al 2019). Most methods in the literature focus on recovering the 3D structure from 2D images by using explicit 3D supervision.…”

Section: Single View 3d Reconstruction and Generationmentioning

confidence: 99%

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

Rajeswar

Mannan²,

Golemo

et al. 2020

Int J Comput Vis

View full text Add to dashboard Cite

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D groundtruth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfelbased reconstruction of a scene -from the latent code -(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed -called 3D-IQTTto evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation and understanding tasks.

show abstract

Section: Single View 3d Reconstruction and Generationmentioning

confidence: 99%

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

Rajeswar

Mannan²,

Golemo

et al. 2020

Int J Comput Vis

View full text Add to dashboard Cite

show abstract

“…Huang et al [28] propose FrameNet, a model to learn a canonical frame from a RGB image, where a canonical frame is represented by three orthogonal directions, one along the normal direction and two in the tangent plane. Dasgupta et al [29] propose a novel method called DeLay for room layout estimation from a single monocular RGB image that uses a CNN model to generate an initial belief map, which is then used by an optimization algorithm to predict the final room layout.…”

Section: B Monocular Depth Estimationmentioning

confidence: 99%

Automatic Dense Annotation for Monocular 3D Scene Understanding

et al. 2020

View full text Add to dashboard Cite

Deep neural networks have revolutionized many areas of computer vision, but they require notoriously large amounts of labeled training data. For tasks such as semantic segmentation and monocular 3d scene layout estimation, collecting high-quality training data is extremely laborious because dense, pixellevel ground truth is required and must be annotated by hand. In this paper, we present two techniques for significantly reducing the manual annotation effort involved in collecting large training datasets. The tools are designed to allow rapid annotation of entire videos collected by RGBD cameras, thus generating thousands of ground-truth frames to use for training. First, we propose a fully-automatic approach to produce dense pixel-level semantic segmentation maps. The technique uses noisy evidence from pre-trained object detectors and scene layout estimators and incorporates spatial and temporal context in a conditional random field formulation. Second, we propose a semi-automatic technique for dense annotation of 3d geometry, and in particular, the 3d poses of planes in indoor scenes. This technique requires a human to quickly annotate just a handful of keyframes per video, and then uses the camera poses and geometric reasoning to propagate these labels through an entire video sequence. Experimental results indicate that the technique could be used as an alternative or complementary source of training data, allowing large-scale data to be collected with minimal human effort.

show abstract

“…Unfortunately, such method assumes the ground plane is always visible in the images, and only applies to vehicle-control use cases. In addition, there are recent work making use of local surface frame representation for a variety of 3D scene understanding tasks [20,21]. However, our method extends beyond these ideas by estimating both local and global aligned surface geometry from single images and use such correspondences to estimate camera orientation.…”

Section: Related Workmentioning

confidence: 99%

“…We find this choice leads to the best performance in our experiments. However, other surface frame representations could also be used [20,21].…”

Section: Approachmentioning

confidence: 99%

UprightNet: Geometry-Aware Camera Orientation Estimation From Single Images

Xian

Snavely

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

We introduce UprightNet, a learning-based approach for estimating 2DoF camera orientation from a single RGB image of an indoor scene. Unlike recent methods that leverage deep learning to perform black-box regression from image to orientation parameters, we propose an end-to-end framework that incorporates explicit geometric reasoning. In particular, we design a network that predicts two representations of scene geometry, in both the local camera and global reference coordinate systems, and solves for the camera orientation as the rotation that best aligns these two predictions via a differentiable least squares module. This network can be trained end-to-end, and can be supervised with both ground truth camera poses and intermediate representations of surface geometry. We evaluate UprightNet on the single-image camera orientation task on synthetic and real datasets, and show significant improvements over prior state-of-the-art approaches.

show abstract

FrameNet: Learning Local Canonical Frames of 3D Surfaces From a Single RGB Image

Cited by 50 publications

References 51 publications

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

Automatic Dense Annotation for Monocular 3D Scene Understanding

UprightNet: Geometry-Aware Camera Orientation Estimation From Single Images

Contact Info

Product

Resources

About