2015 IEEE International Conference on Computer Vision (ICCV) 2015
DOI: 10.1109/iccv.2015.308
|View full text |Cite
|
Sign up to set email alerts
|

Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views

Abstract: Object viewpoint estimation from 2D images is an essential task in computer vision. However, two issues hinder its progress: scarcity of training data with viewpoint annotations, and a lack of powerful features. Inspired by the growing availability of 3D models, we propose a framework to address both issues by combining renderbased image synthesis and CNNs. We believe that 3D models have the potential in generating a large number of images of high variation, which can be well exploited by deep CNN with a high … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

9
674
1
2

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 619 publications
(697 citation statements)
references
References 33 publications
(51 reference statements)
9
674
1
2
Order By: Relevance
“…Methods in the first category, such as [21] and [13], predict 2D keypoints from an image and then use 3D object models to predict the 3D pose given these keypoints. Methods in the second category, such as Viewpoints and Keypoints (V&K) [20] and Render-for-CNN [17], which are closer to what we do, predict 3D pose directly given an image. Both of these methods discretize the pose space into bins and solve a pose classification problem.…”
Section: Introductionmentioning
confidence: 73%
See 1 more Smart Citation
“…Methods in the first category, such as [21] and [13], predict 2D keypoints from an image and then use 3D object models to predict the 3D pose given these keypoints. Methods in the second category, such as Viewpoints and Keypoints (V&K) [20] and Render-for-CNN [17], which are closer to what we do, predict 3D pose directly given an image. Both of these methods discretize the pose space into bins and solve a pose classification problem.…”
Section: Introductionmentioning
confidence: 73%
“…They have a similar network architecture, which is shared across object categories up to the second-last layer and a separate output layer for every category. While V&K [20] uses a standard cross-entropy loss for classification, Render-for-CNN [17] uses a weighted cross-entropy loss that respects the circular symmetry of angles. While V&K [20] uses jittered bounding boxes with sufficient overlap to augmented annotated training data, Render-for-CNN [17] uses rendered images with a well-sampled distribution over pose space, random crops, and backgrounds.…”
Section: Introductionmentioning
confidence: 99%
“…It is shown that synthetic data is beneficial, especially in situations where few (or no) training instances are available, but 3D CAD models are. Su et al [33] follow a similar pipeline of rendering images from 3D models for viewpoint estimation, however, with substantially more synthetic data obtained, e.g., by deforming existing 3D models before rendering.…”
Section: Related Workmentioning
confidence: 99%
“…It remains open, how well synthesized images can be used as a proxy of real training data without sacrificing performance. To bridge the gap between synthetic renderings and realistic images, Su et al (2015) propose a pipeline for rendering 3D objects in common poses onto realistic backgrounds. Based on this, Massa et al (2016) learn a mapping from CNN features computed on a realistic photo to features from a rendering, both showing the same object in the same pose, thus improving matching.…”
Section: Related Workmentioning
confidence: 99%