“…Nevertheless, addressing 3D instance understanding from a single image in an uncontrolled environment is ill-posed and challenging, thus attracting growing attention. With the development of deep CNNs, researchers are able to achieve impressive results with supervised [18,69,43,46,57,54,63,70,6,32,49,38,3,66] or weakly supervised strategies [28,48,24]. Existing works consider to represent an object as a parameterized 3D bounding box [18,54,57,49], coarse wire-frame skeletons [14,32,62,69,68], voxels [9], one-hot selection from a small set of exemplar models [3,45,1], and point clouds [17].…”