“…In ad-dition to numerous semantic-specific details, recognition in novel viewpoints via direct appearance synthesis is suboptimal: one may be sure of the presence of a rug behind a couch, but unsure of its particular color. Similarly, there have been advances in learning to infer 3D properties of scenes from image cues [20,46,63], or with differentiable rendering [10,29,38,50] and other methods for bypassing the need for direct 3D supervision [27,33,34,68]. However, these approaches do not connect to complex scene semantics; they primarily focus on single objects or small, less diverse 3D annotated datasets.…”