End-to-End 6-DoF Object Pose Estimation Through Differentiable Rasterization

Palazzi, Andrea; Bergamini, Luca; Calderara, Simone; Cucchiara, Rita

doi:10.1007/978-3-030-11015-4_53

Cited by 21 publications

(21 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, in Zhu et al [79] the 2.5D sketch consists of both a silhouette and a depth image rendered from a learnt low-resolution voxel grid by means of a differentiable ray-tracer. While this method is appealing for its geometrical guarantees, it is limited by a number of factors: i) it requires a custom differentiable ray-tracing module; ii) footprint of voxel-based representations scales with the cube of the resolution despite most of the information lying on the surface [54], [41]; iii) errors in the 3D voxel grid naturally propagate to the 2.5D sketch. We also follow this line of work to provide soft 3D priors to the synthesis process.…”

Section: Related Workmentioning

confidence: 99%

Warp and Learn: Novel Views Generation for Vehicles and Other Objects

Palazzi

Bergamini

Calderara

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

In this work we introduce a new self-supervised, semi-parametric approach for synthesizing novel views of a vehicle starting from a single monocular image. Differently from parametric (i.e. entirely learning-based) methods, we show how a-priori geometric knowledge about the object and the 3D world can be successfully integrated into a deep learning based image generation framework. As this geometric component is not learnt, we call our approach semi-parametric. In particular, we exploit man-made object symmetry and piece-wise planarity to integrate rich a-priori visual information into the novel viewpoint synthesis process. An Image Completion Network (ICN) is then trained to generate a realistic image starting from this geometric guidance. This careful blend between parametric and non-parametric components allows us to i) operate in a real-world scenario, ii) preserve high-frequency visual information such as textures, iii) handle truly arbitrary 3D roto-translations of the input and iv) perform shape transfer to completely different 3D models. Eventually, we show that our approach can be easily complemented with synthetic data and extended to other rigid objects with completely different topology, even in presence of concave structures and holes (e.g. chairs). A comprehensive experimental analysis against state-of-the-art competitors shows the efficacy of our method both from a quantitative and a perceptive point of view.

show abstract

Section: Related Workmentioning

confidence: 99%

Warp and Learn: Novel Views Generation for Vehicles and Other Objects

Palazzi

Bergamini

Calderara

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

“…We propose a direct pose optimisation through differentiable rendering. While differentiable rendering-based approaches have been shown to be effective for pose estimation [24,10], these works rely on homogeneous data to compute losses between the prediction and the target, often employing pixelwise losses based on photo-metric or depth reconstruction errors. However, in our application we again must tackle the challenge of the asymmetry of our query (RGB) and reference (layouts) data types.…”

Section: Latent Optimisation Of Posementioning

confidence: 99%

“…Differentiable rendering has been shown to be effective for object pose estimation [24,10]. But these works typically rely on like-for-like rendering losses, such as the pixelwise error between the rendered and target images.…”

Section: Introductionmentioning

confidence: 99%

LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments

Howard-Jenkins¹,

Ruiz-Sarmiento²,

Prisacariu³

2021

Preprint

View full text Add to dashboard Cite

We present LaLaLoc to localise in environments without the need for prior visitation, and in a manner that is robust to large changes in scene appearance, such as a full rearrangement of furniture. Specifically, LaLaLoc performs localisation through latent representations of room layout. LaLaLoc learns a rich embedding space shared between RGB panoramas and layouts inferred from a known floor plan that encodes the structural similarity between locations. Further, LaLaLoc introduces direct, cross-modal pose optimisation in its latent space. Thus, LaLaLoc enables fine-grained pose estimation in a scene without the need for prior visitation, as well as being robust to dynamics, such as a change in furniture configuration. We show that in a domestic environment LaLaLoc is able to accurately localise a single RGB panorama image to within 8.3cm, given only a floor plan as a prior.

show abstract

“…Recently, thanks to the development of several differentiable renderers [31,20,34,30,2], a handful of methods [17,13,16] have shown that the task can be addressed as an inverse graphics problem using fewer supervisory signals, such as 2D segmentation masks and object keypoints. Following methods have even relaxed these constraints, training without keypoint supervision [2,19,18] or known camera poses [47,7,28].…”

Section: Related Workmentioning

confidence: 99%

Multi-Category Mesh Reconstruction From Image Collections

Simoni¹,

Pini²,

Vezzani³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a single RGB image. However, current methods are trained on image collections of a single category in order to exploit specific priors, and they often make use of category-specific 3D templates. In this paper, we present an alternative approach that infers the textured mesh of objects combining a series of deformable 3D models and a set of instance-specific deformation, pose, and texture. Differently from previous works, our method is trained with images of multiple object categories using only foreground masks and rough camera poses as supervision. Without specific 3D templates, the framework learns category-level models which are deformed to recover the 3D shape of the depicted object. The instance-specific deformations are predicted independently for each vertex of the learned 3D mesh, enabling the dynamic subdivision of the mesh during the training process. Experiments show that the proposed framework can distinguish between different object categories and learn category-specific shape priors in an unsupervised manner. Predicted shapes are smooth and can leverage from multiple steps of subdivision during the training process, obtaining comparable or state-of-the-art results on two public datasets. Models and code are publicly released 1 .

show abstract

End-to-End 6-DoF Object Pose Estimation Through Differentiable Rasterization

Cited by 21 publications

References 42 publications

Warp and Learn: Novel Views Generation for Vehicles and Other Objects

Warp and Learn: Novel Views Generation for Vehicles and Other Objects

LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments

Multi-Category Mesh Reconstruction From Image Collections

Contact Info

Product

Resources

About