Abstract:Figure 1. We present Mask3D, which learns to embed 3D priors to 2D representations for image understanding tasks, based on a selfsupervised pre-training formulation from single RGB-D views, without requiring any camera pose or multi-view correspondence information. Our pre-training takes masked RGB and depth patches as input to reconstruct the dense depth map, and the pre-trained color backbone is used to fine-tune various downstream image understanding tasks. This results in effective ViT pre-training for a v… Show more
Set email alert for when this publication receives citations?
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.