“…For the intrinsic decomposition task, self-supervision is obtained by dense correspondence between pixels across multiple views [54], training on a sequence of multi-lit images or video streams [25,42], model-based shape reconstruction [46], or through reconstruction loss (imposing consistency between the original images and the re-rendered one from the estimated intrinsic components) while training on a mix of labeled and unlabeled datasets [37,36,16]. Here, we introduce a new self-supervised loss term that 1) reduces the need for pseudo-labels and multi-stage training [37], 2) does not require a sequence of images as input during training [25,42], 3) does not rely on strong priors posed in [54,25,46] for training in limited supervision scenarios (no labels on albedos and normals exist) where the intrinsic decomposition from single image is highly ambiguous. Furthermore, compared to [28] proposing an unsupervised intrinsic decomposition technique given multi-lit images at training, we further disentangle the lighting component from the normals, thus facilitating relighting and light transfer between a source and a target image pair.…”