Unsupervised Part-Based Disentangling of Object Shape and Appearance

Lorenz, Dominik; Bereska, Leonard; Milbich, Timo; Ommer, Björn

doi:10.1109/cvpr.2019.01121

Cited by 156 publications

(193 citation statements)

References 49 publications

(100 reference statements)

Supporting

Mentioning

192

Contrasting

Order By: Relevance

“…Jakab & Gupta et al [25], the most related, is described in the introduction. Lorenz et al [33], Zhang et al [74] develop an auto-encoding formulation to discover landmarks as explicit structural representations for a given…”

Section: Related Workmentioning

confidence: 99%

“…Our method allows for similar but more fine-grained conditional image generation, conditioned on an appearance image or object landmarks. Many unsupervised methods for pose estimation [25,33,50,67,74] share similar ability. However, we can achieve more accurate and predictable image editing by manipulating semantic parts in the image through their corresponding landmarks.…”

Section: Conditional Image Decodermentioning

confidence: 99%

“…Our method obtains state-of-the-art landmark detection performance for approaches that use unlabelled images for supervision. In contrast, self-supervised landmark detectors [25,33,54,74] can only learn to discover keypoints [right] that are not human-interpretable (predictions from [25]) and require supervised post-processing.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

Jakab

Gupta

Bilen

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We propose a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight 'geometric bottleneck' which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality such as mocap, such that no annotated image is ever used in learning the pose recognition network. In standard benchmarks for pose recognition for humans and faces, our method achieves state-of-the-art performance among methods that do not require any labelled images for training

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Conditional Image Decodermentioning

confidence: 99%

See 1 more Smart Citation

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

Jakab

Gupta

Bilen

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…Pose guided image and video generation Given a source image and a target 2D pose, image-based methods [43,44,22,59,49] produce an image with the source appearance in the target pose. To deal with pixel miss-alignments, it is helpful to transform the pixels in the original image to match the target pose within the network [60,11,42]. Following similar ideas to growing GANs [36], high resolution anime characters can be generated [29].…”

Section: Related Workmentioning

confidence: 99%

360-Degree Textures of People in Clothing from a Single Image

Lazova

Insafutdinov

Pons-Moll

2019

2019 International Conference on 3D Vision (3DV)

130

109

View full text Add to dashboard Cite

DensePose Garment segmentation Partial texture Completed texture Partial segmentation Completed segmentation Displacement maps Input view Fully-textured 3D avatar Figure 1: Given a single view of a person we predict a complete texture map in the UV space, complete clothing segmentation as well as a displacement map for the SMPL model [41], which we then combine to obtain a fully-textured 3D avatar. AbstractIn this paper we predict a full 3D avatar of a person from a single image. We infer texture and geometry in the UVspace of the SMPL model using an image-to-image translation method. Given partial texture and segmentation layout maps derived from the input view, our model predicts the complete segmentation map, the complete texture map, and a displacement map. The predicted maps can be applied to the SMPL model in order to naturally generalize to novel poses, shapes, and even new clothing. In order to learn our model in a common UV-space, we non-rigidly register the SMPL model to thousands of 3D scans, effectively encoding textures and geometries as images in correspondence. This turns a difficult 3D inference task into a simpler image-toimage translation one. Results on rendered scans of people and images from the DeepFashion dataset demonstrate that our method can reconstruct plausible 3D avatars from a single image. We further use our model to digitally change pose, shape, swap garments between people and edit clothing. To encourage research in this direction we will make the source code available for research purpose [5].

show abstract

“…[17] proposes a cycle-consistent VAE which adds a cyclic loss to the VAE objective. [40] directly models π as keypoints. All of these methods rely on the same basic principle for disentanglement: Constraining the amount of information in π.…”

Section: Disentangling Without Pose-annotationsmentioning

confidence: 99%

Unsupervised Robust Disentangling of Latent Characteristics for Image Synthesis

Esser

Haux

Ommer

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

Deep generative models come with the promise to learn an explainable representation for visual objects that allows image sampling, synthesis, and selective modification. The main challenge is to learn to properly model the independent latent characteristics of an object, especially its appearance and pose. We present a novel approach that learns disentangled representations of these characteristics and explains them individually. Training requires only pairs of images depicting the same object appearance, but no pose annotations. We propose an additional classifier that estimates the minimal amount of regularization required to enforce disentanglement. Thus both representations together can completely explain an image while being independent of each other. Previous methods based on adversarial approaches fail to enforce this independence, while methods based on variational approaches lead to uninformative representations. In experiments on diverse object categories, the approach successfully recombines pose and appearance to reconstruct and retarget novel synthesized images. We achieve significant improvements over stateof-the-art methods which utilize the same level of supervision, and reach performances comparable to those of posesupervised approaches. However, we can handle the vast body of articulated object classes for which no pose models/annotations are available.

show abstract

Unsupervised Part-Based Disentangling of Object Shape and Appearance

Cited by 156 publications

References 49 publications

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

360-Degree Textures of People in Clothing from a Single Image

Unsupervised Robust Disentangling of Latent Characteristics for Image Synthesis

Contact Info

Product

Resources

About