Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

Dundar, Aysegul; Shih, Kevin J.; Garg, Animesh; Pottorff, Robert; Tao, Andrew; Catanzaro, Bryan

doi:10.1109/tpami.2021.3055560

Cited by 17 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unsupervised Pose Representation -There are several recent examples of RGB-based unsupervised learning approaches to 3D pose estimation, such as [23,24,25,26,27,28]. Authors in [24,25,26] extract unsupervised pose features from 2D joints generated from RGB data.…”

Section: Related Workmentioning

confidence: 99%

“…For example, Chen et al [24] train a network through a 2D-3D consistency loss, computed after lifting 2D pose to 3D joints and reprojecting 3D onto 2D. Dundar et al [28] disentangle pose and appearance features from an RGB image by designing a self-supervised auto-encoder that reconstructs an input image into foreground and background with the constraint that the appearance features remain consistent temporally while the pose features change. Honari et al [27] also relies on temporal information and factorizes the pose and appearance features in a contrastive learning manner.…”

Section: Related Workmentioning

confidence: 99%

“…Our key contributions can be summarized as follows: (i) we propose a novel unsupervised method that learns viewinvariant 3D pose representation from a 2D image without using 3D skeleton data and camera parameters. Our viewinvariant features can be applied directly by downstream tasks to be resilient to human pose variations in unseen viewpoints, unlike unsupervised 3D pose estimation methods such as [23,24,25,26,27,28] which obtain viewspecific 3D pose features, and require camera parameters and further steps to align their view-specific features in a canonical space, (ii) we introduce novel view-invariance and equivariance losses that impose on the network to preserve geometrical and positional order consistency of pose features -these losses can benefit the training process in other pretext tasks that exploit landmark representation, (iii) we evaluate the performance of learned pose features on two downstream tasks that demand view-invariancy and achieve state-of-the-art unsupervised cross-view action recognition accuracy on the NTU RGB+D standard benchmark dataset for RGB and depth images at 74.8% and 67.5% respectively, and for the first time we obtain unsupervised cross-view and cross-subject rank correlation results for human movement assessment scores on the QMAR dataset, while exceeding its supervised state-of-the-art results, (iv) we perform ablation studies to explore the impact of our loss functions on our proposed model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Unsupervised View-Invariant Human Posture Representation

Sardari,

Ommer,

Mirmehdi

2021

Preprint

View full text Add to dashboard Cite

Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-theart supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network.Figure 1: Left: the proposed network learns to disentangle canonical 3D human pose representations and viewdependent features through simultaneous frames from different views and augmented frames from the same view. Right: the unsupervised learned canonical pose representation can be used for downstream tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised View-Invariant Human Posture Representation

Sardari,

Ommer,

Mirmehdi

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, for this technology to be used in gaming, simulators, and virtual applications, they need to be fully controllable. Recent work controls image generation by conditioning output on different type of inputs such as natural and synthetic images [17,50,23,46,7,28], landmarks [27,39,8], and semantic maps [14,40,31]. Among these methods, [27,8,15] disentangle images into pose and appearance in an unsupervised way, and show control over pose during inference.…”

Section: Related Workmentioning

confidence: 99%

“…Recent work controls image generation by conditioning output on different type of inputs such as natural and synthetic images [17,50,23,46,7,28], landmarks [27,39,8], and semantic maps [14,40,31]. Among these methods, [27,8,15] disentangle images into pose and appearance in an unsupervised way, and show control over pose during inference. However, since they do not disentangle images into their 3D attributes, the control and manipulation over generated images are still limited.…”

Section: Related Workmentioning

confidence: 99%

View Generalization for Single Image Textured 3D Models

Bhattad¹,

Dundar²,

Liu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Figure 1: Given a single 2D image input, we infer high quality textured 3D models -a triangular mesh and a texture map, which can then be used to render novel views. Our inferred models compare well with the source image when rendered from that view. More important, our method displays good view generalization -new views of an inferred model look like real pictures of that object. Project page: https://nv-adlr.github.io/view-generalization

show abstract