2021
DOI: 10.1109/tpami.2021.3055560
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

Abstract: Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foregro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 17 publications
(12 citation statements)
references
References 24 publications
0
12
0
Order By: Relevance
“…Unsupervised Pose Representation -There are several recent examples of RGB-based unsupervised learning approaches to 3D pose estimation, such as [23,24,25,26,27,28]. Authors in [24,25,26] extract unsupervised pose features from 2D joints generated from RGB data.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Unsupervised Pose Representation -There are several recent examples of RGB-based unsupervised learning approaches to 3D pose estimation, such as [23,24,25,26,27,28]. Authors in [24,25,26] extract unsupervised pose features from 2D joints generated from RGB data.…”
Section: Related Workmentioning
confidence: 99%
“…For example, Chen et al [24] train a network through a 2D-3D consistency loss, computed after lifting 2D pose to 3D joints and reprojecting 3D onto 2D. Dundar et al [28] disentangle pose and appearance features from an RGB image by designing a self-supervised auto-encoder that reconstructs an input image into foreground and background with the constraint that the appearance features remain consistent temporally while the pose features change. Honari et al [27] also relies on temporal information and factorizes the pose and appearance features in a contrastive learning manner.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, for this technology to be used in gaming, simulators, and virtual applications, they need to be fully controllable. Recent work controls image generation by conditioning output on different type of inputs such as natural and synthetic images [17,50,23,46,7,28], landmarks [27,39,8], and semantic maps [14,40,31]. Among these methods, [27,8,15] disentangle images into pose and appearance in an unsupervised way, and show control over pose during inference.…”
Section: Related Workmentioning
confidence: 99%
“…Recent work controls image generation by conditioning output on different type of inputs such as natural and synthetic images [17,50,23,46,7,28], landmarks [27,39,8], and semantic maps [14,40,31]. Among these methods, [27,8,15] disentangle images into pose and appearance in an unsupervised way, and show control over pose during inference. However, since they do not disentangle images into their 3D attributes, the control and manipulation over generated images are still limited.…”
Section: Related Workmentioning
confidence: 99%