2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.00881
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

Abstract: We propose a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
80
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 77 publications
(80 citation statements)
references
References 65 publications
(108 reference statements)
0
80
0
Order By: Relevance
“…quences, as unlabeled videos can be acquired without effort while providing rich information. The general idea of using videos for self-supervision is not new [13,28,36,40,63,65]. Here, we consider video sequences to automatically generate masks for the objects visible in their frames.…”
Section: Input Imagementioning
confidence: 99%
“…quences, as unlabeled videos can be acquired without effort while providing rich information. The general idea of using videos for self-supervision is not new [13,28,36,40,63,65]. Here, we consider video sequences to automatically generate masks for the objects visible in their frames.…”
Section: Input Imagementioning
confidence: 99%
“…However, unlike our method, their method does not contain any temporal constraint, intuitive editing, nor does it demonstrate any result over videos. Jakab et al [16] use unpaired pose prior to train a keypoint extractor, but their work is also limited to humans. Similar to ours, Siarohin et al [34] learn keypoint representations in an unsupervised fashion, however, as we demonstrate (see Section 4), their method cannot handle cross-domain videos well.…”
Section: Shared Geometric Representationmentioning
confidence: 99%
“…These methods implicitly take in equivariant constraints by modeling objects as deformation (or flow). For example, Jakab et al [11] proposed a method for recognizing poses of objects, which is trained by learning auxiliary tasks without manually labeled data, where a network is deployed to reconstruct the original image. Wiles et al [12] developed an approach by leveraging information from multiple source frames and predicting confidence masks for each frame.…”
Section: Related Workmentioning
confidence: 99%
“…To further utilize the pixel-wise information, other video-based methods tries to disentangle pictorial representations from appearances. [11] distills a dual representation of pose from the target image, which reconstructs the objective with the appearance information extracted from the source frame. Though, progress has been achieved in terms of performances, these methods heavily depend on a preliminary that the predicting object must be shown in the video with stable motion.…”
Section: Introductionmentioning
confidence: 99%