2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 2019
DOI: 10.1109/iccvw.2019.00364
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Learning of Class Embeddings from Video

Abstract: This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
60
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 51 publications
(61 citation statements)
references
References 49 publications
1
60
0
Order By: Relevance
“…We used the pre-trained Fabnet [16] model to obtain embeddings for each frame in the video that contained the speaker's face. The pretext task of Fabnet is specially designed to encourage the network to learn the facial attributes that encodes the landmarks, pose, and emotions.…”
Section: ) Fabnetmentioning
confidence: 99%
See 2 more Smart Citations
“…We used the pre-trained Fabnet [16] model to obtain embeddings for each frame in the video that contained the speaker's face. The pretext task of Fabnet is specially designed to encourage the network to learn the facial attributes that encodes the landmarks, pose, and emotions.…”
Section: ) Fabnetmentioning
confidence: 99%
“…Both pre-trained models of RoBERTa [19] and Wav2Vec [17] were accessed from Fairseq code-base [34] and used to extract text and speech SSL features. To download the pre-trained Fabnet model and extract features for video modality, we referred to their publication [16]. To extract features from videos, we cropped faces from each video frame using Retina-Face [42] facial recognition model.…”
Section: A Self Supervised Embedding Extractionmentioning
confidence: 99%
See 1 more Smart Citation
“…Other methods use no supervision, and some no data-driven prior either. The works of [28,45,50,67] learn to match pairs of images of an object, but they do not learn geometric invariants such as keypoints. [54,55,56] do learn sparse and dense landmarks, also without any annotation.…”
Section: Related Workmentioning
confidence: 99%
“…In this way, our method outputs poses that are directly interpretable. By contrast, state-of-the-art self-supervised keypoint detectors [25,50,54,67,74] do not learn "semantic" keypoints and, in post-processing, they need at least some paired supervision to output human-interpretable keypoints. We highlight this difference in fig.…”
Section: Introductionmentioning
confidence: 99%