2021
DOI: 10.1109/access.2021.3084840
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Abstract: We propose a self-supervised visual learning method by predicting the variable playback speeds of a video. Without semantic labels, we learn the spatio-temporal visual representation of the video by leveraging the variations in the visual appearance according to different playback speeds under the assumption of temporal coherence. To learn the spatio-temporal visual variations in the entire video, we have not only predicted a single playback speed but also generated clips of various playback speeds and directi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
9
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 18 publications
(9 citation statements)
references
References 30 publications
0
9
0
Order By: Relevance
“…More recently, many works have been proposed to learn features by discriminating playback speeds. Epstein, Chen, and Vondrick (2020); Cho et al (2020) try to predict whether a clip is sped up or not. Wang, Jiao, and Liu (2020); Yao et al (2020); Jenni, Meishvili, and Favaro (2020) attempt to predict the specific playback speed of one clip.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…More recently, many works have been proposed to learn features by discriminating playback speeds. Epstein, Chen, and Vondrick (2020); Cho et al (2020) try to predict whether a clip is sped up or not. Wang, Jiao, and Liu (2020); Yao et al (2020); Jenni, Meishvili, and Favaro (2020) attempt to predict the specific playback speed of one clip.…”
Section: Related Workmentioning
confidence: 99%
“…However, these works suffer from the imprecise speed label issue. Cho et al (2020) design a method to sort video clips according to their different playback speeds, we use a spatial-temporal encoder f (•; θ) followed by two projection heads (i.e., g m and g a ) to extract clip features for two pretext tasks. In the relative speed perception (RSP) task, we identify the relative playback speed between clips instead of predicting their specific playback speeds.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, these works suffer from the imprecise speed label issue. Cho et al [7] design a method to sort video clips according to their playback speeds. However, they do not explicitly encourage model to learn appearance features.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, unsupervised video representation learning, which seeks to learn appearance and motion features from unlabeled videos, has attracted great attention [7,1,9,15]. This task, however, is very difficult due to several challenges: 1) The downstream video understanding tasks, such as action recognition, rely on both appearance features (e.g., texture and shape of objects, background scene) and motion features (e.g., the movement of objects).…”
Section: Introductionmentioning
confidence: 99%