Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Cho, Hyeon; Kim, Tae-Hoon; Chang, Hyung Jin; Hwang, Wonjun

doi:10.1109/access.2021.3084840

Cited by 18 publications

(9 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, many works have been proposed to learn features by discriminating playback speeds. Epstein, Chen, and Vondrick (2020); Cho et al (2020) try to predict whether a clip is sped up or not. Wang, Jiao, and Liu (2020); Yao et al (2020); Jenni, Meishvili, and Favaro (2020) attempt to predict the specific playback speed of one clip.…”

Section: Related Workmentioning

confidence: 99%

“…However, these works suffer from the imprecise speed label issue. Cho et al (2020) design a method to sort video clips according to their different playback speeds, we use a spatial-temporal encoder f (•; θ) followed by two projection heads (i.e., g m and g a ) to extract clip features for two pretext tasks. In the relative speed perception (RSP) task, we identify the relative playback speed between clips instead of predicting their specific playback speeds.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, unsupervised video representation learning, which seeks to learn appearance and motion features from unlabeled videos, has attracted great attention (Cho et al 2020;Benaim et al 2020;Epstein, Chen, and Vondrick 2020;Gan et al 2016b). This task, however, is very difficult due to several challenges: 1) The downstream video understanding tasks, such as action recognition, rely on both appearance features (e.g., texture and shape of objects, background scene) and motion features (e.g., the movement of objects).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Chen

Huang

et al. 2021

AAAI

View full text Add to dashboard Cite

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatial-temporal information in videos and 2) the lack of labeled data for training. Unlike representation learning for static images, it is difficult to construct a suitable self-supervised task to effectively model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learned models may tend to focus on motion patterns and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion patterns and thus provides more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to effectively perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that jointly optimizing the two tasks consistently improves the performance on two downstream tasks (namely, action recognition and video retrieval) w.r.t the increasing pre-training epochs. Remarkably, for action recognition on the UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Our code, pre-trained models, and supplementary materials can be found at https://github.com/PeihaoChen/RSPNet.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Chen

Huang

et al. 2021

AAAI

View full text Add to dashboard Cite

show abstract

“…However, these works suffer from the imprecise speed label issue. Cho et al [7] design a method to sort video clips according to their playback speeds. However, they do not explicitly encourage model to learn appearance features.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, unsupervised video representation learning, which seeks to learn appearance and motion features from unlabeled videos, has attracted great attention [7,1,9,15]. This task, however, is very difficult due to several challenges: 1) The downstream video understanding tasks, such as action recognition, rely on both appearance features (e.g., texture and shape of objects, background scene) and motion features (e.g., the movement of objects).…”

Section: Introductionmentioning

confidence: 99%

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Chen¹,

Huang²,

He³

et al. 2020

Preprint

View full text Add to dashboard Cite

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to: 1) the highly complex spatial-temporal information in videos; and 2) the lack of labeled data for training. Unlike the representation learning for static images, it is difficult to construct a suitable self-supervised task to well model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learnt models may tend to focus on motion pattern and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion pattern, and thus provide more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to well perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearancefocused task, where we enforce the model to perceive the appearance difference between two video clips. We show that optimizing the two tasks jointly consistently improves the performance on two downstream tasks, namely action recognition and video retrieval. Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model.

show abstract

Self-supervised Learning for Unintentional Action Prediction

Zatsarynna

Farha

Gall

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Distinguishing if an action is performed as intended or if an intended action fails is an important skill that not only humans have, but that is also important for intelligent systems that operate in human environments. Recognizing if an action is unintentional or anticipating if an action will fail, however, is not straight-forward due to lack of annotated data. While videos of unintentional or failed actions can be found in the Internet in abundance, high annotation costs are a major bottleneck for learning networks for these tasks. In this work, we thus study the problem of self-supervised representation learning for unintentional action prediction. While previous works learn the representation based on a local temporal neighborhood, we show that the global context of a video is needed to learn a good representation for the three downstream tasks: unintentional action classification, localization and anticipation. In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well.

show abstract

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Cited by 18 publications

References 30 publications

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Self-supervised Learning for Unintentional Action Prediction

Contact Info

Product

Resources

About