Video Representation Learning by Recognizing Temporal Transformations

Jenni, Simon; Meishvili, Givi; Favaro, Paolo

doi:10.48550/arxiv.2007.10730

Cited by 10 publications

(18 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unsupervised learning in videos has followed a similar trajectory with earlier methods focusing on predictive tasks based on motion, color and spatiotemporal ordering [29,43,1,44,78,85,60,84,58,57,21,51,86,66,22,48,91,16,87,70,45], and contrastive objectives with visual [74,79,34,53,28,92] and audio-visual input [65,4,5,49,3,68,69].…”

Section: Related Workmentioning

confidence: 97%

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Feichtenhofer¹,

Fan²,

Xiong³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporallypersistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at https://github.com/ facebookresearch/SlowFast.

show abstract

Section: Related Workmentioning

confidence: 97%

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Feichtenhofer¹,

Fan²,

Xiong³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…It is desired that in the embedding space the distance between two clips from the same video is lesser than the clips from different videos. Jenni et al [18] introduced a novel self-supervised framework to learn video representations which are sensitive to the changes in the motion dynamics. They have observed that the motion of objects is essential for action recognition tasks.…”

Section: Related Workmentioning

confidence: 99%

Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction

Parihar,

Ramola,

Saha

et al. 2021

Preprint

View full text Add to dashboard Cite

Ever-increasing smartphone-generated video content demands intelligent techniques to edit and enhance videos on power-constrained devices. Most of the best performing algorithms for video understanding tasks like action recognition, localization, etc., rely heavily on rich spatiotemporal representations to make accurate predictions. For effective learning of the spatio-temporal representation, it is crucial to understand the underlying object motion patterns present in the video. In this paper, we propose a novel approach for understanding object motions via motion type classification. The proposed motion type classifier predicts a motion type for the video based on the trajectories of the objects present. Our classifier assigns a motion type for the given video from the following five primitive motion classes: linear, projectile, oscillatory, local and random. We demonstrate that the representations learned from the motion type classification generalizes well for the challenging downstream task of video retrieval. Further, we proposed a recommendation system for video playback style based on the motion type classifier predictions.

show abstract

“…In self-supervised video representation learning, a line of works designed various pretext tasks, e.g., temporal ordering [46,74,75], spatiotemporal puzzles [33,63], colorization [59], playback speed prediction [31,6] and temporal cycle-consistency [66,30,37]. Some works proposed to predict future frames from the given sequence to learn feature embeddings [58,57,43,5].…”

Section: Self-supervised Video Representation Learningmentioning

confidence: 99%

“…* Corresponding author. Email: wylin@sjtu.edu.cn To achieve this goal, early works designed various pretext tasks to uncover effective supervision from video sequences [6,46,33,31,74,63]. Recently, contrastive learning has shown to be powerful in image representation learning [28,47,55,12,26,77].…”

Section: Introductionmentioning

confidence: 99%

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Qian

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding. Code is available here.

show abstract

Video Representation Learning by Recognizing Temporal Transformations

Cited by 10 publications

References 44 publications

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Contact Info

Product

Resources

About