2020
DOI: 10.48550/arxiv.2007.10730
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video Representation Learning by Recognizing Temporal Transformations

Abstract: We introduce a novel self-supervised learning approach to learn representations of videos that are responsive to changes in the motion dynamics. Our representations can be learned from data without human annotation and provide a substantial boost to the training of neural networks on small labeled data sets for tasks such as action recognition, which require to accurately distinguish the motion of objects. We promote an accurate learning of motion without human annotation by training a neural network to discri… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
18
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(18 citation statements)
references
References 44 publications
0
18
0
Order By: Relevance
“…Unsupervised learning in videos has followed a similar trajectory with earlier methods focusing on predictive tasks based on motion, color and spatiotemporal ordering [29,43,1,44,78,85,60,84,58,57,21,51,86,66,22,48,91,16,87,70,45], and contrastive objectives with visual [74,79,34,53,28,92] and audio-visual input [65,4,5,49,3,68,69].…”
Section: Related Workmentioning
confidence: 97%
“…Unsupervised learning in videos has followed a similar trajectory with earlier methods focusing on predictive tasks based on motion, color and spatiotemporal ordering [29,43,1,44,78,85,60,84,58,57,21,51,86,66,22,48,91,16,87,70,45], and contrastive objectives with visual [74,79,34,53,28,92] and audio-visual input [65,4,5,49,3,68,69].…”
Section: Related Workmentioning
confidence: 97%
“…It is desired that in the embedding space the distance between two clips from the same video is lesser than the clips from different videos. Jenni et al [18] introduced a novel self-supervised framework to learn video representations which are sensitive to the changes in the motion dynamics. They have observed that the motion of objects is essential for action recognition tasks.…”
Section: Related Workmentioning
confidence: 99%
“…In self-supervised video representation learning, a line of works designed various pretext tasks, e.g., temporal ordering [46,74,75], spatiotemporal puzzles [33,63], colorization [59], playback speed prediction [31,6] and temporal cycle-consistency [66,30,37]. Some works proposed to predict future frames from the given sequence to learn feature embeddings [58,57,43,5].…”
Section: Self-supervised Video Representation Learningmentioning
confidence: 99%
“…* Corresponding author. Email: wylin@sjtu.edu.cn To achieve this goal, early works designed various pretext tasks to uncover effective supervision from video sequences [6,46,33,31,74,63]. Recently, contrastive learning has shown to be powerful in image representation learning [28,47,55,12,26,77].…”
Section: Introductionmentioning
confidence: 99%