2020
DOI: 10.48550/arxiv.2001.00294
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Abstract: We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates "blanks" by withholding video clips and then creates "options" by applying spatiotemporal operations on the withheld clips. Finally, it fills the blanks with "options" and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
33
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(33 citation statements)
references
References 26 publications
0
33
0
Order By: Relevance
“…With an extra time dimension, videos provides rich static and dynamic information, and there is thus an abundant supply of various supervision signals. A natural way is to extend patch-based context prediction to spatio-temporal scenarios, such as spatio-temporal puzzles [21], video cloze procedure [30] and frame/clip order prediction [26,54,9]. Besides the extension of image based supervisions, recent works propose to learn representations by predicting future frames [12,13].…”
Section: Related Workmentioning
confidence: 99%
“…With an extra time dimension, videos provides rich static and dynamic information, and there is thus an abundant supply of various supervision signals. A natural way is to extend patch-based context prediction to spatio-temporal scenarios, such as spatio-temporal puzzles [21], video cloze procedure [30] and frame/clip order prediction [26,54,9]. Besides the extension of image based supervisions, recent works propose to learn representations by predicting future frames [12,13].…”
Section: Related Workmentioning
confidence: 99%
“…Evaluation. We adopt the standard evaluation protocol [50,26] during testing. To predict each video sequence, we take 10 video clips uniformly from each testing video sequence and average these prediction results.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…Although these work demonstrated the effectiveness of selfsupervised representation learning with unlabeled videos and showed impressive performances when transferring the learned features to video recognition tasks, their approaches are only applicable to a CNN that accepts one or two frames as inputs and cannot be applied to network architectures that are suitable for spatio-temporal representations. Therefore, several recent papers [10], [13], [14], [20] used 3D CNNs as backbone networks to learn spatio-temporal representations, among which [10], [13], [14] extended the 2D frame ordering pretext tasks to 3D video clip ordering, and [20] formulated the pretext task as a dense prediction problem and proposed to predict future frames. Very recently, self-supervised learning leveraging multi-modality sources, e.g., learning from video and audio [36], [37], is becoming increasingly popular.…”
Section: Self-supervised Representation Learningmentioning
confidence: 99%
“…Stemmed from these two works, various network architectures are designed to learn video representations, including P3D [46], I3D [1], R(2+1)D [8], etc. In this work, we consider to use three backbone networks, C3D [45], 3D-ResNet [8] and R(2+1)D [8] to validate the proposed approach, following prior works [10], [14]. Backbone networks pre-trained with the proposed spatio-temporal statistics will be used as weight initialization and fine-tuned on UCF101 [47] and HMDB51 [48] datasets for the action recognition downstream task.…”
Section: Representation Learning For Video Analytic Tasksmentioning
confidence: 99%
See 1 more Smart Citation