2020
DOI: 10.1609/aaai.v34i07.6840
|View full text |Cite
|
Sign up to set email alerts
|

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Abstract: We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates “blanks” by withholding video clips and then creates “options” by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with “options” and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
82
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 146 publications
(85 citation statements)
references
References 15 publications
(36 reference statements)
0
82
0
Order By: Relevance
“…HMDB-51 is a dataset that expresses 51 human actions at least 101 times per class and consists of the total 6,849 clips. In this paper, we verify our method by using the clips for pre-training to compare the performance of the selfsupervised learning methods [8], [9], [11]. Specifically, we train a 3D ConvNet using UCF-101 without label information first and fine-tune the model using labeled videos such as UCF-101 and HMDB-51, respectively.…”
Section: Resultsmentioning
confidence: 98%
See 3 more Smart Citations
“…HMDB-51 is a dataset that expresses 51 human actions at least 101 times per class and consists of the total 6,849 clips. In this paper, we verify our method by using the clips for pre-training to compare the performance of the selfsupervised learning methods [8], [9], [11]. Specifically, we train a 3D ConvNet using UCF-101 without label information first and fine-tune the model using labeled videos such as UCF-101 and HMDB-51, respectively.…”
Section: Resultsmentioning
confidence: 98%
“…Xu et al [9] have efficiently improved the frame order prediction method [8] by sorting the order of the neighboring clips as known as video clip order prediction (VCOP), where the clips are consistent with the video dynamics. In [11], the video cloze procedure (VCP) was proposed to learn the spatial-temporal representation of video data based on a method that uses spatial rotations and temporal shuffling method, which enhanced the accuracy in action recognition. Our proposed method is inspired by [8] and [9], but we make use of the playback speeds of the videos, not the correct sequential order of sampled frames [8] or clips [9].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Due to the uniqueness and discrimination of corners, we look forward to extending this work to accurate arbitrary-shape text detection and end-to-end text spotting. We would also like to combine this work with self-supervised learning [27][28][29][30].…”
Section: Discussionmentioning
confidence: 99%