2019
DOI: 10.1609/aaai.v33i01.33018545
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Abstract: Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all. Recently, this worthwhile stream of study extends to video domain where the cost of human labeling is even more expensive. However, the most of existing methods are still based on 2D CNN architectures that can not directly capture spatio-temporal information for video applications. In this paper, we introd… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
238
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 338 publications
(247 citation statements)
references
References 12 publications
1
238
0
Order By: Relevance
“…Third, in essense, DPC is trained by predicting future representations, and use them as a "query" to pick Table 4: Comparison with other self-supervised methods, results are reported as an average over three training-testing splits. Note that, previous works [15,17] use full-scale 3D-ResNet18, i.e. all convolutions are 3D, and the input sizes for different models have been shown.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Third, in essense, DPC is trained by predicting future representations, and use them as a "query" to pick Table 4: Comparison with other self-supervised methods, results are reported as an average over three training-testing splits. Note that, previous works [15,17] use full-scale 3D-ResNet18, i.e. all convolutions are 3D, and the input sizes for different models have been shown.…”
Section: Discussionmentioning
confidence: 99%
“…This creates a shortcut to discriminate positive and spatial negative by using padding patterns. One can limit the spatial RF by cutting input frames into patches [40,17]. However this brings some drawbacks: First, the selfsupervised pre-trained network will have limited receptive field (RF), so the representation may not generalize well for downstream tasks where a large RF is required.…”
Section: Avoiding Shortcuts and Learning Semanticsmentioning
confidence: 99%
“…One approach is to use the temporal ordering or coherence as a proxy loss in order to learn the representation [10,17,22,24,30,31,49,52,64]. Other approaches use egomotion [2,21] in order to enforce equivariance in feature space [21].…”
Section: Related Workmentioning
confidence: 99%
“…For the final sequence t = B, this loss term is simply turned off. This work is similar to [16] however they used 3D CNN for spatio-temporal encoding instead of LSTM in our work.…”
Section: Training Methodologymentioning
confidence: 99%
“…The advent of deep learning delivered highly discriminative hashing algorithms e.g. derived from deep auto-encoders [30,38], convolutional neural networks (CNN) [36,19,16] or recurrent neural networks [10,35]. These technique train a global model for content hashing over a representative video corpus, and focus on hashing short clips with lengths of a few minutes at most using visual cues only.…”
Section: Related Workmentioning
confidence: 99%