Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Kim, Dahun; Cho, Donghyeon; Kweon, In So

doi:10.1609/aaai.v33i01.33018545

Cited by 338 publications

(247 citation statements)

References 12 publications

Supporting

Mentioning

238

Contrasting

Order By: Relevance

“…Third, in essense, DPC is trained by predicting future representations, and use them as a "query" to pick Table 4: Comparison with other self-supervised methods, results are reported as an average over three training-testing splits. Note that, previous works [15,17] use full-scale 3D-ResNet18, i.e. all convolutions are 3D, and the input sizes for different models have been shown.…”

Section: Discussionmentioning

confidence: 99%

“…This creates a shortcut to discriminate positive and spatial negative by using padding patterns. One can limit the spatial RF by cutting input frames into patches [40,17]. However this brings some drawbacks: First, the selfsupervised pre-trained network will have limited receptive field (RF), so the representation may not generalize well for downstream tasks where a large RF is required.…”

Section: Avoiding Shortcuts and Learning Semanticsmentioning

confidence: 99%

See 1 more Smart Citation

Video Representation Learning by Dense Predictive Coding

Han

Xie

Zisserman

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

364

302

View full text Add to dashboard Cite

Figure 1: Nearest Neighbour (NN) video clip retrieval on UCF101. Each row contains four video clips, a query clip and the top three retrievals using clip embeddings. To get the embedding, each video is passed to a 3D-ResNet18, average pooled to a single vector, and cosine similarity is used for retrieval. (a) Embeddings obtained by Dense Predictive Coding (DPC); (b) Embeddings obtained by using the inflated ImageNet pretrained weights. The DPC captures the semantics of the human action, rather than the scene appearance or layout as captured by the ImageNet trained embeddings. In the DPC retrievals the actual appearances of frames can vary dramatically, e.g. in the change in camera viewpoint for the climbing case. AbstractThe objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for selfsupervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatialtemporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with selfsupervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 (75.7% top1 acc) and HMDB51 (35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet. The code is available at https

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Avoiding Shortcuts and Learning Semanticsmentioning

confidence: 99%

Video Representation Learning by Dense Predictive Coding

Han

Xie

Zisserman

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

364

302

View full text Add to dashboard Cite

show abstract

“…One approach is to use the temporal ordering or coherence as a proxy loss in order to learn the representation [10,17,22,24,30,31,49,52,64]. Other approaches use egomotion [2,21] in order to enforce equivariance in feature space [21].…”

Section: Related Workmentioning

confidence: 99%

Self-Supervised Learning of Class Embeddings from Video

Wiles

Koepke

Zisserman

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on three distinct deformable object classeshuman full bodies, upper bodies, faces -and show experimentally that the learned embeddings do indeed generalise. They achieve state-ofthe-art performance in comparison to other self-supervised methods trained on the same datasets, and approach the performance of fully supervised methods.

show abstract

“…For the final sequence t = B, this loss term is simply turned off. This work is similar to [16] however they used 3D CNN for spatio-temporal encoding instead of LSTM in our work.…”

Section: Training Methodologymentioning

confidence: 99%

“…The advent of deep learning delivered highly discriminative hashing algorithms e.g. derived from deep auto-encoders [30,38], convolutional neural networks (CNN) [36,19,16] or recurrent neural networks [10,35]. These technique train a global model for content hashing over a representative video corpus, and focus on hashing short clips with lengths of a few minutes at most using visual cues only.…”

Section: Related Workmentioning

confidence: 99%

ARCHANGEL: Tamper-Proofing Video Archives Using Temporal Content Hashes on the Blockchain

Bui

Cooper

Collomosse

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we describe a novel deep network architecture for computing compact temporal content hashes (TCHs) from audio-visual streams with durations of minutes or hours. Our TCHs are sensitive to accidental or malicious content modification (tampering) but invariant to the codec used to encode the video. This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives. We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, Estonia and Norway participated.

show abstract

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Cited by 338 publications

References 12 publications

Video Representation Learning by Dense Predictive Coding

Video Representation Learning by Dense Predictive Coding

Self-Supervised Learning of Class Embeddings from Video

ARCHANGEL: Tamper-Proofing Video Archives Using Temporal Content Hashes on the Blockchain

Contact Info

Product

Resources

About