Spatiotemporal Contrastive Video Representation Learning

Qian, Rui; Meng, Tianjian; Gong, Boqing; Yang, Ming–Hsuan; Wang, Huisheng; Belongie, Serge; Cui, Yin

doi:10.1109/cvpr46437.2021.00689

Cited by 300 publications

(207 citation statements)

References 40 publications

Supporting

Mentioning

178

Contrasting

Order By: Relevance

“…We conclude by applying the training strategies to the Kinetics-400 video classification task, using a 3D ResNet as the baseline architecture (Qian et al, 2020) (see Appendix G for experimental details). Table 6 presents an additive study of the RS training recipe and architectural improvements.…”

Section: Revised 3d Resnet For Video Classificationmentioning

confidence: 99%

See 1 more Smart Citation

Revisiting ResNets: Improved Training and Scaling Strategies

Bello,

Fedus,

et al. 2021

Preprint

View full text Add to dashboard Cite

Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies. Our work revisits the canonical ResNet (He et al., 2015) and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime and offer two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended (Tan & Le, 2019). Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1.7x -2.7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet. In a large-scale semi-supervised learning setup, ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7x faster than EfficientNet-NoisyStudent. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. We recommend practitioners use these simple revised ResNets as baselines for future research.

show abstract

Section: Revised 3d Resnet For Video Classificationmentioning

confidence: 99%

“…We follow the training and inference protocols in (Qian et al, 2020;Feichtenhofer et al, 2019). We train with a random 224×224 crop or its horizontal flip on the spatial domain and sample a 32-frame clip with temporal stride 2.…”

Section: G Video Classification Experimental Detailsmentioning

confidence: 99%

Revisiting ResNets: Improved Training and Scaling Strategies

Bello,

Fedus,

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Self-supervised Video Representation Learning: In the past few years, there are growing numbers of works dedicated to self-supervision video representation learning for various downstream tasks, such as action recognition [9,8,33], video retrieval [29], video caption [37,46] and many others. In this paper, we focus on the downstream task of label propagation.…”

Section: Related Workmentioning

confidence: 99%

Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for Video Correspondence Learning

Tang

Jiang

Xie

et al. 2021

Preprint

View full text Add to dashboard Cite

Previous cycle-consistency correspondence learning methods usually leverage image patches for training. In this paper, we present a fully convolutional method, which is simpler and more coherent to the inference process. While directly applying fully convolutional training results in model collapse, we study the underline reason behind this collapse phenomenon, indicating that the absolute positions of pixels provide a shortcut to easily accomplish cycleconsistence, which hinders the learning of meaningful visual representations. To break this absolute position shortcut, we propose to apply different crops for forward and backward frames, and adopt feature warping to establish correspondence between two crops of a same frame. The former technique enforces the corresponding pixels at forward and back tracks to have different absolute positions, and the latter effectively blocks the shortcuts going between forward and back tracks. In three label propagation benchmarks for pose tracking, face landmark tracking and video object segmentation, our method largely improves the results of vanilla fully convolutional cycle-consistency method, achieving very competitive performance compared with the self-supervised state-of-the-art approaches.

show abstract

“…The current dominant praxis is to train models to perform challenging self-supervised learning tasks on a large dataset, and then fine-tune learnt representations for specific 'downstream' tasks using smaller, annotated datasets. Major successes have been reported in image classification [4,7,8,11,16], video understanding [13,27] and NLP [17,25,28], with self-supervised approaches often matching or exceeding the performance of fully-supervised approaches.…”

Section: Related Workmentioning

confidence: 99%

Self-Supervised Multi-Modal Alignment for Whole Body Medical Imaging

Windsor¹,

Jamaludin²,

Kadir³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper explores the use of self-supervised deep learning in medical imaging in cases where two scan modalities are available for the same subject. Specifically, we use a large publicly-available dataset of over 20,000 subjects from the UK Biobank with both whole body Dixon technique magnetic resonance (MR) scans and also dual-energy x-ray absorptiometry (DXA) scans. We make three contributions: (i) We introduce a multi-modal image-matching contrastive framework, that is able to learn to match different-modality scans of the same subject with high accuracy. (ii) Without any adaption, we show that the correspondences learnt during this contrastive training step can be used to perform automatic cross-modal scan registration in a completely unsupervised manner. (iii) Finally, we use these registrations to transfer segmentation maps from the DXA scans to the MR scans where they are used to train a network to segment anatomical regions without requiring ground-truth MR examples. To aid further research, our code will be made publicly available † .

show abstract

Spatiotemporal Contrastive Video Representation Learning

Cited by 300 publications

References 40 publications

Revisiting ResNets: Improved Training and Scaling Strategies

Revisiting ResNets: Improved Training and Scaling Strategies

Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for Video Correspondence Learning

Self-Supervised Multi-Modal Alignment for Whole Body Medical Imaging

Contact Info

Product

Resources

About