2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00689
|View full text |Cite
|
Sign up to set email alerts
|

Spatiotemporal Contrastive Video Representation Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

3
178
3

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 300 publications
(207 citation statements)
references
References 40 publications
3
178
3
Order By: Relevance
“…We conclude by applying the training strategies to the Kinetics-400 video classification task, using a 3D ResNet as the baseline architecture (Qian et al, 2020) (see Appendix G for experimental details). Table 6 presents an additive study of the RS training recipe and architectural improvements.…”
Section: Revised 3d Resnet For Video Classificationmentioning
confidence: 99%
See 1 more Smart Citation
“…We conclude by applying the training strategies to the Kinetics-400 video classification task, using a 3D ResNet as the baseline architecture (Qian et al, 2020) (see Appendix G for experimental details). Table 6 presents an additive study of the RS training recipe and architectural improvements.…”
Section: Revised 3d Resnet For Video Classificationmentioning
confidence: 99%
“…We follow the training and inference protocols in (Qian et al, 2020;Feichtenhofer et al, 2019). We train with a random 224×224 crop or its horizontal flip on the spatial domain and sample a 32-frame clip with temporal stride 2.…”
Section: G Video Classification Experimental Detailsmentioning
confidence: 99%
“…Self-supervised Video Representation Learning: In the past few years, there are growing numbers of works dedicated to self-supervision video representation learning for various downstream tasks, such as action recognition [9,8,33], video retrieval [29], video caption [37,46] and many others. In this paper, we focus on the downstream task of label propagation.…”
Section: Related Workmentioning
confidence: 99%
“…The current dominant praxis is to train models to perform challenging self-supervised learning tasks on a large dataset, and then fine-tune learnt representations for specific 'downstream' tasks using smaller, annotated datasets. Major successes have been reported in image classification [4,7,8,11,16], video understanding [13,27] and NLP [17,25,28], with self-supervised approaches often matching or exceeding the performance of fully-supervised approaches.…”
Section: Related Workmentioning
confidence: 99%