2020
DOI: 10.48550/arxiv.2008.03800
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Spatiotemporal Contrastive Video Representation Learning

Abstract: We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Inspired by the recently proposed self-supervised contrastive learning framework, our representations are learned using a contrastive loss, where two clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentation for video self-supervised learning a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
67
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
3

Relationship

1
9

Authors

Journals

citations
Cited by 48 publications
(67 citation statements)
references
References 60 publications
0
67
0
Order By: Relevance
“…Most works demonstrate that selecting robust pretext tasks along with suitable data augmentations can greatly boost the quality of representations. In the image domain, data augmentation involves mainly color transformation and geometric transformations such as cropping, resizing, rotation, and flipping [3][6][45] [46][47] [48]. Recently, SwAV [10] outperformed other self-supervised methods by using multiple augmentations.…”
Section: B Contrastive Learningmentioning
confidence: 99%
“…Most works demonstrate that selecting robust pretext tasks along with suitable data augmentations can greatly boost the quality of representations. In the image domain, data augmentation involves mainly color transformation and geometric transformations such as cropping, resizing, rotation, and flipping [3][6][45] [46][47] [48]. Recently, SwAV [10] outperformed other self-supervised methods by using multiple augmentations.…”
Section: B Contrastive Learningmentioning
confidence: 99%
“…Several Selfsupervised methods [2,15] for video tasks have been studied by the computer vision community. Qian proposed [26] self-supervised Contrastive Video Representation Learning (CVRL) method which uses the contrastive loss to map the video clips in the embedding space. It is desired that in the embedding space the distance between two clips from the same video is lesser than the clips from different videos.…”
Section: Related Workmentioning
confidence: 99%
“…Recent advances on contrastive learning for image representation learning [8,57] also show promising results on videos. Research along this direction has achieved the stateof-the-art by contrasting on the temporal dimension [17,47], distilling motion representations [18], and designing better video augmentations [43].…”
Section: Related Workmentioning
confidence: 99%