2020
DOI: 10.48550/arxiv.2004.02753
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

Abstract: This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning. The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive pretext tasks. In the same way that high-level visual information in the world changes smoothly, we believe that nearby frames in learned representations should demonstrate similar properties. Using this as… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 42 publications
0
4
0
Order By: Relevance
“…For example, using video clips with different playback rates as positive pairs for contrastive loss along with predicting the playback rate [49]. Other works propose frame-based contrastive learning, along with existing pretext tasks of frame rotation prediction [27] and frame-tuple order verification [55], with the pretext tasks borrowed from from [22] and [35], respectively. Unlike previous work, we take a different approach and propose TCLR: Temporal Contrastive Learning for Video Representation, to improve video contrastive learning by adding temporal contrastive losses that focus on the harder task of discriminating between clips within the same video instance.…”
Section: Related Workmentioning
confidence: 99%
“…For example, using video clips with different playback rates as positive pairs for contrastive loss along with predicting the playback rate [49]. Other works propose frame-based contrastive learning, along with existing pretext tasks of frame rotation prediction [27] and frame-tuple order verification [55], with the pretext tasks borrowed from from [22] and [35], respectively. Unlike previous work, we take a different approach and propose TCLR: Temporal Contrastive Learning for Video Representation, to improve video contrastive learning by adding temporal contrastive losses that focus on the harder task of discriminating between clips within the same video instance.…”
Section: Related Workmentioning
confidence: 99%
“…Effectively, this model treats temporal jitter between neighboring frames as another type of data augmentation. A similar temporal contrastive learning model was proposed by Knights et al (2020) before.…”
Section: Static Contrastive Learningmentioning
confidence: 94%
“…by applying color distortions to an image (Chen et al, 2020b). Contrastive self-supervised learning has been applied to both images (Oord et al, 2018;Hjelm et al, 2018;He et al, 2019;Chen et al, 2020b,a) and videos (Sermanet et al, 2018;Zhuang et al, 2020;Knights et al, 2020) with promising results. However, these works were primarily motivated by computer vision applications, and did not apply self-supervised learning methods to a developmentally realistic, longitudinal, first-person video dataset.…”
Section: Related Workmentioning
confidence: 99%
“…[8], [9] employ contrastive loss for representation learning and achieve high accuracy on classification and segmentation tasks. [10], [14] use the temporal correlations in the streaming data to improve representation learning. However, all these works assume that the whole training dataset is available in the learning process, and each mini-batch can be formed by sampling from the dataset.…”
Section: B Related Workmentioning
confidence: 99%