2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 2020
DOI: 10.1109/wacv45572.2020.9093278
|View full text |Cite
|
Sign up to set email alerts
|

Temporal Contrastive Pretraining for Video Action Recognition

Abstract: In this paper, we propose a self-supervised method for video representation learning based on Contrastive Predictive Coding (CPC) [27]. Previously, CPC has been used to learn representations for different signals (audio, text or image). It benefits from the use of an autoregressive modeling and contrastive estimation to learn long-term relations inside raw signal while remaining robust to local noise. Our self-supervised task consists in predicting the latent representation of future segments of the video. As … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 40 publications
(24 citation statements)
references
References 18 publications
0
21
0
Order By: Relevance
“…The goal of a future prediction task is to predict high-level information of future time-step given a series of past ones. In [21,22], high-dimensional data are compressed into a compact lower-dimensional latent embedding space. Powerful autoregressive models are used to summarize the information in the latent space, and a context latent representation C t is produced as represented in Figure 7.…”
Section: Future Predictionmentioning
confidence: 99%
See 1 more Smart Citation
“…The goal of a future prediction task is to predict high-level information of future time-step given a series of past ones. In [21,22], high-dimensional data are compressed into a compact lower-dimensional latent embedding space. Powerful autoregressive models are used to summarize the information in the latent space, and a context latent representation C t is produced as represented in Figure 7.…”
Section: Future Predictionmentioning
confidence: 99%
“…Further, a shallow MLP (1 hidden layer) maps representations to a latent space where a contrastive loss is applied. For training a model for action recognition, the most common approach to extract features from a sequence of image frames is to use a 3D-ResNet as encoder [22,24].…”
Section: Encodersmentioning
confidence: 99%
“…For unsupervised representation learning, we are inspired by the success of contrastive learning in images (Chen et al 2020b), short-trimmed videos (Lorre et al 2020;Singh et al 2021) and other areas of machine learning (Chen et al 2021;Rahaman, Ghosh, and Thiery 2021). Works which apply contrastive learning to longer sequences bring together multiple 0% 20% 40% 60% 80% 100% Labeled video (%) 20% 40% 60% 80% MoF Our Semi-Supervised Supervised…”
Section: Introductionmentioning
confidence: 99%
“…The few direct extensions of SimCLR for video (Bai et al 2020;Qian et al 2020;Lorre et al 2020) target action recognition on few seconds short clips. Others integrate contrastive learning by bringing together next-frame feature predictions with actual representations (Kong et al 2020;Lorre et al 2020), using path-object tracks for bringing cycleconsistency (Wang, Zhou, and Li 2020), and considering multiple viewpoints (Sermanet et al 2018) or accompanying modalities like audio (Alwassel et al 2019) or text (Miech et al 2020). We are inspired by these works to develop contrastive learning for long-range segmentation.…”
Section: Introductionmentioning
confidence: 99%
“…Previously, contrastive learning is widely adopted in image representation learning. For image representation learning, multiple methods Inspired by the success of contrastive learning in images, recent methods [18,19,[79][80][81] have been proposed to leverage contrastive learning for video representation learning. For instance, [18]…”
Section: Contrastive Learningmentioning
confidence: 99%