2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00413
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Abstract: We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approache… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
137
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 207 publications
(138 citation statements)
references
References 41 publications
1
137
0
Order By: Relevance
“…All the approaches employ RGB frames as input. Particularly, the Max‐GL is compared with DrLim [44], TempoCoh [11], object patch [9], temporal order [6], Odd‐One‐Out (O3N) [7], Order Prediction Network (OPN) [8], Generative Adversarial Network for Video (VGAN) [30], geometry [28], motion pattern [45], and video jigsaw [46]. As shown in Table 4, our Max‐GL achieves 64.0% on the UCF101 dataset, which improves 3.7% than the second best approach reported in [7].…”
Section: Methodsmentioning
confidence: 99%
“…All the approaches employ RGB frames as input. Particularly, the Max‐GL is compared with DrLim [44], TempoCoh [11], object patch [9], temporal order [6], Odd‐One‐Out (O3N) [7], Order Prediction Network (OPN) [8], Generative Adversarial Network for Video (VGAN) [30], geometry [28], motion pattern [45], and video jigsaw [46]. As shown in Table 4, our Max‐GL achieves 64.0% on the UCF101 dataset, which improves 3.7% than the second best approach reported in [7].…”
Section: Methodsmentioning
confidence: 99%
“…The general pipeline is to train a pre-text task on unlabeled data and transfer the knowledge to a supervised downstream task (Jing and Tian 2020) or by clustering video datasets without manual supervision (Asano et al 2020). Pretext tasks include dense predictive coding (Han et al 2020), shuffling frames (Fernando et al 2017;Xu et al 2019), exploiting spatial and/or temporal order (Jenni et al 2020;Tschannen et al 2020;Wang et al 2019), or by matching frames with other modalities (Afouras et al 2020;Alayrac et al 2020;Owens and Efros 2018;Patrick et al 2020). Self-supervised approaches utilize unlabeled train videos to learn representations without semantic class labels.…”
Section: Self-supervised Video Learningmentioning
confidence: 99%
“…In addition, some works applied the self-supervised approaches to learn video feature based on multi-stream structure. Wang et al [23] proposed a two-stream-based self-supervised approach to learn visual feature by regressing both motion and appearance statistical information without action label. In this work, both RGB data and optical data were used to compute appearance and motion respectively.…”
Section: Multi-stream Structure For Action Recognitionmentioning
confidence: 99%
“…2. Referring to the clip-level learning methods [1,2,23], the length of the clip is set to 16 frames in this paper.…”
Section: Problem Definitionmentioning
confidence: 99%
See 1 more Smart Citation