2020
DOI: 10.48550/arxiv.2008.13426
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Jiangliu Wang,
Jianbo Jiao,
Linchao Bao
et al.

Abstract: This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 59 publications
0
7
0
Order By: Relevance
“…Our method is significantly better than those using temporal cues to design pretext tasks. Though DSM [61] and STS [63] designed elaborate operations to build static appearance and dynamic motion statistics, our higher performance indicates good transferability of knowledge to the downstream task, hence showing the efficacy of our multi-level feature optimization. Further improvement can be observed when Kinetics-400 is utilized.…”
Section: Evaluation On Downstream Tasksmentioning
confidence: 93%
See 4 more Smart Citations
“…Our method is significantly better than those using temporal cues to design pretext tasks. Though DSM [61] and STS [63] designed elaborate operations to build static appearance and dynamic motion statistics, our higher performance indicates good transferability of knowledge to the downstream task, hence showing the efficacy of our multi-level feature optimization. Further improvement can be observed when Kinetics-400 is utilized.…”
Section: Evaluation On Downstream Tasksmentioning
confidence: 93%
“…Under the end-to-end finetune setting, for models pretrained on UCF-101, our method can outperform approaches that used simple temporal order or playback rate as their pretext task, and is comparable to STS [63] which designed a complex learning scheme to characterize appearance and motion statistics. This demonstrates that our method is capable of robust spatiotemporal modeling.…”
Section: Evaluation On Downstream Tasksmentioning
confidence: 97%
See 3 more Smart Citations