2020
DOI: 10.48550/arxiv.2011.07949
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Abstract: We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to: 1) the highly complex spatial-temporal information in videos; and 2) the lack of labeled data for training. Unlike the representation learning for static images, it is difficult to construct a suitable self-supervised task to well model both motion and appea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 10 publications
(6 citation statements)
references
References 36 publications
0
6
0
Order By: Relevance
“…Specifically, when pretrained with the 3D ResNet-18 backbone, our method outperforms 3D-RotNet [19], ST-Puzzle [21], and DPC [13] by a large margin (80.5% vs. 62.9%, 65.8%, and 68.2%, respectively, on UCF-101 and 52.3% vs. 33.7%, 33.7%, and 34.5%). When utilizing S3D-G as the backbone, our ASC-Net achieves better accuracy than SpeedNet [2], Pace [34], and RSPNet [5] (90.8% vs. 81.1%, 87.1%, and 89.9%, respectively, on UCF-101 and 60.5% vs. 48.8%, 52.6%, and 59.9%) under the same settings. Remarkably, without the need of any annotation for pretraining, our ASCNet outperforms the ImageNet [10] supervised pretrained model over two datasets (90.8% vs. 86.6%, 60.5% vs. 57.7%).…”
Section: Evaluation On the Action Recognition Taskmentioning
confidence: 96%
See 3 more Smart Citations
“…Specifically, when pretrained with the 3D ResNet-18 backbone, our method outperforms 3D-RotNet [19], ST-Puzzle [21], and DPC [13] by a large margin (80.5% vs. 62.9%, 65.8%, and 68.2%, respectively, on UCF-101 and 52.3% vs. 33.7%, 33.7%, and 34.5%). When utilizing S3D-G as the backbone, our ASC-Net achieves better accuracy than SpeedNet [2], Pace [34], and RSPNet [5] (90.8% vs. 81.1%, 87.1%, and 89.9%, respectively, on UCF-101 and 60.5% vs. 48.8%, 52.6%, and 59.9%) under the same settings. Remarkably, without the need of any annotation for pretraining, our ASCNet outperforms the ImageNet [10] supervised pretrained model over two datasets (90.8% vs. 86.6%, 60.5% vs. 57.7%).…”
Section: Evaluation On the Action Recognition Taskmentioning
confidence: 96%
“…SpeedNet [2] predicts whether the video clip is sped up or not, while Pace [34] predicts the exact speed of the video clip. Instead of predicting the absolute playback speed, RSPNet [5] predicts the relative speed to avoid the we use a video encoder f (•; θ) to map the clips into appearance and speed embedding space. For the ACP task, we pull the appearance features from the same video closer.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Based on the formulation in [31], [15,16] contrast between the representation of the predicted future frames and that of the real ones. Some recent works exploit video pace variation as augmentation and contrast between representations with different paces [39,6]. Whether it is in the video paradigm, which is the focus of this paper, or in the image domain, augmentations are all shown to be critical to learning a strong representation.…”
Section: Related Workmentioning
confidence: 99%