RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Chen, Peihao; Huang, Deng; He, Dongliang; Long, Xiang; Zeng, Runhao; Wen, Shifeng; Tan, Mingkui; Gan, Chuang

doi:10.48550/arxiv.2011.07949

Cited by 10 publications

(6 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, when pretrained with the 3D ResNet-18 backbone, our method outperforms 3D-RotNet [19], ST-Puzzle [21], and DPC [13] by a large margin (80.5% vs. 62.9%, 65.8%, and 68.2%, respectively, on UCF-101 and 52.3% vs. 33.7%, 33.7%, and 34.5%). When utilizing S3D-G as the backbone, our ASC-Net achieves better accuracy than SpeedNet [2], Pace [34], and RSPNet [5] (90.8% vs. 81.1%, 87.1%, and 89.9%, respectively, on UCF-101 and 60.5% vs. 48.8%, 52.6%, and 59.9%) under the same settings. Remarkably, without the need of any annotation for pretraining, our ASCNet outperforms the ImageNet [10] supervised pretrained model over two datasets (90.8% vs. 86.6%, 60.5% vs. 57.7%).…”

Section: Evaluation On the Action Recognition Taskmentioning

confidence: 96%

“…SpeedNet [2] predicts whether the video clip is sped up or not, while Pace [34] predicts the exact speed of the video clip. Instead of predicting the absolute playback speed, RSPNet [5] predicts the relative speed to avoid the we use a video encoder f (•; θ) to map the clips into appearance and speed embedding space. For the ACP task, we pull the appearance features from the same video closer.…”

Section: Related Workmentioning

confidence: 99%

“…Self-supervised pretraining stage. Following prior work [34,2,5], we sample 16 consecutive frames with 112 ×112 spatial size for each clip unless specified otherwise. Video clips are augmented using random cropping with resizing, random color jittering, random Gaussian blurring, and random grayscale and solarization.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Different evaluation protocols. We survey existing self-supervised video representation learning methods and make the following observations about the evaluation protocols: (1) Different works may use different cropping strategies for evaluation, such as center-crop [2,34,5], threecrop [27], and ten-crop [13,14,15] 3, we present the results of our method with different evaluation protocols used in prior works. Impact of the pretraining epochs.…”

Section: Evaluation On the Action Recognition Taskmentioning

confidence: 99%

See 3 more Smart Citations

ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

Huang¹,

Wu²,

Hu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other, but they require careful treatment of negative pairs by relying on large batch sizes, memory banks, extra modalities or customized mining strategies, inevitably including noisy data. In this paper, we observe that the consistency between positive samples is the key to learn robust video representations. Specifically, we propose two tasks to learn appearance and speed consistency, separately. The appearance consistency task aims to maximize the similarity between two clips of the same video with different playback speeds. The speed consistency task aims to maximize the similarity between two clips with the same playback speed but different appearance information. We show that joint optimization of the two tasks consistently improves the performance on downstream tasks, e.g., action recognition and video retrieval. Remarkably, for action recognition on the UCF-101 dataset, we achieve 90.8% accuracy without using any additional modalities or negative pairs for unsupervised pretraining, outperforming the ImageNet supervised pretrained model. Codes and models will be available.

show abstract

Section: Evaluation On the Action Recognition Taskmentioning

confidence: 96%

Section: Related Workmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

Section: Evaluation On the Action Recognition Taskmentioning

confidence: 99%

See 2 more Smart Citations

ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

Huang¹,

Wu²,

Hu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Based on the formulation in [31], [15,16] contrast between the representation of the predicted future frames and that of the real ones. Some recent works exploit video pace variation as augmentation and contrast between representations with different paces [39,6]. Whether it is in the video paradigm, which is the focus of this paper, or in the image domain, augmentations are all shown to be critical to learning a strong representation.…”

Section: Related Workmentioning

confidence: 99%

ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning

Qing¹,

Huang²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The central idea of contrastive learning is to discriminate between different instances and force different views of the same instance to share the same representation. To avoid trivial solutions, augmentation plays an important role in generating different views, among which random cropping is shown to be effective for the model to learn a strong and generalized representation. Commonly used random crop operation keeps the difference between two views statistically consistent along the training process. In this work, we challenge this convention by showing that adaptively controlling the disparity between two augmented views along the training process enhances the quality of the learnt representation. Specifically, we present a parametric cubic cropping operation, ParamCrop, for video contrastive learning, which automatically crops a 3D cubic from the video by differentiable 3D affine transformations. ParamCrop is trained simultaneously with the video backbone using an adversarial objective and learns an optimal cropping strategy from the data. The visualizations show that the center distance and the IoU between two augmented views are adaptively controlled by ParamCrop and the learned change in the disparity along the training process is beneficial to learning a strong representation. Extensive ablation studies demonstrate the effectiveness of the proposed ParamCrop on multiple contrastive learning frameworks and video backbones. With ParamCrop, we improve the state-of-the-art performance on both HMDB51 and UCF101 datasets. † Equal Contribution.

show abstract

Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning

Yang¹,

Mirmehdi²,

Burghardt³

2020

Preprint

View full text Add to dashboard Cite

In this paper we show that learning video feature spaces in which temporal cycles are maximally predictable benefits action classification. In particular, we propose a novel learning approach termed Cycle Encoding Prediction (CEP) that is able to effectively represent high-level spatio-temporal structure of unlabelled video content. CEP builds a latent space wherein the concept of closed forwardbackward as well as backward-forward temporal loops is approximately preserved. As a self-supervision signal, CEP leverages the bi-directional temporal coherence of the video stream and applies loss functions that encourage both temporal cycle closure as well as contrastive feature separation. Architecturally, the underpinning network structure utilises a single feature encoder for all video snippets, adding two predictive modules that learn temporal forward and backward transitions. We apply our framework for pretext training of networks for action recognition tasks. We report significantly improved results for the standard datasets UCF101 and HMDB51. Detailed ablation studies support the effectiveness of the proposed components. We publish source code for the CEP components in full with this paper.

show abstract

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Cited by 10 publications

References 36 publications

ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning

Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning

Contact Info

Product

Resources

About