Self-supervised Video Representation Learning by Pace Prediction

Wang, Jiangliu; Jiao, Jianbo; Liu, Yunhui

doi:10.1007/978-3-030-58520-4_30

Cited by 164 publications

(202 citation statements)

References 45 publications

Supporting

Mentioning

202

Contrasting

Order By: Relevance

“…However, our method always outperforms the two other methods such as VCOP and VCP regardless of the 3D ConvNet models. From table 9, our method shows a comparable results compared to the state-of-the-art (SOTA) self-supervised methods such as VCOP, VCP, Dense Predictive Coding (DPC) [39], SpeedNet [24], Temporal Transformation (TT) [25], Pace Prediction (PP) [40], and CoCLR [38]. Since each method has different backbone architecture, hyper-parameter setting, and augmentation method, we compare the performance of each backbone and augmentation setting to ensure fair comparison.…”

Section: B Action Recognitionmentioning

confidence: 85%

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

et al. 2021

View full text Add to dashboard Cite

We propose a self-supervised visual learning method by predicting the variable playback speeds of a video. Without semantic labels, we learn the spatio-temporal visual representation of the video by leveraging the variations in the visual appearance according to different playback speeds under the assumption of temporal coherence. To learn the spatio-temporal visual variations in the entire video, we have not only predicted a single playback speed but also generated clips of various playback speeds and directions with randomized starting points. Hence the visual representation can be successfully learned from the meta information (playback speeds and directions) of the video. We also propose a new layerdependable temporal group normalization method that can be applied to 3D convolutional networks to improve the representation learning performance where we divide the temporal features into several groups and normalize each one using the different corresponding parameters. We validate the effectiveness of our method by fine-tuning it to the action recognition and video retrieval tasks on UCF-101 and HMDB-51. a

show abstract

Section: B Action Recognitionmentioning

confidence: 85%

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Self-supervised learning aims to extract the underlying useful representation of unlabeled data by designing effective pretext tasks. Recently, self-supervised techniques have a broad range of applications in different domains such as computer vision [14][15][16][17][18], and audio/speech processing [19][20][21][22]. For visual data, various pretext tasks are designed including solving jigsaw puzzles [14], rotation prediction [15] and visual contrastive learning [16] for image, and frame order validation [17] and pace prediction [18]…”

Section: Related Workmentioning

confidence: 99%

Semi-Supervised Time Series Classification by Temporal Relation Prediction

Fan

Zhang

Wang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Semi-supervised learning (SSL) has proven to be a powerful algorithm in different domains by leveraging unlabeled data to mitigate the reliance on the tremendous annotated data. However, few efforts consider the underlying temporal relation structure of unlabeled time series data in the semi-supervised learning paradigm. In this work, we propose a simple and effective method of Semi-supervised Time series classification architecture (termed as SemiTime) by gaining from the structure of unlabeled data in a self-supervised manner. Specifically, for the labeled time series, SemiTime conducts the supervised classification directly under the supervision of the annotated class label. For the unlabeled time series, the segments of pastfuture pair are sampled from time series, where two segments of pair from the same time series candidate are in positive temporal relation, while two segments from the different candidates are in negative temporal relation. Then, the temporal relation between those segments is predicted by SemiTime in a self-supervised manner. Finally, by jointly classifying labeled data and predicting the temporal relation of unlabeled data, the useful representation of unlabeled time series can be captured by SemiTime. Extensive experiments on multiple real-world datasets show that SemiTime consistently outperforms the state-of-the-arts, which demonstrates the effectiveness of the proposed method. Code and data are publicly available at https://haoyfan.github.io.

show abstract

“…Instead of directly predicting low-level information, methods based on spatio- more dedicate pretext tasks, such as temporal order prediction [13,14,39,40,75] and video speed prediction [15,16,18]. Compared to dense prediction methods, the spatio-temporal reasoning methods are more efficient since they discard additional generators.…”

Section: Spatio-temporal Reasoning Methodsmentioning

confidence: 99%

“…Although improvement can be achieved by temporal order prediction, there is still a noticeable gap in performance when compared to fully-supervised methods. To narrow the gap of performance, recent methods [15,16,18] et al [18] propose a self-supervised pace prediction task, where they discard the generation task in PRP [15] and include an additional constrative learning task.…”

Section: Spatio-temporal Reasoning Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Effective action recognition with fully supervised and self-supervised methods

Cao¹

View full text Add to dashboard Cite

Action recognition in videos has attracted interest in computer vision and machine learning communities thanks to its applications such as surveillance and smart homes. In addition to spatial information in individual frames, videos contain temporal information across the temporal dimension. Therefore, effective spatiotemporal representation is the key to accurate action recognition in videos.Previous works have proposed various fully-supervised and self-supervised methods for video representation learning. For fully-supervised methods, most of them utilize convolution neural networks (CNNs) to extract spatial representation while temporal representation is usually modelled by pixel-wise correlations. However, it is inefficient to extract correlations between all pixels since some of them may relate to in-salient area (e.g. backgrounds or environments). On the other hand, self-supervised methods are proposed to leverage more accessible unlabled data on the Internet and transfer the extracted representation for different downstream tasks. The core of self-supervised methods is to design a pretext task where supervision signal is automatically generated based on characteristics of unlabeled data.Although self-supervised methods avoid the annotation of labeled data, compared to fully-supervised methods, there is room for performance improvement of selfsupervised methods. In this thesis, we address the above research gap with two novel deep learning methods, to advance fully-supervised and self-supervised methods, respectively. For fully-supervised learning, we propose a novel Key Point Shift Embedding Module (KPSEM) to adaptively extract channel-wise key point shifts across video frames without key point annotation for temporal feature extraction. Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points. The key point shifts are encoded as the overall temporal features via linear embedding layers in a multi-set manner. vi To advance self-supervised learning, we propose a novel self-supervised learning method, called Video Incoherence Detection (VID), that leverages incoherence detection for spatio-temporal feature extraction. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the relative location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between different incoherent clips from the same raw video. Our experiments show that both KPSEM and VID achieve state-of-the-art performance on action recognition wi...

show abstract

Self-supervised Video Representation Learning by Pace Prediction

Cited by 164 publications

References 45 publications

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Semi-Supervised Time Series Classification by Temporal Relation Prediction

Effective action recognition with fully supervised and self-supervised methods

Contact Info

Product

Resources

About