Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Wang, Jiangliu; Jiao, Jianbo; Bao, Linchao; He, Shengfeng; Liu, Wei; Liu, Yun-hui

doi:10.48550/arxiv.2008.13426

Cited by 3 publications

(7 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our method is significantly better than those using temporal cues to design pretext tasks. Though DSM [61] and STS [63] designed elaborate operations to build static appearance and dynamic motion statistics, our higher performance indicates good transferability of knowledge to the downstream task, hence showing the efficacy of our multi-level feature optimization. Further improvement can be observed when Kinetics-400 is utilized.…”

Section: Evaluation On Downstream Tasksmentioning

confidence: 93%

“…Under the end-to-end finetune setting, for models pretrained on UCF-101, our method can outperform approaches that used simple temporal order or playback rate as their pretext task, and is comparable to STS [63] which designed a complex learning scheme to characterize appearance and motion statistics. This demonstrates that our method is capable of robust spatiotemporal modeling.…”

Section: Evaluation On Downstream Tasksmentioning

confidence: 97%

“…In self-supervised video representation learning, a line of works designed various pretext tasks, e.g., temporal ordering [46,74,75], spatiotemporal puzzles [33,63], colorization [59], playback speed prediction [31,6] and temporal cycle-consistency [66,30,37]. Some works proposed to predict future frames from the given sequence to learn feature embeddings [58,57,43,5].…”

Section: Self-supervised Video Representation Learningmentioning

confidence: 99%

“…Note that due to limited computational resources, we only report results with resolution 112 and training epochs 100. According to [50,63], further improvement is expected when using resolution 224 and more epochs for self-supervised pretraining. Video Retrieval.…”

Section: Evaluation On Downstream Tasksmentioning

confidence: 99%

“…* Corresponding author. Email: wylin@sjtu.edu.cn To achieve this goal, early works designed various pretext tasks to uncover effective supervision from video sequences [6,46,33,31,74,63]. Recently, contrastive learning has shown to be powerful in image representation learning [28,47,55,12,26,77].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Qian

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding. Code is available here.

show abstract

Section: Evaluation On Downstream Tasksmentioning

confidence: 93%

Section: Evaluation On Downstream Tasksmentioning

confidence: 97%

Section: Self-supervised Video Representation Learningmentioning

confidence: 99%

Section: Evaluation On Downstream Tasksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Qian

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

TCLR: Temporal Contrastive Learning for Video Representation

Dave,

Gupta,

Rizve

et al. 2021

Preprint

View full text Add to dashboard Cite

Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations. Existing extensions of contrastive learning to the domain of video data however do not explicitly attempt to represent the internal distinctiveness across the temporal dimension of video clips. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The first loss adds the task of discriminating between non-overlapping clips from the same video, whereas the second loss aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the features. Temporal contrastive learning achieves significant improvement over the state-of-the-art results in downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on video datasets across multiple 3D CNN architectures. With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval.

show abstract

Static and Dynamic Concepts for Self-supervised Video Representation Learning

Qu¹,

Ding²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel learning scheme for selfsupervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding. Specifically, we utilize static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space. We add diversity and fidelity regularizations to guarantee that we learn a compact set of meaningful concepts. Then we employ a cross-attention mechanism to aggregate detailed local features of different concepts, and filter out redundant concepts with low activations to perform local concept contrast. Extensive experiments demonstrate that our method distills meaningful static and dynamic concepts to guide video understanding, and obtains state-of-the-art results on UCF-101, HMDB-51, and Diving-48.

show abstract

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Cited by 3 publications

References 59 publications

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

TCLR: Temporal Contrastive Learning for Video Representation

Static and Dynamic Concepts for Self-supervised Video Representation Learning

Contact Info

Product

Resources

About