Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Luo, Dezhao; Liu, Chang; Zhou, Yu; Yang, Dongbao; Ma, Can; Ye, Qixiang; Wang, Weiping

doi:10.1609/aaai.v34i07.6840

Cited by 146 publications

(85 citation statements)

References 15 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…HMDB-51 is a dataset that expresses 51 human actions at least 101 times per class and consists of the total 6,849 clips. In this paper, we verify our method by using the clips for pre-training to compare the performance of the selfsupervised learning methods [8], [9], [11]. Specifically, we train a 3D ConvNet using UCF-101 without label information first and fine-tune the model using labeled videos such as UCF-101 and HMDB-51, respectively.…”

Section: Resultsmentioning

confidence: 98%

“…Xu et al [9] have efficiently improved the frame order prediction method [8] by sorting the order of the neighboring clips as known as video clip order prediction (VCOP), where the clips are consistent with the video dynamics. In [11], the video cloze procedure (VCP) was proposed to learn the spatial-temporal representation of video data based on a method that uses spatial rotations and temporal shuffling method, which enhanced the accuracy in action recognition. Our proposed method is inspired by [8] and [9], but we make use of the playback speeds of the videos, not the correct sequential order of sampled frames [8] or clips [9].…”

Section: Related Workmentioning

confidence: 99%

“…We conduct experiments on UCF-101 and HMDB-51 and results are measured over three split of each dataset. In Table 8, we compare the proposed method with the latest self-supervised learning methods such as VCOP [9] and VCP [11] for both UCF-101 and HMDB-51. To validate the generality of the proposed method, we evaluate three backbone networks: C3D, R3D, and R(2+1)D in these experiments.…”

Section: B Action Recognitionmentioning

confidence: 99%

“…From the viewpoint of video classification, video data has its inherent characteristics such as spatial and temporal coherence, and most of the video-based self-supervised learning methods leverage spatio-temporal representation learning based on the steady chronological order of video data to understand the underlying temporal dynamics [8], [9], [11]. Specifically, the frames selected at equal temporal intervals are shuffled and their chronological order predicted [8], whereas clips instead of video frames are used to leverage This work is licensed under a Creative Commons Attribution 4.0 License.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

et al. 2021

View full text Add to dashboard Cite

We propose a self-supervised visual learning method by predicting the variable playback speeds of a video. Without semantic labels, we learn the spatio-temporal visual representation of the video by leveraging the variations in the visual appearance according to different playback speeds under the assumption of temporal coherence. To learn the spatio-temporal visual variations in the entire video, we have not only predicted a single playback speed but also generated clips of various playback speeds and directions with randomized starting points. Hence the visual representation can be successfully learned from the meta information (playback speeds and directions) of the video. We also propose a new layerdependable temporal group normalization method that can be applied to 3D convolutional networks to improve the representation learning performance where we divide the temporal features into several groups and normalize each one using the different corresponding parameters. We validate the effectiveness of our method by fine-tuning it to the action recognition and video retrieval tasks on UCF-101 and HMDB-51. a

show abstract

Section: Resultsmentioning

confidence: 98%

Section: Related Workmentioning

confidence: 99%

Section: B Action Recognitionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Due to the uniqueness and discrimination of corners, we look forward to extending this work to accurate arbitrary-shape text detection and end-to-end text spotting. We would also like to combine this work with self-supervised learning [27][28][29][30].…”

Section: Discussionmentioning

confidence: 99%

FC²RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

Qin

Zhou

Guo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Accurate detection of multi-oriented text that accounts for a large proportion in real practice is of great significance. The performance has improved rapidly on common benchmarks in recent years. However, dense long text case and the quality of detection are easy to be overlooked. Direct regression may produce low-quality and incomplete detections due to the constrain of the receptive field; proposal-based methods could alleviate this but might introduce redundant context due to RoI operation, degrading the performance. To address the dilemma, a novel proposed corner-aware convolution in which the sampling positions tightly cover the text area is utilized to encode an initial corner prediction into the feature maps, which can be further used to produce a refined corner prediction. We embed the proposed module into an anchor-free baseline model, leading to a simple and effective fully convolutional corner refinement network (FC 2 RN). Experimental results on four public datasets including MSRA-TD500, ICDAR2015, RCTW-17, and COCO-Text demonstrate that FC 2 RN can outperform state-of-the-art methods.

show abstract

Self-supervised Video Representation Learning by Pace Prediction

Wang

Jiao

Liu

2020

Lecture Notes in Computer Science

170

205

View full text Add to dashboard Cite

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Cited by 146 publications

References 15 publications

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

FC²RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

Self-supervised Video Representation Learning by Pace Prediction

Contact Info

Product

Resources

About

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Cited by 146 publications

References 15 publications

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

Self-supervised Video Representation Learning by Pace Prediction

Contact Info

Product

Resources

About

FC²RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection