Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Wang, Jiangliu; Jiao, Jianbo; Bao, Linchao; He, Shengfeng; Liu, Yunhui; Liu, Wei

doi:10.1109/cvpr.2019.00413

Cited by 212 publications

(142 citation statements)

References 41 publications

Supporting

Mentioning

137

Contrasting

Order By: Relevance

“…All the approaches employ RGB frames as input. Particularly, the Max‐GL is compared with DrLim [44], TempoCoh [11], object patch [9], temporal order [6], Odd‐One‐Out (O3N) [7], Order Prediction Network (OPN) [8], Generative Adversarial Network for Video (VGAN) [30], geometry [28], motion pattern [45], and video jigsaw [46]. As shown in Table 4, our Max‐GL achieves 64.0% on the UCF101 dataset, which improves 3.7% than the second best approach reported in [7].…”

Section: Methodsmentioning

confidence: 99%

Mutual information guided 3D ResNet for self‐supervised video representation learning

Xue

Zhang

2020

IET image process

View full text Add to dashboard Cite

Section: Methodsmentioning

confidence: 99%

Mutual information guided 3D ResNet for self‐supervised video representation learning

Xue

Zhang

2020

IET image process

View full text Add to dashboard Cite

“…The general pipeline is to train a pre-text task on unlabeled data and transfer the knowledge to a supervised downstream task (Jing and Tian 2020) or by clustering video datasets without manual supervision (Asano et al 2020). Pretext tasks include dense predictive coding (Han et al 2020), shuffling frames (Fernando et al 2017;Xu et al 2019), exploiting spatial and/or temporal order (Jenni et al 2020;Tschannen et al 2020;Wang et al 2019), or by matching frames with other modalities (Afouras et al 2020;Alayrac et al 2020;Owens and Efros 2018;Patrick et al 2020). Self-supervised approaches utilize unlabeled train videos to learn representations without semantic class labels.…”

Section: Self-supervised Video Learningmentioning

confidence: 99%

Object Priors for Classifying and Localizing Unseen Actions

2021

View full text Add to dashboard Cite

This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.

show abstract

“…In addition, some works applied the self-supervised approaches to learn video feature based on multi-stream structure. Wang et al [23] proposed a two-stream-based self-supervised approach to learn visual feature by regressing both motion and appearance statistical information without action label. In this work, both RGB data and optical data were used to compute appearance and motion respectively.…”

Section: Multi-stream Structure For Action Recognitionmentioning

confidence: 99%

“…2. Referring to the clip-level learning methods [1,2,23], the length of the clip is set to 16 frames in this paper.…”

Section: Problem Definitionmentioning

confidence: 99%

“…In order to make the parameters of network as few as possible, we only use 5 convolution layers and 5 pooling layers (each convolution layer is immediately followed by a pooling layer), 2 fully connected layers and a softmax layer to predict action labels. Inspired by the previous works of 3D ConvNets [1,2,23], all of the convolutional kernels are set to 3 * 3 * 3 in the proposed approach. To improve the action recognition performance of 3D ConvNets, the video-level learning strategy is proposed in this paper.…”

Section: Network Structure and Video-level Learningmentioning

confidence: 99%

See 1 more Smart Citation

Consistent constraint-based video-level learning for action recognition

Shi

Ren

et al. 2020

J Image Video Proc.

View full text Add to dashboard Cite

This paper proposes a new neural network learning method to improve the performance for action recognition in video. Most human action recognition methods use a clip-level training strategy, which divides the video into multiple clips and trains the feature learning network by minimizing the loss function of clip classification. The video category is predicted by the voting of clips from the same video. In order to obtain more effective action feature, a new video-level feature learning method is proposed to train 3D CNN to boost the action recognition performance. Different with clip-level training which uses clips as input, video-level learning network uses the entire video as the input. Consistent constraint loss is defined to minimize the distance between clips of the same video in voting space. Further, a video-level loss function is defined to compute the video classification error. The experimental results show that the proposed video-level training is a more effective action feature learning approach compared with the clip-level training. And this paper has achieved the state-of-the-art performance on UCF101 and HMDB51 datasets without using pre-trained models of other large-scale datasets. Our code and final model are available at https://github. com/hqu-cst-mmc/VLL.

show abstract

Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Cited by 212 publications

References 41 publications

Mutual information guided 3D ResNet for self‐supervised video representation learning

Mutual information guided 3D ResNet for self‐supervised video representation learning

Object Priors for Classifying and Localizing Unseen Actions

Consistent constraint-based video-level learning for action recognition

Contact Info

Product

Resources

About