Self-supervised Motion Learning from Static Images

Huang, Ziyuan; Zhang, Shiwei; Jiang, Jianwen; Tang, Mingqian; Jin, Rong; Ang, Marcelo H.

doi:10.48550/arxiv.2104.00240

Cited by 5 publications

(5 citation statements)

References 54 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Transfer learning is an important measure to improve the generalization ability of the model. Supervised training [24,26,8,29,25,11] as well as unsupervised ones [15,13,21] fore, we adopt the former strategy. Recently, Transformer-Based methods have shown great potential in image recognition [10,33] and video understanding [1,3].…”

Section: Pre-train Of Classification Modelsmentioning

confidence: 99%

A Stronger Baseline for Ego-Centric Action Detection

Qing,

Huang,

Wang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This technical report analyzes an egocentric video action detection method we used in the 2021 EPIC-KITCHENS-100 competition hosted in CVPR2021 Workshop. The goal of our task is to locate the start time and the end time of the action in the long untrimmed video, and predict action category. We adopt sliding window strategy to generate proposals, which can better adapt to short-duration actions. In addition, we show that classification and proposals are conflict in the same network. The separation of the two tasks boost the detection performance with high efficiency. By simply employing these strategy, we achieved 16.10% performance on the test set of EPIC-KITCHENS-100 Action Detection challenge using a single model, surpassing the baseline method by 11.7% in terms of average mAP.

show abstract

Section: Pre-train Of Classification Modelsmentioning

confidence: 99%

A Stronger Baseline for Ego-Centric Action Detection

Qing,

Huang,

Wang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…There are multiple ways to prepare the pre-trained model, such as supervised pre-training [17,7,1,4] as is used in [14,13,18] as well as unsupervised ones [10,9]. Here we adopt the supervised pre-training as it yields a better downstream performance.…”

Section: Initialization Preparationmentioning

confidence: 99%

Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

Huang¹,

Qing²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

With the recent surge in the research of vision transformers, they have demonstrated remarkable potential for various challenging computer vision applications, such as image recognition, point cloud classification as well as video understanding. In this paper, we present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset. Specifically, we explore training techniques for video vision transformers, such as augmentations, resolutions as well as initialization, etc. With our training recipe, a single ViViT model achieves the performance of 47.4% on the validation set of EPIC-KITCHENS-100 dataset, outperforming what is reported in the original paper [1] by 3.4%. We found that video transformers are especially good at predicting the noun in the verb-noun action prediction task. This makes the overall action prediction accuracy of video transformers notably higher than convolutional ones. Surprisingly, even the best video transformers underperform the convolutional networks on the verb prediction. Therefore, we combine the video vision transformers and some of the convolutional video networks and present our solution to the EPIC-KITCHENS-100 Action Recognition competition.

show abstract

“…The existing mainstream pre-training methods can be divided into two types: supervised [21, 9, 1, 8] and unsupervised [13,11]. Supervised methods can achieve stronger performance, but need to provide labels for each video.…”

Section: Training Backbonesmentioning

confidence: 99%

Exploring Stronger Feature for Temporal Action Localization

Qing,

Wang,

Huang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Temporal action localization aims to localize starting and ending time with action category. Limited by GPU memory, mainstream methods pre-extract features for each video. Therefore, feature quality determines the upper bound of detection performance. In this technical report, we explored classic convolution-based backbones and the recent surge of transformer-based backbones. We found that the transformer-based methods can achieve better classification performance than convolution-based, but they cannot generate accuracy action proposals. In addition, extracting features with larger frame resolution to reduce the loss of spatial information can also effectively improve the performance of temporal action localization. Finally, we achieve 42.42% in terms of mAP on validation set with a single SlowFast [9] feature by a simple combination: BMN [16]+TCANet [19], which is 1.87% higher than the result of 2020 [20]'s multi-model ensemble. Finally, we achieve Rank 1st on the CVPR2021 HACS supervised Temporal Action Localization Challenge.

show abstract

Self-supervised Motion Learning from Static Images

Cited by 5 publications

References 54 publications

A Stronger Baseline for Ego-Centric Action Detection

A Stronger Baseline for Ego-Centric Action Detection

Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

Exploring Stronger Feature for Temporal Action Localization

Contact Info

Product

Resources

About