Shengxin Zha scite author profile

In this paper, we propose an efficient approach to exploit off-the-shelf image-trained CNN architectures for video classification and evaluate on the challenging TRECVID MED'14 dataset and UCF-101 dataset. Our work is closely related to other research efforts towards the efficient use of CNN for video classification. While it is now clear that CNN-based approaches outperform most state-of-the-art handcrafted features for image classification, it is not yet obvious that this holds true for video classification. Moreover, there seems to be mixed conclusions regarding the benefit of training a spatiotemporal vs. applying an image-trained CNN architecture on videos. Although the specificity of the considered video datasets might play a role, the way the 2D CNN architecture is exploited for video classification is certainly the main reason behind these contradictory observations. The additional computational cost of training on videos is also an element that should be taken into account when comparing the two options. Prior to training a spatiotemporal CNN architecture, it thus seems legitimate to fully exploit the potential of image-trained CNN architectures. Obtained on a highly heterogeneous video dataset, we believe that our results can serve as a strong 2D CNN baseline against which to compare CNN architectures specifically trained on videos.We conduct an in-depth exploration of different strategies for doing event detection in videos using convolutional neural networks (CNNs) trained for image classification (Figure 1, 2). We study different ways of performing spatial and temporal pooling, feature normalization, choice of CNN layers as well as choice of classifiers. Making judicious choices along these dimensions led to a very significant increase in performance over more naive approaches that have been used till now. The modality fusion of image-trained CNN features and motion-based Fisher vectors shows considerably improvement in classification performance. On TRECVID MED'14 dataset, our methods, based entirely on image-trained CNN features, can outperform several state-of-the-art non-CNN models. Our proposed late fusion of CNN-and motion-based features can further increase the mean average precision (mAP) on MED'14 from 34.95% to 38.

show abstract

SF-Net: Single-Frame Supervision for Temporal Action Localization

Zhu

Yang

et al. 2020

View full text Add to dashboard Cite

Only Time Can Tell: Discovering Temporal Data for Temporal Modeling

Sevilla-Lara

Zha

Yan

et al. 2021

View full text Add to dashboard Cite

Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

Zha¹,

Luisier²,

Andrews³

et al. 2015

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shengxin Zha

Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

SF-Net: Single-Frame Supervision for Temporal Action Localization

Only Time Can Tell: Discovering Temporal Data for Temporal Modeling

Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

Contact Info

Product

Resources

About