Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

Zha, Shengxin; Luisier, Florian; Andrews, W.D.; Srivastava, Nitish; Salakhutdinov, Ruslan

doi:10.5244/c.29.60

Cited by 148 publications

(108 citation statements)

References 31 publications

(52 reference statements)

Supporting

Mentioning

102

Contrasting

Order By: Relevance

“…Considering deep learning methods, our method performs on par and is only outperformed from [33]. [33] makes use of the very deep VGGnet [24], which is a more competitive network than that the Alexnet architecture we rely on. Hence a direct comparison is not possible.…”

Section: State-of-the-art Comparisonsmentioning

confidence: 99%

Dynamic Image Networks for Action Recognition

Bilen

Fernando

Gavves

et al. 2016

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

501

476

View full text Add to dashboard Cite

We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used. The dynamic image is based on the rank pooling concept and is obtained through the parameters of a ranking machine that encodes the temporal evolution of the frames of the video. Dynamic images are obtained by directly applying rank pooling on the raw image pixels of a video producing a single RGB image per video. This idea is simple but powerful as it enables the use of existing CNN models directly on video data with fine-tuning. We present an efficient and effective approximate rank pooling operator, speeding it up orders of magnitude compared to rank pooling. Our new approximate rank pooling CNN layer allows us to generalize dynamic images to dynamic feature maps and we demonstrate the power of our new representations on standard benchmarks in action recognition achieving state-of-the-art performance.

show abstract

Section: State-of-the-art Comparisonsmentioning

confidence: 99%

Dynamic Image Networks for Action Recognition

Bilen

Fernando

Gavves

et al. 2016

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

501

476

View full text Add to dashboard Cite

show abstract

“…video classification in noisy video streams). Nevertheless, we examined the following papers : [15], [16], [17] which presents results of video classification using UCF-101 dataset. The best systems presented in those papers are based on various architectures of Convolutional Neural Networks (CNNs) and achieve accuracy of 80% and more.…”

Section: Experiments and The Discussionmentioning

confidence: 99%

Using Spatial Pooler of Hierarchical Temporal Memory for object classification in noisy video streams

Wielgosz

Pietroń

Wiatr

2016

Annals of Computer Science and Information Systems

View full text Add to dashboard Cite

Abstract-This paper focuses on analyzing a Spatial Pooler (SP) of Hierarchical Temporal Memory (HTM) ability for facilitating object classification in noisy video streams. In particular, we seek to determine whether employing SP as a component of the video system increases overall robustness to noise. We have implemented our own version of HTM and applied it to object recognition tasks under various testing conditions. The system is composed of a video preprocessing block, a dimensionality reduction section which contains SP, a histograms collecting module and SVM classifier.Our experiments involve assessing performance of two different system setups (i.e. a version featuring SP and one without it) under various noise conditions with 32-frame video files. In order to make tests fair and repeatable the videos of several 3-D geometric shapes were artificially generated. Subsequently, Gaussian noise of a different intensity was introduced to the videos making them more indistinct. Such an approach mimics real-life scenarios where the system is taught ideal objects and then faces in its normal working conditions the challenge of detecting noisy ones.The results of the experiments reveal the superiority of the solution featuring Spatial Pooler over the one without it. Furthermore, the system with SP performed better also in the experiment without a noise component introduced and achieved a mean F1-score of 0.91 in ten trials.

show abstract

“…Zhongwen Xu et al [19] proposed discriminative CNN video representation to perform event detection from video dataset. Andrej Karpathy et al [6], Joe Yue-Hei Ng [8] and Shengxin Zha [20] used CNN architectures to perform video classification. They also retrained the top layers of their systems to study the generalization performance of their models and reported performance improvements from 88.6 percent to 88.0 percent.…”

Section: Related Workmentioning

confidence: 99%