Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video

Su, Yu-Chuan; Grauman, Kristen

doi:10.1007/978-3-319-46478-7_48

Cited by 32 publications

(18 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most existing work focuses on extending 2D convolution to the video domain and modeling motion information in videos [19,29,28,35,24]. Only a few methods consider efficient video classifica-tion [38,31,40,20,10]. However, these approaches perform mean-pooling of scores/features from multiple frames, either uniformly sampled or decided by an agent, to classify a video clip.…”

Section: Related Workmentioning

confidence: 99%

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

Xiong

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

190

175

View full text Add to dashboard Cite

We present AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame contains a Long Short-Term Memory network augmented with a global memory that provides context information for searching which frames to use over time. Trained with policy gradient methods, AdaFrame generates a prediction, determines which frame to observe next, and computes the utility, i.e., expected future rewards, of seeing more frames at each time step. At testing time, AdaFrame exploits predicted utilities to achieve adaptive lookahead inference such that the overall computational costs are reduced without incurring a decrease in accuracy. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet. AdaFrame matches the performance of using all frames with only 8.21 and 8.65 frames on FCVID and ActivityNet, respectively. We further qualitatively demonstrate learned frame usage can indicate the difficulty of making classification decisions; easier samples need fewer frames while harder ones require more, both at instance-level within the same class and at class-level among different categories.

show abstract

Section: Related Workmentioning

confidence: 99%

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

Xiong

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

190

175

View full text Add to dashboard Cite

show abstract

“…In contrast, our policy network makes all routing decisions in a single step, resulting in lower overhead cost for the routing itself and thus larger computational savings. Reinforcement learning has also been applied for dynamic feature prioritization in images [26] and video [45,56], actively deciding which frames or image regions to visit next. These techniques could be used in tandem with our approach.…”

Section: Related Workmentioning

confidence: 99%

BlockDrop: Dynamic Inference Paths in Residual Networks

Wu¹,

Nagarajan

Kumar³

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Self Cite

402

349

View full text Add to dashboard Cite

Very deep convolutional neural networks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications. We introduce BlockDrop, an approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. Exploiting the robustness of Residual Networks (ResNets) to layer dropping, our framework selects on-the-fly which residual blocks to evaluate for a given novel image. In particular, given a pretrained ResNet, we train a policy network in an associative reinforcement learning setting for the dual reward of utilizing a minimal number of blocks while preserving recognition accuracy. We conduct extensive experiments on CIFAR and ImageNet. The results provide strong quantitative and qualitative evidence that these learned policies not only accelerate inference but also encode meaningful visual information. Built upon a ResNet-101 model, our method achieves a speedup of 20% on average, going as high as 36% for some images, while maintaining the same 76.4% top-1 accuracy on ImageNet.

show abstract

“…We found many research works addressing the issue of temporal action localization (Shou et al, 2016;Caba Heilbron et al, 2016;Escorcia et al, 2016;Karaman et al, 2014;Buch et al, 2017;Oneata et al, 2014;Gao et al, 2017;Su and Grauman, 2016;Sun et al, 2015;Wang et al, 2014;Yuan et al, 2015;Tran et al, 2015;Singh et al, 2016;Duchenne et al, 2009). A traditional way of performing temporal action detection is to densely apply action classifiers in a sliding window fashion (Duchenne et al, 2009).…”

Section: Temporal Action Detection and Proposalsmentioning

confidence: 99%

TAB: Temporally aggregated bag-of-discriminant-words for temporal action proposals

Murtaza

Yousaf

Velastín

2019

Computer Vision and Image Understanding

View full text Add to dashboard Cite

In this work, we propose a new method to generate temporal action proposals from long untrimmed videos named Temporally Aggregated Bag-of-Discriminant-Words (TAB). TAB is based on the ob-servation that there are many overlapping frames in action and background temporal regions of untrimmed videos, which cause difficulties in segmenting actions from non-action regions. TAB solve this issue by extracting class-specific codewords from the action and background videos and extracting the discriminative weights of these codewords based on their ability to discriminate between these two classes. We integrate these discriminative weights with Bag of Word encoding, which we then call Bag-of-Discriminant-Words (BoDW). We sample the untrimmed videos into non-overlapping snippets and temporally aggregate the BoDW representation of multiple snippets into action proposals using a binary classifier trained on trimmed videos in a single pass. We present the effectiveness of our TAB proposal extraction method on two challenging temporal action detection datasets: MSR-II and Thumos14, where it improves upon state-of-the-art with recall rate of 82.0% and 80.65% respectively at a temporal intersection over union ratio of 0.8.

show abstract

Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video

Cited by 32 publications

References 47 publications

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

BlockDrop: Dynamic Inference Paths in Residual Networks

TAB: Temporally aggregated bag-of-discriminant-words for temporal action proposals

Contact Info

Product

Resources

About