Adaptive Focus for Efficient Video Recognition

Wang, Yulin; Chen, Zhaoxi; Jiang, Haojun; Song, Shiji; Han, Yizeng; Huang, Gao

doi:10.1109/iccv48922.2021.01594

Cited by 75 publications

(26 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The training in AR-Net is simplified using the Gumbel-Softmax trick. Later, this idea was extended to adaptively select a proper modality [20] or patches [42]. Our approach is motivated by these prior works to apply a similar framework to adaptive computation on deep learn-based VIO for the first time.…”

Section: Adaptive Inferencementioning

confidence: 99%

Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection

Yang¹,

Chen²,

Kim³

2022

Preprint

View full text Add to dashboard Cite

In recent years, deep learning-based approaches for visual-inertial odometry (VIO) have shown remarkable performance outperforming traditional geometric methods. Yet, all existing methods use both the visual and inertial measurements for every pose estimation incurring potential computational redundancy. While visual data processing is much more expensive than that for the inertial measurement unit (IMU), it may not always contribute to improving the pose estimation accuracy. In this paper, we propose an adaptive deep-learning based VIO method that reduces computational redundancy by opportunistically disabling the visual modality. Specifically, we train a policy network that learns to deactivate the visual feature extractor on the fly based on the current motion state and IMU readings. A Gumbel-Softmax trick is adopted to train the policy network to make the decision process differentiable for end-to-end system training. The learned strategy is interpretable, and it shows scenario-dependent decision patterns for adaptive complexity reduction. Experiment results show that our method achieves a similar or even better performance than the full-modality baseline with up to 78.8% computational complexity reduction for KITTI dataset evaluation. Our code will be shared in https://github.com/mingyuyng/Visual-Selective-VIO Keywords visual-inertial odometry • long short-term memory • gumbel-softmax • adaptive learning

show abstract

Section: Adaptive Inferencementioning

confidence: 99%

Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection

Yang¹,

Chen²,

Kim³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Reducing spatio-temporal redundancy for efficient video analysis has recently been a popular research topic. The mainstream approaches mostly train an additional lightweight network to achieve: (i) adaptive frame selection [12]- [14], [16], [44], i.e., dynamically determining the relevant frames for the recognition networks; (ii) adaptive frame resolution [12], i.e., learning an optimal resolution for each frame online; (iii) early stopping [45], i.e., terminating the inference process before observing all frames; (iv) adaptive spatio-temporal regions [10], [11], i.e., localizing the most task-relevant spatiotemporal regions; (v) adaptive network architectures [15], [16], [46], i.e., adjusting the network architecture to save computation on less informative features. Another line is to manually define low redundant sampling rules, such as MGSampler [47], which selects frames containing rich motion information by the cumulative motion distribution.…”

Section: B Spatio-temporal Redundancymentioning

confidence: 99%

“…Although this yields decent performances, the computation over full videos is highly redundant due to the excessive and widely present spatio-temporal redundancy of visual information [9]- [13] in videos. In light of this, a branch of previous works has proposed to reduce the spatiotemporal redundancy by training an additional model to focus on relevant frames [12]- [17] or spatio-temporal regions [10], [11], which can significantly reduce the computation cost. However, they mostly require complicated operations, such as reinforcement learning and multi-stage training.…”

Section: Introductionmentioning

confidence: 99%

MAR: Masked Autoencoders for Efficient Action Recognition

Qing¹,

Zhang²,

Huang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Standard approaches for video action recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos. Recent progress in masked video modelling, i.e., VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts given only limited visual contents. Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos. MAR contains the following two indispensable components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches easily, cell running masking is presented to preserve the spatio-temporal correlations in videos, i.e., it ensures the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this, a bridging classifier is proposed to bridge the semantic gap between the ViT encoded features for reconstruction and the features specialized for classification. Our proposed MAR reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin. Especially, we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a standard training scheme by convincing margins on both Kinetics-400 and Something-Something v2 datasets, while our computation overhead of ViT-Large is only 14.5% of ViT-Huge. Codes and models will be made available here.

show abstract

“…For example, a dynamic network spends less computation on easy samples or less informative spatial areas/temporal locations of an input. For image [32,60] or video-related [30,59] tasks, sample-wise, spatial-wise, or temporal-wise adaptive inference could be conducted by formulating the recognition or detection task as a sequential decision problem and allowing early exiting during inference.…”

Section: Dynamic Neural Networkmentioning

confidence: 99%

E^2TAD: An Energy-Efficient Tracking-based Action Detector

Hu¹,

Wu²,

Miao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video action detection (spatio-temporal action localization) is usually the starting point for human-centric intelligent analysis of videos nowadays. It has high practical impacts for many applications across robotics, security, healthcare, etc. The two-stage paradigm of Faster R-CNN inspires a standard paradigm of video action detection in object detection, i.e., firstly generating person proposals and then classifying their actions. However, none of the existing solutions could provide fine-grained action detection to the "who-when-where-what" level. This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions spatially (by predicting the associated target IDs and locations) and temporally (by predicting the time in exact frame indices). This solution won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC).

show abstract

Adaptive Focus for Efficient Video Recognition

Cited by 75 publications

References 49 publications

Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection

Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection

MAR: Masked Autoencoders for Efficient Action Recognition

E^2TAD: An Energy-Efficient Tracking-based Action Detector

Contact Info

Product

Resources

About