AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Wang, Yulin; Yang, Yue; Lin, Yuanze; Jiang, Haojun; Lai, Zihang; Куликов, В. А.; Orlov, Nikita; Huang, Gao

doi:10.48550/arxiv.2112.14238

Cited by 2 publications

(3 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reducing spatio-temporal redundancy for efficient video analysis has recently been a popular research topic. The mainstream approaches mostly train an additional lightweight network to achieve: (i) adaptive frame selection [12]- [14], [16], [44], i.e., dynamically determining the relevant frames for the recognition networks; (ii) adaptive frame resolution [12], i.e., learning an optimal resolution for each frame online; (iii) early stopping [45], i.e., terminating the inference process before observing all frames; (iv) adaptive spatio-temporal regions [10], [11], i.e., localizing the most task-relevant spatiotemporal regions; (v) adaptive network architectures [15], [16], [46], i.e., adjusting the network architecture to save computation on less informative features. Another line is to manually define low redundant sampling rules, such as MGSampler [47], which selects frames containing rich motion information by the cumulative motion distribution.…”

Section: B Spatio-temporal Redundancymentioning

confidence: 99%

“…Although this yields decent performances, the computation over full videos is highly redundant due to the excessive and widely present spatio-temporal redundancy of visual information [9]- [13] in videos. In light of this, a branch of previous works has proposed to reduce the spatiotemporal redundancy by training an additional model to focus on relevant frames [12]- [17] or spatio-temporal regions [10], [11], which can significantly reduce the computation cost. However, they mostly require complicated operations, such as reinforcement learning and multi-stage training.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MAR: Masked Autoencoders for Efficient Action Recognition

Qing¹,

Zhang²,

Huang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Standard approaches for video action recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos. Recent progress in masked video modelling, i.e., VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts given only limited visual contents. Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos. MAR contains the following two indispensable components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches easily, cell running masking is presented to preserve the spatio-temporal correlations in videos, i.e., it ensures the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this, a bridging classifier is proposed to bridge the semantic gap between the ViT encoded features for reconstruction and the features specialized for classification. Our proposed MAR reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin. Especially, we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a standard training scheme by convincing margins on both Kinetics-400 and Something-Something v2 datasets, while our computation overhead of ViT-Large is only 14.5% of ViT-Huge. Codes and models will be made available here.

show abstract

Section: B Spatio-temporal Redundancymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

MAR: Masked Autoencoders for Efficient Action Recognition

Qing¹,

Zhang²,

Huang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Wu et al [47] utilizes multi-agent reinforce learning to model parallel frame sampling and Lin et al [24] make one-step decision with holistic view. Meng et al [27] and Wang et al [42,44] focus their attention on spatial redundancy. Panda et al adaptively decide modalities for video segments.…”

Section: Related Workmentioning

confidence: 99%

Temporal Saliency Query Network for Efficient Video Recognition

Xia¹,

Wang²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Efficient video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices. Most existing methods select the salient frames without awareness of the class-specific saliency scores, which neglect the implicit association between the saliency of frames and its belonging category. To alleviate this issue, we devise a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement. Specifically, we model the class-specific saliency measuring process as a query-response task. For each category, the common pattern of it is employed as a query and the most salient frames are responded to it. Then, the calculated similarities are adopted as the frame saliency scores. To achieve it, we propose a Temporal Saliency Query Network (TSQNet) that includes two instantiations of the TSQ mechanism based on visual appearance similarities and textual event-object relations. Afterward, cross-modality interactions are imposed to promote the information exchange between them. Finally, we use the class-specific saliencies of the most confident categories generated by two modalities to perform the selection of salient frames. Extensive experiments demonstrate the effectiveness of our method by achieving state-of-the-art results on ActivityNet, FCVID and Mini-Kinetics datasets. Our project page is at https://lawrencexia2008.github.io/ projects/tsqnet.

show abstract

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Cited by 2 publications

References 53 publications

MAR: Masked Autoencoders for Efficient Action Recognition

MAR: Masked Autoencoders for Efficient Action Recognition

Temporal Saliency Query Network for Efficient Video Recognition

Contact Info

Product

Resources

About