AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition

Wang, Yulin; Yue, Yang; Xu, Xiangde; Hassani, Ali; Куликов, В. А.; Orlov, Nikita; Song, Shiji; Huang, Gao

doi:10.1007/978-3-031-19772-7_14

Cited by 10 publications

(5 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…TSN [ 14 ], TRN [ 15 ], ECO [ 47 ], TSMNet [ 16 , 17 ] and MSNet [ 48 ] are static to input and therefore require no policy model other than the classification model. The other CNN-based methods we picked (Adafuse [ 35 ], Dynamic-STE [ 40 ], FrameExit [ 39 ], LiteEval [ 29 ], SCSampler [ 32 ], ARNet [ 36 ], VideoIQ [ 34 ] and Adafocus v1-3 [ 36 , 37 , 38 ]) include dynamic optimization of inferencing. They optimize the input-size or inference pipeline dynamically from each viewpoint and require a policy model (and additional loss functions to train them) to adjust inference other than the classification model.…”

Section: Methodsmentioning

confidence: 99%

“…These approaches usually focus on dynamic optimizations of input-size using additional policy models or modules. For example, they select salient frames/clips as input [ 29 , 30 , 31 , 32 ]; adjust the resolutions [ 33 ], precision of network quantization [ 34 ] and channel fusion [ 35 ]; predict crop regions to input [ 36 , 37 , 38 ]; or determine whether to exit early in the time direction [ 39 ] or to use the heavy teacher knowledge [ 40 ]. Our method, on the other had, selects spatiotemporal patches to input.…”

Section: Related Workmentioning

confidence: 99%

“…Then, patches with a large probability are considered to have a small amount of information and are excluded from the input. IPS is simple and effective in that it can dynamically reduce inference cost depending on the input without requiring any policy model and complicated training , unlike the previous works [ 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 ]. Modeling MPEG4-encoded video has long been conducted for network traffic prediction [ 41 , 42 , 43 , 44 ], and recently, neural-network-based video recognition models directly taking MPEG4-encoded video as input were proposed [ 5 , 45 ].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection

Suzuki

Aoki

2022

Sensors

View full text Add to dashboard Cite

Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are less informative and assigning short code lengths to them (e.g., MPEG4) have successfully reduced the redundancy of videos. Inspired by this idea, we propose Informative Patch Selection (IPS), which efficiently reduces the inference cost by excluding redundant patches from the input of the Transformer-based video model. The redundancy of each patch is calculated from motions and residuals obtained while decoding a compressed video. The proposed method is simple and effective in that it can dynamically reduce the inference cost depending on the input without any policy model or additional loss term. Extensive experiments on action recognition demonstrated that our method could significantly improve the trade-off between the accuracy and inference cost of the Transformer-based video model. Although the method does not require any policy model or additional loss term, its performance approaches that of existing methods that do require them.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection

Suzuki

Aoki

2022

Sensors

View full text Add to dashboard Cite

show abstract

“…As a special type of efficient neural networks, dynamic neural networks can adaptively change their inference complexity under different computational budgets, latency requirements, and prediction confidence requirements. Existing dynamic neural networks are designed in different aspects in-cluding sample-wise (Teerapittayanon, McDanel, and Kung 2016;Wang et al 2018;Veit and Belongie 2018;Wu et al 2018;Yu et al 2019;Guo et al 2019), spatial-wise (Li et al 2017;Wang et al 2019aWang et al , 2020Yang et al 2020;Wang et al 2022a), and temporal-wise dynamism (Shen et al 2017;Yu, Lee, and Le 2017;Wu et al 2019;Meng et al 2020;Wang et al 2022b), as categorized by (Han et al 2021). Specially, as one type of sample-wise methods, depth-wise dynamic models with early exits adaptively exit at different layer depths given different inputs (Huang et al 2017;Li et al 2019;McGill and Perona 2017;Jie et al 2019;Yang et al 2020).…”

Section: Dynamic Neural Networkmentioning

confidence: 99%

Boosted Dynamic Neural Networks

Li²,

Hua³

et al. 2023

AAAI

View full text Add to dashboard Cite

Early-exiting dynamic neural networks (EDNN), as one type of dynamic neural networks, has been widely studied recently. A typical EDNN has multiple prediction heads at different layers of the network backbone. During inference, the model will exit at either the last prediction head or an intermediate prediction head where the prediction confidence is higher than a predefined threshold. To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data. This brings a train-test mismatch problem that all the prediction heads are optimized on all types of data in training phase while the deeper heads will only see difficult inputs in testing phase. Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions. To mitigate this problem, we formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively. We name our method BoostNet. Our experiments show it achieves the state-of-the-art performance on CIFAR100 and ImageNet datasets in both anytime and budgeted-batch prediction modes. Our code is released at https://github.com/SHI-Labs/Boosted-Dynamic-Networks.

show abstract

“…Fortunately, images often have more spatial redundancy than languages (Wang, Stuijk, and De Haan 2014), such as regions with task-unrelated objects. Thus, many works (Wang et al 2020;Yang et al 2020;Wang et al 2022aWang et al , 2021bWang et al , 2022bHan et al 2022bHan et al , 2021c o try to adaptively reduce the input resolution of convolution neural networks. Also, great efforts have been made to excavate redundant tokens for ViTs.…”

Section: Introductionmentioning

confidence: 99%

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Chen

Lin

et al. 2023

AAAI

View full text Add to dashboard Cite

Vision Transformers (ViT) have made many breakthroughs in computer vision tasks. However, considerable redundancy arises in the spatial dimension of an input image, leading to massive computational costs. Therefore, We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance in this paper. Our proposed CF-ViT is motivated by two important observations in modern ViT models: (1) The coarse-grained patch splitting can locate informative regions of an input image. (2) Most images can be well recognized by a ViT model in a small-length token sequence. Therefore, our CF-ViT implements network inference in a two-stage manner. At coarse inference stage, an input image is split into a small-length patch sequence for a computationally economical classification. If not well recognized, the informative patches are identified and further re-split in a fine-grained granularity. Extensive experiments demonstrate the efficacy of our CF-ViT. For example, without any compromise on performance, CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput. Code of this project is at https://github.com/ChenMnZ/CF-V

show abstract

AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition

Cited by 10 publications

References 54 publications

Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection

Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection

Boosted Dynamic Neural Networks

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Contact Info

Product

Resources

About