Resolution Adaptive Networks for Efficient Inference

Yang, Le; Han, Yizeng; Chen, Xi; Song, Shiji; Dai, Jifeng; Huang, Gao

doi:10.1109/cvpr42600.2020.00244

Cited by 161 publications

(124 citation statements)

References 20 publications

Supporting

Mentioning

123

Contrasting

Order By: Relevance

“…The AdaFocusV2 network studied in this paper can be classified into this category as well. Many of the spatially adaptive networks are designed from the lens of inference efficiency [5,24,52,62,71]. For example, recent works have revealed that 2D images can be efficiently processed via attending to the task-relevant or more informative image regions [17,61,66,70].…”

Section: Related Workmentioning

confidence: 99%

“…In specific, we attach two linear classifiers, FC G (•) and FC L (•), to the outputs of f G and f L , and replace the loss function L in (9) by L : we assume that processing a subset of frames (from the beginning) rather than all may be sufficient for these "easier" samples. To implement this idea, at test time, we propose to compare the largest entry of the softmax prediction p t (defined as confidence in previous works [28,65,66,71]) at t th frame with a pre-defined threshold η t . Once max j p tj ≥ η t , the prediction will be postulated to be reliable enough, and the inference will be terminated by outputting p t .…”

Section: Training Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Wang¹,

Yang²,

Lin³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train. Code is available at https://github.com/ LeapLabTHU/AdaFocusV2.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Training Techniquesmentioning

confidence: 99%

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Wang¹,

Yang²,

Lin³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Special architectures. One way is to change the architecture of the model to support adaptive computations [4,14,15,18,25,27,30,37,42,51,54]. For example, models that represent a neural network as a fixed-point function can have the property of adaptive computation by default.…”

Section: Related Workmentioning

confidence: 99%

“…Using ODEs requires a specific solver, is often slower than fix depth models and requires adding extra constrains on the model design. [54] learns a set of classifiers with different resolu-tions executed in order; computation stops when confidence of the model is above the threshold. [27] proposed a residual variant with shared weights and a halting mechanism.…”

Section: Related Workmentioning

confidence: 99%

A-ViT: Adaptive Tokens for Efficient Vision Transformer

Yin¹,

Vahdat²,

Álvarez³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT [17]) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that AdaViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed AdaViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop, outperforming prior art by a large margin.

show abstract

“…Such works have followed either multi-scale or HyperNet strategies. While the former redesigns network topology to encode features from shallow and deep layers [Yang et al 2020], the latter preserves network topology, encouraging application on off-the-shelf networks [Sindagi and Patel 2019]. Despite the positive results, both strategies increase the computational burden significantly since they insert time-consuming operations at multiple levels of the network.…”

Section: Introductionmentioning

confidence: 99%

Partial Least Squares: A Deep Space Odyssey

Jordão¹

2021

Anais Do XXXIV Concurso De Teses E Dissertações Da SBC (CTD-SBC 2021)

View full text Add to dashboard Cite

Modern visual pattern recognition models are based on deep convolutional networks. Such models are computationally expensive, hindering applicability on resource-constrained devices. To handle this problem, we propose three strategies. The first removes unimportant structures (neurons or layers) of convolutional networks, reducing their computational cost. The second inserts structures to design architectures automatically, enabling us to build high-performance networks. The third combines multiple layers of convolutional networks, enhancing data representation at negligible additional cost. These strategies are based on Partial Least Squares (PLS) which, despite promising results, is infeasible on large datasets due to memory constraints. To address this issue, we also propose a discriminative and low-complexity incremental PLS that learns a compact representation of the data using a single sample at a time, thus enabling applicability on large datasets.

show abstract

Resolution Adaptive Networks for Efficient Inference

Cited by 161 publications

References 20 publications

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

A-ViT: Adaptive Tokens for Efficient Vision Transformer

Partial Least Squares: A Deep Space Odyssey

Contact Info

Product

Resources

About