Adaptive Token Sampling for Efficient Vision Transformers

Fayyaz, Mohsen; Koohpayegani, Soroush Abbasi; Jafari, Farnoush Rezaei; Sengupta, Sunando; Joze, Hamid Reza Vaezi; Sommerlade, Eric; Pirsiavash, Hamed; Gall, Jürgen

doi:10.1007/978-3-031-20083-0_24

Cited by 65 publications

(25 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Dehghani et al [13] highlight the significance of using throughput as a metric to measure model efficiency: as the reduction in FLOPs does not necessarily correspond to improvements in latency, as it does not take into account the degree of parallelism or other hardware details. In line with this argument, we observe that while SoTA methods such as ATS [17] and SPViT [26] achieve large reduction in FLOPs, they actually have lower throughput when compared to SKIPAT. Furthermore, HVT [40] while achieving a higher gain in both throughput and FLOPs has poor top-1 accuracy (2.6% drop in ViT-T and 1.8% drop in ViT-S).…”

Section: Image Classificationsupporting

confidence: 75%

“…Token sampling improves the efficiency either by restructuring images during the tokenization step [21,66], pruning the redundant tokens over training [26,46] or dynamically at inference [7,17,43,63]. Despite their effectiveness in reducing the computational cost in image classification, token sampling methods are hardly applicable to dense prediction tasks, e.g.…”

Section: Related Workmentioning

confidence: 99%

“…We use ViT-T/16 [15], ViT-S/16 [15] and ViT-B/16 [15] as our backbone on ImageNet-1K. For fair comparisons, we follow the experimental settings in [48] and evaluate SKIPAT against SoTA methods: A-ViT [63], Dynamic-ViT [38], SViTE [7], SPViT [26], ATS [17], PS-ViT [46], HVT [40] and Rev-Vit [34]. To the best of our knowledge, these are all the works that improve the efficiency of ViT without modifying its underlying architecture.…”

Section: Image Classificationmentioning

confidence: 99%

“…To tackle this problem, there have been three sets of approaches. The first leverages redundancies across input tokens and simply reduces computation by efficient sampling, e.g., dropping or merging redundant tokens [17,46,63]. This, however, means that the final output of the ViT is not spatially continuous and can thus not be used beyond image-level applications such as semantic segmentation or object localization.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Skip-Attention: Improving Vision Transformers by Paying Less Attention

Venkataramanan¹,

Ghodrati²,

Asano³

et al. 2023

Preprint

View full text Add to dashboard Cite

This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -a key redundancy that causes unnecessary computations. Based on this observation, we propose SKIPAT, a method to reuse self-attention computation from preceding layers to approximate attention at one or more subsequent layers. To ensure that reusing self-attention blocks across layers does not degrade the performance, we introduce a simple parametric function, which outperforms the baseline transformer's performance while running computationally faster. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks.

show abstract

Section: Image Classificationsupporting

confidence: 75%

Section: Related Workmentioning

confidence: 99%

Section: Image Classificationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Skip-Attention: Improving Vision Transformers by Paying Less Attention

Venkataramanan¹,

Ghodrati²,

Asano³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Yu et al [55] propose to reformulate the cross-attention learning as a clustering process. Some approaches [14,35,43] study the efficiency of ViTs and propose a dynamic token sparsification framework to prune redundant tokens progressively. Wang et al [49] propose to automatically configure a proper number of tokens for each input image.…”

Section: Related Workmentioning

confidence: 99%

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Zhou¹,

Zhu²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Existing transformer-based offline Mono3D models adopt grid-based vision tokens, which is suboptimal when using coarse tokens due to the limited available computational power. In this paper, we propose an online Mono3D framework, called MonoATT, which leverages a novel vision transformer with heterogeneous tokens of varying shapes and sizes to facilitate mobile Mono3D. The core idea of MonoATT is to adaptively assign finer tokens to areas of more significance before utilizing a transformer to enhance Mono3D. To this end, we first use prior knowledge to design a scoring network for selecting the most important areas of the image, and then propose a token clustering and merging network with an attention mechanism to gradually merge tokens around the selected areas in multiple stages. Finally, a pixel-level feature map is reconstructed from heterogeneous tokens before employing a SOTA Mono3D detector as the underlying detection core. Experiment results on the real-world KITTI dataset demonstrate that MonoATT can effectively improve the Mono3D accuracy for both near and far objects and guarantee low latency. MonoATT yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.

show abstract