2022
DOI: 10.1007/978-3-031-20083-0_24
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive Token Sampling for Efficient Vision Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
23
0
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 65 publications
(25 citation statements)
references
References 42 publications
1
23
0
1
Order By: Relevance
“…Dehghani et al [13] highlight the significance of using throughput as a metric to measure model efficiency: as the reduction in FLOPs does not necessarily correspond to improvements in latency, as it does not take into account the degree of parallelism or other hardware details. In line with this argument, we observe that while SoTA methods such as ATS [17] and SPViT [26] achieve large reduction in FLOPs, they actually have lower throughput when compared to SKIPAT. Furthermore, HVT [40] while achieving a higher gain in both throughput and FLOPs has poor top-1 accuracy (2.6% drop in ViT-T and 1.8% drop in ViT-S).…”
Section: Image Classificationsupporting
confidence: 75%
See 3 more Smart Citations
“…Dehghani et al [13] highlight the significance of using throughput as a metric to measure model efficiency: as the reduction in FLOPs does not necessarily correspond to improvements in latency, as it does not take into account the degree of parallelism or other hardware details. In line with this argument, we observe that while SoTA methods such as ATS [17] and SPViT [26] achieve large reduction in FLOPs, they actually have lower throughput when compared to SKIPAT. Furthermore, HVT [40] while achieving a higher gain in both throughput and FLOPs has poor top-1 accuracy (2.6% drop in ViT-T and 1.8% drop in ViT-S).…”
Section: Image Classificationsupporting
confidence: 75%
“…Token sampling improves the efficiency either by restructuring images during the tokenization step [21,66], pruning the redundant tokens over training [26,46] or dynamically at inference [7,17,43,63]. Despite their effectiveness in reducing the computational cost in image classification, token sampling methods are hardly applicable to dense prediction tasks, e.g.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Yu et al [55] propose to reformulate the cross-attention learning as a clustering process. Some approaches [14,35,43] study the efficiency of ViTs and propose a dynamic token sparsification framework to prune redundant tokens progressively. Wang et al [49] propose to automatically configure a proper number of tokens for each input image.…”
Section: Related Workmentioning
confidence: 99%