2021
DOI: 10.48550/arxiv.2111.15667
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adaptive Token Sampling For Efficient Vision Transformers

Abstract: While state-of-the-art vision transformer models achieve promising results for image classification, they are computationally very expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameterfree Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture. ATS… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(9 citation statements)
references
References 20 publications
0
9
0
Order By: Relevance
“…Main Results. We compare our method with several representative methods including DynamicViT [53], IA-RED 2 [48], RegNetY [51], CrossViT [6], VTP [86], ATS [24], CvT [66], PVT [64], T2T-ViT [79], UP-DeiT [76], PS-ViT [59], Evo-ViT [70], TNT [29], HVT [49], Swin [43], CoaT [69], CPVT [16], MD-DeiT [34], and S 2 ViTE [11]. As shown in Table 14 produced model is 0.91% higher than the original PiT-XS; When the target size is PiT-T, the accuracy of the produced model is 0.9% higher than the original PiT-T.…”
Section: Resultsmentioning
confidence: 99%
“…Main Results. We compare our method with several representative methods including DynamicViT [53], IA-RED 2 [48], RegNetY [51], CrossViT [6], VTP [86], ATS [24], CvT [66], PVT [64], T2T-ViT [79], UP-DeiT [76], PS-ViT [59], Evo-ViT [70], TNT [29], HVT [49], Swin [43], CoaT [69], CPVT [16], MD-DeiT [34], and S 2 ViTE [11]. As shown in Table 14 produced model is 0.91% higher than the original PiT-XS; When the target size is PiT-T, the accuracy of the produced model is 0.9% higher than the original PiT-T.…”
Section: Resultsmentioning
confidence: 99%
“…Despite the success in most vision tasks, ViT-based models cannot compete with the well-studied lightweight CNNs [21,49] when the inference speed is the major concern [50,51,52], especially on resource-constrained edge devices [17]. To accelerate ViT, many approaches have been introduced with different methodologies, such as proposing new architectures or modules [53,54,55,56,57,58], re-thinking self-attention and sparse-attention mechanisms [59,60,61,62,63,64,65], and utilizing search algorithms that are widely explored in CNNs to find smaller and faster ViTs [66,28,29,67]. Recently, LeViT [23] proposes a CONV-clothing design to accelerate vision transformer.…”
Section: Related Workmentioning
confidence: 99%
“…[38] uses distillation to improve the efficiency of the network. [14,46,39] decrease the number of tokens by pruning unimportant tokens. Although these works limit the computation generally, softmax is still required to calculate the attention.…”
Section: Related Workmentioning
confidence: 99%