2021
DOI: 10.48550/arxiv.2105.15075
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Abstract: Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16. In this paper, we argue that every image has its own charact… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
26
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(26 citation statements)
references
References 27 publications
0
26
0
Order By: Relevance
“…• For segmentation, the encoder-decoder Transformer models may unify three segmentation sub-tasks into a mask prediction problem by a series of learnable mask embeddings [29], [84], [137]. This box-free approach has achieved the latest SOTA on multiple benchmarks [137].…”
Section: A Summary Of Recent Improvementsmentioning
confidence: 99%
“…• For segmentation, the encoder-decoder Transformer models may unify three segmentation sub-tasks into a mask prediction problem by a series of learnable mask embeddings [29], [84], [137]. This box-free approach has achieved the latest SOTA on multiple benchmarks [137].…”
Section: A Summary Of Recent Improvementsmentioning
confidence: 99%
“…The latter, unstructured token sparsification, is the most related work to us. Wang et al [42] proposed DVT to determine the number of patches to divide an image dynamically. Specifically, they leverage a cascade ViT models, where each ViT is re- sponsible for one type of token length.…”
Section: Model Compressionmentioning
confidence: 99%
“…It will stop inference for an input image if it has sufficient confidence in the prediction on the current token length. Different from DVT [42], our method is more accessible and practical since only a single ViT model is required. Moreover, we pay more attention on how to accurately decide the smallest number of token lengths that can give correct predication in the transformer for each image.…”
Section: Model Compressionmentioning
confidence: 99%
See 2 more Smart Citations