2022
DOI: 10.48550/arxiv.2201.00520
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vision Transformer with Deformable Attention

Abstract: Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply enlarging receptive field also gives rise to several concerns. On the one hand, using dense attention e.g., in ViT, leads to excessive memory and computational cost, and features can be influenced by irrelevant parts which are beyond the region of interests. On the other hand… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(24 citation statements)
references
References 37 publications
0
23
1
Order By: Relevance
“…For larger EATFormer-small and EATFormer-base models, we consistently get better results than recent counterparts, which surpass Swin-T by +2.4↑/+2.1↑ and Swin-S by +1.5↑/+1.7↑ with 1× schedule, while by +1.4↑/+1.3↑ and by +0.5↑/+0.9↑ with 3× schedule. Also, we obtain slightly higher results than DAT [114] with computation amount going down by 29G↓.…”
Section: Resultsmentioning
confidence: 75%
See 1 more Smart Citation
“…For larger EATFormer-small and EATFormer-base models, we consistently get better results than recent counterparts, which surpass Swin-T by +2.4↑/+2.1↑ and Swin-S by +1.5↑/+1.7↑ with 1× schedule, while by +1.4↑/+1.3↑ and by +0.5↑/+0.9↑ with 3× schedule. Also, we obtain slightly higher results than DAT [114] with computation amount going down by 29G↓.…”
Section: Resultsmentioning
confidence: 75%
“…Then KV is obtained with the new feature map X, i.e., KV = f kv ( X). It is worth mentioning that the main difference between MD-MSA and recent similar work [114] lies in the modulation operation, where MD-MSA could apply appropriate attention to different position features to obtain better results. Also, any form of position embedding is not used since it makes no contribution to results, and detailed comparative experiments can be viewed in Section 5.4.3.…”
Section: Re-samplementioning
confidence: 99%
“…But nearly none of them makes full use of the rich shortand long-range dependencies generated by the self-attention mechanism. In the field of image processing, Deformable Attention Transformer (DAT) proposed in [137] generates the deformed sampling points by introducing an offset network. It achieves consistently-improved results on comprehensive benchmarks and reduces computational costs.…”
Section: A Discussionmentioning
confidence: 99%
“…By incorporating the pyramid structure from CNNs, PVT [38] serves as a versatile backbone for many dense prediction tasks. DeformableViT [40] equips with a deformable self-attention module in line with deformable convolution [11] to enable flexible spatial locations conditioned on input data. ViT-Slim [5] searches for a sub-transformer network across three dimensions of input tokens, MHSA and MLP modules with a 1 -regularized soft mask to indicate the global importance of dimensions, just like the CNNs network slimming [29].…”
Section: Related Workmentioning
confidence: 99%