2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00475
|View full text |Cite
|
Sign up to set email alerts
|

Vision Transformer with Deformable Attention

Abstract: Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises several concerns. On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests. On the other hand, the handcrafted attention adopted in PVT … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
79
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 235 publications
(100 citation statements)
references
References 94 publications
0
79
0
Order By: Relevance
“…Our DGMN2 is related to DAT [76] and Deformable DETR [91], but has key differences: When calculating attention, DAT [76] learns a offset for each point on feature map and use the deformed points as the key and value. In contrast, our DGMN2 samples K nodes of key and value and calculate the attention using sampled key and value.…”
Section: Discussionmentioning
confidence: 99%
“…Our DGMN2 is related to DAT [76] and Deformable DETR [91], but has key differences: When calculating attention, DAT [76] learns a offset for each point on feature map and use the deformed points as the key and value. In contrast, our DGMN2 samples K nodes of key and value and calculate the attention using sampled key and value.…”
Section: Discussionmentioning
confidence: 99%
“…Backbone mIoU FCN [1] CNN 29.4 RefineNet [35] ResNet-152 40.7 UperNet [17] ResNet-50 41.2 UperNet+Conv [14] ConvNeXt-XL 54.0 UperNet+DeAtt [25] DAT-B 49.4 DeepLabv3++ [36] Xception-65 45.7 Auto-DeepLab [37] NAS 44.0 OCR [38] HRNetV2-W48 45.7 MaskFormer [41] ResNet-50 44.5 MaskFormer+FaPN [11] Swin-L 55.2 SegFormer [21] MiT-B5 51.8 HRViT [43] HRViT-b3 50.2 BEiT [44] Transformer 47.7 CSWin [45] CSWin-L 54.0 Mask2Former [6] Swin-L 56. Dilated-ResNet-101 80.2 RefineNet [35] ResNet-101 73.6 DeepLabv3++ [36] Xception-65 82.1 Auto-DeepLab [37] NAS 80.3 OCR [38] HRNetV2-W48 83.6 MDEQ [39] MDEQ 80.3 SynBoost [40] VGG-16 & CNN 83.5 MaskFormer+FaPN [11] ResNet-101 80.1 SML [42] ResNet-101 80.3 SegFormer [21] MiT-B5 84.0 HRViT [43] HRViT-b3 83.2 HSB-Net [13] ResNet-34 73.1 Mask2Former [6] Swin-L 83.…”
Section: Methodsmentioning
confidence: 99%
“…Yu et al [24] proposed to replace the self-attention module with a simple spatial pooling operator for the transformer and showed the competitive performance while significantly reducing the processing complexity. In [25], authors proposed to design a deformable self-attention module, which makes the attention block to concentrate on relevant regions in a data-aware way. However, it is difficult to maintain the highresolution feature map for the transformer-based architecture due to the constraint in terms of computational burden and memory limitation.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Large-scale pre-training. Benefited from the development of Transformer in both vision [35,63,36] and language [54] tasks, large-scale pre-training framework has received wide concerns in recent years and shown promising results in the field of computer vision and natural language processing. GPT [39] is one of the pioneer works for language pre-training which optimizes the probability of output based on previous words in the sequence.…”
Section: Related Workmentioning
confidence: 99%