2021
DOI: 10.48550/arxiv.2107.00652
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Abstract: We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
123
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 75 publications
(123 citation statements)
references
References 68 publications
0
123
0
Order By: Relevance
“…Transformer and Focal Transformer. For more fair comparison, we adopt the same positional encoding LePE (Dong et al, 2021) to PVTv2, Swin and Focal transformer. As shown in Table 8, QuadTree attention obtain consistently better performance than Swin and PVTv2 in both classification task and detection task.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Transformer and Focal Transformer. For more fair comparison, we adopt the same positional encoding LePE (Dong et al, 2021) to PVTv2, Swin and Focal transformer. As shown in Table 8, QuadTree attention obtain consistently better performance than Swin and PVTv2 in both classification task and detection task.…”
Section: Discussionmentioning
confidence: 99%
“…ImageNet-1K COCO (RetinaNet) Flops (G) Top-1 (%) Mem. (MB) AP AP50 AP75 PVTv2 0.6 70.5 574 37.2 57.2 39.5 PVTv2+LePE (Dong et al, 2021) 0.6 70.9 574 37.6 57.8 39.9 Swin 0…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Vision transformers [2,18,19,22,31,49,70,71,88] treat an image as a set of patches and model their interactions with transformer-based architectures [74]. Recent works adding vision priors such as multi-scale feature hierarchies [22,31,49,80,88] or local structure modeling [9,18,49] have shown to be effective. They have also been generalized from the image to video domain [3,22,51,54].…”
Section: Related Workmentioning
confidence: 99%
“…Recently, the pioneering work ViT [22] successfully applies the pure transformer-based architecture to computer vision, revealing the potential of transformer in handling visual tasks. Lots of follow-up studies are proposed [4,5,9,12,18,21,23,24,[27][28][29]31,38,41,43,45,50,52,56,76,77,80,81,84]. Many of them analyze the ViT [15,17,26,32,44,55,69,73,75,82] and improve it via introducing locality to earlier layers [11,17,48,64,79,83,87].…”
Section: Related Workmentioning
confidence: 99%