2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01181
|View full text |Cite
|
Sign up to set email alerts
|

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
164
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 512 publications
(230 citation statements)
references
References 29 publications
0
164
0
Order By: Relevance
“…Backbone mIoU FCN [1] CNN 29.4 RefineNet [35] ResNet-152 40.7 UperNet [17] ResNet-50 41.2 UperNet+Conv [14] ConvNeXt-XL 54.0 UperNet+DeAtt [25] DAT-B 49.4 DeepLabv3++ [36] Xception-65 45.7 Auto-DeepLab [37] NAS 44.0 OCR [38] HRNetV2-W48 45.7 MaskFormer [41] ResNet-50 44.5 MaskFormer+FaPN [11] Swin-L 55.2 SegFormer [21] MiT-B5 51.8 HRViT [43] HRViT-b3 50.2 BEiT [44] Transformer 47.7 CSWin [45] CSWin-L 54.0 Mask2Former [6] Swin-L 56. Dilated-ResNet-101 80.2 RefineNet [35] ResNet-101 73.6 DeepLabv3++ [36] Xception-65 82.1 Auto-DeepLab [37] NAS 80.3 OCR [38] HRNetV2-W48 83.6 MDEQ [39] MDEQ 80.3 SynBoost [40] VGG-16 & CNN 83.5 MaskFormer+FaPN [11] ResNet-101 80.1 SML [42] ResNet-101 80.3 SegFormer [21] MiT-B5 84.0 HRViT [43] HRViT-b3 83.2 HSB-Net [13] ResNet-34 73.1 Mask2Former [6] Swin-L 83.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Backbone mIoU FCN [1] CNN 29.4 RefineNet [35] ResNet-152 40.7 UperNet [17] ResNet-50 41.2 UperNet+Conv [14] ConvNeXt-XL 54.0 UperNet+DeAtt [25] DAT-B 49.4 DeepLabv3++ [36] Xception-65 45.7 Auto-DeepLab [37] NAS 44.0 OCR [38] HRNetV2-W48 45.7 MaskFormer [41] ResNet-50 44.5 MaskFormer+FaPN [11] Swin-L 55.2 SegFormer [21] MiT-B5 51.8 HRViT [43] HRViT-b3 50.2 BEiT [44] Transformer 47.7 CSWin [45] CSWin-L 54.0 Mask2Former [6] Swin-L 56. Dilated-ResNet-101 80.2 RefineNet [35] ResNet-101 73.6 DeepLabv3++ [36] Xception-65 82.1 Auto-DeepLab [37] NAS 80.3 OCR [38] HRNetV2-W48 83.6 MDEQ [39] MDEQ 80.3 SynBoost [40] VGG-16 & CNN 83.5 MaskFormer+FaPN [11] ResNet-101 80.1 SML [42] ResNet-101 80.3 SegFormer [21] MiT-B5 84.0 HRViT [43] HRViT-b3 83.2 HSB-Net [13] ResNet-34 73.1 Mask2Former [6] Swin-L 83.…”
Section: Methodsmentioning
confidence: 99%
“…and CSWin [45], based on the mean IoU (mIoU) metric, which describes the intersection-over-union between the predicted result and the ground truth. The corresponding results on ADE20K and Cityscape datasets are shown in Tables 1 and 2, respectively.…”
Section: Performance Evaluationmentioning
confidence: 99%
“…As shown in Fig. 1, first, the CSwin Transformer [5] encoder takes a RGB image as input, and split the image into nonoverlapping patches as ªtokensº. The whole encoder consists of four stages to produce hierarchical representations (i.e., ST 1 , ST 2 , ST 3 , and ST 4 ).…”
Section: Network Architecturementioning
confidence: 99%
“…(2) We substitute the encoder of Maskformer from Swin transformer to CSwin transformer [5]. The latter introduces the cross-shaped window self-attention mechanism that computing self-attention in the horizontal and vertical stripes in parallel to enlarges the attention area of each token without the increasing of computational complexity.…”
Section: Introductionmentioning
confidence: 99%
“…To reduce the number of tokens, two camps of approaches are proposed. The first is to apply the concepts of hierarchical convolutional neural nets to downsample tokens using various pooling methods (Liu et al, 2021;Heo et al, 2021;Dong et al, 2022). The other attempts to measure the significance scores among the tokens and drop or prune tokens accordingly (Goyal et al, 2020;Rao et al, 2021;Marin et al, 2021).…”
Section: Related Workmentioning
confidence: 99%