2022
DOI: 10.48550/arxiv.2201.01615
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Abstract: Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into sema… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 41 publications
0
11
0
Order By: Relevance
“…It is noteworthy that the RA-Net can be simply implemented by reusing the network's output with class probabilities without complex modules, while remaining faithful to the concept of attention that focuses on important RoI-associated context of the image. Moreover, it possesses advantages in terms of efficiency and performance, compared to transformer networks that prioritize efficiency like Lawin [29], SegNeXt [30], EfficientViT [31].…”
Section: Related Workmentioning
confidence: 99%
“…It is noteworthy that the RA-Net can be simply implemented by reusing the network's output with class probabilities without complex modules, while remaining faithful to the concept of attention that focuses on important RoI-associated context of the image. Moreover, it possesses advantages in terms of efficiency and performance, compared to transformer networks that prioritize efficiency like Lawin [29], SegNeXt [30], EfficientViT [31].…”
Section: Related Workmentioning
confidence: 99%
“…Because it requires a finer granularity analysis of the image, semantic segmentation is also a more time consuming and challenging task in image analysis. To achieve this, most of the neural networkbased models utilize an encoder/decoder-like architecture, such as U-Net [100], FCN [101], SegNet [102], DeepLab [103][104][105], AdaptSegNet [106], Fast-SCNN [107], HANet [108], Panoptic-deeplab [109], SegFormer [110], or Lawin+ [111]. The encoder conducts feature extraction through CNNs and derives an abstract representation (also called a feature map) of the original image.…”
Section: Semantic Segmentationmentioning
confidence: 99%
“…Leveraging the advance of DETR [62], MaX-DeepLab [63] and MaskFormer [64] view image segmentation from the perspective of mask classification. Following this trend, various architectures of dense prediction transformers [65], [66], [67], [68] and semantic segmentation transformers [69], [70], [71], [72] emerge in the field. While these approaches have achieved high segmentation performance, most of them focus on using RGB images and suffer when RGB images cannot provide sufficient information in real-world scenes, e.g., under low-illumination conditions or in high-dynamic areas.…”
Section: Transformer-driven Semantic Segmentationmentioning
confidence: 99%