A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images

Wang, Libo; Li, Rui; Duan, Chenxi; Zhang, Ce; Meng, Xin; Fang, Shenghui

doi:10.1109/lgrs.2022.3143368

Cited by 104 publications

(58 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The structure of the ViT is completely different from the CNN, which treats the 2D image as the 1D ordered sequence and applies the selfattention mechanism for global dependency modelling, demonstrating stronger global feature extraction. Driven by this, many researchers in the field of remote sensing introduced ViTs for segmentation-related tasks, such as land cover classification [63][64][65][66][67][68], urban scene parsing [69][70][71][72][73][74], change detection [75,76], road extraction [77] and especially building extraction [78]. For example, Chen et al [79] proposed a sparse token Transformer to learn the global dependency of tokens in both spatial and channel dimensions, achieving state-of-the-art accuracy on benchmark building extraction datasets.…”

Section: B Vit-based Building Extraction Methodsmentioning

confidence: 99%

“…the Massachusetts building dataset, WHU building dataset and Inria Aerial Image Labeling dataset. The selected methods include convolutional networks, such as U-Net [21], Deeplabv3+ [88], SRI-Net [16], DS-Net [49], BRRNet [20], SiU-Net [18], CU-Net [19], EU-Net [89], DE-Net [90], MA-FCN [48], MANet [53], MAP-Net [27], Bias-UNet [57], CBRNet [35], and ViT-based networks like SwinUperNet [34], Sparse Token Transformer (STT) [79], MSST-Net [80], BANet [72], DC-Swin [69].…”

Section: B Comparison Of State-of-the-art Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Building Extraction With Vision Transformer

Wang

Fang

et al. 2022

IEEE Trans. Geosci. Remote Sensing

Self Cite

View full text Add to dashboard Cite

As an important carrier of human productive activities, the extraction of buildings is not only essential for urban dynamic monitoring but also necessary for suburban construction inspection. Nowadays, accurate building extraction from remote sensing images remains a challenge due to the complex background and diverse appearances of buildings. The convolutional neural network (CNN) based building extraction methods, although increased the accuracy significantly, are criticized for their inability for modelling global dependencies. Thus, this paper applies the Vision Transformer for building extraction. However, the actual utilization of the Vision Transformer often comes with two limitations. First, the Vision Transformer requires more GPU memory and computational costs compared to CNNs. This limitation is further magnified when encountering large-sized inputs like fine-resolution remote sensing images. Second, spatial details are not sufficiently preserved during the feature extraction of the Vision Transformer, resulting in the inability for fine-grained building segmentation. To handle these issues, we propose a novel Vision Transformer (BuildFormer), with a dual-path structure. Specifically, we design a spatial-detailed context path to encode rich spatial details and a global context path to capture global dependencies. Besides, we develop a window-based linear multi-head self-attention to make the complexity of the multi-head self-attention linear with the window size, which strengthens the global context extraction by using large windows and greatly improves the potential of the Vision Transformer in processing large-sized remote sensing images. The proposed method yields state-of-the-art performance (75.74% IoU) on the Massachusetts building dataset. Code will be available.

show abstract

Section: B Vit-based Building Extraction Methodsmentioning

confidence: 99%

Section: B Comparison Of State-of-the-art Methodsmentioning

confidence: 99%

Building Extraction With Vision Transformer

Wang

Fang

et al. 2022

IEEE Trans. Geosci. Remote Sensing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Transformer models perform well on natural language processing (NLP) and computer vision (CV) tasks and have attracted considerable attention in remote sensing. Some people apply Transformer to the research of remote sensing works, such as remote sensing image segmentation [38], [39] and remote sensing image change detection [15], [16]. For example, a Transformer-based method has recently been proposed to detect changes in remote sensing images.…”

Section: Related Workmentioning

confidence: 99%

A CBAM Based Multiscale Transformer Fusion Approach for Remote Sensing Image Change Detection

Wang

Zhang

Wang

2022

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

“…In the transformer-based network, self-attention is treated as the main operation in the encoder phase and not only as a single module in the decoder phase. References [25,26] applied a transformer model on remote imagery successfully. However, they only considered color features as inputs.…”

Section: Acquiring Long-range Dependencymentioning

confidence: 99%

Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation

Huang

Xie

et al. 2022

Remote Sensing

View full text Add to dashboard Cite

Taking depth into consideration has been proven to improve the performance of semantic segmentation through providing additional geometry information. Most existing works adopt a two-stream network, extracting features from color images and depth images separately using two branches of the same structure, which suffer from high memory and computation costs. We find that depth features acquired by simple downsampling can also play a complementary part in the semantic segmentation task, sometimes even better than the two-stream scheme with the same two branches. In this paper, a novel and efficient depth fusion transformer network for aerial image segmentation is proposed. The presented network utilizes patch merging to downsample depth input and a depth-aware self-attention (DSA) module is designed to mitigate the gap caused by difference between two branches and two modalities. Concretely, the DSA fuses depth features and color features by computing depth similarity and impact on self-attention map calculated by color feature. Extensive experiments on the ISPRS 2D semantic segmentation dataset validate the efficiency and effectiveness of our method. With nearly half the parameters of traditional two-stream scheme, our method acquires 83.82% mIoU on Vaihingen dataset outperforming other state-of-the-art methods and 87.43% mIoU on Potsdam dataset comparable to the state-of-the-art.

show abstract

A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images

Cited by 104 publications

References 24 publications

Building Extraction With Vision Transformer

Building Extraction With Vision Transformer

A CBAM Based Multiscale Transformer Fusion Approach for Remote Sensing Image Change Detection

Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation

Contact Info

Product

Resources

About