Homo–Heterogenous Transformer Learning Framework for RS Scene Classification

Ma, Jingjing; Li, Mingteng; Tang, Xu; Zhang, Xiangrong; Liu, Fang; Jiao, Licheng

doi:10.1109/jstars.2022.3155665

Cited by 41 publications

(27 citation statements)

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lu et al 62 Explores the content of semantic label Li et al 111 Captures the key region Bi et al 110 Direct control over semantic labels at the bag level Li et al 111 Reduces the dimensionality of bilinear pooled features Wang et al 112 Addresses the problem of large scale variance Xu et al 76 Extracts the context information Xu et al 114 Improves discriminative ability among ambiguous classes Deng et al 63 Extracts semantic features and local structural features Ma et al 75 Captures contextual relationship in complex RSI AB Wang et al 115 Reduces the number of training parameters Li et al 116 Fuses global and local features.…”

Section: Model Reference(s) Remarksmentioning

confidence: 99%

“…Zhao et al 118 Extracts stronger discriminative features Li et al 119 Eliminates irrelevant and redundant information Tang et al 120 Minimize intra class distance and maximize inter class distance Bi et al 121 Local semantic representation capability is enhanced Chen et al 122 Improves feature extraction ability GAN Lin et al 98 Captures powerful discriminative features Yu et al 106 Collects contextual information Wei et al 105 Multi feature fusion layer is incorporated Ma et al 75 Generates labeled samples Miao et al 107 Deals with overfitting problem of deep models SOSFB He et al 61 Preserves second order information.…”

Section: Model Reference(s) Remarksmentioning

confidence: 99%

See 1 more Smart Citation

Deep learning techniques for remote sensing image scene classification: A comprehensive review, current challenges, and future directions

Kumari

Kaul

2023

Concurrency and Computation

View full text Add to dashboard Cite

SummarySince last decade, deep learning has made exceptional progress in various fields of artificial intelligence including image and voice recognition, natural language processing. Inspired by these successes, researchers are now applying deep learning techniques to classification of scenes in remote sensing images. The purpose of remote sensing image scene classification is to classify remote sensing scenes according to their content. These images display a complex structure due to the variety of landforms as well as the distance between the image collection instrument and earth. In our review, we discussed 76 relevant papers published on this topic over the past 6 years. The review conducts a comparison analysis based on the overall accuracy parameter to provide insight into the effectiveness of different methods on different proportions of the dataset. The five classes of techniques we describe are convolutional neural networks, autoencoders, generative adversarial networks, vision transformers, and few‐shot learning. Future directions are discussed in this review in order to enhance the effectiveness of deep learning‐based scene classification approaches. This article concludes with an overview of the proposed method to enhance the accuracy in classifying remote sensing images.

show abstract

Section: Model Reference(s) Remarksmentioning

confidence: 99%

Section: Model Reference(s) Remarksmentioning

confidence: 99%

Deep learning techniques for remote sensing image scene classification: A comprehensive review, current challenges, and future directions

Kumari

Kaul

2023

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Based on the basic CNN, CTNet 30 delicately develops an enhanced version of the CNNbased network. HHTL 31 carefully designed the patch before it was input to the transformers and subtly fused them after feature extraction. Some methods improved the classification performance by adding operations, e.g., multi-scale, spatial attention, and feature aggregation.…”

Section: Nwpu Datasetmentioning

confidence: 99%

“…Therefore, Li et al 30 proposed the remote sensing transformer (TRS), which uses self-attention integrated into a residual neural network (ResNet), Multi-Head Self-Attention (MHSA) layers instead of spatial convolutions, and concatenates multiple pure transformer module encoders to improve the attention-dependent representation learning performance. Ma et al 31 proposed a homo-heterogenous transformer learning (HHTL) framework for HRRS scene classification according to the characteristics of the transformer to divide the image into multiple patches.…”

mentioning

confidence: 99%

Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image

Guo

Jia

Bai

2022

Sci Rep

View full text Add to dashboard Cite

Recently, the scenes in large high-resolution remote sensing (HRRS) datasets have been classified using convolutional neural network (CNN)-based methods. Such methods are well-suited for spatial feature extraction and can classify images with relatively high accuracy. However, CNNs do not adequately learn the long-distance dependencies between images and features in image processing, despite this being necessary for HRRS image processing as the semantic content of the scenes in these images is closely related to their spatial relationship. CNNs also have limitations in solving problems related to large intra-class differences and high inter-class similarity. To overcome these challenges, in this study we combine the channel-spatial attention (CSA) mechanism with the Vision Transformer method to propose an effective HRRS image scene classification framework using Channel-Spatial Attention Transformers (CSAT). The proposed model extracts the channel and spatial features of HRRS images using CSA and the Multi-head Self-Attention (MSA) mechanism in the transformer module. First, the HRRS image is mapped into a series of multiple planar 2D patch vectors after passing to the CSA. Second, the ordered vector is obtained via the linear transformation of each vector, and the position and learnable embedding vectors are added to the sequence vector to capture the inter-feature dependencies at a distance from the generated image. Next, we use MSA to extract image features and the residual network structure to complete the encoder construction to solve the gradient disappearance problem and avoid overfitting. Finally, a multi-layer perceptron is used to classify the scenes in the HRRS images. The CSAT network is evaluated using three public remote sensing scene image datasets: UC-Merced, AID, and NWPU-RESISC45. The experimental results show that the proposed CSAT network outperforms a selection of state-of-the-art methods in terms of scene classification.

show abstract

“…From the available transformation change information, they can distinguish the changed region. In recent years, deep learning methods, especially convolutional neural networks (CNNs), have been widely used in various computer vision (CV) tasks, such as image segmentation [11], [12], [13], scene classification [14], [15], [16], object detection [17], [18], saliency detection [19], [20], image enhancement [21], [22], group detection [23], and so on. Because the CNNs can learn the multi-level features and semantic features of the bi-temporal images, they are also introduced into CD to effectively describe the change information [24], [25], [26].…”

Section: Introductionmentioning

confidence: 99%

Context and Difference Enhancement Network for Change Detection

Song

Dong

2022

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

At present, convolution neural networks have achieved good performance in remote sensing image change detection. However, due to the locality of convolution, these methods are difficult to capture the global context relationships among different-level features. To alleviate this issue, we propose a context and difference enhancement network (CDENet) for change detection, which can strongly model global context relationships and enhance the change difference. Specifically, our backbone is the dual TransUNet, which is based on U-Net and equipped with transformer block in the encoder. The dual TransUNet is used to extract bi-temporal features. Then, the features are encoded as the input sequence, which is conducive to modeling the global context. Moreover, we design the content difference enhancement module to process the dual features of each layer in the encoder. The designed module can increase the spatial attention of difference regions to enhance the change difference features. In the decoder, we adopt a simple crosslayer feature fusion to combine the upsampled features with the high-resolution features, which is used to generate more accurate results. Finally, we adopt a novel loss to supervise the accuracy of results in regions and pixels. The experiments on two public change detection datasets demonstrate that our CDENet has strong competitiveness and performs better than the state-ofthe-art methods.

show abstract

Homo–Heterogenous Transformer Learning Framework for RS Scene Classification

Cited by 41 publications

References 85 publications

Deep learning techniques for remote sensing image scene classification: A comprehensive review, current challenges, and future directions

Deep learning techniques for remote sensing image scene classification: A comprehensive review, current challenges, and future directions

Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image

Context and Difference Enhancement Network for Change Detection

Contact Info

Product

Resources

About