A Multiple-Instance Densely-Connected ConvNet for Aerial Scene Classification

Bi, Qi; Qin, Kun; Li, Zhili; Zhang, Han; Xu, Kai; Xia, Gui-Song

doi:10.1109/tip.2020.2975718

Cited by 116 publications

(96 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training-to-test set ratio 50% 80% PLSA(SIFT) [10] 67.55±1.11 71.38±1.77 BoVW(SIFT) [10] 73.48±1.39 75.52±2.13 AlexNet [10] 93.98±0.67 95.02±0.81 VGGNet-16 [10] 94.14±0.69 95.21±1.20 GoogLeNet [10] 92.70±0.60 94.31±0.89 CaffeNet [10] 93.98±0.67 95.02±0.81 TEX-Net with VGG [41] 94.22±0.50 95.31±0.69 D-CNN with AlexNet [13] --96.67±0.10 Fine-tuned GoogLeNet [37] --97.1 Two-Stream Fusion [26] 96.97±0.75 98.02±1.03 SPP with AlexNet [19] 94.77±0.46 96.67±0.94 Gated attention [64] 94.64±0.43 96.12±0.42 CCP-net [65] --97.52±0.97 Fusion by addition [20] --97.42±1.79 DSFATN [61] --98.25 Deep CNN Transfer [22] --98.49 MIDC-Net [66] 95.41±0.40 97.40±0.48 DFAGCN [44] --98.48±0.42 Inception-v3-CapsNet [34] 97.59±0.16 99.05±0.24 Backbone (Xception) [47] 92.76±0. 2) The improvements of the CSDS model are prominent for the large (80%) than for the small training-to-test set ratio (50%).…”

Section: Methodsmentioning

confidence: 99%

CSDS: End-to-End Aerial Scenes Classification With Depthwise Separable Convolution and an Attention Mechanism

Wang

Yuan

et al. 2021

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

Compared with natural scenes, aerial scenes are usually composed of numerous objects densely distributed within the aerial view, and thus more key local semantic features are needed to describe them. However, when existing CNNs are used for remote sensing image classification, they typically focus on the global semantic features of the image, and especially for deep models, shallow and intermediate features are easily lost. This paper proposes a channel-spatial attention mechanism based on a depthwise separable convolution (CSDS) network for aerial scene classification to solve these challenges. First, we construct a depthwise separable convolution (DS-Conv) and pyramid residual connection architecture.DS-Conv extracts features from each channel and merges them, effectively reducing the number of necessary calculations, and the pyramid residual connections connect the features from multiple layers and create associations. Then, the channel-spatial attention algorithm causes the model to obtain more effective features in the channel and spatial domains. Finally, an improved cross-entropy loss function is used to reduce the impact of similar categories on backpropagation. Comparative experiments on three public data sets show that the CSDS network can achieve results comparable to those of other state-of-the-art methods. In addition, visualization of feature extraction results by the Grad-CAM algorithm and ablation experiments for each module reflect the powerful feature learning and representation capabilities of the proposed CSDS network.

show abstract

Section: Methodsmentioning

confidence: 99%

CSDS: End-to-End Aerial Scenes Classification With Depthwise Separable Convolution and an Attention Mechanism

Wang

Yuan

et al. 2021

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

show abstract

“…A popular trend of deep learning algorithms in single-scene classification is to take a CNN as the backbone and introduce well-designed modules for further enhancing the feature efficiency. For instance, Bi et al [31] proposed to learn multiple instances from feature maps extracted by a densely connected CNN and integrated them into bag-level features for single-scene classification. Li et al [49] proposed a key region capturing method to learn class-specific features and retain global information for inferring scene labels.…”

Section: A Aerial Single-scene Classificationmentioning

confidence: 99%

MultiScene: A Large-Scale Dataset and Benchmark for Multiscene Recognition in Single Aerial Images

Hua

Mou²,

et al. 2022

IEEE Trans. Geosci. Remote Sensing

View full text Add to dashboard Cite

Aerial scene recognition is a fundamental research problem in interpreting high-resolution aerial imagery. Over the past few years, most studies focus on classifying an image into one scene category, while in real-world scenarios, it is more often that a single image contains multiple scenes. Therefore, in this article, we investigate a more practical yet underexplored task-multiscene recognition in single images. To this end, we create a large-scale dataset, called MultiScene, composed of 100 000 unconstrained high-resolution aerial images. Considering that manually labeling such images is extremely arduous, we resort to low-cost annotations from crowdsourcing platforms, e.g., OpenStreetMap (OSM). However, OSM data might suffer from incompleteness and incorrectness, which introduce noise into image labels. To address this issue, we visually inspect 14 000 images and correct their scene labels, yielding a subset of cleanly annotated images, named MultiScene-Clean. With it, we can develop and evaluate deep networks for multiscene recognition using clean data. Moreover, we provide crowdsourced annotations of all images for the purpose of studying network learning with noisy labels. We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multiscene recognition in single images and learning from noisy labels for this task, respectively. To facilitate progress, we make our dataset and trained models available on https://gitlab.lrz.de/ai4eo/reasoning/multiscene.

show abstract

“…Other approaches based on recurrent neural networks (RNNs) [23], generative adversarial networks (GANs) [24,25], graph convolutional networks (GCNs) [26], and long-short-term memory (LSTM) [27] have been introduced also. In a recent contribution, the authors considered remote-sensing scene classification as a multiple-instance learning (MIL) problem [28]. They proposed a multiple-instance densely connected network to highlight the local semantics relevant to the scene label.…”

Section: Introductionmentioning

confidence: 99%

Vision Transformers for Remote Sensing Image Classification

et al. 2021

View full text Add to dashboard Cite

In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively.

show abstract

A Multiple-Instance Densely-Connected ConvNet for Aerial Scene Classification

Cited by 116 publications

References 59 publications

CSDS: End-to-End Aerial Scenes Classification With Depthwise Separable Convolution and an Attention Mechanism

CSDS: End-to-End Aerial Scenes Classification With Depthwise Separable Convolution and an Attention Mechanism

MultiScene: A Large-Scale Dataset and Benchmark for Multiscene Recognition in Single Aerial Images

Vision Transformers for Remote Sensing Image Classification

Contact Info

Product

Resources

About