Remote sensing image scene classification has been widely applied and has attracted increasing attention. Recently, convolutional neural networks (CNNs) have achieved remarkable results in scene classification. However, scene images have complex semantic relationships between multi-scale ground objects, and the traditional stacked network structure lacks the ability to effectively extract multi-scale and key features, resulting in limited feature representation capabilities. By simulating the way that humans understand and perceive images, attention mechanisms can be beneficial for quickly and accurately acquiring key features. In our study, we propose a channel-attentionbased DenseNet (CAD) network for scene classification. Firstly, the lightweight DenseNet121 is selected as the backbone for the spatial relationship between multi-scale ground objects. In the spatial domain, densely connected CNN layers can extract spatial features at multiple scales and correlate with each other. Secondly, in the channel domain, a channel attention mechanism is introduced to strengthen the weights of the important feature channels adaptively and to suppress the secondary feature channels. Thirdly, the cross-entropy loss function based on label smoothing is used to reduce the impact of inter-class similarity upon feature representations. The proposed CAD network is evaluated on three public datasets. The experimental results demonstrate that the CAD network can achieve performance comparable to those of other state-of-the-art methods. The visualization through the Grad-CAM ++ algorithm also reflects the effectiveness of channel attention and the powerful feature representation capabilities of the CAD network.
Deep convolutional neural networks have become an indispensable method in remote sensing image scene classification because of their powerful feature extraction capabilities. However, the ability of the models to extract multi-scale features and global features on surface objects of complex scenes is currently insufficient. We propose a framework based on global context spatial attention (GCSA) and densely connected convolutional networks to extract multi-scale global scene features, called GCSANet. The mixup operation is used to enhance the spatial mixed data of remote sensing images, and the discrete sample space is rendered continuous to improve the smoothness in the neighborhood of the data space. The characteristics of multi-scale surface objects are extracted, and their internal dense connection is strengthened by the densely connected backbone network. GCSA is introduced into the densely connected backbone network to encode the context information of the remote sensing scene image into the local features. Experiments were performed on four remote sensing scene datasets to evaluate the performance of GCSANet. The GCSANet achieved the highest classification precision on AID and NWPU datasets and the second best performance on the UCM dataset, indicating the GCSANet can effectively extract the global features of remote sensing images. In addition, the GCSANet presents the highest classification accuracy on the constructed mountain image scene dataset. These results reveal that the GCSANet can effectively extract multi-scale global scene features on complex remote sensing scenes. The source codes of this method can be found in https://github.com/ShubingOuyangcug/GCSANet.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.