Abstract:Abstract:Deep neural networks (DNNs) face many problems in the very high resolution remote sensing (VHRRS) per-pixel classification field. Among the problems is the fact that as the depth of the network increases, gradient disappearance influences classification accuracy and the corresponding increasing number of parameters to be learned increases the possibility of overfitting, especially when only a small amount of VHRRS labeled samples are acquired for training. Further, the hidden layers in DNNs are not tr… Show more
“…High-resolution imagery must be segmented into patches for CNNs due to Graphics Processing Unit (GPU) memory limitations, thus in a limited area, to make full use of the output features of different convolution layers to achieve a better semantic segmentation effect, the researchers often use a multi-depth network model [40] or design a multiple-feature reuse network in which each layer is connected to all the subsequent layers of the same size, enabling the direct use of the hierarchical features in each layer [41]. Emerging new networks, such as U-Net [42] and DenseNet [43], have also been applied in remote sensing image semantic segmentation [44]. The application scenario also extends from surface geographic objects to continuous phenomena such as highly dynamic clouds [45].…”
A comprehensive interpretation of remote sensing images involves not only remote sensing object recognition but also the recognition of spatial relations between objects. Especially in the case of different objects with the same spectrum, the spatial relationship can help interpret remote sensing objects more accurately. Compared with traditional remote sensing object recognition methods, deep learning has the advantages of high accuracy and strong generalizability regarding scene classification and semantic segmentation. However, it is difficult to simultaneously recognize remote sensing objects and their spatial relationship from end-to-end only relying on present deep learning networks. To address this problem, we propose a multi-scale remote sensing image interpretation network, called the MSRIN. The architecture of the MSRIN is a parallel deep neural network based on a fully convolutional network (FCN), a U-Net, and a long short-term memory network (LSTM). The MSRIN recognizes remote sensing objects and their spatial relationship through three processes. First, the MSRIN defines a multi-scale remote sensing image caption strategy and simultaneously segments the same image using the FCN and U-Net on different spatial scales so that a two-scale hierarchy is formed. The output of the FCN and U-Net are masked to obtain the location and boundaries of remote sensing objects. Second, using an attention-based LSTM, the remote sensing image captions include the remote sensing objects (nouns) and their spatial relationships described with natural language. Finally, we designed a remote sensing object recognition and correction mechanism to build the relationship between nouns in captions and object mask graphs using an attention weight matrix to transfer the spatial relationship from captions to objects mask graphs. In other words, the MSRIN simultaneously realizes the semantic segmentation of the remote sensing objects and their spatial relationship identification end-to-end. Experimental results demonstrated that the matching rate between samples and the mask graph increased by 67.37 percentage points, and the matching rate between nouns and the mask graph increased by 41.78 percentage points compared to before correction. The proposed MSRIN has achieved remarkable results. the existing studies do not address the interpretation of spatial relationships between remote sensing objects, which limits the understanding of remote sensing objects, especially when the phenomenon of different objects with the same spectrum in remote sensing appears.The phenomenon of different objects with the same spectrum in remote sensing is quite common. It is difficult to identify objects only by their own textures, spectra, and shape information. Object identification requires multi-scale semantic information and spatially adjacent objects to assist in decision-making. The spatial relationship between remote sensing objects is of great significance to the recognition of remote sensing objects when different objects have the same spectru...
“…High-resolution imagery must be segmented into patches for CNNs due to Graphics Processing Unit (GPU) memory limitations, thus in a limited area, to make full use of the output features of different convolution layers to achieve a better semantic segmentation effect, the researchers often use a multi-depth network model [40] or design a multiple-feature reuse network in which each layer is connected to all the subsequent layers of the same size, enabling the direct use of the hierarchical features in each layer [41]. Emerging new networks, such as U-Net [42] and DenseNet [43], have also been applied in remote sensing image semantic segmentation [44]. The application scenario also extends from surface geographic objects to continuous phenomena such as highly dynamic clouds [45].…”
A comprehensive interpretation of remote sensing images involves not only remote sensing object recognition but also the recognition of spatial relations between objects. Especially in the case of different objects with the same spectrum, the spatial relationship can help interpret remote sensing objects more accurately. Compared with traditional remote sensing object recognition methods, deep learning has the advantages of high accuracy and strong generalizability regarding scene classification and semantic segmentation. However, it is difficult to simultaneously recognize remote sensing objects and their spatial relationship from end-to-end only relying on present deep learning networks. To address this problem, we propose a multi-scale remote sensing image interpretation network, called the MSRIN. The architecture of the MSRIN is a parallel deep neural network based on a fully convolutional network (FCN), a U-Net, and a long short-term memory network (LSTM). The MSRIN recognizes remote sensing objects and their spatial relationship through three processes. First, the MSRIN defines a multi-scale remote sensing image caption strategy and simultaneously segments the same image using the FCN and U-Net on different spatial scales so that a two-scale hierarchy is formed. The output of the FCN and U-Net are masked to obtain the location and boundaries of remote sensing objects. Second, using an attention-based LSTM, the remote sensing image captions include the remote sensing objects (nouns) and their spatial relationships described with natural language. Finally, we designed a remote sensing object recognition and correction mechanism to build the relationship between nouns in captions and object mask graphs using an attention weight matrix to transfer the spatial relationship from captions to objects mask graphs. In other words, the MSRIN simultaneously realizes the semantic segmentation of the remote sensing objects and their spatial relationship identification end-to-end. Experimental results demonstrated that the matching rate between samples and the mask graph increased by 67.37 percentage points, and the matching rate between nouns and the mask graph increased by 41.78 percentage points compared to before correction. The proposed MSRIN has achieved remarkable results. the existing studies do not address the interpretation of spatial relationships between remote sensing objects, which limits the understanding of remote sensing objects, especially when the phenomenon of different objects with the same spectrum in remote sensing appears.The phenomenon of different objects with the same spectrum in remote sensing is quite common. It is difficult to identify objects only by their own textures, spectra, and shape information. Object identification requires multi-scale semantic information and spatially adjacent objects to assist in decision-making. The spatial relationship between remote sensing objects is of great significance to the recognition of remote sensing objects when different objects have the same spectru...
“…Multi-scale feature studies often use multi-scale convolution kernels to extract the features of images in parallel. For example, the sliding window covers only one pixel when the 1 × 1 filter is applied to each channel, which means it focuses more on continuous spectral characteristics, while the 3 × 3 and 5 × 5 kernels can focus on different potential local spatial structures due to the different size of receptive fields [26]. Another method is the atrous convolution of an image with a 3 × 3 kernel and rates of r = 1, 2 and 4 for the target pixel are used for urban land use and land cover classification [27].…”
Section: Multi-scalementioning
confidence: 99%
“…The multi-scale concept in this study refers to the relationship between the local and the global, that is, a small local area and a large area within a certain neighborhood. However, the current multi-scale research focuses on the multi-scale feature extraction and fusion of the same training sample image [24][25][26][27][28][29][30][31]. There are few studies on how to consider the semantic analysis of variable regions in different spatial extents [32][33][34][35].…”
Remote sensing image captioning involves remote sensing objects and their spatial relationships. However, it is still difficult to determine the spatial extent of a remote sensing object and the size of a sample patch. If the patch size is too large, it will include too many remote sensing objects and their complex spatial relationships. This will increase the computational burden of the image captioning network and reduce its precision. If the patch size is too small, it often fails to provide enough environmental and contextual information, which makes the remote sensing object difficult to describe. To address this problem, we propose a multi-scale semantic long short-term memory network (MS-LSTM). The remote sensing images are paired into image patches with different spatial scales. First, the large-scale patches have larger sizes. We use a Visual Geometry Group (VGG) network to extract the features from the large-scale patches and input them into the improved MS-LSTM network as the semantic information, which provides a larger receptive field and more contextual semantic information for small-scale image caption so as to play the role of global perspective, thereby enabling the accurate identification of small-scale samples with the same features. Second, a small-scale patch is used to highlight remote sensing objects and simplify their spatial relations. In addition, the multi-receptive field provides perspectives from local to global. The experimental results demonstrated that compared with the original long short-term memory network (LSTM), the MS-LSTM’s Bilingual Evaluation Understudy (BLEU) has been increased by 5.6% to 0.859, thereby reflecting that the MS-LSTM has a more comprehensive receptive field, which provides more abundant semantic information and enhances the remote sensing image captions.
“…Recently, several CNNs-based semantic segmentation methods have been used in building extraction from earth observation images [6][7][8]. The patch-based CNNs methods [9][10][11][12][13] were initially adopted for prediction in dense urban areas. These patched-CNNs label the center pixel by processing an image patch through a neural network.…”
Automated methods to extract buildings from very high resolution (VHR) remote sensing data have many applications in a wide range of fields. Many convolutional neural network (CNN) based methods have been proposed and have achieved significant advances in the building extraction task. In order to refine predictions, a lot of recent approaches fuse features from earlier layers of CNNs to introduce abundant spatial information, which is known as skip connection. However, this strategy of reusing earlier features directly without processing could reduce the performance of the network. To address this problem, we propose a novel fully convolutional network (FCN) that adopts attention based re-weighting to extract buildings from aerial imagery. Specifically, we consider the semantic gap between features from different stages and leverage the attention mechanism to bridge the gap prior to the fusion of features. The inferred attention weights along spatial and channel-wise dimensions make the low level feature maps adaptive to high level feature maps in a target-oriented manner. Experimental results on three publicly available aerial imagery datasets show that the proposed model (RFA-UNet) achieves comparable and improved performance compared to other state-of-the-art models for building extraction.to detect large objects [14,15]. Since Long et al. [16] adapted the classification network into fully convolutional network (FCN) for semantic segmentation, FCN and its extensions have gradually become the preferred solution in the field of semantic labeling [17][18][19][20]. Though FCN-based methods can produce dense pixel-wise output directly, the pixel-wise classification derived from the final score map is quite coarse because of the sequential sub-sampling operations in the FCN.To address the problem of coarse predictions, recent research [21][22][23][24][25][26] have further improved FCN-based methods for semantic labeling of remote sensing images. There is a growing body of literature that many studies [27][28][29][30][31] employ the encoder-decoder architecture with skip connection. UNet [32], a typical model in the style of encoder-decoder, reuses low-level information to refine the output, and results in better performance. For obtaining accurate labeling of VHR images, an effective structure to integrate the high-resolution, low-level features, and the low-resolution, high-level features is needed. The skip connection fuses features so as to compensate the loss of spatial information caused by repeating local operations (e.g., pooling and strided convolution). Features via skip connection are multi-scale in nature due to the increasingly large receptive field sizes [33]. However, one thing to note is that most existing approaches that are built on top of a contemporary classification network are good at aggregating global contexts. While the reuse of information from early encoding layers contributes to localization in the decoding phase, it may introduce redundant information which results in over-segmentation [3...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.