Existing person re-identification has achieved great progress in the visible domain, capturing all the person images with visible cameras. However, in a 24-hour intelligent surveillance system, the visible cameras may be noneffective at night. In this situation, thermal cameras are the best supplemental components, which capture images without depending on visible light. Therefore, in this paper, we investigate the visible-thermal cross-modality person re-identification (VT Re-ID) problem. In VT Re-ID, there are two knotty problems should be well handled, cross-modality discrepancy and intra-modality variations. To address these two issues, we propose focusing on enhancing the discriminative feature learning (EDFL) with two extreme simple means from two core aspects, (1) skip-connection for mid-level features incorporation to improve the person features with more discriminability and robustness, and (2) dual-modality triplet loss to guide the training procedures by simultaneously considering the cross-modality discrepancy and intra-modality variations. Additionally, the two-stream CNN structure is adopted to learn the multi-modality sharable person features. The experimental results on two datasets show that our proposed EDFL approach distinctly outperforms state-of-the-art methods by large margins, demonstrating the effectiveness of our EDFL to enhance the discriminative feature learning for VT Re-ID.
Currently, an increasing number of convolutional neural networks (CNNs) focus specifically on capturing contextual features (con. feat) to improve performance in semantic segmentation tasks. However, high-level con. feat are biased towards encoding features of large objects, disregard spatial details, and have a limited capacity to discriminate between easily confused classes (e.g., trees and grasses). As a result, we incorporate low-level features (low. feat) and class-specific discriminative features (dis. feat) to boost model performance further, with low. feat helping the model in recovering spatial information and dis. feat effectively reducing class confusion during segmentation. To this end, we propose a novel deep multi-feature learning framework for the semantic segmentation of VHR RSIs, dubbed MFNet. The proposed MFNet adopts a multi-feature learning mechanism to learn more complete features, including con. feat, low. feat, and dis. feat. More specifically, aside from a widely used context aggregation module for capturing con. feat, we additionally append two branches for learning low. feat and dis. feat. One focuses on learning low. feat at a shallow layer in the backbone network through local contrast processing, while the other groups con. feat and then optimizes each class individually to generate dis. feat with better inter-class discriminative capability. Extensive quantitative and qualitative evaluations demonstrate that the proposed MFNet outperforms most state-of-the-art models on the ISPRS Vaihingen and Potsdam datasets. In particular, thanks to the mechanism of multi-feature learning, our model achieves an overall accuracy score of 91.91% on the Potsdam test set with VGG16 as a backbone, performing favorably against advanced models with ResNet101.
Scene parsing of high-resolution remote-sensing images (HRRSIs) refers to parsing different semantic regions from the images, which is an important fundamental task in image understanding. However, due to the inherent complexity of urban scenes, HRRSIs contain numerous object classes. These objects present large-scale variation and irregular morphological structures. Furthermore, their spatial distribution is uneven and contains substantial spatial details. All these features make it difficult to parse urban scenes accurately. To deal with these dilemmas, in this paper, we propose a multi-branch adaptive hard region mining network (MBANet) for urban scene parsing of HRRSIs. MBANet consists of three branches, namely, a multi-scale semantic branch, an adaptive hard region mining (AHRM) branch, and an edge branch. First, the multi-scale semantic branch is constructed based on a feature pyramid network (FPN). To reduce the memory footprint, ResNet50 is chosen as the backbone, which, combined with the atrous spatial pyramid pooling module, can extract rich multi-scale contextual information effectively, thereby enhancing object representation at various scales. Second, an AHRM branch is proposed to enhance feature representation of hard regions with a complex distribution, which would be difficult to parse otherwise. Third, the edge-extraction branch is introduced to supervise boundary perception training so that the contours of objects can be better captured. In our experiments, the three branches complemented each other in feature extraction and demonstrated state-of-the-art performance for urban scene parsing of HRRSIs. We also performed ablation studies on two HRRSI datasets from ISPRS and compared them with other methods.
Currently, the most advanced high-resolution remote sensing image (HRRSI) semantic labeling methods rely on deep neural networks. However, HRRSIs naturally have a serious class imbalance problem which is not yet well solved by the current method. The cross-entropy (CE) loss is often used to guide the training of semantic labeling neural networks for HRRSIs, but it is essentially dominated by the major classes in the image, resulting in poor predictions for the minority class. Based on the prediction results, Focal Loss (FL) effectively suppresses the negative impact of class imbalance in dense object detection by redistributing the loss of each sample. In this paper, we thoroughly analyze the inadequacy of FL for semantic labeling, which inevitably introduces confusing-classified examples that are more difficult to classify while suppressing the loss of wellclassified examples. Therefore, following the core idea of FL, we redefine the hard examples in semantic labeling of HRRSIs and propose the prediction confusion map (PCM) to measure the classification difficulty. Based on this, we further propose the Calibrated Focal Loss (CFL) for semantic labeling of HRRSIs. Finally, we conduct complete experiments on the International Society for Photogrammetry and Remote Sensing (ISPRS) Vaihingen and Potsdam datasets to analyze the semantic labeling performance, model uncertainty, and confidence calibration of different loss functions. Experimental results show that CFL can achieve outstanding results compared to other commonly used loss functions without increasing model parameters and training iterations, demonstrating the effectiveness of our method. In the end, combined with our previously proposed HCANet, we further verify the effectiveness of CFL on state-of-the-art network structures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.