Semantic image segmentation, as one of the most popular tasks in computer vision, has been widely used in autonomous driving, robotics and other fields. Currently, deep convolutional neural networks (DCNNs) are driving major advances in semantic segmentation due to their powerful feature representation. However, DCNNs extract high-level feature representations by strided convolution, which makes it impossible to segment foreground objects precisely, especially when locating object boundaries. This paper presents a novel semantic segmentation algorithm with DeepLab v3+ and super-pixel segmentation algorithm-quick shift. DeepLab v3+ is employed to generate a class-indexed score map for the input image. Quick shift is applied to segment the input image into superpixels. Outputs of them are then fed into a class voting module to refine the semantic segmentation results. Extensive experiments on proposed semantic image segmentation are performed over PASCAL VOC 2012 dataset, and results that the proposed method can provide a more efficient solution.
The widespread applications of remote sensing image scene classification-based Convolutional Neural Networks (CNNs) are severely affected by the lack of large-scale datasets with clean annotations. Data crawled from the Internet or other sources allows for the most rapid expansion of existing datasets at a low-cost. However, directly training on such an expanded dataset can lead to network overfitting to noisy labels. Traditional methods typically divide this noisy dataset into multiple parts. Each part fine-tunes the network separately to improve performance further. These approaches are inefficient and sometimes even hurt performance. To address these problems, this study proposes a novel noisy label distillation method (NLD) based on the end-to-end teacher-student framework. First, unlike general knowledge distillation methods, NLD does not require pre-training on clean or noisy data. Second, NLD effectively distills knowledge from labels across a full range of noise levels for better performance. In addition, NLD can benefit from a fully clean dataset as a model distillation method to improve the student classifier’s performance. NLD is evaluated on three remote sensing image datasets, including UC Merced Land-use, NWPU-RESISC45, AID, in which a variety of noise patterns and noise amounts are injected. Experimental results show that NLD outperforms widely used directly fine-tuning methods and remote sensing pseudo-labeling methods.
Most object detection methods based on remote sensing images are generally dependent on a large amount of high-quality labeled training data. However, due to the slow acquisition cycle of remote sensing images and the difficulty in labeling, many types of data samples are scarce. This makes few-shot object detection an urgent and necessary research problem. In this paper, we introduce a remote sensing few-shot object detection method based on text semantic fusion relation graph reasoning (TSF-RGR), which learns various types of relationships from common sense knowledge in an end-to-end manner, thereby empowering the detector to reason over all classes. Specifically, based on the region proposals provided by the basic detection network, we first build a corpus containing a large number of text language descriptions, such as object attributes and relations, which are used to encode the corresponding common sense embeddings for each region. Then, graph structures are constructed between regions to propagate and learn key spatial and semantic relationships. Finally, a joint relation reasoning module is proposed to actively enhance the reliability and robustness of few-shot object feature representation by focusing on the degree of influence of different relations. Our TSF-RGR is lightweight and easy to expand, and it can incorporate any form of common sense information. Sufficient experiments show that the text information is introduced to deliver excellent performance gains for the baseline model. Compared with other few-shot detectors, the proposed method achieves state-of-the-art performance for different shot settings and obtains highly competitive results on two benchmark datasets (NWPU VHR-10 and DIOR).
With the rapid development of remote sensing technology, remote sensing registration plays an important role in the assessment of various natural disasters, especially earthquakes. However, multi-temporal remote sensing images for the assessment have some characteristics, e.g. large-scale and rotation, resulting in challenges of remote sensing registration. In order to better register remote sensing images, we propose a new image registration method with a deep learning feature matching strategy. We first extract the pre-match point sets M and S by using SIFT-FLANN (SIFT-Fast Library for Approximate Nearest Neighbors). Second, we filter out the correct matching point pairs from M and S by using a multiscale neighborhood information network and a dual-path ConvNeXt network with self-attention-guided local information enhancement. Thirdly, we register multi-temporal remote sensing images by solve the model parameters of the spatial transformation. Finally, we evaluate our proposed method using a variety of remote sensing images with different phases, including visible light images with different illumination, scale and geometry changes. On the remote sensing image dataset containing images of pre- and post-earthquake, we compare our method to existing state-of-the-art methods and provide the results with the evaluation indexes such as Root Mean Square Error (RMSE). The results show that our method for multi-temporal remote sensing registration has a higher registration accuracy and more robustness.
Supervised learning models require a large-scale dataset for more effective model training. However, a large-scale dataset requires time-consuming data collection and highly complicated preprocessing, posing a significant challenge for researchers. Therefore, creating a large-scale dataset for each practical application is unachievable in a real-world setting. In this case, unsupervised domain adaptation is crucial for practical applications, as it can enable models to enhance their performance on unlabeled data by training on labeled data. For multi-modal egocentric video analysis, some models have used unsupervised domain adaptation and achieved outstanding performance. However, they use either early or late fusion, ignoring the correlation in and between multi-modal inputs. Therefore, this paper investigates different fusion architecture and proposes a cascade attentional fusion to improve the feature representation for unsupervised domain adaptation on multi-modal egocentric video analysis. First, we propose a cascade fusion architecture for increased audio signal reuse. Then, we propose a temporal-spatial attention mechanism for highlighting spatio-temporal feature representations. Third, we propose a novel cascade attentional fusion method for multi-modal egocentric video data fusion by incorporating the architecture and attention model described previously. In addition, we study the ways for integrating different attentions. Finally, we propose an adversarial domain alignment model that incorporates the proposed fusion for unsupervised domain adaptation on multi-modal egocentric video analysis, reaching state-of-the-art performance on the public dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.