One-shot video-based person re-identification exploits the unlabeled data by using a singlelabeled sample for each individual to train a model and to reduce the need for laborious labeling. Although recent works focusing on this task have made some achievements, most state-of-the-art models are vulnerable to misalignment, pose variation and corrupted frames. To address these challenges, we propose a one-shot video-based person re-identification model based on pose-guided spatial alignment and KFS. First, a spatial transformer sub-network trained using pose-guided regression is employed to perform the spatial alignment. Second, we propose a novel training strategy based on KFS. Key frames with abruptly changing poses are deliberately identified and selected to make the network adaptive to pose variation. Finally, we propose a frame feature pooling method by incorporating long short-term memory with an attention mechanism to reduce the influence of corrupted frames. Comprehensive experiments are presented based on the MARS and DukeMTMC-VideoReID datasets. The mAP values for these datasets reach 46.5% and 68.4%, respectively, demonstrating that the proposed model achieves significant improvements over state-of-the-art one-shot person re-identification methods.INDEX TERMS Person re-identification, one-shot learning, spatial alignment, key frame selection, frame feature pooling.
Semantic segmentation plays a critical role in image understanding. Recently, Fully Convolutional Network (FCN)-based models have made significant progress in semantic segmentation. However, achieving the full utilization of contextual information and recovery of lost spatial details remains a huge challenge. In this paper, we present a semantic segmentation model based on pyramid context contrast and a subpixel-aware dense decoder. We propose first using the pyramid context contrast to exploit the capability of contextual information by aggregating multi-scale foreground representations in different background regions via the pyramid context contrast module. Then, we add a subpixel-aware dense decoder architecture to reuse features extracted from different decoder levels by pixel shuffle, which can reasonably resolve resolution inconsistency between feature maps. Next, we refine the boundary by utilizing spatial visual information about low-level features via a boundary refinement branch with addition of auxiliary supervision. The presented model was evaluated using the PASCAL VOC 2012 semantic segmentation benchmark and achieved a performance of 86.9%, demonstrating that the proposed model achieves considerable improvement over most state-of-the-art models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.