Person Re-identification (ReID) is to identify the same person across different cameras. It is a challenging task due to the large variations in person pose, occlusion, background clutter, etc. How to extract powerful features is a fundamental problem in ReID and is still an open problem today. In this paper, we design a Multi-Scale Context-Aware Network (MSCAN) to learn powerful features over full body and body parts, which can well capture the local context knowledge by stacking multi-scale convolutions in each layer. Moreover, instead of using predefined rigid parts, we propose to learn and localize deformable pedestrian parts using Spatial Transformer Networks (STN) with novel spatial constraints. The learned body parts can release some difficulties, e.g. pose variations and background clutters, in part-based representation. Finally, we integrate the representation learning processes of full body and body parts into a unified framework for person ReID through multi-class person identification tasks. Extensive evaluations on current challenging large-scale person ReID datasets, including the image-based Market1501, CUHK03 and sequence-based MARS datasets, show that the proposed method achieves the state-of-the-art results. Conv Conv Conv FC Conv Conv Conv FC Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv FC Latent Part Localization MSCAN FC FC Concat Concat Concat Concat FC FC MSCAN MSCAN FC Concat MSCAN Full body Rigid body parts Ours FC Conv FC Concat FC submit
Most existing methods for pedestrian attribute recognition in video surveillance can be formulated as a multi-label image classification methodology, while attribute localization is usually disregarded due to the low image qualities and large variations of camera viewpoints and human poses. In this paper, we propose a weakly-supervised learning based approaching to implementing multi-attribute classification and localization simultaneously, without the need of bounding box annotations of attributes. Firstly, a set of mid-level attribute features are discovered by a multi-scale attribute-aware module receiving the outputs of multiple inception layers in a deep Convolution Neural Network (CNN) e.g., GoogLeNet, where a Flexible Spatial Pyramid Pooling (FSPP) operation is performed to acquire the activation maps of attribute features. Subsequently, attribute labels are predicted through a fully-connected layer which performs the regression between the response magnitudes in activation maps and the image-level attribute annotations. Finally, the locations of pedestrian attributes can be inferred by fusing the multiple activation maps, where the fusion weights are estimated as the correlation strengths between attributes and relevant mid-level features. To validate the proposed approach, extensive experiments are performed on the two currently largest pedestrian attribute datasets, i.e.
State-of-the-art methods treat pedestrian attribute recognition as a multi-label image classification problem. The location information of person attributes is usually eliminated or simply encoded in the rigid splitting of whole body in previous work. In this paper, we formulate the task in a weakly-supervised attribute localization framework. Based on GoogLeNet, firstly, a set of mid-level attribute features are discovered by novelly designed detection layers, where a max-pooling based weakly-supervised object detection technique is used to train these layers with only imagelevel labels without the need of bounding box annotations of pedestrian attributes. Secondly, attribute labels are predicted by regression of the detection response magnitudes. Finally, the locations and rough shapes of pedestrian attributes can be inferred by performing clustering on a fusion of activation maps of the detection layers, where the fusion weights are estimated as the correlation strengths between each attribute and its relevant mid-level features. Extensive experiments are performed on the two currently largest pedestrian attribute datasets, i.e. the PETA dataset and the RAP dataset. Results show that the proposed method has achieved competitive performance on attribute recognition, compared to other state-of-the-art methods. Moreover, the results of attribute localization are visualized to understand the characteristics of the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.