“…However, the use of anchor boxes will cause severe imbalance between positive and negative training samples [37] and involve complex hyperparameter settings (e.g., box size, aspect ratio, stride, and intersection-over-union threshold) [29]. Our method is very different from the existing anchor box based multispectral pedestrian detectors [27,24,32,16,15,31] in two major aspects. Firstly, we make use of the ground truth bounding boxes (manually annotated) to generate coarse boxlevel segmentation masks, which are utilized to replace the anchor bounding boxes for the training of two-stream deep neural networks to learn human-relative characteristic features.…”