Acquisition of Localization Confidence for Accurate Object Detection

Jiang, Borui; Luo, Ruixuan; Mao, Jiayuan; Xiao, Tete; Jiang, Yuning

doi:10.1007/978-3-030-01264-9_48

Cited by 748 publications

(574 citation statements)

References 37 publications

(73 reference statements)

Supporting

Mentioning

573

Contrasting

Unclassified

Order By: Relevance

“…For bounding box estimation, we train the IoU-Net [14] based architecture proposed in [4], employing features from the same backbone network used for target classification. The training procedure in [4] is extended to image sets by computing the modulation vector on the first frame in M train and sampling proposal boxes from all images in M test .…”

Section: Offline Trainingmentioning

confidence: 99%

Learning Discriminative Model Prediction for Tracking

Bhat

Danelljan

Gool

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

975

1,035

View full text Add to dashboard Cite

The current strive towards end-to-end trainable computer vision systems imposes major challenges for the task of visual tracking. In contrast to most other vision problems, tracking requires the learning of a robust target-specific appearance model online, during the inference stage. To be end-to-end trainable, the online learning of the target model thus needs to be embedded in the tracking architecture itself. Due to these difficulties, the popular Siamese paradigm simply predicts a target feature template. However, such a model possesses limited discriminative power due to its inability of integrating background information.We develop an end-to-end tracking architecture, capable of fully exploiting both target and background appearance information for target model prediction. Our architecture is derived from a discriminative learning loss by designing a dedicated optimization process that is capable of predicting a powerful model in only a few iterations. Furthermore, our approach is able to learn key aspects of the discriminative loss itself. The proposed tracker sets a new state-of-the-art on 6 tracking benchmarks, achieving an EAO score of 0.440 on VOT2018, while running at over 40 FPS.

show abstract

Section: Offline Trainingmentioning

confidence: 99%

Learning Discriminative Model Prediction for Tracking

Bhat

Danelljan

Gool

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

975

1,035

View full text Add to dashboard Cite

show abstract

“…Upon publication of the conference version of this manuscript, several works have pursued the idea behind Cascade R-CNN [5], [32], [41], [55]. [41], [55] applied it to single-shot object detectors, showing nontrivial improvements for high quality single-shot detection, for general objects and pedestrians, respectively.…”

Section: Related Workmentioning

confidence: 99%

“…[41], [55] applied it to single-shot object detectors, showing nontrivial improvements for high quality single-shot detection, for general objects and pedestrians, respectively. The IoU-Net [32] explored in greater detail high-quality localization, achieving some gains over the Cascade R-CNN by cascading more bounding box regression steps. [24] showed it is possible to achieve state-of-the-art object detectors without ImageNet pretraining, with a help of the Cascade R-CNN.…”

Section: Related Workmentioning

confidence: 99%

Cascade R-CNN: High Quality Object Detection and Instance Segmentation

Cai

Vasconcelos

2021

IEEE Trans. Pattern Anal. Mach. Intell.

863

616

View full text Add to dashboard Cite

In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron). person: 1.00 person: 1.00 person: 0.99 person: 0.99 person: 0.87 person: 0.82 person: 0.77 person: 0.70 person: 0.64 person: 0.63 person: 0.56 frisbee: 1.00 frisbee: 1.00 frisbee: 0.99 frisbee: 0.97 (a) Detection of u = 0.5 person: 1.00 person: 0.99 person: 0.96 person: 0.94 person: 0.55 frisbee: 0.99 frisbee: 0.99 frisbee: 0.99 frisbee: 0.93

show abstract

“…[15] proposes an object relation module to learn the NMS function as an end-to-end general object detector. [41] and [17] replace the classification scores of proposals used in the NMS process with learned localization confidences to guide NMS to preserve more accurately localized bounding boxes. These methods prove effective in general object detection, but as we state, pedestrian detection in a crowd has its own challenge.…”

Section: Related Workmentioning

confidence: 99%

Adaptive NMS: Refining Pedestrian Detection in a Crowd

Liu

Huang

Wang

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

291

190

View full text Add to dashboard Cite

Pedestrian detection in a crowd is a very challenging issue. This paper addresses this problem by a novel Non-Maximum Suppression (NMS) algorithm to better refine the bounding boxes given by detectors. The contributions are threefold: (1) we propose adaptive-NMS, which applies a dynamic suppression threshold to an instance, according to the target density; (2) we design an efficient subnetwork to learn density scores, which can be conveniently embedded into both the single-stage and two-stage detectors; and (3) we achieve state of the art results on the CityPersons and CrowdHuman benchmarks.

show abstract

Acquisition of Localization Confidence for Accurate Object Detection

Cited by 748 publications

References 37 publications

Learning Discriminative Model Prediction for Tracking

Learning Discriminative Model Prediction for Tracking

Cascade R-CNN: High Quality Object Detection and Instance Segmentation

Adaptive NMS: Refining Pedestrian Detection in a Crowd

Contact Info

Product

Resources

About