2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01416
|View full text |Cite
|
Sign up to set email alerts
|

Open-Vocabulary Object Detection Using Captions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
131
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 140 publications
(132 citation statements)
references
References 27 publications
1
131
0
Order By: Relevance
“…Experiments on the open-vocabulary LVIS [15,16] and the open-vocabulary COCO [2] On open-vocabulary COCO, our method outperforms the previous state-of-the-art OVR-CNN [66] by 5 point with the same detector and data. Finally, we train a detector using the full ImageNet-21K with more than twenty-thousand classes.…”
Section: Introductionmentioning
confidence: 91%
See 2 more Smart Citations
“…Experiments on the open-vocabulary LVIS [15,16] and the open-vocabulary COCO [2] On open-vocabulary COCO, our method outperforms the previous state-of-the-art OVR-CNN [66] by 5 point with the same detector and data. Finally, we train a detector using the full ImageNet-21K with more than twenty-thousand classes.…”
Section: Introductionmentioning
confidence: 91%
“…Rahman et al [38] and Li et al [29] improve the classifier embedding by introducing external text information. OVR-CNN [66] pretrains the detector on image-text pairs using contrastive learning. ViLD [15] upgrades the language embedding to CLIP [37] and distills region features from CLIP image features.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…We report the effects of different objectness scores in Table 5. It can been seen that our sim-Faster RCNN [26] Ours EdgeBox [42] Ground Truth ilarity entropy outperforms the initial score and maximum similarity. The initial score gains better performance than the original Edge boxes [46], thanks to our CLIP entropy selection.…”
Section: Ablation Studymentioning
confidence: 93%
“…Uijlings et al use Multiple Instance Learning to pseudo label data and then train a large vocabulary detector. Recent works build open vocabulary detectors [23,32,88] by leveraging image caption pairs (or models like CLIP [63] which are built from the same), obtained in large quantities on the web. Even though image-captions are noisy, the resulting detectors improve as the data is scaled up.…”
Section: Related Workmentioning
confidence: 99%