2022
DOI: 10.1007/978-3-031-20077-9_41
|View full text |Cite
|
Sign up to set email alerts
|

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
22
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 35 publications
(22 citation statements)
references
References 25 publications
0
22
0
Order By: Relevance
“…Each image is annotated with five captions describing the visually-grounded objects in the image. Unlike previous works [19,49,79] that adopt extra caption datasets, like Conceptual Captions [53] with 3M image-caption pairs for pre-training, we do not use extra caption datasets or detection datasets. We follow the origin OVR-CNN [69] setting by only exploring a limited caption dataset within COCO.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Each image is annotated with five captions describing the visually-grounded objects in the image. Unlike previous works [19,49,79] that adopt extra caption datasets, like Conceptual Captions [53] with 3M image-caption pairs for pre-training, we do not use extra caption datasets or detection datasets. We follow the origin OVR-CNN [69] setting by only exploring a limited caption dataset within COCO.…”
Section: Methodsmentioning
confidence: 99%
“…Meanwhile, several works decouple the learning of open vocabulary classification and detection/segmentation into a two-stage pipeline [15,21]. Recently, state-of-the-art solutions [19,28,33,71,79] for open vocabulary detection/segmentation try to adopt larger-scale dataset pre-training with the help of VLMs. For example, Detic [79] adopts the ImageNet-21k [51] dataset to enlarge the detector in a weakly supervised manner, while Prompt-Det [19] augments the detection dataset with image-caption pairs scraped from the Internet.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Bansal et al [4] introduce ZS+OV detection where the classification layer of a closed vocabulary detector is replaced with the text embeddings of the class names, an approach taken by many subsequent works [11,14,16,24,31,42,46,46], including this one. Some works [16,24,42] take the OV classification closer to the backbone features by directly extracting them from object proposals with ROI-Align [20], and optionally distill a strong OV classifier into the detector [16].…”
Section: Related Workmentioning
confidence: 99%
“…(ii) A typical detector pretrains the vision backbone and language model on image-text datasets to obtain aligned image and text embeddings [14,16,31,46], but also inserts many modules (feature pyramid network [28], detection heads [13,28,38]) that are trained from scratch. The added modules break the vision-text alignment established during pretraining, and we propose to side-step this issue by modifying their architecture.…”
Section: Introductionmentioning
confidence: 99%