PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

Feng, Chengjian; Zhong, Yujie; Jie, Zequn; Chu, Xiangxiang; Ren, Hongliang; Wei, Xiaolin; Xie, Weidi; Ma, Lin

doi:10.1007/978-3-031-20077-9_41

Cited by 35 publications

(22 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each image is annotated with five captions describing the visually-grounded objects in the image. Unlike previous works [19,49,79] that adopt extra caption datasets, like Conceptual Captions [53] with 3M image-caption pairs for pre-training, we do not use extra caption datasets or detection datasets. We follow the origin OVR-CNN [69] setting by only exploring a limited caption dataset within COCO.…”

Section: Methodsmentioning

confidence: 99%

“…Meanwhile, several works decouple the learning of open vocabulary classification and detection/segmentation into a two-stage pipeline [15,21]. Recently, state-of-the-art solutions [19,28,33,71,79] for open vocabulary detection/segmentation try to adopt larger-scale dataset pre-training with the help of VLMs. For example, Detic [79] adopts the ImageNet-21k [51] dataset to enlarge the detector in a weakly supervised manner, while Prompt-Det [19] augments the detection dataset with image-caption pairs scraped from the Internet.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, state-of-the-art solutions [19,28,33,71,79] for open vocabulary detection/segmentation try to adopt larger-scale dataset pre-training with the help of VLMs. For example, Detic [79] adopts the ImageNet-21k [51] dataset to enlarge the detector in a weakly supervised manner, while Prompt-Det [19] augments the detection dataset with image-caption pairs scraped from the Internet. Recent XPM [28] also pretrains their model on caption datasets [52].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Jianzong¹,

Li²,

Ding³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Jianzong¹,

Li²,

Ding³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Bansal et al [4] introduce ZS+OV detection where the classification layer of a closed vocabulary detector is replaced with the text embeddings of the class names, an approach taken by many subsequent works [11,14,16,24,31,42,46,46], including this one. Some works [16,24,42] take the OV classification closer to the backbone features by directly extracting them from object proposals with ROI-Align [20], and optionally distill a strong OV classifier into the detector [16].…”

Section: Related Workmentioning

confidence: 99%

“…(ii) A typical detector pretrains the vision backbone and language model on image-text datasets to obtain aligned image and text embeddings [14,16,31,46], but also inserts many modules (feature pyramid network [28], detection heads [13,28,38]) that are trained from scratch. The added modules break the vision-text alignment established during pretraining, and we propose to side-step this issue by modifying their architecture.…”

Section: Introductionmentioning

confidence: 99%

Three ways to improve feature alignment for open vocabulary detection

Arandjelović¹,

Andonian²,

Arthur³

et al. 2023

Preprint

View full text Add to dashboard Cite

The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes.We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes.Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365. * Equal contribution. • Work done during internship at DeepMind.

show abstract

Are Vision Transformers Robust to Patch Perturbations?

Gu¹,

Tresp²,

Qin

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

show abstract

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

Cited by 35 publications

References 25 publications

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Three ways to improve feature alignment for open vocabulary detection

Are Vision Transformers Robust to Patch Perturbations?

Contact Info

Product

Resources

About