Open-Vocabulary Object Detection Using Captions

Zareian, Alireza; Rosa, Kevin Dela; Hu, Derek Hao; Chang, Shih-Fu

doi:10.1109/cvpr46437.2021.01416

Cited by 140 publications

(132 citation statements)

References 27 publications

Supporting

Mentioning

131

Contrasting

Order By: Relevance

“…Experiments on the open-vocabulary LVIS [15,16] and the open-vocabulary COCO [2] On open-vocabulary COCO, our method outperforms the previous state-of-the-art OVR-CNN [66] by 5 point with the same detector and data. Finally, we train a detector using the full ImageNet-21K with more than twenty-thousand classes.…”

Section: Introductionmentioning

confidence: 91%

“…Rahman et al [38] and Li et al [29] improve the classifier embedding by introducing external text information. OVR-CNN [66] pretrains the detector on image-text pairs using contrastive learning. ViLD [15] upgrades the language embedding to CLIP [37] and distills region features from CLIP image features.…”

Section: Related Workmentioning

confidence: 99%

“…Language supervision for object detection is a recent popular research topic. VirTex [10] and OVR-CNN [66] pretrain backbones on language tasks and show benefits for detection. Cap2Det [65] learns a mapping from sentences to image labels in the detector's vocabulary then applies WSOD [3].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Detecting Twenty-thousand Classes using Image-level Supervision

Girdhar¹,

Joulin²,

Krähenbühl³

et al. 2022

Preprint

View full text Add to dashboard Cite

Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rare classes. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning. Code is available at https://github.com/facebookresearch/Detic.

show abstract

Section: Introductionmentioning

confidence: 91%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Detecting Twenty-thousand Classes using Image-level Supervision

Girdhar¹,

Joulin²,

Krähenbühl³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We report the effects of different objectness scores in Table 5. It can been seen that our sim-Faster RCNN [26] Ours EdgeBox [42] Ground Truth ilarity entropy outperforms the initial score and maximum similarity. The initial score gains better performance than the original Edge boxes [46], thanks to our CLIP entropy selection.…”

Section: Ablation Studymentioning

confidence: 93%

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues

Shi¹,

Hayat²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Object proposal generation is an important and fundamental task in computer vision. In this paper, we propose ProposalCLIP, a method towards unsupervised opencategory object proposal generation. Unlike previous works which require a large number of bounding box annotations and/or can only generate proposals for limited object categories, our ProposalCLIP is able to predict proposals for a large variety of object categories without annotations, by exploiting CLIP (contrastive language-image pre-training) cues. Firstly, we analyze CLIP for unsupervised open-category proposal generation and design an objectness score based on our empirical analysis on proposal selection. Secondly, a graph-based merging module is proposed to solve the limitations of CLIP cues and merge fragmented proposals. Finally, we present a proposal regression module that extracts pseudo labels based on CLIP cues and trains a lightweight network to further refine proposals. Extensive experiments on PASCAL VOC, COCO and Visual Genome datasets show that our ProposalCLIP can better generate proposals than previous state-of-the-art methods. Our ProposalCLIP also shows benefits for downstream tasks, such as unsupervised object detection.

show abstract

“…Uijlings et al use Multiple Instance Learning to pseudo label data and then train a large vocabulary detector. Recent works build open vocabulary detectors [23,32,88] by leveraging image caption pairs (or models like CLIP [63] which are built from the same), obtained in large quantities on the web. Even though image-captions are noisy, the resulting detectors improve as the data is scaled up.…”

Section: Related Workmentioning

confidence: 99%

Webly Supervised Concept Expansion for General Purpose Vision Models

Kamath¹,

Clark²,

Gupta³

et al. 2022

Preprint

View full text Add to dashboard Cite

General purpose vision (GPV) systems [25] are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective and inexpensive alternative: learn skills from fully supervised datasets, learn concepts from web image search results, and leverage a key characteristic of GPVs -the ability to transfer visual knowledge across skills. We use a dataset of 1M+ images spanning 10k+ visual concepts to demonstrate webly-supervised concept expansion for two existing GPVs (GPV-1 [25] and VL-T5 [14]) on 3 benchmarks -5 COCO based datasets (80 primary concepts), a newly curated series of 5 datasets based on the OpenImages and VisualGenome repositories (∼500 concepts) and the Web-derived dataset (10k+ concepts). We also propose a new architecture, GPV-2 that supports a variety of tasks -from vision tasks like classification and localization to vision+language tasks like QA and captioning to more niche ones like human-object interaction recognition. GPV-2 benefits hugely from web data, outperforms GPV-1 and VL-T5 across these benchmarks, and does well in a 0-shot setting at action and attribute recognition.

show abstract

Open-Vocabulary Object Detection Using Captions

Cited by 140 publications

References 27 publications

Detecting Twenty-thousand Classes using Image-level Supervision

Detecting Twenty-thousand Classes using Image-level Supervision

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues

Webly Supervised Concept Expansion for General Purpose Vision Models

Contact Info

Product

Resources

About