RegionCLIP: Region-based Language-Image Pretraining

Zhong, Yiwu; Yang, Jun; Zhang, Pengchuan; Li, Chunyuan; Codella, Noel; Li, Liunian Harold; Zhou, Luowei; Dai, Xiyang; Liu, Yuan; Li, Yin; Gao, Jianfeng

doi:10.48550/arxiv.2112.09106

Cited by 8 publications

(25 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To learn the semantics of novel classes, recent methods [3,13,16,41,44] have simplified the problem by providing image-caption pairs as a weak supervision signal. Such pairs are cheap to acquire and make the problem tractable.…”

Section: Related Workmentioning

confidence: 99%

“…Most of these methods require big dataset with millions of image-caption pairs to train such a model. They either use this model to align image-regions with captions and generate object-box pseudo labels [16,44] or as region-image feature extractor to classify the regions [13]. Many weakly-supervised [1,3,7,34,43] approaches have been proposed to perform such object grounding.…”

Section: Related Workmentioning

confidence: 99%

“…We compare our method with recent state-of-the-art models on Open-Vocabulary. RegionClip [44] uses the CLIP [30] pre-trained model to produce region-image pseudo labels and train an object detector. CLIP (cropped reg) [13] uses the CLIP pre-trained model on 400M image-caption pairs on object proposals obtained by an object detector trained on known classes.…”

Section: Baselines Ovrmentioning

confidence: 99%

“…The closed-world setting restricts the object detector to only discover known annotated objects and annotating all possible objects in the world is infeasible due to high labeling costs. Therefore, research of open-world detectors, which can also discover unmarked objects, has recently come into focus [13,41,44].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Localized Vision-Language Matching for Open-vocabulary Object Detection

Bravo¹,

Mittal²,

Brox³

2022

Preprint

View full text Add to dashboard Cite

In this work, we propose an open-world object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weaklysupervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-world detection approaches while being data-efficient.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Baselines Ovrmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Localized Vision-Language Matching for Open-vocabulary Object Detection

Bravo¹,

Mittal²,

Brox³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, open-vocabulary object detection (OVD) [49] has attracted increasing attention due to its ability to expand the detection vocabulary with the help of pre-trained vision-language models (VLMs) [32]. Typical OVD methods [14,51,53] first learn an unbounded vocabulary of concepts from image-caption pairs, and then transfer the general vision-language knowledge to facilitate OVD with detection annotations of base categories alone.…”

Section: Introductionmentioning

confidence: 99%

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Chen¹,

Sheng²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary. Recent work resorts to the rich knowledge in pre-trained vision-language models. However, existing methods are ineffective in proposal-level vision-language alignment. Meanwhile, the models usually suffer from confidence bias toward base categories and perform worse on novel ones. To overcome the challenges, we present MEDet, a novel and effective OVD framework with proposal mining and prediction equalization. First, we design an online proposal mining to refine the inherited vision-semantic knowledge from coarse to fine, allowing for proposal-level detection-oriented feature alignment. Second, based on causal inference theory, we introduce a class-wise backdoor adjustment to reinforce the predictions on novel categories to improve the overall OVD performance. Extensive experiments on COCO and LVIS benchmarks verify the superiority of MEDet over the competing approaches in detecting objects of novel categories, e.g., 32.6% AP50 on COCO and 22.4% mask mAP on LVIS. Code is available at the https://github.com/Pealing/MEDet.

show abstract

Simple Open-Vocabulary Object Detection

Minderer

Gritsenko

Stone

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Open-vocabulary object detection has benefited greatly from pretrained visionlanguage models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to imagelevel pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudoannotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (≈10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.Preprint. Under review.

show abstract

RegionCLIP: Region-based Language-Image Pretraining

Cited by 8 publications

References 40 publications

Localized Vision-Language Matching for Open-vocabulary Object Detection

Localized Vision-Language Matching for Open-vocabulary Object Detection

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Simple Open-Vocabulary Object Detection

Contact Info

Product

Resources

About