2021
DOI: 10.48550/arxiv.2112.09106
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

RegionCLIP: Region-based Language-Image Pretraining

Abstract: Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans. To mitigate this issue, we propose a ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
25
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 8 publications
(25 citation statements)
references
References 40 publications
0
25
0
Order By: Relevance
“…To learn the semantics of novel classes, recent methods [3,13,16,41,44] have simplified the problem by providing image-caption pairs as a weak supervision signal. Such pairs are cheap to acquire and make the problem tractable.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…To learn the semantics of novel classes, recent methods [3,13,16,41,44] have simplified the problem by providing image-caption pairs as a weak supervision signal. Such pairs are cheap to acquire and make the problem tractable.…”
Section: Related Workmentioning
confidence: 99%
“…Most of these methods require big dataset with millions of image-caption pairs to train such a model. They either use this model to align image-regions with captions and generate object-box pseudo labels [16,44] or as region-image feature extractor to classify the regions [13]. Many weakly-supervised [1,3,7,34,43] approaches have been proposed to perform such object grounding.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, open-vocabulary object detection (OVD) [49] has attracted increasing attention due to its ability to expand the detection vocabulary with the help of pre-trained vision-language models (VLMs) [32]. Typical OVD methods [14,51,53] first learn an unbounded vocabulary of concepts from image-caption pairs, and then transfer the general vision-language knowledge to facilitate OVD with detection annotations of base categories alone.…”
Section: Introductionmentioning
confidence: 99%