2021
DOI: 10.48550/arxiv.2112.14757
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 15 publications
(33 citation statements)
references
References 0 publications
0
29
0
Order By: Relevance
“…The concurrently developed unpublished text-supervised semantic segmentation methods [29,86,90,96] also show promising results. One major difference between these methods and GroupViT is that, they exploit vision-language model [32,61] pre-trained on well prepared large-scale 400M-1.8B image-text data, while our GroupViT is trained from scratch with much noisier data (30M images) to learn grouping and segmentation and yet achieves competitive performance.…”
Section: Related Workmentioning
confidence: 95%
“…The concurrently developed unpublished text-supervised semantic segmentation methods [29,86,90,96] also show promising results. One major difference between these methods and GroupViT is that, they exploit vision-language model [32,61] pre-trained on well prepared large-scale 400M-1.8B image-text data, while our GroupViT is trained from scratch with much noisier data (30M images) to learn grouping and segmentation and yet achieves competitive performance.…”
Section: Related Workmentioning
confidence: 95%
“…To enable this, the core topic is how to define new classes, where using natural language is a promising direction. Pre-trained on large-scale image-text datasets [5], the language-driven or text-based visual recognition has been extended from image classification [45,27] to object detection [7,61,21,34] and semantic segmentation [19,60,32,59]. The task is also related to other cross-modal recognition topics, including image captioning [55,25], visual question answering [1,56,12,20], referring expression localization [26,41,67,39], visual reasoning [62], etc.…”
Section: Related Workmentioning
confidence: 99%
“…The design principle is to create a query from the request onthe-fly and use the query to interact with the extracted visual featuress. Some open-domain recognition algorithms discussed above [60,19,21,32] were based on this framework, in which queries were generated by texts. The idea is also related to the DETR series for object detection [4,66,6,43,50] and its extension to semantic/instance/panoptic segmentation [15,9,8,18,49,64].…”
Section: Related Workmentioning
confidence: 99%
“…Due to its superior zero-shot capability, CLIP bears the feasibility of extracting open-category semantic segmentation from untrained image sets 31 . Therefore, the higher relatedness between the image and the participant's description, the more accurate the semantic representation that is included in the description.…”
Section: Machine Learning Architectures and Features Extraction Appro...mentioning
confidence: 99%