A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Xu, Mengde; Zhang, Zheng; Wei, Fangyun; Lin, Yutong; Cao, Yue; Han, Hu; Bai, Xiang

doi:10.48550/arxiv.2112.14757

Cited by 15 publications

(33 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The concurrently developed unpublished text-supervised semantic segmentation methods [29,86,90,96] also show promising results. One major difference between these methods and GroupViT is that, they exploit vision-language model [32,61] pre-trained on well prepared large-scale 400M-1.8B image-text data, while our GroupViT is trained from scratch with much noisier data (30M images) to learn grouping and segmentation and yet achieves competitive performance.…”

Section: Related Workmentioning

confidence: 95%

GroupViT: Semantic Segmentation Emerges from Text Supervision

Xu¹,

Mello²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitraryshaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixellevel annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 51.2% mIoU on the PASCAL VOC 2012 and 22.3% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. Project page is available at https://jerryxu.net/GroupViT.

show abstract

Section: Related Workmentioning

confidence: 95%

GroupViT: Semantic Segmentation Emerges from Text Supervision

Xu¹,

Mello²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To enable this, the core topic is how to define new classes, where using natural language is a promising direction. Pre-trained on large-scale image-text datasets [5], the language-driven or text-based visual recognition has been extended from image classification [45,27] to object detection [7,61,21,34] and semantic segmentation [19,60,32,59]. The task is also related to other cross-modal recognition topics, including image captioning [55,25], visual question answering [1,56,12,20], referring expression localization [26,41,67,39], visual reasoning [62], etc.…”

Section: Related Workmentioning

confidence: 99%

“…The design principle is to create a query from the request onthe-fly and use the query to interact with the extracted visual featuress. Some open-domain recognition algorithms discussed above [60,19,21,32] were based on this framework, in which queries were generated by texts. The idea is also related to the DETR series for object detection [4,66,6,43,50] and its extension to semantic/instance/panoptic segmentation [15,9,8,18,49,64].…”

Section: Related Workmentioning

confidence: 99%

Visual Recognition by Request

Tang¹,

Xie²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we present a novel protocol of annotation and evaluation for visual recognition. Different from traditional settings, the protocol does not require the labeler/algorithm to annotate/recognize all targets (objects, parts, etc.) at once, but instead raises a number of recognition instructions and the algorithm recognizes targets by request. This mechanism brings two beneficial properties to reduce the burden of annotation, namely, (i) variable granularity: different scenarios can have different levels of annotation, in particular, object parts can be labeled only in large and clear instances, (ii) being open-domain: new concepts can be added to the database in minimal costs. To deal with the proposed setting, we maintain a knowledge base and design a query-based visual recognition framework that constructs queries on-the-fly based on the requests. We evaluate the recognition system on two mixed-annotated datasets, CPP and ADE20K, and demonstrate its promising ability of learning from partially labeled data as well as adapting to new concepts with only text labels.Preprint. Under review.

show abstract

“…Due to its superior zero-shot capability, CLIP bears the feasibility of extracting open-category semantic segmentation from untrained image sets 31 . Therefore, the higher relatedness between the image and the participant's description, the more accurate the semantic representation that is included in the description.…”

Section: Machine Learning Architectures and Features Extraction Appro...mentioning

confidence: 99%

Multi-modal Vision-and-Language Analysis of Communication Deficits due to Alzheimer’s Disease

Liu

Collier

Paek

et al. 2022

Preprint

View full text Add to dashboard Cite

Previous research has demonstrated that referential communication tasks (RCTs) can be used to detect language deficits in people with Alzheimer’s Disease (AD). This study carried out a multi-modal vision-and-language analysis on data produced during RCT. Using the CLIP model, we calculated the association between the transcripts of image descriptions collected in RCTs and the images being described. Statistical analyses were conducted to examine the differences between people with AD and cognitively healthy older adults. The analysis results are significantly different between the two groups. Moreover, the results vary significantly across different experimental conditions in the cognitively healthy group, but not in the AD group. This paper is the first study on multi-modal vision-language analysis of RCTs using CLIP. The study reveals communication deficits in vision-language association in people with AD. Further research is needed to evaluate the potential of using CLIP for automatic dementia screening using interactive image-based description tasks

show abstract

A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Cited by 15 publications

References 0 publications

GroupViT: Semantic Segmentation Emerges from Text Supervision

GroupViT: Semantic Segmentation Emerges from Text Supervision

Visual Recognition by Request

Multi-modal Vision-and-Language Analysis of Communication Deficits due to Alzheimer’s Disease

Contact Info

Product

Resources

About