2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00437
|View full text |Cite
|
Sign up to set email alerts
|

Mask Transfiner for High-Quality Instance Segmentation

Abstract: The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizabil… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 88 publications
(44 citation statements)
references
References 77 publications
0
32
0
Order By: Relevance
“…Video Instance Segmentation (VIS) Existing VIS methods can be summarized into three categories: twostage, one-stage, and transformer-based. Two-stage approaches [2,19,29,30,62] extend the Mask R-CNN family [12,20] by designing an additional tracking branch for object association. One-stage works [4,27,32,63] adopt anchor-free detectors [50], generally using linear masks basis combination [3] or conditional mask prediction generation [49].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Video Instance Segmentation (VIS) Existing VIS methods can be summarized into three categories: twostage, one-stage, and transformer-based. Two-stage approaches [2,19,29,30,62] extend the Mask R-CNN family [12,20] by designing an additional tracking branch for object association. One-stage works [4,27,32,63] adopt anchor-free detectors [50], generally using linear masks basis combination [3] or conditional mask prediction generation [49].…”
Section: Related Workmentioning
confidence: 99%
“…MaskFreeVIS achieves competitive VIS performance without using any video masks or even image mask labels on all datasets. Validated on various methods and backbones, MaskFreeVIS achieves 91.25% performance of its fully supervised counterparts, even outperforming a few recent fully-supervised methods [11,16,19,60] on the popular YTVIS benchmark. Our simple yet effective design greatly narrows the performance gap between weakly-supervised and fully-Table 1.…”
Section: Introductionmentioning
confidence: 99%
“…However, with the occlusion augmentation added to the object, the visible ground truth is treated as the amodal ground truth. In the second training stage, the amodal segmentations outputs from the trained UNet in the first phase are taken as a pseudo-ground truth for learning a standard instance segmenter -Mask R-CNN [24]. In the inference phase, Mask R-CNN trained on the generated amodal ground truth is expected to yield the correct AIS.…”
Section: Amodal Instance Segmentation (Ais)mentioning
confidence: 99%
“…In QueryInst, queries are shared between tasks of detection and segmentation in each stage via dynamic convolutions. Unlike the previous works which process on regular dense tensors, Mask Tranfiner [24] first decomposes and represents an image region as a hierarchical quadtree. Then, all points on the quadtree are transformed into a query sequence to predict labels.…”
Section: Query-based Image Segmentationmentioning
confidence: 99%
“…Recently proposed instance segmentation methods [5,8,9,13,15,16,24,33,43,47] have achieved remarkable performance owing to the availability of abundant of segmentation labels for training. However, compared to other label types (e.g., bounding box or point), segmentation labels necessitate delicate pixel-level annotations, demanding much more monetary cost and human effort.…”
Section: Introductionmentioning
confidence: 99%