2022
DOI: 10.48550/arxiv.2206.02777
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

Abstract: In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(34 citation statements)
references
References 23 publications
(72 reference statements)
0
29
0
Order By: Relevance
“…A recurring theme in all three modes of ML is the high complexity of their models. This could be caused by a high-dimensional input size like classifying a high-resolution image database [72], or a complex problem like image segmentation [73]. A commonlyused -but known to be inaccurate [74,75] -measure of complexity is the parametercount of an ML model.…”
Section: Exponential Growth Of Practical Machine Learning Modelsmentioning
confidence: 99%
“…A recurring theme in all three modes of ML is the high complexity of their models. This could be caused by a high-dimensional input size like classifying a high-resolution image database [72], or a complex problem like image segmentation [73]. A commonlyused -but known to be inaccurate [74,75] -measure of complexity is the parametercount of an ML model.…”
Section: Exponential Growth Of Practical Machine Learning Modelsmentioning
confidence: 99%
“…With multiple stages involved, learning is often not end-toend. Recent work has proposed end-to-end approaches with Transformer based architectures [16,17,33,35,58,67,68], for which the model directly predicts segmentation masks and optimizes based on a bipartite graph matching loss. Nevertheless, they still require customized architectures (e.g., per instance mask generation, and mask fusion module).…”
Section: Related Workmentioning
confidence: 99%
“…merging multiple predictions [14,30,34,40,65,69]. Recently, end-to-end methods [16,17,33,35,58,67,68] have been proposed, based on a differentiable bipartite graph matching [7]; this effectively converts a one-to-many mapping into a one-to-one mapping based on the identified matching. However, such methods still require customized architectures and specialized loss functions with built-in inductive bias for the panoptic segmentation task.…”
Section: Introductionmentioning
confidence: 99%
“…Transformer-based networks were successfully applied in various computer vision tasks and held impressive results. Mask DINO [11] extends DINO [12] by adding a new branch to perform mask prediction for panoptic, instance and semantic segmentation. Content query embeddings from DINO [12] are used to perform mask classification for all segmentation tasks.…”
Section: Semantic Instance Segmentationmentioning
confidence: 99%