DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Liu, Shilong; Li, Feng; Zhang, Hao; Xiao, Yang; Qi, Xianbiao; Su, Hang; Zhu, Jun; Zhang, Lei

doi:10.48550/arxiv.2201.12329

Cited by 58 publications

(114 citation statements)

References 16 publications

(38 reference statements)

Supporting

Mentioning

114

Contrasting

Order By: Relevance

“…Vision Transformer(ViT) [18,5] achieved state-of-the-art results on various vision tasks. To increase the convergence speed and improve accuracy, well-explored locality inductive bias have been reintroduced into vision transformer [66,22,62,41,27,61,51,19,56,26], among which, hybrid architecture of convolution and transformer design [49,57,12,21,34] can achieve state-of-the-art performance of a wide range of tasks. Our ConvMAE is highly motivated by the hybrid architecture design [21,34,12,57] in vision backbones.…”

Section: Related Workmentioning

confidence: 99%

ConvMAE: Masked Convolution Meets Masked Autoencoders

Gao¹,

Ma²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding [2,1,28,55] for feature pretraining and multiscale hybrid convolution-transformer architectures [12,21,49,34,57] can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% AP box and 2.2% AP mask respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

ConvMAE: Masked Convolution Meets Masked Autoencoders

Gao¹,

Ma²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Detection: Mainstream detection algorithms have been dominated by convolutional neural networkbased frameworks, until recently Transformer-based detectors [2,22,18,37] achieve great progress. DETR [2] is the first end-to-end and query-based Transformer object detector, which adopts a setprediction objective with bipartite matching.…”

Section: Related Workmentioning

confidence: 99%

“…Although DETR addresses both the object detection and panoptic segmentation tasks, its segmentation performance is still inferior to classical segmentation models. To improve the detection and segmentation performance of query-based models, researchers have developed specialized models for object detection [40,22,18,37], image segmentation [38,6,4], instance segmentation [10], panoptic segmentation [27], and semantic segmentation [14].…”

Section: Introductionmentioning

confidence: 99%

“…Among the efforts to improve object detection, DINO (DETR with Improved Denoising Anchor Boxes) [37] takes advantage of the dynamic anchor box formulation from DAB-DETR [22] and query denoising training from DN-DETR [18], and further develops contrastive denoising training, mixed query selection, and look forward twice methods to accelerate training and improve the detection performance. As a result, DINO achieves the SOTA result on the COCO object detection leaderboard for the first time as a DETR-like model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

Li¹,

Zhang²,

xu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, scalable, and benefits from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K). Code will be avaliable at https://github.com/IDEACVR/MaskDINO. * Equal contribution. † This work was done when Feng Li and Hao Zhang were interns at IDEA.

show abstract

“…Following, Deformable DETR [51] develops a sparse attention module named deformable attention to fasten the convergence speed of DETR. Sharing the same spirit, many researchers [9,26,48,29] proposed various schemes to speed up the convergence of DETR. More recently, Wang et al pointed out that DETR has the issue of data hunger and proposed to solve it by augmenting the supervision.…”

Section: Related Workmentioning

confidence: 99%

Improving Transferability for Domain Adaptive Detection Transformers

Gong¹,

Li²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

DETR-style detectors stand out amongst in-domain scenarios, but their properties in domain shift settings are under-explored. This paper aims to build a simple but effective baseline with a DETR-style detector on domain shift settings based on two findings. For one, mitigating the domain shift on the backbone and the decoder output features excels in getting favorable results. For another, advanced domain alignment methods in both parts further enhance the performance. Thus, we propose the Object-Aware Alignment (OAA) module and the Optimal Transport based Alignment (OTA) module to achieve comprehensive domain alignment on the outputs of the backbone and the detector. The OAA module aligns the foreground regions identified by pseudo-labels in the backbone outputs, leading to domain-invariant based features. The OTA module utilizes sliced Wasserstein distance to maximize the retention of location information while minimizing the domain gap in the decoder outputs. We implement the findings and the alignment modules into our adaptation method, and it benchmarks the DETR-style detector on the domain shift settings. Experiments on various domain adaptive scenarios validate the effectiveness of our method.

show abstract

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Cited by 58 publications

References 16 publications

ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE: Masked Convolution Meets Masked Autoencoders

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

Improving Transferability for Domain Adaptive Detection Transformers

Contact Info

Product

Resources

About