Anchor DETR: Query Design for Transformer-Based Detector

Wang, Ying‐Ming; Zhang, Xiangyu; Yang, Tong; Sun, Jian

doi:10.1609/aaai.v36i3.20158

Cited by 183 publications

(98 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DETR for object detection. With the pioneering work DETR [5] introducing transformers [57] to 2D object detection, more and more follow-up works [44,13,10,62] have built various advanced extensions based on DETR because it removes the need for many hand-designed components like non-maximum suppression [47] or initial anchor boxes generation [16,51,32,35]. Deformable-DETR [81] introduced the multi-scale deformable self/cross-attention scheme, which attends to only a small set of key sampling points around a reference and achieves better performance than DETR (especially on small objects).…”

Section: Related Workmentioning

confidence: 99%

“…POTO [59] propose to assign the anchor with either the maximum IoU or closest to the object center as the positive sample, which is modified from the strategies of RetinaNet [32] or FCOS. DETR [5] and its followups [44,4,62,81,34,22] apply the Hungarian matching to compute one-to-one positive assignments based on the global minimum matching cost values between all predictions and the ground-truth boxes. Different from the most related work POTO [59] that only uses one-to-many assignment, based on ATSS [76], to help the classification branch of FCOS [56], our approach chooses Hungarian matching to perform both one-to-one matching and one-to-many matching following DETR and generalizes to various vision tasks.…”

Section: Related Workmentioning

confidence: 99%

“…Motivated by the success of DETR on a wide variety of vision tasks, many follow-up efforts have improved DETR from various aspects, including redesigning more advanced transformer encoder [81, 13,14] or transformer decoder architectures [44,73,81,4] or query formulations [62,34,22,74]. Different from most of these previous efforts, we focus on the inefficient training issues caused by one-to-one matching, which only assigns one query to each ground truth.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DETRs with Hybrid Matching

Di¹,

Yuan²,

He³

et al. 2022

Preprint

View full text Add to dashboard Cite

One-to-one set matching is a key design for DETR to establish its end-to-end capability, so that object detection does not require a hand-crafted NMS (non-maximum suppression) method to remove duplicate detections. This endto-end signature is important for the versatility of DETR, and it has been generalized to a wide range of visual problems, including instance/semantic segmentation, human pose estimation, and point cloud/multi-view-images based detection, etc. However, we note that because there are too few queries assigned as positive samples, the oneto-one set matching significantly reduces the training efficiency of positive samples. This paper proposes a simple yet effective method based on a hybrid matching scheme that combines the original one-to-one matching branch with auxiliary queries that use one-to-many matching loss during training. This hybrid strategy has been shown to significantly improve training efficiency and improve accuracy. In inference, only the original one-to-one match branch is used, thus maintaining the end-to-end merit and the same inference efficiency of DETR. The method is named H-DETR, and it shows that a wide range of representative DETR methods can be consistently improved across a wide range of visual tasks, including

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DETRs with Hybrid Matching

Di¹,

Yuan²,

He³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…PnP-DETR [54] proposes a poll-and-pool sampling strategy in its attention mechanism. Besides, Conditional DETR [10], SMCA-DETR [11], Anchor DETR [12], and DAB-DETR [55] make substantial modifications to the attention mechanism, aiming to add spatial constraints to the original cross-attention to better focus on prominent regions. Furthermore, the recently proposed DN-DETR [30] designs a novel de-noising training strategy to speed up DETR's training procedure, which also achieves very promising results.…”

Section: Transformer-based Object Detectionmentioning

confidence: 99%

“…The object queries after distilling relevant features are used to generate instance-level detection predictions as well as to repeat the subsequent 'matching and feature distillation' processes for refined predictions. However, as pointed out in [9,10,11,12,13], it is difficult for object queries to learn to match appropriate regions. As illustrated in Fig.…”

Section: Introductionmentioning

confidence: 99%

Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

Zhang¹,

Luo²,

Yu³

et al. 2022

Preprint

View full text Add to dashboard Cite

The recently proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection. However, DETR suffers from slow training convergence, which hinders its applicability to various detection tasks. We observe that DETR's slow convergence is largely attributed to the difficulty in matching object queries to relevant regions due to the unaligned semantics between object queries and encoded image features. With this observation, we design Semantic-Aligned-Matching DETR++ (SAM-DETR++) to accelerate DETR's convergence and improve detection performance. The core of SAM-DETR++ is a plug-andplay module that projects object queries and encoded image features into the same feature embedding space, where each object query can be easily matched to relevant regions with similar semantics. Besides, SAM-DETR++ searches for multiple representative keypoints and exploits their features for semantic-aligned matching with enhanced representation capacity. Furthermore, SAM-DETR++ can effectively fuse multi-scale features in a coarse-to-fine manner on the basis of the designed semantic-aligned matching. Extensive experiments show that the proposed SAM-DETR++ achieves superior convergence speed and competitive detection accuracy. Additionally, as a plug-and-play method, SAM-DETR++ can complement existing DETR convergence solutions with even better performance, achieving 44.8% AP with merely 12 training epochs and 49.1% AP with 50 training epochs on COCO val 2017 with ResNet-50. Codes are available at https://github.com/ZhangGongjie/SAM-DETR .

show abstract