2022
DOI: 10.1609/aaai.v36i3.20158
|View full text |Cite
|
Sign up to set email alerts
|

Anchor DETR: Query Design for Transformer-Based Detector

Abstract: In this paper, we propose a novel query design for the transformer-based object detection. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we cannot explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solve these problems, in our qu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
77
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 183 publications
(98 citation statements)
references
References 24 publications
0
77
0
Order By: Relevance
“…DETR for object detection. With the pioneering work DETR [5] introducing transformers [57] to 2D object detection, more and more follow-up works [44,13,10,62] have built various advanced extensions based on DETR because it removes the need for many hand-designed components like non-maximum suppression [47] or initial anchor boxes generation [16,51,32,35]. Deformable-DETR [81] introduced the multi-scale deformable self/cross-attention scheme, which attends to only a small set of key sampling points around a reference and achieves better performance than DETR (especially on small objects).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…DETR for object detection. With the pioneering work DETR [5] introducing transformers [57] to 2D object detection, more and more follow-up works [44,13,10,62] have built various advanced extensions based on DETR because it removes the need for many hand-designed components like non-maximum suppression [47] or initial anchor boxes generation [16,51,32,35]. Deformable-DETR [81] introduced the multi-scale deformable self/cross-attention scheme, which attends to only a small set of key sampling points around a reference and achieves better performance than DETR (especially on small objects).…”
Section: Related Workmentioning
confidence: 99%
“…POTO [59] propose to assign the anchor with either the maximum IoU or closest to the object center as the positive sample, which is modified from the strategies of RetinaNet [32] or FCOS. DETR [5] and its followups [44,4,62,81,34,22] apply the Hungarian matching to compute one-to-one positive assignments based on the global minimum matching cost values between all predictions and the ground-truth boxes. Different from the most related work POTO [59] that only uses one-to-many assignment, based on ATSS [76], to help the classification branch of FCOS [56], our approach chooses Hungarian matching to perform both one-to-one matching and one-to-many matching following DETR and generalizes to various vision tasks.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…PnP-DETR [54] proposes a poll-and-pool sampling strategy in its attention mechanism. Besides, Conditional DETR [10], SMCA-DETR [11], Anchor DETR [12], and DAB-DETR [55] make substantial modifications to the attention mechanism, aiming to add spatial constraints to the original cross-attention to better focus on prominent regions. Furthermore, the recently proposed DN-DETR [30] designs a novel de-noising training strategy to speed up DETR's training procedure, which also achieves very promising results.…”
Section: Transformer-based Object Detectionmentioning
confidence: 99%
“…The object queries after distilling relevant features are used to generate instance-level detection predictions as well as to repeat the subsequent 'matching and feature distillation' processes for refined predictions. However, as pointed out in [9,10,11,12,13], it is difficult for object queries to learn to match appropriate regions. As illustrated in Fig.…”
Section: Introductionmentioning
confidence: 99%