DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

Li, Feng; Zhang, Hao; Liu, Shilong; Guo, Jingjie; Ni, Lei; Zhang, Lei

doi:10.1109/cvpr52688.2022.01325

Cited by 243 publications

(148 citation statements)

References 9 publications

Supporting

Mentioning

147

Contrasting

Order By: Relevance

“…Decoder. We adopt the transformer encoder-decoder framework as the decoder that shows promising detection results, including DETR [2], Conditional DETR [17], DAB-DETR [14], Deformable DETR [30], DN-DETR [10], and DINO [28]. Group DETR [3] provides further progress in improving the training convergence speed and the detection performance of various DETR variants.…”

Section: Architecturementioning

confidence: 99%

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Chen¹,

Wang²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Section: Architecturementioning

confidence: 99%

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Chen¹,

Wang²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Unlike previous methods, our proposed CFT does not estimate depth and completely removes camera parameters. Inspired by advanced vision transformers [29,24,20], CFT decouples the positional and content embedding in the position-aware enhancement, and further mines richer 3D information, thereby effectively learning stable BEV representations. Instead of point-wise attention with camera guidance or redundant global attention, a view-attention is presented to reduce the computational and accelerates the establishment of transformation relations.…”

Section: Related Workmentioning

confidence: 99%

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

Jiang¹,

Meng²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

In advanced paradigms of autonomous driving, learning Bird's Eye View (BEV) representation from surrounding views is crucial for multi-task framework. However, existing methods based on depth estimation or camera-driven attention are not stable to obtain transformation under noisy camera parameters, mainly with two challenges, accurate depth prediction and calibration. In this work, we present a completely Multi-Camera Calibration Free Transformer (CFT) for robust BEV representation, which focuses on exploring implicit mapping, not relied on camera intrinsics and extrinsics. To guide better feature learning from image views to BEV, CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA). Instead of camera-driven point-wise or global transformation, for interaction within more effective region and lower computation cost, we propose a view-aware attention which also reduces redundant computation and promotes converge. CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters, comparable to other geometry-guided methods. Without temporal input and other modal information, CFT achieves second highest performance with a smaller image input (1600 × 640). Thanks to view-attention variant, CFT reduces memory and transformer FLOPs for vanilla attention by about 12% and 60%, respectively, with improved NDS by 1.0%. Moreover, its natural robustness to noisy camera parameters makes CFT more competitive.

show abstract

“…As a new paradigm for object detection, detection transformer (DETR) [13] eliminates the need for hand-designed components and shows promising performance compared with most classical detectors based on convolutional architectures due to the processing of global information performed by the self-attention [14]. In the ensuing years, many improved DETR-like methods [15][16][17] have been proposed to address the problems that slow the training convergence of DETR and the meaning of queries. Among them, DETR with improved denoising anchor boxes (DINO) [18] became a new state-of-the-art approach on COCO 2017 [19], proving that transformer-based object-detection models can also achieve superior performance.…”

Section: Introductionmentioning

confidence: 99%

“…Compared with the well-established CNN-based detectors, how to develop efficient domain adaptation methods to enhance the cross-domain performance of DETR-like detectors remains rarely explored. The design draws on DN-DETR [17], DAB-DETR [16], and deformable DETR [15], with DINO achieving an exceptional result on public datasets. However, as with previous object detectors, it cannot be directly applied to new scenarios when variations in environmental conditions change, which results in significant performance degradation.…”

Section: Introductionmentioning

confidence: 99%

Cascading Alignment for Unsupervised Domain-Adaptive DETR with Improved DeNoising Anchor Boxes

Geng

Jiang

Shen

et al. 2022

Sensors

View full text Add to dashboard Cite

Transformer-based object detection has recently attracted increasing interest and shown promising results. As one of the DETR-like models, DETR with improved denoising anchor boxes (DINO) produced superior performance on COCO val2017 and achieved a new state of the art. However, it often encounters challenges when applied to new scenarios where no annotated data is available, and the imaging conditions differ significantly. To alleviate this problem of domain shift, in this paper, unsupervised domain adaptive DINO via cascading alignment (CA-DINO) was proposed, which consists of attention-enhanced double discriminators (AEDD) and weak-restraints on category-level token (WROT). Specifically, AEDD is used to aggregate and align the local–global context from the feature representations of both domains while reducing the domain discrepancy before entering the transformer encoder and decoder. WROT extends Deep CORAL loss to adapt class tokens after embedding, minimizing the difference in second-order statistics between the source and target domain. Our approach is trained end to end, and experiments on two challenging benchmarks demonstrate the effectiveness of our method, which yields 41% relative improvement compared to baseline on the benchmark dataset Foggy Cityscapes, in particular.

show abstract

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

Cited by 243 publications

References 9 publications

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

Cascading Alignment for Unsupervised Domain-Adaptive DETR with Improved DeNoising Anchor Boxes

Contact Info

Product

Resources

About