SRCN3D: Sparse R-CNN 3D Surround-View Camera Object Detection and Tracking for Autonomous Driving

Shi, Yining; Shen, Jingyan; Wang, Yunlong; Li, Jiaxin; Sun, Shiqi; Jiang, Kun; Yang, Daoguo

doi:10.48550/arxiv.2206.14451

Cited by 1 publication

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-camera 3D object detection predicts the 3D bounding boxes of the objects of interest from the input surrounding views. Motivated by typical works in 2D detection [6,55,38], researchers combine 3D prior and propose different 3D object detection frameworks to directly achieve sparse object-level features extraction [45,9,37,11]. In recent new paradigms of autonomous driving, the BEV space attracts much attention because of its advantages in perception [35,47,56,7,33], prediction [1,14], multi-task learning [46,51,28,5] and downstream planning [33], etc.…”

Section: Related Workmentioning

confidence: 99%

“…3D object detection from multi-camera 2D images is a critical perception technique for autonomous driving systems with compared to expensive LiDAR-based [10,18,3] or multi-modal approaches [42,41,50,49,2,30,12]. Recent approaches emphasize transforming 2D image features to sparse instance-level [9,37,45] or dense Bird's Eye View (BEV) representation [16,22,26], characterizing the 3D structure of the surrounding environment. Although some depth-based detectors [16,17,21,26,51] incorporate depth estimation to introduce such 3D information, the extra depth supervision is acquired for preciser detection.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

Jiang¹,

Meng²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

In advanced paradigms of autonomous driving, learning Bird's Eye View (BEV) representation from surrounding views is crucial for multi-task framework. However, existing methods based on depth estimation or camera-driven attention are not stable to obtain transformation under noisy camera parameters, mainly with two challenges, accurate depth prediction and calibration. In this work, we present a completely Multi-Camera Calibration Free Transformer (CFT) for robust BEV representation, which focuses on exploring implicit mapping, not relied on camera intrinsics and extrinsics. To guide better feature learning from image views to BEV, CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA). Instead of camera-driven point-wise or global transformation, for interaction within more effective region and lower computation cost, we propose a view-aware attention which also reduces redundant computation and promotes converge. CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters, comparable to other geometry-guided methods. Without temporal input and other modal information, CFT achieves second highest performance with a smaller image input (1600 × 640). Thanks to view-attention variant, CFT reduces memory and transformer FLOPs for vanilla attention by about 12% and 60%, respectively, with improved NDS by 1.0%. Moreover, its natural robustness to noisy camera parameters makes CFT more competitive.

show abstract