BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Zhang, Yunpeng; Zhu, Zheng; Zheng, Wenzhao; Huang, Junjie; Guan, Huaijin; Zhou, Jie; Lu, Jiwen

doi:10.48550/arxiv.2205.09743

Cited by 22 publications

(35 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the 60m × 30m setting, we adopt VPN [15], Lift- Splat-Shoot [18], HDMapNet [10], BEVSegFormer [17], and BEVerse [24] for comparsion. The comparison results are shown in Tab.…”

Section: Resultsmentioning

confidence: 99%

“…Temporal fusion in BEV With the basis of spatial fusion, temporal fusion could further boost the representation in BEV space. The mainstream methods of temporal fusion are the warp-based method [8,11,24]. The main idea of the warp-based method is to warp and align BEV spaces at different time steps based on the ego motions of vehicles.…”

Section: Related Workmentioning

confidence: 99%

“…The first one is the 100m × 100m setting [11,18,23] with two classes road and lane. The other one is the 60m × 30m setting [10,17,24] with three classes boundary, divider, and ped crossing. In this work, we also propose a new 160m × 100m setting for a more comprehensive evaluation, as shown in Tab.…”

Section: Dataset and Evaluation Settingsmentioning

confidence: 99%

See 2 more Smart Citations

UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

Qin¹,

Chen²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Bird's eye view (BEV) representation is a new perception formulation for autonomous driving, which is based on spatial fusion. Further, temporal fusion is also introduced in BEV representation and gains great success. In this work, we propose a new method that unifies both spatial and temporal fusion and merges them into a unified mathematical formulation. The unified fusion could not only provide a new perspective on BEV fusion but also brings new capabilities. With the proposed unified spatial-temporal fusion, our method could support long-range fusion, which is hard to achieve in conventional BEV methods. Moreover, the BEV fusion in our work is temporal-adaptive and the weights of temporal fusion are learnable. In contrast, conventional methods mainly use fixed and equal weights for temporal fusion. Besides, the proposed unified fusion could avoid information lost in conventional BEV fusion methods and make full use of features. Extensive experiments and ablation studies on the NuScenes dataset show the effectiveness of the proposed method and our method gains the state-of-the-art performance in the map segmentation task.

show abstract

“…For the 60m × 30m setting, we adopt VPN [15], Lift- Splat-Shoot [18], HDMapNet [10], BEVSegFormer [17], and BEVerse [24] for comparsion. The comparison results are shown in Tab.…”

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

Qin¹,

Chen²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…StretchBEV [101] samples latent variables at each timestamp and estimates residual changes for producing future states. To reduce the memory consumption, BEVerse [72] designs iterative flow for efficient generation of future states and jointly reasons 3D detection, semantic map reconstruction, and motion prediction tasks Fig. 14.…”

Section: Multi-task Learning Under Bevmentioning

confidence: 99%

Vision-Centric BEV Perception: A Survey

Ma¹,

Wang²,

Bai³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision-centric BEV perception has recently received increased attention from both industry and academia due to its inherent merits, including presenting a natural representation of the world and being fusion-friendly. With the rapid development of deep learning, numerous methods have been proposed to address the vision-centric BEV perception. However, there is no recent survey for this novel and growing research field. To stimulate its future research, this paper presents a comprehensive survey of recent progress of vision-centric BEV perception and its extensions. It collects and organizes the recent knowledge, and gives a systematic review and summary of commonly used algorithms. It also provides in-depth analyses and comparative results on several BEV perception tasks, facilitating the comparisons of future works and inspiring future research directions. Moreover, empirical implementation details are also discussed and shown to benefit the development of related algorithms.

show abstract

“…Recent approaches emphasize transforming 2D image features to sparse instance-level [9,37,45] or dense Bird's Eye View (BEV) representation [16,22,26], characterizing the 3D structure of the surrounding environment. Although some depth-based detectors [16,17,21,26,51] incorporate depth estimation to introduce such 3D information, the extra depth supervision is acquired for preciser detection. Therefore, other paradigms [22,45] directly learn the transformation based on the attention mechanism [40].…”

Section: Introductionmentioning

confidence: 99%

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

Jiang¹,

Meng²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

In advanced paradigms of autonomous driving, learning Bird's Eye View (BEV) representation from surrounding views is crucial for multi-task framework. However, existing methods based on depth estimation or camera-driven attention are not stable to obtain transformation under noisy camera parameters, mainly with two challenges, accurate depth prediction and calibration. In this work, we present a completely Multi-Camera Calibration Free Transformer (CFT) for robust BEV representation, which focuses on exploring implicit mapping, not relied on camera intrinsics and extrinsics. To guide better feature learning from image views to BEV, CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA). Instead of camera-driven point-wise or global transformation, for interaction within more effective region and lower computation cost, we propose a view-aware attention which also reduces redundant computation and promotes converge. CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters, comparable to other geometry-guided methods. Without temporal input and other modal information, CFT achieves second highest performance with a smaller image input (1600 × 640). Thanks to view-attention variant, CFT reduces memory and transformer FLOPs for vanilla attention by about 12% and 60%, respectively, with improved NDS by 1.0%. Moreover, its natural robustness to noisy camera parameters makes CFT more competitive.

show abstract

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Cited by 22 publications

References 56 publications

UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

Vision-Centric BEV Perception: A Survey

Multi-Camera Calibration Free BEV Representation for 3D Object Detection

Contact Info

Product

Resources

About