2022
DOI: 10.48550/arxiv.2205.09743
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Abstract: In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multicamera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timesta… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(35 citation statements)
references
References 56 publications
0
28
0
Order By: Relevance
“…For the 60m × 30m setting, we adopt VPN [15], Lift- Splat-Shoot [18], HDMapNet [10], BEVSegFormer [17], and BEVerse [24] for comparsion. The comparison results are shown in Tab.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…For the 60m × 30m setting, we adopt VPN [15], Lift- Splat-Shoot [18], HDMapNet [10], BEVSegFormer [17], and BEVerse [24] for comparsion. The comparison results are shown in Tab.…”
Section: Resultsmentioning
confidence: 99%
“…Temporal fusion in BEV With the basis of spatial fusion, temporal fusion could further boost the representation in BEV space. The mainstream methods of temporal fusion are the warp-based method [8,11,24]. The main idea of the warp-based method is to warp and align BEV spaces at different time steps based on the ego motions of vehicles.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…StretchBEV [101] samples latent variables at each timestamp and estimates residual changes for producing future states. To reduce the memory consumption, BEVerse [72] designs iterative flow for efficient generation of future states and jointly reasons 3D detection, semantic map reconstruction, and motion prediction tasks Fig. 14.…”
Section: Multi-task Learning Under Bevmentioning
confidence: 99%
“…Recent approaches emphasize transforming 2D image features to sparse instance-level [9,37,45] or dense Bird's Eye View (BEV) representation [16,22,26], characterizing the 3D structure of the surrounding environment. Although some depth-based detectors [16,17,21,26,51] incorporate depth estimation to introduce such 3D information, the extra depth supervision is acquired for preciser detection. Therefore, other paradigms [22,45] directly learn the transformation based on the attention mechanism [40].…”
Section: Introductionmentioning
confidence: 99%