BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs

Peng, Lang; Chen, Zhirong; Fu, Zhangjie; Liang, Pengpeng; Cheng, Erkang

doi:10.48550/arxiv.2203.04050

Cited by 11 publications

(18 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We can observe that BEVerse-Tiny already obtains the mIoU of 48.7 and outperforms existing methods. Furthermore, BEVerse-Small achieves 51.7 mIoU, which is 7.1 points higher than the previous best method [49]. Motion prediction.…”

Section: Resultsmentioning

confidence: 82%

“…It also proposes a learning method to build BEV features from sensory input and predicts vectorized map elements. BEVSegFormer [49] proposes the multi-camera deformable attention to transform image-view features to BEV representations for semantic map construction. Different from these single-task approaches, our BEVerse incorporates the semantic map construction as part of the multitask framework and uses vanilla convolutional layers for segmentation prediction.…”

Section: Semantic Map Constructionmentioning

confidence: 99%

“…For self-driving vehicles, the whole problem is divided into perception, prediction, and planning&control. The task of perception is to perceive the surrounding environment, including the dynamic objects [5,20,25,41,45,[62][63][64]68] and static streets [29,49]. Since the driving scenarios are rapidly changing, the task of prediction [9,11,14,19,52,73] is needed to speculate the future movements of recognized obstacles.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Zhang¹,

Zhu²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multicamera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the egomotion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of iterative flow for memory-efficient future prediction. We show that the temporal information improves 3D object detection and semantic map construction, while the multi-task learning can implicitly benefit motion prediction. With extensive experiments on the nuScenes dataset, we show that the multi-task BEVerse outperforms existing single-task methods on 3D object detection, semantic map construction, and motion prediction. Compared with the sequential paradigm, BE-Verse also favors in significantly improved efficiency. The code and trained models will be released † .

show abstract

Section: Resultsmentioning

confidence: 82%

Section: Semantic Map Constructionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Zhang¹,

Zhu²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Later, View Parsing Network (VPN) [15] uses a fully connected layer to transform the image features into the BEV features and directly supervise the features in the BEV space in an endto-end manner. Similarly, BEVSegFormer [17] uses the deformable attention [25] mechanism to achieve end-to-end mapping. These methods avoid the explicit mapping between image and BEV spaces, but this property also makes them hard to adopt the geometry prior.…”

Section: Related Workmentioning

confidence: 99%

“…The first one is the 100m × 100m setting [11,18,23] with two classes road and lane. The other one is the 60m × 30m setting [10,17,24] with three classes boundary, divider, and ped crossing. In this work, we also propose a new 160m × 100m setting for a more comprehensive evaluation, as shown in Tab.…”

Section: Dataset and Evaluation Settingsmentioning

confidence: 99%

UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

Qin¹,

Chen²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Bird's eye view (BEV) representation is a new perception formulation for autonomous driving, which is based on spatial fusion. Further, temporal fusion is also introduced in BEV representation and gains great success. In this work, we propose a new method that unifies both spatial and temporal fusion and merges them into a unified mathematical formulation. The unified fusion could not only provide a new perspective on BEV fusion but also brings new capabilities. With the proposed unified spatial-temporal fusion, our method could support long-range fusion, which is hard to achieve in conventional BEV methods. Moreover, the BEV fusion in our work is temporal-adaptive and the weights of temporal fusion are learnable. In contrast, conventional methods mainly use fixed and equal weights for temporal fusion. Besides, the proposed unified fusion could avoid information lost in conventional BEV fusion methods and make full use of features. Extensive experiments and ablation studies on the NuScenes dataset show the effectiveness of the proposed method and our method gains the state-of-the-art performance in the map segmentation task.

show abstract

Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation

Zhang

Zhu

Zheng

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Online semantic 3D segmentation in company with realtime RGB-D reconstruction poses special challenges such as how to perform 3D convolution directly over the progressively fused 3D geometric data, and how to smartly fuse information from frame to frame. We propose a novel fusionaware 3D point convolution which operates directly on the geometric surface being reconstructed and exploits effectively the inter-frame correlation for high quality 3D feature learning. This is enabled by a dedicated dynamic data structure which organizes the online acquired point cloud with global-local trees. Globally, we compile the online reconstructed 3D points into an incrementally growing coordinate interval tree, enabling fast point insertion and neighborhood query. Locally, we maintain the neighborhood information for each point using an octree whose construction benefits from the fast query of the global tree. Both levels of trees update dynamically and help the 3D convolution effectively exploits the temporal coherence for effective information fusion across RGB-D frames. Through evaluation on public benchmark datasets, we show that our method achieves the state-of-the-art accuracy of semantic segmentation with online RGB-D fusion in 10 FPS.

show abstract

BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs

Cited by 11 publications

References 27 publications

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation

Contact Info

Product

Resources

About