FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras

Hu, Anthony; Murez, Zak; Mohan, Nikhil; Dudas, Sofía; Hawke, Jeffrey; Badrinarayanan, Vijay; Cipolla, Roberto; Kendall, Alex

doi:10.1109/iccv48922.2021.01499

Cited by 148 publications

(113 citation statements)

References 30 publications

Supporting

Mentioning

105

Contrasting

Order By: Relevance

“…PYVA [50] proposes a cross-view transformer that converts the front-view monocular image into the BEV, but this paradigm is not suitable for fusing multi-camera features due to the computational cost of global attention mechinism [42]. In addition to the spatial information, previous works [18,38,6] also consider the temporal information by stacking BEV features from several timestamps. Stacking BEV features constraints the available temporal information within fixed time duration and brings extra computational cost.…”

Section: Camera-based 3d Perceptionmentioning

confidence: 99%

“…Compared to simply stacking BEV in [18,38,6], our temporal self-attention can more effectively model long temporal dependency. BEVFormer extracts temporal information from the previous BEV features rather than multiple stacking BEV features, thus requiring less computational cost and suffering less disturbing information.…”

Section: Temporal Self-attentionmentioning

confidence: 99%

“…The bird's-eye-view (BEV) is a commonly used representation of the surrounding scene since it clearly presents the location and scale of objects and is suitable for various autonomous driving tasks, such as perception and planning [29]. Although previous map segmentation methods demonstrate BEV's effectiveness [32,18,29], BEV-based approaches have not shown significant advantages over other paradigm in 3D object detections [47,31,34]. The underlying reason is that the 3D object detection task requires strong BEV features to support accurate 3D bounding box prediction, but generating BEV from the 2D planes is ill-posed.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Li¹,

Wang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design a spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose a temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code will be released at https://github.com/zhiqi-li/BEVFormer.

show abstract

Section: Camera-based 3d Perceptionmentioning

confidence: 99%

Section: Temporal Self-attentionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Li¹,

Wang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Cam2BEV [1] performs a spatial transformer module to transform perspective features to BEV space from surrounding inputs by IPM, which is a straightforward way to link image space to BEV under flat ground assumption. Methods in [2], [5]- [7] utilize depth information to perform the view transformation. For example, Lift-Splat-Shoot [2] first estimates implicit pixel-wise depth information and then uses camera geometry to build the connection between BEV segmentation and feature maps.…”

Section: Related Workmentioning

confidence: 99%

BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs

Peng¹,

Chen²,

Fu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Semantic segmentation in bird's eye view (BEV) is an important task for autonomous driving. Though this task has attracted a large amount of research efforts, it is still challenging to flexibly cope with arbitrary (single or multiple) camera sensors equipped on the autonomous vehicle. In this paper, we present BEVSegFormer, an effective transformerbased method for BEV semantic segmentation from arbitrary camera rigs. Specifically, our method first encodes image features from arbitrary cameras with a shared backbone. These image features are then enhanced by a deformable transformerbased encoder. Moreover, we introduce a BEV transformer decoder module to parse BEV semantic segmentation results. An efficient multi-camera deformable attention unit is designed to carry out the BEV-to-image view transformation. Finally, the queries are reshaped according the layout of grids in the BEV, and upsampled to produce the semantic segmentation result in a supervised manner. We evaluate the proposed algorithm on the public nuScenes dataset and a self-collected dataset. Experimental results show that our method achieves promising performance on BEV semantic segmentation from arbitrary camera rigs. We also demonstrate the effectiveness of each component via ablation study.

show abstract

“…As in nuScenes, we scrape 6 cameras, 1 LiDAR, ego-motion (for temporal fusion of LiDAR scans), and 3D bounding boxes from CARLA. On the right, we show the target binary image for the bird's-eye-view vehicle segmentation task that we consider in this paper [43,37,22]. hard to perfectly label all scene materials and model complicated interactions between sensors and objects -at least today [34,8,56,42].…”

Section: Introductionmentioning

confidence: 99%

Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation

Acuna¹,

Philion²,

Fidler³

2021

Preprint

View full text Add to dashboard Cite

Autonomous driving relies on a huge volume of real-world data to be labeled to high precision. Alternative solutions seek to exploit driving simulators that can generate large amounts of labeled data with a plethora of content variations. However, the domain gap between the synthetic and real data remains, raising the following important question: What are the best ways to utilize a self-driving simulator for perception tasks? In this work, we build on top of recent advances in domain-adaptation theory, and from this perspective, propose ways to minimize the reality gap. We primarily focus on the use of labels in the synthetic domain alone. Our approach introduces both a principled way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. Our method is easy to implement in practice as it is agnostic of the network architecture and the choice of the simulator. We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data (cameras, lidar) using an open-source simulator (CARLA), and evaluate the entire framework on a real-world dataset (nuScenes). Last but not least, we show what types of variations (e.g. weather conditions, number of assets, map design, and color diversity) matter to perception networks when trained with driving simulators, and which ones can be compensated for with our domain adaptation technique. * denotes equal contribution, https://nv-tlabs.github.io/simulation-strategies 35th Conference on Neural Information Processing Systems (NeurIPS 2021),

show abstract

FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras

Cited by 148 publications

References 30 publications

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs

Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation

Contact Info

Product

Resources

About