2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01499
|View full text |Cite
|
Sign up to set email alerts
|

FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
105
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 148 publications
(113 citation statements)
references
References 30 publications
0
105
0
Order By: Relevance
“…PYVA [50] proposes a cross-view transformer that converts the front-view monocular image into the BEV, but this paradigm is not suitable for fusing multi-camera features due to the computational cost of global attention mechinism [42]. In addition to the spatial information, previous works [18,38,6] also consider the temporal information by stacking BEV features from several timestamps. Stacking BEV features constraints the available temporal information within fixed time duration and brings extra computational cost.…”
Section: Camera-based 3d Perceptionmentioning
confidence: 99%
See 2 more Smart Citations
“…PYVA [50] proposes a cross-view transformer that converts the front-view monocular image into the BEV, but this paradigm is not suitable for fusing multi-camera features due to the computational cost of global attention mechinism [42]. In addition to the spatial information, previous works [18,38,6] also consider the temporal information by stacking BEV features from several timestamps. Stacking BEV features constraints the available temporal information within fixed time duration and brings extra computational cost.…”
Section: Camera-based 3d Perceptionmentioning
confidence: 99%
“…Compared to simply stacking BEV in [18,38,6], our temporal self-attention can more effectively model long temporal dependency. BEVFormer extracts temporal information from the previous BEV features rather than multiple stacking BEV features, thus requiring less computational cost and suffering less disturbing information.…”
Section: Temporal Self-attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…Cam2BEV [1] performs a spatial transformer module to transform perspective features to BEV space from surrounding inputs by IPM, which is a straightforward way to link image space to BEV under flat ground assumption. Methods in [2], [5]- [7] utilize depth information to perform the view transformation. For example, Lift-Splat-Shoot [2] first estimates implicit pixel-wise depth information and then uses camera geometry to build the connection between BEV segmentation and feature maps.…”
Section: Related Workmentioning
confidence: 99%
“…As in nuScenes, we scrape 6 cameras, 1 LiDAR, ego-motion (for temporal fusion of LiDAR scans), and 3D bounding boxes from CARLA. On the right, we show the target binary image for the bird's-eye-view vehicle segmentation task that we consider in this paper [43,37,22]. hard to perfectly label all scene materials and model complicated interactions between sensors and objects -at least today [34,8,56,42].…”
Section: Introductionmentioning
confidence: 99%