“…Considering the dimension gap between the 2D image input and the 3D prediction, recent studies for vision-based 3D perception first construct the BEV feature representations and then perform various downstream tasks on the BEV space [20,29,31,39,60,40,62,42,19,1,44]. To transform the perspective image features into the BEV features, LSS [40] and its follow-ups [42,29,19,60] predict the pixel-wise depth distribution to project the image features into 3D points, which are then voxelized into the BEV features.…”