2022
DOI: 10.48550/arxiv.2203.08195
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

Abstract: Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While prevalent multi-modal methods [34,36] simply decorate raw lidar point clouds with camera features and feed them directly to existing 3D detection models, our study shows that fusing camera features with deep lidar features instead of raw points, can lead to better performance. However, as those features are often augmented and aggregated, a key challenge in fusion is how to effectively a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 32 publications
0
7
0
Order By: Relevance
“…[ 4,58,1,20] into camera world and use them as queries, to select corresponding image features. This line of work constitutes the state-of-the-art methods of 3D BEV perception.…”
Section: Camera Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…[ 4,58,1,20] into camera world and use them as queries, to select corresponding image features. This line of work constitutes the state-of-the-art methods of 3D BEV perception.…”
Section: Camera Networkmentioning
confidence: 99%
“…However, it is often difficult to regress 3D bounding boxes on pure image inputs due to the lack of depth information, and similarly, it is difficult to classify objects on point clouds when LiDAR does not receive enough points. Previous fusion methods can be broadly categorized into (a) point-level fusion mechanism [41,44,45,44,16,57] that project image features onto raw point clouds, and (b) feature-level fusion mechanism [4,58,1,20] that projects LiDAR feature or proposals on each view image separately to extract RGB information. (c) In contrast, we propose a novel yet surprisingly simple framework that disentangles the camera network from LiDAR inputs.…”
Section: Introductionmentioning
confidence: 99%
“…Specifically, these methods rely on the LiDAR-to-world and camera-toworld calibration matrix to project a LiDAR point on the image plane, where it serves as a query of image features [33,34,31,40,8,44]. Deep fusion methods extract deep features from some pre-trained neural networks for both modalities under a unified space [1,12,9,4,45,16,15], where a popular choice of such space is the bird's eye view (BEV) [1,45]. While both early and deep fusion mechanisms usually occur within a neural network pipeline, the late fusion scheme usually contains two independent perception models to generate 3D bounding box predictions for both modalities, then fuse these predictions using post-processing techniques [4,21].…”
Section: Related Workmentioning
confidence: 99%
“…However, TransFusion [1] mainly explores the robustness against camera inputs, and ignores the noisy LiDAR and temporal misalignment cases. DeepFusion [12] examines the model robustness by adding noise to LiDAR reflections and camera pixels. Though the noise settings of DeepFusion [12] are straightforward and brief, the noisy cases almost never appear in real scenes.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation