“…However, a single camera is naturally inaccurate in 3D localization. There are also other works exploring the use of specific depth sensor such as stereo imagery [8], [9], [10], [11], which are also relatively low-cost and provide effective depth information, but have a limited sensing range; and LiDAR [12], [13], [14], [15], [16], [17], [18], which has accurate 3D localization ability, but is less informative and sensitive to reflection (e.g., rainy, car window). To achieve robust perception, modern self-driving vehicles tend to equip multiple different sensors, where the 3D information is represented in quite different ways (e.g., high-level semantic cues for the monocular image, pixel-level disparity for stereo images, sparse but geometric-aware point cloud for LiDARs).…”