Joint 3D Proposal Generation and Object Detection from View Aggregation

Ku, Jason S.; Mozifian, Melissa; Lee, Jungwook; Harakeh, Ali; Waslander, Steven L.

doi:10.48550/arxiv.1712.02294

Cited by 37 publications

(44 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Currently, two distinct lines are followed. On the one hand, a variety of strategies to perform fusion at the feature level [16], [17], [18] have been introduced. On the other hand, some works divide the process into two steps: performing detection in the image space, and later regressing the 3D box using a subset of the LiDAR modality [6], [19].…”

Section: Related Workmentioning

confidence: 99%

siaNMS: Non-Maximum Suppression with Siamese Networks for Multi-Camera 3D Object Detection

Cortes

Beltrán

Escalera

et al. 2020

2020 IEEE Intelligent Vehicles Symposium (IV)

View full text Add to dashboard Cite

The rapid development of embedded hardware in autonomous vehicles broadens their computational capabilities, thus bringing the possibility to mount more complete sensor setups able to handle driving scenarios of higher complexity. As a result, new challenges such as multiple detections of the same object have to be addressed. In this work, a siamese network is integrated into the pipeline of a well-known 3D object detector approach to suppress duplicate proposals coming from different cameras via re-identification. Additionally, associations are exploited to enhance the 3D box regression of the object by aggregating their corresponding LiDAR frustums. The experimental evaluation on the nuScenes dataset shows that the proposed method outperforms traditional NMS approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

siaNMS: Non-Maximum Suppression with Siamese Networks for Multi-Camera 3D Object Detection

Cortes

Beltrán

Escalera

et al. 2020

2020 IEEE Intelligent Vehicles Symposium (IV)

View full text Add to dashboard Cite

show abstract

“…This representation turns out to be extremely attractive since it does not exhibit any of the perspective artifacts introduced in RGB-D images for example, and a major focus of our work is therefore to develop an implicit image-only analogue to these birds-eye-view maps. A further interesting line of research is sensor fusion methods such as AVOD [15] and MV3D [5] which make use of 3D object proposals on the ground plane to aggregate both image-based and birds-eye-view features: an operation which is closely related to our orthographic feature transform.…”

Section: Related Workmentioning

confidence: 99%

“…Integral images Integral images have been fundamentally associated with object detection ever since their introduction in the seminal work of Viola and Jones [32]. They have formed an important component in many contemporary 3D object detection approaches including AVOD [15], MV3D [5], Mono3D [3] and 3DOP [4]. In all of these cases however, integral images do not backpropagate gradients or form part of a fully end-to-end deep learning architecture.…”

Section: Related Workmentioning

confidence: 99%

“…This has led to 3D bounding box detection emerging as an important problem in computer vision and robotics, particularly in the context of autonomous driving. To date the 3D object detection literature has been dominated by approaches which make use of rich LiDAR point clouds [37,33,15,27,5,6,22,1], while the performance of image-only methods, which lack the absolute depth information of LiDAR, lags significantly behind. Given the high cost of existing LiDAR units, the sparsity of LiDAR point clouds at long ranges, and the need for sensor redundancy, accurate 3D object detection from Figure 1.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Orthographic Feature Transform for Monocular 3D Object Detection

Roddick¹,

Kendall²,

Cipolla³

2018

Preprint

View full text Add to dashboard Cite

3D object detection from monocular images has proven to be an enormously challenging task, with the performance of leading systems not yet achieving even 10% of that of LiDAR-based counterparts. One explanation for this performance gap is that existing systems are entirely at the mercy of the perspective image-based representation, in which the appearance and scale of objects varies drastically with depth and meaningful distances are difficult to infer. In this work we argue that the ability to reason about the world in 3D is an essential element of the 3D object detection task. To this end, we introduce the orthographic feature transform, which enables us to escape the image domain by mapping image-based features into an orthographic 3D space. This allows us to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful. We apply this transformation as part of an end-to-end deep learning architecture and achieve state-of-the-art performance on the KITTI 3D object benchmark. 1

show abstract

“…However, a single camera is naturally inaccurate in 3D localization. There are also other works exploring the use of specific depth sensor such as stereo imagery [8], [9], [10], [11], which are also relatively low-cost and provide effective depth information, but have a limited sensing range; and LiDAR [12], [13], [14], [15], [16], [17], [18], which has accurate 3D localization ability, but is less informative and sensitive to reflection (e.g., rainy, car window). To achieve robust perception, modern self-driving vehicles tend to equip multiple different sensors, where the 3D information is represented in quite different ways (e.g., high-level semantic cues for the monocular image, pixel-level disparity for stereo images, sparse but geometric-aware point cloud for LiDARs).…”

Section: Introductionmentioning

confidence: 99%

Multi-Sensor 3D Object Box Refinement for Autonomous Driving

Li,

Liu,

Shen

2019

Preprint

View full text Add to dashboard Cite

We propose a 3D object detection system with multi-sensor refinement in the context of autonomous driving. In our framework, the monocular camera serves as the fundamental sensor for 2D object proposal and initial 3D bounding box prediction. While the stereo cameras and LiDAR are treated as adaptive plug-in sensors to refine the 3D box localization performance. For each observed element in the raw measurement domain (e.g., pixels for stereo, 3D points for LiDAR), we model the local geometry as an instance vector representation, which indicates the 3D coordinate of each element respecting to the object frame. Using this unified geometric representation, the 3D object location can be unified refined by the stereo photometric alignment or point cloud alignment. We demonstrate superior 3D detection and localization performance compared to state-of-the-art monocular, stereo methods and competitive performance compared with the baseline LiDAR method on the KITTI object benchmark.

show abstract

Joint 3D Proposal Generation and Object Detection from View Aggregation

Cited by 37 publications

References 12 publications

siaNMS: Non-Maximum Suppression with Siamese Networks for Multi-Camera 3D Object Detection

siaNMS: Non-Maximum Suppression with Siamese Networks for Multi-Camera 3D Object Detection

Orthographic Feature Transform for Monocular 3D Object Detection

Multi-Sensor 3D Object Box Refinement for Autonomous Driving

Contact Info

Product

Resources

About