Deep Surface Normal Estimation With Hierarchical RGB-D Fusion

Zeng, Jin; Tong, Yanfeng; Huang, Yunmu; Yan, Qing; Sun, Wenxiu; Jing, Chen; Wang, Yongtian

doi:10.1109/cvpr.2019.00631

Cited by 70 publications

(55 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Therefore, a confidence map network branch was set following the method proposed by Zeng et al [ 11 ], which generates confidence maps to indicate whether side effects resulted from pixel holes on

or not. Confidence maps [ 19 ] of depth image were produced by combining mask images [ 21 ] (

) with relative coarse depth images (

and were denoted as

according to resolution.…”

Section: Our Methodsmentioning

confidence: 99%

“…As is shown in Figure 3 , pixel holes in mask images (

) suggest that there are lots of missing pixels in ground truth depth images, which inevitably causes deviation to supervised learning. Therefore, we adopted a multi-layer convolution network (

) for producing confidence map

[ 19 ] of input depth images.

stands for scale value of images, i.e., if the resolution of 2D images can be denoted as

, then the corresponding

is defined as

.…”

Section: Our Methodsmentioning

confidence: 99%

“…Surface normal guidance has been introduced by previous studies [ 10 , 19 , 20 ], where they employed surface normal maps as 3D cues for improving the geometric quality of monocular depth images. Qi et al [ 10 ] jointly calculated depth and surface normal from a single image, making the final estimation geometrically more precise.…”

Section: Related Workmentioning

confidence: 99%

“…Qi et al [ 10 ] jointly calculated depth and surface normal from a single image, making the final estimation geometrically more precise. In work of Zeng et al [ 19 ], a skip-connected architecture was proposed to fuse features from different layers for surface normal estimation. A novel 3D geometric feature virtual normal was proposed by Yin et al [ 20 ] to refine the predicted depth maps.…”

Section: Related Workmentioning

confidence: 99%

“…Alhashim et al [ 6 ] proved that a very simple transfer learning-based decoder robustly achieves high-resolution depth maps. Previous studies [ 6 , 9 , 19 ] proved that the Dense-net is more suitable for depth estimation than models like SE-Net, Res-Net, and Mobile-net. However, experiments in references [ 6 , 9 ] showed that pre-trained deep learning structures based on Densenet-161, Densenet-169, and Densenet-201 models cannot afford real-time depth estimation.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Huang

Chen

et al. 2020

Sensors

View full text Add to dashboard Cite

Accurately sensing the surrounding 3D scene is indispensable for drones or robots to execute path planning and navigation. In this paper, a novel monocular depth estimation method was proposed that primarily utilizes a lighter-weight Convolutional Neural Network (CNN) structure for coarse depth prediction and then refines the coarse depth images by combining surface normal guidance. Specifically, the coarse depth prediction network is designed as pre-trained encoder–decoder architecture for describing the 3D structure. When it comes to surface normal estimation, the deep learning network was designed as a two-stream encoder–decoder structure, which hierarchically merges red-green-blue-depth (RGB-D) images for capturing more accurate geometric boundaries. Relying on fewer network parameters and simpler learning structure, better detailed depth maps are produced than the existing states. Moreover, 3D point cloud maps reconstructed from depth prediction images confirm that our framework can be conveniently adopted as components of a monocular simultaneous localization and mapping (SLAM) paradigm.

show abstract

or not. Confidence maps [ 19 ] of depth image were produced by combining mask images [ 21 ] (

) with relative coarse depth images (

and were denoted as

according to resolution.…”

Section: Our Methodsmentioning

confidence: 99%

“…As is shown in Figure 3 , pixel holes in mask images (

) suggest that there are lots of missing pixels in ground truth depth images, which inevitably causes deviation to supervised learning. Therefore, we adopted a multi-layer convolution network (

) for producing confidence map

[ 19 ] of input depth images.

stands for scale value of images, i.e., if the resolution of 2D images can be denoted as

, then the corresponding

is defined as

.…”

Section: Our Methodsmentioning

confidence: 99%