“…Lee et al [37] demonstrated that by using DenseNet161 [18] as the encoder backbone for the NYUv2 [4] dataset, their method's accuracy was higher than when using ResNet101 [17]. Song et al [41] further demonstrated in their ablation studies that for the KITTI [5] dataset, the ResNeXt [36] encoder provides the best performance for their model, which matched the findings by Lee et al [37] for this dataset. Ablation studies from Bhat et al [21] illustrate that the use of the EfficientNet-B5 [1] can produce very good predictive performance with a basic decoder.…”
Section: ) Encoders For Monocular Depth Estimationsupporting
confidence: 55%
“…On top of their dilated ResNet-101 backbone in Stage 1, they use the ASPP module [20] to gather global contextual information in Stage 2. ASPP modules in different forms have since been adopted by Yin et al [35], Lee et al [37] and Song et al [41].…”
Section: ) Dilated Convolutions In Depth Estimationmentioning
confidence: 99%
“…Merging low resolution features with high resolutions features in decoders for monocular depth estimation results in a transfer of strong global contextual information from the lower resolutions to higher resolution reconstructions. Lee et al [37] and Song et al [41] both employ a variation of this method to improve their model's predictive performance. We propose a simpler component to reduce the computational overhead for a more efficient and accurate decoder structure.…”
Depth estimation is an essential component in computer vision systems for achieving 3D scene understanding. Efficient and accurate depth map estimation has numerous applications including self-driving vehicles and virtual reality. This paper presents a new deep network, called D-Net, for depth estimation from a single RGB image. The proposed network is designed as an efficient, accurate and universal model that can adopt a wide range of encoder backbones. Our approach gathers strong global and local contextual features at multiple resolutions and transfers these to high resolutions for clearer depth maps. For the encoder backbone we adopt state-of-the-art models including EfficientNet [1], HRNet[2] and Swin Transformer [3] to obtain densely labelled depth maps. The proposed D-net can be trained end-to-end and is designed to have minimal parameters and a reduced computational complexity. Extensive evaluations on the NYUv2 [4] and KITTI [5] benchmark datasets show that our model is highly accurate across multiple backbones and achieves state-of-the-art performance on both benchmark datasets when combined with the Swin Transformer and HRNets.
“…Lee et al [37] demonstrated that by using DenseNet161 [18] as the encoder backbone for the NYUv2 [4] dataset, their method's accuracy was higher than when using ResNet101 [17]. Song et al [41] further demonstrated in their ablation studies that for the KITTI [5] dataset, the ResNeXt [36] encoder provides the best performance for their model, which matched the findings by Lee et al [37] for this dataset. Ablation studies from Bhat et al [21] illustrate that the use of the EfficientNet-B5 [1] can produce very good predictive performance with a basic decoder.…”
Section: ) Encoders For Monocular Depth Estimationsupporting
confidence: 55%
“…On top of their dilated ResNet-101 backbone in Stage 1, they use the ASPP module [20] to gather global contextual information in Stage 2. ASPP modules in different forms have since been adopted by Yin et al [35], Lee et al [37] and Song et al [41].…”
Section: ) Dilated Convolutions In Depth Estimationmentioning
confidence: 99%
“…Merging low resolution features with high resolutions features in decoders for monocular depth estimation results in a transfer of strong global contextual information from the lower resolutions to higher resolution reconstructions. Lee et al [37] and Song et al [41] both employ a variation of this method to improve their model's predictive performance. We propose a simpler component to reduce the computational overhead for a more efficient and accurate decoder structure.…”
Depth estimation is an essential component in computer vision systems for achieving 3D scene understanding. Efficient and accurate depth map estimation has numerous applications including self-driving vehicles and virtual reality. This paper presents a new deep network, called D-Net, for depth estimation from a single RGB image. The proposed network is designed as an efficient, accurate and universal model that can adopt a wide range of encoder backbones. Our approach gathers strong global and local contextual features at multiple resolutions and transfers these to high resolutions for clearer depth maps. For the encoder backbone we adopt state-of-the-art models including EfficientNet [1], HRNet[2] and Swin Transformer [3] to obtain densely labelled depth maps. The proposed D-net can be trained end-to-end and is designed to have minimal parameters and a reduced computational complexity. Extensive evaluations on the NYUv2 [4] and KITTI [5] benchmark datasets show that our model is highly accurate across multiple backbones and achieves state-of-the-art performance on both benchmark datasets when combined with the Swin Transformer and HRNets.
“…They employed a reinforcement learning algorithm and automatically prune redundant channels of MDE by finding a relatively optimal pruning policy. Song et al [28] et al proposed a simple but effective scheme by incorporating the Laplacian pyramid into the decoder architecture. Specifically, encoded features were fed into different streams for decoding depth residuals.…”
Predicting a convincing depth map from a monocular single image is a daunting task in the field of computer vision. In this paper, we propose a novel detail-preserving depth estimation (DPDE) algorithm based on a modified fully convolutional residual network and gradient network. Specifically, we first introduce a new deep network that combines the fully convolutional residual network (FCRN) and a U-shaped architecture to generate the global depth map. Meanwhile, an efficient feature similarity-based loss term is introduced for training this network better. Then, we devise a gradient network to generate the local details of the scene based on gradient information. Finally, an optimization-based fusion scheme is proposed to integrate the depth and depth gradients to generate a reliable depth map with better details. Three benchmark RGBD datasets are evaluated from the perspective of qualitative and quantitative, the experimental results show that the designed depth prediction algorithm is superior to several classic depth prediction approaches and can reconstruct plausible depth maps.
“…The first works in this area [14], [15] used ground truth depth to learn supervisedly. Later research contributed mainly by proposing architectural innovations [16]- [19]. All these methods rely on accurate ground truth labels at training, which is not trivial in many application domains.…”
Estimating depth from endoscopic images is a pre-requisite for a wide set of AI-assisted technologies, namely accurate localization, measurement of tumors, or identification of non-inspected areas. As the domain specificity of colonoscopies -a deformable low-texture environment with fluids, poor lighting conditions and abrupt sensor motions-pose challenges to multi-view approaches, single-view depth learning stands out as a promising line of research. In this paper, we explore for the first time Bayesian deep networks for single-view depth estimation in colonoscopies. Their uncertainty quantification offers great potential for such a critical application area. Our specific contribution is two-fold: 1) an exhaustive analysis of Bayesian deep networks for depth estimation in three different datasets, highlighting challenges and conclusions regarding synthetic-to-real domain changes and supervised vs. self-supervised methods; and 2) a novel teacherstudent approach to deep depth learning that takes into account the teacher uncertainty.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.