Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
This paper reports a study that aims to solve the problem of the weak adaptability to angle transformation of current monocular depth estimation algorithms. These algorithms are based on convolutional neural networks (CNNs) but produce results lacking in estimation accuracy and robustness. The paper proposes a lightweight network based on convolution and capsule feature fusion (CNNapsule). First, the paper introduces a fusion block module that integrates CNN features and matrix capsule features to improve the adaptability of the network to perspective transformations. The fusion and deconvolution features are fused through skip connections to generate a depth image. In addition, the corresponding loss function is designed according to the long-tail distribution, gradient similarity, and structural similarity of the datasets. Finally, the results are compared with the methods applied to the NYU Depth V2 and KITTI datasets and show that our proposed method has better accuracy on the C1 and C2 indices and a better visual effect than traditional methods and deep learning methods without transfer learning. The number of trainable parameters required by this method is 65% lower than that required by methods presented in the literature. The generalization of this method is verified via the comparative testing of the data collected from the internet and mobile phones.
This paper reports a study that aims to solve the problem of the weak adaptability to angle transformation of current monocular depth estimation algorithms. These algorithms are based on convolutional neural networks (CNNs) but produce results lacking in estimation accuracy and robustness. The paper proposes a lightweight network based on convolution and capsule feature fusion (CNNapsule). First, the paper introduces a fusion block module that integrates CNN features and matrix capsule features to improve the adaptability of the network to perspective transformations. The fusion and deconvolution features are fused through skip connections to generate a depth image. In addition, the corresponding loss function is designed according to the long-tail distribution, gradient similarity, and structural similarity of the datasets. Finally, the results are compared with the methods applied to the NYU Depth V2 and KITTI datasets and show that our proposed method has better accuracy on the C1 and C2 indices and a better visual effect than traditional methods and deep learning methods without transfer learning. The number of trainable parameters required by this method is 65% lower than that required by methods presented in the literature. The generalization of this method is verified via the comparative testing of the data collected from the internet and mobile phones.
Objective Obtaining scene depth is crucial in 3D reconstruction, autonomous driving, and other related tasks. Current methods based on lidar or time of flight (ToF) cameras are not widely applicable due to their high cost. In contrast, only employing a single RGB image to infer scene depth information is more costeffective, which has broader potential for more applications. Inspired by the successful applications of deep learning methods in various illposed problems recently, many researchers tend to adopt convolutional neural networks to estimate reasonable and accurate monocular depths.However, most existing studies based on deep learning focus on how to enhance the feature extraction capability of the network, without attention paid to the distribution of image depths. Estimating the pixel distributions of images can not only improve the inference precision but also make the reconstructed 3D images more consistent with ground truth.Therefore, we propose a new adaptive depth distribution module, which allows the model to predict different depth distributions for each image during the training. MethodsThe NYU Depth -v2 dataset created by New York University is employed. Overall, our model is built based on the encoderdecoder structure with skip connections, which has been proven to be able to guide image generation more effectively. An indirect representation of depth maps based on plane coefficient is also introduced to implicitly add the plane constraint in the depth estimation and obtain smoother depth estimation results in the plane region of the scene. Specifically, two subnetworks with different lightweight designs are adopted at the bottleneck and other upsampling stages in the network to enhance the model's feature extraction capability. In addition, an adaptive depth distribution estimation module is also designed to estimate different depth distributions according to different input images, which makes the pixel distribution of depth maps closer to the ground truth. A twostage training strategy is employed. In the first stage, we load the pretrained weights on ImageNet into the backbone network and optimize the model using the loss function only at the 2D level. In the second stage, we perform joint training through loss functions at both the 2D and 3D levels.Results and Discussions Our study employs multiple metrics including root mean square error (RMSE), relative error (REL) , and intersection over union (IoU) to qualitatively evaluate the inference ability of the proposed model. As shown in Table 1, the proposed lightweight network model outperforms most of the listed methods with only 46 M parameters, which proves the overall structure of the model is concise and effective. The visual comparison results of 3D depth reconstruction (Fig. 5) demonstrate that the proposed network can output smoother and more continuous depth predictions in planar regions, and reasonable predictions in the partially occluded or missing areas of planar regions. In terms of depth distribution, the carefully designed adap...
Objective Visionbased depth estimation is an important research direction of computer vision, which is of great significance to threedimensional (3D) reconstruction, semantic segmentation, navigation, etc. The monocular depth estimation scheme has the advantages of low cost and easy installation, which cannot be realized by binocular stereo vision and lidar, and it has received more and more attention in recent years. There is a strong correlation between the outof -
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.