Abstract:In this paper, we propose an efficient multi-scale geometric consistency guided multi-view stereo method for accurate and complete depth map estimation. We first present our basic multi-view stereo method with Adaptive Checkerboard sampling and Multi-Hypothesis joint view selection (ACMH). It leverages structured region information to sample better candidate hypotheses for propagation and infer the aggregation view subset at each pixel. For the depth estimation of low-textured areas, we further propose to comb… Show more
“…Further, Liu et al [21] improved the metric by using Gaussian filtering to counteract the effect of noise. COLMAP [22] and some following works [23], [24] handled this problem by dataset-wide pixel-wise view selection using patch color distribution. Our network learns to predict the pixel-wise visibility for all the given source views and use the prediction in multi-view feature aggregation, which can be trained end-to-end and improve the robustness to occlusions.…”
We introduce VA-Point-MVSNet, a novel visibility-aware point-based deep framework for multi-view stereo (MVS). Distinct from existing cost volume approaches, our method directly processes the target scene as point clouds. More specifically, our method predicts the depth in a coarse-to-fine manner. We first generate a coarse depth map, convert it into a point cloud and refine the point cloud iteratively by estimating the residual between the depth of the current iteration and that of the ground truth. Our network leverages 3D geometry priors and 2D texture information jointly and effectively by fusing them into a feature-augmented point cloud, and processes the point cloud to estimate the 3D flow for each point. This point-based architecture allows higher accuracy, more computational efficiency and more flexibility than cost-volume-based counterparts. Furthermore, our visibility-aware multi-view feature aggregation allows the network to aggregate multi-view appearance cues while taking into account occlusions. Experimental results show that our approach achieves a significant improvement in reconstruction quality compared with state-of-the-art methods on the DTU and the Tanks and Temples dataset. The code of VA-Point-MVSNet proposed in this work will be released at https://github.com/callmeray/PointMVSNet.
“…Further, Liu et al [21] improved the metric by using Gaussian filtering to counteract the effect of noise. COLMAP [22] and some following works [23], [24] handled this problem by dataset-wide pixel-wise view selection using patch color distribution. Our network learns to predict the pixel-wise visibility for all the given source views and use the prediction in multi-view feature aggregation, which can be trained end-to-end and improve the robustness to occlusions.…”
We introduce VA-Point-MVSNet, a novel visibility-aware point-based deep framework for multi-view stereo (MVS). Distinct from existing cost volume approaches, our method directly processes the target scene as point clouds. More specifically, our method predicts the depth in a coarse-to-fine manner. We first generate a coarse depth map, convert it into a point cloud and refine the point cloud iteratively by estimating the residual between the depth of the current iteration and that of the ground truth. Our network leverages 3D geometry priors and 2D texture information jointly and effectively by fusing them into a feature-augmented point cloud, and processes the point cloud to estimate the 3D flow for each point. This point-based architecture allows higher accuracy, more computational efficiency and more flexibility than cost-volume-based counterparts. Furthermore, our visibility-aware multi-view feature aggregation allows the network to aggregate multi-view appearance cues while taking into account occlusions. Experimental results show that our approach achieves a significant improvement in reconstruction quality compared with state-of-the-art methods on the DTU and the Tanks and Temples dataset. The code of VA-Point-MVSNet proposed in this work will be released at https://github.com/callmeray/PointMVSNet.
“…(Galliani, Lasinger, and Schindler 2015) utilizes a diffusion-like propagation scheme to make better use of the parallelization of GPUs. By inheriting the checkerboard pattern of (Galliani, Lasinger, and Schindler 2015), ACMH (Xu and Tao 2019) designs an adaptive checkerboard sampling strategy to propagate more reliable hypotheses. Moreover, ACMH further exploits these hypotheses to infer pixelwise view selection.…”
Section: Related Workmentioning
confidence: 99%
“…Due to the difficulty in solving such optimization problems, the efficiency of these methods is low and they are easy to be trapped in local optima. Recently, PatchMatch multi-view stereo methods (Zheng et al 2014;Galliani, Lasinger, and Schindler 2015;Schönberger et al 2016;Xu and Tao 2019) become popular as their used PatchMatch-based optimization (Barnes et al 2009) makes depth map estimation efficient and accurate. As these methods do not explicitly model the planar priors, these methods still encounter the failure in low-textured areas.…”
The completeness of 3D models is still a challenging problem in multi-view stereo (MVS) due to the unreliable photometric consistency in low-textured areas. Since low-textured areas usually exhibit strong planarity, planar models are advantageous to the depth estimation of low-textured areas. On the other hand, PatchMatch multi-view stereo is very efficient for its sampling and propagation scheme. By taking advantage of planar models and PatchMatch multi-view stereo, we propose a planar prior assisted PatchMatch multi-view stereo framework in this paper. In detail, we utilize a probabilistic graphical model to embed planar models into PatchMatch multi-view stereo and contribute a novel multi-view aggregated matching cost. This novel cost takes both photometric consistency and planar compatibility into consideration, making it suited for the depth estimation of both non-planar and planar regions. Experimental results demonstrate that our method can efficiently recover the depth information of extremely low-textured areas, thus obtaining high complete 3D models and achieving state-of-the-art performance.
“…Patch model is essentially a local tangent plane approximation of a surface. Algorithms proposed by References 4,13,14,16,17,26,27 utilize a pixel window in the reference view to represent the projection of a small planar patch in the scene and conduct a region‐growing procedure to generate dense correspondences and depth maps. Methods like 9‐12,28 directly encode the patch model as a rectangle in three‐dimension space which is oriented and one of its edges is parallel to the x ‐axis of the reference camera.…”
In this article, we propose the novel folding patch model which can replace the traditional patch model utilized in patch-based multiview stereo (MVS) methods to significantly improve the reconstruction results. The patch model is applied as an approximation of the scene surface differential in the geometric estimation procedure. By minimizing the photometric discrepancy of the projection of the patch model on multiple source images, patch-based MVS algorithms optimize the position and normal values for the 3D hypothesis of the target pixel. The optimization is based on the assumption that the patch model can fit the target scene surface perfectly. However, when it comes to complex scenes crowded with sharp edges, splintery surfaces, or round surfaces, the patch model is inherently not suitable since even from the microscopic perspective these surfaces are not entirely flat. We construct the folding patch model by folding the traditional patch model from the middle line. By adjusting the folding angle and direction, the folding patch model can fit complex surfaces more flexibly. We apply our folding patch model to the representative open-source patch based multiview stereo (PMVS) and COLMAP, and validate the effectiveness on ETH3D benchmark and data sets captured in nature. The results demonstrate that utilizing the folding patch model can significantly improve the behavior of PMVS and COLMAP, especially on data sets mainly consist of complex surfaces from plants.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.