“…Compared with the recent methods [11, 42, 43, 45], the proposed model has achieved competitive results in completeness and overall metrics. The main difference between the proposed method and WT‐MVSNet [45] is that WT‐MVSNet [45] adopts ViTs to replace the frequently used 3D convolutions for cost volume regularization.…”
Most existing multi‐view stereo (MVS) methods fail to consider global context information in the stage of feature extraction and cost aggregation. As transformers have shown remarkable performance on various vision tasks due to their ability to perceive global contextual information, this paper proposes a transformer‐based feature enhancement network (TF‐MVSNet) to facilitate feature representation learning by combining local features (both 2D and 3D) with long‐range contextual information. To reduce memory consumption of feature matching, the cross‐attention mechanism is leveraged to efficiently construct 3D cost volumes under the epipolar constraint. Additionally, a colour‐guided network is designed to refine depth maps at a coarse stage, hence reducing incorrect depth predictions at a fine stage. Extensive experiments were performed on the DTU dataset and Tanks and Temples (T&T) benchmark and results are reported.
“…Compared with the recent methods [11, 42, 43, 45], the proposed model has achieved competitive results in completeness and overall metrics. The main difference between the proposed method and WT‐MVSNet [45] is that WT‐MVSNet [45] adopts ViTs to replace the frequently used 3D convolutions for cost volume regularization.…”
Most existing multi‐view stereo (MVS) methods fail to consider global context information in the stage of feature extraction and cost aggregation. As transformers have shown remarkable performance on various vision tasks due to their ability to perceive global contextual information, this paper proposes a transformer‐based feature enhancement network (TF‐MVSNet) to facilitate feature representation learning by combining local features (both 2D and 3D) with long‐range contextual information. To reduce memory consumption of feature matching, the cross‐attention mechanism is leveraged to efficiently construct 3D cost volumes under the epipolar constraint. Additionally, a colour‐guided network is designed to refine depth maps at a coarse stage, hence reducing incorrect depth predictions at a fine stage. Extensive experiments were performed on the DTU dataset and Tanks and Temples (T&T) benchmark and results are reported.
“…In another line of research, GRU‐based methods have been exploited for more lightweight solutions using higher resolution images (Wang, Zhu, et al, 2022). Binary search strategies have also been exploited for efficient memory handling (Mi et al, 2022). More recently, transformer architectures and attention‐based mechanisms have been proposed for more efficient incorporation of the global context (Ding et al, 2022; Wang, Galliani, et al, 2022; Yu, Guo, et al, 2021; Zhang et al, 2021; Zhu et al, 2021).…”
Abstract3D reconstruction of scenes using multiple images, relying on robust correspondence search and depth estimation, has been thoroughly studied for the two‐view and multi‐view scenarios in recent years. Multi‐view stereo (MVS) algorithms aim to generate a rich, dense 3D model of the scene in the form of a dense point cloud or a triangulated mesh. In a typical MVS pipeline, the robust estimations for the camera poses along with the sparse points obtained from structure from motion (SfM) are used as input. During this process, the depth of generally every pixel of the scene is to be calculated. Several methods, either conventional or, more recently, learning‐based have been developed for solving the correspondence search problem. A vast amount of research exists in the literature using local, global or semi‐global stereomatching approaches, with the PatchMatch algorithm being among the most popular and efficient conventional ones in the last decade. Yet, and despite the widespread evolution of the algorithms, yielding complete, accurate and aesthetically pleasing 3D representations of a scene remains an open issue in real‐world and large‐scale photogrammetric applications. This work aims to provide a concrete survey on the most widely used MVS methods, investigating underlying concepts and challenges. To this end, the theoretical background and relative literature are discussed for both conventional and learning‐based approaches, with a particular focus on close‐range 3D reconstruction applications.
“…Coarse-to-fine Learning. The efficient coarse-to-fine manner plays an important role in learning-based stereo matching [51,59,63,13], MVS [13,64,56,29], and optical flow [32,42,60,66]. CasMVSNet [13] builds coarse cost volume at early stages with large depth ranges and makes later stages refine details.…”
Learning robust local image feature matching is a fundamental low-level vision task, which has been widely explored in the past few years. Recently, detector-free local feature matchers based on transformers have shown promising results, which largely outperform pure Convolutional Neural Network (CNN) based ones. But correlations produced by transformer-based methods are spatially limited to the center of source views' coarse patches, because of the costly attention learning. In this work, we rethink this issue and find that such matching formulation degrades pose estimation, especially for low-resolution images. So we propose a transformer-based cascade matching model -Cascade feature Matching TRansformer (CasMTR), to efficiently learn dense feature correlations, which allows us to choose more reliable matching pairs for the relative pose estimation. Instead of re-training a new detector, we use a simple yet effective Non-Maximum Suppression (NMS) post-process to filter keypoints through the confidence map, and largely improve the matching precision. CasMTR achieves state-of-the-art performance in indoor and outdoor pose estimation as well as visual localization. Moreover, thorough ablations show the efficacy of the proposed components and techniques. † Corresponding author.Under review.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.