Generalized Binary Search Network for Highly-Efficient Multi-View Stereo

Mi, Zhenxing; Chang, Di; Xu, Dan

doi:10.1109/cvpr52688.2022.01265

Cited by 42 publications

(26 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with the recent methods [11, 42, 43, 45], the proposed model has achieved competitive results in completeness and overall metrics. The main difference between the proposed method and WT‐MVSNet [45] is that WT‐MVSNet [45] adopts ViTs to replace the frequently used 3D convolutions for cost volume regularization.…”

Section: Methodsmentioning

confidence: 94%

Feature‐enhanced representation with transformers for multi‐view stereo

Xiang,

Yin

2024

IET Image Processing

View full text Add to dashboard Cite

Most existing multi‐view stereo (MVS) methods fail to consider global context information in the stage of feature extraction and cost aggregation. As transformers have shown remarkable performance on various vision tasks due to their ability to perceive global contextual information, this paper proposes a transformer‐based feature enhancement network (TF‐MVSNet) to facilitate feature representation learning by combining local features (both 2D and 3D) with long‐range contextual information. To reduce memory consumption of feature matching, the cross‐attention mechanism is leveraged to efficiently construct 3D cost volumes under the epipolar constraint. Additionally, a colour‐guided network is designed to refine depth maps at a coarse stage, hence reducing incorrect depth predictions at a fine stage. Extensive experiments were performed on the DTU dataset and Tanks and Temples (T&T) benchmark and results are reported.

show abstract

Section: Methodsmentioning

confidence: 94%

Feature‐enhanced representation with transformers for multi‐view stereo

Xiang,

Yin

2024

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…In another line of research, GRU‐based methods have been exploited for more lightweight solutions using higher resolution images (Wang, Zhu, et al, 2022). Binary search strategies have also been exploited for efficient memory handling (Mi et al, 2022). More recently, transformer architectures and attention‐based mechanisms have been proposed for more efficient incorporation of the global context (Ding et al, 2022; Wang, Galliani, et al, 2022; Yu, Guo, et al, 2021; Zhang et al, 2021; Zhu et al, 2021).…”

Section: Learning‐based Methodsmentioning

confidence: 99%

A survey on conventional and learning‐based methods for multi‐view stereo

Stathopoulou

Remondino

2023

The Photogrammetric Record

View full text Add to dashboard Cite

Abstract3D reconstruction of scenes using multiple images, relying on robust correspondence search and depth estimation, has been thoroughly studied for the two‐view and multi‐view scenarios in recent years. Multi‐view stereo (MVS) algorithms aim to generate a rich, dense 3D model of the scene in the form of a dense point cloud or a triangulated mesh. In a typical MVS pipeline, the robust estimations for the camera poses along with the sparse points obtained from structure from motion (SfM) are used as input. During this process, the depth of generally every pixel of the scene is to be calculated. Several methods, either conventional or, more recently, learning‐based have been developed for solving the correspondence search problem. A vast amount of research exists in the literature using local, global or semi‐global stereomatching approaches, with the PatchMatch algorithm being among the most popular and efficient conventional ones in the last decade. Yet, and despite the widespread evolution of the algorithms, yielding complete, accurate and aesthetically pleasing 3D representations of a scene remains an open issue in real‐world and large‐scale photogrammetric applications. This work aims to provide a concrete survey on the most widely used MVS methods, investigating underlying concepts and challenges. To this end, the theoretical background and relative literature are discussed for both conventional and learning‐based approaches, with a particular focus on close‐range 3D reconstruction applications.

show abstract

“…Coarse-to-fine Learning. The efficient coarse-to-fine manner plays an important role in learning-based stereo matching [51,59,63,13], MVS [13,64,56,29], and optical flow [32,42,60,66]. CasMVSNet [13] builds coarse cost volume at early stages with large depth ranges and makes later stages refine details.…”

Section: Related Workmentioning

confidence: 99%

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Cao¹,

Fu²

2023

Preprint

View full text Add to dashboard Cite

Learning robust local image feature matching is a fundamental low-level vision task, which has been widely explored in the past few years. Recently, detector-free local feature matchers based on transformers have shown promising results, which largely outperform pure Convolutional Neural Network (CNN) based ones. But correlations produced by transformer-based methods are spatially limited to the center of source views' coarse patches, because of the costly attention learning. In this work, we rethink this issue and find that such matching formulation degrades pose estimation, especially for low-resolution images. So we propose a transformer-based cascade matching model -Cascade feature Matching TRansformer (CasMTR), to efficiently learn dense feature correlations, which allows us to choose more reliable matching pairs for the relative pose estimation. Instead of re-training a new detector, we use a simple yet effective Non-Maximum Suppression (NMS) post-process to filter keypoints through the confidence map, and largely improve the matching precision. CasMTR achieves state-of-the-art performance in indoor and outdoor pose estimation as well as visual localization. Moreover, thorough ablations show the efficacy of the proposed components and techniques. † Corresponding author.Under review.

show abstract

Generalized Binary Search Network for Highly-Efficient Multi-View Stereo

Cited by 42 publications

References 32 publications

Feature‐enhanced representation with transformers for multi‐view stereo

Feature‐enhanced representation with transformers for multi‐view stereo

A survey on conventional and learning‐based methods for multi‐view stereo

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Contact Info

Product

Resources

About