Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching

Gu, Xiaodong; Fan, Zhiwen; Zhu, Siyu; Dai, Zuozhuo; Tan, Feitong; Tan, Ping

doi:10.1109/cvpr42600.2020.00257

Cited by 563 publications

(638 citation statements)

References 42 publications

Supporting

Mentioning

546

Contrasting

Order By: Relevance

“…NLCA-Net [ 32 ] replaces the concatenation operation by calculating the variance of extracted features, which can reduce the C channel by half. For the D channel, recent CSN [ 26 ] reduces this dimension by generating a disparity candidate range and gradually refining the disparity map in a coarse-to-fine manner. These methods can reduce the memory and computational cost to a certain extent.…”

Section: Related Workmentioning

confidence: 99%

“…In terms of different matching cost computation methods, current neural network-based stereo methods can be mainly divided into the following: 2D networks [ 18 , 19 , 20 , 21 , 22 , 23 ] with cost volumes generated by traditional methods or the correlation layer. 3D networks [ 24 , 25 , 26 , 27 ] with cost volumes generated by concatenation. According to published papers on KITTI official website, these two architectures have obvious differences in speed and accuracy, as shown in Figure 1 .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Joint 2D-3D Complementary Network for Stereo Matching

Jia¹,

Chen²,

Liang³

et al. 2021

Sensors

View full text Add to dashboard Cite

Stereo matching is an important research field of computer vision. Due to the dimension of cost aggregation, current neural network-based stereo methods are difficult to trade-off speed and accuracy. To this end, we integrate fast 2D stereo methods with accurate 3D networks to improve performance and reduce running time. We leverage a 2D encoder-decoder network to generate a rough disparity map and construct a disparity range to guide the 3D aggregation network, which can significantly improve the accuracy and reduce the computational cost. We use a stacked hourglass structure to refine the disparity from coarse to fine. We evaluated our method on three public datasets. According to the KITTI official website results, Our network can generate an accurate result in 80 ms on a modern GPU. Compared to other 2D stereo networks (AANet, DeepPruner, FADNet, etc.), our network has a big improvement in accuracy. Meanwhile, it is significantly faster than other 3D stereo networks (5× than PSMNet, 7.5× than CSN and 22.5× than GANet, etc.), demonstrating the effectiveness of our method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Joint 2D-3D Complementary Network for Stereo Matching

Jia¹,

Chen²,

Liang³

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Their method only calculates one depth map at a time instead of calculating the entire 3D scene. Gu et al [ 58 ] further improved MVSNet, which solved the cubic increase in computational complexity as the image resolution increased. Most 3D reconstruction algorithms are only applicable to static scenes.…”

Section: Related Workmentioning

confidence: 99%

Parallel Structure from Motion for Sparse Point Cloud Generation in Large-Scale Scenes

Bao

Lin

et al. 2021

Sensors

View full text Add to dashboard Cite

Scene reconstruction uses images or videos as input to reconstruct a 3D model of a real scene and has important applications in smart cities, surveying and mapping, military, and other fields. Structure from motion (SFM) is a key step in scene reconstruction, which recovers sparse point clouds from image sequences. However, large-scale scenes cannot be reconstructed using a single compute node. Image matching and geometric filtering take up a lot of time in the traditional SFM problem. In this paper, we propose a novel divide-and-conquer framework to solve the distributed SFM problem. First, we use the global navigation satellite system (GNSS) information from images to calculate the GNSS neighborhood. The number of images matched is greatly reduced by matching each image to only valid GNSS neighbors. This way, a robust matching relationship can be obtained. Second, the calculated matching relationship is used as the initial camera graph, which is divided into multiple subgraphs by the clustering algorithm. The local SFM is executed on several computing nodes to register the local cameras. Finally, all of the local camera poses are integrated and optimized to complete the global camera registration. Experiments show that our system can accurately and efficiently solve the structure from motion problem in large-scale scenes.

show abstract

“…The MVSNet [1] is proposed to estimate the depth map for each view by building a cost volume followed by 3D CNN regularization. Moreover, due to the unideal run-time and memory requirements, the cascade pyramid structure [2] is proposed to build cost volume and infer depth in coarse to fine, which greatly reduces run-time and memory consumption. Besides, some unsupervised methods [7,3] are proposed to overcome the difficulty of obtaining ground-truth depth maps.…”

Section: Related Workmentioning

confidence: 99%

“…Depth estimation from multi-view images has a wide range of applications, such as 3D reconstruction, scene understanding, view synthesis, and robot vision. Recently, deep learningbased MVS methods have achieved promising results [1,2], and most of them are used for 3D reconstruction tasks. However, most learning-based methods rely on ground-truth depth as supervision, which is difficult to obtain so that the application scenarios of supervised methods are very limited.…”

Section: Introductionmentioning

confidence: 99%

Real-Time Unsupervised Multi-View Depth Estimation Network For Virtual View Synthesis

Qiu

Liu

et al. 2021

2021 IEEE International Conference on Multimedia &Amp; Expo Workshops (ICMEW)

View full text Add to dashboard Cite

The existing learning-based multi-view stereo (MVS) approaches achieve impressive results compared with traditional methods. However, most of them rely on ground-truth 3D data as supervision, and the acquisition of high-quality ground truth for various scenes is a challenging problem. In this paper, we propose a novel real-time unsupervised multiview depth estimation network for virtual view synthesis tasks and take multi-view images as supervision. To improve the completeness and accuracy of the virtual viewpoint, we propose a novel shared occlusion mask to deal with the artifacts caused by occlusion in the reconstructed image, and filter out the unreliable points in the depth map. Besides, we also design a mask-based photometric loss to guide our network to generate more reasonable masks and high-quality depth maps. Experimental results on the IEEE1857.9 virtual viewpoint synthesis dataset demonstrate that our proposed method outperforms other recent MVS methods and achieves more excellent real-time performance.

show abstract

Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching

Cited by 563 publications

References 42 publications

A Joint 2D-3D Complementary Network for Stereo Matching

A Joint 2D-3D Complementary Network for Stereo Matching

Parallel Structure from Motion for Sparse Point Cloud Generation in Large-Scale Scenes

Real-Time Unsupervised Multi-View Depth Estimation Network For Virtual View Synthesis

Contact Info

Product

Resources

About