DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Düzçeker, Arda; Galliani, Silvano; Vogel, Christoph; Speciale, Pablo; Dusmanu, Mihai; Pollefeys, Marc

doi:10.1109/cvpr46437.2021.01507

Cited by 54 publications

(51 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video-based depth estimation has attracted extensive attentions recently. They are mainly categorized into multi-view stereo based approaches [11,24,36] and hybrid methods [19,25]. The former try to improve the traditional structure-from-motion and multi-view stereo pipeline with some learning-based modules, such as a differentiable depth and pose modules or a depth estimation uncertainty predictor.…”

Section: Related Workmentioning

confidence: 99%

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Xu¹,

Yin²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Existing monocular depth estimation shows excellent robustness in the wild, but the affine-invariant prediction requires aligning with the ground truth globally while being converted into the metric depth. In this work, we firstly propose a modified locally weighted linear regression strategy to leverage sparse ground truth and generate a flexible depth transformation to correct the coarse misalignment brought by global recovery strategy. Applying this strategy, we achieve significant improvement (more than 50% at most) over most recent state-of-the-art methods on five zero-shot datasets.Moreover, we train a robust depth estimation model with 6.3 million data and analyze the training process by decoupling the inaccuracy into coarse misalignment inaccuracy and detail missing inaccuracy. As a result, our model based on ResNet50 even outperforms the state-of-the-art DPT ViT-Large model with the help of our recovery strategy. In addition to accuracy, the consistency is also boosted for simple per-frame video depth estimation. Compared with monocular depth estimation, robust video depth estimation, and depth completion methods, our pipeline obtains state-ofthe-art performance on video depth estimation without any post-processing. Experiments of 3D scene reconstruction from consistent video depth are conducted for intuitive comparison as well.

show abstract

Section: Related Workmentioning

confidence: 99%

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Xu¹,

Yin²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, CNN-SLAM [23] predicts depth images for keyframes, then refines them using smallbaseline multi-view stereo from surrounding non-keyframe images. DeepVideoMVS [25] extends a cost volume-based encoder-decoder with a ConvLSTM cell at the bottleneck layer to leverage past scene geometry to improve depth prediction at the current time step. Unlike standard ConvL-STM methods, it can make geometrically correct predictions because of its underlying use of MVS at each time step.…”

Section: A Real-time Dense Monocular 3d Reconstructionmentioning

confidence: 99%

Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV

Golodetz¹,

Vankadari²,

Everitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Unmanned aerial vehicles (UAVs) have been used for many applications in recent years, from urban search and rescue, to agricultural surveying, to autonomous underground mine exploration. However, deploying UAVs in tight, indoor spaces, especially close to humans, remains a challenge. One solution, when limited payload is required, is to use micro-UAVs, which pose less risk to humans and typically cost less to replace after a crash. However, micro-UAVs can only carry a limited sensor suite, e.g. a monocular camera instead of a stereo pair or LiDAR, complicating tasks like dense mapping and markerless multi-person 3D human pose estimation, which are needed to operate in tight environments around people. Monocular approaches to such tasks exist, and dense monocular mapping approaches have been successfully deployed for UAV applications. However, despite many recent works on both marker-based and markerless multi-UAV single-person motion capture, markerless single-camera multi-person 3D human pose estimation remains a much earlier-stage technology, and we are not aware of existing attempts to deploy it in an aerial context. In this paper, we present what is thus, to our knowledge, the first system to perform simultaneous mapping and multi-person 3D human pose estimation from a monocular camera mounted on a single UAV. In particular, we show how to loosely couple state-of-the-art monocular depth estimation and monocular 3D human pose estimation approaches to reconstruct a hybrid map of a populated indoor scene in real time. We validate our component-level design choices via extensive experiments on the large-scale ScanNet and GTA-IM datasets. To evaluate our system-level performance, we also construct a new Oxford Hybrid Mapping dataset of populated indoor scenes. I. INTRODUCTIONRecent years have seen huge improvements in the flight stability and obstacle avoidance capabilities of unmanned aerial vehicles, driven by applications including aerial search and rescue [1], aerial tracking and surveillance [2], drone cinematography [3], robotic agriculture [4], and the exploration of everything from mines [5] to other planets [6]. However, deploying drones in confined indoor spaces close to people remains challenging. This is unfortunate, because numerous applications, from awareness systems for emergency responders to indoor drone cinematography for film-makers, could benefit significantly from such a capability.To operate in such an environment, it is helpful for a drone to be able to both map its geometry and detect/track the people moving within it, ideally in real time. At the same time, however, the physical constraints imposed by the environment encourage the use of a small drone (e.g. ≈10cmAll authors are with the University of Oxford. M. Vankadari, A. Everitt and S. Shin assert joint second authorship.

show abstract

“…Inspired by stereo matching networks [26,3], MVS studies [43,4,16,22,20,39] have developed cost volume for unstructured multi-view matching. Relying on basic frameworks, such as DPSNet [22] or MVSNet [43], follow-up research proposes point-based depth refinement [4], cascaded depth refinement [16], and temporal fusion network [20,10]. After exhaustively estimating a collection of depth maps, depth fusion [14,8] starts to reconstruct the global 3D scene.…”

Section: Multi-view Stereomentioning

confidence: 99%

“…Since this strategy proved to be effective, it became the most commonly used technique to build a cost volume. As a result, it has also been widely applied in un-rectified multi-view stereo pipelines [20,22,10]. Nonetheless, it appears that this representation is not appropriate for multiview stereo.…”

Section: Posed Convolution Layermentioning

confidence: 99%

“…To address this issue, we propose to closely mimic the traditional 3D reconstruction pipeline with two distinct stages: local reconstruction and global fusion. However, unlike the previous studies [43,10,32] and concurrent papers [38,1], we integrate these two stages in an end-toend manner. First, our network computes the local geometry, i.e., dense depth maps from neighboring frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Choe¹,

Im²,

Rameau³

et al. 2021

Preprint

View full text Add to dashboard Cite

To reconstruct a 3D scene from a set of calibrated views, traditional multi-view stereo techniques rely on two distinct stages: local depth maps computation and global depth maps fusion. Recent studies concentrate on deep neural architectures for depth estimation by using conventional depth fusion method or direct 3D reconstruction network by regressing Truncated Signed Distance Function (TSDF). In this paper, we advocate that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results. As mentioned, our network operates in two steps: 1) the local computation of the local depth maps with a deep MVS technique, and, 2) the depth maps and images' features fusion to build a single TSDF volume. In order to improve the matching performance between images acquired from very different viewpoints (e.g., large-baseline and rotations), we introduce a rotation-invariant 3D convolution kernel called PosedConv. The effectiveness of the proposed architecture is underlined via a large series of experiments conducted on the ScanNet dataset where our approach compares favorably against both traditional and deep learning techniques.

show abstract

DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Cited by 54 publications

References 32 publications

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Real-Time Hybrid Mapping of Populated Indoor Scenes using a Low-Cost Monocular UAV

VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Contact Info

Product

Resources

About