Supervising the New with the Old: Learning SFM from SFM

Klodt, Maria; Vedaldi, Andrea

doi:10.1007/978-3-030-01249-6_43

Cited by 136 publications

(108 citation statements)

References 26 publications

(46 reference statements)

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…This baseline is similar to the additional supervision from SLAM found in [17,35]. Similarly, Zhu et al [43] add a supervised loss [1] to solve for optical flow and Kuznietsov et al [18] add a supervised loss for depth estimation from LiDAR.…”

Section: Baseline Loss Functionsmentioning

confidence: 95%

See 1 more Smart Citation

Self-Supervised Monocular Depth Hints

Watson¹,

Firman²,

Brostow

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

223

178

View full text Add to dashboard Cite

Monocular depth estimators can be trained with various forms of self-supervision from binocular-stereo data to circumvent the need for high-quality laser scans or other ground-truth data. The disadvantage, however, is that the photometric reprojection losses used with selfsupervised learning typically have multiple local minima. These plausible-looking alternatives to ground truth can restrict what a regression network learns, causing it to predict depth maps of limited quality. As one prominent example, depth discontinuities around thin structures are often incorrectly estimated by current state-of-the-art methods.Here, we study the problem of ambiguous reprojections in depth prediction from stereo-based self-supervision, and introduce Depth Hints to alleviate their effects. Depth Hints are complementary depth suggestions obtained from simple off-the-shelf stereo algorithms. These hints enhance an existing photometric loss function, and are used to guide a network to learn better weights. They require no additional data, and are assumed to be right only sometimes. We show that using our Depth Hints gives a substantial boost when training several leading self-supervised-from-stereo models, not just our own. Further, combined with other good practices, we produce state-of-the-art depth predictions on the KITTI benchmark. We demonstrate that our selective training using DepthHints is a general enhancement that can improve multiple leading self-supervised training algorithms, allowing our implementations to reach better minima. The Depth Hints can come from the same stereo image data, via, e.g. OpenCV's stereo estimates [13,14].3. We show that our selective training with Depth Hints, coupled with sensible network design choices, leads us to outperform most other algorithms. We achieve state-of-the-art results on the KITTI dataset [8], outperforming both our baseline model and previously published results.

show abstract

Section: Baseline Loss Functionsmentioning

confidence: 95%

“…Klodt and Vedaldi [17] use sparse depths and poses from a traditional SLAM system as a supervisory signal to train depth and pose prediction networks. They train from monocular videos (in contrast to [35]), which requires special consideration of scale, and modeling of uncertainty in the depth and poses.…”

Section: Additional Supervisionmentioning

confidence: 99%

Self-Supervised Monocular Depth Hints

Watson¹,

Firman²,

Brostow

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

223

178

View full text Add to dashboard Cite

show abstract

“…Minimizing the epipolar and re-projection errors of all matches using CNNs mimics the non-linear pose estimation [3]. The experiment shows that this weak supervisory signal significantly improves the pose estimation and is superior to other SfM supervisions such as [24].…”

Section: Learning From Indirect Methodsmentioning

confidence: 91%

“…Since both Klodt et al [24] and ours use self-supervised weak supervisions, we redo the experiments in [24] that use self-generated poses and sparse depth maps from ORB- Figure 5. Qualitative comparison for depth estimation on the Eigen split.…”

Section: Depth Estimationmentioning

confidence: 99%

Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency

Shen

Zhou

Luo

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

The self-supervised learning of depth and pose from monocular sequences provides an attractive solution by using the photometric consistency of nearby frames as it depends much less on the ground-truth data. In this paper, we address the issue when previous assumptions of the self-supervised approaches are violated due to the dynamic nature of real-world scenes. Different from handling the noise as uncertainty, our key idea is to incorporate more robust geometric quantities and enforce internal consistency in the temporal image sequence. As demonstrated on commonly used benchmark datasets, the proposed method substantially improves the state-of-the-art methods on both depth and relative pose estimation for monocular image sequences, without adding inference overhead.

show abstract

“…The predicted unreliable matches are prevented from being utilized during joint learning process to improve the robustness of our model against possible occlusions or ambiguous matches. Unlike existing methods [23,22,19,12] where the uncertainty map is inferred from an input image, our uncertainty module leverages the matching score volume C st to provide more informative cues, as in the approaches for confidence estimation in stereo matching [17]. Concretely, a series of convolutional layers with parameters W C are applied to predict the uncertainty map σ from matching similarity scores C st such that σ = F(C st ; W C ) ∈ R H×W ×1 .…”

Section: Feature Extraction and Similarity Score Computationmentioning

confidence: 99%

Joint Learning of Semantic Alignment and Object Landmark Detection

Jeon

Min²,

Kim

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) based approaches for semantic alignment and object landmark detection have improved their performance significantly. Current efforts for the two tasks focus on addressing the lack of massive training data through weakly-or unsupervised learning frameworks. In this paper, we present a joint learning approach for obtaining dense correspondences and discovering object landmarks from semantically similar images. Based on the key insight that the two tasks can mutually provide supervisions to each other, our networks accomplish this through a joint loss function that alternatively imposes a consistency constraint between the two tasks, thereby boosting the performance and addressing the lack of training data in a principled manner. To the best of our knowledge, this is the first attempt to address the lack of training data for the two tasks through the joint learning. To further improve the robustness of our framework, we introduce a probabilistic learning formulation that allows only reliable matches to be used in the joint learning process. With the proposed method, state-of-the-art performance is attained on several standard benchmarks for semantic matching and landmark detection, including a newly introduced dataset, JLAD, which contains larger number of challenging image pairs than existing datasets.

show abstract

Supervising the New with the Old: Learning SFM from SFM

Cited by 136 publications

References 26 publications

Self-Supervised Monocular Depth Hints

Self-Supervised Monocular Depth Hints

Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency

Joint Learning of Semantic Alignment and Object Landmark Detection

Contact Info

Product

Resources

About