Monocular depth estimators can be trained with various forms of self-supervision from binocular-stereo data to circumvent the need for high-quality laser scans or other ground-truth data. The disadvantage, however, is that the photometric reprojection losses used with selfsupervised learning typically have multiple local minima. These plausible-looking alternatives to ground truth can restrict what a regression network learns, causing it to predict depth maps of limited quality. As one prominent example, depth discontinuities around thin structures are often incorrectly estimated by current state-of-the-art methods.Here, we study the problem of ambiguous reprojections in depth prediction from stereo-based self-supervision, and introduce Depth Hints to alleviate their effects. Depth Hints are complementary depth suggestions obtained from simple off-the-shelf stereo algorithms. These hints enhance an existing photometric loss function, and are used to guide a network to learn better weights. They require no additional data, and are assumed to be right only sometimes. We show that using our Depth Hints gives a substantial boost when training several leading self-supervised-from-stereo models, not just our own. Further, combined with other good practices, we produce state-of-the-art depth predictions on the KITTI benchmark.
We demonstrate that our selective training using DepthHints is a general enhancement that can improve multiple leading self-supervised training algorithms, allowing our implementations to reach better minima. The Depth Hints can come from the same stereo image data, via, e.g. OpenCV's stereo estimates [13,14].3. We show that our selective training with Depth Hints, coupled with sensible network design choices, leads us to outperform most other algorithms. We achieve state-of-the-art results on the KITTI dataset [8], outperforming both our baseline model and previously published results.