Supervised deep learning often suffers from the lack of sufficient training data. Specifically in the context of monocular depth map prediction, it is barely possible to determine dense ground truth depth images in realistic dynamic outdoor environments. When using LiDAR sensors, for instance, noise is present in the distance measurements, the calibration between sensors cannot be perfect, and the measurements are typically much sparser than the camera images. In this paper, we propose a novel approach to depth map prediction from monocular images that learns in a semi-supervised way. While we use sparse ground-truth depth for supervised learning, we also enforce our deep network to produce photoconsistent dense depth maps in a stereo setup using a direct image alignment loss. In experiments we demonstrate superior performance in depth map prediction from single images compared to the state-of-theart methods.
While ground truth depth data remains hard to obtain, self-supervised monocular depth estimation methods enjoy growing attention. Much research in this area aims at improving loss functions or network architectures. Most works, however, do not leverage self-supervision to its full potential. They stick to the standard closed world train-test pipeline, assuming the network parameters to be fixed after the training is finished. Such an assumption does not allow to adapt to new scenes, whereas with self-supervision this becomes possible without extra annotations.In this paper, we propose a novel self-supervised Continuous Monocular Depth Adaptation method (CoMoDA), which adapts the pretrained model on a test video on the fly. As opposed to existing test-time refinement methods that use isolated frame triplets, we opt for continuous adaptation, making use of the previous experience from the same scene. We additionally augment the proposed procedure with the experience from the distant past, preventing the model from overfitting and thus forgetting already learnt information.We demonstrate that our method can be used for both intra-and cross-dataset adaptation. By adapting the model from train to test set of the Eigen split of KITTI, we achieve state-of-the-art depth estimation performance and surpass all existing methods using standard architectures. We also show that our method runs 15 times faster than existing test-time refinement methods.
In recent years, there has been significant progress in overcoming the negative effects of domain shift in semantic segmentation. Yet, existing unsupervised domain adaptation methods operate in an offline fashion, which imposes multiple restrictions on their deployment in real world scenarios. In this paper, we introduce a problem of online domain adaptation for semantic segmentation, which involves producing predictions for and, at the same time, continuously adapting a model to new frames of target domain videos. To tackle this problem, we propose a novel method which utilizes unsupervised structure-from-motion cues as the primary source of domain adaptation. By optimizing online the representation shared between depth and semantics networks, our geometry-guided algorithm achieves semantic segmentation performance comparable to state-of-theart offline methods, without using target domain training data whatsoever.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.