MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Zhao, Chaoqiang; Zhang, Youmin; Poggi, Matteo; Tosi, Fabio; Guo, Xianda; Zhu, Zheng; Huang, Guan; Tang, Yang; Mattoccia, Stefano

doi:10.1109/3dv57658.2022.00077

Cited by 92 publications

(56 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 5 reports the depth estimation results for Mon-odepth2 and MonoViT trained on KITTI and evaluated on KITTI and the OOD test set vKITTI. The results are provided using the standard metrics [54,16,1] absolute relative error (Abs Rel), root mean squared error (RMSE) and accuracy δ 1 (δ < 1.25). When comparing the results, it is evident that the depth estimates in all three metrics are significantly worse.…”

Section: Depth Estimation Results On Vkittimentioning

confidence: 99%

“…Models Compared to recent work [46,41,22], we do not only evaluate on Monodepth2 [16] but also on two recently published transformer-based models Pixelformer [1] and MonoViT [54]. In the case of NYU, Monodepth2 and Pixelformer are trained in a supervised manner.…”

Section: Methodsmentioning

confidence: 99%

“…Recently, Agarwal et al [1] propose attention-based skip connections between encoder and decoder. While previous transformer approaches are trained with ground truth supervision, Zhao et al [54] introduce a transformer architecture for self-supervised monocular depth estimation. In this paper, we perform OOD detection for already trained convolutional and transformer-based depth estimation models.…”

Section: Related Workmentioning

confidence: 99%

“…Log and Drop, on the other hand, must be integrated into the training strategy of the depth estimation model by modifying the training loss function or inserting dropout layers, respectively. Implementation Details For Monodepth2, Pixelformer, and MonoViT, we use the code provided by [16], [1], and [54], respectively. For the setup and training of our image reconstruction decoders, we modify the architecture of the respective depth decoder and apply the same optimizer and hyperparameters.…”

Section: Comparison To Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Gradient-Based Uncertainty for Monocular Depth Estimation

Hornauer

Belagiannis

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

Section: Depth Estimation Results On Vkittimentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Comparison To Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Gradient-Based Uncertainty for Monocular Depth Estimation

Hornauer

Belagiannis

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Monodepth2 [16] used automatic masking loss to reject objects moving at similar speeds, and proposed minimum reprojection loss to deal with occlusion, and proposed a multiscale sampling method to enhance sampling to reduce visual artifacts. lite-Mono [31] proposed a continuous extended convolution (CDC) module to extract rich multiscale local features and local global feature interaction ( LGFI) module to encode remote global information into features.R-MSFM [32] proposed recursive multiscale feature modulation to extract per-pixel features, construct a multiscale feature modulation module, and iteratively update the inverse depth at a fixed resolution through a parameter sharing decoder. Featdepth [17] introduced the FeatureNet network architecture for single-view reconstruction based on the cross-view reconstruction networks DepthNet and PosNet.…”

Section: Self-supervised Depth Estimationmentioning

confidence: 99%

CFDepthNet: Monocular Depth Estimation Introducing Coordinate Attention and Texture Features

Feng

Zhu

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

Handling the depth estimation of low-texture regions using photometric error loss is a challenge due to the difficulty of achieving convergence due to the presence of multiple local minima for pixels in low-texture regions (or even no-texture regions). In this paper, based on the photometric loss, we also introduce texture feature metric loss as a constraint and combine the coordinate attention mechanism to improve the depth map's texture quality and edge detail. This paper uses a simple yet compact network structure, a unique loss function, and a relatively flexible embedded attention module, which is more effective and easier to arrange in robotic platforms with weak arithmetic power. The tests show that our network structure not only shows high quality and state-of-the-art results on the KITTI dataset, but the same training results also perform well on the cityscapes and Make3D datasets.

show abstract

Self-supervised Classification of Weather Systems based on Spatiotemporal Contrastive Learning

Wang

2022

Preprint

View full text Add to dashboard Cite

Correlated time series analysis plays an important role in many real-world industries. Learning an efficient representation of this large-scale data for further downstream tasks is necessary but challenging. In this paper, we propose a time-step-level representation learning framework for individual instances via bootstrapped spatiotemporal representation prediction. We evaluated the effectiveness and flexibility of our representation learning framework on correlated time series forecasting and cold-start transferring the forecasting model to new instances with limited data. A linear regression model trained on top of the learned representations demonstrates our model performs best in most cases. Especially compared to representation learning models, we reduce the RMSE, MAE, and MAPE by 37%, 49%, and 48% on the PeMS-BAY dataset, respectively. Furthermore, in real-world metro passenger flow data, our framework demonstrates the ability to transfer to infer future information of new cold-start instances, with gains of 15%, 19%, and 18%. The source code will be released under the GitHub https://github.com/bonaldli/ Spatiotemporal-TS-Representation-Learning.

show abstract

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Cited by 92 publications

References 64 publications

Gradient-Based Uncertainty for Monocular Depth Estimation

Gradient-Based Uncertainty for Monocular Depth Estimation

CFDepthNet: Monocular Depth Estimation Introducing Coordinate Attention and Texture Features

Self-supervised Classification of Weather Systems based on Spatiotemporal Contrastive Learning

Contact Info

Product

Resources

About