Delivering Meaningful Representation for Monocular Depth Estimation

Kim, Doyeon; Joo, Donggyu; Kim, Junmo

doi:10.1109/icpr48806.2021.9412108

Cited by 6 publications

(8 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Para a estimativa da profundidade em metros dos mapas gerados pela pix2pix e CycleGAN, foi utilizado o modelo GLPN fine-tuned on NYUv2, conforme descrito em [25]. O modelo pode ser encontrado em [26]. O pipeline retorna um dicionário com duas entradas.…”

Section: Resultados Obtidos E Estudos De Ablaçãounclassified

A Comparison Study of Depth Map Estimation in Indoor Environments Using pix2pix and CycleGAN

Casado,

Pedrino

2024

IEEE Latin Am. Trans.

View full text Add to dashboard Cite

This article presents a Deep Learning-based approach for comparing automatic depth map estimation in indoor environments, with the aim of using them in navigation aid systems for visually impaired individuals. Depth map estimation is a laborious process, as most high-precision systems consist of complex stereo vision systems. The methodology utilizes Generative Adversarial Networks (GANs) techniques for generating depth maps from single RGB images. The study introduces methods for generating depth maps using pix2pix and CycleGAN. The major challenges still lie in the need to use large datasets, which are coupled with long training times. Additionally, a comparison of L1 Loss with a variation of the MonoDepth2 and DenseDepth systems was performed, using ResNet50 and ResNet18 as encoders, which are mentioned in this work, for comparison and validation of the presented method. The results demonstrate that CycleGAN is capable of generating more reliable maps compared to pix2pix and DepthNet_ResNet50, with an L1 Loss approximately 2.5 times smaller than pix2pix, approximately 2.4 times smaller than DepthNet_ResNet50, and approximately 14 times smaller than DepthNet_ResNet18.

show abstract

Section: Resultados Obtidos E Estudos De Ablaçãounclassified

A Comparison Study of Depth Map Estimation in Indoor Environments Using pix2pix and CycleGAN

Casado,

Pedrino

2024

IEEE Latin Am. Trans.

View full text Add to dashboard Cite

show abstract

“…Recently, Li et al [25] design MonoIndoor++, a framework that takes in account the main challenges of indoor scenarios. Kim et al [26] propose GLPDepth, a globallocal transformer network to extract meaningful features at different scales and a Selective Feature Fusion CNN block for the decoder. The authors also integrate a revisited version of CutDepth data augmentation method [27] which is able to improve the training process on the NYU Depth v2 dataset without needing additional data.…”

Section: B Vit-based Mde Methodsmentioning

confidence: 99%

METER: A Mobile Vision Transformer Architecture for Monocular Depth Estimation

Papa

Russo

Amerini

2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Depth estimation is a fundamental knowledge for autonomous systems that need to assess their own state and perceive the surrounding environment. Deep learning algorithms for depth estimation have gained significant interest in recent years, owing to the potential benefits of this methodology in overcoming the limitations of active depth sensing systems. Moreover, due to the low cost and size of monocular cameras, researchers have focused their attention on monocular depth estimation (MDE), which consists in estimating a dense depth map from a single RGB video frame. State of the art MDE models typically rely on vision transformers (ViT) architectures that are highly deep and complex, making them unsuitable for fast inference on devices with hardware constraints.Purposely, in this paper, we address the problem of exploiting ViT in MDE on embedded devices. Those systems are usually characterized by limited memory capabilities and low-power CPU/GPU. We propose METER, a novel lightweight vision transformer architecture capable of achieving state of the art estimations and low latency inference performances on the considered embedded hardwares: NVIDIA Jetson TX1 and NVIDIA Jetson Nano. We provide a solution consisting of three alternative configurations of METER, a novel loss function to balance pixel estimation and reconstruction of image details, and a new data augmentation strategy to improve the overall final predictions. The proposed method outperforms previous lightweight works over the two benchmark datasets: the indoor NYU Depth v2 and the outdoor KITTI.

show abstract

“…Kim et al introduced GLPDepth [18], a Transformerbased architecture and training strategy for monocular depth estimation that considers both the global and local context of the image. It uses SegFormer encoder [40] to capture global dependencies and a lightweight decoder with skip connections to integrate local information.…”

Section: A Relevant State-of-the-art Vision Transformer Models For Se...mentioning

confidence: 99%

“…Unlike them, our method is a simple single-stage method that takes an image as input and performs the joint segmentation and depth estimation tasks in a single forward pass. • For this purpose, we designed a hybrid encoding and decoding framework based on Vision transformer variants SegFormer [40] and GLPDepth [18]. • We chose the best model for each task (segmentation and depth estimation) to design a multitask model based on a thorough assessment of their advantages and drawbacks.…”

Section: Introductionmentioning

confidence: 99%

A Multi-Task Vision Transformer for Segmentation and Monocular Depth Estimation for Autonomous Vehicles

Bavirisetti,

Martinsen,

Kiss

et al. 2023

IEEE Open J. Intell. Transp. Syst.

View full text Add to dashboard Cite

In this paper, we investigate the use of Vision Transformers for processing and understanding visual data in an autonomous driving setting. Specifically, we explore the use of Vision Transformers for semantic segmentation and monocular depth estimation using only a single image as input. We present state-of-the-art Vision Transformers for these tasks and combine them into a multitask model. Through multiple experiments on four different street image datasets, we demonstrate that the multitask approach significantly reduces inference time while maintaining high accuracy for both tasks. Additionally, we show that changing the size of the Transformer-based backbone can be used as a trade-off between inference speed and accuracy. Furthermore, we investigate the use of synthetic data for pre-training and show that it effectively increases the accuracy of the model when real-world data is limited.

show abstract

Delivering Meaningful Representation for Monocular Depth Estimation

Cited by 6 publications

References 26 publications

A Comparison Study of Depth Map Estimation in Indoor Environments Using pix2pix and CycleGAN

A Comparison Study of Depth Map Estimation in Indoor Environments Using pix2pix and CycleGAN

METER: A Mobile Vision Transformer Architecture for Monocular Depth Estimation

A Multi-Task Vision Transformer for Segmentation and Monocular Depth Estimation for Autonomous Vehicles

Contact Info

Product

Resources

About