Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Hwang, Seung-Jun; Park, Sung-Jun; Baek, Joong-Hwan; Kim, Byung-Kyu

doi:10.1109/jsen.2022.3199265

Cited by 16 publications

(14 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3. Unlike Hwang et al [27], which utilizes residual blocks to improve local features, our objective in developing HFM is to comprehensively integrate local detailed features from the ResNet branch and global features from the Transformer branch using adaptive feature alignment. The module generates four fused features {F i } 4 i=1 with a channel number of 64, reducing model complexity, enhancing computational efficiency, and preventing overfitting.…”

Section: Hfm Modulementioning

confidence: 99%

“…However, the pure Transformer model lacks the ability to model local information due to the absence of spatial inductive bias. To achieve more satisfactory results, some methods have started to combine Transformer with CNNs [13,22,[26][27][28] to leverage the strengths of both approaches. This combination allows for better performance in MDE tasks [13,22,26], as illustrated in Fig.…”

Section: Introductionmentioning

confidence: 99%

“…However, if the features acquired from the previous stage are not accurate enough, it may affect the subsequent features. Alternatively, fewer methods adopt a parallel strategy [27,28], as shown in Fig. 1a, to obtain the last layer of features for fusion through a parallel backbone network to improve the effectiveness and performance of the model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Xia,

Duan,

Gao

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

Monocular depth estimation (MDE) has made great progress with the development of convolutional neural networks (CNNs). However, these approaches suffer from essential shortsightedness due to the utilization of insufficient feature-based reasoning. To this end, we propose an effective parallel CNNs and Transformer model for MDE via dual attention (PCTDepth). Specifically, we use two stream backbones to extract features, where ResNet and Swin Transformer are utilized to obtain local detail features and global long-range dependencies, respectively. Furthermore, a hierarchical fusion module (HFM) is designed to actively exchange beneficial information for the complementation of each representation during the intermediate fusion. Finally, a dual attention module is incorporated for each fused feature in the decoder stage to improve the accuracy of the model by enhancing inter-channel correlations and focusing on relevant spatial locations. Comprehensive experiments on the KITTI dataset demonstrate that the proposed model consistently outperforms the other state-of-the-art methods.

show abstract

Section: Hfm Modulementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Xia,

Duan,

Gao

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

show abstract

“…The main idea behind the self-supervised monocular depth prediction is to use view synthesis [3] to construct photometric consistency loss as supervision. Typically, the self-supervised monocular depth predictions [4][5][6] construct two neural networks to estimate depth and pose, using a photometric and gradient-based loss, called appearance loss, for training. Since the appearance loss is pretty fragile for illumination variations, to further improve the robustness and accuracy of monocular depth prediction, many loss schemes, such as ICP loss [7] and scale consistency geometric constraints [8][9][10] are proposed to promote the self-supervised learning.…”

Section: Introductionmentioning

confidence: 99%

Self‐supervised depth completion with multi‐view geometric constraints

Xiong

Zhang

Liu

et al. 2023

IET Image Processing

View full text Add to dashboard Cite

Self‐supervised learning‐based depth completion is a cost‐effective way for 3D environment perception. However, it is also a challenging task because sparse depth may deactivate neural networks. In this paper, a novel Sparse‐Dense Depth Consistency Loss (SDDCL) is proposed to penalize not only the estimated depth map with sparse input points but also consecutive completed dense depth maps. Combined with the pose consistency loss, a new self‐supervised learning scheme is developed, using multi‐view geometric constraints, to achieve more accurate depth completion results. Moreover, to tackle the sparsity issue of input depth, a Quasi Dense Representations (QDR) module with triplet branches for spatial pyramid pooling is proposed to produce more dense feature maps. Extensive experimental results on VOID, NYUv2, and KITTI datasets show that the method outperforms state‐of‐the‐art self‐supervised depth completion methods.

show abstract

“…While this approach is effective at leveraging prior knowledge such as object shape and textures, it is limited in its ability to learn the geometry and the motion of the scene. By contrast, using multiple frames [1], [2], [16] as input has the potential to provide a more comprehensive view of the scene and to help the model better understand the relationships between objects and their motions.…”

Section: Introductionmentioning

confidence: 99%

Forecasting of depth and ego-motion with transformers and self-supervision

Boulahbal

Voicila

Comport³

2022

2022 26th International Conference on Pattern Recognition (ICPR)

View full text Add to dashboard Cite

This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion. Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a self supervised photometric loss. The architecture is designed using both convolution and transformer modules. This leverages the benefits of both modules: Inductive bias of CNN, and the multi-head attention of transformers, thus enabling a rich spatio-temporal representation that enables accurate depth forecasting. Prior work attempts to solve this problem using multi-modal input/output with supervised groundtruth data which is not practical since a large annotated dataset is required. Alternatively to prior methods, this paper forecasts depth and ego motion using only self-supervised raw images as input. The approach performs significantly well on the KITTI dataset benchmark with several performance criteria being even comparable to prior non-forecasting self-supervised monocular depth inference methods.

show abstract

Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Cited by 16 publications

References 52 publications

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Self‐supervised depth completion with multi‐view geometric constraints

Forecasting of depth and ego-motion with transformers and self-supervision

Contact Info

Product

Resources

About