Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

Yang, Guanglei; Tang, Hao; Ding, Mingli; Sebe, Nicu; Ricci, Elisa

doi:10.1109/iccv48922.2021.01596

Cited by 132 publications

(76 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, transformer architectures such as Vision Transformer (ViT) (Ranftl et al, 2021) and Dataefficient image Transformer (DeiT) (Touvron et al, 2021) (Bhojanapalli et al, 2021;Paul and Chen, 2021). Motivated by their success, researchers have replaced CNN encoders with transformers in scene understanding tasks such as object detection (Carion et al, 2020;Liu et al, 2021), semantic segmentation (Zheng et al, 2021;Strudel et al, 2021), and supervised monocular depth estimation (Ranftl et al, 2020;Yang et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

“…For supervised monocular depth estimation, Dense Prediction Transformer (DPT) (Ranftl et al, 2020) uses ViT as the encoder with a convolutional decoder and shows more coherent predictions than CNNs due to the global receptive field of transformers. TransDepth (Yang et al, 2021) additionally uses a ResNet projection layer and attention gates in the decoder to induce the spatial locality of CNNs for supervised monocular depth and surface-normal estimation. Lately, some works have inculcated elements of transformers such as self-attention (Vaswani et al, 2017) in self-supervised monocular depth estimation (Johnston and Carneiro, 2020;Xiang et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

“…Instead, self-supervised methods have increasingly utilized concepts of Structure from Motion (SfM) with known camera intrinsics to train monocular depth and ego-motion estimation networks simultaneously (Guizilini et al, 2020;Lyu et al, 2020;Chawla et al, 2021). While transformer ingredients such as attention have been utilized for self-supervised depth estimation (Johnston and Carneiro, 2020), most methods are nevertheless limited to the use of CNNs that have localized linear operations and lose feature resolution during downsampling to increase their limited receptive field (Yang et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics

Arnav¹,

Chawla²,

Zonooz³

2022

Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Application

View full text Add to dashboard Cite

The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to convolutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counterparts for tasks such as image classification, no study exists that investigates the impact of using transformers for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transformers for self-supervised monocular depth estimation. Thereafter, we compare the transformer and CNN-based architectures for their performance on KITTI depth prediction benchmarks, as well as their robustness to natural corruptions and adversarial attacks, including when the camera intrinsics are unknown. Our study demonstrates how transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust and generalizable.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics

Arnav¹,

Chawla²,

Zonooz³

2022

Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Application

View full text Add to dashboard Cite

show abstract

“…GC-Transformer Decoder. We encourage readers to refer to (Dosovitskiy et al 2021) for a standard Transformer structure, which achieve state-of-the-art results on many tasks such as (Li et al 2021;Yang et al 2021). We propose the GC-Transformer decoder that inherits the classical structure with customized designs for 3D meshes.…”

Section: Geometry-contrastive Transformermentioning

confidence: 99%

Geometry-Contrastive Transformer for Generalized 3D Pose Transfer

Chen¹,

Tang²,

Yu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Seen poseUnseen pose Figure 1: Examples of pose transfer results by our 3D GC-Transformer. Blue, pink, and purple colors stand for identity, pose, and result meshes, respectively. The left part shows the human pose transfer results. The identity meshes are from FAUST (Bogo et al. 2014), MG-cloth (Bhatnagar et al. 2019), SMPL-NPT (Wang et al. 2020), and our new SMG-3D dataset. The right part shows animal pose transfer results on the SMAL dataset (Zuffi et al. 2017). Our method can be generalized to different spaces and even real-world scenarios and animals. More experimental results can be found in supplementary materials.

show abstract

“…Recently, a new trend of leveraging Transformer architecture [61] into the computer vision domain has emerged [12,19,28,33,36,57,71,83]. The Vision Transformer (ViT), which solely exploits the self-attention mechanism that inherits from the Transformer architecture, has set up many state-of-the-art (SOTA) records in image classifications [6,21,60], object detection [1,3,17,47], tracking [14,46,72], semantic segmentation [15,20,84], depth estimation [40,75], human pose estimation [39], 3D object animation [7], image retrieval [22], and image enhancement [8,44,74]. However, despite the impressive general results, ViTs have sacrificed lightweight model capacity, portability, and trainability in return for high accuracy.…”

mentioning

confidence: 99%