2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01596
|View full text |Cite
|
Sign up to set email alerts
|

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
76
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 132 publications
(76 citation statements)
references
References 44 publications
0
76
0
Order By: Relevance
“…Recently, transformer architectures such as Vision Transformer (ViT) (Ranftl et al, 2021) and Dataefficient image Transformer (DeiT) (Touvron et al, 2021) (Bhojanapalli et al, 2021;Paul and Chen, 2021). Motivated by their success, researchers have replaced CNN encoders with transformers in scene understanding tasks such as object detection (Carion et al, 2020;Liu et al, 2021), semantic segmentation (Zheng et al, 2021;Strudel et al, 2021), and supervised monocular depth estimation (Ranftl et al, 2020;Yang et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, transformer architectures such as Vision Transformer (ViT) (Ranftl et al, 2021) and Dataefficient image Transformer (DeiT) (Touvron et al, 2021) (Bhojanapalli et al, 2021;Paul and Chen, 2021). Motivated by their success, researchers have replaced CNN encoders with transformers in scene understanding tasks such as object detection (Carion et al, 2020;Liu et al, 2021), semantic segmentation (Zheng et al, 2021;Strudel et al, 2021), and supervised monocular depth estimation (Ranftl et al, 2020;Yang et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…For supervised monocular depth estimation, Dense Prediction Transformer (DPT) (Ranftl et al, 2020) uses ViT as the encoder with a convolutional decoder and shows more coherent predictions than CNNs due to the global receptive field of transformers. TransDepth (Yang et al, 2021) additionally uses a ResNet projection layer and attention gates in the decoder to induce the spatial locality of CNNs for supervised monocular depth and surface-normal estimation. Lately, some works have inculcated elements of transformers such as self-attention (Vaswani et al, 2017) in self-supervised monocular depth estimation (Johnston and Carneiro, 2020;Xiang et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…GC-Transformer Decoder. We encourage readers to refer to (Dosovitskiy et al 2021) for a standard Transformer structure, which achieve state-of-the-art results on many tasks such as (Li et al 2021;Yang et al 2021). We propose the GC-Transformer decoder that inherits the classical structure with customized designs for 3D meshes.…”
Section: Geometry-contrastive Transformermentioning
confidence: 99%
“…Recently, a new trend of leveraging Transformer architecture [61] into the computer vision domain has emerged [12,19,28,33,36,57,71,83]. The Vision Transformer (ViT), which solely exploits the self-attention mechanism that inherits from the Transformer architecture, has set up many state-of-the-art (SOTA) records in image classifications [6,21,60], object detection [1,3,17,47], tracking [14,46,72], semantic segmentation [15,20,84], depth estimation [40,75], human pose estimation [39], 3D object animation [7], image retrieval [22], and image enhancement [8,44,74]. However, despite the impressive general results, ViTs have sacrificed lightweight model capacity, portability, and trainability in return for high accuracy.…”
mentioning
confidence: 99%