2021
DOI: 10.48550/arxiv.2103.13413
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vision Transformers for Dense Prediction

Abstract: We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
58
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 36 publications
(59 citation statements)
references
References 38 publications
0
58
1
Order By: Relevance
“…Vision Transformers (ViTs). Since Dosovitskiy et al (Dosovitskiy et al 2020) first successfully applies transformer for image classification by dividing the images into non-overlapping patches, many ViT variants are pro-posed (Wang et al 2021b;Han et al 2021;Chen et al 2021a;Ranftl, Bochkovskiy, and Koltun 2021;Liu et al 2021;Chen, Fan, and Panda 2021;Zhang et al 2021a;Xie et al 2021;Zhang et al 2021b;Jonnalagadda, Wang, and Eckstein 2021;Wang et al 2021d;Fang et al 2021;Huang et al 2021;Gao et al 2021;Rao et al 2021;Yu et al 2021;Zhou et al 2021b;El-Nouby et al 2021;Wang et al 2021c;Xu et al 2021). In this section, we mainly review several closely related works for training ViTs.…”
Section: Related Workmentioning
confidence: 99%
“…Vision Transformers (ViTs). Since Dosovitskiy et al (Dosovitskiy et al 2020) first successfully applies transformer for image classification by dividing the images into non-overlapping patches, many ViT variants are pro-posed (Wang et al 2021b;Han et al 2021;Chen et al 2021a;Ranftl, Bochkovskiy, and Koltun 2021;Liu et al 2021;Chen, Fan, and Panda 2021;Zhang et al 2021a;Xie et al 2021;Zhang et al 2021b;Jonnalagadda, Wang, and Eckstein 2021;Wang et al 2021d;Fang et al 2021;Huang et al 2021;Gao et al 2021;Rao et al 2021;Yu et al 2021;Zhou et al 2021b;El-Nouby et al 2021;Wang et al 2021c;Xu et al 2021). In this section, we mainly review several closely related works for training ViTs.…”
Section: Related Workmentioning
confidence: 99%
“…• Loss function: For semantic segmentation, we apply the cross entropy loss (weight=1) and deep supervision loss (weight=0.4). For depth estimation, we employ the scale-and shift-invariant trimmed loss (weight=1) that operates on an inverse depth representation [16] and gradient-matching loss [10] (weight=1).…”
Section: • Batch Size and Training Timementioning
confidence: 99%
“…Unlike light-weight CNNs that are easy to optimize and integrate with task-specific networks, ViTs are heavy-weight (e.g., ViT-B/16 vs. MobileNetv3: 86 vs. 7.5 million parameters), harder to optimize , need extensive data augmentation and L2 regularization to prevent over-fitting (Touvron et al, 2021a;Wang et al, 2021), and require expensive decoders for down-stream tasks, especially for dense prediction tasks. For instance, a ViT-based segmentation network (Ranftl et al, 2021) learns about 345 million parameters and achieves similar performance as the CNN-based network, DeepLabv3 , with 59 million parameters. The need for more parameters in ViT-based models is likely because they lack image-specific inductive bias, which is inherent in CNNs .…”
Section: Introductionmentioning
confidence: 99%