Vision Transformers for Dense Prediction

Ranftl, René; Bochkovskiy, Alexey; Koltun, Vladlen

doi:10.48550/arxiv.2103.13413

Cited by 36 publications

(59 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Vision Transformers (ViTs). Since Dosovitskiy et al (Dosovitskiy et al 2020) first successfully applies transformer for image classification by dividing the images into non-overlapping patches, many ViT variants are pro-posed (Wang et al 2021b;Han et al 2021;Chen et al 2021a;Ranftl, Bochkovskiy, and Koltun 2021;Liu et al 2021;Chen, Fan, and Panda 2021;Zhang et al 2021a;Xie et al 2021;Zhang et al 2021b;Jonnalagadda, Wang, and Eckstein 2021;Wang et al 2021d;Fang et al 2021;Huang et al 2021;Gao et al 2021;Rao et al 2021;Yu et al 2021;Zhou et al 2021b;El-Nouby et al 2021;Wang et al 2021c;Xu et al 2021). In this section, we mainly review several closely related works for training ViTs.…”

Section: Related Workmentioning

confidence: 99%

Scaled ReLU Matters for Training Vision Transformers

Wang¹,

Wang²,

Luo³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in (Xiao et al. 2021), and the authors conjecture that the issue lies with the patchify-stem of ViT models and propose that early convolutions help transformers see better. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the convolutional stem (conv-stem) matters. We verify, both theoretically and empirically, that scaled ReLU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

show abstract

Section: Related Workmentioning

confidence: 99%

Scaled ReLU Matters for Training Vision Transformers

Wang¹,

Wang²,

Luo³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…• Loss function: For semantic segmentation, we apply the cross entropy loss (weight=1) and deep supervision loss (weight=0.4). For depth estimation, we employ the scale-and shift-invariant trimmed loss (weight=1) that operates on an inverse depth representation [16] and gradient-matching loss [10] (weight=1).…”

Section: • Batch Size and Training Timementioning

confidence: 99%

ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

Lu¹,

Yang²,

Yu³

2021

Preprint

View full text Add to dashboard Cite

Vision transformers have attracted much attention from computer vision researchers as they are not restricted to the spatial inductive bias of ConvNets. However, although Transformer-based backbones have achieved much progress on ImageNet classification, it is still unclear whether the learned representations are as transferable as or even more transferable than ConvNets' features. To address this point, we systematically investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations. Given the strong correlation between the performance of pretrained models and transfer learning, we include 2 residual ConvNets (i.e., R-101×3 and R-152×4) and 3 Transformerbased visual backbones (i.e., ViT-B, ViT-L and Swin-B), which have close error rates on ImageNet, that indicate similar transfer learning performance on downstream datasets.We observe consistent advantages of Transformer-based backbones on 13 downstream tasks (out of 15), including but not limited to fine-grained classification, scene recognition (classification, segmentation and depth estimation), open-domain classification, face recognition, etc. More specifically, we find that two ViT models heavily rely on whole network fine-tuning to achieve performance gains while Swin Transformer does not have such a requirement. Moreover, vision transformers behave more robustly in multi-task learning, i.e., bringing more improvements when managing mutually beneficial tasks and reducing performance losses when tackling irrelevant tasks. We hope our discoveries can facilitate the exploration and exploitation of vision transformers in the future.

show abstract

“…Unlike light-weight CNNs that are easy to optimize and integrate with task-specific networks, ViTs are heavy-weight (e.g., ViT-B/16 vs. MobileNetv3: 86 vs. 7.5 million parameters), harder to optimize , need extensive data augmentation and L2 regularization to prevent over-fitting (Touvron et al, 2021a;Wang et al, 2021), and require expensive decoders for down-stream tasks, especially for dense prediction tasks. For instance, a ViT-based segmentation network (Ranftl et al, 2021) learns about 345 million parameters and achieves similar performance as the CNN-based network, DeepLabv3 , with 59 million parameters. The need for more parameters in ViT-based models is likely because they lack image-specific inductive bias, which is inherent in CNNs .…”

Section: Introductionmentioning

confidence: 99%

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Mehta¹,

Rastegari²

2021

Preprint

121

View full text Add to dashboard Cite

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision transformers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and generalpurpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN-and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than Mo-bileNetv3 for a similar number of parameters.

show abstract

Vision Transformers for Dense Prediction

Cited by 36 publications

References 38 publications

Scaled ReLU Matters for Training Vision Transformers

Scaled ReLU Matters for Training Vision Transformers

ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Contact Info

Product

Resources

About