2022
DOI: 10.48550/arxiv.2202.10108
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Abstract: Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we propose a Vision Transformer Advanced b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

3
6

Authors

Journals

citations
Cited by 15 publications
(21 citation statements)
references
References 68 publications
(130 reference statements)
0
21
0
Order By: Relevance
“…Furthermore, various configurations of an MLP-like variant offer limited gains despite the increased number of parameters. Recently, the visual community has conducted some scaling-up research on the vision Transformers with self-supervised pre-training, including V-MoE, 90 Swinv2, 117 and ViTAEv2, 140 which afford a considerable performance boost. Nevertheless, scaling-up techniques specific to MLPs need further exploration.…”
Section: Discussionmentioning
confidence: 99%
“…Furthermore, various configurations of an MLP-like variant offer limited gains despite the increased number of parameters. Recently, the visual community has conducted some scaling-up research on the vision Transformers with self-supervised pre-training, including V-MoE, 90 Swinv2, 117 and ViTAEv2, 140 which afford a considerable performance boost. Nevertheless, scaling-up techniques specific to MLPs need further exploration.…”
Section: Discussionmentioning
confidence: 99%
“…We also use LayerScale [70] to train deep models. Like previous studies [5,66], we further fine tune iFormer on the input size of 384 × 384, with the weight decay of 1 × 10 −8 , learning rate of 1 × 10 −5 , batch size of 512. For fairness, we adopt Timm [71] to implement and train iFormer.…”
Section: Results On Image Classificationmentioning
confidence: 99%
“…We finetune our model end-to-end on MS COCO [34] for the object detection and instance segmentation tasks. We replace the ViT backbone with our pretrained LoMaR model in the ViTDet [33] and ViTAE [60] frameworks. We report object detection results in AP box and instance segmentation results in AP mask .…”
Section: Comparison With Other Self-supervised Approachesmentioning
confidence: 99%