2021
DOI: 10.48550/arxiv.2106.03348
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Abstract: Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel V… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 15 publications
(17 citation statements)
references
References 89 publications
0
17
0
Order By: Relevance
“…A preliminary version of this work was presented in [85]. This paper extends the previous study by introducing three major improvements.…”
Section: Comparison To the Conference Versionmentioning
confidence: 67%
“…A preliminary version of this work was presented in [85]. This paper extends the previous study by introducing three major improvements.…”
Section: Comparison To the Conference Versionmentioning
confidence: 67%
“…Therefore, current research mainly focuses on applying explicit inductive bias to vision transformers [5,18,33,37] and reducing the computational complexity from quadratic to linear [9,13]. XCiT [9] is one study that focused on reducing the computational complexity.…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, the non-linear embedding method (SNE) also improves the classification rates on the ImageNet dataset, compared to the baseline model which uses the linear embedding method. ViT-Ti [8] 5.7M T 68.7 T2T-ViT-7 [37] 4.3M T 71.7 DeiT-Ti [29] 5.7M T 72.2 Mobile-Former-96M [3] 4.6M C+T 72.8 LocalViT-T2T [18] 4.3M C+T 72.5 PiT-Ti [11] 4.9M C+T 73.0 ConViT-Ti [5] 5.7M C+T 73.1 ConT-Ti [34] 5.8M C+T 74.9 LocalViT-T [18] 5.9M C+T 74.8 ViTAE-T [33] 4.8M C+T 75.3 EfficientNet-B0 [28] 5.3M C 76.3 CeiT-T [36] 6.4M C+T 76.4 T2T-ViT-12 [37] 6.9M T 76.5 ConT-S [34] 10.1M C+T 76.5 CoaT-Lite Tiny [32] 5.7M C+T 76.6 Swin-1G [3,19] 7.3M T 77.3 XCiT-T12 (baseline) [9] 6.7M T 77.…”
Section: Imagenet Classificationmentioning
confidence: 99%
“…A large body of research has been devoted into improving efficiency and effectiveness of vision transformers (Dosovitskiy et al, 2020). Recent advances improve original vision transformers from various perspectives, such as data-efficient training (Touvron et al, 2020), adopting pyramid architectures (Wang et al, 2021a;Liu et al, 2021a;Heo et al, 2021), incorporating convolutional modules into the transformer Xu et al, 2021;d'Ascoli et al, 2021) and reducing computational costs by restricting the scope of the self-attention (Liu et al, 2021a;Dong et al, 2021).…”
Section: Related Workmentioning
confidence: 99%