2022
DOI: 10.1609/aaai.v36i2.20094
|View full text |Cite
|
Sign up to set email alerts
|

Can Vision Transformers Learn without Natural Images?

Abstract: Is it possible to complete Vision Transformer (ViT) pre-training without natural images and human-annotated labels? This question has become increasingly relevant in recent months because while current ViT pre-training tends to rely heavily on a large number of natural images and human-annotated labels, the recent use of natural images has resulted in problems related to privacy violation, inadequate fairness protection, and the need for labor-intensive annotations. In this paper, we experimentally verify that… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 12 publications
(3 citation statements)
references
References 27 publications
0
3
0
Order By: Relevance
“…Instead, the model learns from images automatically generated using fractal geometry, computer graphics, and other methods. Existing studies [15,16,21,23,24] have shown that such models can effectively learn representations through fractal images, Bessel curves [25], and Perlin noise [26], improving the interpretability of the features and performing almost as well as pretrained models based on ImageNet.…”
Section: Formula-driven Supervised Learningmentioning
confidence: 99%
“…Instead, the model learns from images automatically generated using fractal geometry, computer graphics, and other methods. Existing studies [15,16,21,23,24] have shown that such models can effectively learn representations through fractal images, Bessel curves [25], and Perlin noise [26], improving the interpretability of the features and performing almost as well as pretrained models based on ImageNet.…”
Section: Formula-driven Supervised Learningmentioning
confidence: 99%
“…This finding shows that fractal geometry plays an important role in dataset construction using FDSL. Nakashima et al [48] confirmed that the FDSL framework is effective for pre-training Vision Transformers (ViTs). More interestingly, they suggest that FDSL are more likely to benefit ViTs than CNNs.…”
Section: Introductionmentioning
confidence: 98%
“…* indicates that 5,000 epochs were learned during fine-tuning. † refers to a quote from[48]. The Underlined bold and bold scores indicate the best and second-best values, respectively.…”
mentioning
confidence: 99%