2021
DOI: 10.48550/arxiv.2108.08810
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Do Vision Transformers See Like Convolutional Neural Networks?

Abstract: Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
51
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 42 publications
(52 citation statements)
references
References 38 publications
(58 reference statements)
1
51
0
Order By: Relevance
“…Ablation Study The class attention A can obtained from any Tranformer Block in ViTs. Due to the global receptive field, the class attention would not have big difference across blocks [36,14]. We first study the effect of attention matrix generated in different depth d for DeiT-S. Then we follow [1,5] to compute the attention rollout, which aggregate the attention matrices from all blocks by matrix multiplications.…”
Section: B Additional Resultsmentioning
confidence: 99%
“…Ablation Study The class attention A can obtained from any Tranformer Block in ViTs. Due to the global receptive field, the class attention would not have big difference across blocks [36,14]. We first study the effect of attention matrix generated in different depth d for DeiT-S. Then we follow [1,5] to compute the attention rollout, which aggregate the attention matrices from all blocks by matrix multiplications.…”
Section: B Additional Resultsmentioning
confidence: 99%
“…We were surprised to find that both convolution-and transformer-based backbone networks attain similar performance. We conjecture that although it has been widely studied that convolutions and transformers see differently [47], as they are pretrained on the same dataset [7], the representations learned by models are almost alike. Note that we only utilized backbones with pyramidal structure, and the results may differ if other backbone networks are used, which we leave this exploration for the future work.…”
Section: Ablation Studymentioning
confidence: 98%
“…What about transformer-based backbone networks? As addressed in many works [9,47], CNN and transformers see images differently, which means that the kinds of backbone networks may affect the performance significantly, but this has never been explored in this task. We thus exploit several well-known vision transformer architectures to explore the potential differences that probably exist.…”
Section: Ablation Studymentioning
confidence: 99%
“…The Vision Transformer (ViT) architecture is firstly proposed in (Dosovitskiy et al, 2020), which uses the attention mechanism (Vaswani et al, 2017) to solve various vision tasks. Compared to traditional CNN structures that operate on a fixed-sized window with restricted spatial interactions (Raghu et al, 2021), ViT allows all the positions in an image to interact through transformer blocks. Since then, many variants have been proposed (Graham et al, 2021;Liu et al, 2021c;Yuan et al, 2021a;Wang et al, 2021b;Han et al, 2021;Wu et al, 2021;Chen et al, 2021b;Steiner et al, 2021;El-Nouby et al, 2021;Liu et al, 2021a;Wang et al, 2021a;Bao et al, 2021).…”
Section: Vision Transformersmentioning
confidence: 99%