2021
DOI: 10.48550/arxiv.2107.02174
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

What Makes for Hierarchical Vision Transformer?

Abstract: Recent studies show that hierarchical Vision Transformer with interleaved non-overlapped intra window selfattention & shifted window self-attention is able to achieve state-of-the-art performance in various visual recognition tasks and challenges CNN's dense sliding window paradigm. Most follow-up works try to replace shifted window operation with other kinds of cross window communication while treating self-attention as the de-facto standard for intra window information aggregation. In this short preprint, we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 12 publications
0
2
0
Order By: Relevance
“…LSA only achieves similar top-1 accuracy as DwConv, which is lower than DDF, but it requires more floating-point operations (FLOPs). This phenomenon has also been observed in recent papers [17,25,32,83], but they lack detailed analysis of the reason behind such performances, and it motivates us to raise a question: what makes local self-attention mediocre?…”
Section: Introductionmentioning
confidence: 82%
See 1 more Smart Citation
“…LSA only achieves similar top-1 accuracy as DwConv, which is lower than DDF, but it requires more floating-point operations (FLOPs). This phenomenon has also been observed in recent papers [17,25,32,83], but they lack detailed analysis of the reason behind such performances, and it motivates us to raise a question: what makes local self-attention mediocre?…”
Section: Introductionmentioning
confidence: 82%
“…Swin Transformer [48], as a milestone, also leverages local self-attention (LSA) to embed detailed information in high-resolution finer-level features. Despite these successes, several studies [17,25,32,83] observe that the performance of LSA is just on par with convolution in both upstream and downstream tasks [32]. The reasons behind this phenomenon are not clear, and in-depth comparisons under the same conditions are valuable.…”
Section: Related Workmentioning
confidence: 99%