What Makes for Hierarchical Vision Transformer?

Fang, Yuxin; Wang, Xinggang; Wu, Ruey-Shyang; Liu, Wenyu

doi:10.48550/arxiv.2107.02174

Cited by 1 publication

(2 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LSA only achieves similar top-1 accuracy as DwConv, which is lower than DDF, but it requires more floating-point operations (FLOPs). This phenomenon has also been observed in recent papers [17,25,32,83], but they lack detailed analysis of the reason behind such performances, and it motivates us to raise a question: what makes local self-attention mediocre?…”

Section: Introductionmentioning

confidence: 82%

“…Swin Transformer [48], as a milestone, also leverages local self-attention (LSA) to embed detailed information in high-resolution finer-level features. Despite these successes, several studies [17,25,32,83] observe that the performance of LSA is just on par with convolution in both upstream and downstream tasks [32]. The reasons behind this phenomenon are not clear, and in-depth comparisons under the same conditions are valuable.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ELSA: Enhanced Local Self-Attention for Vision Transformer

Zhou¹,

Wang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: channel setting and spatial processing. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer [48] by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO [83] from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at https://github.com/damo-cv/ELSA.

show abstract

Section: Introductionmentioning

confidence: 82%