2022
DOI: 10.48550/arxiv.2205.14756
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Abstract: Vision Transformer (ViT) has achieved remarkable performance in many vision tasks. However, ViT is inferior to convolutional neural networks (CNNs) when targeting high-resolution mobile vision applications. The key computational bottleneck of ViT is the softmax attention module which has quadratic computational complexity with the input resolution. It is essential to reduce the cost of ViT to deploy it on edge devices. Existing methods (e.g., Swin, PVT) restrict the softmax attention within local windows or re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 34 publications
1
13
0
Order By: Relevance
“…Despite the lower computational complexity, linear attention always presents a clear performance drop compared to Softmax [8,74,64]. We also observe this drop when applying the linear attention to video Transformers, as shown in Table 1 and 2.…”
Section: Achilles Heel Of Linear Attentionmentioning
confidence: 62%
See 2 more Smart Citations
“…Despite the lower computational complexity, linear attention always presents a clear performance drop compared to Softmax [8,74,64]. We also observe this drop when applying the linear attention to video Transformers, as shown in Table 1 and 2.…”
Section: Achilles Heel Of Linear Attentionmentioning
confidence: 62%
“…Since we always have N D in practice, O(N D 2 ) is approximately equal to O(N ), i.e., growing linearly with the sequence length. As discussed in [64,8], a simple ReLU [1] is a good candidate for the kernel function, which satisfies the requirement of being non-negative and easily decomposable. The attended output is formulated with row-wise normalization, i.e.,…”
Section: Linear Attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…Boosting up ViT efficiency has therefore been a very vibrant area. One stream of approach roots in the efficient deep learning literature that cuts down on network complexity leveraging popular methods such as efficient attention [3,41,4], network compression [7,8,34,67], dynamic inference [69,48], operator adaptation [43], token merging and manipulations [42,66], etc. These methods can yield off-the-shelf speedups on target ViT backbones, but are also limited to the original backbone's accuracy and capacity.…”
Section: Related Workmentioning
confidence: 99%
“…We also avoid squeeze-andexcitation operators and minimize Layer Normalization for higher resolution stages (i.e., 1, 2), as these layers tend to be math-limited. Later stages (i.e., 3,4) in the architecture tend to be math-limited as GPU hardware spends more time on compute compared to the memory transfer cost. As a result, applying multi-head attention will not be a bottleneck.…”
Section: Architecturementioning
confidence: 99%