EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Han, Chonghun; Gan, Chuang; Han, Song

doi:10.48550/arxiv.2205.14756

Cited by 10 publications

(14 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite the lower computational complexity, linear attention always presents a clear performance drop compared to Softmax [8,74,64]. We also observe this drop when applying the linear attention to video Transformers, as shown in Table 1 and 2.…”

Section: Achilles Heel Of Linear Attentionmentioning

confidence: 62%

“…Since we always have N D in practice, O(N D 2 ) is approximately equal to O(N ), i.e., growing linearly with the sequence length. As discussed in [64,8], a simple ReLU [1] is a good candidate for the kernel function, which satisfies the requirement of being non-negative and easily decomposable. The attended output is formulated with row-wise normalization, i.e.,…”

Section: Linear Attentionmentioning

confidence: 99%

“…We also observe this drop when applying the linear attention to video Transformers, as shown in Table 1 and 2. The lack of non-linear attention concentration in the process of dot-product normalization is considered to be the primary cause [64,8]. From Fig.…”

Section: Achilles Heel Of Linear Attentionmentioning

confidence: 99%

See 2 more Smart Citations

Linear Video Transformer with Feature Fixation

Lu¹,

Liu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. Some studies alleviate the computational costs by reducing the number of tokens in attention calculation, but the complexity is still quadratic. Another promising way is to replace Softmax attention with linear attention, which owns linear complexity but presents a clear performance drop. We find that such a drop in linear attention results from the lack of attention concentration on critical features. Therefore, we propose a feature fixation module to reweight feature importance of the query and key before computing linear attention. Specifically, we regard the query, key, and value as various latent representations of the input token, and learn the feature fixation ratio by aggregating Query-Key-Value information. This is beneficial for measuring the feature importance comprehensively. Furthermore, we enhance the feature fixation by neighborhood association, which leverages additional guidance from spatial and temporal neighbouring tokens. The proposed method significantly improves the linear attention baseline and achieves state-of-the-art performance among linear video Transformers on three popular video classification benchmarks. With fewer parameters and higher efficiency, our performance is even comparable to some Softmax-based quadratic Transformers.

show abstract

Section: Achilles Heel Of Linear Attentionmentioning

confidence: 62%

Section: Linear Attentionmentioning

confidence: 99%

See 1 more Smart Citation

Linear Video Transformer with Feature Fixation

Lu¹,

Liu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Boosting up ViT efficiency has therefore been a very vibrant area. One stream of approach roots in the efficient deep learning literature that cuts down on network complexity leveraging popular methods such as efficient attention [3,41,4], network compression [7,8,34,67], dynamic inference [69,48], operator adaptation [43], token merging and manipulations [42,66], etc. These methods can yield off-the-shelf speedups on target ViT backbones, but are also limited to the original backbone's accuracy and capacity.…”

Section: Related Workmentioning

confidence: 99%

“…We also avoid squeeze-andexcitation operators and minimize Layer Normalization for higher resolution stages (i.e., 1, 2), as these layers tend to be math-limited. Later stages (i.e., 3,4) in the architecture tend to be math-limited as GPU hardware spends more time on compute compared to the memory transfer cost. As a result, applying multi-head attention will not be a bottleneck.…”

Section: Architecturementioning

confidence: 99%

GradViT: Gradient Inversion of Vision Transformers

Hatamizadeh¹,

Yin²,

Roth³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient crosswindow communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy vs. image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.

show abstract