2021
DOI: 10.48550/arxiv.2110.15343
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

Abstract: Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences. However, it is still challenging to balance the trade-off between model quality and efficiency to perform a one-size-fits-all approximation for different tasks. To better understand this trade-off, we observe that sparse and low-rank approximations excel in different regimes, determined by the softmax temperature… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 48 publications
(85 reference statements)
0
2
0
Order By: Relevance
“…Transformer Acceleration Various methods have been explored for optimizing Transformers' high computational cost, including designing alternative lightweight attention formulations [11,28,31,46,50,54,68], removing unnecessary network modules [17,40,53] approximating attention multiplications with low-rank decompositions [6,12,55], distilling knowledge into a more efficient student network [48,51,69], and extending network quantization techniques for Transformers [1,18,30,49,67]. Furthermore, acceleration techniques specific to ViTs have been proposed [19,34,41,44,47,61,63] by exploiting the redundancy in the input patches to early drop tokens for saving computation.…”
Section: Related Workmentioning
confidence: 99%
“…Transformer Acceleration Various methods have been explored for optimizing Transformers' high computational cost, including designing alternative lightweight attention formulations [11,28,31,46,50,54,68], removing unnecessary network modules [17,40,53] approximating attention multiplications with low-rank decompositions [6,12,55], distilling knowledge into a more efficient student network [48,51,69], and extending network quantization techniques for Transformers [1,18,30,49,67]. Furthermore, acceleration techniques specific to ViTs have been proposed [19,34,41,44,47,61,63] by exploiting the redundancy in the input patches to early drop tokens for saving computation.…”
Section: Related Workmentioning
confidence: 99%
“…Another way to reduce the memory requirements and computation complexity is low-rank approximation [38]. In [39], a method which exploits both sparsity and low-rank approximation is proposed, which is also called as Scatterbrain. It is shown that the Scatterbrain can outperform the methods which employ only either sparsity or low-rank approximation, illustrating that the sparsity and low-rank approximation can be exploited synergistically.…”
Section: Related Workmentioning
confidence: 99%
“…This combination significantly hinders deployment on devices with constrained computational and memory resources, particularly in real-time applications such as autonomous driving [13] and virtual reality [14], where fulfilling low latency requirements and producing a highquality user experience are crucial. This underscores the pressing need for advancements in model compression techniques such as pruning [15], quantization [16], knowledge distillation [17], and low-rank factorization [18]. Moreover, the rapid adoption of ViTs can be attributed not only to algorithmic innovations and data availability but also to enhancements in processor performance.…”
Section: Introductionmentioning
confidence: 99%