Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

Chen, Beidi; Dao, Tri; Winsor, Eric; Song, Zhao; Rudra, Atri; Ré, Christopher

doi:10.48550/arxiv.2110.15343

Cited by 6 publications

(5 citation statements)

References 48 publications

(85 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Transformer Acceleration Various methods have been explored for optimizing Transformers' high computational cost, including designing alternative lightweight attention formulations [11,28,31,46,50,54,68], removing unnecessary network modules [17,40,53] approximating attention multiplications with low-rank decompositions [6,12,55], distilling knowledge into a more efficient student network [48,51,69], and extending network quantization techniques for Transformers [1,18,30,49,67]. Furthermore, acceleration techniques specific to ViTs have been proposed [19,34,41,44,47,61,63] by exploiting the redundancy in the input patches to early drop tokens for saving computation.…”

Section: Related Workmentioning

confidence: 99%

Revisiting Token Pruning for Object Detection and Instance Segmentation

Liu,

Gehrig,

Messikommer

et al. 2024

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

Vision Transformers (ViTs) have shown impressive performance in computer vision, but their high computational cost, quadratic in the number of tokens, limits their adoption in computation-constrained applications. However, this large number of tokens may not be necessary, as not all tokens are equally important. In this paper, we investigate token pruning to accelerate inference for object detection and instance segmentation, extending prior works from image classification. Through extensive experiments, we offer four insights for dense tasks: (i) tokens should not be completely pruned and discarded, but rather preserved in the feature maps for later use. (ii) reactivating previously pruned tokens can further enhance model performance. (iii) a dynamic pruning rate based on images is better than a fixed pruning rate. (iv) a lightweight, 2-layer MLP can effectively prune tokens, achieving accuracy comparable with complex gating networks with a simpler design. We assess the effects of these design decisions on the COCO dataset and introduce an approach that incorporates these findings, showing a reduction in performance decline from ∼1.5 mAP to ∼0.3 mAP in both boxes and masks, compared to existing token pruning methods. In relation to the dense counterpart that utilizes all tokens, our method realizes an increase in inference speed, achieving up to 34% faster performance for the entire network and 46% for the backbone.

show abstract

Section: Related Workmentioning

confidence: 99%

Revisiting Token Pruning for Object Detection and Instance Segmentation

Liu,

Gehrig,

Messikommer

et al. 2024

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

show abstract

“…Another way to reduce the memory requirements and computation complexity is low-rank approximation [38]. In [39], a method which exploits both sparsity and low-rank approximation is proposed, which is also called as Scatterbrain. It is shown that the Scatterbrain can outperform the methods which employ only either sparsity or low-rank approximation, illustrating that the sparsity and low-rank approximation can be exploited synergistically.…”

Section: Related Workmentioning

confidence: 99%

Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation

Kong

Munir

2023

IEEE Access

View full text Add to dashboard Cite

As recent machine translation models are mostly based on the attention-based neural machine translation (NMT), many well-known models such as Transformer or bidirectional encoder representations from Transformers (BERT) have been proposed. Along with algorithmic advancements, hardware acceleration methods for those attention-based neural machine translation models have also been introduced. However, the size of the parameters for attention-based NMT is also becoming larger to guarantee the satisfactory machine translation quality. Among various weights, linearization weights (W Q , W K , W V , and W O ) account for a non-negligible portion (by up to 30%) among the entire parameters in the modern NMT models. In this paper, we propose a method for linearization weight compression and near-memory hardware decoder for fast and in-situ weight decompression. Our weight compression method exploits the fixed-point quantization along with Huffman coding which is selectively applied depending on the weight value distribution. Our hardware decoder decompresses the Huffman-coded weights near-memory to minimize the weight decoding latency. Our compression method shows 4.9-10.0 compression ratio with small NMT score drops across the five widely used attention-based NMT models (Transformer, Transformer-XL-base, Transformer-XL-large, BERT-base, and BERT-large). In addition, due to the reduced linearization weight size, our proposed method with near-memory decoding enables multi-head attention (MHA) execution latency reduction by 11.8%, on average, as compared to the baseline when considering the weight loading and initialization. In terms of the memory data transfer energy consumption, our proposed method leads to a memory energy saving of 16.1%, on average, as compared to the baseline. INDEX TERMSNeural machine translation, multi-head attention, quantization, Huffman coding, hardwarebased near-memory decoding. I. INTRODUCTIONRecent trends in machine learning-based translation mostly focus on the neural machine translation (NMT). In the early stages of the NMT, recurrent neural networks (RNNs) or long short-term memories (LSTMs) have been typically used. However, the advent of attention-based models such as Transformer [1], bidirectional encoder representations from Trans-The associate editor coordinating the review of this manuscript and approving it for publication was Bing Li .

show abstract

“…This combination significantly hinders deployment on devices with constrained computational and memory resources, particularly in real-time applications such as autonomous driving [13] and virtual reality [14], where fulfilling low latency requirements and producing a highquality user experience are crucial. This underscores the pressing need for advancements in model compression techniques such as pruning [15], quantization [16], knowledge distillation [17], and low-rank factorization [18]. Moreover, the rapid adoption of ViTs can be attributed not only to algorithmic innovations and data availability but also to enhancements in processor performance.…”

Section: Introductionmentioning

confidence: 99%

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

et al. 2020

View full text Add to dashboard Cite

Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. However, their large model sizes and high computational and memory demands hinder deployment, especially on resource-constrained devices. This underscores the necessity of algorithm-hardware co-design specific to ViTs, aiming to optimize their performance by tailoring both the algorithmic structure and the underlying hardware accelerator to each other's strengths. Model quantization, by converting high-precision numbers to lower-precision, reduces the computational demands and memory needs of ViTs, allowing the creation of hardware specifically optimized for these quantized algorithms, boosting efficiency. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration. We first delve into the unique architectural attributes of ViTs and their runtime characteristics. Subsequently, we examine the fundamental principles of model quantization, followed by a comparative analysis of the state-of-the-art quantization techniques for ViTs. Additionally, we explore the hardware acceleration of quantized ViTs, highlighting the importance of hardware-friendly algorithm design. In conclusion, this article will discuss ongoing challenges and future research paths. We consistently maintain the related opensource materials at https://github.com/DD-DuDa/awesome-vitquantization-acceleration.

show abstract

Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

Cited by 6 publications

References 48 publications

Revisiting Token Pruning for Object Detection and Instance Segmentation

Revisiting Token Pruning for Object Detection and Instance Segmentation

Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

Contact Info

Product

Resources

About