Learned Token Pruning for Transformers

Kim, Sehoon; Shen, Sheng; Thorsley, David; Gholami, Amir; Kwon, Woosuk; Hassoun, Joseph; Keutzer, Kurt

doi:10.48550/arxiv.2107.00910

Cited by 9 publications

(10 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nonetheless, this approach requires sorting of tokens and heads depending on their importance dynamically to select the top-k candidates, which needs specialized hardware. Similar to our work, recently published [16] also adopts a threshold-based pruning approach, which removes unimportant tokens as the input passes through the Transformer layers. However, this method requires a three-step training procedure to obtain a per-layer learned threshold, which again prevents to easily deploy the technique across a wide range of pre-trained networks.…”

Section: Related Workmentioning

confidence: 92%

See 1 more Smart Citation

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Zuzana¹,

Verhelst²

2022

Preprint

View full text Add to dashboard Cite

Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resourceconstrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ∼ 80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ∼ 87 − 94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head selfattention inference by a factor of ∼ 7.5 − 16.

show abstract

Section: Related Workmentioning

confidence: 92%

“…An increasing number of works focusing on MHSA pruning recently emerge. These mainly aim for reducing the number of attention heads in each Transformer layer [22,23,33], and token pruning [10,15,16,34]. Eliminating attention heads completely to speed up the processing might significantly impact accuracy.…”

Section: Introductionmentioning

confidence: 99%

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Zuzana¹,

Verhelst²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…All token selection strategies are conducted under our structure preserving strategy without layer-to-stage training schedule. The token selection strategies include: randomly selecting the informative tokens (random selection); Utilizing the class attention of the last layer for selection in all layers via twice inference (last class attention); taking the column mean of the attention matrix as the score of each token as proposed in (Kim et al 2021) tention outperforms the other selection strategies and common sub-sampling methods on both accuracy and efficiency.…”

Section: Ablation Analysismentioning

confidence: 99%

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Zhang

et al. 2022

AAAI

View full text Add to dashboard Cite

Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue. Since the computation complexity of ViT is quadratic with respect to the input sequence length, a mainstream paradigm for computation reduction is to reduce the number of tokens. Existing designs include structured spatial compression that uses a progressive shrinking pyramid to reduce the computations of large feature maps, and unstructured token pruning that dynamically drops redundant tokens. However, the limitation of existing token pruning lies in two folds: 1) the incomplete spatial structure caused by pruning is not compatible with structured spatial compression that is commonly used in modern deep-narrow transformers; 2) it usually requires a time-consuming pre-training procedure. To tackle the limitations and expand the applicable scenario of token pruning, we present Evo-ViT, a self-motivated slow-fast token evolution approach for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the simple and effective global class attention that is native to vision transformers. Then, we propose to update the selected informative tokens and uninformative tokens with different computation paths, namely, slow-fast updating. Since slow-fast updating mechanism maintains the spatial structure and information flow, Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that our method significantly reduces the computational cost of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates DeiT-S by over 60% throughput while only sacrificing 0.4% top-1 accuracy on ImageNet-1K, outperforming current token pruning methods on both accuracy and efficiency.

show abstract

“…As Equation 1indicates, the computational complexity for computing the attention matrix is O (𝑑 2 𝑛 + 𝑛 2 𝑑), which quadratically scales with sequence length. As such, the attention operation becomes a bottleneck when applied to long sequences such as source code [25].…”

Section: Dietcodebert: Program Simplification For Codebertmentioning

confidence: 99%

Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code

Zhang,

Shen

et al. 2022

Preprint

View full text Add to dashboard Cite

Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of input sequence. Our empirical analysis on CodeBERT's attention reveals that CodeBERT pays more attention to certain types of tokens and statements such as keywords and data-relevant statements. Based on these findings, we propose Diet-CodeBERT, which aims at lightweight leverage of large pre-trained models for source code. DietCodeBERT simplifies the input program of CodeBERT with three strategies, namely, word dropout, frequency filtering, and an attention-based strategy which selects statements and tokens that receive the most attention weights during pre-training. Hence, it gives a substantial reduce in the computational cost without hampering the model performance. Results on two downstream tasks show that DietCodeBERT provides comparable results as CodeBERT with 40% less computational cost in fine-tuning and testing. CCS CONCEPTS• Computing methodologies → Natural language processing.

show abstract

Learned Token Pruning for Transformers

Cited by 9 publications

References 0 publications

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code

Contact Info

Product

Resources

About