2021
DOI: 10.48550/arxiv.2107.00910
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learned Token Pruning for Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(10 citation statements)
references
References 0 publications
0
9
0
Order By: Relevance
“…Nonetheless, this approach requires sorting of tokens and heads depending on their importance dynamically to select the top-k candidates, which needs specialized hardware. Similar to our work, recently published [16] also adopts a threshold-based pruning approach, which removes unimportant tokens as the input passes through the Transformer layers. However, this method requires a three-step training procedure to obtain a per-layer learned threshold, which again prevents to easily deploy the technique across a wide range of pre-trained networks.…”
Section: Related Workmentioning
confidence: 92%
See 1 more Smart Citation
“…Nonetheless, this approach requires sorting of tokens and heads depending on their importance dynamically to select the top-k candidates, which needs specialized hardware. Similar to our work, recently published [16] also adopts a threshold-based pruning approach, which removes unimportant tokens as the input passes through the Transformer layers. However, this method requires a three-step training procedure to obtain a per-layer learned threshold, which again prevents to easily deploy the technique across a wide range of pre-trained networks.…”
Section: Related Workmentioning
confidence: 92%
“…An increasing number of works focusing on MHSA pruning recently emerge. These mainly aim for reducing the number of attention heads in each Transformer layer [22,23,33], and token pruning [10,15,16,34]. Eliminating attention heads completely to speed up the processing might significantly impact accuracy.…”
Section: Introductionmentioning
confidence: 99%
“…All token selection strategies are conducted under our structure preserving strategy without layer-to-stage training schedule. The token selection strategies include: randomly selecting the informative tokens (random selection); Utilizing the class attention of the last layer for selection in all layers via twice inference (last class attention); taking the column mean of the attention matrix as the score of each token as proposed in (Kim et al 2021) tention outperforms the other selection strategies and common sub-sampling methods on both accuracy and efficiency.…”
Section: Ablation Analysismentioning
confidence: 99%
“…As Equation 1indicates, the computational complexity for computing the attention matrix is O (𝑑 2 𝑛 + 𝑛 2 𝑑), which quadratically scales with sequence length. As such, the attention operation becomes a bottleneck when applied to long sequences such as source code [25].…”
Section: Dietcodebert: Program Simplification For Codebertmentioning
confidence: 99%