schuBERT: Optimizing Elements of BERT

Khetan, Ashish; Karnin, Zohar

doi:10.18653/v1/2020.acl-main.250

Cited by 15 publications

(12 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mao et al (2020) combined distillation with unstructured pruning, while Hou et al (2020) combined distillation with structured pruning. When compared with only structured pruning (Khetan and Karnin, 2020), we see that Hou et al (2020) achieved both a smaller model size (12.4%) and also a smaller drop in accuracy (0.96%).…”

Section: Comparison and Analysismentioning

confidence: 75%

“…However, this is not a completely fair comparison, as Sanh et al (2019) did not use attention as a distillation target. When we compare other methods, we find that Jiao et al (2020) was able to beat Khetan and Karnin (2020) in terms of both model size and accuracy. This shows that structured pruning outperforms student models trained using distillation only on encoder outputs and output logits, but fails against distillation on attention maps.…”

Section: Matrix Decomposition and Dynamic Inferencementioning

confidence: 95%

“…Similarly to encoder unit pruning, we can reduce the size of the embedding vector (H) by pruning along the width of the model. Such a model can be obtained by either training with adaptive width, so that the model is robust to such pruning during inference (Hou et al, 2020), or by removing the least important feature dimensions iteratively (Khetan and Karnin, 2020;Prasanna et al, 2020;Tsai et al, 2020;.…”

Section: Pruningmentioning

confidence: 99%

“…As discussed in Section 3, structured pruning removes architectural components from BERT, which can also be seen as reducing the number of hyper-parameters that govern the BERT architecture. While pruned the encoder units (L) and reduced the model depth by half with an average accuracy drop of 1.0%, Khetan and Karnin (2020) took it a step further and systematically reduced both the depth (L) as well as the width (H, A) of the model, compressing to 39.1% of the original size with an average accuracy drop of only 1.86%. Detailed experiments by Khetan and Karnin (2020) also show that reducing all hyper-parameters in harmony, instead of focusing on just one, yields better performance.…”

Section: Comparison and Analysismentioning

confidence: 99%

“…However, some layers might be able to handle more compression. Methods compressing each layer independently (Khetan and Karnin, 2020;Tsai et al, 2020) have shown promising results, but remain under-explored.…”

Section: Open Issues and Research Directionsmentioning

confidence: 99%

See 4 more Smart Citations

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Ganesh

Chen

Lou

et al. 2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and thus are too resource- hungry and computation-intensive to suit low- capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted considerable research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models.

show abstract

Section: Comparison and Analysismentioning

confidence: 75%