Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.250
|View full text |Cite
|
Sign up to set email alerts
|

schuBERT: Optimizing Elements of BERT

Abstract: Transformers (Vaswani et al., 2017) have gradually become a key component for many state-of-the-art natural language representation models. A recent Transformer based model- BERT (Devlin et al., 2018) achieved state-of-the-art results on various natural language processing tasks, including GLUE, SQuAD v1.1, and SQuAD v2.0. This model however is computationally prohibitive and has a huge number of parameters. In this work we revisit the architecture choices of BERT in efforts to obtain a lighter model. We focu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 15 publications
(12 citation statements)
references
References 33 publications
0
9
0
Order By: Relevance
“…Mao et al (2020) combined distillation with unstructured pruning, while Hou et al (2020) combined distillation with structured pruning. When compared with only structured pruning (Khetan and Karnin, 2020), we see that Hou et al (2020) achieved both a smaller model size (12.4%) and also a smaller drop in accuracy (0.96%).…”
Section: Comparison and Analysismentioning
confidence: 75%
See 4 more Smart Citations
“…Mao et al (2020) combined distillation with unstructured pruning, while Hou et al (2020) combined distillation with structured pruning. When compared with only structured pruning (Khetan and Karnin, 2020), we see that Hou et al (2020) achieved both a smaller model size (12.4%) and also a smaller drop in accuracy (0.96%).…”
Section: Comparison and Analysismentioning
confidence: 75%
“…However, this is not a completely fair comparison, as Sanh et al (2019) did not use attention as a distillation target. When we compare other methods, we find that Jiao et al (2020) was able to beat Khetan and Karnin (2020) in terms of both model size and accuracy. This shows that structured pruning outperforms student models trained using distillation only on encoder outputs and output logits, but fails against distillation on attention maps.…”
Section: Matrix Decomposition and Dynamic Inferencementioning
confidence: 95%
See 3 more Smart Citations