2019
DOI: 10.48550/arxiv.1908.09355
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Patient Knowledge Distillation for BERT Model Compression

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
128
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 94 publications
(142 citation statements)
references
References 0 publications
1
128
0
Order By: Relevance
“…In recent years, many approaches are proposed for the acceleration of PLMs. One popular way is to distill large-scale PLMs into lightweight models (Sanh et al, 2019;Sun et al, 2019), where the computation cost can be saved proportionally with the reduction of model size. Besides, efficient transformers are proposed, which may reduce the time complexity of self-attention from quadratic to linear (or log-linear).…”
Section: Related Workmentioning
confidence: 99%
“…In recent years, many approaches are proposed for the acceleration of PLMs. One popular way is to distill large-scale PLMs into lightweight models (Sanh et al, 2019;Sun et al, 2019), where the computation cost can be saved proportionally with the reduction of model size. Besides, efficient transformers are proposed, which may reduce the time complexity of self-attention from quadratic to linear (or log-linear).…”
Section: Related Workmentioning
confidence: 99%
“…Transformer-based pretrain language models like BERT Devlin et al (2018) exhibit excellent performances but are also computationally expensive. There have been many works that attempt to compress Transformer-based models with knowledge distillation (Jiao et al, 2019;Sun et al, 2019a;Wang et al, 2020). The distilled knowledge may be soft target probabilities, embedding outputs, hidden representations or attention weight distributions.…”
Section: Related Workmentioning
confidence: 99%
“…Task-specific methods require that the teacher be trained for each downstream task. Distilled bidirectional long short-term memory network (Distilled BiLSTM) (Tang et al, 2019), Patient Knowledge Distillation for a BERT model (BERT-PKD) (Sun et al, 2019), and Stacked Internal Distillation (SID) (Aguilar et al, 2020) are all considered as task-specific methods. On the other hand, task-agnostic method uses one teacher for several downstream tasks.…”
Section: Knowledge Distillationmentioning
confidence: 99%