Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1441
|View full text |Cite
|
Sign up to set email alerts
|

Patient Knowledge Distillation for BERT Model Compression

Abstract: Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

8
459
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 444 publications
(467 citation statements)
references
References 30 publications
8
459
0
Order By: Relevance
“…Before the birth of BERT, KD had been applied to several specific tasks like machine translation (Kim and Rush, 2016;Tan et al, 2019) in NLP. While the recent studies of distilling large pre-trained models focus on finding general distillation methods that work on various tasks and are receiving more and more attention (Sanh et al, 2019;Jiao et al, 2019;Sun et al, 2019a;Tang et al, 2019;Clark et al, 2019;.…”
Section: Introductionmentioning
confidence: 99%
“…Before the birth of BERT, KD had been applied to several specific tasks like machine translation (Kim and Rush, 2016;Tan et al, 2019) in NLP. While the recent studies of distilling large pre-trained models focus on finding general distillation methods that work on various tasks and are receiving more and more attention (Sanh et al, 2019;Jiao et al, 2019;Sun et al, 2019a;Tang et al, 2019;Clark et al, 2019;.…”
Section: Introductionmentioning
confidence: 99%
“…Furthermore, using BERT for inference poses latency challenges in a production system. A promising direction of future work that we plan on investigating is leveraging distilled versions of BERT (Sun et al, 2019;Wang et al, 2020) for the task. Table 6 shows the results of the NN analysis.…”
Section: Discussionmentioning
confidence: 99%
“…As such, we look into Knowledge Distillation (KD) (Hinton et al, 2015) to transfer the language modeling capability of BART while keeping its copying behavior. Transferring the language model of massive pre-trained models into smaller models has been of high interest recently (Sanh et al, 2019;Turc et al, 2020;Sun et al, 2019). Knowledge transfer to simple models has also been discussed in lesser extent (Tang et al, 2019;Mukherjee and Awadallah, 2019).…”
Section: Distilling Bartmentioning
confidence: 99%