2019
DOI: 10.48550/arxiv.1908.01851
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Knowledge Distillation in Natural Language Processing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(18 citation statements)
references
References 0 publications
0
14
0
Order By: Relevance
“…Knowledge Distillation Knowledge distillation (KD) is a prominent neural model compression technique (Hinton et al, 2015) in which the output of a teacher network is used as an auxiliary supervision besides the ground-truth training labels. Later on, it was shown that KD can be used for improving the performance of neural networks in the so-called born-again (Furlanello et al, 2018) or self-distillation frameworks (Kim et al, 2020;Yun et al, 2020;Hahn & Choi, 2019). Self-distillation is a regularization technique trying to improve the performance of a network using its internal knowledge.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Knowledge Distillation Knowledge distillation (KD) is a prominent neural model compression technique (Hinton et al, 2015) in which the output of a teacher network is used as an auxiliary supervision besides the ground-truth training labels. Later on, it was shown that KD can be used for improving the performance of neural networks in the so-called born-again (Furlanello et al, 2018) or self-distillation frameworks (Kim et al, 2020;Yun et al, 2020;Hahn & Choi, 2019). Self-distillation is a regularization technique trying to improve the performance of a network using its internal knowledge.…”
Section: Related Workmentioning
confidence: 99%
“…In other words, in self-distillation scenarios, the student becomes its own teacher. While KD has shown great success in different ASR tasks (Pang et al, 2018;Huang et al, 2018;Takashima et al, 2018;Kim et al, 2019;Chebotar & Waters, 2016;Fukuda et al, 2017;Yoon et al, 2020), self-distillation is more investigated in computer vision and natural language processing (NLP) domains (Haun & Choi, 2019;Hahn & Choi, 2019). To best of our knowledge, we incorporate the self-KD approach for the first time in training ASR models.…”
Section: Related Workmentioning
confidence: 99%
“…Results show that SD can almost replicate the accuracy regardless of a well-trained large model or big dataset such as in the image classification task (Zhang et al, 2019). SD has also been applied to NLP tasks such as language model and neural machine translation (Hahn and Choi, 2019) and obtains promising results. Despite obtaining higher accuracy and better performance, modern deep learning models face drawbacks of miscalibration and overconfidence (Müller et al, 2019;Naeini et al, 2015;Lakshminarayanan et al, 2016).…”
Section: Introductionmentioning
confidence: 97%
“…Knowledge Distillation KD (Hinton et al, 2015) is a well-known method for neural model compression and also is shown to be an effective regularizer in improving the performance of neural networks in the self-distillation (Yun et al, 2020;Hahn and Choi, 2019) or born-again (Furlanello et al, 2018) setups. KD adds a particular loss term to the regular cross entropy (CE) classification loss:…”
Section: Introductionmentioning
confidence: 99%