2021
DOI: 10.48550/arxiv.2110.08532
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

Mehdi Rezagholizadeh,
Aref Jafari,
Puneeth Salad
et al.

Abstract: With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searchi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 23 publications
0
3
0
Order By: Relevance
“…Recent research has shown that distillation from a large teacher to a small student has only marginal benefits (Jin et al, 2019;Cho & Hariharan, 2019), mainly due to the large prediction discrepancy . Traditional solutions have resorted to introducing auxiliary teacher assistant models (Mirzadeh et al, 2020;Rezagholizadeh et al, 2021;, but training and storing auxiliary models can be memory and computational costly.…”
Section: Resolving Prediction Discrepancymentioning
confidence: 99%
“…Recent research has shown that distillation from a large teacher to a small student has only marginal benefits (Jin et al, 2019;Cho & Hariharan, 2019), mainly due to the large prediction discrepancy . Traditional solutions have resorted to introducing auxiliary teacher assistant models (Mirzadeh et al, 2020;Rezagholizadeh et al, 2021;, but training and storing auxiliary models can be memory and computational costly.…”
Section: Resolving Prediction Discrepancymentioning
confidence: 99%
“…This suggests that the teacher's knowledge should align with the student's capabilities. Furthermore, Rezagholizadeh et al [45] proposed progressive distillation for natural language processing (NLP) and classification tasks to minimize the capability gap between the student and teacher. Progressive distillation gradually distills knowledge from a smoother teacher to a fully-trained teacher, allowing the student to learn from a teacher at the appropriate level.…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…The enhanced PyNET teacher model is depicted in Figure 2b, with the enhancements highlighted in magenta. Taking inspiration from progressive distillation [45] in classification settings, we propose progressive distillation for generative models, which also adaptively adjusts the teacher's level to smooth the knowledge transfer during the distillation process.…”
Section: B Progressive Distillationmentioning
confidence: 99%