2023
DOI: 10.48550/arxiv.2303.11098
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A closer look at the training dynamics of knowledge distillation

Abstract: In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 38 publications
0
2
0
Order By: Relevance
“…1 According to Hugging Face Model repositories, the BERT models fine-tuned for the GLUE tasks have already been downloaded about 138,000 times in total at the time of writing. Research communities leverage torchdistill not only for knowledge distillation studies Li et al, 2022a;Lin et al, 2022;Dong et al, 2022;Miles and Mikolajczyk, 2023), but also for machine learning reproducibility challenge (MLRC) (Lee and Lee, 2023) and reproducible deep learning studies (Matsubara et al, 2022a,c;Furutanpey et al, 2023b,a;. torchdistill is publicly available as a pip-installable PyPI package and will be maintained and upgraded for encouraging coding-free reproducible deep learning and knowledge distillation studies.…”
Section: Discussionmentioning
confidence: 99%
“…1 According to Hugging Face Model repositories, the BERT models fine-tuned for the GLUE tasks have already been downloaded about 138,000 times in total at the time of writing. Research communities leverage torchdistill not only for knowledge distillation studies Li et al, 2022a;Lin et al, 2022;Dong et al, 2022;Miles and Mikolajczyk, 2023), but also for machine learning reproducibility challenge (MLRC) (Lee and Lee, 2023) and reproducible deep learning studies (Matsubara et al, 2022a,c;Furutanpey et al, 2023b,a;. torchdistill is publicly available as a pip-installable PyPI package and will be maintained and upgraded for encouraging coding-free reproducible deep learning and knowledge distillation studies.…”
Section: Discussionmentioning
confidence: 99%
“…Deep Neural Networks progressively generate features [36][37][38], with higher layers capturing more closely related critical features necessary for the main task. Considering the training process of a DNN as the problem and its learned weights and parameters as the solution, the features generated within the depths of the DNN can be viewed as temporary results during the solving process.…”
Section: Flow Of Solution Procedures (Fsp) Matrixmentioning
confidence: 99%