Knowledge Distillation: A Survey

Gou, Jianping; Yu, Baosheng; Maybank, Stephen John; Tao, Dacheng

doi:10.48550/arxiv.2006.05525

Cited by 35 publications

(48 citation statements)

References 179 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Knowledge distillation [42] originally is designed for training a smaller model which can be deployed on edge devices. Nowadays, it has been a stepping stone for numerous algorithms [43]. The core concept of knowledge distillation is to provide a meaningful label representation from a pre-trained model, called the teacher model.…”

Section: Knowledge Distillationmentioning

confidence: 99%

LTD: Low Temperature Distillation for Robust Adversarial Training

Chen¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

Adversarial training has been widely used to enhance the robustness of the neural network models against adversarial attacks. However, there still a notable gap between the nature accuracy and the robust accuracy. We found one of the reasons is the commonly used labels, one-hot vectors, hinder the learning process for image recognition. In this paper, we proposed a method, called Low Temperature Distillation (LTD), which is based on the knowledge distillation framework to generate the desired soft labels. Unlike the previous work, LTD uses relatively low temperature in the teacher model, and employs different, but fixed, temperatures for the teacher model and the student model. Moreover, we have investigated the methods to synergize the use of nature data and adversarial ones in LTD.Experimental results show that without extra unlabeled data, the proposed method combined with the previous work can achieve 57.72% and 30.36% robust accuracy on CIFAR-10 and CIFAR-100 dataset respectively, which is about 1.21% improvement of the state-of-the-art methods in average.

show abstract

Section: Knowledge Distillationmentioning

confidence: 99%

LTD: Low Temperature Distillation for Robust Adversarial Training

Chen¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We point the interested reader to Gou et al (2020) for a sweeping survey of the many developments in knowledge distillation over the past half decade. In addition to the references discussing theoretical aspects of knowledge distillation provided in Sec.…”

Section: A Extended Literature Reviewmentioning

confidence: 99%

“…Related work. Since we cannot review the vast literature on KD in its entirety, we point the interested reader to Gou et al (2020) for a recent overview of the field. We devote this section to reviewing theoretical advances in the understanding of KD and summarize complementary empirical studies and applications of in the extended literature review in App.…”

Section: Introductionmentioning

confidence: 99%

Knowledge Distillation as Semiparametric Inference

Dao,

Kamath,

Syrgkanis

et al. 2021

Preprint

View full text Add to dashboard Cite

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements-cross-fitting and loss correction-to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.

show abstract

“…A well-trained model captures meaningful knowledge or information for a specific task. The knowledge distillation approach aims to distill the learning capacity of a larger deep neural network (teacher model) to a smaller network (student model) [28,29]. It has shown efficacy in cross-modal scenarios, where the teacher model is trained on one modality and the knowledge is transferred to another modality [14,30].…”

Section: Introductionmentioning

confidence: 99%

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Jiang¹,

Sharma

Madhavi

et al. 2021

Preprint

View full text Add to dashboard Cite

End-to-end intent classification using speech has numerous advantages compared to the conventional pipeline approach using automatic speech recognition (ASR), followed by natural language processing modules. It attempts to predict intent from speech without using an intermediate ASR module. However, such end-to-end framework suffers from the unavailability of large speech resources with higher acoustic variation in spoken language understanding. In this work, we exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model. In this regard, we leverage the reliable and widely used bidirectional encoder representations from transformers (BERT) model as a language model and transfer the knowledge to build an acoustic model for intent classification using the speech. In particular, a multilevel transformer based teacher-student model is designed, and knowledge distillation is performed across attention and hidden sub-layers of different transformer layers of the student and teacher models. We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively. Further, the proposed method demonstrates better performance and robustness in acoustically degraded condition compared to the baseline method.

show abstract

Knowledge Distillation: A Survey

Cited by 35 publications

References 179 publications

LTD: Low Temperature Distillation for Robust Adversarial Training

LTD: Low Temperature Distillation for Robust Adversarial Training

Knowledge Distillation as Semiparametric Inference

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Contact Info

Product

Resources

About