Self-Knowledge Distillation in Natural Language Processing

Hahn, Sangchul; Choi, Heeyoul

doi:10.26615/978-954-452-056-4_050

Cited by 68 publications

(29 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Predictions on different samples belonging to the same class are distilled to mitigate overconfident predictions. Self distillation has also been used in natural language processing [33]. Supervising the same model at different depths is explored in [34].…”

Section: Self Kd Methodsmentioning

confidence: 99%

Confidence Conditioned Knowledge Distillation

Mishra¹,

Sundaram²

2021

Preprint

View full text Add to dashboard Cite

In this paper, a novel confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed.Existing state-of-the-art methods employ fixed loss functions for this purpose and ignore the different levels of information that need to be transferred for different samples. In addition to that, these methods are also inefficient in terms of data usage.CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions (CCKD-L formulation) and targets (CCKD-T formulation). Further, CCKD improves the data efficiency by employing self-regulation to stop those samples from participating in the distillation process on which the student model learns faster. Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods while being data efficient in the process. Student models trained through CCKD methods do not retain most of the misclassifications commited by the teacher model on the training set. Distillation through CCKD methods improves the resilience of the student models against adversarial attacks compared to the conventional KD method. Experiments show at least 3% increase in performance against adversarial attacks for the MNIST and the Fashion MNIST datasets, and at least 6% increase for the CIFAR10 dataset.

show abstract

Section: Self Kd Methodsmentioning

confidence: 99%

Confidence Conditioned Knowledge Distillation

Mishra¹,

Sundaram²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Compared to direct training, knowledge distillation provides a more stable training process which leads to better performing student models (Hinton et al, 2015;Phuong and Lampert, 2019). Recent work (Furlanello et al, 2018;Hahn and Choi, 2019) also sheds light on leveraging knowledge distillation for training a highperforming student model with the same size as the teacher (see the discussion in the next section).…”

Section: Knowledge Distillationmentioning

confidence: 99%

Noisy Self-Knowledge Distillation for Text Summarization

Liu¹,

Shen²,

Lapata³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

In this paper we apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximumlikelihood training on single reference and noisy datasets. Instead of relying on one-hot annotation labels, our student summarization model is trained with guidance from a teacher which generates smoothed labels to help regularize training. Furthermore, to better model uncertainty during training, we introduce multiple noise signals for both teacher and student models. We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and nonpretrained summarizers achieving state-of-theart results. 1

show abstract

“…For the Seq2Seq model, Kim and Rush (2016) proposes to use the generated sequences as the sequence-level knowledge to guide the student network training. Moreover, self-knowledge distillation (Hahn and Choi, 2019) even shows that knowledge (representations) from the student network itself can improve the performance.…”

Section: Knowledge Distillationmentioning

confidence: 99%

Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Lin¹,

Li²,

Wang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Knowledge distillation has proven to be effective in model acceleration and compression. It transfers knowledge from a large neural network to a small one by using the large neural network predictions as targets of the small neural network. But this way ignores the knowledge inside the large neural networks, e.g., parameters. Our preliminary study as well as the recent success in pre-training suggests that transferring parameters are more effective in distilling knowledge. In this paper, we propose Weight Distillation to transfer the knowledge in parameters of a large neural network to a small neural network through a parameter generator. On the WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks, our experiments show that weight distillation learns a small network that is 1.88∼2.94× faster than the large network but with competitive BLEU performance. When fixing the size of the small networks, weight distillation outperforms knowledge distillation by 0.51∼1.82 BLEU points.

show abstract

Self-Knowledge Distillation in Natural Language Processing

Cited by 68 publications

References 13 publications

Confidence Conditioned Knowledge Distillation

Confidence Conditioned Knowledge Distillation

Noisy Self-Knowledge Distillation for Text Summarization

Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Contact Info

Product

Resources

About