Self-Knowledge Distillation in Natural Language Processing

Hahn, Sangchul; Choi, Heeyoul

doi:10.48550/arxiv.1908.01851

Cited by 17 publications

(18 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Knowledge Distillation Knowledge distillation (KD) is a prominent neural model compression technique (Hinton et al, 2015) in which the output of a teacher network is used as an auxiliary supervision besides the ground-truth training labels. Later on, it was shown that KD can be used for improving the performance of neural networks in the so-called born-again (Furlanello et al, 2018) or self-distillation frameworks (Kim et al, 2020;Yun et al, 2020;Hahn & Choi, 2019). Self-distillation is a regularization technique trying to improve the performance of a network using its internal knowledge.…”

Section: Related Workmentioning

confidence: 99%

“…In other words, in self-distillation scenarios, the student becomes its own teacher. While KD has shown great success in different ASR tasks (Pang et al, 2018;Huang et al, 2018;Takashima et al, 2018;Kim et al, 2019;Chebotar & Waters, 2016;Fukuda et al, 2017;Yoon et al, 2020), self-distillation is more investigated in computer vision and natural language processing (NLP) domains (Haun & Choi, 2019;Hahn & Choi, 2019). To best of our knowledge, we incorporate the self-KD approach for the first time in training ASR models.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

Haidar¹,

Chen²,

Rezagholizadeh³

2021

Preprint

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR), unlike conventional ASR, does not have modules to learn the semantic representation from speech encoder. Moreover, the higher frame-rate of speech representation prevents the model to learn the semantic representation properly. Therefore, the models that are constructed by the lower frame-rate of speech encoder lead to better performance. For Transformer-based ASR, the lower frame-rate is not only important for learning better semantic representation but also for reducing the computational complexity due to the self-attention mechanism which has O(n 2 ) order of complexity in both training and inference. In this paper, we propose a Transformerbased ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that further reduce the frame-rate. This can help in reducing the computational cost of the self-attention process for training and inference with performance improvement. Moreover, we introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model. Experiments on LibriSpeech datasets show that our proposed methods outperform all other Transformer-based ASR systems. Furthermore, with language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models with just 30 million parameters trained without any external data.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

Haidar¹,

Chen²,

Rezagholizadeh³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Results show that SD can almost replicate the accuracy regardless of a well-trained large model or big dataset such as in the image classification task (Zhang et al, 2019). SD has also been applied to NLP tasks such as language model and neural machine translation (Hahn and Choi, 2019) and obtains promising results. Despite obtaining higher accuracy and better performance, modern deep learning models face drawbacks of miscalibration and overconfidence (Müller et al, 2019;Naeini et al, 2015;Lakshminarayanan et al, 2016).…”

Section: Introductionmentioning

confidence: 97%

Learning ULMFiT and Self-Distillation with Calibration for Medical Dialogue System

Ao,

Acharya

2021

Preprint

View full text Add to dashboard Cite

A medical dialogue system is essential for healthcare service as providing primary clinical advice and diagnoses. It has been gradually adopted and practiced in medical organizations in the form of a conversational bot, largely due to the advancement of NLP. In recent years, the introduction of state-of-theart deep learning models and transfer learning techniques like Universal Language Model Fine Tuning (ULMFiT) and Knowledge Distillation (KD) largely contributes to the performance of NLP tasks. However, some deep neural networks are poorly calibrated and wrongly estimate the uncertainty. Hence the model is not trustworthy, especially in sensitive medical decision-making systems and safety tasks. In this paper, we investigate the well-calibrated model for ULMFiT and self-distillation (SD) in a medical dialogue system. The calibrated ULMFiT (CULMFiT) is obtained by incorporating label smoothing (LS), a commonly used regularization technique to achieve a wellcalibrated model. Moreover, we apply the technique to recalibrate the confidence score called temperature scaling (TS) with KD to observe its correlation with network calibration. To further understand the relation between SD and calibration, we use both fixed and optimal temperatures to fine-tune the whole model. All experiments are conducted on the consultation backpain dataset collected by experts then further validated using a large publicly medial dialogue corpus. We empirically show that our proposed methodologies outperform conventional methods in terms of accuracy and robustness.

show abstract

“…Knowledge Distillation KD (Hinton et al, 2015) is a well-known method for neural model compression and also is shown to be an effective regularizer in improving the performance of neural networks in the self-distillation (Yun et al, 2020;Hahn and Choi, 2019) or born-again (Furlanello et al, 2018) setups. KD adds a particular loss term to the regular cross entropy (CE) classification loss:…”

Section: Introductionmentioning

confidence: 99%

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

Rezagholizadeh,

Jafari,

Salad

et al. 2021

Preprint

View full text Add to dashboard Cite

With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searching through the checkpoints of the teacher would be a very tedious and computationally expensive process, which we refer to as the checkpoint-search problem. Moreover, another observation is that larger teachers might not necessarily be better teachers in KD which is referred to as the capacitygap problem. To address these challenging problems, in this work, we introduce our progressive knowledge distillation (Pro-KD) technique which defines a smoother training path for the student by following the training footprints of the teacher instead of solely relying on distilling from a single mature fully-trained teacher. We demonstrate that our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem. We evaluate our technique using a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and CIFAR-100), natural language understanding tasks of the GLUE benchmark, and question answering (SQuAD 1.1 and 2.0) using BERT-based models and consistently got superior results over state-of-the-art techniques.

show abstract

Self-Knowledge Distillation in Natural Language Processing

Cited by 17 publications

References 0 publications

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

Learning ULMFiT and Self-Distillation with Calibration for Medical Dialogue System

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

Contact Info

Product

Resources

About