Online Distilling from Checkpoints for Neural Machine Translation

Wei, Haoran; Huang, Shujian; Wang, Ran; Dai, Xinyu; Chen, Jiajun

doi:10.18653/v1/n19-1192

Cited by 28 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Knowledge distillation is extensively studied in the field of natural language processing (NLP), in order to obtain the lightweight, efficient and effective language models. More and more KD methods are proposed for solving the numerous NLP tasks (Liu et al, 2019b;Gordon and Duh, 2019;Haidar and Rezagholizadeh, 2019;Yang et al, 2020d;Tang et al, 2019;Hu et al, 2018;Nakashole and Flauger, 2017;Jiao et al, 2020;Wang et al, 2018d;Zhou et al, 2019a;Sanh et al, 2019;Turc et al, 2019;Arora et al, 2019;Clark et al, 2019;Kim and Rush, 2016;Mou et al, 2016;Liu et al, 2019f;Hahn and Choi, 2019;Kuncoro et al, 2016;Cui et al, 2017;Wei et al, 2019;Freitag et al, 2017;Shakeri et al, 2019;Aguilar et al, 2020;Fu et al, 2021;Yang et al, 2020d;Zhang et al, 2021b;Chen et al, 2020b;Wang and Du, 2021). The existing NLP tasks using KD contain neural machine translation (NMT) (Hahn and Choi, 2019;Zhou et al, 2019a;Li et al, 2021;Kim and Rush, 2016;Gordon and Duh, 2019;Wei et al, 2019;Freitag et al, 2017;Zhang et al, 2021b), text generation (Chen et al, 2020b;Haidar and Rezagholizad...…”

Section: Kd In Nlpmentioning

confidence: 99%

Knowledge Distillation: A Survey

et al. 2021

View full text Add to dashboard Cite

In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

show abstract

Section: Kd In Nlpmentioning

confidence: 99%

Knowledge Distillation: A Survey

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Knowledge distillation is extensively studied in the field of natural language processing (NLP), in order to obtain the lightweight, efficient and effective language models. More and more KD methods are proposed for solving the numerous NLP tasks (Liu et al, 2019b;Gordon and Duh, 2019;Haidar and Rezagholizadeh, 2019;Yang et al, 2020b;Tang et al, 2019;Hu et al, 2018;Nakashole and Flauger, 2017;Jiao et al, 2019;Wang et al, 2018c;Zhou et al, 2019a;Sanh et al, 2019;Turc et al, 2019;Arora et al, 2019;Clark et al, 2019;Kim and Rush, 2016;Mou et al, 2016;Liu et al, 2019e;Hahn and Choi, 2019;Kuncoro et al, 2016;Cui et al, 2017;Wei et al, 2019;Freitag et al, 2017;Shakeri et al, 2019;Aguilar et al, 2020). The existing NLP tasks using KD contain neural machine translation (NMT) (Hahn and Choi, 2019;Zhou et al, 2019a;Kim and Rush, 2016;Wei et al, 2019;Freitag et al, 2017;Gordon and Duh, 2019), question answering system (Wang et al, 2018c;Arora et al, 2019;Yang et al, 2020b;Hu et al, 2018), document retrieval (Shakeri et al, 2019), event detection (Liu et al, 2019b), text generation (Haidar and Rezagholizadeh, 2019)...…”

Section: Kd In Nlpmentioning

confidence: 99%

“…In natural language processing, neural machine translation is the hottest application. There are many extended knowledge distillation methods for neural machine translation (Hahn and Choi, 2019;Zhou et al, 2019a;Kim and Rush, 2016;Gordon and Duh, 2019;Wei et al, 2019;Freitag et al, 2017;. In (Zhou et al, 2019a), an empirical analysis about how knowledge distillation affects the non-autoregressive machine translation (NAT) models was studied.…”

Section: Kd In Nlpmentioning

confidence: 99%

“…The ensemble teacher model is based on a number of NMT models. In (Wei et al, 2019), to improve the performance of machine translation and machine reading tasks, a novel online knowledge distillation method was proposed to address the unstableness of the training process and the decreasing performance on each validation set. In this online KD, the best evaluated model during training is chosen as the teacher model and updated by any subsequent better model.…”

Section: Kd In Nlpmentioning

confidence: 99%

See 1 more Smart Citation

Knowledge Distillation: A Survey

Gou,

Yu,

Maybank

et al. 2020

Preprint

View full text Add to dashboard Cite

In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, distillation algorithms and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

show abstract

“…These methods slow down the learning of parameters that are important for previous tasks. Another famous regularization-based approach is knowledge distillation [64] that preserves the prediction produced by the model learned on the previous task. This method creates a forward transfer of knowledge from a large network (teacher) to a small network (student) [65] such that the student learns to follow the predictions of the teacher.…”

Section: Regularizationmentioning

confidence: 99%

Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks

Shaheen¹,

Hanif²,

Hasan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Continual learning is essential for all real-world applications, as frozen pre-trained models cannot effectively deal with non-stationary data distributions. The purpose of this study is to review the state-of-the-art methods that allow continuous learning of computational models over time. We primarily focus on the learning algorithms that perform continuous learning in an online fashion from considerably large (or infinite) sequential data and require substantially low computational and memory resources. We critically analyze the key challenges associated with continual learning for autonomous real-world systems and compare current methods in terms of computations, memory, and network/model complexity. We also briefly describe the implementations of continuous learning algorithms under three main autonomous systems, i.e., self-driving vehicles, unmanned aerial vehicles, and robotics. The learning methods of these autonomous systems and their strengths and limitations are extensively explored in this article.

show abstract

Online Distilling from Checkpoints for Neural Machine Translation

Cited by 28 publications

References 19 publications

Knowledge Distillation: A Survey

Knowledge Distillation: A Survey

Knowledge Distillation: A Survey

Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks

Contact Info

Product

Resources

About