Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1192
|View full text |Cite
|
Sign up to set email alerts
|

Online Distilling from Checkpoints for Neural Machine Translation

Abstract: Current predominant neural machine translation (NMT) models often have a deep structure with large amounts of parameters, making these models hard to train and easily suffering from over-fitting. A common practice is to utilize a validation set to evaluate the training process and select the best checkpoint. Average and ensemble techniques on checkpoints can lead to further performance improvement. However, as these methods do not affect the training process, the system performance is restricted to the checkpo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
11
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(12 citation statements)
references
References 19 publications
0
11
0
Order By: Relevance
“…Knowledge distillation is extensively studied in the field of natural language processing (NLP), in order to obtain the lightweight, efficient and effective language models. More and more KD methods are proposed for solving the numerous NLP tasks (Liu et al, 2019b;Gordon and Duh, 2019;Haidar and Rezagholizadeh, 2019;Yang et al, 2020d;Tang et al, 2019;Hu et al, 2018;Nakashole and Flauger, 2017;Jiao et al, 2020;Wang et al, 2018d;Zhou et al, 2019a;Sanh et al, 2019;Turc et al, 2019;Arora et al, 2019;Clark et al, 2019;Kim and Rush, 2016;Mou et al, 2016;Liu et al, 2019f;Hahn and Choi, 2019;Kuncoro et al, 2016;Cui et al, 2017;Wei et al, 2019;Freitag et al, 2017;Shakeri et al, 2019;Aguilar et al, 2020;Fu et al, 2021;Yang et al, 2020d;Zhang et al, 2021b;Chen et al, 2020b;Wang and Du, 2021). The existing NLP tasks using KD contain neural machine translation (NMT) (Hahn and Choi, 2019;Zhou et al, 2019a;Li et al, 2021;Kim and Rush, 2016;Gordon and Duh, 2019;Wei et al, 2019;Freitag et al, 2017;Zhang et al, 2021b), text generation (Chen et al, 2020b;Haidar and Rezagholizad...…”
Section: Kd In Nlpmentioning
confidence: 99%
“…Knowledge distillation is extensively studied in the field of natural language processing (NLP), in order to obtain the lightweight, efficient and effective language models. More and more KD methods are proposed for solving the numerous NLP tasks (Liu et al, 2019b;Gordon and Duh, 2019;Haidar and Rezagholizadeh, 2019;Yang et al, 2020d;Tang et al, 2019;Hu et al, 2018;Nakashole and Flauger, 2017;Jiao et al, 2020;Wang et al, 2018d;Zhou et al, 2019a;Sanh et al, 2019;Turc et al, 2019;Arora et al, 2019;Clark et al, 2019;Kim and Rush, 2016;Mou et al, 2016;Liu et al, 2019f;Hahn and Choi, 2019;Kuncoro et al, 2016;Cui et al, 2017;Wei et al, 2019;Freitag et al, 2017;Shakeri et al, 2019;Aguilar et al, 2020;Fu et al, 2021;Yang et al, 2020d;Zhang et al, 2021b;Chen et al, 2020b;Wang and Du, 2021). The existing NLP tasks using KD contain neural machine translation (NMT) (Hahn and Choi, 2019;Zhou et al, 2019a;Li et al, 2021;Kim and Rush, 2016;Gordon and Duh, 2019;Wei et al, 2019;Freitag et al, 2017;Zhang et al, 2021b), text generation (Chen et al, 2020b;Haidar and Rezagholizad...…”
Section: Kd In Nlpmentioning
confidence: 99%
“…Knowledge distillation is extensively studied in the field of natural language processing (NLP), in order to obtain the lightweight, efficient and effective language models. More and more KD methods are proposed for solving the numerous NLP tasks (Liu et al, 2019b;Gordon and Duh, 2019;Haidar and Rezagholizadeh, 2019;Yang et al, 2020b;Tang et al, 2019;Hu et al, 2018;Nakashole and Flauger, 2017;Jiao et al, 2019;Wang et al, 2018c;Zhou et al, 2019a;Sanh et al, 2019;Turc et al, 2019;Arora et al, 2019;Clark et al, 2019;Kim and Rush, 2016;Mou et al, 2016;Liu et al, 2019e;Hahn and Choi, 2019;Kuncoro et al, 2016;Cui et al, 2017;Wei et al, 2019;Freitag et al, 2017;Shakeri et al, 2019;Aguilar et al, 2020). The existing NLP tasks using KD contain neural machine translation (NMT) (Hahn and Choi, 2019;Zhou et al, 2019a;Kim and Rush, 2016;Wei et al, 2019;Freitag et al, 2017;Gordon and Duh, 2019), question answering system (Wang et al, 2018c;Arora et al, 2019;Yang et al, 2020b;Hu et al, 2018), document retrieval (Shakeri et al, 2019), event detection (Liu et al, 2019b), text generation (Haidar and Rezagholizadeh, 2019)...…”
Section: Kd In Nlpmentioning
confidence: 99%
“…In natural language processing, neural machine translation is the hottest application. There are many extended knowledge distillation methods for neural machine translation (Hahn and Choi, 2019;Zhou et al, 2019a;Kim and Rush, 2016;Gordon and Duh, 2019;Wei et al, 2019;Freitag et al, 2017;. In (Zhou et al, 2019a), an empirical analysis about how knowledge distillation affects the non-autoregressive machine translation (NAT) models was studied.…”
Section: Kd In Nlpmentioning
confidence: 99%
See 1 more Smart Citation
“…These methods slow down the learning of parameters that are important for previous tasks. Another famous regularization-based approach is knowledge distillation [64] that preserves the prediction produced by the model learned on the previous task. This method creates a forward transfer of knowledge from a large network (teacher) to a small network (student) [65] such that the student learns to follow the predictions of the teacher.…”
Section: Regularizationmentioning
confidence: 99%