Proceedings - Natural Language Processing in a Deep Learning World 2019
DOI: 10.26615/978-954-452-056-4_050
|View full text |Cite
|
Sign up to set email alerts
|

Self-Knowledge Distillation in Natural Language Processing

Abstract: Since deep learning became a key player in natural language processing (NLP), many deep learning models have been showing remarkable performances in a variety of NLP tasks, and in some cases, they are even outperforming humans. Such high performance can be explained by efficient knowledge representation of deep learning models. While many methods have been proposed to learn more efficient representation, knowledge distillation from pretrained deep networks suggest that we can use more information from the soft… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
27
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 68 publications
(29 citation statements)
references
References 13 publications
0
27
0
Order By: Relevance
“…Predictions on different samples belonging to the same class are distilled to mitigate overconfident predictions. Self distillation has also been used in natural language processing [33]. Supervising the same model at different depths is explored in [34].…”
Section: Self Kd Methodsmentioning
confidence: 99%
“…Predictions on different samples belonging to the same class are distilled to mitigate overconfident predictions. Self distillation has also been used in natural language processing [33]. Supervising the same model at different depths is explored in [34].…”
Section: Self Kd Methodsmentioning
confidence: 99%
“…Compared to direct training, knowledge distillation provides a more stable training process which leads to better performing student models (Hinton et al, 2015;Phuong and Lampert, 2019). Recent work (Furlanello et al, 2018;Hahn and Choi, 2019) also sheds light on leveraging knowledge distillation for training a highperforming student model with the same size as the teacher (see the discussion in the next section).…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…For the Seq2Seq model, Kim and Rush (2016) proposes to use the generated sequences as the sequence-level knowledge to guide the student network training. Moreover, self-knowledge distillation (Hahn and Choi, 2019) even shows that knowledge (representations) from the student network itself can improve the performance.…”
Section: Knowledge Distillationmentioning
confidence: 99%