Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.485
|View full text |Cite
|
Sign up to set email alerts
|

BERT Learns to Teach: Knowledge Distillation with Meta Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(26 citation statements)
references
References 0 publications
1
16
0
Order By: Relevance
“…Teacher assistant-based distillation [14,15,17] is showcased to trade in teacher scale for student performance by inserting an intermediate-scale teacher assistant. This phenomenon is also supported in other work that better student performance should be attained with slightly worse teacher learning capacity [52]. However, setting the teacher assistant to a small scale with high performance for the student is nontrivial.…”
Section: Related Worksupporting
confidence: 57%
“…Teacher assistant-based distillation [14,15,17] is showcased to trade in teacher scale for student performance by inserting an intermediate-scale teacher assistant. This phenomenon is also supported in other work that better student performance should be attained with slightly worse teacher learning capacity [52]. However, setting the teacher assistant to a small scale with high performance for the student is nontrivial.…”
Section: Related Worksupporting
confidence: 57%
“…To further illustrate the superiority of our methods, we further compare the current typical distillation methods like previous work [41], as shown in Table 2. Results of baselines in Table 2 are reported by [51].…”
Section: Results On Cifar-100mentioning
confidence: 99%
“…In Table 4, we further explore the effect of the PESF-KD in NLP dataset, and other baselines are reported by [51]. Time in the Table 4 refer to training resources cost, which is the lowest consumption with our PESF-KD compared with other baselines except for vanilla KD.…”
Section: Results On Gluementioning
confidence: 99%
“…In this work, we aim to leverage meta-learning in a more flexible manner by refining the pseudo-labels instead of reweighting. Approach-wise, the most related works are (Pham et al, 2021;Zhou et al, 2022) from computer vision and model distillation respectively, which also refine the teacher's parameters from student feedback. However, they work with samples from clean distributions, while we anticipate the noise memorization effect and enhance our framework with teacher warm-up and confidence filtering to suppress the error propagation.…”
Section: Related Workmentioning
confidence: 99%