2020
DOI: 10.1016/j.neucom.2020.07.048
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive multi-teacher multi-level knowledge distillation

Abstract: Knowledge distillation (KD) is an effective learning paradigm for improving the performance of lightweight student networks by utilizing additional supervision knowledge distilled from teacher networks. Most pioneering studies either learn from only a single teacher in their distillation learning methods, neglecting the potential that a student can learn from multiple teachers simultaneously, or simply treat each teacher to be equally important, unable to reveal the different importance of teachers for specifi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
43
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 111 publications
(47 citation statements)
references
References 22 publications
0
43
0
Order By: Relevance
“…The accuracy difference between the first and the last model and between the second and the last model was 0.49% and 0.35%, respectively. The proposed method was compared to three knowledge distillation methods; DML [12], AvgMKD [27], and AMTML-KD [14] as shown in Table 3. Each method was used to train three student models on the CIFAR-10, CIFAR-100, and TinyImageNet datasets.…”
Section: Results On Cifar-10 Cifar-100 and Tinyimagenetmentioning
confidence: 99%
See 2 more Smart Citations
“…The accuracy difference between the first and the last model and between the second and the last model was 0.49% and 0.35%, respectively. The proposed method was compared to three knowledge distillation methods; DML [12], AvgMKD [27], and AMTML-KD [14] as shown in Table 3. Each method was used to train three student models on the CIFAR-10, CIFAR-100, and TinyImageNet datasets.…”
Section: Results On Cifar-10 Cifar-100 and Tinyimagenetmentioning
confidence: 99%
“…However, by treating each teacher equally, the differences between teacher models could be lost. In [14], authors proposed an adaptive multi-teacher knowledge distillation method (named AMTML-KD) that extended the previous method by adding an adaptive weight for each teacher model and transferring the intermediatelevel knowledge from hidden layers of the teacher models to the student models.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, knowledge capacity and diversity may be constrained in some instances, such as cross-model KD [51]. To cope with this issue, the training of a portable student network by several teachers has been investigated [52]. In this study, a student learns to execute the same or different task from several teachers, rather than just one.…”
Section: B Multi-teacher Kdmentioning
confidence: 99%
“…You et al (2017) average soft-labels of multiple teachers and propose to transfer relative dissimilarity among intermediate representations using teacher voting to select the best ordering relationships. Liu et al (2020) combine soft-labels of multiple teachers with learnable weights, distill structural knowledge between data examples, and transfer intermediate layer representations making each teacher responsible for a specific group of layers in the student network. Both papers relate to the Computer Vision field, both use models with different architectures as teachers, and both show that 5 teachers are better than 3 for their methods (in terms of classification accuracy), but not better for the original knowledge distillation.…”
Section: Related Workmentioning
confidence: 99%