Adaptive multi-teacher multi-level knowledge distillation

Liu, Yuang; Zhang, Wei; Wang, Jun

doi:10.1016/j.neucom.2020.07.048

Cited by 111 publications

(47 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The accuracy difference between the first and the last model and between the second and the last model was 0.49% and 0.35%, respectively. The proposed method was compared to three knowledge distillation methods; DML [12], AvgMKD [27], and AMTML-KD [14] as shown in Table 3. Each method was used to train three student models on the CIFAR-10, CIFAR-100, and TinyImageNet datasets.…”

Section: Results On Cifar-10 Cifar-100 and Tinyimagenetmentioning

confidence: 99%

“…However, by treating each teacher equally, the differences between teacher models could be lost. In [14], authors proposed an adaptive multi-teacher knowledge distillation method (named AMTML-KD) that extended the previous method by adding an adaptive weight for each teacher model and transferring the intermediatelevel knowledge from hidden layers of the teacher models to the student models.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, co-distillation or online distillation techniques [12][13][14][15] have been more attractive since these simplify the training process to a single-stage, where a group of models is trained simultaneously to learn from the ground-truth and distill knowledge from each other. This enables the deployment of small robust models on small devices, such as mobile phones or other edge devices.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Deep Collaborative Learning for Randomly Wired Neural Networks

Essa

Xie

2021

Electronics

View full text Add to dashboard Cite

A deep collaborative learning approach is introduced in which a chain of randomly wired neural networks is trained simultaneously to improve the overall generalization and form a strong ensemble model. The proposed method takes advantage of functional-preserving transfer learning and knowledge distillation to produce an ensemble model. Knowledge distillation is an effective learning scheme for improving the performance of small neural networks by using the knowledge learned by teacher networks. Most of the previous methods learn from one or more teachers but not in a collaborative way. In this paper, we created a chain of randomly wired neural networks based on a random graph algorithm and collaboratively trained the models using functional-preserving transfer learning, so that the small network in the chain could learn from the largest one simultaneously. The training method applies knowledge distillation between randomly wired models, where each model is considered as a teacher to the next model in the chain. The decision of multiple chains of models can be combined to produce a robust ensemble model. The proposed method is evaluated on CIFAR-10, CIFAR-100, and TinyImageNet. The experimental results show that the collaborative training significantly improved the generalization of each model, which allowed for obtaining a small model that can mimic the performance of a large model and produce a more robust ensemble approach.

show abstract

Section: Results On Cifar-10 Cifar-100 and Tinyimagenetmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Collaborative Learning for Randomly Wired Neural Networks

Essa

Xie

2021

Electronics

View full text Add to dashboard Cite

show abstract

“…However, knowledge capacity and diversity may be constrained in some instances, such as cross-model KD [51]. To cope with this issue, the training of a portable student network by several teachers has been investigated [52]. In this study, a student learns to execute the same or different task from several teachers, rather than just one.…”

Section: B Multi-teacher Kdmentioning

confidence: 99%

Robust Semantic Segmentation With Multi-Teacher Knowledge Distillation

Amirkhani¹,

Khosravian

Masih‐Tehrani

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Recent studies have recently exploited knowledge distillation (KD) technique to address timeconsuming annotation task in semantic segmentation, through which one teacher trained on a single dataset could be leveraged for annotating unlabeled data. However, in this context, knowledge capacity is restricted, and knowledge variety is rare in different conditions, such as cross-model KD, in which the single teacher KD prohibits the student model from distilling information using cross-domain context. To fix this concern, we have looked into learning a lightweight student from a group of teachers. To be more specific, we train five distinct lightweight convolutional neural networks (CNNs) for semantic segmentation on several datasets. Several state-of-the-art augmentation transformations have also been utilized in our training phase. The impacts of such training scenarios are then assessed in terms of student robustness and accuracy. As the main contribution of this paper, our proposed multi-teacher KD paradigm endows the student with the ability to amalgamate and capture a variety of knowledge illustrations from different sources. Results demonstrated that our method outperforms the existing studies on both clean and corrupted data in the semantic segmentation task while benefiting from our proposed score weight system. Experiments validate that our multi-teacher framework results in an improvement of 9% up to 32.18% compared to the singleteacher paradigm. Moreover, it is demonstrated that our paradigm surpasses previous supervised real-time studies in the semantic segmentation challenge.

show abstract

“…You et al (2017) average soft-labels of multiple teachers and propose to transfer relative dissimilarity among intermediate representations using teacher voting to select the best ordering relationships. Liu et al (2020) combine soft-labels of multiple teachers with learnable weights, distill structural knowledge between data examples, and transfer intermediate layer representations making each teacher responsible for a specific group of layers in the student network. Both papers relate to the Computer Vision field, both use models with different architectures as teachers, and both show that 5 teachers are better than 3 for their methods (in terms of classification accuracy), but not better for the original knowledge distillation.…”

Section: Related Workmentioning

confidence: 99%