Residual error based knowledge distillation

Gao, Mengya; Wang, Yujun; Wan, Liang

doi:10.1016/j.neucom.2020.10.113

Cited by 32 publications

(19 citation statements)

References 21 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The model capacity gap between the large deep neural network and a small student neural network can degrade knowledge transfer (Mirzadeh et al, 2020;Gao et al, 2021). To effectively transfer knowledge to student networks, a variety of methods have been proposed for a controlled reduction of the model complexity (Zhang et al, 2018b;Nowak and Corso, 2018;Crowley et al, 2018;Liu et al, 2019a,i;Wang et al, 2018a;Gu and Tresp, 2020).…”

Section: Teacher-student Architecturementioning

confidence: 99%

Knowledge Distillation: A Survey

et al. 2021

View full text Add to dashboard Cite

In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

show abstract

Section: Teacher-student Architecturementioning

confidence: 99%

Knowledge Distillation: A Survey

et al. 2021

View full text Add to dashboard Cite

show abstract

“…However, some recent studies argued a different view. [27] and [10] thought that a large model capacity gap between teacher and student may have a negative effect on knowledge transfer, and introduced assistant networks to narrow the gap. [29] proposed to learn a student-friendly teacher by plugging in student branches during the training procedure.…”

Section: B Experience Ensemble Knowledge Distillationmentioning

confidence: 99%

Learn From the Past: Experience Ensemble Knowledge Distillation

Wang¹,

Zhang²,

Song³

et al. 2022

Preprint

View full text Add to dashboard Cite

Traditional knowledge distillation transfers "dark knowledge" of a pre-trained teacher network to a student network, and ignores the knowledge in the training process of the teacher, which we call teacher's experience. However, in realistic educational scenarios, learning experience is often more important than learning results. In this work, we propose a novel knowledge distillation method by integrating the teacher's experience for knowledge transfer, named experience ensemble knowledge distillation (EEKD). We save a moderate number of intermediate models from the training process of the teacher model uniformly, and then integrate the knowledge of these intermediate models by ensemble technique. A self-attention module is used to adaptively assign weights to different intermediate models in the process of knowledge transfer. Three principles of constructing EEKD on the quality, weights and number of intermediate models are explored. A surprising conclusion is found that strong ensemble teachers do not necessarily produce strong students. The experimental results on CIFAR-100 and ImageNet show that EEKD outperforms the mainstream knowledge distillation methods and achieves the state-of-theart. In particular, EEKD even surpasses the standard ensemble distillation on the premise of saving training cost.

show abstract

“…Some researchers find that KD can lead the students to suboptimal converged performance when the accuracy gap between teacher and students is too large [15,31]. Moreover, [1,14] have shown that the very early training time is much essential for the network, meaning such a severe gap might damage the overall performance at the essential early phases.…”

Section: Analysis Of Trainingmentioning

confidence: 99%

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

Tang¹,

Zhai²,

Ouyang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Conventional model quantization methods use a fixed quantization scheme to different data samples, which ignores the inherent "recognition difficulty" differences between various samples. We propose to feed different data samples with varying quantization schemes to achieve a data-dependent dynamic inference, at a finegrained layer level. However, enabling this adaptive inference with changeable layer-wise quantization schemes is challenging because the combination of bit-widths and layers is growing exponentially, making it extremely difficult to train a single model in such a vast searching space and use it in practice. To solve this problem, we present the Arbitrary Bit-width Network (ABN), where the bitwidths of a single deep network can change at runtime for different data samples, with a layer-wise granularity. Specifically, first we build a weight-shared layer-wise quantizable "super-network" in which each layer can be allocated with multiple bit-widths and thus quantized differently on demand. The super-network provides a considerably large number of combinations of bit-widths and layers, each of which can be used during inference without retraining or storing myriad models. Second, based on the well-trained super-network, each layer's runtime bit-width selection decision is modeled as a Markov Decision Process (MDP) and solved by an adaptive inference strategy accordingly. Experiments show that the super-network can be built without accuracy degradation, and the bit-widths allocation of each layer can be adjusted to deal with various inputs on the fly. On ImageNet classification, we achieve 1.1% top1 accuracy improvement while saving 36.2% BitOps.

show abstract

Residual error based knowledge distillation

Cited by 32 publications

References 21 publications

Knowledge Distillation: A Survey

Knowledge Distillation: A Survey

Learn From the Past: Experience Ensemble Knowledge Distillation

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

Contact Info

Product

Resources

About