An Embarrassingly Simple Approach for Knowledge Distillation

Gao, Mengya; Shen, Yujun; Li, Quanquan; Yan, Junjie; Liang, Wei; Lin, Dahua; Loy, Chen Change; Tang, Xiaoou

doi:10.48550/arxiv.1812.01819

Cited by 7 publications

(11 citation statements)

References 15 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zagoruyko and Komodakis [22] averaged the feature map across channel dimension to obtain spatial attention map, Yim et al [21] defined inter-layer flow by computing the inner product of two feature maps, and Lee et al [14] improved this idea with singular value decomposition (SVD). A recent work [7] demonstrated the effectiveness of mimicking feature map directly in KD task.…”

Section: Related Workmentioning

confidence: 99%

“…In addition to initialization, there are some methods combining KD with other techniques to transfer knowledge from T to S more efficiently. Belagiannis et al [3] involved adversarial learning into KD by employing a discriminator to tell whether the outputs of S and T are close enough, Ashok et al [1] exploited reinforcement learning to find out the best network structure of S under the guidance of T , and Wang et al [19], Gao et al [7] referred to the idea of progressive learning to make knowledge transferred step by step. Nevertheless, all of the above methods use a single model, S, to learn from T , and the knowledge is distilled only once.…”

Section: Related Workmentioning

confidence: 99%

“…a fully-connected layer activated by softmax function, to convert the final feature map f T K to soft label prediction ŷT . However, inspired by Gao et al [7], classifiers of T and S share the same structure as well as equal learning capacity, and hence they are excluded from the knowledge distillation process. In this way, Eq.…”

Section: Residual Learning With Assistant (Plain Rkd)mentioning

confidence: 99%

“…In this way, even with the help of A, some low-level information may still get lost after distillation. To inherit knowledge from T more completely, we propose Progressive RKD by introducing the idea of progressive learning [19,7], as shown in Fig. 2.…”

Section: Progressive Learning (Progressive Rkd)mentioning

confidence: 99%

See 3 more Smart Citations

Residual error based knowledge distillation

2021

Self Cite

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Residual Learning With Assistant (Plain Rkd)mentioning

confidence: 99%

Section: Progressive Learning (Progressive Rkd)mentioning

confidence: 99%

See 2 more Smart Citations

Residual error based knowledge distillation

2021

Self Cite

View full text Add to dashboard Cite

“…Thus, adopting teacher's knowledge as supervision will guide student to have more discrimination. To improve transfer efficiency, many recent related papers focus on designing different kinds of knowledge [1,5,14,15,17,23,24,32,33,39,41], or extending training strategies [7,10,11,22,28,33,36,37,38,40,42,43]. The works have obtained positive results.…”

Section: Introductionmentioning

confidence: 99%

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Wen

Lai

Qian

2019

Preprint

View full text Add to dashboard Cite

Knowledge distillation (KD) is widely used for training a compact model with the supervision of another large model, which could effectively improve the performance. Previous methods mainly focus on two aspects: 1) training the student to mimic representation space of the teacher; 2) training the model progressively or adding extra module like discriminator. Knowledge from teacher is useful, but it is still not exactly right compared with ground truth. Besides, overly uncertain supervision also influences the result. We introduce two novel approaches, Knowledge Adjustment (KA) and Dynamic Temperature Distillation (DTD), to penalize bad supervision and improve student model. Experiments on CIFAR-100, CINIC-10 and Tiny ImageNet show that our methods get encouraging performance compared with state-of-the-art methods. When combined with other KD-based methods, the performance will be further improved.

show abstract

Inter-Region Affinity Distillation for Road Marking Segmentation

Hou

Liu

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

112

View full text Add to dashboard Cite

We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network for the task of road marking segmentation. In this work, we explore a novel knowledge distillation (KD) approach that can transfer 'knowledge' on scene structure more effectively from a teacher to a student model. Our method is known as Inter-Region Affinity KD (IntRA-KD). It decomposes a given road scene image into different regions and represents each region as a node in a graph. An inter-region affinity graph is then formed by establishing pairwise relationships between nodes based on their similarity in feature distribution. To learn structural knowledge from the teacher network, the student is required to match the graph generated by the teacher. The proposed method shows promising results on three large-scale road marking segmentation benchmarks, i.e., ApolloScape, CU-Lane and LLAMAS, by taking various lightweight models as students and ResNet-101 as the teacher. IntRA-KD consistently brings higher performance gains on all lightweight models, compared to previous distillation methods. Our code is available at https://github.com/ cardwing/Codes-for-IntRA-KD.

show abstract

An Embarrassingly Simple Approach for Knowledge Distillation

Cited by 7 publications

References 15 publications

Residual error based knowledge distillation

Residual error based knowledge distillation

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Inter-Region Affinity Distillation for Road Marking Segmentation

Contact Info

Product

Resources

About