2018
DOI: 10.48550/arxiv.1812.01819
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

An Embarrassingly Simple Approach for Knowledge Distillation

Abstract: Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and the KD loss simultaneously, using a pre-defined loss weight to balance these two terms. In this work, we propose to first transfer the backbone knowledge from a teacher to the student, and then only learn the task-head of the student network. Such a decomposition of the traini… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 15 publications
(39 reference statements)
0
11
0
Order By: Relevance
“…Zagoruyko and Komodakis [22] averaged the feature map across channel dimension to obtain spatial attention map, Yim et al [21] defined inter-layer flow by computing the inner product of two feature maps, and Lee et al [14] improved this idea with singular value decomposition (SVD). A recent work [7] demonstrated the effectiveness of mimicking feature map directly in KD task.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Zagoruyko and Komodakis [22] averaged the feature map across channel dimension to obtain spatial attention map, Yim et al [21] defined inter-layer flow by computing the inner product of two feature maps, and Lee et al [14] improved this idea with singular value decomposition (SVD). A recent work [7] demonstrated the effectiveness of mimicking feature map directly in KD task.…”
Section: Related Workmentioning
confidence: 99%
“…In addition to initialization, there are some methods combining KD with other techniques to transfer knowledge from T to S more efficiently. Belagiannis et al [3] involved adversarial learning into KD by employing a discriminator to tell whether the outputs of S and T are close enough, Ashok et al [1] exploited reinforcement learning to find out the best network structure of S under the guidance of T , and Wang et al [19], Gao et al [7] referred to the idea of progressive learning to make knowledge transferred step by step. Nevertheless, all of the above methods use a single model, S, to learn from T , and the knowledge is distilled only once.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Thus, adopting teacher's knowledge as supervision will guide student to have more discrimination. To improve transfer efficiency, many recent related papers focus on designing different kinds of knowledge [1,5,14,15,17,23,24,32,33,39,41], or extending training strategies [7,10,11,22,28,33,36,37,38,40,42,43]. The works have obtained positive results.…”
Section: Introductionmentioning
confidence: 99%