2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00201
|View full text |Cite
|
Sign up to set email alerts
|

A Comprehensive Overhaul of Feature Distillation

Abstract: We investigate the design aspects of feature distillation methods achieving network compression and propose a novel feature distillation method in which the distillation loss is designed to make a synergy among various aspects: teacher transform, student transform, distillation feature position and distance function. Our proposed distillation loss includes a feature transform with a newly designed margin ReLU, a new distillation feature position, and a partial L 2 distance function to skip redundant informatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
204
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 362 publications
(208 citation statements)
references
References 16 publications
0
204
0
Order By: Relevance
“…Network Compression. Generally, compression methods can be categorized into five types: quantization [3,31,7,43,1], knowledge distillation [14,23,41,53,32], low-rank decomposition [38,6,22,56], weight sparsification [10,26,51], and filter pruning [34,27,13,40]. Quantization methods accelerate deep CNNs by replacing high-precision float point operations with low-precision fixed point ones, which usually incurs significantly accuracy drop.…”
Section: Sparsity Sparsitymentioning
confidence: 99%
“…Network Compression. Generally, compression methods can be categorized into five types: quantization [3,31,7,43,1], knowledge distillation [14,23,41,53,32], low-rank decomposition [38,6,22,56], weight sparsification [10,26,51], and filter pruning [34,27,13,40]. Quantization methods accelerate deep CNNs by replacing high-precision float point operations with low-precision fixed point ones, which usually incurs significantly accuracy drop.…”
Section: Sparsity Sparsitymentioning
confidence: 99%
“…Many methods have been proposed to minimize the performance gap between a student and a teacher. We discuss different forms of knowledge in the following categories: response-based knowledge [26,27,35], feature-based knowledge [28,29,[36][37][38][39][40][41][42][43][44][45], and relation-based knowledge [30,31,[46][47][48][49].…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…Ahn et al proposed Variational Information Distillation (VID) [41] that maximizes a lower boundary for the mutual information between the student network and the teacher network. Heo et al proposed Overhaul of Feature Distillation (OFD) [42] to transfer the magnitude of feature response which contains both the activation status of each neuron and feature information. Wang et al proposed Attentive Feature Distillation (AFD) [43] which dynamically learns not only the features to transfer, but also the unimportant neurons to skip.…”
Section: Feature-based Knowledgementioning
confidence: 99%
“…Knowledge distillation, as another compression strategy, aims to transfer dark knowledge in logits outputs [14], feature maps [13,18], and relationship diagrams [26] from a larger pre-trained teacher network to a smaller student network, allowing the student network to mimic the teacher network performance. The strategy of knowledge distillation can better improve some smaller networks' accuracy than directly training them with one-hot labels.…”
Section: Introductionmentioning
confidence: 99%