A Comprehensive Overhaul of Feature Distillation

Heo, Byeongho; Kim, Jeesoo; Yun, Sangdoo; Park, Hyojin; Kwak, Nojun; Choi, Jin Young

doi:10.1109/iccv.2019.00201

Cited by 362 publications

(208 citation statements)

References 16 publications

Supporting

Mentioning

204

Contrasting

Order By: Relevance

“…Network Compression. Generally, compression methods can be categorized into five types: quantization [3,31,7,43,1], knowledge distillation [14,23,41,53,32], low-rank decomposition [38,6,22,56], weight sparsification [10,26,51], and filter pruning [34,27,13,40]. Quantization methods accelerate deep CNNs by replacing high-precision float point operations with low-precision fixed point ones, which usually incurs significantly accuracy drop.…”

Section: Sparsity Sparsitymentioning

confidence: 99%

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

Fang

Kang

et al. 2020

Computer Vision – ECCV 2020

View full text Add to dashboard Cite

Replacing normal convolutions with group convolutions can significantly increase the computational efficiency of modern deep convolutional networks, which has been widely adopted in compact network architecture designs. However, existing group convolutions undermine the original network structures by cutting off some connections permanently resulting in significant accuracy degradation. In this paper, we propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly. Specifically, we equip each group with a small feature selector to automatically select the most important input channels conditioned on the input images. Multiple groups can adaptively capture abundant and complementary visual/semantic features for each input image. The DGC preserves the original network structure and has similar computational efficiency as the conventional group convolution simultaneously. Extensive experiments on multiple image classification benchmarks including CIFAR-10, CIFAR-100 and ImageNet demonstrate its superiority over the existing group convolution techniques and dynamic execution methods 4 . The code is available at https://github.com/zhuogege1943/dgc.

show abstract

Section: Sparsity Sparsitymentioning

confidence: 99%

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

Fang

Kang

et al. 2020

Computer Vision – ECCV 2020

View full text Add to dashboard Cite

show abstract

“…Many methods have been proposed to minimize the performance gap between a student and a teacher. We discuss different forms of knowledge in the following categories: response-based knowledge [26,27,35], feature-based knowledge [28,29,[36][37][38][39][40][41][42][43][44][45], and relation-based knowledge [30,31,[46][47][48][49].…”

Section: Knowledge Distillationmentioning

confidence: 99%

“…Ahn et al proposed Variational Information Distillation (VID) [41] that maximizes a lower boundary for the mutual information between the student network and the teacher network. Heo et al proposed Overhaul of Feature Distillation (OFD) [42] to transfer the magnitude of feature response which contains both the activation status of each neuron and feature information. Wang et al proposed Attentive Feature Distillation (AFD) [43] which dynamically learns not only the features to transfer, but also the unimportant neurons to skip.…”

Section: Feature-based Knowledgementioning

confidence: 99%

Ensemble Learning of Lightweight Deep Learning Models Using Knowledge Distillation for Image Classification

Kang

Gwak

2020

Mathematics

View full text Add to dashboard Cite

In recent years, deep learning models have been used successfully in almost every field including both industry and academia, especially for computer vision tasks. However, these models are huge in size, with millions (and billions) of parameters, and thus cannot be deployed on the systems and devices with limited resources (e.g., embedded systems and mobile phones). To tackle this, several techniques on model compression and acceleration have been proposed. As a representative type of them, knowledge distillation suggests a way to effectively learn a small student model from large teacher model(s). It has attracted increasing attention since it showed its promising performance. In the work, we propose an ensemble model that combines feature-based, response-based, and relation-based lightweight knowledge distillation models for simple image classification tasks. In our knowledge distillation framework, we use ResNet−20 as a student network and ResNet−110 as a teacher network. Experimental results demonstrate that our proposed ensemble model outperforms other knowledge distillation models as well as the large teacher model for image classification tasks, with less computational power than the teacher model.

show abstract

“…Knowledge distillation, as another compression strategy, aims to transfer dark knowledge in logits outputs [14], feature maps [13,18], and relationship diagrams [26] from a larger pre-trained teacher network to a smaller student network, allowing the student network to mimic the teacher network performance. The strategy of knowledge distillation can better improve some smaller networks' accuracy than directly training them with one-hot labels.…”

Section: Introductionmentioning

confidence: 99%

Knowledge from the original network: restore a better pruned network with knowledge distillation

Chen

et al. 2021

Complex Intell. Syst.

View full text Add to dashboard Cite

To deploy deep neural networks to edge devices with limited computation and storage costs, model compression is necessary for the application of deep learning. Pruning, as a traditional way of model compression, seeks to reduce the parameters of model weights. However, when a deep neural network is pruned, the accuracy of the network will significantly decrease. The traditional way to decrease the accuracy loss is fine-tuning. When over many parameters are pruned, the pruned network’s capacity is reduced heavily and cannot recover to high accuracy. In this paper, we apply the knowledge distillation strategy to abate the accuracy loss of pruned models. The original network of the pruned network was used as the teacher network, aiming to transfer the dark knowledge from the original network to the pruned sub-network. We have applied three mainstream knowledge distillation methods: response-based knowledge, feature-based knowledge, and relation-based knowledge (Gou et al. in Knowledge distillation: a survey. arXiv:200605525, 2020), and compare the result to the traditional fine-tuning method with grand-truth labels. Experiments have been done on the CIFAR100 dataset with several deep convolution neural network. Results show that the pruned network recovered by knowledge distillation with its original network performs better accuracy than it recovered by fine-tuning with sample labels. It has also been validated in this paper that the original network as the teacher performs better than differently structured networks with same accuracy as the teacher.

show abstract

A Comprehensive Overhaul of Feature Distillation

Cited by 362 publications

References 16 publications

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

Ensemble Learning of Lightweight Deep Learning Models Using Knowledge Distillation for Image Classification

Knowledge from the original network: restore a better pruned network with knowledge distillation

Contact Info

Product

Resources

About