Progressive Network Grafting for Few-Shot Knowledge Distillation

Shen, Chengchao; Wang, Xinchao; Yin, Youtan; Song, Jie; Luo, Sihui; Song, Mingli

doi:10.1609/aaai.v35i3.16356

Cited by 28 publications

(16 citation statements)

References 40 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some studies have also improved layer-wisely distillation and proposed cross-distillation, which can effectively reduce the estimation error of layer-wisely distillation by cross-training the hidden layer network of teachers and students [8]. Besides the cross-distillation model, a principled dual-stage distillation scheme based on small samples has also been proposed, in which the student modules are grafted into the teacher network for training, then the trained student modules are spliced together and grafted into the teacher network, and finally the teacher network is replaced [9]. In some of the above methods, some will add additional convolutional layers to the compressed network during training, which increases the complexity of the network structure.…”

Section: Related Workmentioning

confidence: 99%

“…D EEP neural networks are widely used in various computer vision tasks [1]- [4] and have achieved remarkable results [5], [6]. However, the current state-of-the-art deep models suffer from huge energy consumption, high operating and storage costs, which greatly hinder their deployment in resource-efficient situations [7]- [9]. To solve this problem, a lot of works have been proposed to compress neural networks for obtaining more lightweight neural network models.…”

Section: Introductionmentioning

confidence: 99%

“…Network pruning is usually performed on the trained network to remove unimportant channels or weights, and then the pruned network is retrained to restore the performance of the original network [15], [16]. These pruning methods usually require a large amount of labeled data, and the training process is very time-consuming [9], [17]. The knowledge distillation method transfers knowledge from the pre-trained teacher network to the student network, and trains the student network by making students imitate the output of the teacher network, so as to achieve the performance of the teacher network [18]- [20].…”

Section: Introductionmentioning

confidence: 99%

“…The knowledge distillation method transfers knowledge from the pre-trained teacher network to the student network, and trains the student network by making students imitate the output of the teacher network, so as to achieve the performance of the teacher network [18]- [20]. However, since the student network is usually set to be randomly initialized, it needs to rely on a large amount of data for knowledge transfer to train a model with good performance [9], [21]. Therefore, it is difficult for the existing methods to recover the lost accuracy with few training samples.…”

Section: Introductionmentioning

confidence: 99%

“…Zhou et al [22] introduced a progressive training strategy to achieve knowledge transfer between student network and teacher network by matching the feature distribution between them. Shen et al [9] proposed a progressive network grafting method, which trains the student network through block grafting and network grafting, reduces its parameter space, and enhances the robustness of knowledge distillation. However, these methods rarely take into account both the feature information of the teacher network and the classification result information to optimize the student network simultaneously.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Progressive Network Grafting With Local Features Embedding for Few-Shot Knowledge Distillation

2022

IEEE Access

View full text Add to dashboard Cite

Compared with traditional knowledge distillation, which relies on a large amount of data, few-shot knowledge distillation can distill student networks with good performance using only a small number of samples. Some recent studies treat the network as a combination of a series of network blocks, adopt a progressive graft strategy, and use the output of the teacher network to distill the student network. However, this strategy ignores the importance of the local feature information generated by the teacher block, which indicates what features should be learned by the corresponding student block. In this paper, we argue that using the features output from the teacher block can guide the student block to further learn more useful information from the teacher block. Therefore, we propose a joint learning framework for few-shot knowledge distillation that exploits both the output of the teacher network and the local features generated by the teacher block to optimize the student network. The local features will guide the student block to learn the output of the teacher block, and the output of the teacher network will allow the student network to take its learned local features to better contribute to the classification. In addition, further model compression was carried out to design a series of student networks with fewer number of parameters by reducing the number of network channels. Finally, extensive experiments using the model on CIFAR10 and CIFAR100 datasets show that our method outperforms SOTA, and our method has considerable advantages even with a very small number of parameters in further model compression experiments.

show abstract

Section: Related Workmentioning

confidence: 99%