Few Shot Network Compression via Cross Distillation

Bai, Haoli; Wu, Jiaxiang; King, Irwin; Lyu, Michael R.

doi:10.1609/aaai.v34i04.5718

Cited by 47 publications

(38 citation statements)

References 18 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A majority of meta-learning methods for include metric-based (Snell et al, 2017;Pan et al, 2019), model-based (Santoro et al, 2016;Bartunov et al, 2020) and model-agnostic approaches (Finn et al, 2017(Finn et al, , 2018Vuorio et al, 2019). Meta-learning can also be applied to KD in some computer vision tasks (Lopes et al, 2017;Jang et al, 2019;Bai et al, 2020;Li et al, 2020). For example, Lopes et al ( 2017) record per-layer metadata for the teacher model to reconstruct a training set, and then adopts a standard training procedure to obtain the student model.…”

Section: Transfer Learning and Meta-learningmentioning

confidence: 99%

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pan¹,

Wang²,

Qiu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in realtime applications. One line of model compression approaches considers knowledge distillation to distill large teacher models into small student models. Most of these studies focus on single-domain only, which ignores the transferable knowledge from other domains. We notice that training a teacher with transferable knowledge digested across domains can achieve better generalization capability to help knowledge distillation. Hence we propose a Meta-Knowledge Distillation (Meta-KD) framework to build a meta-teacher model that captures transferable knowledge across domains and passes such knowledge to students. Specifically, we explicitly force the meta-teacher to capture transferable knowledge at both instance-level and feature-level from multiple domains, and then propose a meta-distillation algorithm to learn singledomain student models with guidance from the meta-teacher. Experiments on public multidomain NLP tasks show the effectiveness and superiority of the proposed Meta-KD framework. Further, we also demonstrate the capability of Meta-KD in the settings where the training data is scarce.

show abstract

Section: Transfer Learning and Meta-learningmentioning

confidence: 99%

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pan¹,

Wang²,

Qiu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…However, there is a semantic gap between the external knowledge and the samples. Therefore, we propose a knowledge distillation framework [17,18] for transferring the cross-modal knowledge. Recently, many crossmodal knowledge distillation frameworks have been proposed.…”

Section: Related Workmentioning

confidence: 99%

Cross-Modal Knowledge Distillation For Fine-Grained One-Shot Classification

Zhao

Lin

Yang³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Few-shot learning can recognize a novel category based on only a few samples because it learns to learn from a lot of labeled samples during the training process. When data is insufficient, the performance is affected. And it is expensive to obtain a large-scale finegrained dataset with annotation. In this paper, we adopt domainspecific knowledge to fill the gap of insufficient annotated data. We propose a cross-modal knowledge distillation (CMKD) framework to do fine-grained one-shot classification and propose the Spatial Relation Loss (SRL) to transfer cross-modal information, which can tackle the semantic gap between multimodal features. The teacher network distills the spatial relationship of the samples as a soft target for training a unimodal student network. Notably, the student network makes predictions only based on a few samples without any external knowledge in the application. This model-agnostic framework will be well adapted to other few-shot models. Extensive experimental results on benchmarks demonstrate that CMKD can make full use of cross-modal knowledge in image and text few-shot classification. CKMD improves the performances of the student networks significantly, even if it is a state-of-the-art student network.

show abstract

“…Li et al [11] presented few-sample knowledge distillation (FSKD), which is used in network compression where the student model is made by pruning the teacher model. Subsequently, Bai et al [12] proposed a novel layer-wise knowledge distillation approach for effectively compressing network with few data. Recently, Shen et al [13] proposed a novel grafting strategy for few-shot knowledge distillation.…”

Section: Introductionmentioning

confidence: 99%

Attention Based Data Augmentation for Knowledge Distillation with Few Data

Tian

Chen

2022

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

Knowledge distillation has attracted great attentions from computer vision researchers in recent years. However, the performance of student model will suffer from the absence of the complete dataset, which is used to train the teacher model. Especially for conducting knowledge distillation between heterogeneous models, it is difficult for student model to learn and receive guidance with few data. In this paper, a data augmentation method is proposed based on the attentional response of teacher model. The proposed method utilizes the knowledge in teacher model without requiring homogeneous architecture between teacher model and student model. Experimental results demonstrate that combining the proposed data augmentation method with different knowledge distillation methods, the performance of student model can be improved in knowledge distillation with few data.

show abstract

Few Shot Network Compression via Cross Distillation

Cited by 47 publications

References 18 publications

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Cross-Modal Knowledge Distillation For Fine-Grained One-Shot Classification

Attention Based Data Augmentation for Knowledge Distillation with Few Data

Contact Info

Product

Resources

About