On Representation Knowledge Distillation for Graph Neural Networks

Joshi, Chaitanya K.; Liu, Fayao; Xu, Xun; Lin, Jie; Foo, Chuan-Sheng

doi:10.1109/tnnls.2022.3223018

Cited by 13 publications

(10 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also demonstrate that, in general, the use of a projector will scales much more favourably for larger batch sizes and feature dimensions. We also note that the handcrafted design of kernel functions [28,21,38] may not generalise to large scale or complex real-world datasets. From the results in table 2, we observe that when fixing all other settings, the choice of normalisation can significantly affect the student's performance.…”

Section: Revisiting Knowledge Distillationmentioning

confidence: 99%

A closer look at the training dynamics of knowledge distillation

Miles¹,

Mikolajczyk²

2023

Preprint

View full text Add to dashboard Cite

In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet.

show abstract

Section: Revisiting Knowledge Distillationmentioning

confidence: 99%

A closer look at the training dynamics of knowledge distillation

Miles¹,

Mikolajczyk²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…GFKD [46] RDD [47] GKD [48] GLNN [49] Distill2Vec [50] MT-GCN [51] TinyGNN [52] GLocalKD [53] SCR [54] ROD [55] EGNN [56] Middle layer LWC-KD [57] MustaD [58] EGAD [59] AGNN [60] Cold Brew [61] PGD [62] OAD [63] CKD [64] BGNN [65] EGSC [66] HSKDM [67] Constructed graph GRL [68] GFL [69] HGKT [70] CPF [71] LSP [16] scGCN [72] MetaHG [73] G-CRD [74] HIRE [75] SKD methods…”

Section: Output Layermentioning

confidence: 99%

“…To further enrich and provide more general knowledge, GNNs will further learn the topological structure and node relationship information of the teacher model with the help of constructed graphs [16,[68][69][70][71][72][73][74][75], so as to deeply explore the knowledge contained in the teacher model.…”

Section: Constructed Graph Knowledgementioning

confidence: 99%

“…Inspired by LSP, CPF [71] is proposed, which designs the student model as a trainable combination of parametric label propagation and feature transformation modules so that students can benefit from prior knowledge based on structure and features in the teacher model. Joshi et al [74] design a novel distillation framework G-CRD for graph contrast learning representation based on LSP, which implicitly preserves the global topology based on the idea of contrastive learning to align the node embedding representation of teachers and students. Song et al [72] utilize a single-cell graph convolutional network model named scGCN, combine with knowledge distillation technology, to achieve effective knowledge transfer across different datasets.…”

Section: Constructed Graph Knowledgementioning

confidence: 99%

See 1 more Smart Citation

Graph-based Knowledge Distillation: A survey and experimental evaluation

Liu¹,

Tongya²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Graph data, such as citation networks, social networks, and transportation networks, are prevalent in the real world. Graph neural networks (GNNs) have gained widespread attention for their robust expressiveness and exceptional performance in various graph analysis applications. However, the efficacy of GNNs is heavily reliant on sufficient data labels and complex network models, with the former being challenging to obtain and the latter requiring expensive computational resources. To address the labeled data scarcity and high complexity of GNNs, Knowledge Distillation (KD) has been introduced to enhance existing GNNs. This technique involves transferring the soft-label supervision of the large teacher model to the small student model while maintaining prediction performance. Transferring the KD technique to graph data and graph-based knowledge is a major challenge. This survey offers a comprehensive overview of Graph-based Knowledge Distillation methods, systematically categorizing and summarizing them while discussing their limitations and future directions. This paper first introduces the background of graph and KD. It then provides a comprehensive summary of three types of Graph-based Knowledge Distillation methods, namely Graph-based Knowledge Distillation for deep neural networks (DKD), Graph-based Knowledge Distillation for GNNs (GKD), and Self-Knowledge Distillation based Graph-based Knowledge Distillation (SKD). Each type of method is further divided into knowledge distillation methods based on the output layer, middle layer, and constructed graph. Subsequently, various graph-based knowledge distillation algorithms' ideas are analyzed and compared, concluding with the advantages and disadvantages of each algorithm supported by experimental results. In addition, the applications of graph-based knowledge distillation in computer vision, natural language processing, recommendation systems, and other fields are listed. Finally, the development of graph-based knowledge distillation is summarized and prospectively discussed. We have also released related resources at https://github.com/liujing1023/Graph-based-Knowledge-Distillation.

show abstract

“…We also investigated a number of different transformation functions and the effect they have on performance. The transformations we utilized include: the identity transformation (when appropriate); learned linear transformations and MLP projection heads; as well as variants of the structure-preserving transformations described in [36]. Our results (see Appendix E.1) showed that the examined transformations either hurt performance or only provided a marginal improvement in performance compared to our default linear mapping.…”

Section: Effect Of Transformation Functionmentioning

confidence: 99%