“…One needs to ensure that paired features are similar both in modeling capacity and relevance to the output. Most research on feature-based distillation on graphs has so far focused on models that only have one type of (scalar) features in single-output classification tasks [26,27,28,29], thereby reducing the problem to the selection of layers to pair across the student and the teacher. This is often further simplified by utilizing models of the same architecture.…”