2021
DOI: 10.48550/arxiv.2112.00459
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Information Theoretic Representation Distillation

Abstract: Despite the empirical success of knowledge distillation, there still lacks a theoretical foundation that can naturally lead to computationally inexpensive implementations. To address this concern, we forge an alternative connection between information theory and knowledge distillation using a recently proposed entropy-like functional. In doing so, we introduce two distinct complementary losses which aim to maximise the correlation and mutual information between the student and teacher representations. Our meth… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 32 publications
0
6
0
Order By: Relevance
“…Through the quantification of visual concepts encoded in teacher networks, Cheng et al [96] explain the success of knowledge distillation from the following three hypotheses: learning more visual concepts, learning various visual concepts, and yielding more stable optimization directions. Miles et al [97] integrate the analysis of information theory with knowledge distillation using infinitely divisible kernels, which achieve the computationally-efficient learning process on the crossmodel transfer tasks. Therefore, the quantification of knowledge can be investigated in the future research direction, which aims to analyze how much important knowledge can be potentially captured before the knowledge-learning process.…”
Section: Quality Of Knowledgementioning
confidence: 99%
“…Through the quantification of visual concepts encoded in teacher networks, Cheng et al [96] explain the success of knowledge distillation from the following three hypotheses: learning more visual concepts, learning various visual concepts, and yielding more stable optimization directions. Miles et al [97] integrate the analysis of information theory with knowledge distillation using infinitely divisible kernels, which achieve the computationally-efficient learning process on the crossmodel transfer tasks. Therefore, the quantification of knowledge can be investigated in the future research direction, which aims to analyze how much important knowledge can be potentially captured before the knowledge-learning process.…”
Section: Quality Of Knowledgementioning
confidence: 99%
“…(15) follows by noticing that Tn (x) ∈ [−1, 1] for any x ∈ [0, v]. (16) follows by applying Lemma A.3 on R(i−α, 2α+1) similar to (13). (17) follows by Euler's reflection formula similar to (14).…”
Section: Tr(pmentioning
confidence: 99%
“…Inspired by the quantum generalization of Rényi's definition [8], this new family of information measures is defined on the eigenspectrum of a normalized Hermitian matrix constructed by projecting data points in reproducing kernel Hilbert space (RKHS), thus avoiding explicit estimation of underlying data distributions. Because of its intriguing property in high-dimensional scenarios, the matrix-based Rényi's entropy and mutual information have been successfully applied in various data science applications, ranging from classical dimensionality reduction [9], [10] and feature selection [11] problems to advanced deep learning problems such as robust learning against covariant shift [5], network pruning [12] and knowledge distillation [13].…”
Section: Introductionmentioning
confidence: 99%
“…Inspired by the quantum generalization of Rényi's definition [8], this new family of information measures is defined on the eigenspectrum of a normalized Hermitian matrix constructed by projecting data points in reproducing kernel Hilbert space (RKHS), thus avoiding explicit estimation of underlying data distributions. Because of its intriguing property in high-dimensional scenarios, the matrix-based Rényi's entropy, and mutual information have been successfully applied in various data science applications, ranging from classical dimensionality reduction [9] and feature selection [10] problems to advanced deep learning problems such as robust learning against covariant shift [5], network pruning [11] and knowledge distillation [12].…”
Section: Introductionmentioning
confidence: 99%