2021 International Joint Conference on Neural Networks (IJCNN) 2021
DOI: 10.1109/ijcnn52387.2021.9533993
|View full text |Cite
|
Sign up to set email alerts
|

Generalization Self-distillation with Epoch-wise Regularization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 17 publications
0
3
0
Order By: Relevance
“…Different traditional knowledge distillation of two‐stage scheme, self‐distillation is a one‐phase training scheme without incurring extra computation cost, which gradually utilizes its own knowledge for softening the ground‐truth targets to improve generalization performance. Specifically, let pT ${{\bf{p}}}_{{\bf{T}}}$ be the prediction from mean‐teacher branch, we can utilize mean‐teacher prediction to soften ground‐truth label boldy ${\bf{y}}$, described as 31,32 SKD(θ)=H(ηboldy+(1η)pT,pS), ${{\rm{ {\mathcal L} }}}_{SKD}(\theta )=H(\eta {\bf{y}}+(1-\eta ){{\bf{p}}}_{{\bf{T}}},{{\bf{p}}}_{{\bf{S}}}),$ where pS ${{\bf{p}}}_{{\bf{S}}}$ is current student prediction, and η $\eta $ controls how much we are going to trust the knowledge from the teacher. In our approach, the uniform and reversed samplings are fed into teacher branches to obtain the different distribution prediction pTA ${{\bf{p}}}_{{\bf{T}}}^{{\bf{A}}}$ and pTB ${{\bf{p}}}_{{\bf{T}}}^{{\bf{B}}}$.…”
Section: Proposed Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Different traditional knowledge distillation of two‐stage scheme, self‐distillation is a one‐phase training scheme without incurring extra computation cost, which gradually utilizes its own knowledge for softening the ground‐truth targets to improve generalization performance. Specifically, let pT ${{\bf{p}}}_{{\bf{T}}}$ be the prediction from mean‐teacher branch, we can utilize mean‐teacher prediction to soften ground‐truth label boldy ${\bf{y}}$, described as 31,32 SKD(θ)=H(ηboldy+(1η)pT,pS), ${{\rm{ {\mathcal L} }}}_{SKD}(\theta )=H(\eta {\bf{y}}+(1-\eta ){{\bf{p}}}_{{\bf{T}}},{{\bf{p}}}_{{\bf{S}}}),$ where pS ${{\bf{p}}}_{{\bf{S}}}$ is current student prediction, and η $\eta $ controls how much we are going to trust the knowledge from the teacher. In our approach, the uniform and reversed samplings are fed into teacher branches to obtain the different distribution prediction pTA ${{\bf{p}}}_{{\bf{T}}}^{{\bf{A}}}$ and pTB ${{\bf{p}}}_{{\bf{T}}}^{{\bf{B}}}$.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…Different traditional knowledge distillation of two-stage scheme, self-distillation is a one-phase training scheme without incurring extra computation cost, which gradually utilizes its own knowledge for softening the ground-truth targets to improve generalization performance. Specifically, let p T be the prediction from mean-teacher branch, we can utilize mean-teacher prediction to soften ground-truth label y, described as 31,32…”
Section: Self-distillation Guided Knowledge Transfer By Joint Head-to...mentioning
confidence: 99%
“…In the recent literature, several works proposing self-distillation [25,26] have also been emerged. The work in [27] introduced a weighting mechanism to dynamically put less weights on uncertain samples and showed promising results. Huang et.…”
Section: Recommender Systemmentioning
confidence: 99%