Generalization Self-distillation with Epoch-wise Regularization

Xia, Yuelong; Ye, Yun

doi:10.1109/ijcnn52387.2021.9533993

Cited by 3 publications

(3 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Different traditional knowledge distillation of two‐stage scheme, self‐distillation is a one‐phase training scheme without incurring extra computation cost, which gradually utilizes its own knowledge for softening the ground‐truth targets to improve generalization performance. Specifically, let

{{\bf{p}}}_{{\bf{T}}}

be the prediction from mean‐teacher branch, we can utilize mean‐teacher prediction to soften ground‐truth label

{\bf{y}}

, described as 31,32

{{\rm{ {\mathcal L} }}}_{SKD}(\theta )=H(\eta {\bf{y}}+(1-\eta ){{\bf{p}}}_{{\bf{T}}},{{\bf{p}}}_{{\bf{S}}}),

where

{{\bf{p}}}_{{\bf{S}}}

is current student prediction, and

\eta

controls how much we are going to trust the knowledge from the teacher. In our approach, the uniform and reversed samplings are fed into teacher branches to obtain the different distribution prediction

{{\bf{p}}}_{{\bf{T}}}^{{\bf{A}}}

and

{{\bf{p}}}_{{\bf{T}}}^{{\bf{B}}}

.…”

Section: Proposed Methodsmentioning

confidence: 99%

“…Different traditional knowledge distillation of two-stage scheme, self-distillation is a one-phase training scheme without incurring extra computation cost, which gradually utilizes its own knowledge for softening the ground-truth targets to improve generalization performance. Specifically, let p T be the prediction from mean-teacher branch, we can utilize mean-teacher prediction to soften ground-truth label y, described as 31,32…”

Section: Self-distillation Guided Knowledge Transfer By Joint Head-to...mentioning

confidence: 99%

See 1 more Smart Citation

One‐stage self‐distillation guided knowledge transfer for long‐tailed visual recognition

Xia

Zhang

Wang

et al. 2022

Int J of Intelligent Sys

Self Cite

View full text Add to dashboard Cite

Deep learning has achieved remarkable progress for visual recognition on balanced data sets but still performs poorly on real‐world long‐tailed data distribution. The existing methods mainly decouple the problem into the two‐stage decoupling training, that is, representation learning and classifier training, or multistage training based on knowledge distillation, thus resulting in huge training steps and extra computation cost. In this paper, we propose a conceptually simple yet effective One‐stage Long‐tailed Self‐Distillation framework, called OLSD, which simultaneously takes representation learning and classifier training into one‐stage training. For representation learning, we take two different sampling distributions and mixup them to input them into two branches, where the collaborative consistency loss is introduced to train network consistency, and we theoretically show that the proposed mixup naturally generates a tail‐majority distribution mixup. For classifier training, we introduce balanced self‐distillation guided knowledge transfer to improve generalization performance, where we theoretically show that proposed knowledge transfer implicitly minimizes not only cross‐entropy but also KL divergence between head‐to‐tail and tail‐to‐head. Extensive experiments on long‐tailed CIFAR10/100, ImageNet‐LT and multilabel long‐tailed VOC‐LT demonstrate the proposed method's effectiveness.

show abstract

{{\bf{p}}}_{{\bf{T}}}

be the prediction from mean‐teacher branch, we can utilize mean‐teacher prediction to soften ground‐truth label

{\bf{y}}

, described as 31,32

{{\rm{ {\mathcal L} }}}_{SKD}(\theta )=H(\eta {\bf{y}}+(1-\eta ){{\bf{p}}}_{{\bf{T}}},{{\bf{p}}}_{{\bf{S}}}),

where

{{\bf{p}}}_{{\bf{S}}}

is current student prediction, and

\eta

{{\bf{p}}}_{{\bf{T}}}^{{\bf{A}}}

and

{{\bf{p}}}_{{\bf{T}}}^{{\bf{B}}}

.…”

Section: Proposed Methodsmentioning

confidence: 99%

Section: Self-distillation Guided Knowledge Transfer By Joint Head-to...mentioning

confidence: 99%

One‐stage self‐distillation guided knowledge transfer for long‐tailed visual recognition

Xia

Zhang

Wang

et al. 2022

Int J of Intelligent Sys

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the recent literature, several works proposing self-distillation [25,26] have also been emerged. The work in [27] introduced a weighting mechanism to dynamically put less weights on uncertain samples and showed promising results. Huang et.…”

Section: Recommender Systemmentioning

confidence: 99%

Fully adaptive recommendation paradigm: top-enhanced recommender distillation for intelligent education systems

Ren

Liang

Shang

et al. 2022

Complex Intell. Syst.

View full text Add to dashboard Cite

Top-N recommendation has received great attention in assisting students in providing personalized learning guidance on the required subject/domain. Generally, existing approaches mainly aim to maximize the overall accuracy of the recommendation list while ignoring the accuracy of highly ranked recommended exercises, which seriously affects the students’ learning enthusiasm. Motivated by the Knowledge Distillation (KD) technique, we skillfully design a fully adaptive recommendation paradigm named Top-enhanced Recommender Distillation framework (TERD) to improve the recommendation effect of the top positions. Specifically, the proposed TERD transfers the knowledge of an arbitrary recommender (teacher network), and injects it into a well-designed student network. The prior knowledge provided by the teacher network, including student-exercise embeddings, and candidate exercise subsets, are further utilized to define the state and action space of the student network (i.e., DDQN). In addition, the student network introduces a well-designed state representation scheme and an effective individual ability tracing model to enhance the recommendation accuracy of top positions. The developed TERD follows a flexible model-agnostic paradigm that not only simplifies the action space of the student network, but also promotes the recommendation accuracy of the top position, thus enhancing the students’ motivation and engagement in e-learning environment. We implement our proposed approach on three well-established datasets and evaluate its Top-enhanced performance. The experimental evaluation on three publicly available datasets shows that our proposed TERD scheme effectively resolves the Top-enhanced recommendation issue.

show abstract

Transferable adversarial masked self-distillation for unsupervised domain adaptation

Xia

Yun

Yang

2023

Complex Intell. Syst.

View full text Add to dashboard Cite

Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to a related unlabeled target domain. Most existing works focus on minimizing the domain discrepancy to learn global domain-invariant representation using CNN-based architecture while ignoring both transferable and discriminative local representation, e.g, pixel-level and patch-level representation. In this paper, we propose the Transferable Adversarial Masked Self-distillation based on Vision Transformer architecture to enhance the transferability of UDA, named TAMS. Specifically, TAMS jointly optimizes three objectives to learn both task-specific class-level global representation and domain-specific local representation. First, we introduce adversarial masked self-distillation objective to distill representation from a full image to the representation predicted from a masked image, which aims to learn task-specific global class-level representation. Second, we introduce masked image modeling objectives to learn local pixel-level representation. Third, we introduce an adversarial weighted cross-domain adaptation objective to capture discriminative potentials of patch tokens, which aims to learn both transferable and discriminative domain-specific patch-level representation. Extensive studies on four benchmarks and the experimental results show that our proposed method can achieve remarkable improvements compared to previous state-of-the-art UDA methods.

show abstract

Generalization Self-distillation with Epoch-wise Regularization

Cited by 3 publications

References 17 publications

One‐stage self‐distillation guided knowledge transfer for long‐tailed visual recognition

One‐stage self‐distillation guided knowledge transfer for long‐tailed visual recognition

Fully adaptive recommendation paradigm: top-enhanced recommender distillation for intelligent education systems

Transferable adversarial masked self-distillation for unsupervised domain adaptation

Contact Info

Product

Resources

About