Attention-based Feature Interaction for Efficient Online Knowledge Distillation

Su, Tongtong; Liang, Qiyu; Zhang, Jinsong; Yu, Zou; Wang, Gang; Liu, Xiaoguang

doi:10.1109/icdm51629.2021.00069

Cited by 8 publications

(12 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ResNet-32 is a typical lightweight baseline model, which is widely selected for many previous advanced methods, we further make comparison experiments on ResNet-32. Results shown in Tab.VII show that EKD-FWSNet has little inferiority compared to AFID [65] and PCL [66].…”

Section: ) Classification On Lightweight Baseline Modelsmentioning

confidence: 92%

“…[20]- [24] all design student-classmate ensemble training framework to obtain knowledge of ensemble teacher, which can guide both student and classmate efficiently in an end-to-end manner. AFID [65] directly employs one more complete sub-net to construct a two-branch ensemble training network. Besides distilling knowledge from ensemble teacher, it further proposes a feature interaction module to employ mutual learning between attentive feature maps of two sub-nets.…”

Section: B Knowledge Distillation Guided Training Frameworkmentioning

confidence: 99%

“…In this paper, we select ResNet [20], [22]- [24], [51], [59]- [61], [63]. Specifically, KD-ONE [22], OEM [20], DCCL [24], AFID [65] and PCL [66] are recent notable methods which combines knowledge distillation strategy with ensemble learning. All these three methods are studentclassmate training framework (Fig.…”

Section: ) Classification On Lightweight Baseline Modelsmentioning

confidence: 99%

See 2 more Smart Citations

Learn by Oneself: Exploiting Weight-Sharing Potential in Knowledge Distillation Guided Ensemble Network

Zhao

Lyu

Chen

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Recent CNNs (convolutional neural networks) have become more and more compact. The elegant structure design highly improves the performance of CNNs. With the development of knowledge distillation technique, the performance of CNNs gets further improved. However, existing knowledge distillation guided methods either rely on offline pretrained high-quality large teacher models or online heavy training burden. To solve the above problems, we propose a feature-sharing and weightsharing based ensemble network (training framework) guided by knowledge distillation (EKD-FWSNet) to make baseline models stronger in terms of representation ability with less training computation and memory cost involved. Specifically, motivated by getting rid of the dependence of offline pretrained teacher model, we design an end-to-end online training scheme to optimize EKD-FWSNet. Motivated by decreasing the online training burden, we only introduce one auxiliary classmate branch to construct multiple forward branches, which will then be integrated as ensemble teacher to guide baseline model. Compared to previous online ensemble training frameworks, EKD-FWSNet can provide diverse output predictions without relying on increasing auxiliary classmate branches. Motivated by maximizing the optimization power of EKD-FWSNet, we exploit the representation potential of weight-sharing blocks and design efficient knowledge distillation mechanism in EKD-FWSNet. Extensive comparison experiments and visualization analysis on benchmark datasets (CIFAR-10/100, tiny-ImageNet, CUB-200 and ImageNet) show that self-learned EKD-FWSNet can boost the performance of baseline models by large margin, which has obvious superiority compared to previous related methods. Extensive analysis also proves the interpretability of EKD-FWSNet. Our code is available at https://github.com/cv516Buaa/EKD-FWSNet.

show abstract

Section: ) Classification On Lightweight Baseline Modelsmentioning

confidence: 92%

Section: B Knowledge Distillation Guided Training Frameworkmentioning

confidence: 99%

Section: ) Classification On Lightweight Baseline Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Learn by Oneself: Exploiting Weight-Sharing Potential in Knowledge Distillation Guided Ensemble Network

Zhao

Lyu

Chen

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

show abstract

“…The intermediate feature representations from each teacher block are followed by multiple student blocks, respectively, where the teacher and the student are simultaneously trained by minimizing the differences in the feature representations and logits between the teacher and the student. To solve the regression problem, -FitNets [17] L 2 Noisy Student [8] --Distilling Task-Specific Knowledge from BERT into Simple Neural Networks [14] L 2 A Gift from Knowledge Distillation [18] -L 2 Ranking Distillation [21] --L D M Deep Mutual Learning [22] -Born-Again Neural Networks [23] -LD Paraphrasing Complex Network [31] -LD Knowledge Transfer via Distillation [32] -LD Relational Knowledge Distillation [33] LD DarkRank [34] LD BERT Learns to Teach [35] -Learning Student-Friendly Teacher Networks for Knowledge Distillation [36] -Knowledge Distillation Meets Self-Supervision [37] Cosine Improved Knowledge Distillation via Teacher Assistant [38] -Uninformed Students [39] --L 2 Mean teachers [40] L 2 Adaptive Multi-Teacher [41] Cosine Learning from Multiple Teacher [42] L D M Reinforced Multi-Teacher Selection for Knowledge Distillation [43] --Knowledge Adaptation: Teaching to Adapt [44] -Online Knowledge Distillation with Diverse Peers [45] L 2 Knowledge Distillation by on-the-Fly [46] -Online Knowledge Distillation for Efficient Pose Estimation [47] --Feature Fusion for Online Mutual Knowledge Distillation [48] -Attention-based Feature Interaction for Efficient [49] -Peer Collaborative Learning for Online Knowledge distillation [50] -Large-Scale Domain Adaptation via Teacher-Student Learning [51] -Revisiting Knowledge Distillation via Label Smoothing Regularization [52] -Feature-Map-Level Online Adversarial Knowledge Distillation [53] -Multi-View Contrastive Learning [54] -Online Subclass Knowledge Distillation …”

Section: Single Teachermentioning

confidence: 99%

“…Online learning [22], [38], [45], [46] [47], [48], [49], [50] To be trained To be trained Static Self-learning [52], [53], [64] [54], [55], [56] To be trained To be trained Dynamic any architecture. To better boost the knowledge distillation process, Su et al [49] additionally introduce an attention mechanism to capture important and high-level knowledge, so that teachers and students can be dynamically and effectively trained with the help of the valuable knowledge.…”

Section: Role Statusmentioning

confidence: 99%

Teacher-Student Architecture for Knowledge Learning: A Survey

Hu¹,

Li²,

Liú³

et al. 2022

Preprint

View full text Add to dashboard Cite

Although Deep Neural Networks (DNNs) have shown a strong capacity to solve large-scale problems in many areas, such DNNs with voluminous parameters are hard to be deployed in a real-time system. To tackle this issue, Teacher-Student architectures were first utilized in knowledge distillation, where simple student networks can achieve comparable performance to deep teacher networks. Recently, Teacher-Student architectures have been effectively and widely embraced on various knowledge learning objectives, including knowledge distillation, knowledge expansion, knowledge adaption, and multi-task learning. With the help of Teacher-Student architectures, current studies are able to achieve multiple knowledge-learning objectives through lightweight and effective student networks. Different from the existing knowledge distillation surveys, this survey detailedly discusses Teacher-Student architectures with multiple knowledge learning objectives. In addition, we systematically introduce the knowledge construction and optimization process during the knowledge learning and then analyze various Teacher-Student architectures and effective learning schemes that have been leveraged to learn representative and robust knowledge. This paper also summarizes the latest applications of Teacher-Student architectures based on different purposes (i.e., classification, recognition, and generation). Finally, the potential research directions of knowledge learning are investigated on the Teacher-Student architecture design, the quality of knowledge, and the theoretical studies of regression-based learning, respectively. With this comprehensive survey, both industry practitioners and the academic community can learn insightful guidelines about Teacher-Student architectures on multiple knowledge learning objectives.

show abstract

Self‐adaption and texture generation: A hybrid loss function for low‐dose CT denoising

Wang

Liu

Cheng

et al. 2023

J Applied Clin Med Phys

View full text Add to dashboard Cite

BackgroundDeep learning has been successfully applied to low‐dose CT (LDCT) denoising. But the training of the model is very dependent on an appropriate loss function. Existing denoising models often use per‐pixel loss, including mean abs error (MAE) and mean square error (MSE). This ignores the difference in denoising difficulty between different regions of the CT images and leads to the loss of large texture information in the generated image.PurposeIn this paper, we propose a new hybrid loss function that adapts to the noise in different regions of CT images to balance the denoising difficulty and preserve texture details, thus acquiring CT images with high‐quality diagnostic value using LDCT images, providing strong support for condition diagnosis.MethodsWe propose a hybrid loss function consisting of weighted patch loss (WPLoss) and high‐frequency information loss (HFLoss). To enhance the model's denoising ability of the local areas which are difficult to denoise, we improve the MAE to obtain WPLoss. After the generated image and the target image are divided into several patches, the loss weight of each patch is adaptively and dynamically adjusted according to its loss ratio. In addition, considering that texture details are contained in the high‐frequency information of the image, we use HFLoss to calculate the difference between CT images in the high‐frequency information part.ResultsOur hybrid loss function improves the denoising performance of several models in the experiment, and obtains a higher peak signal‐to‐noise ratio (PSNR) and structural similarity index (SSIM). Moreover, through visual inspection of the generated results of the comparison experiment, the proposed hybrid function can effectively suppress noise and retain image details.ConclusionsWe propose a hybrid loss function for LDCT image denoising, which has good interpretation properties and can improve the denoising performance of existing models. And the validation results of multiple models using different datasets show that it has good generalization ability. By using this loss function, high‐quality CT images with low radiation are achieved, which can avoid the hazards caused by radiation and ensure the disease diagnosis for patients.

show abstract

Attention-based Feature Interaction for Efficient Online Knowledge Distillation

Cited by 8 publications

References 32 publications

Learn by Oneself: Exploiting Weight-Sharing Potential in Knowledge Distillation Guided Ensemble Network

Learn by Oneself: Exploiting Weight-Sharing Potential in Knowledge Distillation Guided Ensemble Network

Teacher-Student Architecture for Knowledge Learning: A Survey

Self‐adaption and texture generation: A hybrid loss function for low‐dose CT denoising

Contact Info

Product

Resources

About