DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Chen, Xianing; Cao, Qiong; Zhong, Yujie; Zhang, Jing; Gao, Shenghua; Tao, Dacheng

doi:10.1109/cvpr52688.2022.01174

Cited by 23 publications

(17 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Tab. 1, G2SD is compared with 1) supervised methods including MobileNet-v3 [19], ResNet [15,48], DeiT [41,42], Swin Trasnformer [28] and ConvNeXt [29]; 2) self-supervised methods upon ViT-Small, like BEiT [4] and CAE [8]; and 3) distillation methods upon vanilla ViTs, like DeiT⚗ [41], DearKD [7], Manifold [21], MKD [27], SSTA [49] and DMAE [3]. G2SD achieves 82.5% top-1 accuracy, which outperforms CNNbased ConvNeXt by 0.4%, by using fewer parameters (22M vs. 29M).…”

Section: Resultsmentioning

confidence: 99%

“…One solution is to explicitly introduce convolutional operators to ViTs [31,50] to enhance the competitiveness compared to lightweight CNNs [19]. The other way is using large models act as teachers to transfer inductive bias to ViTs in the knowledge distillation fashion [7,41,50]. This study focuses on the latter.…”

Section: Related Workmentioning

confidence: 99%

“…Besides distilling the knowledge contained in samples, inter-samples relation, as structural information, was transferred to student models [33,34,43]. Knowledge distillation has also been elaborately studied for ViTs [7,21,27,41,54]. SSTA [49] simultaneously learned…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Generic-to-Specific Distillation of Masked Autoencoders

Huang¹,

ZhiLiang²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Large vision Transformers (ViTs) driven by selfsupervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pretraining mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Generic-to-Specific Distillation of Masked Autoencoders

Huang¹,

ZhiLiang²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, many algorithms have been proposed to improve the efficiency of vision transformers. Recent works demonstrate that some popular model compression methods such as network pruning [17,7,8,70], knowledge distillation [20,54,9], and quantization [46,51] can be applied to ViTs. Besides, other methods introduce CNN properties such as hierarchy and locality into the transformers to alleviate the burden of computing global attention [35,5].…”

Section: Efficient Vision Transformersmentioning

confidence: 99%

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Ma¹,

Wang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results. Source code and trained model can be found at https://github.com/HowieMa/ PPT.

show abstract

“…Despite its promising accuracy, the ViT [12] is a computational heavyweight. To address this issue, several algorithms have been proposed to improve the efficiency of vision transformers in different ways [5,6,21,28,37,40,52].…”

Section: Efficient Vision Transformersmentioning

confidence: 99%

Bilateral Pose Transformer for Human Pose Estimation

Yen

Tao

Xu³

2022

Proceedings of the 4th International Symposium on Signal Processing Systems

View full text Add to dashboard Cite

Human pose estimation has seen widespread use of transformer models in recent years. Pose transformers benefit from the self-attention map, which captures the correlation between human joint tokens and the image. However, training such models is computationally expensive. The recent token-Pruned Pose Transformer (PPT) solves this problem by pruning the background tokens of the image, which are usually less informative. However, although it improves efficiency, PPT inevitably leads to worse performance than TokenPose due to the pruning of tokens.To overcome this problem, we present a novel method called Distilling Pruned-Token Transformer for human pose estimation (DPPT). Our method leverages the output of a pre-trained TokenPose to supervise the learning process of PPT. We also establish connections between the internal structure of pose transformers and PPT, such as attention maps and joint features. Our experimental results on the MPII datasets show that our DPPT can significantly improve PCK compared to previous PPT models while still reducing computational complexity.

show abstract

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Cited by 23 publications

References 27 publications

Generic-to-Specific Distillation of Masked Autoencoders

Generic-to-Specific Distillation of Masked Autoencoders

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

Bilateral Pose Transformer for Human Pose Estimation

Contact Info

Product

Resources

About