Data-free Knowledge Distillation for Object Detection

Chawla, Akshay; Yin, Hongxu; Molchanov, Pavlo; Álvarez, Jose M.

doi:10.1109/wacv48630.2021.00333

Cited by 47 publications

(19 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the literature, Lopes et al proposes the first data-free approach for knowledge distillation, which utilizes statistical information of original training data to reconstruct a synthetic set during knowledge distillation (Lopes, Fenu, and Starner 2017). This seminal work has spawned several works, which has achieved impressive progressive on several tasks including detection (Chawla et al 2021), segmentation (Fang et al 2019), text classification (Ma et al 2020), graph classification (Deng and Zhang 2021) and Federated Learning (Zhu, Hong, and Zhou 2021). Despite the impressive progress, a vexing problem remains in DFKD, i.e., the inefficiency of data synthesis, which makes data-free training extraordinarily time-consuming.…”

Section: Related Workmentioning

confidence: 99%

“…To learn a comparable student model, the synthetic set should contain sufficient samples to enable a comprehensive knowledge transfer from teachers. Consequentially, this poses a significant challenge to DFKD, since synthesizing a large-scale dataset is inevitably time-consuming, especially for sophisticated tasks like ImageNet recognition (Yin et al 2019) and COCO detection (Chawla et al 2021).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Up to 100x Faster Data-Free Knowledge Distillation

Fang

Wang

et al. 2022

AAAI

View full text Add to dashboard Cite

Data-free knowledge distillation (DFKD) has recently been attracting increasing attention from research communities, attributed to its capability to compress a model only using synthetic data. Despite the encouraging results achieved, state-of-the-art DFKD methods still suffer from the inefficiency of data synthesis, making the data-free training process extremely time-consuming and thus inapplicable for large-scale tasks. In this work, we introduce an efficacious scheme, termed as FastDFKD, that allows us to accelerate DFKD by a factor of orders of magnitude. At the heart of our approach is a novel strategy to reuse the shared common features in training data so as to synthesize different data instances. Unlike prior methods that optimize a set of data independently, we propose to learn a meta-synthesizer that seeks common features as the initialization for the fast data synthesis. As a result, FastDFKD achieves data synthesis within only a few steps, significantly enhancing the efficiency of data-free training. Experiments over CIFAR, NYUv2, and ImageNet demonstrate that the proposed FastDFKD achieves 10x and even 100x acceleration while preserving performances on par with state of the art. Code is available at https://github.com/zju-vipa/Fast-Datafree.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Up to 100x Faster Data-Free Knowledge Distillation

Fang

Wang

et al. 2022

AAAI

View full text Add to dashboard Cite

show abstract

“…GAN-based methods [8,34,55,63] synthesized training samples through maximizing response on the discriminator. Prior-based methods [5] provide another perspective for data-free KD, where the synthetic data are forced to satisfy a pre-defined prior, such as total variance prior [3,36] and batch normalization statistics [5,8]. However, they all has the problem of mode collapse [6,45], so we propose a boundary-preserving intra-divergence loss for DeepInversion [56] to generate diverse samples.…”

Section: Related Workmentioning

confidence: 99%

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Chen¹,

Cao²,

Zhong³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformers are successfully applied to computer vision due to their powerful modeling capacity with selfattention. However, the excellent performance of transformers heavily depends on enormous training images. Thus, a data-efficient transformer solution is urgently needed. In this work, we propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers. Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation. Further, our DearKD can be readily applied to the extreme data-free case where no real images are available. In this case, we propose a boundary-preserving intradivergence loss based on DeepInversion to further close the performance gap against the full-data counterpart. Extensive experiments on ImageNet, partial ImageNet, data-free setting and other downstream tasks prove the superiority of DearKD over its baselines and state-of-the-art methods.

show abstract

“…[38,39] opt to use a large general proxy dataset to query the teacher, and their teacher outputs to on this data to train the student. Other methods [40,41,42,43] generate proxy data directly from the trained models, and use this data to train the students. Further, [44,45], also encourage generating samples the student and teacher disagree on.…”

Section: Related Workmentioning

confidence: 99%

Representation Consolidation for Training Expert Students

Li,

Ravichandran,

Fowlkes

et al. 2021

Preprint

View full text Add to dashboard Cite

Traditionally, distillation has been used to train a student model to emulate the input/output functionality of a teacher. A more useful goal than emulation, yet under-explored, is for the student to learn feature representations that transfer well to future tasks. However, we observe that standard distillation of task-specific teachers actually reduces the transferability of student representations to downstream tasks. We show that a multi-head, multi-task distillation method using an unlabeled proxy dataset and a generalist teacher is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance, outperforming the teacher(s) and the strong baseline of ImageNet pretrained features. Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model, whose representation is improved on all teachers' domain(s).Preprint. Under review.

show abstract

Data-free Knowledge Distillation for Object Detection

Cited by 47 publications

References 17 publications

Up to 100x Faster Data-Free Knowledge Distillation

Up to 100x Faster Data-Free Knowledge Distillation

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Representation Consolidation for Training Expert Students

Contact Info

Product

Resources

About