QKD: Quantization-aware Knowledge Distillation

Kim, Jangho; Bhalgat, Yash; Lee, Jinwon; Patel, Chirag; Kwak, Nojun

doi:10.48550/arxiv.1911.12491

Cited by 25 publications

(42 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The architectures of the quantized students are therefore the same as that of the teachers, so that they are comparable to other QAT and PTQ methods. Here, we compare our method with QAT methods includes DoReFa-Net, LSQ [27], DSQ [20], QKD [40] and PTQ method includes BRECQ [36], PWLQ [54] and ZeroQ [19]. As shown in Table 3, for ResNet, Arch-Net outperforms these methods by a big advantage.…”

Section: Results On Imagenetmentioning

confidence: 99%

See 1 more Smart Citation

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

Xu¹,

Feng²,

Fang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vast requirement of computation power of Deep Neural Networks is a major hurdle to their real world applications. Many recent Application Specific Integrated Circuit (ASIC) chips feature dedicated hardware support for Neural Network Acceleration. However, as ASICs take multiple years to develop, they are inevitably out-paced by the latest development in Neural Architecture Research. For example, Transformer Networks do not have native support on many popular chips, and hence are difficult to deploy. In this paper, we propose Arch-Net, a family of Neural Networks made up of only operators efficiently supported across most architectures of ASICs. When a Arch-Net is produced, less common network constructs, like Layer Normalization and Embedding Layers, are eliminated in a progressive manner through label-free Blockwise Model Distillation, while performing subeight bit quantization at the same time to maximize performance. Empirical results on machine translation and image classification tasks confirm that we can transform latest developed Neural Architectures into fast running and as-accurate Arch-Net, ready for deployment on multiple mass-produced ASIC chips. The code will be available at https://github.com/megvii-research/Arch-Net.

show abstract

Section: Results On Imagenetmentioning

confidence: 99%

“…In [39,21], knowledge distillation is directly applied to training as low as 2W8A quantized networks. While in [40], a three phases method is adopted to get as low as 3W3A quantization of ResNet with little loss of accuracy. However, the data problem is still remaining because these methods rely on large amount of data with ground truth.…”

Section: Related Workmentioning

confidence: 99%

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

Xu¹,

Feng²,

Fang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…After the warm up stage in phase 2, we set T = 2, α = 0, 5, β = 0.5. We did not conduct a grid search for finding hyper-parameters but choose them based on recommendations from related works [21,12,1]. For the learning rate of learnable step-size Sw, we multiply 10 −4 to the initial learning rate of model parameters because of its sensitivity.…”

Section: Methodsmentioning

confidence: 99%

“…Teacher network transfers its knowledge to student network to enhance the performance of student network. Feature maps [8,9,10] and logits of a network [11,12] are widely used as knowledge. Model compression has actively been studied mainly on computer vision tasks.…”

Section: Introductionmentioning

confidence: 99%

PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation

Kim¹,

Chang²,

Kwak³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

As edge devices become prevalent, deploying Deep Neural Networks (DNN) on edge devices has become a critical issue. However, DNN requires a high computational resource which is rarely available for edge devices. To handle this, we propose a novel model compression method for the devices with limited computational resources, called PQK consisting of pruning, quantization, and knowledge distillation (KD) processes. Unlike traditional pruning and KD, PQK makes use of unimportant weights pruned in the pruning process to make a teacher network for training a better student network without pre-training the teacher model. PQK has two phases. Phase 1 exploits iterative pruning and quantization-aware training to make a lightweight and power-efficient model. In phase 2, we make a teacher network by adding unimportant weights unused in phase 1 to a pruned network. By using this teacher network, we train the pruned network as a student network. In doing so, we do not need a pre-trained teacher network for the KD framework because the teacher and the student networks coexist within the same network (See Fig. 1). We apply our method to the recognition model and verify the effectiveness of PQK on keyword spotting (KWS) and image recognition.

show abstract

“…Binary networks [22,42] constrain both weights and activations to binary values, which brings great benefits to specialized hardware devices. Designing efficient strategies for training low-precision [71,29,70] or anyprecision networks [27,65] that can flexibly adjust the precision during inference is also another recent trend in quantization. Despite recent progress, the problem of quantization for video recognition models is rarely explored.…”

Section: Related Workmentioning

confidence: 99%

Dynamic Network Quantization for Efficient Video Inference

Sun¹,

Panda²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep convolutional networks have recently achieved great success in video recognition, yet their practical realization remains a challenge due to the large amount of computational resources required to achieve robust recognition. Motivated by the effectiveness of quantization for boosting efficiency, in this paper, we propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition. Specifically, given a video clip, we train a very lightweight network in parallel with the recognition network, to produce a dynamic policy indicating which numerical precision to be used per frame in recognizing videos. We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency required for video recognition. Extensive experiments on four challenging diverse benchmark datasets demonstrate that our proposed approach provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods. Project page: https://cs-people.bu.edu/ sunxm/VideoIQ/project.html.

show abstract

QKD: Quantization-aware Knowledge Distillation

Cited by 25 publications

References 23 publications

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation

Dynamic Network Quantization for Efficient Video Inference

Contact Info

Product

Resources

About