The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2019
DOI: 10.48550/arxiv.1911.12491
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

QKD: Quantization-aware Knowledge Distillation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
42
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 25 publications
(42 citation statements)
references
References 23 publications
0
42
0
Order By: Relevance
“…The architectures of the quantized students are therefore the same as that of the teachers, so that they are comparable to other QAT and PTQ methods. Here, we compare our method with QAT methods includes DoReFa-Net, LSQ [27], DSQ [20], QKD [40] and PTQ method includes BRECQ [36], PWLQ [54] and ZeroQ [19]. As shown in Table 3, for ResNet, Arch-Net outperforms these methods by a big advantage.…”
Section: Results On Imagenetmentioning
confidence: 99%
See 1 more Smart Citation
“…The architectures of the quantized students are therefore the same as that of the teachers, so that they are comparable to other QAT and PTQ methods. Here, we compare our method with QAT methods includes DoReFa-Net, LSQ [27], DSQ [20], QKD [40] and PTQ method includes BRECQ [36], PWLQ [54] and ZeroQ [19]. As shown in Table 3, for ResNet, Arch-Net outperforms these methods by a big advantage.…”
Section: Results On Imagenetmentioning
confidence: 99%
“…In [39,21], knowledge distillation is directly applied to training as low as 2W8A quantized networks. While in [40], a three phases method is adopted to get as low as 3W3A quantization of ResNet with little loss of accuracy. However, the data problem is still remaining because these methods rely on large amount of data with ground truth.…”
Section: Related Workmentioning
confidence: 99%
“…After the warm up stage in phase 2, we set T = 2, α = 0, 5, β = 0.5. We did not conduct a grid search for finding hyper-parameters but choose them based on recommendations from related works [21,12,1]. For the learning rate of learnable step-size Sw, we multiply 10 −4 to the initial learning rate of model parameters because of its sensitivity.…”
Section: Methodsmentioning
confidence: 99%
“…Teacher network transfers its knowledge to student network to enhance the performance of student network. Feature maps [8,9,10] and logits of a network [11,12] are widely used as knowledge. Model compression has actively been studied mainly on computer vision tasks.…”
Section: Introductionmentioning
confidence: 99%
“…Binary networks [22,42] constrain both weights and activations to binary values, which brings great benefits to specialized hardware devices. Designing efficient strategies for training low-precision [71,29,70] or anyprecision networks [27,65] that can flexibly adjust the precision during inference is also another recent trend in quantization. Despite recent progress, the problem of quantization for video recognition models is rarely explored.…”
Section: Related Workmentioning
confidence: 99%