“…The architectures of the quantized students are therefore the same as that of the teachers, so that they are comparable to other QAT and PTQ methods. Here, we compare our method with QAT methods includes DoReFa-Net, LSQ [27], DSQ [20], QKD [40] and PTQ method includes BRECQ [36], PWLQ [54] and ZeroQ [19]. As shown in Table 3, for ResNet, Arch-Net outperforms these methods by a big advantage.…”
Section: Results On Imagenetmentioning
confidence: 99%
“…In [39,21], knowledge distillation is directly applied to training as low as 2W8A quantized networks. While in [40], a three phases method is adopted to get as low as 3W3A quantization of ResNet with little loss of accuracy. However, the data problem is still remaining because these methods rely on large amount of data with ground truth.…”
Vast requirement of computation power of Deep Neural Networks is a major hurdle to their real world applications. Many recent Application Specific Integrated Circuit (ASIC) chips feature dedicated hardware support for Neural Network Acceleration. However, as ASICs take multiple years to develop, they are inevitably out-paced by the latest development in Neural Architecture Research. For example, Transformer Networks do not have native support on many popular chips, and hence are difficult to deploy. In this paper, we propose Arch-Net, a family of Neural Networks made up of only operators efficiently supported across most architectures of ASICs. When a Arch-Net is produced, less common network constructs, like Layer Normalization and Embedding Layers, are eliminated in a progressive manner through label-free Blockwise Model Distillation, while performing subeight bit quantization at the same time to maximize performance. Empirical results on machine translation and image classification tasks confirm that we can transform latest developed Neural Architectures into fast running and as-accurate Arch-Net, ready for deployment on multiple mass-produced ASIC chips. The code will be available at https://github.com/megvii-research/Arch-Net.
“…The architectures of the quantized students are therefore the same as that of the teachers, so that they are comparable to other QAT and PTQ methods. Here, we compare our method with QAT methods includes DoReFa-Net, LSQ [27], DSQ [20], QKD [40] and PTQ method includes BRECQ [36], PWLQ [54] and ZeroQ [19]. As shown in Table 3, for ResNet, Arch-Net outperforms these methods by a big advantage.…”
Section: Results On Imagenetmentioning
confidence: 99%
“…In [39,21], knowledge distillation is directly applied to training as low as 2W8A quantized networks. While in [40], a three phases method is adopted to get as low as 3W3A quantization of ResNet with little loss of accuracy. However, the data problem is still remaining because these methods rely on large amount of data with ground truth.…”
Vast requirement of computation power of Deep Neural Networks is a major hurdle to their real world applications. Many recent Application Specific Integrated Circuit (ASIC) chips feature dedicated hardware support for Neural Network Acceleration. However, as ASICs take multiple years to develop, they are inevitably out-paced by the latest development in Neural Architecture Research. For example, Transformer Networks do not have native support on many popular chips, and hence are difficult to deploy. In this paper, we propose Arch-Net, a family of Neural Networks made up of only operators efficiently supported across most architectures of ASICs. When a Arch-Net is produced, less common network constructs, like Layer Normalization and Embedding Layers, are eliminated in a progressive manner through label-free Blockwise Model Distillation, while performing subeight bit quantization at the same time to maximize performance. Empirical results on machine translation and image classification tasks confirm that we can transform latest developed Neural Architectures into fast running and as-accurate Arch-Net, ready for deployment on multiple mass-produced ASIC chips. The code will be available at https://github.com/megvii-research/Arch-Net.
“…After the warm up stage in phase 2, we set T = 2, α = 0, 5, β = 0.5. We did not conduct a grid search for finding hyper-parameters but choose them based on recommendations from related works [21,12,1]. For the learning rate of learnable step-size Sw, we multiply 10 −4 to the initial learning rate of model parameters because of its sensitivity.…”
Section: Methodsmentioning
confidence: 99%
“…Teacher network transfers its knowledge to student network to enhance the performance of student network. Feature maps [8,9,10] and logits of a network [11,12] are widely used as knowledge. Model compression has actively been studied mainly on computer vision tasks.…”
As edge devices become prevalent, deploying Deep Neural Networks (DNN) on edge devices has become a critical issue. However, DNN requires a high computational resource which is rarely available for edge devices. To handle this, we propose a novel model compression method for the devices with limited computational resources, called PQK consisting of pruning, quantization, and knowledge distillation (KD) processes. Unlike traditional pruning and KD, PQK makes use of unimportant weights pruned in the pruning process to make a teacher network for training a better student network without pre-training the teacher model. PQK has two phases. Phase 1 exploits iterative pruning and quantization-aware training to make a lightweight and power-efficient model. In phase 2, we make a teacher network by adding unimportant weights unused in phase 1 to a pruned network. By using this teacher network, we train the pruned network as a student network. In doing so, we do not need a pre-trained teacher network for the KD framework because the teacher and the student networks coexist within the same network (See Fig. 1). We apply our method to the recognition model and verify the effectiveness of PQK on keyword spotting (KWS) and image recognition.
“…Binary networks [22,42] constrain both weights and activations to binary values, which brings great benefits to specialized hardware devices. Designing efficient strategies for training low-precision [71,29,70] or anyprecision networks [27,65] that can flexibly adjust the precision during inference is also another recent trend in quantization. Despite recent progress, the problem of quantization for video recognition models is rarely explored.…”
Deep convolutional networks have recently achieved great success in video recognition, yet their practical realization remains a challenge due to the large amount of computational resources required to achieve robust recognition. Motivated by the effectiveness of quantization for boosting efficiency, in this paper, we propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition. Specifically, given a video clip, we train a very lightweight network in parallel with the recognition network, to produce a dynamic policy indicating which numerical precision to be used per frame in recognizing videos. We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency required for video recognition. Extensive experiments on four challenging diverse benchmark datasets demonstrate that our proposed approach provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods. Project page: https://cs-people.bu.edu/ sunxm/VideoIQ/project.html.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.