The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2022
DOI: 10.48550/arxiv.2201.08442
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

Abstract: While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additional noise it induces can lead to accuracy degradation. In this white paper, we present an overview of neural netwo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 6 publications
0
6
0
Order By: Relevance
“…While posttraining quantization (with 4, 8, and 16 bits) has been shown to reduce the model size by 4× and speed up inference by 2-3×, quantization-aware training is recommended for microcontroller-class models to mitigate layerwise quantization error due to a large range of weights across channels [47], [80]. This is achieved through the injection of simulated quantization operations, weight clamping, and fusion of special layers [51], allowing up to 8× model size reduction for same or lower accuracy drop. However, care must be taken to ensure that the target hardware supports the used bitwidth.…”
Section: A Common Model Compression Techniquesmentioning
confidence: 99%
See 1 more Smart Citation
“…While posttraining quantization (with 4, 8, and 16 bits) has been shown to reduce the model size by 4× and speed up inference by 2-3×, quantization-aware training is recommended for microcontroller-class models to mitigate layerwise quantization error due to a large range of weights across channels [47], [80]. This is achieved through the injection of simulated quantization operations, weight clamping, and fusion of special layers [51], allowing up to 8× model size reduction for same or lower accuracy drop. However, care must be taken to ensure that the target hardware supports the used bitwidth.…”
Section: A Common Model Compression Techniquesmentioning
confidence: 99%
“…Pruning policies for intermittent computing treat pruning as a hyperparameter tuning problem, sweeping through the memory, energy, and accuracy spaces to build a Pareto frontier. Some frameworks [51], [54] provide support for structured pruning, allowing policies for channel and filter pruning rather than pruning weights in an irregular fashion.…”
Section: A Common Model Compression Techniquesmentioning
confidence: 99%
“…Most of the time, computing resources are restricted during high workload utilization. Helpful techniques such as pruning [156], quantization [157], [158], and aggregation can be applied to optimize the ML model. Similarly, as discussed in [159], the computational cost of the Deep Learning model can be improved by reducing the spatial complexity, such as pruning the model parameters, parameter sharing, network quantization, and others.…”
Section: ) Prediction Layermentioning
confidence: 99%
“…The technique attempt to reduce the computation cost while retaining the accuracy to be nearly the same. In [158], research at Qualcomm AI Research investigates how the quantization technique can reduce the computational cost and latency in Neural networks. The authors discussed how the AI Model Efficiency Toolkit (AIMET), the library for quantization and compression of the AI model.…”
Section: ) Prediction Layermentioning
confidence: 99%
“…Existing quantization methods can be post-training quantization (PTQ) or in-training / quantization aware training (QAT). PTQ is applied after the model training is complete by compressing models into 8-bit representations and is relatively well supported by various libraries [3,4,5,6,7,8], such as TensorFlow Lite [9] and AIMET [10] for on-device deployment. However, almost no existing PTQ supports customized quantization configurations to compress machine learning (ML) layers and kernels into sub-8-bit (S8B) regimes [11].…”
Section: Introductionmentioning
confidence: 99%