Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

Siddegowda, Sangeetha; Fournarakis, Marios; Nagel, Markus; Blankevoort, Tijmen; Patel, Chirag; Khobare, Abhijit

doi:10.48550/arxiv.2201.08442

Cited by 4 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While posttraining quantization (with 4, 8, and 16 bits) has been shown to reduce the model size by 4× and speed up inference by 2-3×, quantization-aware training is recommended for microcontroller-class models to mitigate layerwise quantization error due to a large range of weights across channels [47], [80]. This is achieved through the injection of simulated quantization operations, weight clamping, and fusion of special layers [51], allowing up to 8× model size reduction for same or lower accuracy drop. However, care must be taken to ensure that the target hardware supports the used bitwidth.…”

Section: A Common Model Compression Techniquesmentioning

confidence: 99%

“…Pruning policies for intermittent computing treat pruning as a hyperparameter tuning problem, sweeping through the memory, energy, and accuracy spaces to build a Pareto frontier. Some frameworks [51], [54] provide support for structured pruning, allowing policies for channel and filter pruning rather than pruning weights in an irregular fashion.…”

Section: A Common Model Compression Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

Machine Learning for Microcontroller-Class Hardware: A Review

2022

View full text Add to dashboard Cite

The advancements in machine learning (ML) opened a new opportunity to bring intelligence to the low-end Internet-of-Things (IoT) nodes, such as microcontrollers. Conventional ML deployment has high memory and computes footprint hindering their direct deployment on ultraresource-constrained microcontrollers. This article highlights the unique requirements of enabling onboard ML for microcontroller-class devices. Researchers use a specialized model development workflow for resource-limited applications to ensure that the compute and latency budget is within the device limits while still maintaining the desired performance. We characterize a closed-loop widely applicable workflow of ML model development for microcontroller-class devices and show that several classes of applications adopt a specific instance of it. We present both qualitative and numerical insights into different stages of model development by showcasing several use cases. Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.

show abstract

Section: A Common Model Compression Techniquesmentioning

confidence: 99%

Section: A Common Model Compression Techniquesmentioning

confidence: 99%

Machine Learning for Microcontroller-Class Hardware: A Review

2022

View full text Add to dashboard Cite

show abstract

“…Most of the time, computing resources are restricted during high workload utilization. Helpful techniques such as pruning [156], quantization [157], [158], and aggregation can be applied to optimize the ML model. Similarly, as discussed in [159], the computational cost of the Deep Learning model can be improved by reducing the spatial complexity, such as pruning the model parameters, parameter sharing, network quantization, and others.…”

Section: ) Prediction Layermentioning

confidence: 99%

“…The technique attempt to reduce the computation cost while retaining the accuracy to be nearly the same. In [158], research at Qualcomm AI Research investigates how the quantization technique can reduce the computational cost and latency in Neural networks. The authors discussed how the AI Model Efficiency Toolkit (AIMET), the library for quantization and compression of the AI model.…”

Section: ) Prediction Layermentioning

confidence: 99%

QoE-Driven IoT Architecture: A Comprehensive Review on System and Resource Management

et al. 2022

View full text Add to dashboard Cite

Internet of Things (IoT) services have grown substantially in recent years. Consequently, IoT service providers (SPs) are emerging in the market and competing to offer their services. Many IoT applications utilize these services in an integrated manner with different Quality-of-Service (QoS) requirements. Thus, the provisioning of end-to-end QoS is getting more indispensable for IoT platforms. However, provisioning the system by using only QoS metrics without considering user experiences is not sufficient. Recently, Quality of Experience (QoE) model has become a promising approach to quantify actual user experiences of services. A holistic design approach that considers constraints of various QoS/QoE metrics together is needed to satisfy requirements of these applications and services. Besides, IoT services may operate in environments with limited resources. Therefore, effective management of services and system resources is essential for QoS/QoE support. This paper provides a comprehensive survey for the state-of-the-art studies on IoT services with QoS/QoE perspective. Our contributions are threefold: (1) QoEdriven architecture is demonstrated by classifying vital components according to QoE-related functions in prior studies, (2) QoE metrics and QoE optimization objectives are classified by corresponding system and resource control problems in the architecture, and (3) QoE-aware resource management e.g., QoE-aware offloading, placement and data caching policies with recent Machine Learning approaches are extensively reviewed. INDEX TERMSInternet of Things, Quality of Service, Quality of Experience, IoT services, IoT applications, QoS for IoT services, QoS metrics, QoE metrics, IoT architecture Recently, emerging IoT architectures with multi-layers, e.g., Mobile Edge Computing (MEC), Fog Computing and Cloud Computing, have been proposed to improve user experiences.

show abstract

“…Existing quantization methods can be post-training quantization (PTQ) or in-training / quantization aware training (QAT). PTQ is applied after the model training is complete by compressing models into 8-bit representations and is relatively well supported by various libraries [3,4,5,6,7,8], such as TensorFlow Lite [9] and AIMET [10] for on-device deployment. However, almost no existing PTQ supports customized quantization configurations to compress machine learning (ML) layers and kernels into sub-8-bit (S8B) regimes [11].…”

Section: Introductionmentioning

confidence: 99%

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Zhang¹,

Radfar²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a µ-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

show abstract

Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

Cited by 4 publications

References 6 publications

Machine Learning for Microcontroller-Class Hardware: A Review

Machine Learning for Microcontroller-Class Hardware: A Review

QoE-Driven IoT Architecture: A Comprehensive Review on System and Resource Management

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Contact Info

Product

Resources

About