Abstract:While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additional noise it induces can lead to accuracy degradation. In this white paper, we present an overview of neural netwo… Show more
“…While posttraining quantization (with 4, 8, and 16 bits) has been shown to reduce the model size by 4× and speed up inference by 2-3×, quantization-aware training is recommended for microcontroller-class models to mitigate layerwise quantization error due to a large range of weights across channels [47], [80]. This is achieved through the injection of simulated quantization operations, weight clamping, and fusion of special layers [51], allowing up to 8× model size reduction for same or lower accuracy drop. However, care must be taken to ensure that the target hardware supports the used bitwidth.…”
Section: A Common Model Compression Techniquesmentioning
confidence: 99%
“…Pruning policies for intermittent computing treat pruning as a hyperparameter tuning problem, sweeping through the memory, energy, and accuracy spaces to build a Pareto frontier. Some frameworks [51], [54] provide support for structured pruning, allowing policies for channel and filter pruning rather than pruning weights in an irregular fashion.…”
Section: A Common Model Compression Techniquesmentioning
The advancements in machine learning (ML) opened a new opportunity to bring intelligence to the low-end Internet-of-Things (IoT) nodes, such as microcontrollers. Conventional ML deployment has high memory and computes footprint hindering their direct deployment on ultraresource-constrained microcontrollers. This article highlights the unique requirements of enabling onboard ML for microcontroller-class devices. Researchers use a specialized model development workflow for resource-limited applications to ensure that the compute and latency budget is within the device limits while still maintaining the desired performance. We characterize a closed-loop widely applicable workflow of ML model development for microcontroller-class devices and show that several classes of applications adopt a specific instance of it. We present both qualitative and numerical insights into different stages of model development by showcasing several use cases. Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.
“…While posttraining quantization (with 4, 8, and 16 bits) has been shown to reduce the model size by 4× and speed up inference by 2-3×, quantization-aware training is recommended for microcontroller-class models to mitigate layerwise quantization error due to a large range of weights across channels [47], [80]. This is achieved through the injection of simulated quantization operations, weight clamping, and fusion of special layers [51], allowing up to 8× model size reduction for same or lower accuracy drop. However, care must be taken to ensure that the target hardware supports the used bitwidth.…”
Section: A Common Model Compression Techniquesmentioning
confidence: 99%
“…Pruning policies for intermittent computing treat pruning as a hyperparameter tuning problem, sweeping through the memory, energy, and accuracy spaces to build a Pareto frontier. Some frameworks [51], [54] provide support for structured pruning, allowing policies for channel and filter pruning rather than pruning weights in an irregular fashion.…”
Section: A Common Model Compression Techniquesmentioning
The advancements in machine learning (ML) opened a new opportunity to bring intelligence to the low-end Internet-of-Things (IoT) nodes, such as microcontrollers. Conventional ML deployment has high memory and computes footprint hindering their direct deployment on ultraresource-constrained microcontrollers. This article highlights the unique requirements of enabling onboard ML for microcontroller-class devices. Researchers use a specialized model development workflow for resource-limited applications to ensure that the compute and latency budget is within the device limits while still maintaining the desired performance. We characterize a closed-loop widely applicable workflow of ML model development for microcontroller-class devices and show that several classes of applications adopt a specific instance of it. We present both qualitative and numerical insights into different stages of model development by showcasing several use cases. Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.
“…Most of the time, computing resources are restricted during high workload utilization. Helpful techniques such as pruning [156], quantization [157], [158], and aggregation can be applied to optimize the ML model. Similarly, as discussed in [159], the computational cost of the Deep Learning model can be improved by reducing the spatial complexity, such as pruning the model parameters, parameter sharing, network quantization, and others.…”
Section: ) Prediction Layermentioning
confidence: 99%
“…The technique attempt to reduce the computation cost while retaining the accuracy to be nearly the same. In [158], research at Qualcomm AI Research investigates how the quantization technique can reduce the computational cost and latency in Neural networks. The authors discussed how the AI Model Efficiency Toolkit (AIMET), the library for quantization and compression of the AI model.…”
Internet of Things (IoT) services have grown substantially in recent years. Consequently, IoT service providers (SPs) are emerging in the market and competing to offer their services. Many IoT applications utilize these services in an integrated manner with different Quality-of-Service (QoS) requirements. Thus, the provisioning of end-to-end QoS is getting more indispensable for IoT platforms. However, provisioning the system by using only QoS metrics without considering user experiences is not sufficient. Recently, Quality of Experience (QoE) model has become a promising approach to quantify actual user experiences of services. A holistic design approach that considers constraints of various QoS/QoE metrics together is needed to satisfy requirements of these applications and services. Besides, IoT services may operate in environments with limited resources. Therefore, effective management of services and system resources is essential for QoS/QoE support. This paper provides a comprehensive survey for the state-of-the-art studies on IoT services with QoS/QoE perspective. Our contributions are threefold: (1) QoEdriven architecture is demonstrated by classifying vital components according to QoE-related functions in prior studies, (2) QoE metrics and QoE optimization objectives are classified by corresponding system and resource control problems in the architecture, and (3) QoE-aware resource management e.g., QoE-aware offloading, placement and data caching policies with recent Machine Learning approaches are extensively reviewed.
INDEX TERMSInternet of Things, Quality of Service, Quality of Experience, IoT services, IoT applications, QoS for IoT services, QoS metrics, QoE metrics, IoT architecture Recently, emerging IoT architectures with multi-layers, e.g., Mobile Edge Computing (MEC), Fog Computing and Cloud Computing, have been proposed to improve user experiences.
“…Existing quantization methods can be post-training quantization (PTQ) or in-training / quantization aware training (QAT). PTQ is applied after the model training is complete by compressing models into 8-bit representations and is relatively well supported by various libraries [3,4,5,6,7,8], such as TensorFlow Lite [9] and AIMET [10] for on-device deployment. However, almost no existing PTQ supports customized quantization configurations to compress machine learning (ML) layers and kernels into sub-8-bit (S8B) regimes [11].…”
For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a µ-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.