TAB: Unified and Optimized Ternary, Binary, and Mixed-precision Neural Network Inference on the Edge

Zhu, Shien; Duong, Luan H. K.; Liu, Weichen

doi:10.1145/3508390

Cited by 5 publications

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These instructions can noticeably speed up eightbit QNN inference [16]. Fast implementations are also available for ternary [17][18][19] and binary networks [18,20]. However, binary and ternary networks still suffer from accuracy loss compared to full-precision or eight-bit quantized networks with a similar number of parameters and architecture, which limits their suitability for certain tasks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

Trusov,

Limonova,

Nikolaev

et al. 2024

Mathematics

View full text Add to dashboard Cite

Quantization is a widespread method for reducing the inference time of neural networks on mobile Central Processing Units (CPUs). Eight-bit quantized networks demonstrate similarly high quality as full precision models and perfectly fit the hardware architecture with one-byte coefficients and thirty-two-bit dot product accumulators. Lower precision quantizations usually suffer from noticeable quality loss and require specific computational algorithms to outperform eight-bit quantization. In this paper, we propose a novel 4.6-bit quantization scheme that allows for more efficient use of CPU resources. This scheme has more quantization bins than four-bit quantization and is more accurate while preserving the computational efficiency of the later (it runs only 4% slower). Our multiplication uses a combination of 16- and 32-bit accumulators and avoids multiplication depth limitation, which the previous 4-bit multiplication algorithm had. The experiments with different convolutional neural networks on CIFAR-10 and ImageNet datasets show that 4.6-bit quantized networks are 1.5–1.6 times faster than eight-bit networks on the ARMv8 CPU. Regarding the quality, the results of the 4.6-bit quantized network are close to the mean of four-bit and eight-bit networks of the same architecture. Therefore, 4.6-bit quantization may serve as an intermediate solution between fast and inaccurate low-bit network quantizations and accurate but relatively slow eight-bit ones.

show abstract

Section: Related Workmentioning

confidence: 99%

“…There are 21 pairs (N x , N w ) which satisfy ( 6): (255, 3), (127, 5), (85, 7), (63, 9), (51, 11), (43,13), (37,15), (31,17), (29,19), (25,21), (23,23) and symmetrical ones. If we compute the average bitwidth required to store x and w as (log 2 N x + log 2 N w )/2, we obtain values in the range 4.51-4.79.…”

Section: High-performance Matrix Multiplicationmentioning

confidence: 99%

4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

Trusov,

Limonova,

Nikolaev

et al. 2024

Mathematics

View full text Add to dashboard Cite

show abstract

“…Quantization with lower precision can further reduce the memory consumption and computation. And ultra-low precision (1 or 2-bit) operations can often be computed efficiently with bit-wise arithmetic and thus achieving signification computation acceleration [28]. However, due to the large quantization noise, the benefits of low precision quantization often come at the cost of significant accuracy degradation.…”

Section: Introductionmentioning

confidence: 99%

An Empirical Study of Low Precision Quantization for TinyML

Zhuo¹,

Chen²,

Ramakrishnan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Tiny machine learning (tinyML) has emerged during the past few years aiming to deploy machine learning models to embedded AI processors with highly constrained memory and computation capacity. Low precision quantization is an important model compression technique that can greatly reduce both memory consumption and computation cost of model inference. In this study, we focus on post-training quantization (PTQ) algorithms that quantize a model to low-bit (less than 8-bit) precision with only a small set of calibration data and benchmark them on different tinyML use cases.To achieve a fair comparison, we build a simulated quantization framework to investigate recent PTQ algorithms. Furthermore, we break down those algorithms into essential components and reassembled a generic PTQ pipeline. With ablation study on different alternatives of components in the pipeline, we reveal key design choices when performing low precision quantization. We hope this work could provide useful data points and shed lights on the future research of low precision quantization.

show abstract