VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Dai, Steve; Venkatesan, Rangharajan; Ren, Haoxing; Zimmer, Brian; Dally, William J.; Khailany, Brucek

doi:10.48550/arxiv.2102.04503

Cited by 3 publications

(2 citation statements)

References 21 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Weight quantization, on the other hand, reduces the numerical precision of the model parameters, leading to significant reductions in both model size and computational requirements. Various weight quantization techniques have been proposed, including binary [38], ternary [39], and vector quantization [40]. Despite the advantages of weight quantization, it may introduce quantization errors that can affect the model's performance, especially when extreme quantization levels are applied.…”

Section: Model Compression Methodsmentioning

confidence: 99%

Efficient Medical Knowledge Graph Embedding: Leveraging Adaptive Hierarchical Transformers and Model Compression

Han-sheng

Yang

et al. 2023

Electronics

View full text Add to dashboard Cite

Medical knowledge graphs have emerged as essential tools for representing complex relationships among medical entities. However, existing methods for learning embeddings from medical knowledge graphs, such as DistMult, RotatE, ConvE, InteractE, JointE, and ConvKB, may not adequately capture the unique challenges posed by the domain, including the heterogeneity of medical entities, rich hierarchical structures, large-scale, high-dimensionality, and noisy and incomplete data. In this study, we propose an Adaptive Hierarchical Transformer with Memory (AHTM) model, coupled with a teacher–student model compression approach, to effectively address these challenges and learn embeddings from a rich medical knowledge dataset containing diverse entities and relationship sets. We evaluate the AHTM model on this newly constructed “Med-Dis” dataset and demonstrate its superiority over baseline methods. The AHTM model achieves substantial improvements in Mean Rank (MR) and Hits@10 values, with the highest MR value increasing by nearly 56% and Hits@10 increasing by 39%. Furthermore, we observe similar performance enhancements on the “FB15K-237” and “WN18RR” datasets. Our model compression approach, incorporating knowledge distillation and weight quantization, effectively reduces the model’s storage and computational requirements, making it suitable for resource-constrained environments. Overall, the proposed AHTM model and compression techniques offer a novel and effective solution for learning embeddings from medical knowledge graphs and enhancing our understanding of complex relationships among medical entities, while addressing the inadequacies of existing approaches.

show abstract

Section: Model Compression Methodsmentioning

confidence: 99%

Efficient Medical Knowledge Graph Embedding: Leveraging Adaptive Hierarchical Transformers and Model Compression

Han-sheng

Yang

et al. 2023

Electronics

View full text Add to dashboard Cite

show abstract

“…Post-training quantization (PTQ) enables the user to convert an already trained float model and quantize it without retraining [10,23,7,11]. However, it can also result in drastic reduction in model quality.…”

Section: Related Workmentioning

confidence: 99%

Pareto-Optimal Quantized ResNet Is Mostly 4-bit

Abdolrashidi¹,

Wang²,

Agrawal³

et al. 2021

Preprint

View full text Add to dashboard Cite

Quantization has become a popular technique to compress neural networks and reduce compute cost, but most prior work focuses on studying quantization without changing the network size. Many real-world applications of neural networks have compute cost and memory budgets, which can be traded off with model quality by changing the number of parameters. In this work, we use ResNet as a case study to systematically investigate the effects of quantization on inference compute cost-quality tradeoff curves. Our results suggest that for each bfloat16 ResNet model, there are quantized models with lower cost and higher accuracy; in other words, the bfloat16 compute cost-quality tradeoff curve is Pareto-dominated by the 4-bit and 8-bit curves, with models primarily quantized to 4-bit yielding the best Pareto curve. Furthermore, we achieve stateof-the-art results on ImageNet for 4-bit ResNet-50 with quantization-aware training, obtaining a top-1 eval accuracy of 77.09%. We demonstrate the regularizing effect of quantization by measuring the generalization gap. The quantization method we used is optimized for practicality: It requires little tuning and is designed with hardware capabilities in mind. Our work motivates further research into optimal numeric formats for quantization, as well as the development of machine learning accelerators supporting these formats. As part of this work, we contribute a quantization library written in JAX, which is open-sourced at https : / / github . com / google -research / google-research/tree/master/aqt.

show abstract

Resource constrained neural network training

Pietrołaj,

Blok

2024

Sci Rep

View full text Add to dashboard Cite

Modern applications of neural-network-based AI solutions tend to move from datacenter backends to low-power edge devices. Environmental, computational, and power constraints are inevitable consequences of such a shift. Limiting the bit count of neural network parameters proved to be a valid technique for speeding up and increasing efficiency of the inference process. Hence, it is understandable that a similar approach is gaining momentum in the field of neural network training. In the face of growing complexity of neural network architectures, reducing resources required for preparation of new models would not only improve cost efficiency but also enable a variety of new AI applications on modern personal devices. In this work, we present a deep refinement of neural network parameters limitation with the use of the asymmetric exponent method. In addition to the previous research, we study new techniques of floating-point variables limitation, representation, and rounding. Moreover, by leveraging exponent offset, we present floating-point precision adjustments without an increase in variables’ bit count. The proposed method allowed us to train LeNet, AlexNet and ResNet-18 convolutional neural networks with a custom 8-bit floating-point representation achieving minimal or no results degradation in comparison to baseline 32-bit floating-point variables.

show abstract

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Cited by 3 publications

References 21 publications

Efficient Medical Knowledge Graph Embedding: Leveraging Adaptive Hierarchical Transformers and Model Compression

Efficient Medical Knowledge Graph Embedding: Leveraging Adaptive Hierarchical Transformers and Model Compression

Pareto-Optimal Quantized ResNet Is Mostly 4-bit

Resource constrained neural network training

Contact Info

Product

Resources

About