BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

Yang, Huanrui; Duan, Lin; Chen, Yiran; Li, Hai

doi:10.48550/arxiv.2102.10462

Cited by 8 publications

(14 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Quantization reduces the number of data bits and parameter bits, and it is an actively studied area with a broad adoption in the industry [10,15,18,39,50]. There are many flavors of quantization including binary parameterization [10,39], low-precision fixed-point [15,32], and mixed-precision training [4,54]. While many of them require a dedicated hardware-level support, we limit our focus to the pure algorithmic solutions, and consider combining our method with an algorithmic quantization in Section 6.…”

Section: Dnn Compression Methodsmentioning

confidence: 99%

A Highly Effective Low-Rank Compression of Deep Neural Networks with Modified Beam-Search and Modified Stable Rank

Eo¹,

Kang²,

Rhee³

2021

Preprint

View full text Add to dashboard Cite

Compression has emerged as one of the essential deep learning research topics, especially for the edge devices that have limited computation power and storage capacity. Among the main compression techniques, low-rank compression via matrix factorization has been known to have two problems. First, an extensive tuning is required. Second, the resulting compression performance is typically not impressive. In this work, we propose a low-rank compression method that utilizes a modified beam-search for an automatic rank selection and a modified stable rank for a compression-friendly training. The resulting BSR (Beamsearch and Stable Rank) algorithm requires only a single hyperparameter to be tuned for the desired compression ratio. The performance of BSR in terms of accuracy and compression ratio trade-off curve turns out to be superior to the previously known low-rank compression methods. Furthermore, BSR can perform on par with or better than the state-of-the-art structured pruning methods. As with pruning, BSR can be easily combined with quantization for an additional compression.

show abstract

Section: Dnn Compression Methodsmentioning

confidence: 99%

A Highly Effective Low-Rank Compression of Deep Neural Networks with Modified Beam-Search and Modified Stable Rank

Eo¹,

Kang²,

Rhee³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, uniformly quantizing a model to ultra low-precision can cause significant accuracy degradation. It is possible to address this with mixed-precision quantization [51,80,100,180,191,201,226,233,236,250,273]. In this approach, each layer is quantized with different bit precision, as illustrated in Figure 8.…”

Section: B Mixed-precision Quantizationmentioning

confidence: 99%

A Survey of Quantization Methods for Efficient Neural Network Inference

Gholami¹,

Kim²,

Dong³

et al. 2022

Low-Power Computer Vision

368

143

View full text Add to dashboard Cite

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

show abstract

“…While the previously discussed work [10,11,25,28,39] takes a more systematic approach, others [15,35,36,38] leverage machine learning to address the challenge of mixed precision's large search space. [15,36] are more heavy-handed in their approaches.…”

Section: Mixed Precision Quantizationmentioning

confidence: 99%

“…They also show that keeping NAS and quantization as separate processes yields models that are perform worse than their combined NN+Quantization search with respect to accuracy, model size, and energy efficiency. [35,38] take a more traditional QAT approach when finding the best mixed precision schemes by learning the best mixed precision quantization parameters during QAT. [35] claims that learning the quantization function's parameters is possible if a good parameterization is chosen during training.…”

Section: Mixed Precisionmentioning

confidence: 99%

“…The authors also note that starting with a pre-trained floating point model outperforms starting from scratch with a random weight initialization. Bit-level Sparsity Quantization (BSQ) [38] also uses a more standard QAT framework; however, the lower the granularity of quantizing at the layer level to the bit level. In their work, the authors treat each bit used to represent each weight as an independent variable, forcing some bits to 0 to induce sparsity and lower bit widths.…”

Section: Mixed Precisionmentioning

confidence: 99%

See 1 more Smart Citation

Neural Network Quantization for Efficient Inference: A Survey

Weng¹

2021

Preprint

View full text Add to dashboard Cite

As neural networks have become more powerful, there has been a rising desire to deploy them in the real world; however, the power and accuracy of neural networks is largely due to their depth and complexity, making them difficult to deploy, especially in resource-constrained devices. Neural network quantization has recently arisen to meet this demand of reducing the size and complexity of neural networks by reducing the precision of a network. With smaller and simpler networks, it becomes possible to run neural networks within the constraints of their target hardware. This paper surveys the many neural network quantization techniques that have been developed in the last decade. Based on this survey and comparison of neural network quantization techniques, we propose future directions of research in the area.

show abstract

BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

Cited by 8 publications

References 16 publications

A Highly Effective Low-Rank Compression of Deep Neural Networks with Modified Beam-Search and Modified Stable Rank

A Highly Effective Low-Rank Compression of Deep Neural Networks with Modified Beam-Search and Modified Stable Rank

A Survey of Quantization Methods for Efficient Neural Network Inference

Neural Network Quantization for Efficient Inference: A Survey

Contact Info

Product

Resources

About