Hessian-Aware Pruning and Optimal Neural Implant

Yu, Shixing; Yao, Zhewei; Gholami, Amir; Dong, Zhen; Kim, Sehoon; Mahoney, Michael W.; Keutzer, Kurt

doi:10.1109/wacv51458.2022.00372

Cited by 20 publications

(10 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is easy to see this through a second-order Taylor series expansion, where the perturbation is dependent on not just the weight magnitude but also the Hessian (LeCun et al, 1990 ). As such there are several works that use second-order based pruning (LeCun et al, 1990 ; Hassibi and Stork, 1993 ; Hassibi et al, 1993 ; Wang et al, 2019a ; Yu et al, 2021 ).…”

Section: Technology State-of-the-artmentioning

confidence: 99%

Applications and Techniques for Fast Machine Learning in Science

Deiana¹,

Tran²,

Agar³

et al. 2022

Front. Big Data

Self Cite

View full text Add to dashboard Cite

In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.

show abstract

Section: Technology State-of-the-artmentioning

confidence: 99%

Applications and Techniques for Fast Machine Learning in Science

Deiana¹,

Tran²,

Agar³

et al. 2022

Front. Big Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…It reduces the DNN model size and lower computation cost by replacing the floating point weights with low precision fixed-point data. Common quantization methods including directly apply uniform quantizers [24], [25], quantization-aware fine-tuning [26] and mixedprecision quantization [7], [8], [27]. Quantization can largely decrease DNN's arithmetic intensity but still cause significant accuracy degradation for ultra-low data precision.…”

Section: B Model Compressionmentioning

confidence: 99%

HCE: Improving Performance and Efficiency with Heterogeneously Compressed Neural Network Ensemble

Zhang¹,

Yang²,

Li³

2023

Preprint

View full text Add to dashboard Cite

Ensemble learning has gain attention in resent deep learning research as a way to further boost the accuracy and generalizability of deep neural network (DNN) models. Recent ensemble training method explores different training algorithms or settings on multiple sub-models with the same model architecture, which lead to significant burden on memory and computation cost of the ensemble model. Meanwhile, the heurtsically induced diversity may not lead to significant performance gain. We propose a new prespective on exploring the intrinsic diversity within a model architecture to build efficient DNN ensemble. We make an intriguing observation that pruning and quantization, while both leading to efficient model architecture at the cost of small accuracy drop, leads to distinct behavior in the decision boundary. To this end, we propose Heterogeneously Compressed Ensemble (HCE), where we build an efficient ensemble with the pruned and quantized variants from a pretrained DNN model. An diversity-aware training objective is proposed to further boost the performance of the HCE ensemble. Experiemnt result shows that HCE achieves significant improvement in the efficiency-accuracy tradeoff comparing to both traditional DNN ensemble training methods and previous model compression methods.

show abstract

“…Finally, Q-BERT (Shen et al, 2020) employed approximate information about the Hessian spectrum in order to choose the quantization bit-widths applied to each layer. Follow-up work (Yu et al, 2022) applied a similar approach to structured pruning in the context of convolutional and languange models, using an approximation of the Hessian trace to decide which layers should be pruned. We note that this approach is quite different from the one we employ here, as we use completely different inverse-Hessian approximations to perform pruning decisions.…”

Section: Background and Related Workmentioning

confidence: 99%

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Kurtic¹,

Campos²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Pre-trained Transformer-based language models have become a key building block for natural language processing (NLP) tasks. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to be applicable to decrease model size and increase inference speed. In this context, this paper's contributions are two-fold. We begin with an in-depth study of the accuracycompression trade-off for unstructured weight pruning in the context of BERT models, and introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-theart results in terms of the compression/accuracy trade-off. Specifically, Optimal BERT Surgeon extends existing work on second-order pruning by allowing for pruning blocks of weights, and by being applicable at BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches for Transformer-based models, which allows us to combine state-of-the-art structured and unstructured pruning together with quantization, in order to obtain highly compressed, but accurate models. The resulting compression framework is powerful, yet general and efficient: we apply it to both the fine-tuning and pre-training stages of language tasks, to obtain state-of-the-art results on the accuracycompression trade-off with relatively simple compression recipes. For example, we obtain 10x model size compression with < 1% relative drop in accuracy to the dense BERT-base, 10x end-to-end CPU-inference speedup with < 2% relative drop in accuracy, and 29x inference speedups with < 7.5% relative accuracy drop.

show abstract

Hessian-Aware Pruning and Optimal Neural Implant

Cited by 20 publications

References 36 publications

Applications and Techniques for Fast Machine Learning in Science

Applications and Techniques for Fast Machine Learning in Science

HCE: Improving Performance and Efficiency with Heterogeneously Compressed Neural Network Ensemble

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Contact Info

Product

Resources

About