A Systematic DNN Weight Pruning Framework Using Alternating Direction Method of Multipliers

Zhang, Tianyun; Ye, Shaokai; Zhang, Kaiqi; Tang, Jian; Wen, Wujie; Fardad, Makan; Wang, Yanzhi

doi:10.1007/978-3-030-01237-3_12

Cited by 382 publications

(244 citation statements)

References 27 publications

Supporting

Mentioning

228

Contrasting

Order By: Relevance

“…Problem Formulation: Consider an N -layer DNN, and we focus on the most computationally intensive CONV layers. The weights and biases of layer k are respectively denoted by W k and b k , and the loss function of DNN is denoted by [64] for more details. In our discussion, {W k } N k=1 and {b k } N k =1 respectively characterize the collection of weights and biases from layer 1 to layer N .…”

Section: Kernel Pattern and Connectivity Pruning Algorithmmentioning

confidence: 99%

“…Early efforts on DNN model compression [8,12,14,15,19,42,54] mainly rely on iterative and heuristic methods, with limited and non-uniform model compression rates. Recently, a systematic DNN model compression framework (ADMM-NN) has been developed using the powerful mathematical optimization tool ADMM (Alternating Direction Methods of Multipliers) [4,21,39], currently achieving the best performance (in terms of model compression rate under the same accuracy) on weight pruning [49,64] and one of the best on weight quantization [35].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

Niu

Lin

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

Self Cite

195

113

View full text Add to dashboard Cite

With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing Deep Neural Networks (DNNs) inference is still challenging considering the high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss.In this paper, we advance the state-of-the-art by introducing a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in the design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an endto-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression techniquepattern-based pruning based on an extended ADMM solution framework-and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning. Evaluation results demonstrate that PatDNN outperforms three state-ofthe-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5×, 11.4×, and 7.1×, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

show abstract

Section: Kernel Pattern and Connectivity Pruning Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

Niu

Lin

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

Self Cite

195

113

View full text Add to dashboard Cite

show abstract

“…FPGA hardware accelerators [19], [20] have also been investigated to accommodate pruned CNNs, by leveraging the reconfigurability in on-chip resources. Recently, the authors of [14] have developed a systematic weight pruning framework based on the powerful optimization tool ADMM (Alternating Direction Method of Multipliers) [21]. Such framework consistently achieves higher pruning rate than prior arts.…”

Section: B Cnn Weight Pruningmentioning

confidence: 99%

“…Different from the existing ADMM based approach [14] which performs pruning in the space domain, SPEC 2 performs end-to-end spectral pruning without transformation between W and W . Such spectral pruning enables us to exploit computation redundancy in both the sliding window operation and the spectral kernel weights.…”

Section: A Overviewmentioning

confidence: 99%

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

Niu

Zeng

Srivastava

et al. 2019

2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Self Cite

View full text Add to dashboard Cite

To accelerate inference of Convolutional Neural Networks (CNNs), various techniques have been proposed to reduce computation redundancy. Converting convolutional layers into frequency domain significantly reduces the computation complexity of the sliding window operations in space domain. On the other hand, weight pruning techniques address the redundancy in model parameters by converting dense convolutional kernels into sparse ones. To obtain high-throughput FPGA implementation, we propose SPEC 2 -the first work to prune and accelerate spectral CNNs. First, we propose a systematic pruning algorithm based on Alternative Direction Method of Multipliers (ADMM). The offline pruning iteratively sets the majority of spectral weights to zero, without using any handcrafted heuristics. Then, we design an optimized pipeline architecture on FPGA that has efficient random access into the sparse kernels and exploits various dimensions of parallelism in convolutional layers. Overall, SPEC 2 achieves high inference throughput with extremely low computation complexity and negligible accuracy degradation. We demonstrate SPEC 2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex platform. After pruning 75% of the spectral weights, SPEC 2 achieves 0% accuracy loss for LeNet, and < 1% accuracy loss for VGG16. The resulting accelerators achieve up to 24× higher throughput, compared with the stateof-the-art FPGA implementations for VGG16.

show abstract

“…In order to deploy DNNs on these embedded devices, DNN model compression techniques such as weight pruning, have been proposed for storage reduction and computation acceleration. Recently, works such as [5,20] have made breakthrough on the weight pruning methods for DNNs while maintaining the network accuracy. However, the network structure and weight storage after pruning become highly irregular and therefore the storage of indexing is non-negligible, which undermines the compression ratio and the performance.…”

Section: Introductionmentioning

confidence: 99%

Deep Compressed Pneumonia Detection for Low-Power Embedded Devices

Lin

Liu

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Deep neural networks (DNNs) have been expanded into medical fields and triggered the revolution of some medical applications by extracting complex features and achieving high accuracy and performance, etc. On the contrast, the large-scale network brings high requirements of both memory storage and computation resource, especially for portable medical devices and other embedded systems. In this work, we first train a DNN for pneumonia detection using the dataset provided by RSNA Pneumonia Detection Challenge [4]. To overcome hardware limitation for implementing large-scale networks, we develop a systematic structured weight pruning method with filter sparsity, column sparsity and combined sparsity. Experiments show that we can achieve up to 36x compression ratio compared to the original model with 106 layers, while maintaining no accuracy degradation. We evaluate the proposed methods on an embedded low-power device, Jetson TX2, and achieve low power usage and high energy efficiency.

show abstract

A Systematic DNN Weight Pruning Framework Using Alternating Direction Method of Multipliers

Cited by 382 publications

References 27 publications

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

Deep Compressed Pneumonia Detection for Low-Power Embedded Devices

Contact Info

Product

Resources

About