PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices

Ma, Xiaolong; Guo, Fu-Ming; Niu, Wei; Lin, Xue; Tang, Jian; Ma, Kaisheng; Ren, Bin; Wang, Yanzhi

doi:10.1609/aaai.v34i04.5954

Cited by 132 publications

(83 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Chen et al proposed sparse complimentary convolution in which half of the weights with regular patterns in the original convolution kernels can be removed with little accuracy loss [8]. Ma et al proposed pattern-based kernel pruning [29]. The convolution kernels can only be pruned to one of several pre-defined patterns so that the pruned model contains some regular structures.…”

Section: Related Workmentioning

confidence: 99%

“…The performance gains are limited, due to the sparse nature of the computation. Another approach is to design more hardware-amenable pruning strategies [8,29]. For example, a hybrid strategy by combining structured and non-structured pruning can achieve good accuracy while maintaining some regular patterns in the pruned model for efficient hardware processing [29,33].…”

mentioning

confidence: 99%

“…Another approach is to design more hardware-amenable pruning strategies [8,29]. For example, a hybrid strategy by combining structured and non-structured pruning can achieve good accuracy while maintaining some regular patterns in the pruned model for efficient hardware processing [29,33]. These works, however, lack a careful examination of the code optimization opportunities, resulting in restricted pruning choices and sub-optimal performance.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Rumi

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

Weight pruning is a popular technique to reduce the size and computation complexity of the Convolutional Neural Networks (CNNs). Despite its success in reducing the model size, weight pruning has brought limited benefit to the CNN inference performance, due to the irregularity introduced in the sparse convolution operations. In this work, we aim to improve the performance of sparse convolutions on GPUs by mitigating the irregularity. We find that the existing performance optimization techniques for sparse matrix computations fail to accelerate sparse convolutions, and we observe that the main performance bottleneck is caused by the heavy control-flow instructions. Based on the observation, we proposed a new GEMM-based implementation of sparse convolutions. Our main idea is to extract dense blocks of non-zeros in the sparse convolution kernels, and use dense matrix-matrix multiplication for these dense blocks to achieve high throughput. For cases where many non-zero weights cannot be grouped into dense blocks, we propose a performance-aware re-pruning strategy that removes the least important weights in the sparse kernels to further improve the throughput. The experimental results with five real-world pruned CNN models show that our techniques can significantly improve the layer-wise performance of sparse convolution operations as well as the end-to-end performance of CNN inference. CCS CONCEPTS • Computing methodologies → Neural networks; • Software and its engineering → Source code generation;

show abstract

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Rumi

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

show abstract

“…For example, Park et al (2017) proposed that the method of weights pruning could be used to reduce the complexity of deep neural network, so as to improve the real-time performance of deep neural network on mobile platform. In Ma et al (2020), it is proposed that weights can be quantified to reduce the computational complexity required for mobile platforms to execute deep neural network applications. What is more, a new deep neural network was proposed in Zhou et al (2019) specifically for the mobile platform to ensure the high-speed and accurate completion of the same task target and experimental results.…”

Section: Performance Optimization Of Neural Network For Mobile-cloudmentioning

confidence: 99%

To cloud or not to cloud: an on-line scheduler for dynamic privacy-protection of deep learning workload on edge devices

Tang

Wang

et al. 2020

CCF Trans. HPC

View full text Add to dashboard Cite

Recently deep learning applications are thriving on edge and mobile computing scenarios, due to the concerns of latency constraints, data security and privacy, and other considerations. However, because of the limitation of power delivery, battery lifetime and computation resource, offering real-time neural network inference ability has to resort to the specialized energy-efficient architecture, and sometimes the coordination between the edge devices and the powerful cloud or fog facilities. This work investigates a realistic scenario when an on-line scheduler is needed to meet the requirement of latency even when the edge computing resources and communication speed are dynamically fluctuating, while protecting the privacy of users as well. It also leverages the approximate computing feature of neural networks and actively trade-off excessive neural network propagation paths for latency guarantee even when local resource provision is unstable. Combining neural network approximation and dynamic scheduling, the real-time deep learning system could adapt to different requirements of latency/ accuracy and the resource fluctuation of mobile-cloud applications. Experimental results also demonstrate that the proposed scheduler significantly improves the energy efficiency of real-time neural networks on edge devices.

show abstract

“…Chen 等 [16] 提出模型张量化的概念, 将计算流图的每个节点定义为一个张量表达式, 并利用机器学习找到张量表达式到底层程序的最佳映射. Ma 等 [17] 提出了一种新的剪枝模式, 并针对这种具有视觉特性的卷积内核开发了一种新颖的编译器来辅助 DNN 进行推理, 实现了实时执行 PCONV 模型而不会影响精度的效果. 但是这些方法的目标是优化模型执行时间, 并不关注深度学习模型自适应.…”

unclassified