Peiyan Dong scite author profile

Recurrent neural networks (RNNs) based automatic speech recognition has nowadays become prevalent on mobile devices such as smart phones. However, previous RNN compression techniques either suffer from hardware performance overhead due to irregularity or significant accuracy loss due to the preserved regularity for hardware friendliness. In this work, we propose RTMobile that leverages both a novel blockbased pruning approach and compiler optimizations to accelerate RNN inference on mobile devices. Our proposed RTMobile is the first work that can achieve real-time RNN inference on mobile platforms. Experimental results demonstrate that RTMobile can significantly outperform existing RNN hardware acceleration methods in terms of inference accuracy and time. Compared with prior work on FPGA, RTMobile using Adreno 640 embedded GPU on GRU can improve the energy-efficiency by about 40× while maintaining the same inference time.

show abstract

RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition

Dong¹,

Wang²,

Niu³

et al. 2020

Preprint

View full text Add to dashboard Cite

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Kong¹,

Dong²,

Ma³

et al. 2021

Preprint

View full text Add to dashboard Cite

NS-FDN: Near-Sensor Processing Architecture of Feature-Configurable Distributed Network for Beyond-Real-Time Always-on Keyword Spotting

Qin

Liu

Dong

et al. 2021

IEEE Trans. Circuits Syst. I

View full text Add to dashboard Cite

DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks

Ren

Zhang

Wang

et al. 2020

AAAI

View full text Add to dashboard Cite

The rapidly growing parameter volume of deep neural networks (DNNs) hinders the artificial intelligence applications on resource constrained devices, such as mobile and wearable devices. Neural network pruning, as one of the mainstream model compression techniques, is under extensive study to reduce the model size and thus the amount of computation. And thereby, the state-of-the-art DNNs are able to be deployed on those devices with high runtime energy efficiency. In contrast to irregular pruning that incurs high index storage and decoding overhead, structured pruning techniques have been proposed as the promising solutions. However, prior studies on structured pruning tackle the problem mainly from the perspective of facilitating hardware implementation, without diving into the deep to analyze the characteristics of sparse neural networks. The neglect on the study of sparse neural networks causes inefficient trade-off between regularity and pruning ratio. Consequently, the potential of structurally pruning neural networks is not sufficiently mined.In this work, we examine the structural characteristics of the irregularly pruned weight matrices, such as the diverse redundancy of different rows, the sensitivity of different rows to pruning, and the position characteristics of retained weights. By leveraging the gained insights as a guidance, we first propose the novel block-max weight masking (BMWM) method, which can effectively retain the salient weights while imposing high regularity to the weight matrix. As a further optimization, we propose a density-adaptive regular-block (DARB) pruning that can effectively take advantage of the intrinsic characteristics of neural networks, and thereby outperform prior structured pruning work with high pruning ratio and decoding efficiency. Our experimental results show that DARB can achieve 13× to 25× pruning ratio, which are 2.8× to 4.3× improvements than the state-of-the-art counterparts on multiple neural network models and tasks. Moreover, DARB can achieve 14.3× decoding efficiency than block pruning with higher pruning ratio.

show abstract

Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework

Yuan

Dong

Sun

et al. 2022

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Efficient deployment of Deep Neural Networks (DNNs) on edge devices (i.e., FPGAs and mobile platforms) is very challenging, especially under a recent witness of the increasing DNN model size and complexity. Model compression strategies, including weight quantization and pruning, are widely recognized as effective approaches to significantly reduce computation and memory intensities, and have been implemented in many DNNs on edge devices. However, most state-of-the-art works focus on ad-hoc optimizations, and there lacks a thorough study to comprehensively reveal the potentials and constraints of different edge devices when considering different compression strategies. In this paper, we qualitatively and quantitatively compare the energy efficiency of FPGA-based and mobile-based DNN executions using mobile GPU and provide a detailed analysis. Based on the observations obtained from the analysis, we propose a unified optimization framework using block-based pruning to reduce the weight storage and accelerate the inference speed on mobile devices and FPGAs, achieving high hardware performance and energy-efficiency gain while maintaining accuracy.

show abstract

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices Based on Fine-Grained Structured Weight Sparsity

Niu¹,

Li²,

Ma³

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Peiyan Dong

SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition

RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

NS-FDN: Near-Sensor Processing Architecture of Feature-Configurable Distributed Network for Beyond-Real-Time Always-on Keyword Spotting

DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks

Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices Based on Fine-Grained Structured Weight Sparsity

Contact Info

Product

Resources

About