Acceleration of LSTM With Structured Pruning Method on FPGA

Wang, Shaorun; Lin, Peng; Hu, Ruihan; Wang, Hao; He, Jin; Huang, Qijun; Chang, Sheng

doi:10.1109/access.2019.2917312

Cited by 32 publications

(11 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Although this method is effective in convolutional layers, it fails to work in the fully-connected layers, where removing one column can cause significant information loss as it is equivalent to removing one input activation. Prior work (Wen et al 2017;Wang et al 2019) adopts this strategy in RNNs but only achieves about 2× parameter reduction. Block pruning performs pruning at the scale of blocks (Van Keirsbilck, Keller, and Yang 2019), but grouping neighboring weights into a specific structure is a strong constraint which is not an effective way to keep the salient weights.…”

Section: Background and Related Workmentioning

confidence: 99%

DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks

Ren

Zhang

Wang

et al. 2020

AAAI

View full text Add to dashboard Cite

The rapidly growing parameter volume of deep neural networks (DNNs) hinders the artificial intelligence applications on resource constrained devices, such as mobile and wearable devices. Neural network pruning, as one of the mainstream model compression techniques, is under extensive study to reduce the model size and thus the amount of computation. And thereby, the state-of-the-art DNNs are able to be deployed on those devices with high runtime energy efficiency. In contrast to irregular pruning that incurs high index storage and decoding overhead, structured pruning techniques have been proposed as the promising solutions. However, prior studies on structured pruning tackle the problem mainly from the perspective of facilitating hardware implementation, without diving into the deep to analyze the characteristics of sparse neural networks. The neglect on the study of sparse neural networks causes inefficient trade-off between regularity and pruning ratio. Consequently, the potential of structurally pruning neural networks is not sufficiently mined.In this work, we examine the structural characteristics of the irregularly pruned weight matrices, such as the diverse redundancy of different rows, the sensitivity of different rows to pruning, and the position characteristics of retained weights. By leveraging the gained insights as a guidance, we first propose the novel block-max weight masking (BMWM) method, which can effectively retain the salient weights while imposing high regularity to the weight matrix. As a further optimization, we propose a density-adaptive regular-block (DARB) pruning that can effectively take advantage of the intrinsic characteristics of neural networks, and thereby outperform prior structured pruning work with high pruning ratio and decoding efficiency. Our experimental results show that DARB can achieve 13× to 25× pruning ratio, which are 2.8× to 4.3× improvements than the state-of-the-art counterparts on multiple neural network models and tasks. Moreover, DARB can achieve 14.3× decoding efficiency than block pruning with higher pruning ratio.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks

Ren

Zhang

Wang

et al. 2020

AAAI

View full text Add to dashboard Cite

show abstract

“…To compress the model for the hardware, two widely applied methods are (a) model selection/structured pruning, i.e. choosing a model structure with pruned layers/channels and small performance degradation [4,5], and (b) zero-weight compression/sparse pruning, i.e. pruning small-value weights to zero [6,7].…”

Section: Introductionmentioning

confidence: 99%

“…pruning small-value weights to zero [6,7]. Model selection differs from sparse pruning in that it deletes entire channels or layers, showing a more efficient speedup during inference, yet with a more severe performance degradation [4,5]. These two types of methods are usually complementary: after being structurally pruned, a model can also undergo further zero-weight compression to improve the inference speed.…”

Section: Introductionmentioning

confidence: 99%

Sparsification via Compressed Sensing for Automatic Speech Recognition

Zhen

Nguyen

Chang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In order to achieve high accuracy for machine learning (ML) applications, it is essential to employ models with a large number of parameters. Certain applications, such as Automatic Speech Recognition (ASR), however, require real-time interactions with users, hence compelling the model to have as low latency as possible. Deploying large scale ML applications thus necessitates model quantization and compression, especially when running ML models on resource constrained devices. For example, by forcing some of the model weight values into zero, it is possible to apply zero-weight compression, which reduces both the model size and model reading time from the memory. In the literature, such methods are referred to as sparse pruning. The fundamental questions are when and which weights should be forced to zero, i.e. be pruned. In this work, we propose a compressed sensing based pruning (CSP) approach to effectively address those questions. By reformulating sparse pruning as a sparsity inducing and compression-error reduction dual problem, we introduce the classic compressed sensing process into the ML model training process. Using ASR task as an example, we show that CSP consistently outperforms existing approaches in the literature.

show abstract

“…Sendo que mais recentemente alguns trabalhos indicam a adequação do uso de redes neurais do tipo LSTM (Long Short-Term Memory) devido à sua bem conhecida capa-cidade de processar dados sequenciais, como são as séries temporais das variáveis de processos industrias (de Oliveira et al, 2020;Jalayer et al, 2021). Porém, tais tipos de redes neurais demandam mais recursos computacionais de processamento e memória devido a sua estrutura em cascata dependente que causam gargalos para se realizar a operação de inferência (Wang et al, 2019;Gao et al, 2020). Além disto, em muitos sistemas industrias, como os baseados em IIoT (Industrial Internet of Things), tais recursos computacionais são bastante limitados, o que implica no emprego de técnicas baseadas em redes neurais de forma mais eficiente em termos computacionais, sem prejudicar seus desempenhos, para que sua adoção seja viável.…”

Section: Introductionunclassified

“…A aplicação de técnicas de compressão nas redes neurais LSTM se faz necessário devido à alta parametrização deste tipo de rede, que podem facilmente chegar a milhões de parâmetros (Kadetotad et al, 2020). A compressão destes modelos implica na redução de memória ocupada pelos mesmos (Wang et al, 2019) e no processamento da sua inferência.…”

Section: Introductionunclassified

Análise de Desempenho de Redes Neurais LSTM com Técnicas de Pruning para Detecção de Falhas em Processos Industrias

Correia

Dantasy

Guedes

et al. 2021

Procedings Do XV Simpósio Brasileiro De Automação Inteligente

View full text Add to dashboard Cite

In industry, real-time fault detection and diagnosis methods are required to secure processes, reduce damage to products, and avoid possible system failures. Recently, Long Short-Term Memory (LSTM) neural networks are used as an approach to fault detection in industrial process operation because of their strength sequential data processing, such as time series processing. However, LSTM neural networks demand more effort computational to inferring and training when compared to other kinds of neural network architectures. Then, considering IIoT (Industrial Internet of Things) embedded systems have limited memory capacity and small battery charges, strategies to speed up inference in LSTM neural networks and enhance their performance became necessary. In this way, this paper proposes a basis to compress LSTM neural networks with pruning techniques in software. Our pruning approach removes redundant parameters of the LSTM neural network by zeroing absolute synaptic weight values below a threshold. Then, we retrain the pruned model to readjust nonzero weights. We used the Tennessee Eastman Process benchmark to assess our approach. Finally, the paper presents the accuracy, precision, recall and F1-Score for both faulty data sets, varying the network's sparsity and comparing sparsities with performance parameters of the proposed network.

show abstract

Acceleration of LSTM With Structured Pruning Method on FPGA

Cited by 32 publications

References 14 publications

DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks

DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks

Sparsification via Compressed Sensing for Automatic Speech Recognition

Análise de Desempenho de Redes Neurais LSTM com Técnicas de Pruning para Detecção de Falhas em Processos Industrias

Contact Info

Product

Resources

About