Optimizing Reconfigurable Recurrent Neural Networks

Que, Zhiqiang; Nakahara, Hiroki; Nurvitadhi, Eriko; Fan, Hongxiang; Zeng, Chenglong; Meng, Jiuxi; Niu, Xinyu; Luk, Wayne

doi:10.1109/fccm48280.2020.00011

Cited by 23 publications

(18 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their paper has proposed a hardware architecture for a LSTM by exploiting the inherent parallelism mechanism and aimed to outperform the software implementations. Z. Que et al [23] have proposed a novel latency-hiding hardware architecture based on a column-wise matrix-vector multiplication mechanism to eliminate data dependency and to improve the throughput of RNN models. The proposed architecture has been implemented on Ama10 and Stratix 10 FPGAs.…”

Section: Overview Of Lstm Network and Reversible Logic A Long Short-term Memory (Lstm) Networkmentioning

confidence: 99%

A Reversible-Logic Based Architecture for Long Short-Term Memory (LSTM) Network

Khalil

Dey

Kumar

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Any sequential learning task relies on the idea of connecting previous time-stamp information to the immediate present time-stamp task to predict the future. The underlying challenge is to understand the hidden patterns in the sequence by means of analyzing short-and long-term dependencies and temporal differences. Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) are widely used in problem domains like speech recognition, Natural Language Processing (NLP), fault prediction, and language translation modeling over the past few years. Higher accuracy demands complex LSTM network models which lead to high computational cost, area overhead, and excessive power consumption. Reversible logic circuit synthesis, in the context of ideally Zero heat dissipation, has emerged as a new research paradigm for low power circuit designs. In this paper, we have proposed a novel design of LSTM architecture using reversible logic gates. To the best of our knowledge, the proposed approach is the first attempt to implement a complete feedforward LSTM circuit using only reversible logic gates. The hardware implementation of the proposed method is presented using VHDL and Altera Arria10 GX FPGA. The comparative analysis demonstrates that the proposed approach has achieved an approximately 17% reduction in overall power dissipation compared to traditional networks. The proposed approach also has better scalability than the classical design approach.

show abstract

Section: Overview Of Lstm Network and Reversible Logic A Long Short-term Memory (Lstm) Networkmentioning

confidence: 99%

A Reversible-Logic Based Architecture for Long Short-Term Memory (LSTM) Network

Khalil

Dey

Kumar

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

show abstract

“…In [26] a novel Timestep(TS)-buffer is introduced to avoid redundant calculations of LSTM gate operations to reduce system latency. In [27], the authors propose a novel latency-hiding hardware architecture based on column-wise matrix-vector multiplication to eliminate data dependency, improving the throughput of systems of LSTM/GRU models. These LSTM implementations store all the weights in on-chip memory of FPGAs.…”

Section: Previous Workmentioning

confidence: 99%

Mapping Large LSTMs to FPGAs with Weight Reuse

Que

Zhu

Fan

et al. 2020

J Sign Process Syst

Self Cite

View full text Add to dashboard Cite

Long-Short Term Memory (LSTM) can retain memory and learn from data sequences. It gives state-of-the-art accuracy in many applications such as speech recognition, natural language processing and video classifications. Field-Programmable Gate Arrays (FPGAs) have been used to speed up the inference of LSTMs, but FPGA-based LSTM accelerators are limited by the size of on-chip memory and the bandwidth of external memory on FPGA boards. We propose a novel hardware architecture to overcome data dependency and a new blocking-batching strategy to reuse the LSTM weights fetched from external memory to optimize the performance of systems with size-limited on-chip memory for large machine learning models. Evaluation results show that our architecture can achieve 20.8 GOPS/W, which is among the highest for the FPGA-based LSTM designs storing weights in off-chip memory. Our design achieves 1.65 times higher performance-per-watt efficiency and 2.48 times higher performance-per-DSP efficiency when compared with the current state-of-the-art designs of LSTM using weights stored in off-chip memory. Compared with CPU and GPU implementations, our FPGA implementation is 23.7 and 1.3 times faster while consuming 208 and 19.2 times lower energy respectively, which shows that our approach enables large LSTM systems to be processed efficiently on FPGAs with high performance and low power consumption.

show abstract

“…We perform the bit-sparse quantization of the LSTM model through retraining, a fine-tuning process commonly used for the fixed-point quantization [13]. We quantize all of the weights for updating the LSTM gates to the bit-sparse data type and keep the rest of weights as the fixed point.…”

Section: Bit-sparse Data Representationmentioning

confidence: 99%

“…Various approaches have been proposed for the energy efficient LSTM/RNN inference accelerators [3,4,6,7,11,13,[15][16][17]. [6] designed a low power LSTM accelerator for the keyword spotting under 5 µW with 60 nJ/inference energy efficiency.…”

Section: Related Work 61 Efficient Lstm Inference Acceleratormentioning

confidence: 99%

“…[11] used the stochastic computing to improve the energy efficiency of the RNN inference. [13] leveraged the coarse grained parallelism by taking advantage of the column-wise multiplication. This paper proposes the bit-sparse data representation to simplify the multiplication into the bit shift, and it can be used in perpendicular with previous optimization strategies in improving the energy efficiency of the LSTM inference.…”

Section: Related Work 61 Efficient Lstm Inference Acceleratormentioning

confidence: 99%

See 1 more Smart Citation

Blink

Chen

Blair

et al. 2020

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

View full text Add to dashboard Cite

Miniaturized fluorescent calcium imaging microscopes are widely used for monitoring the activity of a large population of neurons in freely behaving animals in vivo. Conventional calcium image analyses extract calcium traces by iterative and bulk image processing and they are hard to meet the power and latency requirements for neurofeedback devices. In this paper, we propose the calcium image processing pipeline based on a bit-sparse long short-term memory (LSTM) inference kernel (BLINK) for efficient calcium trace extraction. It largely reduces the power and latency while remaining the trace extraction accuracy. We implemented the customized pipeline on the Ultra96 platform. It can extract calcium traces from up to 1024 cells with sub-ms latency on a single FPGA device. We designed the BLINK circuits in a 28-nm technology. Evaluation shows that the proposed bit-sparse representation can reduce the circuit area by 38.7% and save the power consumption by 38.4% without accuracy loss. The BLINK circuits achieve 410 pJ/inference, which has 6293x and 52.4x gains in energy efficiency compared to the evaluation on the high performance CPU and GPU, respectively. CCS CONCEPTS • Hardware → Reconfigurable logic and FPGAs; Logic circuits; • Computing methodologies → Machine learning.

show abstract

Optimizing Reconfigurable Recurrent Neural Networks

Cited by 23 publications

References 30 publications

A Reversible-Logic Based Architecture for Long Short-Term Memory (LSTM) Network

A Reversible-Logic Based Architecture for Long Short-Term Memory (LSTM) Network

Mapping Large LSTMs to FPGAs with Weight Reuse

Blink

Contact Info

Product

Resources

About