An FPGA implementation of a long short-term memory neural network

Ferreira, Joao Canas; Fonseca, José Pedro

doi:10.1109/reconfig.2016.7857151

Cited by 43 publications

(30 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, exponential terms appear in function calculation, which makes it very difficult to directly implement them in FPGA. Lookup table [25] and polynomial approximation [26] are commonly used alternatives at present.…”

Section: Activation Functionmentioning

confidence: 99%

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

et al. 2021

Electronics

View full text Add to dashboard Cite

Over the past two decades, Long Short-Term Memory (LSTM) networks have been used to solve problems that require modeling of long sequence because they can selectively remember certain patterns over a long period, thus outperforming traditional feed-forward neural networks and Recurrent Neural Network (RNN) on learning long-term dependencies. However, LSTM is characterized by feedback dependence, which limits the high parallelism of general-purpose processors such as CPU and GPU. Besides, in terms of the energy efficiency of data center applications, the high consumption of GPU and CPU computing cannot be ignored. To deal with the above problems, Field Programmable Gate Array (FPGA) is becoming an ideal alternative. FPGA has the characteristics of low power consumption and low latency, which are helpful for the acceleration and optimization of LSTM and other RNNs. This paper proposes an implementation scheme of the LSTM network acceleration engine based on FPGA and further optimizes the implementation through fixed-point arithmetic, systolic array and lookup table for nonlinear function. On this basis, for easy deployment and application, we integrate the proposed acceleration engine into Caffe, one of the most popular deep learning frameworks. Experimental results show that, compared with CPU and GPU, the FPGA-based acceleration engine can achieve performance improvement of 8.8 and 2.2 times and energy efficiency improvement of 16.9 and 9.6 times, respectively, within Caffe framework.

show abstract

Section: Activation Functionmentioning

confidence: 99%

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

et al. 2021

Electronics

View full text Add to dashboard Cite

show abstract

“…However, this strategy largely increases computation latency and power dissipation. Another FPGA-based work reported in Ferreira and Fonseca (2016) fully uses both logic units and memory cells in FPGA to speed up computation and suppress the power dissipation. Work in Chang and Culurciello (2017) balances the data communication that both on-chip LUT and off-chip DRAM are used for internal storage of matrix multiplication to reduce the latency due to off-chip memory access and workload of on-chip communication.…”

Section: Memory Accessmentioning

confidence: 99%

Long Short-Term Memory Network Design for Analog Computing

Zhao

Srivastava

Peng

et al. 2019

J. Emerg. Technol. Comput. Syst.

View full text Add to dashboard Cite

We present an analog-integrated circuit implementation of long short-term memory network, which is compatible with digital CMOS technology. We have used multiple-input floating gate MOSFETs as both the front-end to obtain converted analog signals and the differential pairs in proposed analog multipliers. Analog crossbar is built by the analog multiplier processing matrix and bitwise multiplications. We have shown that using current signals as internal transmission signals can largely reduce computation delay, compared to the digital implementation. We also have introduced analog blocks to work as activation functions for the algorithm. In the back-end of our design, we have used current comparators to achieve the output to be readable to external digital systems. We have designed the LSTM network with the matrix size of 16 × 16 in TSMC 180nm CMOS technology. The post-layout simulations show that the latency of one computing cycle is 1.19ns without memory, and power dissipation of the single analog LSTM computing core with 2 kilobytes SRAM at 200MHz is 460.3mW. The overhead of power dissipation due to SRAM access is 8.3%, in which the computing of each LSTM layer requires one computing cycle. The energy efficiency is 0.95TOP/s/W.

show abstract

“…There has been much previous work on FPGA based LSTM implementations using on-chip memory to store all the weights. Ferreira et al proposed an FPGA accelerator of LSTM in [7] for a learning problem of adding two 8-bit numbers with weights stored in on-chip memory. Rybalkin et al [8] presented the first hardware architecture designed for BiLSTM for OCR.…”

Section: B Related Workmentioning

confidence: 99%

“…FPGAs have been used to speed up the inference of LSTMs [4,5,6,7], which offer benefits of low latency and low power when compared to CPUs or GPUs. Although FPGA-based LSTM accelerators have advantages in latency and power consumption, they are limited by the memory bandwidth of the FPGA board.…”

Section: Introductionmentioning

confidence: 99%

“…There has been previous work [7,8,9,10] with FPGA based implementations such that all the weights are stored in the on-chip memory, but this is expensive and limits the size of models that can be deployed. When the RNN model is too large that the weights need to be stored on an external DRAM, it is not efficient because the fetched weights are typically used only once for each output computation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Weight Reuse for Large LSTMs

Que

Nugent

Liu

et al. 2019

2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

Long Short-Term Memory (LSTM) networks have been deployed in speech recognition, natural language processing and financial calculations in recent years, and are beginning to be used in systems where low latency and low power are required. To meet such requirements, we propose a stall-free hardware architecture by reorganising the order of operations in an LSTM system and develop a unique blocking-batching strategy to reuse the LSTM weights fetched from external memory to optimise the benefits of on-chip memory with a limited size for a large machine learning model. Evaluation results show that our architecture can achieve up to 20.8 GOPS/W, which would be among the highest for FPGA designs targeting LSTM systems with weights stored in off-chip memory. Comparing to the state-of-the-art design using off-chip memory to store the weights, we achieve 1.65 times higher performance-per-watt efficiency and 1.60 times higher performance-per-DSP efficiency. When compared with CPU and GPU implementation, our novel hardware architecture is 23.7 and 1.3 times faster while consuming 208 and 19.2 times lower energy respectively, which shows that our approach contributes to high performance and low power FPGA-based LSTM systems.

show abstract

An FPGA implementation of a long short-term memory neural network

Cited by 43 publications

References 15 publications

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

Long Short-Term Memory Network Design for Analog Computing

Efficient Weight Reuse for Large LSTMs

Contact Info

Product

Resources

About