Hardware architecture of Bidirectional Long Short-Term Memory Neural Network for Optical Character Recognition

Rybalkin, Vladimir; Wehn, Norbert; Yousefi, Mohammad Reza; Stricker, Didier

doi:10.23919/date.2017.7927210

Cited by 41 publications

(50 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…RNN needs to run for multiple time-steps for each task to be completed. The computation of the recurrent unit can be unrolled over timesteps [92]. However, this cannot be fully parallelized, as discussed earlier.…”

Section: ) Compute-specificmentioning

confidence: 99%

Recurrent Neural Networks: An Embedded Computing Perspective

et al. 2020

View full text Add to dashboard Cite

Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data. Recently, there has been a strong interest in executing RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational capability and a large memory space. In this paper, we review existing implementations of RNN models on embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems. We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. Then, we explain the components of RNN models from an implementation perspective. We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms. Finally, we compare the defined objectives with the implementations and highlight some open research questions and aspects currently not addressed for embedded RNNs. Overall, applying algorithmic optimizations to RNN models and decreasing the memory access overhead is vital to obtain high efficiency. To further increase the implementation efficiency, we point up the more promising optimizations that could be applied in future research. Additionally, this article observes that high performance has been targeted by many implementations, while flexibility has, as yet, been attempted less often. Thus, the article provides some guidelines for RNN hardware designers to support flexibility in a better manner. INDEX TERMS Compression, flexibility, efficiency, embedded computing, long short term memory (LSTM), quantization, recurrent neural networks (RNNs).

show abstract

Section: ) Compute-specificmentioning

confidence: 99%

Recurrent Neural Networks: An Embedded Computing Perspective

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Ferreira et al proposed an FPGA accelerator of LSTM in [7] for a learning problem of adding two 8-bit numbers with weights stored in on-chip memory. Rybalkin et al [8] presented the first hardware architecture designed for BiLSTM for OCR. The architecture was implemented with 5-bit fixed-point numbers for weights and activations which were stored in on-chip memory.…”

Section: B Related Workmentioning

confidence: 99%

“…There has been previous work [7,8,9,10] with FPGA based implementations such that all the weights are stored in the on-chip memory, but this is expensive and limits the size of models that can be deployed. When the RNN model is too large that the weights need to be stored on an external DRAM, it is not efficient because the fetched weights are typically used only once for each output computation.…”

Section: Introductionmentioning

confidence: 99%

Efficient Weight Reuse for Large LSTMs

Que

Nugent

Liu

et al. 2019

2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

Long Short-Term Memory (LSTM) networks have been deployed in speech recognition, natural language processing and financial calculations in recent years, and are beginning to be used in systems where low latency and low power are required. To meet such requirements, we propose a stall-free hardware architecture by reorganising the order of operations in an LSTM system and develop a unique blocking-batching strategy to reuse the LSTM weights fetched from external memory to optimise the benefits of on-chip memory with a limited size for a large machine learning model. Evaluation results show that our architecture can achieve up to 20.8 GOPS/W, which would be among the highest for FPGA designs targeting LSTM systems with weights stored in off-chip memory. Comparing to the state-of-the-art design using off-chip memory to store the weights, we achieve 1.65 times higher performance-per-watt efficiency and 1.60 times higher performance-per-DSP efficiency. When compared with CPU and GPU implementation, our novel hardware architecture is 23.7 and 1.3 times faster while consuming 208 and 19.2 times lower energy respectively, which shows that our approach contributes to high performance and low power FPGA-based LSTM systems.

show abstract

“…An option are FPGAs, which allow to design a specialized hardware architecture for DNNs, but at much less effort than building a computer chip from scratch. There are several examples of FPGA implementations dealing with redundancy in DNNs [5,6,7,8,9,10]. FPGAs consume little energy, therefore they are good candidates for embedded applications.…”

Section: Related Workmentioning

confidence: 99%

Sparsity in Deep Neural Networks - An Empirical Investigation with TensorQuant

Loroch

Pfreundt

Wehn

et al. 2019

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

Deep learning is finding its way into the embedded world with applications such as autonomous driving, smart sensors and augmented reality. However, the computation of deep neural networks is demanding in energy, compute power and memory. Various approaches have been investigated to reduce the necessary resources, one of which is to leverage the sparsity occurring in deep neural networks due to the high levels of redundancy in the network parameters. It has been shown that sparsity can be promoted specifically and the achieved sparsity can be very high. But in many cases the methods are evaluated on rather small topologies. It is not clear if the results transfer onto deeper topologies. In this paper, the TensorQuant toolbox has been extended to offer a platform to investigate sparsity, especially in deeper models. Several practical relevant topologies for varying classification problem sizes are investigated to show the differences in sparsity for activations, weights and gradients.

show abstract

Hardware architecture of Bidirectional Long Short-Term Memory Neural Network for Optical Character Recognition

Cited by 41 publications

References 8 publications

Recurrent Neural Networks: An Embedded Computing Perspective

Recurrent Neural Networks: An Embedded Computing Perspective

Efficient Weight Reuse for Large LSTMs

Sparsity in Deep Neural Networks - An Empirical Investigation with TensorQuant

Contact Info

Product

Resources

About