High-Performance CNN Accelerator on FPGA Using Unified Winograd-GEMM Architecture

Kala, S; Jose, Babita R.; Mathew, Jimson; Nalesh, S

doi:10.1109/tvlsi.2019.2941250

Cited by 86 publications

(42 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors of [9][10][11][12] proposed to accelerate CNN on FPGA using simplified numerical precision to save chip resource consumption. The authors of [13,14] proposed CNN architecture implemented in FPGA with the Winograd algorithm to reduce the complexity of convolution operation and accelerate the computation process. Bai et al [15] specifically used depthwise separable convolution to implement the CNN accelerator.…”

Section: Related Workmentioning

confidence: 99%

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

et al. 2021

Electronics

View full text Add to dashboard Cite

Over the past two decades, Long Short-Term Memory (LSTM) networks have been used to solve problems that require modeling of long sequence because they can selectively remember certain patterns over a long period, thus outperforming traditional feed-forward neural networks and Recurrent Neural Network (RNN) on learning long-term dependencies. However, LSTM is characterized by feedback dependence, which limits the high parallelism of general-purpose processors such as CPU and GPU. Besides, in terms of the energy efficiency of data center applications, the high consumption of GPU and CPU computing cannot be ignored. To deal with the above problems, Field Programmable Gate Array (FPGA) is becoming an ideal alternative. FPGA has the characteristics of low power consumption and low latency, which are helpful for the acceleration and optimization of LSTM and other RNNs. This paper proposes an implementation scheme of the LSTM network acceleration engine based on FPGA and further optimizes the implementation through fixed-point arithmetic, systolic array and lookup table for nonlinear function. On this basis, for easy deployment and application, we integrate the proposed acceleration engine into Caffe, one of the most popular deep learning frameworks. Experimental results show that, compared with CPU and GPU, the FPGA-based acceleration engine can achieve performance improvement of 8.8 and 2.2 times and energy efficiency improvement of 16.9 and 9.6 times, respectively, within Caffe framework.

show abstract

Section: Related Workmentioning

confidence: 99%

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

et al. 2021

Electronics

View full text Add to dashboard Cite

show abstract

“…CNNs extract important features embedded in the input data and are increasingly computationally efficient. As recent studies have shown the effectiveness of FPGA as a hardware accelerator for the CNNs [51][52][53], the CNN in this system is to be built on FPGA as a real-time and low power consumption system. The CNN is built using Theano [54], and it consists of two convolutional layers, two pooling layers, one all-to-all connection layer and one output layer, as shown in Figure 1F.…”

Section: Regression Neural Networkmentioning

confidence: 99%

A Biologically Inspired Sound Localisation System Using a Silicon Cochlea Pair

Afshar

Wang

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

We present a biologically inspired sound localisation system for reverberant environments using the Cascade of Asymmetric Resonators with Fast-Acting Compression (CAR-FAC) cochlear model. The system exploits a CAR-FAC pair to pre-process binaural signals that travel through the inherent delay line of the cascade structures, as each filter acts as a delay unit. Following the filtering, each cochlear channel is cross-correlated with all the channels of the other cochlea using a quantised instantaneous correlation function to form a 2-D instantaneous correlation matrix (correlogram). The correlogram contains both interaural time difference and spectral information. The generated correlograms are analysed using a regression neural network for localisation. We investigate the effect of the CAR-FAC nonlinearity on the system performance by comparing it with a CAR only version. To verify that the CAR/CAR-FAC and the quantised instantaneous correlation provide a suitable basis with which to perform sound localisation tasks, a linear regression, an extreme learning machine, and a convolutional neural network are trained to learn the azimuthal angle of the sound source from the correlogram. The system is evaluated using speech data recorded in a reverberant environment. We compare the performance of the linear CAR and nonlinear CAR-FAC models with current sound localisation systems as well as with human performance.

show abstract

“…Winograd filtering is a known technique to reduce the number of multiplications of a convolution. The technique was efficiently implemented on FPGA [147][148][149][150].…”

Section: Hardware-oriented Deep Neural Network Optimizationsmentioning

confidence: 99%

“…The main data quantization and data reduction techniques are summarized in Table 4. [146][147][148][149][150] Data reduction techniques are normally applied together with data quantization. Together, they generate very efficient solutions with a small accuracy reduction when compared to solutions without optimizations.…”

Section: Hardware-oriented Deep Neural Network Optimizationsmentioning

confidence: 99%

Moving Deep Learning to the Edge

et al. 2020

View full text Add to dashboard Cite

Deep learning is now present in a wide range of services and applications, replacing and complementing other machine learning algorithms. Performing training and inference of deep neural networks using the cloud computing model is not viable for applications where low latency is required. Furthermore, the rapid proliferation of the Internet of Things will generate a large volume of data to be processed, which will soon overload the capacity of cloud servers. One solution is to process the data at the edge devices themselves, in order to alleviate cloud server workloads and improve latency. However, edge devices are less powerful than cloud servers, and many are subject to energy constraints. Hence, new resource and energy-oriented deep learning models are required, as well as new computing platforms. This paper reviews the main research directions for edge computing deep learning algorithms.

show abstract

High-Performance CNN Accelerator on FPGA Using Unified Winograd-GEMM Architecture

Cited by 86 publications

References 22 publications

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

An FPGA-Based LSTM Acceleration Engine for Deep Learning Frameworks

A Biologically Inspired Sound Localisation System Using a Silicon Cochlea Pair

Moving Deep Learning to the Edge

Contact Info

Product

Resources

About