Andrew C. Ling scite author profile

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL TM , which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020img/s, or 23img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FP-GAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU. Keywords Deep Neural Network, Convolution Neural NetworkDue to the contributions above we are able to implement all layers of AlexNet [7] on Intel's Arria 10 FPGA and achieve over 10x better throughput and 8.4x more GFLOPS than the state-of-the-art FPGA implementation of AlexNet [20]. Furthermore, we show that, to the best of our knowledge, this is the first FPGA implementation whose performance per watt is competitive against the same generation highlyoptimized TitanX GPU results [3,9,10].The rest of the paper is organized as follows. Section 2 has background on CNNs and related work. Section 3 describes the DLA architecture. Section 4 describes our analytical model for design space exploration. Finally, Sections 5 and 6 describe our results. BACKGROUNDDeep neural networks are machine learning algorithms that are inspired by the structure and function of the human brain. They consist of several interconnected artificial neurons that are modeled after the neurons of the human nervous system. An artificial neuron accepts numerical input from other neurons, and produces an output. For DNNs, the output is computed as a dot-product of its inputs and its

show abstract

Towards scalable placement for FPGAs

Bian

Ling

Choong

et al. 2010

View full text Add to dashboard Cite

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Abdelfattah¹,

Han²,

Bitar³

et al. 2018

View full text Add to dashboard Cite

Overlays have shown significant promise for fieldprogrammable gate-arrays (FPGAs) as they allow for fast development cycles and remove many of the challenges of the traditional FPGA hardware design flow. However, this often comes with a significant performance burden resulting in very little adoption of overlays for practical applications. In this paper, we tailor an overlay to a specific application domain, and we show how we maintain its full programmability without paying for the performance overhead traditionally associated with overlays. Specifically, we introduce an overlay targeted for deep neural network inference with only~1% overhead to support the control and reprogramming logic using a lightweight very-long instruction word (VLIW) network. Additionally, we implement a sophisticated domain specific graph compiler that compiles deep learning languages such as Caffe or Tensorflow to easily target our overlay. We show how our graph compiler performs architecture-driven software optimizations to significantly boost performance of both convolutional and recurrent neural networks (CNNs/RNNs) -we demonstrate a 3× improvement on ResNet-101 and a 12× improvement for long short-term memory (LSTM) cells, compared to naïve implementations. Finally, we describe how we can tailor our hardware overlay, and use our graph compiler to achieve~900 fps on GoogLeNet on an Intel Arria 10 1150 -the fastest ever reported on comparable FPGAs.

show abstract

FPGA Logic Synthesis Using Quantified Boolean Satisfiability

Ling

Singh²,

Brown³

2005

View full text Add to dashboard Cite

Harnessing the power of FPGAs using altera's OpenCL compiler

Singh

Czajkowski

Ling

2013

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Andrew C. Ling

An OpenCL™ Deep Learning Accelerator on Arria 10

Towards scalable placement for FPGAs

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

FPGA Logic Synthesis Using Quantified Boolean Satisfiability

Harnessing the power of FPGAs using altera's OpenCL compiler

Contact Info

Product

Resources

About