Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Suda, Naveen; Chandra, Vikas; Dasika, Ganesh; Mohanty, Abinash; Ma, Yufei; Vrudhula, Sarma; Seo, Jae-sun; Cao, Yu

doi:10.1145/2847263.2847276

Cited by 459 publications

(275 citation statements)

References 10 publications

Supporting

Mentioning

256

Contrasting

Unclassified

Order By: Relevance

“…[8] and [23] explore in-memory-processing to accelerate CNNs. [28] develop an OpenCLbased HLS tool to implement CNN accelerators that use different modules for different kinds of layers, but all convolutional layers are computed with a single CLP.…”

Section: Related Workmentioning

confidence: 99%

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

ShenYongming

FerdmanMichael

MilderPeter

2017

SIGARCH Comput. Archit. News

131

149

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions.We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.

show abstract

Section: Related Workmentioning

confidence: 99%

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

ShenYongming

FerdmanMichael

MilderPeter

2017

SIGARCH Comput. Archit. News

131

149

View full text Add to dashboard Cite

show abstract

“…The intersection of the roofline curve with a vertical line for a particular arithmetic intensity, gives the theoretical peak performance point, which is either compute-bound or memory-bound. In particular, we consider the binarized [31,21] and 8-bit fixed-point [25] implementations of the popular AlexNet [14], both of which require 1.4 billion operations (GOPS) to classify one image.…”

Section: Estimating Performance Using Rooflinesmentioning

confidence: 99%

Finn

Umuroglu

Fraser

Gambardella

et al. 2017

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

716

View full text Add to dashboard Cite

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present Finn, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 µs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 µs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.

show abstract

“…In this cases, the majority of reported results include GOp/s when a favourable batch size is used. In [16], an OpenCL-based high-throughput accelerator is proposed which employs batch processing in order to sustain a high resource utilisation and hide the hostaccelerator communication overhead. In [17], Chen et al used batch processing to maximise weights reuse in ConvNet layers across multiple inputs.…”

Section: Performance Comparisonmentioning

confidence: 99%

Latency-driven design for FPGA-based convolutional neural networks

Venieris

Bouganis

2017

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Abstract-In recent years, Convolutional Neural Networks (ConvNets) have become the quintessential component of several state-of-the-art Artificial Intelligence tasks. Across the spectrum of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on different performance requirements. However, with the increasing complexity of ConvNet models, the architectural design space becomes overwhelmingly large, asking for principled design flows that address the application-level needs. This paper presents a latency-driven design methodology for mapping ConvNets on FPGAs. The proposed design flow employs novel transformations over a Synchronous Dataflow-based modelling framework together with a latency-centric optimisation procedure in order to efficiently explore the design space targeting low-latency designs. Quantitative evaluation shows large improvements in latency when latency-driven optimisation is in place yielding designs that improve the latency of AlexNet by 73.54× and VGG16 by 5.61× over throughput-optimised designs.

show abstract

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Cited by 459 publications

References 10 publications

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Finn

Latency-driven design for FPGA-based convolutional neural networks

Contact Info

Product

Resources

About