A Massively Parallel Coprocessor for Convolutional Neural Networks

Sankaradas, Murugan; Jakkula, Venkata; Cadambi, Srihari; Chakradhar, Srimat; Durdanovic, Igor; Cosatto, Eric; Graf, Hans Peter

doi:10.1109/asap.2009.25

Cited by 201 publications

(96 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…25) can be used to classify the DNN dataflows in recent works [82][83][84][85][86][87][88][89][90][91][92][93] based on their data handling characteristics [80]:…”

Section: B Energy-efficient Dataflow For Acceleratorsmentioning

confidence: 99%

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

et al. 2017

View full text Add to dashboard Cite

Abstract-Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

show abstract

“…25) can be used to classify the DNN dataflows in recent works [82][83][84][85][86][87][88][89][90][91][92][93] based on their data handling characteristics [80]:…”

Section: B Energy-efficient Dataflow For Acceleratorsmentioning

confidence: 99%

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

et al. 2017

View full text Add to dashboard Cite

show abstract

“…[1,2,3,4] Due to the specific computation pattern of CNN, general purpose processors hardly meet the implementation requirement, which encourages the proposal of various hardware implementations based on FPGA, GPU and ASIC [5,6,7]. CNN contains numerous 2D convolutions, which are responsible for more than 90% of the whole computation [8].…”

Section: Introductionmentioning

confidence: 99%

“…To solve this problem, many efforts have been made [1,4,9,10,11]. Among these approaches, the architecture which is inspired by [12], first introduced into CNN by [1], is commonly adopted.…”

Section: Introductionmentioning

confidence: 99%

An efficient implementation of 2D convolution in CNN

Jing

Sha

2017

IEICE Electron. Express

View full text Add to dashboard Cite

Convolutional neural network (CNN), a well-known machine learning algorithm, has been widely used in the field of computer vision for its amazing performance in image classification. With the rapid growth of applications based on CNN, various acceleration schemes have been proposed on FPGA, GPU and ASIC. In the implementation of these specific hardware accelerations, the most challenging part is the implementation of 2D convolution. To obtain a more efficient design of 2D convolution in CNN, this paper proposes a novel technique, singular value decomposition approximation (SVDA) to reduce the usage of resources. Experimental results show that the proposed SVDA hardware implementation can achieve a reduction in resources in the range of 14.46% to 37.8%, while the loss of classification accuracy is less than 1%.

show abstract

“…Other recent works propose different CNN acceleration hardware. For example, [3,[10][11][12]22] focus on 2D-convolvers, which play the roles of both compute modules and data caches. Meanwhile, [18,19] use FMA units for computation.…”

Section: Related Workmentioning

confidence: 99%

“…Several key similarities cause these methods to suffer from the underutilization problem we observe in our Single-CLP design. For example, the 2D-convolvers used in [3,10,12,22] must be provisioned for the largest filter across layers; they will necessarily be underutilized when computing layers with smaller filters. In [19], the organization of the compute modules depends on the number of output feature maps and their number of rows.…”

Section: Related Workmentioning

confidence: 99%

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

ShenYongming

FerdmanMichael

MilderPeter

2017

SIGARCH Comput. Archit. News

132

149

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions.We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.

show abstract

A Massively Parallel Coprocessor for Convolutional Neural Networks

Cited by 201 publications

References 11 publications

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

An efficient implementation of 2D convolution in CNN

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Contact Info

Product

Resources

About