PuDianNao

Liu, Daofu; Liu, Shaoli; Zhou, Junlin; Zhou, Shuai; Teman, Olivier; Feng, Xiaobing; Zhou, Xuehai; Chen, Tianshi

doi:10.1145/2775054.2694358

Cited by 29 publications

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many dense architectures have been proposed in the literature that optimize compute [14,25,30] and memory bandwidth [13,32] for CNN inferences. Quantization of weights and activations using log [35,58] and linear [17,30] techniques further reduce the memory footprint.…”

Section: Related Workmentioning

confidence: 99%

Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks

Qureshi¹,

Munir²

2021

Preprint

View full text Add to dashboard Cite

Sparse convolutional neural networks (CNNs) have gained significant traction over the past few years as sparse CNNs can drastically decrease the model size and computations, if exploited befittingly, as compared to their dense counterparts. Sparse CNNs often introduce variations in the layer shapes and sizes, which can prevent dense accelerators from performing well on sparse CNN models. Recently proposed sparse accelerators like SCNN, Eyeriss v2, and SparTen, actively exploit the two-sided or full sparsity, that is, sparsity in both weights and activations, for performance gains. These accelerators, however, either have inefficient micro-architecture (Eyeriss v2, SCNN), which limits their performance, have no support for non-unit stride convolutions (SCNN) and fully-connected (FC) layers (SCNN, SparTen), or suffer massively from systematic load imbalance (SCNN, Eyeriss v2). To circumvent these issues and support both sparse and dense models, we propose Phantom, a multi-threaded, dynamic, and flexible neural computational core. Phantom uses sparse binary mask representation to actively lookahead into sparse computations, and dynamically schedule its computational threads to maximize the thread utilization and throughput. We also generate a two-dimensional (2D) mesh architecture of Phantom neural computational cores, which we refer to as Phantom-2D accelerator, and propose a novel dataflow that supports all layers of a CNN, including unit and non-unit stride convolutions, and FC layers. In addition, Phantom-2D uses a two-level load balancing strategy to minimize the computational idling, thereby, further improving the hardware utilization. To show support for different types of layers, we evaluate the performance of the Phantom architecture on VGG16 and MobileNet. Our simulations show that the Phantom-2D accelerator attains a performance gain of 12×, 4.1×, 1.98×, and 2.36×, over dense architectures, SCNN, SparTen, and Eyeriss v2, respectively. CCS Concepts: • Computer systems organization → Neural networks; Data flow architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks

Qureshi¹,

Munir²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…ASIC Cloud-worthy accelerators with planet-scale applicability are numerous, including those targeting graph processing [28], database servers [29], Web Search RankBoost [30], Machine Learning [31,32], gzip/gunzip [33] and Big Data Analytics [34]. Tandon et al [35] designed accelerators for similarity measurement in natural language processing.…”

Section: Related Workmentioning

confidence: 99%

ASIC Clouds: Specializing the Datacenter

et al. 2017

View full text Add to dashboard Cite

GPU and FPGA-based clouds have already demonstrated the promise of accelerating computing-intensive workloads with greatly improved power and performance.In this paper, we examine the design of ASIC Clouds, which are purpose-built datacenters comprised of large arrays of ASIC accelerators, whose purpose is to optimize the total cost of ownership (TCO) of large, high-volume chronic computations, which are becoming increasingly common as more and more services are built around the Cloud model. On the surface, the creation of ASIC clouds may seem highly improbable due to high NREs and the inflexibility of ASICs. Surprisingly, however, large-scale ASIC Clouds have already been deployed by a large number of commercial entities, to implement the distributed Bitcoin cryptocurrency system.We begin with a case study of Bitcoin mining ASIC Clouds, which are perhaps the largest ASIC Clouds to date. From there, we design three more ASIC Clouds, including a YouTubestyle video transcoding ASIC Cloud, a Litecoin ASIC Cloud, and a Convolutional Neural Network ASIC Cloud and show 2-3 orders of magnitude better TCO versus CPU and GPU.Among our contributions, we present a methodology that given an accelerator design, derives Pareto-optimal ASIC Cloud Servers, by extracting data from place-and-routed circuits and computational fluid dynamic simulations, and then employing clever but brute-force search to find the best jointlyoptimized ASIC, DRAM subsystem, motherboard, power delivery system, cooling system, operating voltage, and case design. Moreover, we show how data center parameters determine which of the many Pareto-optimal points is TCOoptimal. Finally we examine when it makes sense to build an ASIC Cloud, and examine the impact of ASIC NRE.The first two authors contributed equally. To Appear in ISCA 2016.1 See http://blockchain.info to see real-time ledger updates.

show abstract

“…To optimize memoryaccess and data movement, DianNao [8] uses customized on-chip buffer to minimize energy-hungry DRAM accesses. In contrast, the next generation of accelerators of the DianNao family-DaDianNao [8], ShiDianNao [9] and PuDianNao [49] use onchip embedded-DRAM and SRAM (Static Random Access Memory) completely to eradicate DRAM access. As the performance of deep learning increased with huge data, prevalent hardware architectures [8,9,45] are limited by inefficient datatransfer between processing elements and main memory.…”

Section: Existing Dnn Acceleratorsmentioning

confidence: 99%

Modelling, exploration and optimization of hardware accelerators for deep learning applications

Dutt¹

View full text Add to dashboard Cite

show abstract

PuDianNao

Cited by 29 publications

References 31 publications

Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks

Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks

ASIC Clouds: Specializing the Datacenter

Modelling, exploration and optimization of hardware accelerators for deep learning applications

Contact Info

Product

Resources

About