Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads

Abts, Dennis; Ross, Jonathan; Sparling, Jonathan; Wong-VanHaren, Mark; Baker, Max; Hawkins, Tom; Bell, Andrew; Thompson, John R.; Kahsai, Temesghen; Kimmell, Garrin; Hwang, Jennifer; Leslie-Hurd, Rebekah; Bye, Michael; Creswick, E. R.; Boyd, Matthew; Venigalla, Mahitha; Laforge, Evan; Purdy, Jon; Kamath, Purushotham; Maheshwari, Dinesh; Beidler, Michael; Rosseel, G.P.; Ahmad, Omar; Gagarin, Gleb; Czekalski, Richard; Rane, Ashay; Parmar, Sahil; Werner, J. J.; Sproch, Jim; Macias, Adrian; Kurtz, Brian P.

doi:10.1109/isca45697.2020.00023

Cited by 51 publications

(38 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although plenty of other notable architectures exist (see Table II), a pattern begins to emerge, as most specialized processors rely on a series of sub-processing elements which each contribute to increasing throughput of a larger processor [81], [82]. Whilst there are plenty of ways to achieve MAC parallelism, one of the most renowned techniques is the systolic array, and is utilized by Groq [85] and Google, amongst numerous other chip developers. This is not a new concept: systolic architectures were first proposed back in the late 1970s [86], [87], and have become widely popularized since powering the hardware DeepMind used for the AlphaGo system to defeat Lee Sedol, the world champion of the board game Go in October 2015.…”

Section: ) Edge-ai Dnn Accelerators Suitable For Biomedical Applicationsmentioning

confidence: 99%

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Azghadi

Lammie

Eshraghian

et al. 2020

IEEE Trans. Biomed. Circuits Syst.

137

View full text Add to dashboard Cite

The advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors has brought on new opportunities for applying both Deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies including emerging memristive devices, Field Programmable Gate Arrays (FPGAs), and Complementary Metal Oxide Semiconductor (CMOS) can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. The tutorial is augmented with case studies of the vast literature on neural network and neuromorphic hardware as applied to the healthcare domain. We benchmark various hardware platforms by performing a sensor fusion signal processing task combining electromyography (EMG) signals with computer vision. Comparisons are made between dedicated neuromorphic processors and embedded AI accelerators in terms of inference latency and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that various accelerators and neuromorphic processors introduce to healthcare and biomedical domains.

show abstract

Section: ) Edge-ai Dnn Accelerators Suitable For Biomedical Applicationsmentioning

confidence: 99%

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Azghadi

Lammie

Eshraghian

et al. 2020

IEEE Trans. Biomed. Circuits Syst.

137

View full text Add to dashboard Cite

show abstract

“…On the other hand, for some applications, 16-bit floating point MAC units are necessary [51] to reduce significant development cost. State-ofthe-art CNN accelerators [51][53] [56] have both 8bit MAC units for efficient execution of CNNs and 16bit floating-point MAC units for the accurate execution.…”

Section: A Quantizationmentioning

confidence: 99%

“…In this section, we review recent trends in CNN accelerator research which are not covered in the previous sections. In the previous sections, we mentioned the trends of CNN accelerators, for example, large on-chip memories (144MB [51], 220MB [56]), 8bit fixed point MAC units and 16bit floating point MAC units for CNN accelerators, popularity of streaming architectures [41][52] [53].…”

Section: Trends In Recent Cnn Acceleratormentioning

confidence: 99%

A Survey on System-Level Design of Neural Network Accelerators

Seto

2021

JICS

View full text Add to dashboard Cite

In this paper, we present a brief survey on the system-level optimizations used for convolutional neural network (CNN) inference accelerators. For the nested loop of convolutional (CONV) layers, we discuss the effects of loop optimizations such as loop interchange, tiling, unrolling and fusion on CNN accelerators. We also explain memory optimizations that are effective with the loop optimizations. In addition, we discuss streaming architectures and single computation engine architectures that are commonly used in CNN accelerators. Optimizations for CNN models are briefly explained, followed by the recent trends and future directions of the CNN accelerator design.

show abstract

“…To keep pace with the rapid advancement of DNN models, the computing throughput of spatial accelerators scales up to tens or hundreds of TOPS [1,3,14]. And the number of PEs in a spatial accelerator also increases rapidly at the same time.…”

Section: Spatial Dnn Acceleratorsmentioning

confidence: 99%

METRO: A Software-Hardware Co-Design of Interconnections for Spatial DNN Accelerators

Wang,

Sun,

Zhu

et al. 2021

Preprint

View full text Add to dashboard Cite

Tiled spatial architectures have proved to be an effective solution to build large-scale DNN accelerators. In particular, interconnections between tiles are critical for high performance in these tile-based architectures. In this work, we identify the inefficiency in the widely used traditional on-chip networks and the opportunity of software-hardware co-design. We propose METRO with the basic idea of decoupling the traffic scheduling policies from hardware fabrics and moving them to the software level. METRO contains two modules working in synergy: METRO software scheduling framework to coordinate the traffics, and METRO hardware facilities to deliver the data based on software configurations.We evaluate the co-design using different flit sizes for synthetic study, illustrating its effectiveness under various hardware resource constraints, in addition to a wide range of DNN models selected from real-world workloads. The results show that METRO achieves 56.3% communication speedup on average and up to 73.6% overall processing time reduction compared with traditional on-chip network designs.

show abstract

Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads

Cited by 51 publications

References 39 publications

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

A Survey on System-Level Design of Neural Network Accelerators

METRO: A Software-Hardware Co-Design of Interconnections for Spatial DNN Accelerators

Contact Info

Product

Resources

About