A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

Majumdar, Abhinandan; Cadambi, Srihari; Becchi, Michela; Chakradhar, Srimat; Graf, Hans Peter

doi:10.1145/2133382.2133388

Cited by 40 publications

(13 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, the neurons they implement are inspired from biology, i.e., spiking neurons, they do not implement the CNNs and DNNs which are the focus of our architecture. Majumdar et al [37] investigate a parallel architecture for various machine-learning algorithms, including, but not only, neural networks; unlike our architecture, they have an off-chip banked memory, and they introduce memory banks close to PEs (similar to those found in GPUs) for caching purposes. Finally, beyond neural networks and machine-learning tasks, other largescale custom architectures have been proposed, such as the recently proposed Anton 2 [60], for molecular dynamics simulation.…”

Section: Related Workmentioning

confidence: 99%

DaDianNao: A Neural Network Supercomputer

Luo

Liu

et al. 2017

IEEE Trans. Comput.

152

View full text Add to dashboard Cite

Many companies are deploying services largely based on machine-learning algorithms for sophisticated processing of large amounts of data, either for consumers or industry. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines, and evaluate performance by integrating electrical and optical inter-chip interconnects separately. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63x over a GPU, and reduce the energy by 184.05x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with electrical inter-chip interconnects.

show abstract

Section: Related Workmentioning

confidence: 99%

DaDianNao: A Neural Network Supercomputer

Luo

Liu

et al. 2017

IEEE Trans. Comput.

152

View full text Add to dashboard Cite

show abstract

“…The prevalence and compute-intensive nature of RM applications has led to efforts to optimize them using parallel software on multi-core and many-core processors [1,2], specialized hardware accelerators [3,4,14] and custom circuits [5]. StoRM is an accelerator for RM applications, but utilizes an entirely different approach (SC), which leads to significant benefits compared to previous efforts.…”

Section: Related Workmentioning

confidence: 99%

“…As a result, realizing efficient implementations of RM workloads is a problem that has attracted great interest, with solutions proposed ranging from optimized software on multi-core and many-core processors [1,2] to specialized hardware accelerators [3,4] and custom mixed-signal circuits [5].…”

Section: Introductionmentioning

confidence: 99%

StoRM

Chippa

Venkataramani

Roy

et al. 2014

Proceedings of the 2014 International Symposium on Low Power Electronics and Design

View full text Add to dashboard Cite

Recognition and Mining applications are becoming prevalent across the entire spectrum of computing platforms, and place very high demands on their capabilities. We propose a Stochastic Recognition and Mining processor (StoRM), which uses Stochastic Computing (SC) to efficiently realize computational kernels from these domains. Stochastic computing facilitates compact, power-efficient realization of arithmetic operations by representing and processing information as pseudo-random bit-streams. However, the overhead of conversion between representations, and the exponential relationship between precision and bit-stream length, are key challenges that limit the efficiency of stochastic designs. The proposed architecture for StoRM consists of a 2D array of Stochastic Processing Elements (StoPEs) with a streaming memory hierarchy, enabling binary-to-stochastic conversion to be amortized across rows or columns of StoPEs. We propose vector processing and segmented stochastic processing in the StoPEs to mitigate the unfavorable tradeoff between precision and bit-stream length. We also exploit the compactness of StoPEs to increase parallelism, thereby improving performance and energy efficiency. Finally, leveraging the resilience of RM applications to approximations in their computations, we design StoRM to support modulation of the stochastic bit-stream length, and utilize this capability to to optimize energy for a desired output quality. StoRM achieves 2-3X energy-delay improvements over a conventional design without sacrificing output quality, and upto 10X (20X) improvements when upto 5% (10%) loss in output quality is allowed. Our results also demonstrate that the proposed design techniques greatly enhance the applicability and benefits of stochastic computing.

show abstract

“…al. [11] described the manycore MAPLE architecture which was designed to accelerate a number of learning and classification problems, including SVMs. Vector processing elements in a two dimensional grid were used to perform linear algebra.…”

Section: Introductionmentioning

confidence: 99%

“…While both Ly [14] and Majumdar [11] targetted maximum performance in batch learning tasks, ours is designed for single-FPGA, floatingpoint embedded applications in which minimising latency and compactness are the key design goals. Similar architectures have been applied to the acceleration of linear algebra problems, utilising both spatial parallelism and pipelining to achieve high performance.…”

Section: Introductionmentioning

confidence: 99%

A low latency kernel recursive least squares processor using FPGA technology

Pang

Wang

et al. 2013

2013 International Conference on Field-Programmable Technology (FPT)

View full text Add to dashboard Cite

The kernel recursive least squares (KRLS) algorithm performs non-linear regression in an online manner, with similar computational requirements to linear techniques. In this paper, an implementation of the KRLS algorithm utilising pipelining and vectorisation for performance; and microcoding for reusability is described. The design can be scaled to allow tradeoffs between capacity, performance and area. Compared with a central processing unit (CPU) and digital signal processor (DSP), the processor improves on execution time, latency and energy consumption by factors of 5, 5 and 12 respectively.

show abstract

A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

Cited by 40 publications

References 21 publications

DaDianNao: A Neural Network Supercomputer

DaDianNao: A Neural Network Supercomputer

StoRM

A low latency kernel recursive least squares processor using FPGA technology

Contact Info

Product

Resources

About