Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference

Mei, Linyan; Dandekar, Mohit; Rodopoulos, Dimitrios; Constantin, Joseph; Debacker, Peter; Lauwereins, Rudy; Verhelst, Marian

doi:10.1109/aicas.2019.8771481

“…The concepts of Sum Apart (SA) and Sum Together (ST) were introduced at PE level by Mei et. al [16] to qualify two opposite ways of accumulating subword-parallel computations: SA keeps the parallel-generated products separately, while ST sums them together to form one single output result. These concepts can be applied to differentiate algorithm-level characteristics of neural-network workloads.…”

Section: A Sa and St At Algorithm Levelmentioning

confidence: 99%

“…The Sum Together (ST) version of the SWP MAC unit, introduced by Mei et. al [16] is also a 2D symmetric scalable architecture, based on an array multiplier. But unlike SWP SA, SWP ST adds all subword results together by activating the array multiplier into an opposite diagonal pattern, as shown in Fig.…”

Section: G Subword-parallel St (St)mentioning

confidence: 99%

“…Figs. [16][17][18][19] show the breakdown of memory bandwidth, energy per operation, and silicon area of the circuits. The left subfigures display symmetric scaling scenarios while the right ones display weight-only scaling.…”

Section: E Dvfsmentioning

confidence: 99%

“…This has lead to a new trend of ultra-efficient run-time precisionscalable neural processors for embedded Deep Neural Network (DNN) processing in mobile devices and IoT nodes. Accordingly, many run-time precision-scalable MAC architectures have been introduced in the recent years, built either with high parallelization capabilities [13]- [16] or serial approaches [17]- [21]. However, it has been difficult to assess their efficiency or to decide on a topology for several reasons.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing

Camus

¹

,

Mei

²

,

Enz

³

et al. 2019

IEEE J. Emerg. Sel. Topics Circuits Syst.

Self Cite

View full text Add to dashboard Cite

The current trend for deep learning has come with an enormous computational need for billions of Multiply-Accumulate (MAC) operations per inference. Fortunately, reduced precision has demonstrated large benefits with low impact on accuracy, paving the way towards processing in mobile devices and IoT nodes. To this end, various precision-scalable MAC architectures optimized for neural networks have recently been proposed. Yet, it has been hard to comprehend their differences and make a fair judgment of their relative benefits as they have been implemented with different technologies and performance targets. To overcome this, this work exhaustively reviews the state-of-the-art precision-scalable MAC architectures and unifies them in a new taxonomy. Subsequently, these different topologies are thoroughly benchmarked in a 28 nm commercial CMOS process, across a wide range of performance targets, and with precision ranging from 2 to 8 bits. Circuits are analyzed for each precision as well as jointly in practical use cases, highlighting the impact of architectures and scalability in terms of energy, throughput, area and bandwidth, aiming to understand the key trends to reduce computation costs in neural-network processing.Index Terms-ASIC, deep neural networks, precision-scalable circuits, configurable circuits, MAC, multiply-accumulate units. I. INTRODUCTIONE MBEDDED deep learning has gained a lot of attention nowadays due to its broad application prospects and vast potential market. However, the main challenge to embrace this era of edge intelligence comes from the supply-anddemand gap between the limited energy budget of embedded devices, often battery powered, and the computationallyintensive deep-learning algorithms, requiring billions of Multiply-Accumulate (MAC) operations and data movements.To alleviate this unbalanced relationship, many approaches have been investigated at different levels of abstraction. At algorithmic level, researchers have introduced hardware-

show abstract

“…It has also highlighted that less scalability levels can be a good trade-off thanks to lower circuit overheads. Future works could propose a more extensive analysis and cover additional configurable or low-precision design techniques [8]- [10].…”

Section: Discussionmentioning

confidence: 99%

Survey of Precision-Scalable Multiply-Accumulate Units for Neural-Network Processing

Camus

¹

,

Enz

²

,

Verhelst

³

2019

2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Self Cite

View full text Add to dashboard Cite

The current trend for deep learning has come with an enormous computational need for billions of Multiply-Accumulate (MAC) operations per inference. Fortunately, reduced precision has demonstrated large benefits with low impact on accuracy, paving the way towards processing in mobile devices and IoT nodes. Precision-scalable MAC architectures optimized for neural networks have recently gained interest thanks to their subword parallel or bit-serial capabilities. Yet, it has been hard to make a fair judgment of their relative benefits as they have been implemented with different technologies and performance targets. In this work, run-time configurable MAC units from ISSCC 2017 and 2018 are implemented and compared objectively under diverse precision scenarios. All circuits are synthesized in a 28 nm commercial CMOS process with precision ranging from 2 to 8 bits. This work analyzes the impact of scalability and compares the different MAC units in terms of energy, throughput and area, aiming to understand the optimal architectures to reduce computation costs in neural-network processing.

show abstract

Machine Learning at the Edge

Verhelst

¹

,

Murmann

²

2020

The Frontiers Collection

View full text Add to dashboard Cite

Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference

Cited by 31 publications

References 13 publications

Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing

Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing

Survey of Precision-Scalable Multiply-Accumulate Units for Neural-Network Processing

Machine Learning at the Edge

Contact Info

Product

Resources

About