XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference

Conti, Francesco; Schiavone, Pasquale Davide; Benini, Luca

doi:10.1109/tcad.2018.2857019

Cited by 108 publications

(91 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By considering technology scaling, we see that the energy efficiency (in terms of TOP/s/W) of PPAC is comparable to that of the two fully-digital designs in [23], [24] but 7.9× and 2.3× lower than that of the mixed-signal designs in [6] and [19], respectively, where the latter is implemented in a comparable technology node as PPAC. As noted in Section III-D, mixedsignal designs are particularly useful for tasks that are resilient to noise or process variation, such as neural network inference.…”

Section: B Comparison With Existing Acceleratorsmentioning

confidence: 97%

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations

Castañeda

Bobbett

Gallyas-Sanhueza

et al. 2019

2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

Processing in memory (PIM) moves computation into memories with the goal of improving throughput and energy-efficiency compared to traditional von Neumann-based architectures. Most existing PIM architectures are either generalpurpose but only support atomistic operations, or are specialized to accelerate a single task. We propose the Parallel Processor in Associative Content-addressable memory (PPAC), a novel in-memory accelerator that supports a range of matrix-vectorproduct (MVP)-like operations that find use in traditional and emerging applications. PPAC is, for example, able to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction. The fully-digital nature of PPAC enables its implementation with standard-cellbased CMOS, which facilitates automated design and portability among technology nodes. To demonstrate the efficacy of PPAC, we provide post-layout implementation results in 28nm CMOS for different array sizes. A comparison with recent digital and mixed-signal PIM accelerators reveals that PPAC is competitive in terms of throughput and energy-efficiency, while accelerating a wide range of applications and simplifying development.

show abstract

Section: B Comparison With Existing Acceleratorsmentioning

confidence: 97%

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations

Castañeda

Bobbett

Gallyas-Sanhueza

et al. 2019

2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

show abstract

“…Each lane consists of a First-In First-Out (FIFO) queue to buffer read and write data. An address generator based on the one presented by Schuiki et al [8] and Conti et al [9] assigns memory addresses to the stream-based accesses performed by the core. The lane can be put into read mode, in which case the address generator is used to fetch data from memory and store it in the FIFO.…”

Section: Data Movermentioning

confidence: 99%

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

Schuiki¹,

Zaruba²,

Hoefler

et al. 2021

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a cycle not spent on computation, limiting ALU/FPU utilization to 33% on reductions. We propose "Stream Semantic Registers" to boost utilization and increase energy efficiency. SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of loads/stores. We implement the proposed extension in the RTL of an existing multi-core cluster and synthesize the design for a modern 22nm technology. Our extension provides a significant, 2x to 5x, architectural speedup across different kernels at a small 11% increase in core area. Sequential code runs 3x faster on a single core, and 3x fewer cores are needed in a cluster to achieve the same performance. The utilization increase to almost 100% in leads to a 2x energy efficiency improvement in a multi-core cluster. The extension reduces instruction fetches by up to 3.5x and instruction cache power consumption by up to 5.6x. Compilers can automatically map loop nests to SSRs, making the changes transparent to the programmer.

show abstract

“…Umuroglu et al [67] have created FINN, a framework for binarized Field Programmable Gate Array (FPGA) accelerators, which was further expanded to larger models by Fraser et al [23]. Other binarized accelerators have been proposed, both targeting FPGAs [49,51,72,76], Application-specific integrated circuit (ASIC) [2,10,18,63], and in-memory compute [11,36]. Yang et al [71] have developed BMXNet, an extension of MXNet [13] based on the binarized GEMM kernel.…”

Section: Binarized Neural Networkmentioning

confidence: 99%

3PXNet

Romaszkan

Gupta

2020

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

As the adoption of Neural Networks continues to proliferate different classes of applications and systems, edge devices have been left behind. Their strict energy and storage limitations make them unable to cope with the sizes of common network models. While many compression methods such as precision reduction and sparsity have been proposed to alleviate this, they don't go quite far enough. To push size reduction to its absolute limits, we combine binarization with sparsity in Pruned-Permuted-Packed XNOR Networks (3PXNet), which can be efficiently implemented on even the smallest of embedded microcontrollers. 3PXNets can reduce model sizes by up to 38X and reduce runtime by up to 3X compared with already compact conventional binarized implementations with less than 3% accuracy reduction. We have created the first software implementation of sparse-binarized Neural Networks, released as open source library targeting edge devices. Our library is complete with training methodology and model generating scripts, making it easy and fast to deploy.

show abstract

XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference

Cited by 108 publications

References 45 publications

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

3PXNet

Contact Info

Product

Resources

About