A Product Engine for Energy-Efficient Execution of Binary Neural Networks Using Resistive Memories

Vieira, João; Giacomin, Edouard; Qureshi, Yasir Mahmood; Zapater, Marina; Tang, Xianglu; Kvatinsky, Shahar; Atienza, David; Gaillardon, Pierre-Emmanuel

doi:10.1109/vlsi-soc.2019.8920343

Cited by 8 publications

(11 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most closely related to our approach is the paper of Vieira et al, which details a full-system evaluation strategy of AIMC acceleration. As in our case, the authors also base their approach on AIMC-dedicated extensions to the gem5 environment [23]. Nonetheless, their approach is limited to modelling the simple case of binary CNNs, and their perkernel mapping strategy does not scale to the larger and more general applications we tackle in this paper.…”

Section: B Simulations Of Aimc-based Systemsmentioning

confidence: 99%

“…6. By itself, this allows for the system-level implementation of classic loosely-coupled AIMC-enabled systems.To simulate the tightly-coupled AIMC-enabled architectures, we extend the accelerator modeling in [23] such that the custom ARMv8 ISA extension can also interface peripheral I/O (PIO) devices like our wrapper object. For this, we add connections between the ISA extension and PIO device via the system object (e.g., the simulated system that is instantiated on gem5-X's launch).…”

Section: B Aimc-enabled Systems In Gem5-xmentioning

confidence: 99%

See 1 more Smart Citation

ALPINE: Analog In-Memory Acceleration with Tight Processor Integration for Deep Learning

Klein¹,

Boybat²,

Qureshi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Analog in-memory computing (AIMC) cores offers significant performance and energy benefits for neural network inference with respect to digital logic (e.g., CPUs). AIMCs accelerate matrix-vector multiplications, which dominate these applications' run-time. However, AIMC-centric platforms lack the flexibility of general-purpose systems, as they often have hardcoded data flows and can only support a limited set of processing functions. With the goal of bridging this gap in flexibility, we present a novel system architecture that tightly integrates analog in-memory computing accelerators into multi-core CPUs in general-purpose systems. We developed a powerful gem5-based full system-level simulation framework into the gem5-X simulator, ALPINE, which enables an in-depth characterization of the proposed architecture. ALPINE allows the simulation of the entire computer architecture stack from major hardware components to their interactions with the Linux OS. Within ALPINE, we have defined a custom ISA extension and a software library to facilitate the deployment of inference models. We showcase and analyze a variety of mappings of different neural network types, and demonstrate up to 20.5x/20.8x performance/energy gains with respect to a SIMD-enabled ARM CPU implementation for convolutional neural networks, multi-layer perceptrons, and recurrent neural networks.

show abstract

Section: B Simulations Of Aimc-based Systemsmentioning

confidence: 99%

Section: B Aimc-enabled Systems In Gem5-xmentioning

confidence: 99%

ALPINE: Analog In-Memory Acceleration with Tight Processor Integration for Deep Learning

Klein¹,

Boybat²,

Qureshi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Xilinx Vivado reports that the floatingpoint arithmetic units responsible for computing the distances account for 12% of the in-chip energy consumption. Furthermore, the memory accesses dominate the energy consumption of the entire system, which can be as high as 90% of the total energy consumption [24]. Consequently, the units responsible for computing the distance account for only 1% of the total energy consumption.…”

Section: Energy Efficiency Improvementsmentioning

confidence: 99%

kNN-STUFF: kNN STreaming Unit for Fpgas

2019

Self Cite

View full text Add to dashboard Cite

This paper presents kNN STreaming Unit For Fpgas (kNN-STUFF), a modular, scalable and efficient Hardware/Software implementation of k-Nearest Neighbors (kNN) classifier targeting System on Chip (SoC) devices. It takes advantage of custom accelerators, implemented on the reconfigurable fabric of the SoC device, to perform most of the classifier's workload, whereas the processor coordinates the accelerators and runs the remaining workload of the kNN algorithm. kNN-STUFF offers a highly flexible framework, where the designer has the possibility to define the number of parallel instances of the classifier and the parallelism within each instance. This capability allows creating the most suitable implementation for a target device of any size. Results show that kNN-STUFF, with 24 accelerators, attains performance improvements up to 67.4×, when compared to an optimized (-O3) software-only implementation of the kNN running on a single core of the ARM Cortex-A9 CPU. Furthermore, its energy efficiency improvements are as high as 50.6×.

show abstract

“…Moreover, adding logic gates to the SA can realize more complex functions like addition. ReRAM IMCE [117] SOT-MRAM CA-PIM [8] SOT-MRAM BNN XNOR + popcnt PSA-BNN [118] SRAM SRAM-CIM [65] XNORAM [119] XNOR-SRAM [69] CIM-SR [120] SRAM and ReRAM XNOR-BNN [113] ReRAM-BNN [64] ReRAM FPSA-BNN [121] BDPE [122] 2T2R-TCAM [66] VR-XNOR [123] Memristor EEIM-BNN [124] SOT-MRAM MLC-CIM [125] STT-MRAM PIMBALL [126] TNN Ternary Multiplication or Gated-XNOR + popcnt TiM-DNN [68] SRAM-TPC XNOR-SRAM [69] SRAM TeC-Cell [127] FeRAM 4T2R-IM-DP [128] ReRAM SpinLiM [129] SOT-MRAM Ter-LiM [130] Memristor IMC-CD-TNN [70] Switch-Capacitor BWN Dense Addition ParaPIM [4] SOT-MRAM MRIMA [5] STT-MRAM TWN Sparse Addition Proposed FAT [9] STT-MRAM AdderNet Dense Add+Sub Proposed iMAD [131] STT-MRAM IMC accelerators for BNNs receive great research efforts thanks to BNNs' simple computation workflow. BNNs replace the 1-bit multiplication with XNOR and the 1-bit accumulation with popcnt (count the number of "1"s in a binary value) in the binary dot product.…”

Section: In-memory Computingmentioning

confidence: 99%

“…FPSA-BNN [121] uses (+1, -1) for weights and (+1, 0) for input neurons so that they can fuse the XNOR, popcnt, and sign function to create a Fully Parallel RRAM Synaptic Array (FPSA), achieving high parallelism by reading out several consecutive rows simultaneously. BDPE [122] integrates Binary Dot Product Engine (BDPE) inside CPU for fast and energy-efficient XNOR and popcnt operations utilizing ReRAM. 2T2R-TCAM [66] creates a 2-transistor-2-ReRAM (2T2R) Ternary Content Addressable Memory (TCAM) that supports in-memory logic and XNOR/XOR-based binary dot product.…”

Section: Inmentioning

confidence: 99%

Deep learning acceleration: from quantization to in-memory computing

Zhu¹

View full text Add to dashboard Cite

He guided me from an undergraduate student to becoming a graduate researcher.His emphasis on research innovation, technical contribution, and persuasive writing greatly benefited me. Furthermore, he let me explore the research areas I am interested in and pushed me to publish papers in time. I wish him a happy life and academic prosperity. At the same time, I would like to thank my thesis advisory committee, Prof. Xueyan Tang and Dr. Tao Luo, for their insightful comments and support of my research.

show abstract

A Product Engine for Energy-Efficient Execution of Binary Neural Networks Using Resistive Memories

Cited by 8 publications

References 26 publications

ALPINE: Analog In-Memory Acceleration with Tight Processor Integration for Deep Learning

ALPINE: Analog In-Memory Acceleration with Tight Processor Integration for Deep Learning

kNN-STUFF: kNN STreaming Unit for Fpgas

Deep learning acceleration: from quantization to in-memory computing

Contact Info

Product

Resources

About