iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture

Gu, Peng; Xie, Xiujuan; Ding, Yufei; Chen, Guoyang; Zhang, Weifeng; Niu, Dimin; Xie, Yuan

doi:10.1109/isca45697.2020.00071

Cited by 50 publications

(32 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A large body of prior work examines Processing-Near-Memory (PNM) [3, 4, 8, 9, 16, 30-32, 39, 47, 52, 57, 66, 68, 76-78, 81, 89, 90, 101, 102, 109, 110, 112, 126, 129, 130, 144, 151, 166, 167, 176, 177, 179-181, 191, 194, 199, 206, 212, 223, 224, 232, 240, 269, 271, 280, 281]. PNM integrates processing units near or inside the memory via a 3D PNM configuration (i.e., processing units are located at the logic layer of 3D-stacked memories) [3, 30-32, 47, 57, 76, 166, 180, 181, 206, 269, 271, 281], a 2.5D PNM configuration (i.e., processing units are located in the same package as the CPU connected via silicon interposers) [68,81,223], a 2D PNM configuration (i.e., processing units are placed inside DDRX DIMMs) [9,16,44,89,90,126,143,147,148,179,185,199,212,282], or at the memory controller of CPU systems [101,102,167]. These works propose hardware designs for irregular applications like graph processing [3,4,31,32,52,180,281], bioinformatics [39,81,130,147,148], neural networks [29,30,48,68,78,89,129,…”

Section: Related Workmentioning

confidence: 99%

“…Most near-bank PIM architectures [16,44,45,55,82,89,94,140,145,151,179,199,240] support several PIM-enabled memory chips connected to a host CPU via memory channels. Each memory chip comprises multiple PIM cores, which are low-area and low-power cores with relatively low computation capability [82,94], and each of them is located close to a DRAM bank [16,44,45,55,82,89,94,140,145,151,179,199,240]. Each PIM core can access data located on their local DRAM banks, and typically there is no direct communication channel among PIM cores.…”

Section: Introductionmentioning

confidence: 99%

“…Overall, near-bank PIM architectures provide high levels of parallelism and very large memory bandwidth, thereby being a very promising computing platform to accelerate memory-bound kernels. Recent works leverage near-bank PIM architectures to provide high performance and energy benefits on bioinformatics [82,94,147,148], skyline computation [282], compression [185] and neural network [44,82,89,94,151] kernels. A recent study [82,94] provides PrIM benchmarks [87], which are a collection of 16 kernels for evaluating near-bank PIM architectures, like the UPMEM PIM system.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Giannoula¹,

Fernandez²,

Gómez-Luna³

et al. 2022

Preprint

View full text Add to dashboard Cite

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given.This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Giannoula¹,

Fernandez²,

Gómez-Luna³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In an NMP system with 3D memory cubes, the processing capability is in the base logic die under a stack of DRAM layers to utilize the ample internal bandwidth [5]. Later research also proposes near-bank processing with logic near memory banks in the same DRAM layer to exploit even higher bandwidth [20,21], such as FIMDRAM [22] announced recently by Samsung. Recent proposals [23,24,25,26,27] have also explored augmenting traditional DIMMs with computation in the buffer die to provide low-cost but bandwidth-limited NMP solutions.…”

Section: Near-memory Processingmentioning

confidence: 99%

Continual Learning Approach for Improving the Data and Computation Mapping in Near-Memory Processing System

Majumder,

Huang,

Kim

et al. 2021

Preprint

View full text Add to dashboard Cite

The resurgence of near-memory processing (NMP) with the advent of big data has shifted the computation paradigm from processor-centric to memory-centric computing. To meet the bandwidth and capacity demands of memory-centric computing, 3D memory has been adopted to form a scalable memory-cube network. Along with NMP and memory system development, the mapping for placing data and guiding computation in the memory-cube network has become crucial in driving the performance improvement in NMP. However, it is very challenging to design a universal optimal mapping for all applications due to unique application behavior and intractable decision space.In this paper, we propose an artificially intelligent memory mapping scheme, AIMM, that optimizes data placement and resource utilization through page and computation remapping. Our proposed technique involves continuously evaluating and learning the impact of mapping decisions on system performance for any application. AIMM uses a neural network to achieve a near-optimal mapping during execution, trained using a reinforcement learning algorithm that is known to be effective for exploring a vast design space. We also provide a detailed AIMM hardware design that can be adopted as a plugin module for various NMP systems. Our experimental evaluation shows that AIMM improves the baseline NMP performance in single and multiple program scenario by up to 70% and 50%, respectively.

show abstract

“…However, the elapsed time of operators from the R 3 cluster takes 52% of total time, making R 3 -like operators (memory-intensive highly parallel operators) the actual bottleneck, not Conv2D. Instead of accelerating Conv2D, which would result in more computation resources or larger on-chip memory, our analysis recommends that the architecture should be designed with higher effective memory bandwidth, such as processing-in-memory architectures [15,22,30,33] for R 3 -like operators, because they take the majority of the elapsed time.…”

Section: Applicationmentioning

confidence: 99%

NNBench-X

Xie

et al. 2020

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

The tremendous impact of deep learning algorithms over a wide range of application domains has encouraged a surge of neural network (NN) accelerator research. Facilitating the NN accelerator design calls for guidance from an evolving benchmark suite that incorporates emerging NN models. Nevertheless, existing NN benchmarks are not suitable for guiding NN accelerator designs. These benchmarks are either selected for general-purpose processors without considering unique characteristics of NN accelerators or lack quantitative analysis to guarantee their completeness during the benchmark construction, update, and customization. In light of the shortcomings of prior benchmarks, we propose a novel benchmarking methodology for NN accelerators with a quantitative analysis of application performance features and a comprehensive awareness of software-hardware co-design. Specifically, we decouple the benchmarking process into three stages: First, we characterize the NN workloads with quantitative metrics and select the representative applications for the benchmark suite to ensure diversity and completeness. Second, we refine the selected applications according to the customized model compression techniques provided by specific software-hardware co-design. Finally, we evaluate a variety of accelerator designs on the generated benchmark suite. To demonstrate the effectiveness of our benchmarking methodology, we conduct a case study of composing an NN benchmark from the TensorFlow Model Zoo and compress these selected models with various model compression techniques. Finally, we evaluate compressed models on various architectures, including GPU, Neurocube, DianNao, and Cambricon-X.

show abstract

iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture

Cited by 50 publications

References 61 publications

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Continual Learning Approach for Improving the Data and Computation Mapping in Near-Memory Processing System

NNBench-X

Contact Info

Product

Resources

About