Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware

Zhu, Qunxiong; Graf, Tobias; Sumbul, H. Ekin; Pileggi, Larry; Franchetti, Franz

doi:10.1109/hpec.2013.6670336

Cited by 83 publications

(47 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A similar operation is performed for assembling C by using a single "vertical CAM," which activates individual horizontal CAM blocks only if their corresponding column indices are matched. High-level system simulations in [12] show that such a LiM based CAM-SpGEMM core can be used as a low-power hardware accelerator in 3D IC stacks. Sparse matrices are decomposed into sub-blocks and then mapped to DRAM rows for maximizing off-chip DRAM row buffer hit.…”

Section: Lim Synthesis Examplementioning

confidence: 99%

“…To improve the column-by-column algorithm for SpGEMM, Zhu et al explored the data storage and access patterns in [12] and showed that the SpGEMM operations can be effectively mapped to LiM based content addressable memory (CAM) blocks. As matrix sparsity requires storing only the non-zero elements that are accompanied by their row and column indices, the single cycle "matching" capability of CAMs facilitates index comparison and alignment.…”

Section: Lim Synthesis Examplementioning

confidence: 99%

“…There are various examples of "processor-in-memory" designs wherein processing units are placed in memory abstraction [2] [3], or more recently, near data computing in 3D and 2.5D stacks to provide more bandwidth with less energy [11]. It has also been shown that various data intensive applications highly benefit from LiM blocks [12] [13]. However, there remains the need for physical synthesis (rather than compilation) of LiM blocks that are compatible with a full chip physical synthesis and offer the ability of system-level exploration.…”

Section: Introductionmentioning

confidence: 98%

See 2 more Smart Citations

A synthesis methodology for application-specific logic-in-memory designs

Sumbul

Vaidyanathan

Zhu

et al. 2015

Proceedings of the 52nd Annual Design Automation Conference

Self Cite

View full text Add to dashboard Cite

For deeply scaled digital integrated systems, the power required for transporting data between memory and logic can exceed the power needed for computation, thereby limiting the efficacy of synthesizing logic and compiling memory independently. Logic-in-Memory (LiM) architectures address this challenge by embedding logic within the memory block to perform basic operations on data locally for specific functions. While custom smart memories have been successfully constructed for various applications, a fully automated LiM synthesis flow enables architectural exploration that has heretofore not been possible. In this paper we present a tool and design methodology for LiM physical synthesis that performs co-design of algorithms and architectures to explore system level trade-offs. The resulting layouts and timing models can be incorporated within any physical synthesis tool. Silicon results shown in this paper demonstrate a 250x performance improvement and 310x energy savings for a data-intensive application example.

show abstract

Section: Lim Synthesis Examplementioning

confidence: 99%

Section: Lim Synthesis Examplementioning

confidence: 99%

Section: Introductionmentioning

confidence: 98%

See 1 more Smart Citation

A synthesis methodology for application-specific logic-in-memory designs

Sumbul

Vaidyanathan

Zhu

et al. 2015

Proceedings of the 52nd Annual Design Automation Conference

Self Cite

View full text Add to dashboard Cite

show abstract

“…At the bottom of the Fig. 6, we show the structures of LiM core customized for the 2D FFT and SpGEMM respectively [4], [33]. As we can see, both LiM cores involve embedded memory arrays, on-chip buffers, arithmetic units, as well as the control models such as DRAM to Local Memory (D2L) and Local Memory to Core (L2C).…”

Section: D Lim Accelerated Data Intensive Applicationsmentioning

confidence: 99%

“…The CAM based SpGEMM is designed to match the specific sparse data access pattern, and it is able to process the sparse data in an extremely high throughput to match the TSV bandwidth. The design details are beyond the scope of this paper and can be found in another accompanying work [33].…”

Section: D Lim Accelerated Data Intensive Applicationsmentioning

confidence: 99%

A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing

Zhu

Akin

Sumbul

et al. 2013

2013 IEEE International 3D Systems Integration Conference (3DIC)

Self Cite

View full text Add to dashboard Cite

Abstract-This paper introduces a 3D-stacked logic-in-memory (LiM) system that integrates the 3D die-stacked DRAM architecture with the application-specific LiM IC to accelerate important data-intensive computing. The proposed system comprises a fine-grained rank-level 3D die-stacked DRAM device and extra LiM layers implementing logic-enhanced SRAM blocks that are dedicated to a particular application. Through silicon vias (TSVs) are used for vertical interconnections providing the required bandwidth to support the high performance LiM computing. We performed a comprehensive 3D DRAM design space exploration and exploit the efficient architectures to accelerate the computing that can balance the performance and power. Our experiments demonstrate orders of magnitude of performance and power efficiency improvements compared with the traditional multithreaded software implementation on modern CPU.

show abstract

OpenFAM: Programming disaggregated memory

Singhal,

Crasta,

Abdulla K

et al. 2023

Concurrency and Computation

View full text Add to dashboard Cite

High performance computing (HPC) clusters are increasingly handling workloads where working data sets cannot be easily partitioned or are too large to fit into local node memory. In order to enable HPC workloads to access memory external to the node, HPE has defined a programming API (OpenFAM) for developing applications that use large‐scale disaggregated memory. In this paper we describe an open‐source reference implementation of OpenFAM that can be used on scale‐up machines, traditional HPC clusters, as well as emerging disaggregated memory architectures. We demonstrate the efficiency of the implementation using micro‐benchmarks on InfiniBand and Slingshot‐based clusters.

show abstract

Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware

Cited by 83 publications

References 23 publications

A synthesis methodology for application-specific logic-in-memory designs

A synthesis methodology for application-specific logic-in-memory designs

A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing

OpenFAM: Programming disaggregated memory

Contact Info

Product

Resources

About