A Modern Primer on Processing in Memory

Mutlu, Onur; Ghose, Saugata; Gómez-Luna, Juan; Ausavarungnirun, Rachata

doi:10.48550/arxiv.2012.03112

Cited by 11 publications

(17 citation statements)

References 251 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Synergy With PIM. Processing-in-memory (PIM) systems improve system performance and/or energy consumption by performing computations directly within a memory chip, thereby avoiding unnecessary data movement [25,26,57,58,60,116,118,137,139]. Prior works propose a broad range of PIM systems [5-8, 13, 22-24, 34, 38, 44, 48, 49, 54, 55, 58, 59, 65, 66, 71, 72, 89, 98, 100, 103, 107, 113, 115, 119, 120, 124, 133-135, 137-139, 142, 148, 164, 168] in the context of various workloads and memory devices.…”

Section: Motivation and Goalmentioning

confidence: 99%

“…Therefore, QUAC-TRNG o ers a new design point that can enable new applications that were previously infeasible with alternative TRNGs, especially for systems where the costs of on-chip TRNGs may be prohibitive (e.g., heavily constrained embedded systems, processing-in-memory architectures). For example, QUAC-TRNG would enable processing-in-memory systems [62,116,137,157] to execute security workloads as it enables true random number generation directly within a DRAM chip.…”

Section: Non-dram-based Trngs That Require Specialized Hardwarementioning

confidence: 99%

See 1 more Smart Citation

QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

Olgun

Patel²,

Yağlıkçı³

et al. 2021

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)

Self Cite

View full text Add to dashboard Cite

True random number generators (TRNG) sample random physical processes to create large amounts of random numbers for various use cases, including security-critical cryptographic primitives, scienti c simulations, machine learning applications, and even recreational entertainment. Unfortunately, not every computing system is equipped with dedicated TRNG hardware, limiting the application space and security guarantees for such systems. To open the application space and enable security guarantees for the overwhelming majority of computing systems that do not necessarily have dedicated TRNG hardware (e.g., processing-in-memory systems), we develop QUAC-TRNG, a new high-throughput TRNG that can be fully implemented in commodity DRAM chips, which are key components in most modern systems.QUAC-TRNG exploits the new observation that a carefullyengineered sequence of DRAM commands activates four consecutive DRAM rows in rapid succession. This QUadruple ACtivation (QUAC) causes the bitline sense ampli ers to nondeterministically converge to random values when we activate four rows that store con icting data because the net deviation in bitline voltage fails to meet reliable sensing margins.We experimentally demonstrate that QUAC reliably generates random values across 136 commodity DDR4 DRAM chips from one major DRAM manufacturer. We describe how to develop an e ective TRNG (QUAC-TRNG) based on QUAC. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and nd that QUAC-TRNG successfully passes each test. Our experimental evaluations show that QUAC-TRNG reliably generates true random numbers with a throughput of 3.44 Gb/s (per DRAM channel), outperforming the state-of-the-art DRAM-based TRNG by 15.08× and 1.41× for basic and throughput-optimized versions, respectively. We show that QUAC-TRNG utilizes DRAM bandwidth better than the state-of-the-art, achieving up to 2.03× the throughput of a throughput-optimized baseline when scaling bus frequencies to 12 GT/s.

show abstract

Section: Motivation and Goalmentioning

confidence: 99%

Section: Non-dram-based Trngs That Require Specialized Hardwarementioning

confidence: 99%

QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

Olgun

Patel²,

Yağlıkçı³

et al. 2021

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Stacked memory architectures vertically stack DRAM layers on top of each other and connect the vertical partitions of memory using high-bandwidth through-silicon vias (TSVs). A typical 3D-stacked memory configuration can employ thousands of TSVs [45], which makes its internal memory bandwidth far exceed that of traditional memory systems. At the bottom of the memory stacks, there is a logic layer that can host hardware logic that can interact with both the host processor and the DRAM memory.…”

Section: Background and Assumptions A Processing-in-memorymentioning

confidence: 99%

“…Due to its nature of being near memory, applications offloaded to PIM gain a high memory bandwidth as they do not have to move data across the slow memory bus. Moreover, in 3D-stacked memories, TSV connection between the layer naturally provides more internal bandwidth [45]. This makes PIM-based applications exceed memory-bounded workloads and workloads with erratic memory access patterns.…”

Section: A Processing-in-memorymentioning

confidence: 99%

PIM-Enclave: Bringing Confidential Computation Inside Memory

Duy¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

Demand for data-intensive workloads and confidential computing are the prominent research directions shaping the future of cloud computing. Computer architectures are evolving to accommodate the computing of large data better. Protecting the computation of sensitive data is also an imperative yet challenging objective; processor-supported secure enclaves serve as the key element in confidential computing in the cloud. However, side-channel attacks are threatening their security boundaries. The current processor architectures consume a considerable portion of its cycles in moving data. Near data computation is a promising approach that minimizes redundant data movement by placing computation inside storage. In this paper, we present a novel design for Processing-In-Memory (PIM) as a dataintensive workload accelerator for confidential computing. Based on our observation that moving computation closer to memory can achieve efficiency of computation and confidentiality of the processed information simultaneously, we study the advantages of confidential computing inside memory. We then explain our security model and programming model developed for PIMbased computation offloading. We construct our findings into a software-hardware co-design, which we call PIM-Enclave. Our design illustrates the advantages of PIM-based confidential computing acceleration. Our evaluation shows PIM-Enclave can provide a side-channel resistant secure computation offloading and run data-intensive applications with negligible performance overhead compared to baseline PIM model.

show abstract

“…This motivates using processing-in-memory (PIM) to gain the much needed speedups in graph mining. While PIM is not the only potential solution for hardware acceleration of graph mining, we select PIM because (1) it represents one of the most promising trends to tackle the memory bottleneck [69,128] outperforming other approaches [153], (2) it offers well-understood designs [129], and (3) numerous works illustrate it brings very large speedups in simple graph algorithms such as BFS or PageRank (see more than 15 works in Table 7), also using processing fully inside DRAM [10]. Yet, graph mining algorithms are much more complex: they employ deep recursion, create many intermediate data structures with non-trivial inter-dependencies, and have high load imbalance [62,186].…”

Section: Introductionmentioning

confidence: 99%

SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

Besta¹,

Raghavendra²,

Kwasniewski³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Simple graph algorithms such as PageRank have recently been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with non-straightforward parallelism and complicated memory access patterns. In this work, we address this with a simple yet surprisingly powerful observation: operations on sets of vertices, such as intersection or union, form a large part of many complex graph mining algorithms, and can offer rich and simple parallelism at multiple levels. This observation drives our cross-layer design, in which we (1) expose set operations using a novel programming paradigm, (2) express and execute these operations efficiently with carefully designed set-centric ISA extensions called SISA, and (3) use PIM to accelerate SISA instructions. The key design idea is to alleviate the bandwidth needs of SISA instructions by mapping set operations to two types of PIM: in-DRAM bulk bitwise computing for bitvectors representing high-degree vertices, and near-memory logic layers for integer arrays representing low-degree vertices. Set-centric SISA-enhanced algorithms are efficient and outperform hand-tuned baselines, offering more than 10× speedup over the established Bron-Kerbosch algorithm for listing maximal cliques. We deliver more than 10 SISA set-centric algorithm formulations, illustrating SISA's wide applicability.

show abstract

A Modern Primer on Processing in Memory

Cited by 11 publications

References 251 publications

QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

PIM-Enclave: Bringing Confidential Computation Inside Memory

SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

Contact Info

Product

Resources

About