LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory

Boroumand, Amirali; Ghose, Saugata; Patel, Minesh; Hassan, Hasan; Lucia, Brandon; Hsieh, Kevin; Malladi, Krishna T.; Zheng, Hongzhong; Mutlu, Onur

doi:10.1109/lca.2016.2577557

Cited by 127 publications

(118 citation statements)

References 48 publications

Supporting

Mentioning

118

Contrasting

Order By: Relevance

“…In a block group, the metadata block stores the sequence ID (SID), which is the unique number in the memory log area to represent a block group, and the metadata (BLK-1. Note that memory controllers are becoming increasingly more intelligent and complex to deal with various scheduling and performance management issues in multi-core and heterogeneous systems (e.g., [5], [6], [7], [8], [11], [12], [13], [14], [21], [25], [26], [27], [32], [33], [34], [35], [38], [39], [42], [45], [46], [49], [50], [51], [52], [53], [54], [61], [62], [64], [65], [66], [67], [68], [81], [84], [85], [86], [87], [88], [89], [97], [98], [108], [110], [112], [113],…”

Section: Eager Commitmentioning

confidence: 99%

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Shu²,

Sun³

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Persistent memory provides high-performance data persistence at main memory. Memory writes need to be performed in strict order to satisfy storage consistency requirements and enable correct recovery from system crashes. Unfortunately, adhering to such a strict order significantly degrades system performance and persistent memory endurance. This paper introduces a new mechanism, Loose-Ordering Consistency (LOC), that satisfies the ordering requirements at significantly lower performance and endurance loss. LOC consists of two key techniques. First, Eager Commit eliminates the need to perform a persistent commit record write within a transaction. We do so by ensuring that we can determine the status of all committed transactions during recovery by storing necessary metadata information statically with blocks of data written to memory. Second, Speculative Persistence relaxes the write ordering between transactions by allowing writes to be speculatively written to persistent memory. A speculative write is made visible to software only after its associated transaction commits. To enable this, our mechanism supports the tracking of committed transaction ID and multi-versioning in the CPU cache. Our evaluations show that LOC reduces the average performance overhead of memory persistence from 66.9% to 34.9% and the memory write traffic overhead from 17.1% to 3.4% on a variety of workloads.

show abstract

Section: Eager Commitmentioning

confidence: 99%

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Shu²,

Sun³

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although a typical PIM consists of a processing unit (PU), a DRAM controller, and at least one more DRAM, recent PIM proposals have not questioned the necessity of using a cache for PIM [5,6,8,9,10,11]. Existing cache architectures for PIM may be classified under two large groups, one inside of PIM [5,10,11] and one outside [6,8,9]. A host processor is the CPU of the system, and a PIM management unit (PMU) receives and passes packets for the operation of PIM from the host processor to the PIM subsystem.…”

Section: Cache Management Policies For Pimmentioning

confidence: 99%

Cache memory organization for processing in memory

Kim

Moon

Kim

et al. 2019

IEICE Electron. Express

View full text Add to dashboard Cite

A promising solution for assuring ultra-low latency in dataintensive application processing systems is processing in memory (PIM). Although most studies that have examined PIM-based computing systems have used cache memory, few have adequately explored a reasonable cache management policy for PIM. Therefore, this paper studies cache management policies for PIM-based computing systems and classifies existing PIM policies according to where they are located and how they are managed. To evaluate the policies, we model three types of PIM-based computing systems used in an in-memory system architecture. One model employs an internal-single cache, another an external cache hierarchy, and the other internal multiple cache-based PIM. We also simulate the performance and power consumption of the three models by their workloads, each with diverse characteristics. The experimental results show how cache policies influence the performance and power of PIM-based inmemory computing systems.

show abstract

“…As mentioned in Section 5.1, for the evaluation, we generate workloads with different cache miss ratios and marked them with "-H" for the original workload, "-M" for a medium miss ratio setting, and "-L" for a low miss ratio setting. 7 In the evaluation, we compare CAIRO with two naïve methods: (i) disabling offloading, denoted as the "no-offloading" method; and (ii) offloading all eligible candidates, 48:18 R. Hadidi et al denoted as the "all-offloading" method. We observe that while the "all-offloading" decision is beneficial for high miss ratio settings, it degrades performance for low miss ratio settings.…”

Section: Evaluation Of Cpu Workloadsmentioning

confidence: 99%

Cairo

Hadidi

Nai

Kim

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processingin-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the memory. Several memory vendors have also started to integrate computation logics into the memory, such as Hybrid Memory Cube (HMC), the latest version of which supports up to 18 in-memory atomic instructions. Although industry prototypes have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. As a result, compiler support for recognizing offloading candidates and utilizing instruction-level PIM offloading is unavailable. In this article, we analyze the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graphcomputing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers. To develop CAIRO, we analyzed how instruction offloading enables performance gain in both CPU and GPU workloads. Our studies show that performance gain from bandwidth savings, the ratio of number of cache misses to total cache accesses, and the overhead of host atomic instructions are the key factors in selecting an offloading candidate. Based on our analytical models, we characterize the properties of beneficial and nonbeneficial candidates for offloading. We evaluate CAIRO with 27 multithreaded CPU and 36 GPU benchmarks. In our evaluation, CAIRO not only doubles the speedup for a set of PIM-beneficial workloads by exploiting HMC-atomic instructions but also prevents slowdown caused by incorrect offloading decisions for other workloads. CCS Concepts: • Hardware → 3D integrated circuits; Emerging architectures; Memory and dense storage; • Software and its engineering → Compilers;

show abstract

LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory

Cited by 127 publications

References 48 publications

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

Cache memory organization for processing in memory

Cairo

Contact Info

Product

Resources

About