Minimal disturbance placement and promotion

Teran, Elvira; Tian, Yingying; Wang, Zhe; Jiménez, Daniel

doi:10.1109/hpca.2016.7446065

Cited by 12 publications

(17 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is in contrast with prior predictors [2], [4]- [6], which need to access a dedicated predictor table upon every single LLC access. Because modern multicore processors feature distributed last-level caches, accesses to dedicated prediction tables introduce detrimental latency and energy overheads in traversing the on-chip interconnect to query such structures.…”

Section: Introductionmentioning

confidence: 94%

See 1 more Smart Citation

Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches

Faldu¹,

Grot²

2017

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

Abstract-The looming breakdown of Moore's Law and the end of voltage scaling are ushering a new era where neither transistors nor the energy to operate them is free. This calls for a new regime in computer systems, one in which every transistor counts. Caches are essential for processor performance and represent the bulk of modern processor's transistor budget. To get more performance out of the cache hierarchy, future processors will rely on effective cache management policies.This paper identifies variability in generational behavior of cache blocks as a key challenge for cache management policies that aim to identify dead blocks as early and as accurately as possible to maximize cache efficiency. We show that existing management policies are limited by the metrics they use to identify dead blocks, leading to low coverage and/or low accuracy in the face of variability. In response, we introduce a new metric -Live Distance -that uses the stack distance to learn the temporal reuse characteristics of cache blocks, thus enabling a dead block predictor that is robust to variability in generational behavior. Based on the reuse characteristics of an application's cache blocks, our predictor -Leeway -classifies application's behavior as streaming-oriented or reuse-oriented and dynamically selects an appropriate cache management policy. By leveraging live distance for LLC management, Leeway outperforms state-of-the-art approaches on single-and multi-core SPEC and manycore CloudSuite workloads.

show abstract

Section: Introductionmentioning

confidence: 94%

“…We evaluate the performance of SPEC CPU 2006 benchmarks using a modified version of CMP$im [15] provided with the JILP Cache Replacement Championship [16] and used in prior research in dead block prediction [2], [4], [5], [14]. Table III summarizes the features of the simulated processor.…”

Section: Spec Cpu 2006mentioning

confidence: 99%

Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches

Faldu¹,

Grot²

2017

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

show abstract

“…2 History-based predictive schemes such as the state-ofthe-art Hawkeye [26] and many others [5,10,13,28,29,49,53] learn past reuse behavior of cache blocks by employing sophisticated storage-intensive prediction mechanisms. A large body of recent works focus on history-based schemes as they generally provide higher performance than the lightweight schemes for a wide range of applications.…”

Section: F Prior Hardware Schemesmentioning

confidence: 99%

“…These hardware schemes aim to perform two tasks: (1) identify cache blocks that are likely to exhibit high reuse, and (2) protect high reuse cache blocks from cache thrashing. To accomplish the first task, these schemes deploy either probabilistic or prediction-based hardware mechanisms [5,10,13,26,28,29,41,49,51,52,53,57,58,59,60]. However, our work finds that graph-dependent irregular access patterns prevent these schemes from correctly learning which cache blocks to preserve, rendering them deficient for the broad domain of graph analytics.…”

Section: Introductionmentioning

confidence: 96%

Domain-Specialized Cache Management for Graph Analytics

Faldu

Diamond

Grot

2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Graph analytics power a range of applications in areas as diverse as finance, networking and business logistics. A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. These richly-connected, hot, vertices inherently exhibit high reuse. However, this work finds that state-of-the-art hardware cache management schemes struggle in capitalizing on their reuse due to highly irregular access patterns of graph analytics.In response, we propose GRASP, domain-specialized cache management at the last-level cache for graph analytics. GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. GRASP keeps hardware cost negligible by leveraging lightweight software support to pinpoint hot vertices, thus eliding the need for storage-intensive prediction mechanisms employed by state-of-the-art cache management schemes. On a set of diverse graph-analytic applications with large high-skew graph datasets, GRASP outperforms prior domain-agnostic schemes on all datapoints, yielding an average speed-up of 4.2% (max 9.4%) over the best-performing prior scheme. GRASP remains robust on low-/no-skew datasets, whereas prior schemes consistently cause a slowdown.

show abstract

“…With so much of the available on-die resources invested in the cache hierarchy, an efficient, high performance design requires intelligent cache management techniques. While many cache management and speculation techniques such as alternate replacement policies [6,14,15,18,27], deadblock/hit prediction [17,20,28,33,36], and prefetching techniques [2,10,16,19,19,23,26,31,32] have been extensively explored, many of these are piecemeal, one-off solutions that often interact poorly when implemented together and typically only address one level of the memory-system hierarchy. There has been little work exploring the interactions between these policies across multiple levels of the memory hierarchy and examining the information needed across boundaries in the system from software to the core, to the last level cache.…”

Section: Introductionmentioning

confidence: 99%

Kill the Program Counter

et al. 2017

Self Cite

View full text Add to dashboard Cite

Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better performance than the best combination of existing prefetcher and replacement policy for multi-core workloads.

show abstract

Minimal disturbance placement and promotion

Cited by 12 publications

References 23 publications

Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches

Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches

Domain-Specialized Cache Management for Graph Analytics

Kill the Program Counter

Contact Info

Product

Resources

About