Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Hashemi, Milad; Khubaib,; Ebrahimi, Eiman; Mutlu, Onur; Patt, Yale N.

doi:10.1109/isca.2016.46

Cited by 56 publications

(11 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concretely, we avoid high-frequency interactions with internal CPU resources, avoid tracking dependencies among non-load instructions and avoid the increased verification costs associated with complicating the CPU design. The cost of exact criticality and dependency tracking has motivated researchers to develop heuristic approaches [64,65,66]. Such heuristics are similar to the heuristics used in architecture-centric accounting which we have shown to be less accurate than dataflow accounting.…”

Section: Related Workmentioning

confidence: 97%

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

Jahre

Eeckhout

2018

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Abstract-Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates.We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4-and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model.

show abstract

Section: Related Workmentioning

confidence: 97%

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

Jahre

Eeckhout

2018

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…A hardware mechanism called cache-conscious wavefront scheduling, which uses an intra-wavefront locality detector to capture a locality, was proposed in [14]. To minimize dependent cache miss latency, Hashemi and others [16] proposed adding enough functionality to dynamically identify instructions at the core and migrate them to the memory controller for execution. To minimize dependent cache miss latency, Hashemi and others [16] proposed adding enough functionality to dynamically identify instructions at the core and migrate them to the memory controller for execution.…”

Section: Previous Workmentioning

confidence: 99%

“…A new memory-buffer chip called Centaur, which provides up to 128 MB of embedded DRAM buffer cache per processor along with an improved DRAM scheduler, was proposed in [15]. To minimize dependent cache miss latency, Hashemi and others [16] proposed adding enough functionality to dynamically identify instructions at the core and migrate them to the memory controller for execution. In [17], a dynamic scheduling algorithm was proposed for a set of sporadic realtime tasks that efficiently co-schedule a processor and a DMA execution to hide memory transfer latency.…”

Section: Previous Workmentioning

confidence: 99%

WARP: Memory Subsystem Effective for Wrapping Bursts of a Cache

Jang

2017

ETRI Journal

View full text Add to dashboard Cite

State‐of‐the‐art processors require increasingly complicated memory services for high performance and low power consumption. In particular, they request transfers within a burst in a wrap‐around order to minimize the miss penalty of a cache. However, synchronous dynamic random access memories (SDRAMs) do not always generate transfers in the wrap‐round order required by the processors. Thus, a memory subsystem rearranges the SDRAM transfers in the wrap‐around order, but the rearrangement process may increase memory latency and waste the bandwidth of on‐chip interconnects. In this paper, we present a memory subsystem that is effective for the wrapping bursts of a cache. The proposed memory subsystem makes SDRAMs generate transfers in an intermediate order, where the transfers are rearranged in the wrap‐around order with minimal penalties. Then, the transfers are delivered with priority, depending on the program locality in space. Experimental results showed that the proposed memory subsystem minimizes the memory performance loss resulting from wrapping bursts and, thus, improves program execution time.

show abstract

“…Various prior works [1,2,3,5,7,8,25,31,33,34,35,36,38,40,42,46,47,56,62,69,92,109,111,112,114,125,126,128,129,134,139,149] examine processing in memory to reduce DRAM latency. Other prior works propose memory scheduling techniques, [4,37,49,66,67,74,99,100,103,104,135,136,137,138,141], which generally reduce latency to access DRAM.…”

Section: Other Latency Reduction Mechanismsmentioning

confidence: 99%

Understanding Latency Variation in Modern DRAM Chips

Chang

Kashyap

Hassan

et al. 2016

SIGMETRICS Perform. Eval. Rev.

Self Cite

View full text Add to dashboard Cite

This article summarizes key results of our work on experimental characterization and analysis of latency variation and latency-reliability trade-o s in modern DRAM chips, which was published in SIGMETRICS 2016 [24], and examines the work's signi cance and future potential. Our work is motivated to reduce the long DRAM latency, which is a critical performance bottleneck in current systems. DRAM access latency is de ned by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is signi cant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation.

show abstract

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Cited by 56 publications

References 49 publications

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

WARP: Memory Subsystem Effective for Wrapping Bursts of a Cache

Understanding Latency Variation in Modern DRAM Chips

Contact Info

Product

Resources

About