Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Kim, Yoongu; Papamichael, Michael K.; Mutlu, Onur; Harchol-Balter, Mor

doi:10.1109/micro.2010.51

Cited by 363 publications

(390 citation statements)

References 21 publications

Supporting

Mentioning

377

Contrasting

Order By: Relevance

“…These designs do not aim to provide worst-case bounds and can underestimate memory interference. Future memory controllers might incorporate ideas like batching and thread prioritization (e.g., [28,19]). This will lead to a different analysis, which could be interesting future work that builds on ours.…”

Section: Related Workmentioning

confidence: 99%

Bounding memory interference delay in COTS-based multi-core systems

Kim

Niz

Andersson

et al. 2014

2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS)

Self Cite

177

158

View full text Add to dashboard Cite

Abstract-In commercial-off-the-shelf (COTS) multi-core systems, a task running on one core can be delayed by other tasks running simultaneously on other cores due to interference in the shared DRAM main memory. Such memory interference delay can be large and highly variable, thereby posing a significant challenge for the design of predictable real-time systems. In this paper, we present techniques to provide a tight upper bound on the worst-case memory interference in a COTS-based multi-core system. We explicitly model the major resources in the DRAM system, including banks, buses and the memory controller. By considering their timing characteristics, we analyze the worstcase memory interference delay imposed on a task by other tasks running in parallel. To the best of our knowledge, this is the first work bounding the request re-ordering effect of COTS memory controllers. Our work also enables the quantification of the extent by which memory interference can be reduced by partitioning DRAM banks. We evaluate our approach on a commodity multi-core platform running Linux/RK. Experimental results show that our approach provides an upper bound very close to our measured worst-case interference.

show abstract

Section: Related Workmentioning

confidence: 99%

Bounding memory interference delay in COTS-based multi-core systems

Kim

Niz

Andersson

et al. 2014

2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS)

Self Cite

177

158

View full text Add to dashboard Cite

show abstract

“…Each rank consists of multiple banks that share an internal bus for reading/writing data. 3 Because each bank acts as an independent entity, banks can serve multiple memory requests in parallel, offering banklevel parallelism [17,21,32]. A DRAM bank is further sub-divided into multiple subarrays [18,37,44] as shown in Figure 2.…”

Section: Dram System Organizationmentioning

confidence: 99%

Improving DRAM performance by parallelizing refreshes with accesses

Chang

Lee

Chishti

et al. 2014

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Self Cite

175

153

View full text Add to dashboard Cite

Modern DRAM cells are periodically refreshed to prevent data loss due to leakage. Commodity DDR (double data rate) DRAM refreshes cells at the rank level. This degrades performance significantly because it prevents an entire DRAM rank from serving memory requests while being refreshed. DRAM designed for mobile platforms, LPDDR (low power DDR) DRAM, supports an enhanced mode, called per-bank refresh, that refreshes cells at the bank level. This enables a bank to be accessed while another in the same rank is being refreshed, alleviating part of the negative performance impact of refreshes. Unfortunately, there are two shortcomings of per-bank refresh employed in today's systems. First, we observe that the perbank refresh scheduling scheme does not exploit the full potential of overlapping refreshes with accesses across banks because it restricts the banks to be refreshed in a sequential round-robin order. Second, accesses to a bank that is being refreshed have to wait.To mitigate the negative performance impact of DRAM refresh, we propose two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies and the performance benefit increases as DRAM density increases.

show abstract

“…All programs are compiled by gcc 4.4.3 with the -O3 optimizations. Similar to previous work, we use weighted speedup [12] (WS) to measure system performance and maximum slowdown (MS) [12] for fairness: We compare several memory allocation schemes including the unmodified paging system in the Linux kernel, utilitybased partitioning [15], DRAM bank partitioning [16], random allocation [18] and our proposed HVR system. Figure 11 shows that for workloads that benefit from cacheonly and bank-only partitioning (50 workloads in Quadrant I of Figure 3), VP can accumulate the performance gains.…”

Section: Experimental Methodologymentioning

confidence: 99%

“…Previous research efforts [12,29] show that contention can significantly degrade the overall system performance and many solutions have been proposed to mitigate the contention problems.…”

Section: Page-coloring Based Memory Managementmentioning

confidence: 99%

“…More recently, cache partitioning is also adopted in heterogeneous GPU-CPU architectures to promote fair resource sharing among CPU and GPU applications [30], which exhibit drastically different memory access characteristics. Other efforts [3,9,12,15,25,28] classify workloads based on hardware profiling, and then choose appropriate scheduling policies for different classifications. OS-level approaches for memory utilization monitoring [6,7,26,27] have also been studied to provide knowledge for resource management.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation