When Prefetching Works, When It Doesn’t, and Why

Lee, Jaekyu; Kim, Hyesoon; Vuduc, Richard

doi:10.1145/2133382.2133384

Cited by 109 publications

(73 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, stride prefetchers use the distance (i.e., stride of the load) between the current memory and last memory addresses referenced by a load instruction to fetch an address formed by the last address plus the stride distance. For a complete review on hardware prefetchers, please refer to [Lee et al 2012]. Usually, real-time application designers disable hardware prefetchers to improve predictability.…”

Section: Future Directionsmentioning

confidence: 99%

A Survey on Cache Management Mechanisms for Real-Time Embedded Systems

et al. 2015

View full text Add to dashboard Cite

Multicore processors are being extensively used by real-time systems, mainly because of their demand for increased computing power. However, multicore processors have shared resources that affect the predictability of real-time systems, which is the key to correctly estimate the worst-case execution time of tasks. One of the main factors for unpredictability in a multicore processor is the cache memory hierarchy. Recently, many research works have proposed different techniques to deal with caches in multicore processors in the context of real-time systems. Nevertheless, a review and categorization of these techniques is still an open topic and would be very useful for the real-time community. In this article, we present a survey of cache management techniques for real-time embedded systems, from the first studies of the field in 1990 up to the latest research published in 2014. We categorize the main research works and provide a detailed comparison in terms of similarities and differences. We also identify key challenges and discuss future research directions.

show abstract

Section: Future Directionsmentioning

confidence: 99%

A Survey on Cache Management Mechanisms for Real-Time Embedded Systems

et al. 2015

View full text Add to dashboard Cite

show abstract

“…Similarly, Laurenzano et al [10] proposed a runtime mechanism to find opportunities to insert non-temporal prefetch instructions in batch applications to conserve LLC space so that user-facing applications' performance in datacenters remains predictable. Lee et al [11] investigated combining hardware prefetching and software prefetching for singlethreaded applications, concluding that caution should be exercised when mixing the two. In contrast to their work we have shown that hardware prefetching can be combined with software prefetching in a useful way to increase throughput performance in multicores.…”

Section: Related Workmentioning

confidence: 99%

AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance

Khan

Laurenzanoy

Marsy

et al. 2015

2015 International Conference on Parallel Architecture and Compilation (PACT)

View full text Add to dashboard Cite

Abstract-Modern processors widely use hardware prefetching to hide memory latency. While aggressive hardware prefetchers can improve performance significantly for some applications, they can limit the overall performance in highlyutilized multicore processors by saturating the offchip bandwidth and wasting last-level cache capacity. Co-executing applications can slowdown due to contention over these shared resources.This work introduces Adaptive Resource Efficient Prefetching (AREP) − a runtime framework that dynamically combines software prefetching and hardware prefetching to maximize throughput in highly utilized multicore processors. AREP achieves better performance by prefetching data in a resource efficient way − conserving offchip-bandwidth and last-level cache capacity with accurate prefetching and by applying cache-bypassing when possible. AREP dynamically explores a mix of hardware/software prefetching policies, then selects and applies the best performing policy. AREP is phase-aware and re-explores (at runtime) for the best prefetching policy at phase boundaries.A multitude of experiments with workload mixes and parallel applications on a modern high performance multicore show that AREP can increase throughput by up to 49% (8.1% on average). This is complemented by improved fairness, resulting in average quality of service above 94%.

show abstract

“…Data caches are a nice example where prefetching is automatically driven in hardware: a full set of data (a cache line) is loaded as a single data item contained in it is explicitly accessed. A very interesting recent study [32] tries to draw some ''guiding lines'' on the usage of prefetching to achieve an actual performance gain. However, no timing guarantees are provided, nor timing predictability is among the goals of the most advanced prefetching techniques.…”

Section: Related Workmentioning

confidence: 99%

P-SOCRATES: A Parallel Software Framework for Time-Critical Many-Core Systems

Pinho¹,

Quiñones

Bertogna

et al. 2014

2014 17th Euromicro Conference on Digital System Design

View full text Add to dashboard Cite

Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. Parallel programming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution.The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives. Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. Parallel programming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution. The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives. P-SOCRATES

show abstract

When Prefetching Works, When It Doesn’t, and Why

Cited by 109 publications

References 43 publications

A Survey on Cache Management Mechanisms for Real-Time Embedded Systems

A Survey on Cache Management Mechanisms for Real-Time Embedded Systems

AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance

P-SOCRATES: A Parallel Software Framework for Time-Critical Many-Core Systems

Contact Info

Product

Resources

About