Sequential hardware prefetching in shared-memory multiprocessors

Dahlgren, Fredrik; Dubois, Michel; Stenström, Per

doi:10.1109/71.395402

Cited by 117 publications

(90 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A first insight is the significant increase in performance resulting from the inclusion of even a small MSHR file. This is illustrated by the difference in performance between the configurations with no MSHR file (1,0), (2,0), (4,0), (8,0) and the rest. We combine collected average performance numbers with area information and determine the Pareto-optimal design space points.…”

Section: Dse Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Caragea

Tzannes

Keceli

et al. 2011

Int J Parallel Prog

View full text Add to dashboard Cite

Abstract. Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a designspace exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.

show abstract

Section: Dse Resultsmentioning

confidence: 99%

“…Prefetching schemes for parallel architectures in both software [19,31,20] and hardware (e.g. [8]) build upon uni-processor prefetching by taking into consideration issues caused by sharing of data and resources, such as coherence traffic and overheads.…”

Section: Related Workmentioning

confidence: 99%

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Caragea

Tzannes

Keceli

et al. 2011

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…Prefetching schemes for parallel architectures in both software [14], [18], [19] and hardware (e.g. [20]) build upon uni-processor prefetching by taking into consideration issues caused by sharing of data and resources, such as coherence traffic and overheads.…”

Section: Related Workmentioning

confidence: 99%

Resource-Aware Compiler Prefetching for Many-Cores

Caragea

Tzannes

Keceli

et al. 2010

2010 Ninth International Symposium on Parallel and Distributed Computing

View full text Add to dashboard Cite

Abstract-Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor lighter cores with less resources.Support for hardware and software prefetch increase MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We show that in situations where not enough resources are available to issue prefetch instructions for all references in a loop, it is more beneficial to decrease the prefetch distance and prefetch for as many references as possible, rather than use a fixed prefetched distance and skip prefetching for some references, as in current approaches.We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% and the state-of-the art GCC implementation by up to 34.79%. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show improvements of up to 24.61%.

show abstract

“…Prefetching schemes based on hardware [3,4], software [5], or both [6,7] have been studied extensively. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware.…”

Section: Introductionmentioning

confidence: 99%