Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Mowry, Todd C.; Gupta, Anoop

doi:10.1016/0743-7315(91)90014-z

Cited by 276 publications

(216 citation statements)

References 4 publications

Supporting

Mentioning

216

Contrasting

Order By: Relevance

“…Mowry proposed a compiler algorithm for selective prefetching in uniprocessor and multiprocessor systems [10]. Simulation analysis based on DASH-like CC-NUMA multiprocessors showed considerable improvements, between 6% and 53% reduction in execution time.…”

Section: Related Workmentioning

confidence: 99%

“…In softwarecontrolled cache prefetching, a processor executes a special Pf instruction, which initiates a non-blocking fetch operation that brings a data block, expected to be used by that processor, into its cache [10]. Ideally, the data block arrives at the cache before it is needed by the processor, and its load instruction results in a cache hit (Figure 1b, processor P2).…”

Section: Introductionmentioning

confidence: 99%

“…Most of the studies [1,5,10,[13][14][15][16] examined the effectiveness of cache prefetching and data forwarding in CC-NUMA or CC-UMA architectures, except [18,19], which examined the potential of cache prefetching in bus-based multiprocessors. This study reported poor effectiveness of cache prefetching, despite assumed high memory latency.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A performance evaluation of cache injection in bus-based shared memory multiprocessors

Milenkovic

Milutinović

2002

Microprocessors and Microsystems

View full text Add to dashboard Cite

Bus-based shared memory multiprocessors with private caches and snooping write-invalidate cache coherence protocols are dominant form of small-to medium-scale parallel machines today. In these systems the high memory latency poses the major hurdle in achieving high performance. One way to cope with this problem is to use various techniques for tolerating high memory latency. Software-controlled cache prefetching and data forwarding are two widely used techniques for tolerating high memory latency in scalable cache-coherent shared memory multiprocessors. However, some previous studies have shown that these techniques are not so effective in bus-based shared memory multiprocessors. In this paper, we propose a novel software-controlled technique called cache injection, which combines consumer and producer initiated approach, and broadcasting nature of bus. Performance evaluation based on program-driven simulation and a set of scientific applications and test benchmarks shows that cache injection is highly effective in reducing misses and bus traffic.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A performance evaluation of cache injection in bus-based shared memory multiprocessors

Milenkovic

Milutinović

2002

Microprocessors and Microsystems

View full text Add to dashboard Cite

show abstract

“…In his comprehensive work on software data prefetching, Mowry [22] explores the effect on execution time of varying the number of outstanding prefetch requests that can be handled simultaneously by the hardware. The author also compares two versions of the prefetch issue buffer hardware -one in which the processor stalls when the buffer is full, and one where additional requests are simply dropped.…”

Section: Related Workmentioning

confidence: 99%

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Caragea

Tzannes

Keceli

et al. 2011

Int J Parallel Prog

View full text Add to dashboard Cite

Abstract. Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a designspace exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.

show abstract

“…The concept of overlapping computation with I/O, network, and other long latency operations is an old concept. Prefetching techniques [15][16][17] and thread speculation [1,18,19] also use such overlapping concept. Most previous work on prefetching also focused on moving data (mostly contiguous data) from main memory to local memory (either to register or cache) prior to execution.…”

Section: Related Workmentioning

confidence: 99%

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Tan

Sreedhar²,

Gao

2008

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Abstract. This paper presents a new technique to optimize locality of irregular programs by leveraging parallelism on a massive many-core architecture -IBM Cyclops64 (C64). The key idea is to achieve Just-In-Time Locality which ensures that data are available locally for computation to use. The proposed percolation model for Just-In-Time Locality moves data proactively close to the computation and organizes the data layout such that locality is exploited effectively. The percolation model opens a door for exploiting locality through parallelism, which is an advantage of the future many-core architecture. We implemented the percolation strategy in the context of two irregular applications on C64. Our experimental results are very encouraging and we get an order of magnitude improvement in performance of irregular applications. We also drastically improve the scalability of the applications that we studied.

show abstract

Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Cited by 276 publications

References 4 publications

A performance evaluation of cache injection in bus-based shared memory multiprocessors

A performance evaluation of cache injection in bus-based shared memory multiprocessors

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Contact Info

Product

Resources

About