"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Barnes, R. D.; Ryoo, Shane; Hwu, Wen-mei W.

doi:10.1109/micro.2005.1

Cited by 28 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mutlu and Patt, and Barnes et al subsequently demonstrated the benefits of runahead execution in the context of out-of-order [21] and EPIC [3] microarchitectures, respectively. There have been a number of subsequent enhancements since then involving more sophisticated checkpoint mechanisms [1] as well as the ability to runahead without requiring the re-execution of subsequent instructions [28].…”

Section: Related Workmentioning

confidence: 99%

Runahead execution vs. conventional data prefetching in the IBM POWER6 microprocessor

Cain

Nagpurkar

2010

2010 IEEE International Symposium on Performance Analysis of Systems &Amp; Software (ISPASS)

View full text Add to dashboard Cite

After many years of prefetching research, most commercially available systems support only two types of prefetching: software-directed prefetching and hardware-based prefetchers using simple sequential or stride-based prefetching algorithms. More sophisticated prefetching proposals, despite promises of improved performance, have not been adopted by industry. In this paper, we explore the efficacy of both hardware and software prefetching in the context of an IBM POWER6 commercial server. Using a variety of applications that have been compiled with an aggressively optimizing compiler to use software prefetching when appropriate, we perform the first study of a new runahead prefetching feature adopted by the POWER6 design, evaluating it in isolation and in conjunction with a conventional hardware-based sequential stream prefetcher and compiler-inserted software prefetching.We find that the POWER6 implementation of runahead prefetching is quite effective on many of the memory intensive applications studied; in isolation it improves performance as much as 36% and on average 10%. However, it outperforms the hardware-based stream prefetcher on only two of the benchmarks studied, and in those by a small margin. When used in conjunction with the conventional prefetching mechanisms, the runahead feature adds an additional 6% on average, and 39% in the best case (GemsFDTD).

show abstract

Section: Related Workmentioning

confidence: 99%

Runahead execution vs. conventional data prefetching in the IBM POWER6 microprocessor

Cain

Nagpurkar

2010

2010 IEEE International Symposium on Performance Analysis of Systems &Amp; Software (ISPASS)

View full text Add to dashboard Cite

show abstract

“…Like Multipass [3], iCFP may make multiple rally passes over the slice buffer, initiating a pass every time a pending miss returns. Each rally pass processes fewer instructions, until the slice is completely processed.…”

Section: Advance and Rallymentioning

confidence: 99%

“…Like SLTP, iCFP un-blocks the pipeline on cache misses, drains miss-dependent instructions-along with their miss-independent side inputs-into a slice buffer and then re-executes only the slice when the miss returns. Re-executing only the miss-dependent slice gives SLTP and iCFP a performance advantage over techniques like Runahead execution [8] and "flea-flicker" Multipass pipelining [3], which un-block the pipeline on a miss but then re-process all post-miss instructions. iCFP has an additional advantage over SLTP.…”

Section: Introductionmentioning

confidence: 99%

iCFP: Tolerating all-level cache misses in in-order processors

Hilton

Nagarakatte

Roth

2009

2009 IEEE 15th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

iCFP: Tolerating all-level cache misses in in-order processors AbstractGrowing concerns about power have revived interest in in-order pipelines. In-order pipelines sacrifice singlethread performance. Specifically, they do not allow execution to flow freely around data cache misses. As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem. In this paper, we go a step further and introduce iCFP (in-order Continual Flow Pipeline), an adaptation of the CFP concept to an in-order processor. When iCFP encounters a primary data cache or 12 miss, it checkpoints the register file and transitions into an "advance " execution mode. Miss-independent instructions execute as usual and even update register state. Miss-dependent instructions are diverted into a slice buffer, un-blocking the pipeline latches. When the miss returns, iCFP "rallies" and executes the contents of the slice buffer, merging missdependent state with miss-independent state along the way. An enhanced register dependence tracking scheme and a novel store buffer design facilitate the merging process. Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design.Keywords multiprocessing systems, pipeline processing, Runahead execution, all-level cache, in-order continual flow pipeline, in-order pipelines, in-order processors, miss-independent instructions, multipass pipelining, register dependence tracking scheme, register file This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.This conference paper is available at ScholarlyCommons: http://repository.upenn.edu/cis_papers/410 iCFP: Tolerating All-Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, and Amir Roth Department of Computer and Information Science, University of Pennsylvania {adhilton, santoshn, amir}@cis.upenn.edu AbstractGrowing concerns about power have revived interest in in-order pipelines. In-order pipelines sacrifice single-thread performance. Specifically, they do not allow execution to flow freely around data cache misses. As a result, they have difficulties overlapping independent misses with one another.Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem. In this paper, we go a step further and introduce iCFP (in-order Continual Flow Pipeline), an adaptation of the CFP concept to an in-orde...

show abstract

“…Techniques like flea-flicker [Barnes et al 2003[Barnes et al , 2005 and dual-core execution [Zhou 2005] tolerate load miss latencies by holding missed loads and their dependent instructions in intra-or intercore queues for deferred processing by a second core. When the load miss is resolved, all instructions are committed in sequential order by the second core.…”

Section: Related Workmentioning

confidence: 99%

Performance scalability of decoupled software pipelining

Rangan

Vachharajani

Ottoni

et al. 2008

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Any successful solution to using multicore processors to scale general-purpose program performance will have to contend with rising intercore communication costs while exposing coarsegrained parallelism. Recently proposed pipelined multithreading (PMT) techniques have been demonstrated to have general-purpose applicability and are also able to effectively tolerate intercore latencies through pipelined interthread communication. These desirable properties make PMT techniques strong candidates for program parallelization on current and future multicore processors and understanding their performance characteristics is critical to their deployment. To that end, this paper evaluates the performance scalability of a general-purpose PMT technique called decoupled software pipelining (DSWP) and presents a thorough analysis of the communication bottlenecks that must be overcome for optimal DSWP scalability.

show abstract

"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Cited by 28 publications

References 20 publications

Runahead execution vs. conventional data prefetching in the IBM POWER6 microprocessor

Runahead execution vs. conventional data prefetching in the IBM POWER6 microprocessor

iCFP: Tolerating all-level cache misses in in-order processors

Performance scalability of decoupled software pipelining

Contact Info

Product

Resources

About