Filtered runahead execution with a runahead buffer

Hashemi, Milad; Patt, Yale N.

doi:10.1145/2830772.2830812

Cited by 23 publications

(23 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These new instructions are executed if their source data is available, thereby generating additional memory accesses and boosting MLP. Hashemi et al [17] observed that Runahead incurs significant energy overhead due to the core front-end being operational during the entire runhead duration. They proposed to filter out the instructions leading up to the memory accesses and buffer them in a Runahead Buffer.…”

Section: Related Workmentioning

confidence: 99%

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Kumar

Alipour

Black-Schaffer

2019

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex hardware and the resulting energy overheads. As energy efficiency becomes the prime design constraint, we investigate low complexity/energy mechanisms to exploit MLP. This work revisits slice-out-of-order (sOoO) cores as an energy efficient alternative to OoO cores for MLP exploitation. These cores construct slices of MLP generating instructions and execute them out-of-order with respect to the rest of instructions. However, the slices and the remaining instructions, by themselves, execute in-order. Though their energy overhead is low compared to full OoO cores, sOoO cores fall considerably behind in terms of MLP extraction. We observe that their dependence-oblivious inorder slice execution causes dependent slices to frequently block MLP generation. To boost MLP generation in sOoO cores, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them out of the way of MLP extraction. The proposed core incurs minimal area and power overheads, yet approaches the MLP benefits of fully OoO cores. Our evaluation shows that Freeway outperforms the state-of-the-art sOoO core by 12% and is within 7% of the MLP limits of full OoO execution.

show abstract

Section: Related Workmentioning

confidence: 99%

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Kumar

Alipour

Black-Schaffer

2019

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…However, many instructions are not necessary to calculate the memory addresses used in subsequent long-latency loads. Hashemi et al [4] propose a technique to track and execute only the chain of instructions that leads to a long-latency load. Upon a fullwindow stall, they perform an expensive backward data-flow walk in the ROB and the store queue to find a dependency chain that leads to another instance of the same stalling load.…”

Section: Filtered Runahead Executionmentioning

confidence: 99%

“…Instructions that are not part of a dependency chain that generates a long-latency load waste processor resources that could otherwise be used to generate prefetch requests. To improve the energy-efficiency and performance of runahead execution, runahead buffer [4] filters out unnecessary runahead instructions. In runahead mode, this technique identifies the chain of instructions that generates the stalling load, stores it in the runahead buffer, and keeps replaying only this instruction chain in a loop.…”

Section: Introductionmentioning

confidence: 99%

Precise Runahead Execution

Naithani

Feliu

Adileh

et al. 2019

IEEE Comput. Arch. Lett.

View full text Add to dashboard Cite

Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests during runahead execution. Unfortunately, all these prior runahead proposals have shortcomings that limit performance and energy efficiency because they discard the full instruction window to enter runahead mode and then flush the pipeline to restart normal operation. This significantly constrains the performance benefits and increases the energy overhead of runahead execution. In addition, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that lead to the same long-latency load. We propose precise runahead execution (PRE) to mitigate the shortcomings of prior work. PRE leverages the renaming unit to track all the dependency chains leading to long-latency loads. PRE uses a novel approach to manage free processor resources to execute the detected instruction chains in runahead mode without flushing the pipeline. Our results show that PRE achieves an additional 21.1% performance improvement over the recent runahead proposals while reducing energy consumption by 6.1%.

show abstract

“…Other than explicitly launching a helper thread, many proposals have dealt with reducing the chance a conventional microarchitecture is blocked [2], [13], [14], [19], [30], [38], [39], [41], [51], [55], [69], [73], [100]. Many designs share a theme of checkpointing important state, clean up some structures to allow further (speculative) execution.…”

Section: Background and Related Workmentioning

confidence: 99%

R3-DLA (Reduce, Reuse, Recycle): A More Efficient Approach to Decoupled Look-Ahead Architectures

Kondguli

Huang

2019

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Modern societies have developed insatiable demands for more computation capabilities. Exploiting implicit parallelism to provide automatic performance improvement remains a central goal in engineering future general-purpose computing systems. One approach is to use a separate thread context to perform continuous look-ahead to improve the data and instruction supply to the main pipeline. Such a decoupled look-ahead (DLA) architecture can be quite effective in accelerating a broad range of applications in a relatively straightforward implementation. It also has broad design flexibility as the look-ahead agent need not be concerned with correctness constraints. In this paper, we explore a number of optimizations that make the look-ahead agent more efficient and yet extract more utility from it. With these optimizations, a DLA architecture can achieve an average speedup of 1.4 over a state-of-the-art microarchitecture for a broad set of benchmark suites, making it a powerful tool to enhance single-thread performance.

show abstract

Filtered runahead execution with a runahead buffer

Cited by 23 publications

References 31 publications

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Precise Runahead Execution

R3-DLA (Reduce, Reuse, Recycle): A More Efficient Approach to Decoupled Look-Ahead Architectures

Contact Info

Product

Resources

About