FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors

Alipour, Mehdi; Kumar, Rakesh; Black-Schaffer, David

doi:10.23919/date.2019.8715034

Cited by 10 publications

(8 citation statements)

References 14 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast to CRISP, OOO techniques, however, fail to improve performance if there do not exist sufficient independent instructions after the delinquent load (see Figure 1). Instruction criticality has been leveraged to improve scheduling in prior works including Fiforder [4], Longterm parking [102], and Delay-and-Bypass [3]. These works partition the instruction queue into smaller sub-queues holding ready, non-ready, critical, and non-critical instructions to improve the scheduling energy-efficiency.…”

Section: Related Workmentioning

confidence: 99%

CRISP: critical slice prefetching

Litz

Ayers

Ranganathan

2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

The high access latency of DRAM continues to be a performance challenge for contemporary microprocessor systems. Prefetching is a well-established technique to address this problem, however, existing implemented designs fail to provide any performance benefits in the presence of irregular memory access patterns. The hardware complexity of prior techniques that can predict irregular memory accesses such as runahead execution has proven untenable for implementation in real hardware. We propose a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling. In particular, our technique executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach. Our proposal only requires minimal hardware modifications by performing memory access classification, load and branch slice extraction, as well as priority analysis exclusively in software. As a result, our technique is feasible to implement, introducing only a simple new instruction prefix while requiring minimal modifications of the instruction scheduler. Our technique increases the IPC of memory-latency-bound applications by up to 38% and by 8.4% on average. CCS CONCEPTS• Computer systems organization → Superscalar architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

CRISP: critical slice prefetching

Litz

Ayers

Ranganathan

2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…Therefore, they propose an architecture that attempts to execute all instructions via in-order pipeline stages before dispatching the unexecuted ones to OoO pipeline, thereby reducing scheduling energy. FIFOrder [2], instead of trying to execute all instruction via in-order stages, dispatches ready instructions to a FIFO issue queue and non-ready instructions to an OoO (content addressable memory-based) issue queue. As the OoO queue handles fewer instructions, FIFOrder reduces its depth and width, thus reducing the scheduling energy cost.…”

Section: Energy-efficient Core Designmentioning

confidence: 99%

Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores

Kumar

Alipour

Black-Schaffer

2022

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Exploiting memory-level parallelism (MLP) is crucial to hide long memory and last-level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex and energy-hungry hardware. This work revisits slice-out-of-order (sOoO) cores as an energy-efficient alternative for MLP exploitation. sOoO cores achieve energy efficiency by constructing and executing slices of MLP-generating instructions out-of-order only with respect to the rest of instructions; the slices and the remaining instructions, by themselves, execute in-order. However, we observe that existing sOoO cores miss significant MLP opportunities due to their dependence-oblivious in-order slice execution, which causes dependent slices to frequently block MLP generation. To boost MLP generation, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them from blocking subsequent independent slices and MLP extraction. The proposed core incurs minimal area and power overheads, yet approaches the MLP benefits of fully OoO cores. Our evaluation shows that Freeway delivers 12% better performance than the state-of-the-art sOoO core and is within 7% of the MLP limits of full OoO execution.

show abstract

“…FIFOrder: The FIFOrder architecture, proposed by Alipour et al [19], offloads and issues instructions from three FIFO queues covering ready, "almost-ready", and "load tail" instruc-tions. By separating instructions into these classes they can reduce cross-FIFO stalls due to dependencies on long-latency loads.…”

Section: A Ready-aware Approachesmentioning

confidence: 99%

“…This allows for the use of smaller and/or narrower IQs without hurting performance. Examples of this approach include filtering instructions that can be executed earlier [18], "parking" instructions that will not be ready for a while [3], and bypassing instructions that do not benefit from out-of-order scheduling [19]. Implicit to the approach of reducing IQ pressure is the need to identify instructions that do not benefit from the expensive scheduling capabilities of the IQ.…”

Section: Introductionmentioning

confidence: 99%

Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors

Alipour

Black-Schaffer

Kumar

2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Self Cite

View full text Add to dashboard Cite

Flexible instruction scheduling is essential for performance in out-of-order processors. This is typically achieved by using CAM-based Instruction Queues (IQs) that provide complete flexibility in choosing ready instructions for execution, but at the cost of significant scheduling energy.In this work we seek to reduce the instruction scheduling energy by reducing the depth and width of the IQ. We do so by classifying instructions based on their readiness and criticality, and using this information to bypass the IQ for instructions that will not benefit from its expensive scheduling structures and delay instructions that will not harm performance. Combined, these approaches allow us to offload a significant portion of the instructions from the IQ to much cheaper FIFO-based scheduling structures without hurting performance. As a result we can reduce the IQ depth and width by half, thereby saving energy.Our design, Delay and Bypass (DNB), is the first design to explicitly address both readiness and criticality to reduce scheduling energy. By handling both classes we are able to achieve 95% of the baseline out-of-order performance while only using 33% of the scheduling energy. This represents a significant improvement over previous designs which addressed only criticality or readiness (91%/89% performance at 74%/53% energy).

show abstract

FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors

Cited by 10 publications

References 14 publications

CRISP: critical slice prefetching

CRISP: critical slice prefetching

Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores

Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors

Contact Info

Product

Resources

About