The Forward Slice Core Microarchitecture

Lakshminarasimhan, Kartik; Naithani, Ajeya; Feliu, Josué; Eeckhout, Lieven

doi:10.1145/3410463.3414629

Cited by 11 publications

(11 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CESP steers load instructions and their consumers to the same queue, which leads to a better overall balance across the queues. Finally, it is interesting to note that the performance improvement over CESP increases with increasing pipeline width-we reported an average 4.5% (and up to 12.6%) improvement for a 2-wide FSC configuration and SPEC CPU2017, see the conference paper [21], while we now report an 11% average improvement for the three-wide configuration.…”

Section: Comparison Against Cespmentioning

confidence: 49%

“…Relative to the 3-wide OoO core, we find that the FSC core occupies 47% less chip overhead. This is a more significant saving in chip area as for the 2-wide configurations, i.e., we reported a 37% reduction in chip area for the 2-wide configurations in the conference paper [21]. The bottom line is that the reduction in hardware overhead for FSC relative to an OoO baseline increases with increasing pipeline width.…”

Section: Hardware Overheadmentioning

confidence: 66%

“…For a 3-wide core and SPEC CPU2017, the additional power consumption incurred by the additional FSC structures over a baseline in-order core amounts to 32 mW relative to our baseline InO core which consumes 3.12 W versus 8.15 W for the OoO core. Similarly, for a 2-wide setup reported in the conference paper [21], the additional FSC structures account for 19.4 mW, versus 2.99 W and 6.95 W for the InO and OoO cores, respectively. As we scale the superscalar pipeline width from 2 to 3, FSC incurs a 4.7% increase in power consumption, versus 17.3% for the OoO core.…”

Section: Power Consumptionmentioning

confidence: 73%

“…The below evaluation primarily focuses on 2-wide versus 3-wide performance results as well as contrasting FSC performance across the different benchmark suites. The original conference paper [21] provides additional analyses including CPI stack analysis for individual workloads, lane distribution statistics, ILP and MHP analysis, and various sensitivity analyses to lane configuration, lane size, number of waiting cycles for HL re-direction, and memory disambiguation.…”

Section: Discussionmentioning

confidence: 99%

“…2-wide setup, see our conference paper [21], the additional FSC structures amount to 0.06 mm 2 , or a 1.0% increase over a 5.98 mm 2 InO core. As we scale the superscalar pipeline width from 2 to 3, FSC incurs a 3.9% increase in chip core area.…”

Section: Hardware Overheadmentioning

confidence: 91%

See 4 more Smart Citations

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Lakshminarasimhan

Naithani

Feliu

et al. 2022

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Superscalar out-of-order cores deliver high performance at the cost of increased complexity and power budget. In-order cores, in contrast, are less complex and have a smaller power budget, but offer low performance. A processor architecture should ideally provide high performance in a power- and cost-efficient manner. Recently proposed slice-out-of-order (sOoO) cores identify backward slices of memory operations which they execute out-of-order with respect to the rest of the dynamic instruction stream for increased instruction-level and memory-hierarchy parallelism. Unfortunately, constructing backward slices is imprecise and hardware-inefficient, leaving performance on the table. In this article, we propose Forward Slice Core (FSC ), a novel core microarchitecture that builds on a stall-on-use in-order core and extracts more instruction-level and memory-hierarchy parallelism than slice-out-of-order cores. FSC does so by identifying and steering forward slices (rather than backward slices) to dedicated in-order FIFO queues. Moreover, FSC puts load-consumers that depend on L1 D-cache misses on the side to enable younger independent load-consumers to execute faster. Finally, FSC eliminates the need for dynamic memory disambiguation by replicating store-address instructions across queues. Considering 3-wide pipeline configurations, we find that FSC improves performance by 27.1%, 21.1%, and 14.6% on average compared to Freeway, the state-of-the-art sOoO core, across SPEC CPU2017, GAP, and DaCapo, respectively, while at the same time incurring reduced hardware complexity. Compared to an OoO core, FSC reduces power consumption by 61.3% and chip area by 47%, providing a microarchitecture with high performance at low complexity.

show abstract

Section: Comparison Against Cespmentioning

confidence: 49%

Section: Hardware Overheadmentioning

confidence: 66%

Section: Power Consumptionmentioning

confidence: 73%

Section: Discussionmentioning

confidence: 99%

Section: Hardware Overheadmentioning

confidence: 91%

See 3 more Smart Citations

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Lakshminarasimhan

Naithani

Feliu

et al. 2022

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

show abstract

ERrOR: Improving Performance and Fault Tolerance Using Early Execution

Choudhary,

Patel,

Singh

2023

2023 IEEE 29th International Symposium on on-Line Testing and Robust System Design (IOLTS)

View full text Add to dashboard Cite

Vector Runahead

Naithani

Ainsworth

Jones

et al. 2021

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)

Self Cite

View full text Add to dashboard Cite

The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchitectural technique to automatically improve their performance in conventional processors has remained elusive.Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memorylevel parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79× performance speedup on a large out-of-order superscalar system, significantly improving on stateof-the-art techniques.

show abstract

The Forward Slice Core Microarchitecture

Cited by 11 publications

References 27 publications

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

ERrOR: Improving Performance and Fault Tolerance Using Early Execution

Vector Runahead

Contact Info

Product

Resources

About