SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Tran, Kim-Anh; Jimborean, Alexandra; Carlson, Trevor E.; Koukos, Konstantinos; Själander, Magnus

doi:10.1145/3192366.3192393

Cited by 10 publications

(2 citation statements)

References 79 publications

(71 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compile-time application analysis is also used to categorize and prioritize instructions by predicting the critical path of the execution [34], [35]. Solutions with good balance between performance and energy efficiency use modified hardware equipped with the appropriate compile-time support to statically reorder instructions in advance [36]- [42]. But, unlike our work, these solutions require modification to the application itself and do not provide backward compatibility for deployed applications.…”

Section: Related Workmentioning

confidence: 99%

Efficient Instruction Scheduling using Real-time Load Delay Tracking

Diavastos¹,

Carlson²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Many hardware structures in today's highperformance out-of-order processors do not scale in an efficient way. To address this, different solutions have been proposed that build execution schedules in an energy-efficient manner. Issue time prediction processors are one such solution that use dataflow dependencies and predefined instruction latencies to predict issue times of repeated instructions. In this work, we aim to improve their accuracy, and consequently their performance, in an energy efficient way. We accomplish this by taking advantage of two key observations. First, memory accesses often take additional time to arrive than the static, predefined access latency that is used to describe these systems. This is due to contention in the memory hierarchy and variability in DRAM access times. The use of this observed delay is important to optimize a processor's execution schedule, as previous works that use predefined information demonstrate performance losses as high as 25%. Second, we find that these memory access delays often repeat across iterations of the same code. This, in turn, allows us to predict the arrival time of these accesses.In this work, we introduce a new processor microarchitecture, that replaces a complex reservation-station-based scheduler with an efficient, scalable alternative. Our proposed scheduling technique tracks real-time delays of loads to accurately predict instruction issue times, and uses a reordering mechanism to prioritize instructions based on that prediction, achieving closeto-out-of-order processor performance. To accomplish this in an energy-efficient manner we introduce: (1) an instruction delay learning mechanism that monitors repeated load instructions and learns their latest delay, (2) an issue time predictor that uses learned delays and data-flow dependencies to predict instruction issue times and (3) priority queues that reorder instructions based on their issue time prediction. Together, our processor achieves 86.2% of the performance of a traditional out-of-order processor, higher than previous efficient scheduler proposals, while still consuming 30% less power.

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Instruction Scheduling using Real-time Load Delay Tracking

Diavastos¹,

Carlson²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Hardware-software cooperative techniques involve new instructions, advanced profiling, or binary translation for separating critical instruction slices, see for example DAE [57], speculative slice execution [71], flea-flicker multi-pass pipelining [7], braid processing [65], and OUTRIDER [16]. Instruction slices have also been exploited to improve the energy-efficiency of both in-order and OoO processors [11,35,54,63,64]. PRE does not require a helper thread, hardware context, or support from software for converting demand misses into hits.…”

Section: Related Workmentioning

confidence: 99%

Precise Runahead Execution

Naithani

Feliu

Adileh

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests. Unfortunately, all prior runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when entering runahead mode and then need to re-fill the pipeline to restart normal operation. Moreover, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that leads to the same long-latency load.We propose precise runahead execution (PRE) which builds on the key observation that when entering runahead mode, the processor has enough issue queue and physical register file resources to speculatively execute instructions. This mitigates the need to release and re-fill processor state in the ROB, issue queue, and physical register file. In addition, PRE preexecutes only those instructions in runahead mode that lead to full-window stalls, using a novel register renaming mechanism to quickly free physical registers in runahead mode, further improving efficiency and effectiveness. Finally, PRE optionally buffers decoded runahead micro-ops in the front-end to save energy. Our experimental evaluation using a set of memoryintensive applications shows that PRE achieves an additional 18.2% performance improvement over the recent runahead proposals while at the same time reducing energy consumption by 6.8%.

show abstract