Speculation techniques for improving load related instruction scheduling

Yoaz, Adi; Erez, Mattan; Ronen, Ronny; Jourdan, S.

doi:10.1109/isca.1999.765938

Cited by 76 publications

(78 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, we do not require such costly dynamic techniques. In this paper, we show that a simple ld/st vectorization is useful (in the context of scientific loops) to solve the same problems tackled in [1,5,7,4]. Coupling our costless software optimization technique with the actual imprecise memory disambiguation mechanisms is less expensive than pure hardware methods, giving nonetheless good performance improvement.…”

Section: Related Workmentioning

confidence: 88%

“…Even if we do not avoid all situations of bad relative array offsets in all hardware platforms, and thus few memory disambiguation penalties persist, we showed that we still get high speedups in all experimented processors (up to 54% of perfor-mance gain). This simple software solution coupled with imprecise memory disambiguation mechanisms are less expensive than sophisticated totally hardware approaches such as [1,6,5,7,4].…”

Section: Discussionmentioning

confidence: 99%

“…Another similar hardware improvement has been proposed by Sethumadhavan et al in [5]. A speculative technique for memory dependence prediction has been proposed by Yoaz et al in [7]: the hardware tries to predict colliding loads, relying on the fact that such loads tend to repeat their delinquent behavior. Another speculative technique devoted to superscalar processors was presented by S. Onder [4].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Improving load/store queues usage in scientific computing

Lemuet¹,

Jalby²,

Touati³

2004

International Conference on Parallel Processing, 2004. ICPP 2004.

View full text Add to dashboard Cite

International audienceMemory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops

show abstract

Section: Related Workmentioning

confidence: 88%

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Improving load/store queues usage in scientific computing

Lemuet¹,

Jalby²,

Touati³

2004

International Conference on Parallel Processing, 2004. ICPP 2004.

View full text Add to dashboard Cite

show abstract

“…Chkmiss is an informing memory operation [30] which provides early warning on upcoming stalling code, essential for a timely control flow change to the alternative execution path. Lightweight techniques to predict misses in the cache hierarchy have been proposed [77,80] and refined to detect a last-level cache (LLC) miss in one cycle [62]. We encode the presence of an LLC cache line in the TLB entries, using a simple bitmap (e.g., 64 bits for 64-byte cache lines in a 4kB page).…”

Section: Chkmissmentioning

confidence: 99%

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

et al. 2018

View full text Add to dashboard Cite

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.In this paper, we address one of the main performance bottlenecks-last-level cache misses-through a softwarehardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms. We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.The effectiveness of our software-hardware co-design is proven on the most limited but energy-efficient microarchitectures, non-speculative, in-order execution (InO) cores, which rely entirely on compile-time instruction scheduling.

show abstract

“…We introduce a scalable SQ design that implements store-load forwarding without associative search. As each dynamic load is renamed, we use store-load dependence prediction [3,9,22] to predict the single in-flight store from which that load is most likely to forward. As illustrated in Figure 1(b), when a load executes, it accesses the SQ only at this predicted index, not associatively.…”

Section: Introductionmentioning

confidence: 99%

Scalable Store-Load Forwarding via Store Queue Index Prediction

Sha

Martin

Roth

38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05)

View full text Add to dashboard Cite

Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search. Our design uses prediction to identify the single SQ entry from which each dynamic load is most likely to forward. When a load executes, it either obtains its value from the predicted SQ entry (if the address of the entry matches the load address) or the data cache (otherwise). A forwarding mis-prediction -detected by pre-commit filtered load re-execution -results in a pipeline flush. SQ index prediction is generally accurate, but for some loads it cannot reliably identify a single SQ entry. To avoid flushes on these difficult loads while keeping the single-SQ-access-per-load invariant, a second predictor delays difficult loads until all but the youngest of their "candidate" stores have committed. Our predictors are inspired by store-load dependence predictors for load scheduling (Store Sets and the Exclusive Collision Predictor) and unify load scheduling and forwarding.Experiments on the SPEC2000 and MediaBench benchmarks show that on an 8-way issue processor with a 512-entry reorder buffer, our technique performs within 3.3% of an ideal associative SQ (same latency as the data cache) and either matches or exceeds the performance of a realistic associative SQ (slower than data cache) on 31 of 47 programs. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

show abstract

Speculation techniques for improving load related instruction scheduling

Cited by 76 publications

References 19 publications

Improving load/store queues usage in scientific computing

Improving load/store queues usage in scientific computing

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Scalable Store-Load Forwarding via Store Queue Index Prediction

Contact Info

Product

Resources

About