Two hardware methods for remedying the effects of true data dependences are studied. The$rst method, dependence speculation, is used to eliminate address generation-load dependences. This is enabled by address prediction that permits load instructions to proceed speculatively without waiting for their address operands. The second technique, dependence collapsing, is used to eliminate data dependences by combining a dependence among multiple instructions into one instruction. The potential of these techniques for improving processor performance is demonstrated via trace-driven simulation. When both techniques are used with maximum issue widths of 4, 8, 16, and 32, the overall speedups in comparison to a base instruction level parallel machine are 1.20, 1.35, 1.51, and 1.66, respectively. In general, dependence collapsing contributes the majority of the improvement in performance. Under the dependence collapsing model, 29% to 47% of the total number of instructions in a trace may be collapsed. The distance separating the collapsed instructions is nearly always less than 8. Our experimentation also suggests that further performance improvements can be achieved by incorporating mechanisms that increase the address prediction rate.
This paper presents the Alpha EV8 conditional branch predictor The Alpha EV8 microprocessor project, canceled in June 2001 in a late phase of development, envisioned an aggressive 8-wide issue out-of-order superscalar microarchitecture featuring a very deep pipeline and simultaneous multithreading. Performance of such a processor is highly dependent on the accuracy of its branch predictor and consequently a very large silicon area was devoted to branch prediction on EVS. The Alpha EV8 branch predictor relies on global history and features a total of 352 Kbits.The focus of this paper is on the different trade-offs performed to overcome various implementation constraints for the EV8 branch predictor. One such instance is the pipelining of the predictor on two cycles to facilitate the prediction of up to 16 branches per cycle from any two dynamically successive, 8 instruction fetch blocks. This resulted in the use of three fetch-block oM compressed branch history information for accesing the predictor. Implementation constraints also restricted the composition of the index functions for the predictor and forced the usage of only sing&-ported memory cells.Nevertheless, we show that the Alpha EV8 branch predictor achieves prediction accuracy in the same range as the state-of-the-art academic global history branch predictors that do not consider implementation constraints in great detail
I IntroductionThe Alpha EV8 microprocessor [2] features a 8-wide superscalar deeply pipelined microarchitecture. With minimum branch misprediction penalty of 14 cycles, the performance of this microprocessor is very dependent on the branch prediction accuracy. The architecture and technology of the Alpha EV8 are very aggressive and new challenges were confronted in the design of the branch predictor. This paper presents the Alpha EV8 branch predictor in great detail. The paper expounds on different constraints that were * This work was done while the authors were with Compaq during 1999 faced during the definition of the predictor, and on various trade-offs performed that lead to the final design. In particular, we elucidate on the following: (a) use of a global history branch prediction scheme, (b) choice of the prediction scheme derived from the hybrid skewed branch predictor 2Bc-gskew[ 19], (c) redefinition of the information vector used for indexing the predictor that combines compressed branch history and path history, (d) different prediction and hysteresis table sizes: prediction tables and hysteresis tables are accessed at different pipeline stages, and hence can be implemented as physically distinct tables, (e) variable history lengths: the four logical tables in the EV8 predictor are accessed using four different history lengths, (f) guaranteeing conflict free access to the bank-interleaved predictor with single-ported memory cells for up to 16 branch predictions from any two 8-instruction dynamically succesive fetch blocks, and (g) careful definition of index functions for the predictor tables.This work demonstrates that in ...
Temperature has become an important constraint in high-performance processors, especially multicores. Thread migration will be essential to exploit the full potential of future thermally constrained multicores. We propose and study a thread migration method that maximizes performance under a temperature constraint, while minimizing the number of migrations and ensuring fairness between threads. We show that thread migration brings important performance gains and that it is most effective during the first tens of seconds following a decrease of the number of running threads.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.