Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10-60% (geometric mean of 31%) in program speed compared to today's processors. The processor operates at clock speeds comparable to today's processors, but achieves significantly higher ILP.To measure the impact of a large window on clock speed, we design and simulate new implementations of the logic components that most limit the critical path of our large-window processor: the schedule logic and the wake-up logic. We use log-depth cyclic segmented prefix (CSP) circuits to reimplement these components. Our layouts and simulations of critical paths through these circuits indicate that our large-window processor could be clocked at frequencies exceeding 500MHz in today's technology. Our commit logic and rename logic can also run at these speeds.To measure the impact of a large window on ILP, we compare two microarchitectures, the first has a 128-instruction window, an 8-wide fetch unit, and 20-wide issue (four integer, branch, multiply, float, and memory units), whereas the second has a 32-instruction window, and a 4-wide fetch unit and is comparable to today's processors. For each, we simulate different window reuse and bypass policies. Our simulations show that the large-window processor achieves significantly higher IPC. This performance increase comes despite the fact that the large-window processor uses a wrap-around window while the small-window processor uses a compressing window, thus effectively increasing its number of outstanding instructions. Furthermore, the large-window processor sometimes pays an extra clock cycle for bypassing.
The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as n 2 where n is the fetch width, the issue width, or the window size. This paper presents a novel implementation, called the Ultrascalar processor, that dramatically reduces the asymptotic critical-path length of a superscalar processor. The processor is implemented by a large collection of ALUs with controllers (together called execution stations) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved cache to the execution stations. These networks provide the full functionality of superscalar processors including renaming, out-of-order execution, and speculative execution. The Ultrascalar's criticalpath length due to gate delays is gates = log n. The wire delays and chip size depend on the provided memory bandwidth and the layout. If the provided memory bandwidth is Mn memory operations per clock cycle then, using an H-tree VLSI layout, the critical-path length due to wire delay (speed-of-light delay) is wires = 8 : n 1=2 if Mn is On 1=2, for 0, [optimal] n 1=2 log n if Mn is n 1=2 , and [near optimal]Mn if Mn is n 1=2+ for 0, [optimal] (with M suitably constrained.) The area is the square of the wire delay.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.