Load latency contributes significantly to execution time. Because most cache accesses hit, cache-hit latency becomes an important component of expected load latency. Most modern microprocessors have base+offset addressing loads; thus effective cache-hit latency includes an addition as well as the RAM access.This paper introduces a new technique used in the UltraSPARC III microprocessor, Sum-Addressed Memory (SAM), which performs true addition using the decoder of the RAM array, with very low latency. We compare SAM with other methods for reducing the add part of load latency. These methods include sum-prediction with recovery, and bitwise indexing with duplicate-tolerance. The results demonstrate the superior performance of SAM.
UltraSPARC-III (US-III) is a 64b 800MHz 4-instruction-issue superscalar microprocessor for high-performance desktop workstation, work group server, and enterprise server platforms. On-chip caches include a 64kB 4-way associative for data (D$), 32kB 4-way associative for instructions (I$), a 2kB 4-way associative data prefetch cache (P$), and a 2kB 4-way associative write (W$). A 90kB on-chip tag array supports the off-chip 8MB unified second-level cache (E$) [1]. The 23M-transistor chip in a 0.15µm, 7-layer metal process consumes 60W from a 1.5V supply [2].The architecture is driven by performance, scalability and compatibility. The design is SPARC V9-compliant, maintaining binary compatibility with all 10,000+ existing SPARC applications [3]. Scalability in two directions is required: 1) taking full entitlement of future process improvements to scale clock rate and 2) off-chip interfaces that enable scaling multi-processor (MP) systems to 1000+ processors. Performance can be achieved in multiple ways. Clock rate is prioritized over IPC improvements, setting a goal of 1.5x the clock rate compared to the previous designs in the same process technology, as well as IPC and compiler improvement goals of 1.15x each, for a doubling of overall performance[4]. This requires different approaches to the micro-architecture, as well as more aggressive circuit and physical design, compared to previous UltraSPARC processors [5,6]. 8 static gates are budgeted for each of the 14 pipeline stages vs. 9 stages and 20 static gates/stage on US-I/II. Timing is more critical in the instruction fetch, integer execution, and floating-point (FP) areas, where dynamic logic is used liberally, than in the memory system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.