POWER5 offers significantly increased performance over previous POWER designs by incorporating simultaneous multithreading, an enhanced memory subsystem, and extensive RAS and power management support. The 276M transistor processor is implemented in 130nm silicon-on-insulator technology with 8-level of Cu metallization and operates at >1.5 GHz. General TermsDesign Keywords POWER5, Microprocessor Design, Simultaneous Multi-threading (SMT), Temperature Sensor, Power Reduction, Clock Gating POWER5 TM is the next generation of IBM's POWER microprocessors. This design, shown below in Figure 1, sets a new standard of industry-leading server performance by incorporating simultaneous multithreading (SMT), an enhanced distributed switch and memory subsystem supporting 1-64w SMP, and extensive RAS support. First pass hardware using IBM's 130nm silicon-on-insulator technology operates above 1.5GHz at 1.3V.POWER5's dual-threaded SMT [1] creates up to two virtual processors per core, improving execution unit utilization and masking memory latency. Although a simplistic SMT implementation promised ~20% performance improvement, resizing critical micro-architectural resources almost doubles in many cases the SMT performance benefit at a 24% area cost per core.The two SMT cores interface with an enhanced memory subsystem. The cache hierarchy includes a larger (1.9MB) L2 cache, reduced L3 latency, and a larger (36MB) L3 cache located on a custom DRAM companion chip. The new on-chip main memory controller improves latency and the enhanced interconnect fabric extends SMP scalability. Figure 2 depicts the microarchitectural changes introduced with POWER5 chip.
The POWER4 chip, functioning in the laboratory at frequencies >1GHz, contains two independent processor cores, a shared L2, an L3 directory and all of the logic needed to form large SMPs. The chip, containing over 170M transistors, is fabricated using a 0.18µm CMOS SOI technology with 7-layer copper metallization. The physical design challenges for this chip are to guarantee functionality of all circuits, meet cycle time goals, check complex ground rules, verify that the transistors implement the VHDL properly, and meet test, power, and clock-distribution requirements on an aggressive schedule with a design team at multiple geographically-separated sites.Each POWER4 core [1] is an out-of-order superscalar design containing an instruction fetch unit with its 64kB L1 instruction cache, an instruction decode unit, two fixed-point and two floating-point execution units, dual load store execution units with a dual-ported 32kB L1 data cache, a branch execution unit, an execution unit to perform logical operations on the condition register, and a sequencing unit to manage instructions in flight. Instructions can be issued to each execution unit every cycle. Up to 8 data and 3 instruction cache misses are supported. In excess of 200 instructions can be in various stages of execution. The two cores share an 8-way set-associative unified L2 organized as 3 independent cache controllers. In aggregate, 12 outstanding L2 misses can be supported by the L2. Figure 15.2.1 shows an 8-way module, with 4 POWER4 chips, that is used as a system building block. A photo of the actual multi-chip module is shown in Figure 15.2.2. All logic necessary to communicate between POWER4 chips is contained on the chip. Multiple modules can be interconnected to form larger SMP systems. POWER4 to POWER4 buses on and off module operate at half the processor speed. Buses to and from an off-chip L3 and memory operate at one-third the processor speed. Figure 15.2.3 lists the number of objects that are placed on the chip. The chip, with 2208 signal I/O C4s and over 5500 total C4s including power and ground, supports greater than 1Tb/s peak bandwidth.The chip physical design is built on a hierarchy of transistors, macros, units, microprocessor cores and chip. Three types of macros are employed: custom, SRAM and synthesized. During the high-level design phase, the macros, units, core and chip are all assigned contracts for timing, area, shape, wiring tracks and I/O. Timing and physical design of the chip are done concurrently on all levels of the hierarchy. All major buses are routed early in the design. Figure 15.2.4 shows the floorplanned buses. As the design progresses contracts are modified to reflect the actual design. Significant design constraints include maintaining a slew rate of <300ps on all transitions, with a wire signal delay of approximately 100ps/mm. These constraints require more than 70k buffers/inverters to be inserted. In the final months of the design, turn-around-time from entering design changes to a chip level timing run is <1 day. ...
The Power5 microprocessor is implemented in 130nm technology and is a larger chip than its predecessor with increased clocking and timing challenges. Clock-gating is used to reduce average power but this causes increased power supply noise. To help characterize this, on-chip measurement circuits (called Skitter circuits) are used to measure timing uncertainty from the combined effects of PLL jitter, clock distribution skew, jitter, and power-grid noise effects. The Skitter circuits are placed at three locations on the chip and allow measurement of timing variations while running arbitrary functional test patterns or applications. Placement and interconnect delays between the cross-coupled Skitters are shown in Fig. 19.7.1.Clock distribution latency is not scaling with microprocessor cycle time. On this chip the clock distribution latency from a single PLL to approximately 15,000 clock pins was ~1.5 cycles at 2 GHz. Even with a highly optimized clock distribution [1] the global clock latency is sensitive to power supply noise. As a result, the jitter is now mostly from the clock distribution rather than the PLL in high performance microprocessors.A complicated interaction exists among the power supply noise effects on both clock distribution and logic circuit delays. It is valuable to measure the total timing uncertainty rather than separately measuring the variations in logic delay, clock period, and skew, since these effects are not independent. Simulations show that the total effect is less than a simple sum.The Skitter circuit contains a register-tapped delay-line with 129 low fan-out inverters each with a nominal delay of 8ps and tapped with a sampling latch (Fig. 19.7.2). The latch is a standard master-slave flip-flop, with clock c1 for the master, c2 for slave. A multiplexer selects one of four inputs to send to the delay chain. The four inputs are different versions of local or distant clock signals. The sampling latches are identical to standard latches and are clocked normally. If a local clock is sent to the delay chain, the sampling latches record how many delay-chain inverters the clock edges travel through before being latched. If
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.