A six-issue, four-fetch, out-of-order execution, 6OOMHz Alpha microprocessor achieves an estimated 40SpecInt95,60SpecFP95 and 1800MB/s on McCalpin Stream. The 16.7x18.8mmz die contains 15.2M transistors and dissipates an estimated 72W. It is in 2.0V, 6-metal, 0.35pm CMOS with CMP planarization (Table 1) [ll. The chip is in a 587-pin ceramic IPGA with 198 pins for VDD/ VSS that includes a CuW heat slug for low thermal resistance between die and detachable heat sink. An on-chip PLL performs frequency multiplication of a differential PECL reference and synchronizes I/O by phase-aligning a CPU clock to the reference. Figure 1 is a detailed floorplan of the chip. Figure 2 depicts a blockf pipeline diagram of major sections and functions.The instruction fetcher ( Figure 3) reads four instructions per cycle plus a next-address pointer from a 64kB, 2-way pseudo-set associative, virtual instruction cache. The next-address pointer predicts the address of the subsequent four instructions and indexes the cache in the next cycle. In parallel, a branch predictor resolves the prediction. It contains three tables: a PC-indexed prediction table, a path-indexed prediction table, and a pathindexed table that dynamically chooses one of the former two predictions, based on the success of previous predictions. Fetched instructions are dispatched to integedmemory (INT/ MEM) and floating point (FP) pipelines, issued and executed outof order and retired in order. During dispatch, register specifiers are renamed to eliminate false dependencies by two twelve-port register mappers that dynamically map the architectural registers into a pool of physical registers (80 integer and 72 FP). Resulting map state is retained in an array until the instruction retires. Pre-retire map state is used to generate alist of remaining free physical registers. Buffered map state is restored when the CPU is redirected following a branch mispredict or exception.Mapped instructions enter a 20-entry INTMEM or a 15-entry FP issue queue. The INTMEM queue arbiter identifies the 4 oldest data-ready instructions. They issue to the integer execution unit (EBOX) and are removed from the INTMEM queue. Similarly, the FP queue issues the 2 oldest data-ready instructions to the FP execution unit (FBOX) and removes them from the FP queue.The EBOX (Figure 4) is divided into two clusters, CLO and CL1; each cluster contains 2 independent execution pipelines surrounding an 80-entry register file. Coherency between the two register file copies is maintained by broadcasting results across intercluster buses. Each of the four pipelines executes and bypasses arithmetic and logical operations in one cycle. Bypassed results between clusters take an additional cycle. The upper pipelines handle branches and shifts; CLO contains a pipelined multimedia engine (3-cycle latency) and CL1 contains a pipelined multiplier (7-cycle latency). The lower pipelines handle displacement address calculations for memory operations. The FBOX contains 2 independent execution pipelines surrounding a 72-en...
The clocking methodology for the 6OOMHz Alpha microprocessor depicted in Figure 1 allows increased performance goals to be met through multi-level buffering [l]. In addition power savings is realized through reduced metal usage and conditional clocks. Two distinct analysis methods are required to verify the clock design. One is used for large, globally distributed clocks and the other is applied to small, locally distributed clocks. Figure 2 shows the clock hierarchy. The clock is generated from an 80-200MHz reference clock multiplied by an on-chip phaselocked loop (PLL) to a nominal frequency of 6OOMHz. The clock distribution network up to and including the global clock (GCLK) is included in the feedback loop of the PLL to control phase alignment. GCLK is the primary timing reference for the chip. The generation of GCLK begins at the PLL and is routed through a high-gain buffer network to a central point on the die.From there the clock is driven through buffered X, H and RC trees as shown in Figure 3 to distributed GCLK drivers located in a windowpane pattern across the chip [21. The final physical stage of the global clock distribution network is a grid of upperlevel low-impedance metal that covers the entire die.Here ends the resemblance to a traditional approach to microprocessor clock design [31. To allow circuit designers more options to meet performance and power goals, the design has a hierarchy of clocks beyond GCLK. There are six other global clocks, referred to as box clocks, that drive large grids over their respective execution units: floating point, bus interface, load/ store, integer, pads and instruction issue. Smaller, local clocks are generated as needed from any clock, including other local clocks. Designers created local clocks without strict limits on the number, size, or logic function of the local buffers or requirements on the duty cycle of the generated clocks as long as the race and speed critical path constraints were met. Freedom, albeit limited, is also permitted in the allowable ranges for buffer beta ratios and clock edge rates. This flexible methodology is advantageous in a variety of ways. For example, clocks are generated with as many as eight buffers after GCLK, adjustment of the number of buffers between GCLK and state elements allowed the "borrowing" of time from temporally adjacent clock cycles, and conditional logic gates are employed as clock buffers to realize power savings.Within such an unconstrained framework, careful verification is needed to ensure functionality and speed. Skew and edgerate under worst-case operating conditions are characterized for the global clocks and accounted for in the verification tools. Figure 4 shows skew across the chip as simulated by AWEsim based on extracted layout 141. Since the timing verification tools use skew and edge-rate specifications of global clocks, complete characterization under all expected operating conditions is essential. The primary factors affecting clock performance not directly attributable to the PLL are variations i...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.