AMD's 2-core "Bulldozer" module contains 213 million transistors in an 11metal layer 32nm HKMG SOI CMOS process and is designed to operate from 0.8 to 1.3V. This new micro-architecture [1] improves performance and frequency while reducing area and power compared to a previous AMD x86-64 CPU in the same process [2]. To achieve these goals, the design reduced the number of FO4 inverter delays/cycle by more than 20%, achieving higher frequencies in the same power envelope even with increased core counts. The 2-core CPU module area (including 2MB L2 cache) is 30.9mm 2 (Fig. 4.5.7).The module design contains 84 unique custom macros and 317,000 scannable flops. Module-level VSS power gating (CC6) reduces leakage power by 95% when both cores are idle [2]. Transistor Vts across the design are mostly regular (47%) and long-channel regular (46%).The Bulldozer micro-architecture is cycle-based, using soft-edge flip-flops (SEF) to provide high-frequency performance, process variation tolerance, and low power consumption (Fig. 4.5.1). Performance and process tolerance are provided by a 2-clock design: early and late clocks (ECLK, LCLK) create a soft timing edge, allowing limited cycle stealing. Power is reduced in low-power SEFs by internally gated slave latch clocks. The majority of flops (78%) are low-power, using high-performance flops only on timing-critical paths.In contrast to leveraged power-optimized CPU designs [2,4], Bulldozer's groundup design requires co-development of power efficiency, timing, and functionality. Initially, micro-architectural power is optimized using a power-aware highlevel performance model. Next, before schematic completion, the team tracks and analyzes RTL-based clock and flip-flop activity (a proxy for switching power) to meet clock gating goals. Finally, a new power model enables early mixed schematic/layout analysis of transistor-level power. This enables aggressive power optimizations while the implementation is still malleable. The result is a design with low power consumption for typical applications, making it well-suited to active power management and boost (Fig. 4.5.2).The L1 caches are split, with I-cache residing in the instruction unit and a Dcache located in each load/store unit of the 2-cores. The 2-way, 64KB I-cache consists of an 8×2 array of 4KB bank macros, with 2 more arrays for pre-decode bits. Load/store area in the 2 cores is at a premium, so the D-cache uses a 4way 16KB array with performance features described later in the paper. Both L1 caches use an 8T storage cell. The change from a 6T cell in 45nm to 8T in 32nm was required to improve low-voltage margin and read timing and to reduce power. Use of the 8T cell also eliminated a difficult D-cache read-modify-write timing path. Reads use a 2-level pre-charged local/super bitline structure with delayed-onset keeper, single-rail, full-swing signals, and glitch latches.Several D-cache performance features reduce conflicts and power. First, microbanking reduces read conflicts to the same rate as a previous 16-bank 64KB desi...
Sun Microsystems, Palo Alto, CAThis 3rd-generation, superscalar processor, implementing the SPARC V9 64b architecture, improves performance over previous processors by improvements in the on-chip memory system and circuit designs enhancing the speed of critical paths beyond the process entitlement [1,2]. In the on-chip memory system, both bandwidth and latency are scaled. Keys to scaling memory latency are a sum-addressed memory data cache, which allows the average memory latency to scale by more than the clock ratio, and the use of a prefetch data cache [3]. Memory bandwidth is improved by using wave-pipelined SRAM designs for on-chip caches and a write cache for store traffic [4]. The chip operates at 800MHz and dissipates <60W from a 1.5V supply. It contains 23M transistors (12M in RAM cells) on a 244mm 2 die. Figure 25.2.1 contrasts this 7-metal-layeraluminum, 0.15µm CMOS design with the previous generations designs. To deal with the growing microprocessor complexity, more aggressive circuit techniques, interconnect delay optimization, crosstalk reduction, improved power and clock distribution schemes, and better thermal management are used.For minimum power dissipation and simplified verification, the primary circuit style is static CMOS using synthesis and automatic place and route. Where synthesis is not enough and full custom design not appropriate, a hybrid approach is used. Domino cells are manually placed and CAD tools shield all wires, route clocks, and insert power and ground. A commercial router completes routing of signals. For the most critical paths, custom dynamic logic design is used. Delayed reset logic is used in the SRAM structures for power minimization and to simplify clock distribution. Large caches use a self-timed latency control circuit for one-cycle throughput and twocycle latency. A predecode flip-flop circuit incorporates the predecode logic function, eliminating 2 logic levels and significantly speeding up the address decoding critical path. Logical structures are traditional domino logic as well as delayed clocking domino logic with an overlapping multiphase non-blocking clocking. Critical signals are never gated by clocks, creating a pseudo-transparent evaluation phase that maximizes speed. Consecutive logic stages are clocked by delayed phases with enough overlap to guarantee safe signal transition. A family of edge-triggered flip-flops includes dynamic flipflops producing monotonic outputs for domino logic [5]. Members of this family also embed a full logic level while maintaining a low input-to-output delay, allowing a pipeline with only 8 logic stages per clock cycle. For ease of verification, dynamic design is chiefly confined to fully-shielded full-custom structures.To facilitate single-cycle transfers, the working register file (WRF), which handles regular read/write operations, and the architectural register file (ARF), which stores 8 windows, are interleaved into one physical unit, a WARF (Figure 25.2.2). The WARF performs read, write, and transfer simultaneously. The 32...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.