AMD's 2-core "Bulldozer" module contains 213 million transistors in an 11metal layer 32nm HKMG SOI CMOS process and is designed to operate from 0.8 to 1.3V. This new micro-architecture [1] improves performance and frequency while reducing area and power compared to a previous AMD x86-64 CPU in the same process [2]. To achieve these goals, the design reduced the number of FO4 inverter delays/cycle by more than 20%, achieving higher frequencies in the same power envelope even with increased core counts. The 2-core CPU module area (including 2MB L2 cache) is 30.9mm 2 (Fig. 4.5.7).The module design contains 84 unique custom macros and 317,000 scannable flops. Module-level VSS power gating (CC6) reduces leakage power by 95% when both cores are idle [2]. Transistor Vts across the design are mostly regular (47%) and long-channel regular (46%).The Bulldozer micro-architecture is cycle-based, using soft-edge flip-flops (SEF) to provide high-frequency performance, process variation tolerance, and low power consumption (Fig. 4.5.1). Performance and process tolerance are provided by a 2-clock design: early and late clocks (ECLK, LCLK) create a soft timing edge, allowing limited cycle stealing. Power is reduced in low-power SEFs by internally gated slave latch clocks. The majority of flops (78%) are low-power, using high-performance flops only on timing-critical paths.In contrast to leveraged power-optimized CPU designs [2,4], Bulldozer's groundup design requires co-development of power efficiency, timing, and functionality. Initially, micro-architectural power is optimized using a power-aware highlevel performance model. Next, before schematic completion, the team tracks and analyzes RTL-based clock and flip-flop activity (a proxy for switching power) to meet clock gating goals. Finally, a new power model enables early mixed schematic/layout analysis of transistor-level power. This enables aggressive power optimizations while the implementation is still malleable. The result is a design with low power consumption for typical applications, making it well-suited to active power management and boost (Fig. 4.5.2).The L1 caches are split, with I-cache residing in the instruction unit and a Dcache located in each load/store unit of the 2-cores. The 2-way, 64KB I-cache consists of an 8×2 array of 4KB bank macros, with 2 more arrays for pre-decode bits. Load/store area in the 2 cores is at a premium, so the D-cache uses a 4way 16KB array with performance features described later in the paper. Both L1 caches use an 8T storage cell. The change from a 6T cell in 45nm to 8T in 32nm was required to improve low-voltage margin and read timing and to reduce power. Use of the 8T cell also eliminated a difficult D-cache read-modify-write timing path. Reads use a 2-level pre-charged local/super bitline structure with delayed-onset keeper, single-rail, full-swing signals, and glitch latches.Several D-cache performance features reduce conflicts and power. First, microbanking reduces read conflicts to the same rate as a previous 16-bank 64KB desi...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.