This paper describes the system packaging and technologies of the IBM System z9e enterprise-class server. The central electronic complex of the system consists of four nodes, each housing a multichip module (MCM) with 16 chips consuming up to 1,200 W. The z9e server doubles the multiprocessor performance of the System z990 by increasing the central processing unit (CPU) configuration and using an internally developed elastic interface to increase interconnect speed on all high-speed buses. In contrast to all previous zSeriest designs, which were running at half of the processor speed, the packaging interconnects on the multichip module run at the same speed as the processor (1.72 GHz). High frequencies and massively parallel connectivity lead to a raw packaging bandwidth of up to 1,764 GB/s between processors and cache within a single frame for a fully configured four-node z9 system.
The next-generation System z design introduces a new microprocessor chip (CP) and a system controller chip (SC) aimed at providing a substantial boost to maximum system capacity and performance compared to the previous zEC12 design in 32nm [1,2]. As shown in the die photo, the CP chip includes 8 high-frequency processor cores, 64MB of eDRAM L3 cache, interface IOs ("XBUS") to connect to two other processor chips and the L4 cache chip, along with memory interfaces, 2 PCIe Gen3 interfaces, and an I/O bus controller (GX). The design is implemented on a 678 mm 2 die with 4.0 billion transistors and 17 levels of metal interconnect in IBM's high-performance 22nm high-κ CMOS SOI technology [3]. The SC chip is also a 678 mm 2 die, with 7.1 billion transistors, running at half the clock frequency of the CP chip, in the same 22nm technology, but with 15 levels of metal. It provides 480 MB of eDRAM L4 cache, an increase of more than 2× from zEC12 [1,2], and contains an 18 MB eDRAM L4 directory, along with multi-processor cache control/coherency logic to manage inter-processor and system-level communications. Both the CP and SC chips incorporate significant logical, physical, and electrical design innovations.Systems are built from configurable nodes of tightly-coupled CP and SC chips, each packaged on single-chip modules ( Fig. 4.1.1). This structure provides improved flexibility and modularity compared to the multi-chip modules used previously. All high-speed node-to-node and drawer-to-drawer communication is through the SC chip using micro-controllers to manage the flow. Each SC chip contains over 440 of these micro controllers along with a series of wide multiplexers to manage the traffic. Both the CP and SC chips support high levels of I/O bandwidth, with about 5Tb/s total bandwidth for each CP or SC chip, running at speeds of up to 5Gb/s (single-ended) and 9.6Gb/s (differential).The CP chip adopted a unique floorplan configuration, driven by the width of the cores, which were too wide to fit four across on the die. This floorplan created significant logical and physical complexities in the L3 design, but careful engineering prevented these issues from having any meaningful impact on latency or bandwidth of the L3. The entire L3 and all 8 cores are covered with a single large "mega-mesh" clock domain, maximizing on-chip bus bandwidth. The unified mega-mesh design enables double-pumping of many on-chip buses for wider effective bandwidth, and eliminates any mesh-to-mesh timing margins in critical core-to-L3 timing paths.The CP processor core design, shown in Fig. 4.1.2, improves upon the zEC12 processor [4] with two vector execution units, significantly higher instruction-per-cycle throughput, and a new SMT2 micro-architecture supporting simultaneous execution of two threads. The microprocessor core features a wide superscalar, out-of-order pipeline that can sustain an instruction fetch, decode, dispatch and completion rate of six CISC instructions per cycle. The instruction execution path is predicted by multi-level bra...
The IBM POWER8i processor is a 649-mm 2 , 4.2-billion transistor, high-frequency microprocessor fabricated in the IBM 22-nm silicon on insulator (SOI) technology with embedded dynamic random access memory (eDRAM) and 15 layers of metal. With its twelve architecturally enhanced, eight-way multithreaded cores, 96-MB high-bandwidth shared third-level cache, and increased on and off-chip bandwidth, the POWER8 processor delivers industry-leading performance. This paper describes the circuit techniques and design methodologies that were employed for implementing this chip and that allowed it to maintain the power dissipation at the level of its predecessor while delivering a threefold increase in per-socket performance. Among the innovative technologies employed by the processor are resonant clocking, on-chip per-core voltage regulation, and enhanced eDRAM arrays. Chip overviewThe IBM POWER8* processor [1,2] is the eighth generation of IBM Power Architecture* implemented in IBM's 22-nm embedded dynamic random access memory (eDRAM) silicon on insulator (SOI) technology [3]. The 649 mm 2 POWER8 processor die includes twelve architecturally enhanced eight-way multithreaded cores with high-throughput private second-level caches, a 96-MB high-bandwidth eDRAM third-level cache, an on-chip symmetric multi-processor (SMP) fabric, a set of cryptography and memory compression accelerators, memory controllers with I/O links capable of connecting to a maximum of eight memory buffer chips [4], six high-bandwidth off-chip SMP links, and 32 third-generation PCI Express** (PCIe**) lanes. Figure 1 shows the die photograph of the POWER8 processor. The processor cores are grouped into four quadrants. Each core has a private 512-KB level-2 (L2) cache with a read bandwidth of 64 bytes per cycle. The shared 96-MB level-3 (L3) cache is physically placed into the core quadrants. The on-chip SMP buses connecting the processor cores, memory controllers, accelerators, and I/O units are running through the horizontal stripe in the center of the chip, referred to as Fabric in Figure 1, and the vertical wiring channel in the middle. The on-node SMP buses, responsible for intra-node communication, are located along the top edge of the die, while the off-node SMP buses, responsible for inter-node communication, are located along the bottom edge together with the PCIe links. The memory links connecting the POWER8 processor to a maximum of eight memory buffer chips are located on the left and right side of the die. The accelerator units are located between the two core quadrants in the upper half of the processor die. The POWER8 processor contains approximately 4.2 billion transistors. Compared to its predecessor, the POWER7* processor [2, 5, 6], the POWER8 chip achieves a 50% improvement in single-thread performance, a two-fold increase in the per-core performance, and a three-fold increase in the chip throughput when measured at the same frequency [7]. In terms of the maximum core and SMP bus frequencies, the POWER8 processor achieves an incremental ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.