The IBM zEnterprise A 196 (z196) system, announced in the second quarter of 2010, is the latest generation of the IBM System z A mainframe. The system is designed with a new microprocessor and memory subsystems, which distinguishes it from its z10 A predecessor. The system has up to 40% improvement in performance for traditional z/OS A workloads and carries up to 60% more capacity when compared with its z10 predecessor. The memory subsystem has four levels of cache hierarchy (L1 through L4) and constructs the L3 and L4 caches with embedded DRAM silicon technology, which achieves approximately three times the cache density over traditional static RAM technology. The microprocessor has 50% more decode and dispatch bandwidth when compared with the z10 microprocessor, as well as an out-of-order design that can issue and execute up to five instructions every single cycle. The microprocessor has an advanced branch prediction structure and employs enhanced store queue management algorithms. At the date of product announcement, the microprocessor was the fastest complex-instruction-set computing processor in the industry, running at a sustained 5.2 GHz, executing approximately 1,100 instructions, 220 of which are cracked into reduced-instruction-set computing-type operations, to achieve large performance gains in legacy online transaction processing and compute-intensive workloads.
Interest in the concept of clustered caches has been growing in recent years. The advantages of sharing data and instruction streams among two or more microprocessors are understood; however, clustering also introduces new challenges in cache and memory coherency when system design requirements indicate that two or more of these clusters are needed. This paper describes the shared L2 cache cluster design found in the S/390® G4 server. This novel cache design consists of multiple shared-cache clusters, each supporting up to three microprocessors, forming a tightly coupled symmetric multiprocessor with fully coherent caches and main memory. Because this cache provides the link between an existing S/390 system bus and the new, highperformance S/390 G4 microprocessor chips, the paper addresses the challenges unique to operating shared caches on a common system bus.
The next-generation System z design introduces a new microprocessor chip (CP) and a system controller chip (SC) aimed at providing a substantial boost to maximum system capacity and performance compared to the previous zEC12 design in 32nm [1,2]. As shown in the die photo, the CP chip includes 8 high-frequency processor cores, 64MB of eDRAM L3 cache, interface IOs ("XBUS") to connect to two other processor chips and the L4 cache chip, along with memory interfaces, 2 PCIe Gen3 interfaces, and an I/O bus controller (GX). The design is implemented on a 678 mm 2 die with 4.0 billion transistors and 17 levels of metal interconnect in IBM's high-performance 22nm high-κ CMOS SOI technology [3]. The SC chip is also a 678 mm 2 die, with 7.1 billion transistors, running at half the clock frequency of the CP chip, in the same 22nm technology, but with 15 levels of metal. It provides 480 MB of eDRAM L4 cache, an increase of more than 2× from zEC12 [1,2], and contains an 18 MB eDRAM L4 directory, along with multi-processor cache control/coherency logic to manage inter-processor and system-level communications. Both the CP and SC chips incorporate significant logical, physical, and electrical design innovations.Systems are built from configurable nodes of tightly-coupled CP and SC chips, each packaged on single-chip modules ( Fig. 4.1.1). This structure provides improved flexibility and modularity compared to the multi-chip modules used previously. All high-speed node-to-node and drawer-to-drawer communication is through the SC chip using micro-controllers to manage the flow. Each SC chip contains over 440 of these micro controllers along with a series of wide multiplexers to manage the traffic. Both the CP and SC chips support high levels of I/O bandwidth, with about 5Tb/s total bandwidth for each CP or SC chip, running at speeds of up to 5Gb/s (single-ended) and 9.6Gb/s (differential).The CP chip adopted a unique floorplan configuration, driven by the width of the cores, which were too wide to fit four across on the die. This floorplan created significant logical and physical complexities in the L3 design, but careful engineering prevented these issues from having any meaningful impact on latency or bandwidth of the L3. The entire L3 and all 8 cores are covered with a single large "mega-mesh" clock domain, maximizing on-chip bus bandwidth. The unified mega-mesh design enables double-pumping of many on-chip buses for wider effective bandwidth, and eliminates any mesh-to-mesh timing margins in critical core-to-L3 timing paths.The CP processor core design, shown in Fig. 4.1.2, improves upon the zEC12 processor [4] with two vector execution units, significantly higher instruction-per-cycle throughput, and a new SMT2 micro-architecture supporting simultaneous execution of two threads. The microprocessor core features a wide superscalar, out-of-order pipeline that can sustain an instruction fetch, decode, dispatch and completion rate of six CISC instructions per cycle. The instruction execution path is predicted by multi-level bra...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.