This third-generation chip-multithreading (CMT) SPARC processor is targeted for high-performance servers, and is optimized for both single-and multi-threaded applications. The architecture highlights are provided in [1], while this paper focuses on the physical implementation aspects, providing an overview of circuit innovations in memory arrays, register files, and floating-point hardware that boost the performance and circuit robustness with low area overhead. The 396mm 2 chip, shown in Fig. 4.2.1, is fabricated in a 11M 65nm CMOS process and operates at a nominal frequency of 2.3GHz, consuming a maximum power of 250W at 1.2V. Power-management techniques include clock gating at corecluster level and power throttling through a single-thread-issue mode of operation. This mode is used in power-constrained systems without sacrificing single-thread performance.The chip floorplan is symmetrical with the 4 core clusters placed at the corners of the chip and the shared L2-cache crossbar switch in the middle, as shown in Fig. 4.2.1. The SerDes I/O interface occupies three of the chip sides. Process variability and high leakage call for extensive use of static CMOS circuits to improve circuit robustness, minimizing design effort and design time while achieving timing and area targets. Standard-and customcell libraries adhere to common library template rules to facilitate block composition through standard place-and-route tools. Dynamic circuits are limited to high-speed flip-flops and SRAM blocks.The data cache has double the read bandwidth compared to other arrays. The split-wordline cell, shown in Fig. 4.2.2, combined with a single-ended sensing scheme eliminates the potentially marginal self-timed circuits and 2× area overhead of dual-portcell or double-pumped circuit implementations. All arrays use a special bitline-precharge and keeper cell that completes by abutment the edge memory row structure. This reduces the 16-row memory array overhead to 20%, instead of the typical 35%. Clock gating of the non-active portions of the array reduces the overall power. During the write operation both wordlines are turned on, and the complementary data inputs are applied to the bitlines as in a traditional single port cell design. During the read operation however, the single-ended scheme allows the separate use of the two wordlines and bitlines to perform a dual-port style read, as shown in Fig. 4.2.3.The L2 data array operates at the core supply voltage and has a 2-CPU-cycle latency. A single read is performed every 2 cycles and each write operation includes a read of the old data before overwriting with the new data as required by the ECC architecture implementation. The read-before-write operation at the same address occurs without a precharge cycle in between. Several optimization techniques help reduce the area overhead and meet the timing constraints in the L2 cache array. The I/O circuitry, including sense amplifiers, write drivers and output-data latch, are shared by two 128×128b arrays. The column muxes for the top and ...
Third-generation 16 core 32 thread chip-multithreading SPARC processor interface has 1.1Tbps I/O throughput with 112 Tx/176 Rx SerDes channels in 46mm 2 . Individual links run at BER of 1E-12 on FR4 PCBs at 4.08-0.5Gbps full-half rate, and 18mW/ch/Gbps at 2.67Gbps. Each link has linear equalization, 15 deemphasis and 8 output-swing control settings, and latency of 8UI in Rx and 14-16UI in Tx. IntroductionMemory access time and capacity are critical to computer system performance. With increased clock speeds and parallel processing, instruction execution rates increase, requiring higher memory bandwidth between caches and main memory [1,2]. To accommodate high throughput demand, third-generation chip multi-threading (CMT) SPARC processor employs embedded-clock high-speed serializer/deserializer (SerDes) links, with low-latency signal processing techniques [3]. These links provide memory and system I/O interfacing and ease pin-field congestion, intersymbol interference, crosstalk, jitter, and attenuation with complex routing on multilayer boards using adaptive linear equalization. DFE-based transceivers can achieve higher speeds [4,5] but are susceptible to error propagation, producing error bursts compromising data integrity in high-end computing systems. Error-detection-and-correction encoding is not strong enough, and costs latency. Feed-forward equalization [6] based transceivers have complex adaptation; controlling adaptation may require a backchannel. Adaptive linear equalization is therefore attractive for our application.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.