With up to 65,536 compute nodes and a peak performance of more than 360 teraflops, the Blue Genet/L (BG/L) supercomputer represents a new level of massively parallel systems. The system software stack for BG/L creates a programming and operating environment that harnesses the raw power of this architecture with great effectiveness. The design and implementation of this environment followed three major principles: simplicity, performance, and familiarity. By specializing the services provided by each component of the system architecture, we were able to keep each one simple and leverage the BG/L hardware features to deliver high performance to applications. We also implemented standard programming interfaces and programming languages that greatly simplified the job of porting applications to BG/L. The effectiveness of our approach has been demonstrated by the operational success of several prototype and production machines, which have already been scaled to 16,384 nodes.
Many optimizing compilers perform global register allocation using a Chaitin-style graph coloring algorithm. Live ranges that cannot be allocated to registers are spilled to memory. The amount of code required to spill the live range depends on the spilling heuristic used. Chaitin's spilling heuristic offers some guidance in reducing the amount of spill code produced. However, this heuristic does not allow the partial spilling of live ranges and the reduction in spill code is limited to a local level. In this paper, we present a global technique called interference region spilling that improves the spilling granularity of any local spilling heuristic. Our technique works above the local spilling heuristic, limiting the normal insertion of spill code to a portion of each spilled live range. By partially spilling live ranges, we can achieve large reductions in dynamically executed spill code; up to 75% in some cases and an average of 33.6% across the benchmarks tested.
We describe the design of a dual-issue single-instruction, multipledata-like (SIMD-like) extension of the IBM PowerPCt 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Genet/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.
Power ISA™ Version 3.1 has introduced a new family of matrix math instructions, collectively known as the Matrix-Multiply Assist (MMA) facility. The instructions in this facility implement numerical linear algebra operations on small matrices and are meant to accelerate computation-intensive kernels, such as matrix multiplication, convolution and discrete Fourier transform. These instructions have led to a power-and area-efficient implementation of a high throughput math engine in the future POWER10 processor. Performance per core is 4 times better, at constant frequency, than the previous generation POWER9 processor. We also advocate the use of compiler builtins as the preferred way of leveraging these instructions, which we illustrate through case studies covering matrix multiplication and convolution.
182 Moreira et al.This article is both a description of the Blue Gene/L supercomputer as well as an account of how that system was designed, developed, and delivered. It reports on the technical characteristics of the system that made it possible to build such a powerful supercomputer. It also reports on how teams across the world worked around the clock to accomplish this milestone of high-performance computing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.