Over the last twenty years, the open source community has provided more and more software on which the world's High Performance Computing (HPC) systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. But although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual PetaScale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and GPUs. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
Abstract-We have extended the Falkon execution framework to make loosely coupled petascale systems a practical and useful prog This work studies and measures the perf involved in applying this approach to enable th systems by a broader user community, and w Our work enables the execution of highly para composed of loosely coupled serial jobs with no the respective applications. This approach all potentially far larger-class of applications to l systems, such as the IBM Blue Gene/P sup present the challenges of I/O performance encou this model practical, and show resul microbenchmarks and real applications from economic energy modeling and molecular benchmarks show that we can scale up to 160K with high efficiency, and can achieve sustained thousands of tasks per second.
The time is nowWhen processor clock speeds flatlined in 2004, after more than 15 years of exponential increases, the computational science community lost the key to the automatic performance improvements its applications had traditionally enjoyed. Subsequent developments in processor and system designhundreds of thousands of nodes, millions of cores, reduced bandwidth and memory available to cores, inclusion of special purpose elements -have made it clear that a broad divide has now opened up between the software infrastructure that we have, and the one we will certainly need to have to perform the kind of computationally intensive and data intensive work that tomorrow's scientists and engineers will require. Given the daunting conceptual and technical problems that such a change in design paradigms brings with it, we believe that this software gap will require an unprecedented level of cooperation and coordination within the worldwide open source software community. In forming the International Exascale Software Project (IESP), we hope to plan for and catalyze the kind of community wide effort that we believe is necessary to meet this historic challenge.Our belief in the need for broad-based, coordinated action by the global scientific software community to address the looming crisis reflects, in part, the fact computational methods are now universally accepted as indispensable to future progress in science and engineering. The last time a disruption of comparable dimensions occurred -during the transition from vector to distributed memory supercomputers more than two decades ago -only a relatively small part of the scientific community felt the consequences of the struggle to replace, wholesale, the programming models, numerical and communication libraries, and all the other software components and tools on which application scientists were already building.Computational science was still relatively young, and computationally intensive methods were still largely the province of relatively small scientific elite in a relatively small number of physical sciences.Today, aided by the success of the scientific software research and development community, researchers in nearly every field of science and engineering have been able to turn computational modeling/simulation and high-throughput data analysis to open new areas of inquiry (e.g., the very small, very large, very hazardous, very complex), to dramatically increase research productivity, and to amplify the social and economic impact of their work. Recent reports [7,10] make a compelling case, in terms of both scope and importance, for the profound expansion of our research horizons that will occur if we can rise to the challenge of peta/exascale computing. But in the light of the radical changes in computing we are currently undergoing, it is clear that the software infrastructure necessary to make that ascent does not yet exist and that we are a long way from being in a position to create it.At the same time, the increasing use of computationally intensive m...
We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.
Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.