Addressing failures in exascale computing

Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A.; Adve, Sarita V.; Bagchi, Saurabh; Balaji, Pavan; Belak, J.; Bose, Pradip; Cappello, Franck; Carlson, Bill; Chien, Andrew A.; Coteus, P.; DeBardeleben, Nathan; Diniz, Pedro C.; Engelmann, Christian; Erez, Mattan; Fazzari, Saverio; Geist, Al; Gupta, Rinku; Johnson, Fred P.; Krishnamoorthy, Sriram; Leyffer, Sven; Liberty, Dean; Mitra, Subhasish; Munson, Todd; Schreiber, Rob; Stearley, Jon; Hensbergen, Eric Van

doi:10.1177/1094342014522573

Cited by 278 publications

(97 citation statements)

References 134 publications

(160 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In contrast, the Cray XK6/XK7 (Titan) at Oak Ridge National Laboratory (10-20/27 petaflops) achieves a MTBI of 132/173 h [2]. The anticipated failure rate of an exascale machine is likely to be higher than present systems [8,9,23,28] and therefore application resilience is critical in maintaining the usefulness of any future exascale system.…”

Section: Introductionmentioning

confidence: 90%

“…Algorithm and software resilience is now one of the greatest concerns in striving towards exascale and interruption, due to component failure, is now considered a major barrier to effectively using an exascale system with current numerical codes [9,28]. Both hardware and software errors, such as component failures or operating system crashes, may interrupt simulations or lead to non-deterministic results [27].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

Cantwell

Nielsen

2018

J Sci Comput

View full text Add to dashboard Cite

We propose a novel, minimally intrusive approach to adding fault tolerance to existing complex scientific simulation codes, used for addressing a broad range of time-dependent problems on the next generation of supercomputers. Exascale systems have the potential to allow much larger, more accurate and scale-resolving simulations of transient processes than can be performed on current petascale systems. However, with a much larger number of components, exascale computers are expected to suffer a node failure every few minutes. Many existing parallel simulation codes are not tolerant of these failures and existing resilience methodologies would necessitate major modifications or redesign of the application. Our approach combines the proposed user-level failure mitigation extensions to the Message-Passing Interface (MPI), with the concepts of message-logging and remote inmemory checkpointing, to demonstrate how to add scalable resilience to transient solvers. Logging MPI communication reduces the storage requirement of static data, such as finite element operators, and allows a spare MPI process to rebuild these data structures independently of other ranks. Remote in-memory checkpointing avoids disk I/O contention on large parallel filesystems. A prototype implementation is applied to Nektar++, a scalable, production-ready transient simulation framework. Forward-path and recovery-path performance of the resilience algorithm is analysed through experiments using the solver for the incompressible Navier-Stokes equations, and strong scaling of the approach is observed.

show abstract

Section: Introductionmentioning

confidence: 90%

Section: Introductionmentioning

confidence: 99%

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

Cantwell

Nielsen

2018

J Sci Comput

View full text Add to dashboard Cite

show abstract

“…2. With probability 1 2 the error has struck in the other 2 x−1 nodes and we don't need to recompute any of the first 2 x−1 nodes. We can write…”

Section: Abfrmentioning

confidence: 99%

“…Future large-scale systems are projected to have higher error rates, with MTBFs (Mean Time Between Failures) as low as 20 minutes [1]. We focus on latent errors, that are not detected immediately after their occurrence.…”

Section: Introductionmentioning

confidence: 99%

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Cavelan

Fang

Chien

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.

show abstract

“…The issues are different for Supercomputers whose storage nodes typically comprise tens of thousands of individual disks interconnected through a dedicated storage high-speed network, and managed by a parallel file system. Due the scale of such infrastructures and the dramatic decrease of the Mean Time Between Failures (MTBF), a lot of papers consider application checkpointing [1]. Accurately modeling and simulating the impact of reading and writing checkpointed data on disks is thus crucial to design efficient policies.…”

Section: Introductionmentioning

confidence: 99%

Adding Storage Simulation Capacities to the SimGrid Toolkit: Concepts, Models, and API

Lèbre¹,

Legrand

Suter

et al. 2015

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

Abstract-For each kind of distributed computing infrastructures, i.e., clusters, grids, clouds, data centers, or supercomputers, storage is a essential component to cope with the tremendous increase in scientific data production and the ever-growing need for data analysis and preservation. Understanding the performance of a storage subsystem or dimensioning it properly is an important concern for which simulation can help by allowing for fast, fully repeatable, and configurable experiments for arbitrary hypothetical scenarios. However, most simulation frameworks tailored for the study of distributed systems offer no or little abstractions or models of storage resources.In this paper, we detail the extension of SimGrid, a versatile toolkit for the simulation of large-scale distributed computing systems, with storage simulation capacities. We first define the required abstractions and propose a new API to handle storage components and their contents in SimGrid-based simulators. Then we characterize the performance of the fundamental storage component that are disks and derive models of these resources. Finally we list several concrete use cases of storage simulations in clusters, grids, clouds, and data centers for which the proposed extension would be beneficial.

show abstract

Addressing failures in exascale computing

Cited by 278 publications

References 134 publications

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Adding Storage Simulation Capacities to the SimGrid Toolkit: Concepts, Models, and API

Contact Info

Product

Resources

About