Failure Detection and Propagation in HPC systems

Bosilca, George; Bouteiller, Aurélien; Guermouche, Amina; Hérault, Thomas; Robert, Yves; Sens, Pierre; Dongarra, Jack

doi:10.1109/sc.2016.26

Cited by 15 publications

(9 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A collection of works on ULFM [9,[16][17][18]21,23,26] has investigated the applicability of ULFM and benchmarked individual operations of it. Bosilca et al [7,8] and Katti et al [19] propose efficient fault detection algorithms to integrate with ULFM. Teranishi et al [31] use spare processes to replace failed processes for local recovery so as to accelerate recovery of ULFM.…”

Section: Related Workmentioning

confidence: 99%

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Georgakoudis

Guo

Laguna

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, redeploying an application incurs overhead by tearing down and reinstating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit ++ , a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit ++ recovers much faster than restarting, up to 6×, or ULFM, up to 3×, and that it scales excellently as the number of MPI processes grows.

show abstract

Section: Related Workmentioning

confidence: 99%

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Georgakoudis

Guo

Laguna

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Algorithm‐based fault tolerance (ABFT) refers to algorithms which include fault detection or recovery. For example, matrix computation algorithms could recover from faults by way of “hot‐replacement.” 11 One common fault detection method is based on the communication timeouts, such as the logical ring topology proposed by Bosilca et al, 12 sending periodic keep‐alive messages in parallel with application execution. Fault‐aware MPI 13 represents another approach for applications to address faults by defining “transactions” which could be either committed or rolled‐back in the case of a fault, comparable to ULFM.…”

Section: Related Workmentioning

confidence: 99%

Tree‐based fault‐tolerant collective operations for MPI

Margolin

Barak

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary With the increase in size and complexity of high‐performance computing systems, the probability of failures, and the cost of recovery grow. Parallel applications running on these systems should be able to continue running in spite of node failures at arbitrary times. Collective operations are essential for many parallel MPI applications, and are often the first to detect such failures. This work presents tree‐based fault‐tolerant collective operations, which combine fault detection and recovery as an integral part each operation. We do this by extending existing tree‐based algorithms, to allow for a collective operation to succeed despite failing nodes before or during its run. This differs from other approaches, where recovery takes place after a failure of such operations have failed. The article includes a comparison between the performance of the proposed algorithm and other approaches, as well as a simulator‐based analysis of performance at scale.

show abstract

“…Within an MPI communication this can result in a deadlock due to open MPI requests. These failures are a main motivation behind the design of the ULFM extension [7]. If a hard failure occurs it is not straight forward to continue the computation.…”

Section: A Faults and Failuresmentioning

confidence: 99%

“…Current MPI implementations thus typically terminate (or deadlock) in such a situation. The most prominent proposal which suggests a suitable extension to the MPI standard currently is User-Level Failure Mitigation (ULFM) [7], [8]. It allows users to define a workaround for the node loss scenario, e.g.…”

Section: Introductionmentioning

confidence: 99%

A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application

Engwer

Altenbernd

Dreier

et al. 2018

2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

View full text Add to dashboard Cite

C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of MPI failures. In particular, a local C++ exception can currently lead to a deadlock due to unfinished communication requests on remote hosts. At the same time, future MPI implementations are expected to include an API to continue computations even after a hard fault (node loss), i.e. the worst possible unexpected behaviour.In this paper we present an approach that adds extended exception propagation support to C++ MPI programs. Our technique allows to propagate local exceptions to remote hosts to avoid deadlocks, and to map MPI failures on remote hosts to local exceptions. A use case of particular interest are asynchronous 'local failure local recovery' resilience approaches. Our prototype implementation uses MPI-3.0 features only. In addition we present a dedicated implementation, which integrates seamlessly with MPI-ULFM, i.e. the most prominent proposal for extending MPI towards fault tolerance.Our implementation is available at https://gitlab.dune-project.org/christi/test-mpi-exceptions.

show abstract

Failure Detection and Propagation in HPC systems

Cited by 15 publications

References 29 publications

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Tree‐based fault‐tolerant collective operations for MPI

A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application

Contact Info

Product

Resources

About