A failure detector for HPC platforms

Bosilca, George; Bouteiller, Aurélien; Guermouche, Amina; Hérault, Thomas; Robert, Yves; Sens, Pierre; Dongarra, Jack

doi:10.1177/1094342017711505

Cited by 14 publications

(15 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, in ULFM, application time grows significantly as the number of ranks increases. ULFM extends MPI with an always-on, periodic heartbeat mechanism [8] to detect failures and also modifies communication primitives for fault tolerant operation. Following from our measurements, those extensions noticeably increase the original application execution time.…”

Section: Discussionmentioning

confidence: 99%

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Georgakoudis

Guo

Laguna

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, redeploying an application incurs overhead by tearing down and reinstating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit ++ , a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit ++ recovers much faster than restarting, up to 6×, or ULFM, up to 3×, and that it scales excellently as the number of MPI processes grows.

show abstract

Section: Discussionmentioning

confidence: 99%

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Georgakoudis

Guo

Laguna

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…In the failure detection stage, stop & restart techniques cause the entire application to abort when one or several processes fail, while the ULFM resilience constructs enable failure notification to some or all the remaining live processes without global cancellation of the application. Besides, the existence of a well-defined propagation mechanism (i.e., communication revocation), exposed through the ULFM API, allows for highly optimized implementations, as proposed in [8]. Such implementations take advantage of underlying MPI capabilities and the structure of applications to improve the speed at which process faults are detected and to deliver a fast and reliable multicast using the same [53] simulates the main procedures in a 3D method of characteristics (MOC) code for the numerical solution of the steady-state neutron transport equation.…”

Section: Resilient Vs Stop and Restart Solutionsmentioning

confidence: 99%

Fault tolerance of MPI applications in exascale systems: The ULFM solution

Losada

González

Martín

et al. 2020

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient faulttolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.

show abstract

“…It can introduce memory access and communication latency to the application execution and further affect the application execution efficiency. As reported in a ULFM paper [24], ULFM implements a constantly heartbeat mechanism for failures detection, and also amends MPI communication interfaces for failure recovery operations. These changes must have an impact on application execution 0 50 100 150 200 250 300 RESTART-FTI REINIT-FTI ULFM-FTI RESTART-FTI REINIT-FTI ULFM-FTI RESTART-FTI REINIT-FTI ULFM-FTI RESTART-FTI REINIT- Furthermore, we observe that the times for writing checkpoints in RESTART-FTI and REINIT-FTI cases are close.…”

Section: Performance Comparison On Different Scaling Sizesmentioning

confidence: 99%

MATCH: An MPI Fault Tolerance Benchmark Suite

Guo

Georgakoudis

Parasyris

et al. 2020

2020 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.

show abstract

A failure detector for HPC platforms

Cited by 14 publications

References 32 publications

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Fault tolerance of MPI applications in exascale systems: The ULFM solution

MATCH: An MPI Fault Tolerance Benchmark Suite

Contact Info

Product

Resources

About