Fault-Tolerant MPI

Bouteiller, Aurélien

doi:10.1007/978-3-319-20943-2_3

Cited by 6 publications

(2 citation statements)

References 87 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It seems quite natural to accept redundancies to improve the fault tolerance of a system, e.g. by combining multiple physical storage components to a Redundant Array of Inexpensive Disks (RAID) system (Patterson et al, 1988) or by using techniques that can be applied to enable an automated MPI application recovery (Bouteiller, 2015). There may also be performance improvements associated with it.…”

Section: Accepting Redundant Computations In Parallel Applicationsmentioning

confidence: 99%

Multiple execution of the same MPI application to exploit parallelism at hotspots with minimal code changes: a case study with FESOM2-Iceberg and FESOM2-REcoM

Himstedt

2023

Preprint

View full text Add to dashboard Cite

Abstract. For a typical climate model, parallelization based on a domain decomposition is a predominant technique to speed up its computation as an MPI (Message Passing Interface) application on an HPC (High Performance Computing) system. In this contribution, it is shown how the potential of simultaneously executing multiple instances of such an MPI application can be exploited to achieve a further speedup with an additional parallelization of suitable compute-intensive loops. In contrast to a parallelization based on OpenMP (Open Multi-Processing), no special synchronization effort is required if MPI calls occur in the iterations of the original loop. Splitting the work at such hotspots between the instances represents an independent level of parallelization on top of the domain decomposition. The simple implementation can be performed within the familiar MPI world, where the climate model can largely be considered as a black box. Outside of the hotspots, however, the same computations are performed in all instances. Some examples will show that such a conscious acceptance of redundant computations for parallelization approaches is quite common in other disciplines to reduce the time-to-solution. These approaches thus also represent the main inspiration for the approach presented in this contribution. Experimental results show for the example of the additional parallelization of an iceberg and a biogeochemical model, each embedded into FESOM2, how the time-to-solution can be further reduced with a small number of instances at appropriate efficiency. With the non-parallelized part outside of hotspots, however, the meaningful utilization of a larger number of instances will not be easily possible in practice, which will be explained in more detail in some efficiency considerations with the reference to Amdahl’s Law. Nevertheless, the implementation of the approach for other simulation models with similar properties seems promising, if the further reduction of the time-to-solution is in the focus, but a limit for the scalability based on the domain decomposition is reached.

show abstract

Section: Accepting Redundant Computations In Parallel Applicationsmentioning

confidence: 99%

Multiple execution of the same MPI application to exploit parallelism at hotspots with minimal code changes: a case study with FESOM2-Iceberg and FESOM2-REcoM

Himstedt

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Early works regarding fault tolerance, including dynamic message passing interface (MPI) programs with checkpointing and resilient versions of MPI, are described in the work of Agbaria and Friedman and by Bosilca et al with a relatively recent summary found in the work of Dongarra et al and of Bouteiller . Cappello et al have summarized recent developments in resiliency that targets exascale.…”

Section: Related Workmentioning

confidence: 99%

Node failure resiliency for Uintah without checkpointing

Sahasrabudhe

Berzins

Schmidt

2019

Concurrency and Computation

View full text Add to dashboard Cite

The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-based fault tolerance (ABFT) using adaptive mesh refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution.This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables, such as positivity or boundedness, may be violated during interpolation. These challenges can be addressed by the combination of two techniques: (1) a fault-tolerant message passing interface (MPI) implementation to recover from runtime node failures, and (2) high-order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a ''limited essentially nonoscillatory'' (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI-user-level failure mitigation to recover from runtime failure and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10× faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.

show abstract

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Levy

Ferreira

Widener

2018

Concurrency and Computation

View full text Add to dashboard Cite

Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large-scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next-generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance.In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation.We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.

show abstract

Fault-Tolerant MPI

Cited by 6 publications

References 87 publications

Multiple execution of the same MPI application to exploit parallelism at hotspots with minimal code changes: a case study with FESOM2-Iceberg and FESOM2-REcoM

Multiple execution of the same MPI application to exploit parallelism at hotspots with minimal code changes: a case study with FESOM2-Iceberg and FESOM2-REcoM

Node failure resiliency for Uintah without checkpointing

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Contact Info

Product

Resources

About