MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

Bouteiller, Aurélien; Hérault, Thomas; Krawezik, Géraud; Lemarinier, Pierre; Cappello, Franck

doi:10.1177/1094342006067469

Cited by 107 publications

(73 citation statements)

References 29 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Typical values for ρ lie in the interval [1; 2], meaning that re-execution time can be reduced by up to half for some applications [16]. Fortunately, the introduction of λ and ρ is not difficult to account for in the expression of the expected waste: in Equation (25), we replace WORK by λ WORK and RE-EXEC by RE-EXEC ρ and obtain…”

Section: Work Timementioning

confidence: 99%

Fault-Tolerance Techniques for High-Performance Computing

2015

Computer Communications and Networks

View full text Add to dashboard Cite

This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption).This report is a slightly modified version of the first chapter of the monograph Fault tolerance techniques for high-performance computing edited by Thomas Herault and Yves Robert, and to be published by Springer Verlag.

show abstract

Section: Work Timementioning

confidence: 99%

Fault-Tolerance Techniques for High-Performance Computing

2015

Computer Communications and Networks

View full text Add to dashboard Cite

show abstract

“…Often global snapshots are used for fault recovery, but, as this paper demonstrates, can also be used to support reverse execution while debugging the MPI application. For an analysis of the performance implications of integrating C/R into an MPI implementation we refer the reader to previous literature on the subject [9,11].…”

Section: Related Workmentioning

confidence: 99%

“…The state of the communication channels is usually captured by a C/R-enabled MPI implementation, such as Open MPI [6]. Although C/R is not part of the MPI standard, it is often provided as a transparent service by MPI implementations [6][7][8][9]. The state of the process is captured by a Checkpoint/Restart Service (CRS), such as Berkeley Lab Checkpoint/Restart (BLCR) [10].…”

Section: Related Workmentioning

confidence: 99%

Checkpoint/Restart-Enabled Parallel Debugging

Hursey¹,

January²,

O'Connor³

et al. 2010

Recent Advances in the Message Passing Interface

View full text Add to dashboard Cite

Abstract. Debugging is often the most time consuming part of software development. HPC applications prolong the debugging process by adding more processes interacting in dynamic ways for longer periods of time. Checkpoint/restart-enabled parallel debugging returns the developer to an intermediate state closer to the bug. This focuses the debugging process, saving developers considerable amounts of time, but requires parallel debuggers cooperating with MPI implementations and checkpointers. This paper presents a design specification for such a cooperative relationship. Additionally, this paper discusses the application of this design to the GDB and DDT debuggers, Open MPI, and BLCR projects.

show abstract

“…More recently, MPICH-V [3] system support a wide range of fault tolerance protocols. Its generic framework covers coordinated, uncoordinated, pessimistic logging and causal logging, etc.…”

Section: Related Workmentioning

confidence: 99%

“…As the systems scale, these systems have become increasingly fragile; an ideal MPI implementation thus must embody fault tolerance properties that support long execution times in fragile underlying execution environments, while being easy and flexible to use as well as being portable and adaptable of their execution environment and fault characteristics. Previous fault tolerant MPI implementations such as LAM/MPI [16] and MPICH-V [3] are easy to use for the user in that fault tolerance is largely transparent to the programmer, but their recovery protocol is fixed and thus not adaptable. This may not be desirable, as for example, the desirable recovery method is obviously different when a fault is transient software one confined in a single process, versus a repetitive one caused by a faulty hardware.…”

Section: Introductionmentioning

confidence: 99%

ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs

Jitsumoto

Endo

Matsuoka

2007

2007 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Long-running MPI applications on clusters and grids that are prone to node and network failures, motivates the use of fault tolerant MPI implementations. However, previous fault tolerant MPIs lack the ability to allow the user to easily choose appropriate fault recovery strategies according to the execution environment, independent of the application codes-rather, the user often had to hard-code restoration strateties in accordance to diverse sets of fault patterns, which could be numerous: for instance, if the fault is transient to a particular process, we merely have to restart the process on the same computing node; on the other hand, if the fault is due to repetitive hardware unreliability, we must migrate the process to a new node in its recovery. ABARIS is our new Fault/Recovery model aware component framework for MPI, where users can customize MPI fault detection and recovery algorithms according to their application and execution environmental requirements by merely selecting appropriate fault/recovery components, independent of the application code. Currently, the ARA-BIS framework prototype is implemented on top of MPICH-P4MPD. Preliminary evaluation of the prototype using NPB on our MPI fault simulator demonstrates that overhead compared to the original MPICH-P4MPD is almost negligible (less than 1%) under normal execution, and when faults occur, appropriate selections and pairings of fault model and recovery method components for corresponding to the execution environment is significant to the overall execution time.

show abstract

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

Cited by 107 publications

References 29 publications

Fault-Tolerance Techniques for High-Performance Computing

Fault-Tolerance Techniques for High-Performance Computing

Checkpoint/Restart-Enabled Parallel Debugging

ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs

Contact Info

Product

Resources

About