2013
DOI: 10.1177/1094342013488238
|View full text |Cite
|
Sign up to set email alerts
|

Post-failure recovery of MPI communication capability

Abstract: As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that all… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
110
0
11

Year Published

2015
2015
2019
2019

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 164 publications
(121 citation statements)
references
References 29 publications
0
110
0
11
Order By: Relevance
“…In this paper, we tackle this issue from a different perspective, with the goal of improving the efficiency of the existing implementation of the fault tolerant constructs added to the Message Passing Interface (MPI) by the User-Level Failure Mitigation (ULFM) [3] proposal. Moreover, MPI being a de-facto parallel programming paradigm, one of our main concerns will be the time-to-solution, more specifically the scalability, of the proposed agreement.…”
Section: Use Case: the Ulfm Agreementmentioning
confidence: 99%
“…In this paper, we tackle this issue from a different perspective, with the goal of improving the efficiency of the existing implementation of the fault tolerant constructs added to the Message Passing Interface (MPI) by the User-Level Failure Mitigation (ULFM) [3] proposal. Moreover, MPI being a de-facto parallel programming paradigm, one of our main concerns will be the time-to-solution, more specifically the scalability, of the proposed agreement.…”
Section: Use Case: the Ulfm Agreementmentioning
confidence: 99%
“…Mini-Ckpts is a known example of a framework that emphasizes the recovery of the OS environment by preserving kernel structures in persistent memory [31]. Similarly, the ULFM MPI provides recovery of the communication environment from the failure of processes by reconstructing the MPI communicator by creating consensus among the remaining set of processes [7].…”
Section: Environment State Patternmentioning
confidence: 99%
“…While the pattern seeks to restructure the sub-systems in an operating state that is functionally equivalent to the fault-free state, the pattern may result in the operation of the system in degraded condition, which incurs additional time overhead to the system. Existing solutions that restructure the system in response to an event include the ULFM extension to the MPI standard [7], which allows parallel applications to get notifications of process failures. ULFM provides a set of routines to revoke and restructure a MPI communicator that consists of the remaining active processes.…”
Section: Restructure Patternmentioning
confidence: 99%
“…User-Level Failure Mitigation (ULFM) [22] is the proposal currently being discussed and iteratively refined at the MPI Forum. Although we can find implementations supporting it experimentally, it cannot be considered MPI compliant, and its contents are likely to change before its possible final incorporation into a future version of the Standard.…”
Section: Mpi Resiliencementioning
confidence: 99%