2004
DOI: 10.1023/b:clus.0000039491.64560.8a
|View full text |Cite
|
Sign up to set email alerts
|

MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0
5

Year Published

2009
2009
2017
2017

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 37 publications
(18 citation statements)
references
References 18 publications
0
10
0
5
Order By: Relevance
“…Software-based solutions detect the anomalies in the behavior of a system's data variables or control flow attributes to determine the presence of a fault. Heartbeat monitoring is used for liveness checking of MPI processes, which enables detection of imminent failure of the MPI communicator [6].…”
Section: Fault Treatment Patternmentioning
confidence: 99%
“…Software-based solutions detect the anomalies in the behavior of a system's data variables or control flow attributes to determine the presence of a fault. Heartbeat monitoring is used for liveness checking of MPI processes, which enables detection of imminent failure of the MPI communicator [6].…”
Section: Fault Treatment Patternmentioning
confidence: 99%
“…These ABFT techniques typically require a fault-tolerant message passing environment. There have been a number of these resilient message passing libraries based on MPI, including; FT-MPI [22,17], AMPI [9], MPI/FT [3], and C 3 [8]. The differences between these libraries is beyond the scope of this work, but each of these libraries allows for an application to continue operating in the presence of faults, possibly in a degraded mode, and it is left up to the application to ensure the result is correct.…”
Section: Fault Tolerant Userspace Librariesmentioning
confidence: 99%
“…As a result, they can get away with not completely handling nondeterministic events. Other projects such as FT-MPI [43] and MPI/FT [44] have extended MPI to implement process replicas on MPI applications for hard faults. PLR applies replicas for transient fault tolerance on general-purpose multicore machines.…”
Section: Related Workmentioning
confidence: 99%