2007 IEEE International Parallel and Distributed Processing Symposium 2007
DOI: 10.1109/ipdps.2007.370603
|View full text |Cite
|
Sign up to set email alerts
|

ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs

Abstract: Long-running MPI applications on clusters and grids that are prone to node and network failures, motivates the use of fault tolerant MPI implementations. However, previous fault tolerant MPIs lack the ability to allow the user to easily choose appropriate fault recovery strategies according to the execution environment, independent of the application codes-rather, the user often had to hard-code restoration strateties in accordance to diverse sets of fault patterns, which could be numerous: for instance, if th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2009
2009
2012
2012

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(11 citation statements)
references
References 14 publications
0
11
0
Order By: Relevance
“…This paper focuses on the former techniques. It builds on recently developed techniques such as checkpointing (with restarts or rollbacks) or redundant computing in highperformance computing [3,6,13,21,34] or API extensions for checkpointing [15,24]. A common challenge of transparent resiliency lies in the detection of faults, which is also a requirement for fault-awareness as proposed in MPI-3 [1].…”
Section: Introductionmentioning
confidence: 99%
“…This paper focuses on the former techniques. It builds on recently developed techniques such as checkpointing (with restarts or rollbacks) or redundant computing in highperformance computing [3,6,13,21,34] or API extensions for checkpointing [15,24]. A common challenge of transparent resiliency lies in the detection of faults, which is also a requirement for fault-awareness as proposed in MPI-3 [1].…”
Section: Introductionmentioning
confidence: 99%
“…MPI implementations must improve their ability to handle failures, such as broken connections and dead processes, to the extent possible. A number of research efforts in faulttolerant MPI implementation exist [2,7,13]. However, production MPI implementations need to improve their support for fault tolerance.…”
Section: Fault Tolerancementioning
confidence: 99%
“…This system could initiate different recovery actions designed to remedy different fault types. Jitsumoto and colleagues [44] also developed a detector that differentiated between hardware, process, and transmission faults. Here, users were allowed to pre-select a recovery procedure to be invoked in response to occurrence of a particular fault type.…”
Section: Detecting Faults and Failures In Grid Resourcesmentioning
confidence: 99%
“…If the host processor has not failed, temporal redundancy can be used to roll back and restart the process on the same platform. As in other systems, this method is widely used in grids [16,44,47]. However, if the host has failed, the process may be migrated, or transferred, to a different execution environment.…”
Section: Checkpoint and Recoverymentioning
confidence: 99%
See 1 more Smart Citation